Evaluating the Effectiveness of Artificial Intelligence Systems in Intelligence Analysis
by Daniel Ish2021RAND Corporation
Daniel Ish’s 2021 RAND report tackles a problem that has shadowed every wave of automation inside the intelligence community: once an AI system is built and deployed against an analytic workflow, how does anyone actually know whether it is working. The study is aimed at programme managers, evaluators, and acquisition officers who have to sign off on systems whose outputs feed assessments that policymakers and operators rely on.
Ish argues that the standard machine-learning yardsticks — precision, recall, F1, area under the curve — are necessary but badly insufficient for the intelligence setting. A classifier can post strong scores on a held-out test set and still degrade an analyst’s judgement, either by anchoring attention on the wrong cues, by failing silently on distribution shift, or by producing outputs whose confidence is poorly calibrated to the decisions they inform. Effectiveness, in his framing, is a property of the analyst-plus-system, not of the model alone, and the evaluation regime has to be designed accordingly.
The report walks through the layers at which evaluation has to happen. At the component level, it discusses the familiar metrics and their pathologies, including the trouble with imbalanced classes, the brittleness of benchmarks drawn from convenience data, and the way adversarial or denied environments invalidate the i.i.d. assumption baked into most testing. At the human-machine level, it looks at how analysts actually use AI outputs — when they trust them, when they over-rely, when they ignore them — and how those behaviours can be measured through controlled experiments and instrumented workflows. At the mission level, it asks whether the system shortens the time to a finished assessment, broadens the set of questions analysts can credibly answer, or changes the quality of the product reaching the customer, and it concedes how rarely those outcomes are clean to measure.
Specific themes get sustained attention. Calibration of confidence is treated as a first-class concern, given how intelligence products encode uncertainty in words of estimative probability. Distributional robustness is examined through the lens of foreign-language and rare-event analysis, where training data is thin and skewed. Red-teaming and adversarial evaluation get a chapter’s worth of treatment, on the premise that adversaries will probe deployed systems just as they probe human analysts. Ish also returns repeatedly to the data side of the problem: ground truth in intelligence is often delayed, contested, or unavailable, which makes the evaluator’s job structurally harder than in commercial AI.
The report sits at the practical end of RAND’s AI-for-defence catalogue, closer to acquisition guidance than to strategy. It is most useful to officials who have to write test plans, draft requirements, or judge vendor claims, and it pairs well with the broader literature on human-AI teaming that has accumulated since the IC began experimenting with machine learning at scale.
Publisher's description
- Computers
Last researched .