Citation Verification in Legal AI: Why Accuracy Matters More Than Speed

The 1-in-6 problem

In 2024, Stanford researchers published the first rigorous, preregistered evaluation of leading legal AI research tools. The findings were sobering. Lexis+ AI, Westlaw AI-Assisted Research, and Thomson Reuters’ Ask Practical Law AI — tools marketed with claims of eliminating or avoiding hallucinations — all produced hallucinated content at significant rates. Even with RAG architectures, roughly 1 in 6 responses contained fabricated or misgrounded citations.

General-purpose chatbots performed worse: GPT-4 hallucinated on 58–82% of legal queries. But the gap between general chatbots and specialised tools was smaller than the marketing suggested.

Since that study, the problem has only become more visible. As of early 2026, over 700 documented cases exist of AI-generated hallucinations in legal proceedings — fabricated case citations, non-existent holdings, invented legal principles submitted to courts. Lawyers have been fined, cases have been dismissed, and professional reputations have been damaged.

This context shapes everything we do when building legal AI. Citation verification isn’t a nice-to-have feature. It’s the most important layer in the stack.

Why RAG alone doesn’t solve it

RAG — retrieval-augmented generation — is supposed to prevent hallucinations by grounding the model’s responses in retrieved documents. In theory, the model only synthesises from what it finds in the knowledge base, so it can’t invent sources.

In practice, RAG introduces its own hallucination modes. The Stanford researchers identified several: the retrieval system might return documents that are textually similar but legally irrelevant (a case from the wrong jurisdiction, or a superseded statute). The model might correctly cite a real case but mischaracterise its holding — stating that a court ruled one way when it actually ruled the opposite. The model might blend information from multiple retrieved passages in ways that create a claim none of the individual sources actually support.

These failures are harder to detect than outright fabrication because the citations are real — the error is in what the AI says about them. This is what the Stanford team calls “misgrounded” responses, and they’re arguably more dangerous than obvious fabrications because they’re harder to catch.

How we build verification

Our approach to citation verification has three layers, applied after the LLM generates its response.

The first layer is citation existence checking. For every citation in the response, we verify that the cited document exists in our corpus. This catches outright fabrications — citations to cases or documents that don’t exist. It’s computationally cheap and catches the most egregious errors.

The second layer is passage alignment. For each citation, we verify that the cited passage actually says what the response claims it says. We extract the specific claim being attributed to the source, retrieve the relevant passage from the source document, and compute semantic similarity between the claim and the passage. If the similarity is below a threshold, the citation is flagged. This catches mischaracterisations — situations where the AI cites a real source but gets the content wrong.

The third layer is logical consistency checking. We verify that the overall response doesn’t contain internal contradictions and that the reasoning flow logically connects the cited sources to the conclusions. This is the hardest layer to implement and the most computationally expensive, but it catches the subtle errors that escape the first two checks.

When any layer flags an issue, the system has several options depending on the configuration: regenerate the response with stricter constraints, remove the problematic citation and rephrase, present the response with a low-confidence warning, or escalate to human review.

Case Study

AAAi ChatBook

AI Legal Search for the American Arbitration Association

Read case study

The “I don’t know” problem

Equally important is training the system to recognise when it doesn’t have enough information to answer reliably. In the Stanford study, one of the key findings was that legal AI tools rarely said “I don’t know” — they almost always provided an answer, even when the retrieved sources didn’t support one.

We treat the “I don’t know” response as a design goal, not a failure mode. When the retrieval system returns no passages above the relevance threshold, or when the retrieved passages contradict each other, or when the confidence score falls below an acceptable level, the system should say so clearly. In our experience, legal professionals strongly prefer a system that says “I don’t have sufficient information to answer this reliably — here are the most relevant sources I found” over one that gives a confident but unverifiable answer.

For the AAA ChatBook tools, this meant defining clear boundaries: the system answers within the scope of its grounding material and explicitly declines to answer outside it. No improvising, no pulling from general knowledge, no guessing.

“The hardest thing to teach an AI system is humility. LLMs are trained to be helpful, which means they’re biased toward providing an answer — any answer — rather than admitting uncertainty. For legal AI, you have to actively engineer against this bias. The ‘I don’t know’ path needs to be as well-designed as the answer path.”

— Evgeny Smirnov, CEO and Lead Architect:

Practical implementation tips

If you’re building legal AI and want to add citation verification, here are the concrete steps:

Start with the cheapest check first. Citation existence verification (does this source exist in our corpus?) is fast and cheap. Implement it immediately — it catches the worst failures with minimal latency cost.

For passage alignment, use a cross-encoder rather than cosine similarity. Cross-encoders are more accurate at judging whether a claim is actually supported by a passage, though they’re slower. The accuracy tradeoff is worth it for legal applications.

Set confidence thresholds conservatively. It’s better to flag a correct response for review than to let an incorrect one through. You can always lower thresholds as you build confidence in the system.

Log everything. Every verification check, every flag, every regeneration. This data is invaluable for improving the system and essential for EU AI Act compliance.

Measure your hallucination rate and publish it (internally, at least). If you don’t measure, you can’t improve. We track citation accuracy, passage alignment scores, and “I don’t know” appropriateness across all our legal AI deployments.

Building legal AI that needs to be reliable? Contact us — we’ll show you how citation verification works in practice and help you implement it.