When AI Gets It Wrong: The Hidden Risks in AI-Generated Reports

Content

In July 2025, Deloitte, one of the world’s largest consulting firms, handed the Australian government an independent assurance report it was paid roughly A$440,000 (about US$290,000) for. It looked impeccable: 237 pages, properly cited, written in the house style of a firm that has audited the world for a century. By every signal an organisation normally uses to judge quality, it passed.

Then the report reached someone who happened to know the material cold. Dr Chris Rudge, a welfare law academic at the University of Sydney, was reading the report when he noticed a citation to a book that did not exist, supposedly written by a colleague he knew. Once he pulled that thread, the rest unravelled: more than a dozen fabricated references attributed to academics he knew, and an invented quote attributed to a federal court judgment.

When pressed on it, Deloitte confirmed the errors, disclosed that it had used a generative AI tool to help produce the work, and agreed to refund part of its fee.

The AI did exactly what it was built to do: produce fluent, confident, plausibly sourced text. The failure sits in the process around it. Whoever used the tool to produce that research should have been working to a clear verification standard, treating AI output as an unchecked draft and confirming every citation, quote and figure against its original source before anything leaves the desk.

It would be comforting to treat the Deloitte report as a one-off event, but the numbers say otherwise

When Stanford’s Human-Centred AI institute (HAI) researchers tested the legal AI tools sold specifically to law firms, they found the tools still invented answers in roughly one out of every six queries. (Stanford HAI, Stanford RegLab paper)

These were not consumer chatbots being used carelessly. The tools were built and sold for accuracy, and the people using them were lawyers, trained to present the truth. Even under those near-ideal conditions, one answer in six was wrong. Outside them, it is worse.

McKinsey’s 2025 survey found that more than half of organisations using AI had already experienced at least one negative consequence, and close to a third traced the damage specifically to inaccuracy. Adoption of AI is now nearly universal but the reliability of its output has not yet caught up, and there is no quiet grace period in which it will.

The same failure across different industries

What makes the pattern unmistakable is how little the setting matters. In each one, a machine produced something fluent, confident, and false, and it reached the outside world because no one whose job it was had checked it first. See some examples below.

  • The courts. Lawyers and self-represented litigants have been filing court documents based on materials hallucinated by AI, including citations to non-existent cases, without a human verifying that the cited cases actually existed. A Paris-based researcher, Damien Charlotin, maintains a public database of rulings in which a judge caught the fabrications, and by mid-2026, it had surpassed 1,600 entries. Judges have stopped treating these as honest errors, fining the lawyers responsible and ordering some to cover the opposing side’s wasted hours. Sanctions are now reaching tens of thousands of dollars. The issue keeps surfacing because of how litigation works: every filing is read by an opponent paid to find exactly this kind of mistake. (Charlotin, Bloomberg Law)

  • The newspapers. In May 2025, both the Chicago Sun-Times and the Philadelphia Inquirer published summer reading lists of 15 books. Ten of them did not exist. The authors were real and well-known, but the book titles were not. A freelancer used an AI tool to build the list and sent it on without fact-checking; the company that supplied the papers passed it along without confirmation; and two newsrooms printed it without verification. The fallout was immediate: both papers ran corrections and apologies, the company dropped the writer, and the episode became a national example of AI fabrication reaching print. (New York Times, NPR)

  • The market. In 2023, Google launched its Bard chatbot with a promoted public demo, and one of its answers was wrong. It credited the James Webb Space Telescope with the first image of a planet outside our solar system. That image had been taken years earlier by a ground-based telescope. No one had checked the source before the demo was published. Once the error spread, Alphabet’s shares fell 7.7%, and roughly $100 billion was shaved from its market value. That figure is often quoted as the price of the AI error, but the label inflates it. The market was reacting to wider doubts about Google’s readiness as much as to the single wrong answer. What holds regardless here is the pattern. The information went public without anyone verifying it first, and because the launch was so visible, astronomers and reporters caught the mistake within hours, while the market punished the company the same day. (CNN)

Every one of these failures was caught, and for the same reason: the field in which it happened had a checker. Litigation has opposing counsel. Publishing has readers who know the catalogue. Public markets have analysts. In each case, the error surfaced because someone outside the organisation had every reason to go looking for it.

Now ask the same question of your own work

When a board paper, an investment memo, a client deliverable, or a policy brief leaves your organisation, who is the opposing counsel? Who is the reader who would catch an AI-hallucinated citation before it reached the decision? If you can name that person and that checkpoint, you are in better shape than most. If you cannot, you have found the gap.

The fix is not a better AI model

The reason this is so hard to own comes down to one thing. AI hallucinations are invisible. Only a fraction of the generated report is wrong, and nothing on the page indicates which part is wrong, so the only way to catch it is to open and check every source manually. That leaves the reader with two options: trust the whole document or re-verify it line by line and hand back the very time the AI was supposed to save. Under a tight deadline, most people trust it and hope for the best.

Resolving that dilemma is the whole point of SPARK, a tool that we have built. SPARK stands for Self-verifying, Portable, Agentic Research Kit, and works on top of existing AI tools, checking each claim it produces against its source, and making sure they stay current with the field. Where the evidence genuinely conflicts, SPARK does the opposite of what a general-purpose model will do: instead of resolving the disagreement into a confident answer, it flags the claim as contested and shows you why.

In our own testing, that reduced the time required to fully vet an AI-generated report from 50+ hours to roughly 20 minutes. If that is the discipline you want behind the work that carries your name, read more about what we are building SPARK: Self-verifying, Portable, Agentic Researcher Kit .

Recommended articles