NEJM Study: AI not Ready to Perform Basic Medical Coding, Let Alone Replace People

By Brian Murphy

In the battle of artificial intelligence (AI) vs. humans, score one for humanity.

Yeah, a provocative opening line, and probably needlessly so. AI is after all supposed to be a tool that allows us to do more and/or focus on higher order tasks, resulting in increased efficiency and output.

I just don’t happen to like the way some companies position their AI powered solutions (as human replacement) or choose to employ them (insurance companies denying claims en masse with AI, which is happening now).

These are both misuses of the tech, and now a recent study has further confirmed the limitations of AI in medical coding.

A recent study published in the online issue of the New England Journal of Medicine (NEJM) AI and reported on the Mount Sinai website confirmed that large language models (LLM) including GPT-4, GPT-3.5, Gemini-pro, and Llama-2-70b were less than 50% accurate when coding.

Per the Mount Sinai press release (see link below), the study extracted a list of more than 27,000 unique diagnosis and procedure codes from 12 months of routine care in the Mount Sinai Health System, while excluding identifiable patient data. Using the description for each code, the researchers prompted models from OpenAI, Google, and Meta to output the most accurate medical codes.

The generated codes were compared with the original codes and errors were analyzed for any patterns. GPT-4 demonstrated the best performance, with the highest exact match rates for ICD-9-CM (45.9 percent), ICD-10-CM (33.9 percent), and CPT codes (49.8 percent).

This is well below the 95% accuracy rate most organizations expect out of humans.

The study concluded, “All tested LLMs performed poorly on medical code querying, often generating codes conveying imprecise or fabricated information. LLMs are not appropriate for use on medical coding tasks without additional research.”

Note: This is not meant as a blanket dismissal of dedicated CDI and coding solutions powered by natural language processing (NLP), which are typically built with encoder logic and populated with Coding Clinic references to improve their accuracy.

And, the study apparently did not feed the LLMs all of the supporting documentation for the codes, merely the code descriptions themselves. All in all, a bit of an odd approach and not how coders actually work.

But, the takeaway is that AI is only a sophisticated word prediction model. There is no thinking going on behind the screen.

My colleague Jason Jobes gave a great example of this in a recent presentation, of a machine identifying patients living in the state of Michigan as falsely having an MI. Why? MI is the state abbreviation and appeared throughout their medical records.

The world needs you, coders and CDI. To think and apply your irreplaceable contextual experience. It needs all of us.

At least until the day we ever get AGI, at which point everything changes for everyone, all at once. And that’s not coming any time soon.


Mount Sinai, “Despite AI Advancements, Human Oversight Remains Essential”:

NEJM AI, “Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying”:

Related News & Insights

HCC Best Practice Advisory (BPA) Alerts a Hot Topic in Compliant Condition Capture

By Jason Jobes, SVP, Norwood Solutions HOT TOPIC QUESTION: Are you using Epic HCC BPAs (or similar…

Read More read more

Clean Up Your Problem Lists to Facilitate Accurate Coding

By Brian Murphy Per CMS, a problem list is “a list of current and active diagnoses as…

Read More read more