NEJM Study: AI not Ready to Perform Basic Medical Coding, Let Alone Replace People

In the battle of artificial intelligence (AI) vs. humans, score one for humanity.

 

Yeah, a provocative opening line, and probably needlessly so. AI is after all supposed to be a tool that allows us to do more and/or focus on higher order tasks, resulting in increased efficiency and output.

 

I just don’t happen to like the way some companies position their AI powered solutions (as human replacement) or choose to employ them (insurance companies denying claims en masse with AI, which is happening now—see STAT article below).

 

These are both misuses of the tech, and now a recent study has further confirmed the limitations of AI in medical coding.

 

A recent study published in the online issue of the New England Journal of Medicine (NEJM) AI and reported on the Mount Sinai website confirmed that large language models (LLM) including GPT-4, GPT-3.5, Gemini-pro, and Llama-2-70b were less than 50% accurate when coding.

 

Per the Mount Sinai press release (see link below), the study extracted a list of more than 27,000 unique diagnosis and procedure codes from 12 months of routine care in the Mount Sinai Health System, while excluding identifiable patient data. Using the description for each code, the researchers prompted models from OpenAI, Google, and Meta to output the most accurate medical codes. 

 

The generated codes were compared with the original codes and errors were analyzed for any patterns. GPT-4 demonstrated the best performance, with the highest exact match rates for ICD-9-CM (45.9 percent), ICD-10-CM (33.9 percent), and CPT codes (49.8 percent).

 

This is well below the 95% accuracy rate most organizations expect out of humans.

 

The study concluded, “All tested LLMs performed poorly on medical code querying, often generating codes conveying imprecise or fabricated information. LLMs are not appropriate for use on medical coding tasks without additional research.”

 

Note: This is not meant as a blanket dismissal of dedicated CDI and coding solutions powered by natural language processing (NLP), which are typically built with encoder logic and populated with Coding Clinic references to improve their accuracy. 

 

And, the study apparently did not feed the LLMs all of the supporting documentation for the codes, merely the code descriptions themselves. All in all, a bit of an odd approach and not how coders actually work.

 

But, the takeaway is that AI is only a sophisticated word prediction model. There is no thinking going on behind the screen.

 

My colleague Jason Jobes gave a great example of this in a recent presentation, of a machine identifying patients living in the state of Michigan as falsely having an MI. Why? MI is the state abbreviation and appeared throughout their medical records. 

 

Not surprisingly, I’ve had pushback on this article from (you guessed it) tech companies selling AI powered coding solutions. One commenter tried to undercut the studies and my article by noting that the database powering ChatGPT is only updated through 2022, and so is not current. This is a non-sequitur, failing to explain why it is only 45.9% accurate with ICD-9 coding—and ICD-9 has been long retired and not updated since 2013. All long ingested into ChatGPT’s database. Since 2022 both CPT and ICD-10 have had updates, but only minimal compared to the entirety of the code set, and so this also fails to account for the woefully poor performance of autonomous AI coding.

 

In short, the world needs you, coders and CDI. To think and apply your irreplaceable contextual experience. It needs all of us. 

 

At least until the day we ever get AGI, at which point everything changes for everyone, all at once. And that’s not coming any time soon.

 

References

 

Mount Sinai, “Despite AI Advancements, Human Oversight Remains Essential”:

https://www.mountsinai.org/about/newsroom/2024/despite-ai-advancements-human-oversight-remains-essential

 

NEJM AI, “Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying”:

https://ai.nejm.org/doi/full/10.1056/AIdbp2300040 

 

STAT, “Denied by AI: How Medicare Advantage plans use algorithms to cut off care for seniors in need”: https://www.statnews.com/2023/03/13/medicare-advantage-plans-denial-artificial-intelligence/

Related News & Insights

Eight frequently misdiagnosed conditions can be rectified with good CDI, coding practices

By Brian Murphy   What conditions are most frequently misdiagnosed, leading to patient harm?   A study…

Read More read more

Remote patient monitoring sees huge utilization increase, but corresponding regulatory spotlight

By Brian Murphy   Remote patient monitoring has incredible potential to improve the health of our population……

Read More read more