Ontario auditors find doctors' AI note takers routinely blow basic facts

Ontario auditors find doctors' AI note takers routinely blow basic facts(theregister.com)

311 points by sohkamyung 3 days ago | 138 comments

rainsford 3 days ago |

I have generally moved from bearish to bullish on the future of current AI technology, but the continued inaccuracy with basic facts all while the models significantly improve continues to give me significant pause.

As an example, creating recipes with Claude Opus based on flavor profiles and preferences feels magical, right up until the point at which it can't accurately convert between tablespoons and teaspoons. It's like the point in the movie where a character is acting nearly right but something is a bit off and then it turns out they're a zombie and going to try to eat your brain. This note taking example feels similar. It nearly works in some pretty impressive ways and then fails at the important details in a way that something able to do the things AI can allegedly do really shouldn't.

It's these failures that make me more and more convinced that while current generation AI can do some pretty cool things if you manage it right, we're not actually on the right track to achieve real intelligence. The persistence of these incredibly basic failure modes even as models advance makes it fairly obvious that continued advancement isn't going to actually address those problems.

cootsnuck 3 days ago | |

Yup, spot on. There's a capability-reliability gap that the industry does not like to talk about too much.

It often feels like the AI industry is continually glossing over the fact that capability and reliability are fundamentally different qualities. We tend to use "accurate" and "reliable" interchangeably, but they describe different things. A model can ace a benchmark (capability/accuracy) and still be a liability in production (reliability).

Just look at recent reactions to yet another release from METR showing improved capabilities. But the less talked about part is how their measure is for a 50% success rate (and the even lesser talked about secondary measure they have at 80% success rate has a drastically lower time-horizon for tasks). https://metr.org/

I implement AI systems for enterprises and I don't know any that would ever be okay with 80% reliability (let alone 50%).

jcgrillo 3 days ago | | |

This capability-reliability gap (excellent term btw, more people need to think in these terms or we'll be in real trouble) is also infecting LLM assisted outputs. I just tried VSCode again tonight after a ~3yr hiatus and goddamn has it deteriorated. Lots of new features, lots of interesting looking plugins, but 3 out of the 5 plugins I tried for code CAD (the reason I downloaded VSCode again at all) were completely unusable--like couldn't even be made to work at all--and the other two didn't do anything like what they claimed. Also VSCode itself got into some kind of spastic loop trying to log me into github, and seemed incapable of recognizing the virtual environment in a python project's workspace... It also feels like the UI got even slower. This situation is bad.

retrochameleon 2 days ago | |

I was skeptical that LLMs could be the right path to AGI, but then I kept being impressed by how much further we could take it by expanding upon the way we use it, the harnesses we use with LLMs, and better context engineering.

When I see how LLMs are capable of essentially prompt and context engineering for themselves, it makes me think they won't need human guidance forever.

When it comes to simple fact-based tasks that have a concrete methodology, it is no surprise to me that LLMs aren't the right tool, and I believe it's a failure of the harness to not recognize those types of tasks and handle them with a more concretely functioning tool instead of relying on statistical probabilities in the LLM "brain" to spit out the correct number to a math problem.

In the same sense that LLMs can use "skills" when necessary, it should have tools or possibly even specialized "brains" for it to pass of certain types of tasks to.

I'm starting to feel that our first form of AGI is not going to be a single brain but an elaborate system of harnesses, multiple LLM models, skills, domain and task specialized subsystems it passes tasks off to, etc. Whether we get there with current LLM technology before some other evolution in AI is the question, to me.

djhn 2 days ago | | |

This sounds a lot like ignoring the Bitter Lesson, and expending a lot of effort rebuilding slightly better Expert Systems.

smusamashah 3 days ago | |

Your analogy reminds of messed up fingers and hands in image generation models just a year ago. Now that is pretty much solved. These days they are generating videos you can't tell apart from reality. This makes me believe these nuances will keep reducing and eventually become very hard to notice and find in may be every task.

sillyfluke 2 days ago | | |

I would suggest slightly adjusting your expectations by factoring in the difference between video training data and text training data. Due to computation and cost limitations, the idea of video training data being polluted with AI video slop is less of a thing. Also, humans don't generate a lot of biology and physics defying fictional video relative to the abundance and generation ease of real-life video.

The main problem currently with LLM text is not that they create incoherent sentences, it's that what they purport to be statements of fact or general consensus often times aren't, because they are bullshit machines that become better and more accurate bullshitters the more context-accurate data they are fed. AI videos may still have issues with "looking plausible" whereas LLM text currently has less issues with "sounding plausible" and more issues with "being correct" with respect to reality. Which they have no direct connection to.

No one is penalizing an AI video generator for creating a scene that never happened in real life.

cyberrock 2 days ago | |

If Claude occasionally overestimates the conversion, that might be an artifact of Australian tablespoons being different (4tsp/20mL vs 3tsp/15mL in the US). This error could at least be explained as a complication of the real world.

(If it's saying 3.14tsp or 2tsp then I have no idea)

eigencoder 2 days ago | | |

Usually tsp means teaspoons, and 4tsp / 20 mL vs 3tsp / 15 mL are the same ratios of tsp to mL, aren't they?

igleria 3 days ago | |

Yesterday I was using opus 4.6 through copilot (don't ask...) to rubber-duck-brainstorm a big feature that needs a lot of care.

I got some inspiration from it but it misinterpreted very basic stuff. might be a skill issue on my side, I do not know.

themafia 3 days ago | |

> we're not actually on the right track to achieve real intelligence.

Real intelligence means you have to say "I don't know" when you don't know, or ask for help, or even just saying you refuse to help with the subtext being you don't want to appear stupid.

The models could ostensibly do this when it has low confidence in it's own results but they don't. What I don't know if it's because it would be very computationally difficult or it would harm the reputation of the companies charging a good sum to use them.

cmrdporcupine 3 days ago | | |

That's just not how they work, really. They don't know what they don't know and their process requires an output.

I think they're getting better at it, but it's likely just the number of parameters getting bigger and bigger in the SOTA models more than anything.

bluefirebrand 3 days ago | | |

My theory is because the people building the models and in charge of directing where they go love the sycophantic yes-man behavior the models display

They don't like hearing "I don't know"

vintagedave 2 days ago | | |

> Real intelligence means you have to say "I don't know" when you don't know

I have met many supposedly intelligent, certainly high status, humans who don't appear to be able to do that either.

I have more confidence we can train AIs to do it, honestly.

wagwang 3 days ago | | |

You can just tell the agent to do exactly that

colechristensen 3 days ago | | |

You can TELL the models to do this and they'll follow your prompt.

"Give me your answer and rate each part of it for certainty by percentage" or similar.

Brian_K_White 3 days ago | |

I hate to help provide possible soultions to an entire process I don't approve of, but maybe the fuzzy tools need old style deterministic tools the same way and for the same reasons we do.

So instead of an LLM trying to answer a math or reason question by finding a statistical match with other similar groups of words it found on 4chan and the all in podcast and a terrible recipe for soup written by a terrible cook, it can use a calculator when it needs a calculator answer.

cootsnuck 3 days ago | | |

They absolutely need deterministic tools. What you just described is exactly how the current popular AI agents work. They use "harnesses", which to me is just a rebranding of what we have known all along about building useful and reliable software...composable orchestrated systems with a variety of different pieces selected based on their capabilities and constraints being glued together for specific outcomes.

It just feels like for some reason this is all being relearned with LLMs. I guess shortcuts have always been tempting. And the idea of a "digital panacea" is too hard to resist.

analog31 3 days ago | | |

Doesn't agentic AI do this? I've got AI running in VS Code. If I ask it for something, it can fill a code cell with a little bit of Python, and then run it with my approval. It's using the Python interpreter on my computer as a calculator.

stevula 3 days ago | | |

I think that is how the smarter agents do things? Just like Claude/ChatGPT sometimes does a web search they can do other tool calls instead of just making a statistical guess. Of course it doesn’t always make the bright choice between those options though…

epcoa 3 days ago | | |

That’s exactly how all the current cloud chat bots and agents work now.

colechristensen 3 days ago | | |

No, they just need to be trained to have adversarial self review "thinking" processes.

You ask an LLM "What's wrong with your answer?" and you get pretty good results.

zOneLetter 3 days ago |

Anecdotally, we use an LLM note-taker at work for meetings. I had to intervene recently because our CIO was VERY angry at our vendor for something they promised to do and never did. He wasn't at the meeting where the "promise" was made. I was. They never promised anything, and the discussion was significantly more nuanced than what the LLM wrote in the detailed summary.

In other cases, I have seen it miss the mark when the discussion is not very linear. For example, if I am going back and forth with the SOC team about their response to a recent alert/incident. It'll get the gist of it right, but if you're relying on it for accuracy, holy hell does it miss the mark.

I can see the LLM take great notes for that initial nurse visit when you're at the hospital: summarize your main issue, weight, height, recent changes, etc. I would not trust it when it comes to a detailed and technical back-and-forth with the doctor. I would think for compliance reasons hospitals would not want to alter the records and only go by transcripts, but what do I know...

Groxx 3 days ago |

Yep. It happened to me just recently.

Diagnosed with Runner's Knee.

AI summary said I was diagnosed with osteoporosis, and had hip pain and walking difficulty, though literally none of that was ever said or implied.

CHECK YOUR TRANSCRIPTS. Always, but especially with LLM transcribers, which fairly frequently include common symptoms which don't exist, or claim a diagnosis which is common and fits a few details but not others. Get them fixed, it can very strongly affect your care and costs later if it's wrong.

Anecdotally, I'd say that outside of a couple very simple and very common things, about 50% of the "AI" summaries I've had have been wrong somewhere. Usually claiming I have symptoms that don't exist, occasionally much more serious and major fabrications like this time.

LLMs are NOT normal speech to text software, and they shouldn't be treated like one. They'll often insert entire sentences that never occurred. In some contexts that might be fine, but definitely not in medical records.

root_axis 3 days ago | |

I've actually seen this lead to serious issues when a zoom LLM summary attributed statements to someone who didn't say them.

Someone else who couldn't attend the meeting later read that summary and it created a major argument because the topic had been a sore subject for this person due to an ongoing debate at the company. Everyone who attended the meeting confirmed it was an error, but the coincidental timing made it hard for him to accept, because the LLMs summary presented things in a way that validated this person's concerns that had been previously minimized by some folks on that meeting.

The drama got heated to the point where management produced a policy about not trusting generative output without independent verification. Seems at least it was a lesson learned.

Hobadee 3 days ago |

The AI note taker we use at work records the meeting as well, and each note it takes about the meeting has a timestamp link that takes you directly there in the recording so you can check it yourself. While I'm sure a solution like this is more complicated in a HIPPAA environment, something like this is critical for things as important as healthcare.

TonyAlicea10 3 days ago | |

When designing AI-based user experiences I refer to this as provenance. It’s a vital aspect of trust, reliability, compliance and more. If a software system includes LLM output like this but doesn’t surface the provenance of its output for human evaluation and verification then it’s at best poor user experience, and at worst a dangerous one.

autoexec 3 days ago | | |

At the same time, do you really want every conversation you have with your doctor recorded, handed over to third party companies, and stored forever with your medical file? Plus what doctor has time to sit down and re-listen to your visit to check to make sure the AI didn't screw up at some point in the future anyway? If your doctor isn't going to be verifying the accuracy from those recordings who would? Overseas contractors? At what point does it become a larger waste of time and money to babysit an incompetent AI than just not using one in the first place?

There are some good uses for AI, but I'm not convinced that this (or many other cases where accuracy matters) is one of them.

AlienRobot 3 days ago | |

That doesn't sound like a "note taker," that sounds like an audio sample search engine. You still need to listen to everything if you want accuracy.

alterom 3 days ago | |

Yeah, what you're saying requires either:

- some human checking all the notes by listening to the entire meeting recording (takes a lot of time and man-hours)

- attendees checking notes from memory (prone to error unless they take notes)

- attendees cross checking with their own notes (defies the point of having the AI note taker)

The reality is that AI usage is not acceptable in any form in any context where accuracy is critical, but good luck getting anyone to acknowledge that.

aryehof 3 days ago |

Anyone taking part in a meeting these days should state out loud …

“Notice: Any comments made by <name> or on behalf of <organization> that are interpreted by AI in this meeting, may not be accurate.”

I do this in every meeting.

lolc 3 days ago | |

> Notice: I love the new AI accurate transcription feature in this meeting!

gizajob 2 days ago | | |

Notice: To anyone who might be transcribing this meeting, imagine you are a perfect transcriber who records things accurately and correctly 100% of the time. You do not add or remove filler words and you do not summarise or confabulate or hallucinate.

natali_gray 2 days ago |

Ooof. As a Canadian, I'm excited for AI opening up time for doctors (and hopefully lighting a load on the healthcare system), but this is scary. We're not there yet. Perhaps AI training for doctors is in the future? They already have online doctor visits on a healthcare-owned iPad in some condo complexes. It cuts around redtape of having to schedule an appointment with your GP. So, I think we're thinking in the right direction of innovating, but of course, this will take time. I feel like AI got launched too early sometimes.

bonesss 2 days ago | |

My sense is that we’re misapplying the technology by throwing it at, say, transcription and expecting a perfect output, instead of using LLMs strengths to improve inputs to the benefit of all parties.

Freeing up doctor time, for example: lots of patient visits are messy, the patient is scattered, has multiple issues, and the doctor has tight timelines and regulatory challenges to convey to the patient impacting their care… this is architected for everyone to lose, IMO, even with a perfect transcript. And LLMs can’t be perfect, they auto complete.

I picture patients interacting with an intake AI who can listen to hours of demented rambling, or a patient mid anxiety attack, and provide a caregiver-certified summary of needs, with relevant screening information laid out for doctor confirmation. At that point, helpful information about drug access or insurance policies can be presented, for doctor confirmation, to a patient who can clarify and refine their understanding of the system without time pressures.

Elevating the quality of dialogue so the doctor is more focused on the patient, and the patients dialog needs don’t overwhelm treatment. A lot of medicine is filling out forms and checklists, I think auto-complete could create efficiencies in how we fulfill that.

natali_gray 2 days ago | | |

Yeah, I could see AI being used for intake. That's a good point. And then the doctor can get some baseline info that they can use when they talk to the patient. Maybe even some really beautiful data, showing visually to the doctors all the different symptoms they reported.

Insanity 2 days ago |

I’m in Toronto, my doctor always asks me if they can use the AI note taker, which I accept. At the end of the consultation she goes over the notes and corrects it, often complaining to me about having to talk more to the computer than to me.

She is a great doctor and thankfully does this due diligence. But it gives me the impression this is forced on doctors without even them wanting this.

LAC-Tech 3 days ago |

Can someone who is a more AI heavy user explain what is going on?

I would expect an "AI Note Taker" to faithfully transcribe the entire conversation. With the same quality I see in a lot of automated video subtitles.. ie they use the wrong word a lot but it's easy to tell what they mean by context.

Are these tools instead immediately summarising the whole thing, and that summary is the artifact? Because that is a beyond insane way to treat human communication.

cootsnuck 3 days ago | |

I work specifically in voice AI and am very familiar with how these tools and systems work.

> I would expect an "AI Note Taker" to faithfully transcribe the entire conversation. With the same quality I see in a lot of automated video subtitles.. ie they use the wrong word a lot but it's easy to tell what they mean by context.

That's a reasonable expectation, but would not be a safe one. All transcription tools are not made the same. First it depends on what kind of STT/ASR (speech-to-text / automatic speech recognition) model they are using. A lot of tools like to use some flavor of OpenAI's Whisper model. It works well generally but I would never use it in a critical use case like healthcare. Because it can hallucinate. That's specific to its architecture and how it was trained.

There's a fairly large variety of architectures that can be used for STT/ASR. Some of them are designed for "offline" / "batch" / pre-recorded audio. Some are designed for fast real-time streaming transcription.

There are more factors too like training data. And not just demographics of the speakers in the training data but audio environments too. Was the model trained on echo-y doctor offices with two people being recorded from a crappy smartphone mic or desktop mic? (It could've been! But it's an important distinction.)

And there's more factors than that, but you get the picture (e.g. are they trying to "clean up" the transcript afterwards by feeding it to an LLM, are they attempting to pre-process audio before transcription also in an attempt to boost accuracy)

There's a lot of ways to do it, meaning, there's a lot of ways to screw it up.

robbiewxyz 2 days ago | |

Modern transformer-based STT architectures are complex but many are abstractly not entirely unlike putting the results of standard SST through an LLM with the prompt "clean this up & make it make sense". The behavior is trained in rather than prompted but the result is similar.

Obviously this results in hallucinations, mistaken implications, & inaccurately assumed context.

nothinkjustai 3 days ago |

People will eventually figure out LLMs have no capacity for intent and are fundamentally unreliable for tasks such as summarization, note taking etc.

gizajob 2 days ago | |

Smart people and those with basic common sense already have figured that out. AI leaders and CEOs still haven’t noticed.

daveisfera 2 days ago |

But how accurate are humans? I just picked up a print out of medical history for the last 5 years and it was thick enough to be a book. There's no way a human is reading all of that and doing anything meaningful with it. Let an AI tool crunch on it and it will definitely get things wrong or jump to conclusions that aren't there, but it's quick and I can push back on those and then move to the correct answer far quicker than any meeting with a nurse or doctor will show any results. We need to focus on how to use these tools and push back on the parts that seem out of place or wrong, so we can do more rather than point out what's not perfect.

dmix 3 days ago |

> They specifically address the AI Scribe program, the Ontario Ministry of Health initiated for physicians, nurse practitioners, and other healthcare professionals across the broader health sector.

makes me wonder what quality software the ministry would push (probably mostly qualifications like SOC).

This is apparently this list of approved vendors

https://www.supplyontario.ca/vor/software/tender-20123-artif...

mquander 3 days ago |

The linked report seems almost useless -- it doesn't say anything about an error rate or a sample size, so it's a mystery whether 9 out of 20 systems “fabricated information and made suggestions to patients' treatment plans” ten out of ten times, or one out of a thousand times.

If we just postulate that the systems have a high error rate, I wonder why they are being adopted. They seem extremely easy to test, so I don't see why doctors or hospitals or governments should be getting tricked into buying them if they suck.

MallocVoidstar 3 days ago | |

>If we just postulate that the systems have a high error rate, I wonder why they are being adopted.

From the article: "While 30 percent of a platform’s evaluation score depended solely on whether they had a domestic presence in Ontario, the accuracy of medical notes contributed only 4 percent to the total score."

Accuracy wasn't really part of the scoring, Ontario doesn't care about it.

nitwit005 2 days ago | | |

Scoring systems that function by adding up several parts never make sense. Video game magazines used to do that, but it meant that you could have wretched gameplay, and still get a decent score, from points in other categories like audio, graphics, and cinematics.

ceejayoz 3 days ago |

> 60% of evaluated AI Scribe systems mixed up prescribed drugs in patient notes, auditors say

Not mentioned, as far as I can see: the comparative human mistake rate.

Having seen a lot of medical records, 60% sounds about normal lol.

Ekaros 3 days ago |

How do these LLM summarizations work? Do you feed the raw wave data to model and it translate it?

Or do they use traditional voice recognition algorithms to do that part and then just "fix" the result to look plausible? Which with good quality output might not be much, but with bad can be absolutely everything.

If it is later seems to me that issues will absolutely happen.

uejfiweun 2 days ago |

I don't get why you would have an LLM interpret things for you. Like honestly, you replace the software in this example with simple transcription software, the issues disappear.

jqpabc123 3 days ago |

And once again, we have an example of how AI is a liability issue waiting to happen.

gizajob 2 days ago | |

“Move fast and cause unnecessary deaths”.

jeisc 3 days ago |

AI is awfully inexact and insists on being right about it