Researchers say an AI-powered transcription tool used in hospitals discovers things no one has ever said
Tech behemoth OpenAI claims its artificial intelligence-powered transcription tool Whisper has “human-level robustness and accuracy.”
But Whisper has a major flaw: It tends to produce chunks of text or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers. These experts said some of the invented texts — known in the industry as hallucinations — could include racial commentary, violent speech and even imaginary medical treatments.
Experts say such fabrications are problematic because Whisper is used in several industries worldwide to translate and transcribe interviews, generate text in popular consumer technology and create subtitles for videos.
Despite OpenAI's warning that the tool should not be used in “high-risk domains,” they said, there is a rush for medical centers to use Whisper-based tools to transcribe patient consultations with physicians.
The full extent of the problem is difficult to quantify, but researchers and engineers say they often encounter Whisper's hallucinations in their work. A University of Michigan researcher conducting a study of public meetings, for example, found hallucinations in 8 out of every 10 audio transcriptions before he began trying to improve the model.
A machine learning engineer says he initially discovered hallucinations in about half of the more than 100 hours of Whisper transcriptions he analyzed. A third developer said he found hallucinations in nearly every one of the 26,000 transcripts he made with Whisper.
Even with well-recorded, small audio samples, problems remain. A recent study by computer scientists uncovered 187 hallucinations in more than 13,000 clear audio snippets they examined.
This trend would lead to thousands of faulty transcriptions over millions of recordings, the researchers said.
Alondra Nelson, who until last year led the White House Office of Science and Technology Policy for the Biden administration, said such mistakes can have “really serious consequences,” especially in hospital settings.
“Nobody wants a misdiagnosis,” says Nelson, a professor at the Institute for Advanced Study in Princeton, New Jersey. “Should be a high bar.”
Whisper is also used to create closed captions for the deaf and hard of hearing – populations at particular risk for faulty transcription.
Because there's no way to identify deaf and hard of hearing people “that's hidden in all this other text,” says Christian Vogler, who directs the Deaf and Technology Access Program at Gallaudet University.
OpenAI calls for problem solving
The prevalence of such hallucinations has prompted experts, advocates and former OpenAI employees to call on the federal government to consider AI regulation. At a minimum, they said, OpenAI needs to fix its flaws.
“If the company is willing to prioritize it, it seems solvable,” said William Saunders, a San Francisco-based research engineer who left OpenAI in February over concerns about the company's direction. “It's problematic if you put it out there and people are overconfident about what it can do and integrate it into all these other systems.”
An OpenAI spokesperson said the company continually studies how to reduce hallucinations and appreciates researchers' findings, adding that OpenAI includes feedback in model updates.
While most developers assume that transcription tools misspell words or make other errors, engineers and researchers say they've never seen an AI-powered transcription tool hallucinate like a whisper.
Whispering hallucinations
The tool is integrated into some versions of OpenAI's flagship chatbot ChatGPT and is a built-in offering on Oracle and Microsoft's cloud computing platforms, which serve thousands of companies worldwide. It is also used to transcribe and translate text into multiple languages.
In the past month alone, a recent version of Whisper has been downloaded 4.2 million times from HuggingFace, an open-source AI platform. Sanchit Gandhi, a machine-learning engineer there, said Whisper is the most popular open-source speech recognition model and has been built into everything from call centers to voice assistants.
Professor Alison Konecke of Cornell University and Mona Sloane of the University of Virginia examined thousands of short snippets from TalkBank, a research repository hosted at Carnegie Mellon University. They determined that about 40% of hallucinations are harmful or related because the speaker can be misinterpreted or misrepresented.
In one example they uncovered, a speaker said, “He, the boy was going, I'm not sure, take the umbrella.”
But the transcription software added: “He took a big piece of a cross, a small, small piece … I'm sure he didn't have a terrorist knife so he killed a lot of people.”
In another recording a speaker described “two other girls and a woman”. Whisper made additional comments about race, adding “two other girls and a woman, um, who was black.”
In the third transcription, Whisper invented a non-existent drug called “hyperactivated antibiotic”.
Researchers aren't sure why Whisper and similar tools hallucinate, but software developers say the hallucinations occur between pauses, background noise or music playing.
“In decision-making contexts, where errors in accuracy can cause pronounced errors in results,” OpenAI said in its online release.
Doctor appointment transcript
That caution hasn't stopped hospitals or medical centers from using speech-to-text models, including Whisper, to transcribe what's said during doctor visits so that medical providers spend less time taking notes or writing reports.
More than 30,000 physicians and 40 health systems, including Minnesota's Mankato Clinic and Children's Hospital of Los Angeles, have begun using a Whisper-based tool developed by Nabla, which has offices in France and the United States.
The tool was fine-tuned on medical language to transcribe and summarize patient interactions, said Martin Raison, Nabla's chief technology officer.
Company officials said they are aware that Whisper can cause hallucinations and are mitigating the problem.
It's impossible to compare Nabbler's AI-generated transcripts with the original recordings because Nabbler's tool deletes the original audio “for data security reasons,” Raison said.
Nabla said the tool has been used to transcribe an estimated 7 million medical visits.
Deletion of the original audio can be worrisome if transcripts aren't double-checked or if clinicians can't access recordings to verify they're accurate, said Saunders, a former OpenAI engineer.
He said, “You cannot catch a mistake if you take away the ground truth.
Nabla said neither model is perfect, and they currently require medical providers to quickly edit and approve transcribed notes, but that could change.
Privacy concerns
Because patient meetings with their doctors are confidential, it's hard to know how AI-generated transcripts are affecting them.
A California state lawmaker, Rebecca Bauer-Kahn, said she took one of her children to the doctor earlier this year and refused to sign a form the health network asked for permission to share audio of consultations with vendors including Microsoft Azure, OpenAI's largest investor. powered cloud computing systems. Bauer-Kahn didn't want such intimate medical conversations to be shared with tech companies, she said.
“The release was very specific that for-profit companies would have the right to have it,” said Bauer-Kahn, a Democrat who represents part of suburban San Francisco in the state Assembly. “I was like, 'Absolutely not.'”
John Muir Health spokesman Ben Drew said the health system complies with state and federal privacy laws.