Research Shows Generative AI Can Work Well in EHR, but Only Under Human Supervision - MedCity News

As documentation burdens and various other administrative tasks have increased, physician burnout has reached historic levels. In response, EHR vendors are integrating generative AI tools to help physicians formulate their responses to patient messages. However, there is still a lot we don’t know about the accuracy and effectiveness of these tools.

Researchers at Mass General Brigham recently conducted research to learn more about the performance of these generative AI solutions. They published a study in last week The Lancet Digital Health This shows that these AI tools can effectively reduce physician workload and improve patient education – but also that these tools have limitations that require human oversight.

For the study, researchers used OpenAI’s GPT-4 large language model to generate 100 different hypothetical questions from cancer patients.

The researchers had GPT-4 answer these questions as well as six radiation oncologists who answered manually. The research team then provided the same six physicians with the GPT-4-generated responses to review and edit.

The oncologists couldn’t tell whether GPT-4 or a human doctor had written the answers – and in almost a third of the cases, they assumed that a GPT-4-generated answer had been written by a doctor.

The study showed that doctors tended to write shorter answers than GPT-4. The large language model’s responses were longer because they typically contained more educational information for patients – but at the same time, these responses were also less direct and educational, the researchers found.

Overall, physicians reported that using a large language model to assist in composing their responses to patient messages was helpful in reducing their workload and associated burnout. They considered GPT-4 generated responses to be safe 82% of the time and acceptable to send without further processing 58% of the time.

However, it is important to remember that large language models can be dangerous without a human in the loop. The study also found that 7% of reactions caused by GPT-4 could pose a risk to the patient if left untreated. Most often, this is because the response generated by GPT-4 “inaccurately conveys the urgency with which the patient should come to the clinic or be seen by a doctor,” said Dr. Danielle Bitterman, author of the study and Mass General Brigham radiation oncologist.

“These models go through a reinforcement learning process where they are trained to be polite and give answers in a way that a person might want to hear. I think sometimes they become almost too polite and don’t adequately express the urgency when it’s there,” she explained in an interview.

More research needs to be done in the future about how patients feel about large language models being used to interact with them in this way, noted Dr. Bitterman.

Photo: Halfpoint, Getty Images

Research Shows Generative AI Can Work Well in EHR, but Only Under Human Supervision – MedCity News

Leave a Comment Cancel reply