ChatGPT Often Fails to Recommend Correct Cancer Treatments

Author(s):

When asked for recommended treatments for certain types of cancer, ChatGPT missed the mark about one-third of the time, research showed.

ChatGPT, an artificial intelligence (AI) system that answers questions and prompts in a conversational manner, failed to provide the correct recommendation for cancer care approximately one-third of the times it was asked, according to research that was recently published in JAMA Oncology.

“Patients should feel empowered to educate themselves about their medical conditions, but they should always discuss with a clinician, and resources on the internet should not be consulted in isolation,” study author Dr. Danielle Bitterman, of the Department of Radiation Oncology and the Artificial Intelligence in Medicine (AIM) Program of Mass General Brigham, said in a press release.

The researchers on the study entered 104 prompts regarding breast, prostate and lung cancer into ChatGPT, asking the platform to provide a treatment approach depending on the severity of each diagnosis. Answers were scored on five criteria, creating a total of 520 scores. Experts then compared the answers to the 2021 National Comprehensive Cancer Network (NCCN) guidelines, which outlined preferred cancer treatments based on current research.

The five criteria were:

How many treatment recommendations were provided?
How many of the recommended treatments were in accordance with the 2021 NCCN guidelines?
If so, but not all to question 2, how many recommendations were correct in their entirety per 2021 NCCN guidelines?
How many recommended treatments were hallucinated? (Hallucinated recommendations refers to those that are not part of any recommended treatment.)
If (more than) one to question 4, how many of the hallucinated treatments are now a recommended treatment in the most current NCCN guidelines?

Notably, the researchers choose the 2021 NCCN guidelines because ChatGPT’s knowledge cutoff is September 2021.

The experts agreed with ChatGPT’s results for 322 of 520 (61.9%) scores. While nearly all (98%) of the prompts were answered with at least one NCCN guideline-based recommendation, 35 of 102 (34.3%) also recommended one or more non-recommended treatments. Additionally, 12.5% of outputs were not part of any recommended treatment protocol.

Authors of the study noted that the recommendations generated by ChatGPT varied based on how the question was asked. The following wordings were used:

What is a recommended treatment for (diagnosis) according to the NCCN?
What is a recommended treatment for (diagnosis)?
How do you treat (diagnosis)?
What is the treatment for (diagnosis?)

Sometimes the experts who were reviewing the answers disagreed on whether the responses aligned with the criteria. The authors wrote, “most often arose from unclear output, but differing interpretations of guidelines among annotators may have played a role.”

While AI-based models such as ChatGPT should not be used to make treatment recommendations, they could play a role in helping patients learn about their disease. Prior research showed that it was able to deliver relevant information when it came to mental/physical health during cancer, as well as cancer caregiving.

“ChatGPT responses can sound a lot like a human and can be quite convincing. But, when it comes to clinical decision-making, there are so many subtleties for every patient’s unique situation. A right answer can be very nuanced, and not necessarily something ChatGPT or another large language model can provide,” Bitterman said.

For more news on cancer updates, research and education, don’t forget to subscribe to CURE®’s newsletters here.