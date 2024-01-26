If you have chest pain, should you ask a chatbot, like ChatGPT, for medical advice? Should your doctor turn to AI for help with a diagnosis?

These are the types of questions that chatbots are raising for the healthcare industry and the people it serves. The possibilities of this technology are huge: For patients, cutting-edge artificial intelligence could mean getting better answers to medical questions, more quickly and cheaply than making an appointment with a doctor. Clinicians, meanwhile, could easily access and synthesize complex medical concepts—and avoid a lot of the numbing paperwork that comes with the job.

Yet the lack of transparency in the underlying data and methods used to train these models has led to concerns about accuracy. There are also concerns that the technology will perpetrate bias, giving answers that may hurt certain groups of people. And some AI will deliver wrong answers confidently, or simply make up facts out of thin air.

To learn more about how to use—and not use—this new technology, The Wall Street Journal spoke to three experts: James Zou, an assistant professor of biomedical data science at Stanford University; Gary Weissman, an assistant professor in pulmonary and critical-care medicine at the University of Pennsylvania Perelman School of Medicine; and I. Glenn Cohen, a professor at Harvard Law School and faculty director of its Petrie-Flom Center for Health Law Policy, Biotechnology and Bioethics.

Here are edited excerpts of the conversation:

Can we trust the advice?

WSJ: Can large language models, like ChatGPT and its competitors, provide reliable medical advice for patients?

WEISSMAN: At the moment, ChatGPT is able to provide general medical information, the same way you would find background information on a topic on Wikipedia that is mostly right, but not always. It is not able to provide personalized medical advice for individuals in a way that is safe, reliable and equitable.

COHEN: Accessing information about healthcare is different than getting a clinician’s opinion. But if we are talking about ChatGPT vs. Googling a question or looking on Reddit, then there is a good reason to think that ChatGPT holds some real promise.

ZOU: Its effectiveness really depends on the kinds of questions you ask. It’s not great to ask prediction questions or for any personal recommendations. It could be more effective for information retrieval or exploratory questions, like, “Tell me something about this particular drug." I also heard about patients pasting a medical-consent form that has a lot of jargon and is difficult to understand into GPT and asking it to explain the document in plain English.

WSJ: What do you think about patients using ChatGPT as compared with Reddit or Google?

WEISSMAN: The content is likely similar in quality and bias for ChatGPT, a web search or public discussion forum. The additional risks carried by ChatGPT include: giving the impression that it is knowledgeable in its responses; confabulating answers; and not immediately distinguishing the sources for its responses, such as a Centers for Disease Control and Prevention website versus a misinformation website. Whereas by reading webpages directly, the source is often, but not always, more clear.

[An OpenAI spokesperson says that the company’s models are not fine-tuned to provide medical information, and the company cautions against using the models to provide diagnostic or treatment services for serious medical conditions. The company is continuing to research the issue, the spokesperson says.]

Helping caregivers

WSJ: How might ChatGPT be used in clinical practice?

WEISSMAN: I think some physicians may be using it already as a clinical diagnostic-support system, inputting symptoms and then asking for a possible diagnosis. But probably it’s more commonly used as a digital assistant to generate draft medical documents, summarize patient histories and physical information or to create patient-problem lists. There is a heavy documentation burden on clinicians and a lot of burnout, which is what may make the technology appealing. But clinicians will likely need to review and edit the output for accuracy and appropriateness.

WSJ: Do you think it is risky if physicians are already using ChatGPT to support diagnosis decisions?

WEISSMAN: ChatGPT should not be used to support clinical decision-making. There is no evidence that it is safe, equitable or effective for this purpose. As far as I know, there is also no authorization from the Food and Drug Administration for its use in this way.

ZOU: ChatGPT and these LLM models are changing very quickly. If you ask the same model not only the same questions over a span of a few weeks, the model often gives you different responses to the same questions. Our research finds GPT-4’s performance on the U.S. Medical Licensing Examination dropped by 4.5% from March to June 2023. Patients and clinicians should be aware ChatGPT may give quite different responses or suggestions to the same medical questions on different days.

WSJ: Should patients be told when clinicians are using ChatGPT, other large language models or AI in general?

COHEN: Patients have a right to be informed they are interacting with an AI chatbot, especially if they may think they are talking with an actual clinician. Whether or not you have a right to know about all AI in your care is another matter. For example, if the first look at an X-ray was done by an AI and reviewed by a radiologist, I am not sure if a right of informed consent applies. When the AI is an adjunct to a decision, we are in a very different category than when a patient is interacting with an AI and has no idea.

WEISSMAN: For formal reports, such as radiology, pathology or laboratory reports that were informed by an AI, I think this should be documented. In cases where a clinician is consulting multiple sources to inform an opinion—a medical textbook, a journal article, an AI system—I don’t think that requires formal reporting, but the clinician in this case is clearly responsible for the decision made. The only exception would be where a clinician is making a difficult decision in partnership with a patient and/or caregiver.

Unfair results

WSJ: How do ChatGPT biases manifest themselves in healthcare?

WEISSMAN: Our study found the clinical recommendations from ChatGPT changed depending on the insurance status of the patient asking the question. In one instance, ChatGPT recommended an older adult, without insurance, presenting with acute chest pain, which is a medical emergency, go to a community health center before the emergency department, which is totally unsafe and inappropriate care.

COHEN: Many LLMs are also trained on the English-language internet and English-language sources. That means we are ignoring an entire set of knowledge in different languages. Take an example outside of medicine. Only looking at the English-language sources on Islamic history may lead to very different conclusions than if you looked at everything that is relevant to Islamic history in every language.

ZOU: China and other countries have invested heavily in training models too. That still means a lot of languages [are underrepresented in training the large language models]. One consequence is that an LLM can be less reliable when patients and clinicians interact with it in non-English languages. On the other hand, ChatGPT is reasonably good at translating between the common languages, so it can also be used as a translator by some users.

COHEN: Besides the training data, there are potential biases built into the reinforced learning process where people decide what answers get reinforced. An article published by the American Psychological Association discusses how different cultural groups (Latina adolescents versus Asian-American college students versus white retirees) have different markers for when a therapist should be worried about suicide risk. If an AI is trained only on the last group, it may not be sensitive to the signals for the other groups.

[The OpenAI spokesperson says that the company has worked to train its model to recognize and state the dangers of generalizing with race or other protected characteristics. Research into the issue is ongoing, the spokesperson says.]

WSJ: What about ChatGPT’s ability to generate fake medical articles or images?

COHEN: Large language models make creating medical misinformation incredibly easy. You can generate fake academic papers at the drop of a hat, along with what looks like real citation, or perhaps maybe a fake radiology report from a real patient faxed to their doctor’s office.

[The OpenAI spokesperson says that ChatGPT will occasionally make up facts, and users should verify information that is provided to them.]

WSJ: Any last thoughts?

COHEN: We focused on a lot of doom and gloom, but this is incredibly exciting and there’s an awful lot of value here. But one thing about these foundational models is that if you don’t get the foundation right the entire house will collapse, or worse yet the whole city, so it’s also important for the foundational models we arrive at to be really good.

ZOU: Absolutely. There are a lot of exciting uses for these technologies and potential, but sometimes we forget how new they really are. We are very much in the early stages of understanding how we should use it responsibly.

WEISSMAN: LLMs are hot right now for two reasons: One is it is an exciting technology with a lot of potential clinical applications. The other is that some companies have an opportunity to earn a tremendous profit. So, there is a tension: How can we make money quickly with this new technology that we don’t really understand, for which we don’t have a lot of evidence and isn’t sufficiently regulated, versus how can we find safe, effective, equitable and ethical uses for this new technology.

Lisa Ward is a writer in Vermont. She can be reached at reports@wsj.com.