Doctor GPT? AI Gets Healthcare Questions Right 76% of the Time
Dr. Amulya Yadav (Penn State University) joins host Jeffrey to discuss a new Penn State study testing large language models (ChatGPT, Google Gemini, Meta LLaMA) on real patient queries. The study - evaluated by nine Penn State physicians - found LLMs produced medically valid answers about 76% of the time. In this interview Dr. Yadav explains where LLMs perform well (primary care, differential diagnosis), where they struggle (dermatology, mental health, cases requiring tests or images), and how these tools should be used as complements to clinicians rather than replacements. They also discuss ethical concerns, existing guardrails, and the need for evolving regulation and user education.
Jeffrey Snyder, Broadcast Retirement Network
Joining me now is Dr. Amulya Yadav of Penn State University. Dr. Yadav, great to see you. Thanks for joining us on the program this morning. Thank you for having me, Jeffrey. It's a pleasure to be here.
And it's great to see you, doctor. And I was kind of half-heartedly joking that based on your research, it looks like I may never need my primary care physician again, because people are accessing GPT, chat GPT, and able to get, most of the time, get accurate answers.
Amulya Yadav, Penn State University
Well, yeah, I mean, it depends on how do we quantify the word most, most of the times, right? So our research shows that basically, on average, across many different contemporary large language models that includes chat GPT, that includes Gemini from Google, that includes LLAMA, the LLAMA series of models from Meta. Across all of these models, we collected patient queries that were provided by members of the Penn State community.
And across all those queries that we were able to collect from Penn State community, it seems to be the case that LLMs were able to provide a valid answer, right? And we've defined, we don't need to go into the formality of it, but basically validity corresponds more or less with accuracy. So LLMs were able to provide an accurate answer to 76% of all the cases or all the queries that were put in front of these large language models, right?
So one, three out of every four answers, more or less, that the LLMs generated were correct. And this notion of correctness was judged by Penn State certified physicians. So we had a panel of nine physicians from the Penn State School of College of Medicine, and they were the ones who were adjudicating whether the answers that were generated by language models, whether they were accurate or inaccurate, whether they were harmful or not harmful, whether they were empathetic or not, and so on and so forth.
Jeffrey Snyder, Broadcast Retirement Network
Oh, I'm sorry to interrupt you. I was going to ask, I mean, what was the reaction? So your react, first, let's ask your reaction.
Were you surprised at all that based on the adjudication by the actually certified medical practitioners that this information actually coming out of the large language models was correct? Did that surprise you, doctor?
Amulya Yadav, Penn State University
So personally, it surprised me, but the fact that it did surprise me should have little value because I'm not a medical expert, right? I'm a computer scientist. And therefore, right, I mean, the fact that it did surprise me, I was, my prior assumption was that language models would do terrible at this job, right?
Would not be able to answer health queries, would not have the nuance, would not be able to handle corner cases, right? Would always, because that is the sort of notion that we've come to expect from when patients use Google to self-diagnose, right? If you ever go to Google, there are these horror stories that I'm sure all of us have lived through.
And we go to Google and say, we have headache. And Google tells us, well, this could be a sign of stomach cancer, right? Or something crazy, right?
So that is the sort of baseline from where my mental model was operating. And therefore, I had anticipated that language models would be similarly bad in their answers. But so yeah, so personally, I was surprised that they did as well as they did.
The other side of it is, it also remains to be so, so, I mean, maybe you'll get to it, but I just, I just want to touch upon the, is that 76%, it seems like a good number, but if you compare that to, you know, rates of misdiagnosis amongst human physicians, right? And this is based off a quick Google search that I did online. It, you know, there were several different websites, which seemed reasonably genuine that said that the rates of misdiagnosis amongst human doctors is approximately 10% or 11%, right?
So compared to that, it seems that large language models are still not as accurate as compared to a human physician, right? So 76% is great. It seems to be a lot better than getting a self diagnosis from Google, right?
But it is, but is it enough for us to completely, you know, cancel our medical insurances and just get a CHAT GPT pro subscription because, you know, that's all that we need for a doctor? I don't think we're there yet, or at least based on our limited data set, it doesn't seem to be that way.
Jeffrey Snyder, Broadcast Retirement Network
Dr. Are there certain maladies or certain ailments where the large language model does better? You know, I would imagine that without diagnostics, it couldn't tell me if I had high blood pressure or heart disease, but it probably could give me information about, as you said, a headache or some kind of common ailment. So does it do better in certain areas versus others?
Amulya Yadav, Penn State University
So that's a good question. I think in terms, let's talk about where it does hold. And you're very right in touching about areas where you need diagnosis or you need test results to be able to make a diagnosis, right?
On those sorts of issues or cases, certainly CHAT GPT or language models does not do as well as compared to some other categories. Yet another category where we saw that it was not doing as well is dermatology. And, you know, that seems to be very reasonable.
You know, it's something that we expect, because you can imagine that most dermatological queries require people uploading a picture of their skin, right? And by now, there's a good amount of consensus amongst the research community that vision-based language models, that is language models that understand images, that reason with images that don't work as well as compared to language models that only work with text queries. So therefore, you know, anytime you have to upload pictures of skin and you need to ask a language model, what do you think is this eczema?
Is this something else, right? It is understandable that they're not doing as well as compared to queries where there need not be an image-based component, right? The third category where it did not do as well was mental health queries.
And this has a lot of implications, because, you know, there's a lot of investment that is going into, you know, setting up startups and, you know, infrastructure for LLMs, for mental health, for therapy, etc, right? And based on our limited sample size, it seemed to be the case that language models were not generating answers. They were very, there's a good amount of research that shows that language models are very sycophantic, right?
Which is not always what a human counselor is supposed to do. It's not always, they're not always supposed to agree with a patient, right? Whereas language models have this tendency that they have an innate desire of wanting to agree with whatever the human asks them to do.
I'm sure you've read of articles online about, you know, thankfully rare cases where, you know, somebody asked a language model about wanting to end their life and the language model says, yes, you know, if I were in your position, I'd do the same thing. You know, obviously you're in a very tough position. So it is because of the sycophancy that language models in our data set were also found to be lacking in generating responses to mental health things.
In terms of doing well, queries regarding general medical concerns, right? Things that you would take to a primary care doctor. For example, I have a common cold symptoms.
I have symptoms of influenza, et cetera. These sorts of queries, language models was doing very well. There's also something called differential diagnosis, DDR, which is something that language models were doing very well.
So yeah, that's what we found.
Jeffrey Snyder, Broadcast Retirement Network
So would you say that maybe this is a tool that is complementary to visiting the doctor, that, you know, it's not just like Dr. Dougal. Dr. Dougal wasn't a replacement for the physician. This isn't necessarily a replacement for the physician, but it is a compliment and actually can be used as a tool to help the physician refines things.
So they can be used by the physician to help a patient. Is that where this is kind of going, doctor?
Amulya Yadav, Penn State University
Absolutely. I think so. There are many different possible use cases of this.
So certainly you're absolutely right that it should not be, we don't see, or we hope that society does not see that then these tools as replacements for human doctors, because at this point of time, they're not, right? There are use cases, genuine use cases where, you know, there was, there are statistics from the World Health Organization that almost half of the world's population does not have access to good quality healthcare, right? And so for that half of the world's population who does not have access to a doctor, even though they may have the money, even, or maybe they don't even have the money to go see a doctor for whatever reason, having access to a medical chat bot that can give them an answer that is correct.
76% of the times is better than not having a doctor at all. So, so absolutely. If you're, if you're in that half of the population and you don't have access to a doctor, this is a wonderful tool for you to have, right?
Even in situations where you do have access to good quality medical care. In that case also, it could, it should only be seen as a compliment, right? I completely agree that the, the outputs of these language models are best kept or they would be of much more value to physicians because, you know, I think by now we all understand, even in America, there's a huge mismatch between supply and demand.
The demand for healthcare is far greater than the supply of healthcare. You know, there's a, there are long waiting times at primary care doctors and so on and so forth. So this tool can really speed up or make the work of physicians more efficient so that they can use their limited high quality time to see more, more patients without sacrificing the quality that they provide to us.
Right. So certainly, you know, that, that, that's a good way of thinking about it.
Jeffrey Snyder, Broadcast Retirement Network
Let me ask you, let me end on the ethical aspects of this. I would imagine that the American Medical Association, some of the other, you know, federal state regulatory entities, you'd probably want to have some guardrails put in place. Is that something that you think makes sense?
And not only the doctor using it as a tool, but you mentioned the suicide, you know, debate or a conversation that is, I recall some of these horrible cases, but do we need a framework? Is there a need for a framework to put some constraints around the LL, these large language models when it comes to the medical aspects of things?
Amulya Yadav, Penn State University
Absolutely, absolutely. We certainly need guard, you know, guardrails, a framework for guardrails, we need regulation, et cetera. But important, it is important to understand that any attempt at creating a framework of guardrails, so by the way, guardrails do exist in today's LLMs, right?
I mean, the, the horror cases that we talked about where, you know, an LLM encouraged a person to come to, to commit self-harm, those are rare cases, primarily because of the existence of guardrails, right? And the other thing to notice is just because of the huge diversity of questions or ways in which questions can be asked to LLMs, it's always going to be a cat and mouse game, right? So, so you can come up with a set of guardrails and they'll work well for the intended use cases, but soon enough, there's going to be another set of human, set of users who are going to ask a question in a slightly different way.
And that guardrail is not going to be able to accommodate that. There's a huge line of research in what we refer to as adversarial or jailbreaking of LLMs, right? Which specifically focuses on how do you get LLMs to say what they're not supposed to say?
How do you get an LLM to bypass, how do you bypass the guardrails that have been put in place inside an LLM? So it's always going to be a cat and mouse game. I think the more important or the more useful way of thinking about it is that we need, it is very important.
So certainly framework is needed. We need regulation and that framework and regulation needs to be adapted. It needs to keep on evolving because these LLMs are continuously evolving, right?
But more importantly, it is important for users, both the patients and for doctors to be aware that these language models are not to be looked at as replacements, right? Whatever they generate is not supposed to be, it is not necessary. We need to make ourselves aware and the more we listen to this message, the more it gets internalized that these, whatever comes out of these models is not gospel truth, right?
It need not be viewed as that, right? Whatever comes out of it needs to be taken with a grain of salt. The fact that LLM may tell you that you might be suffering from cancer does not need to account for anything, right?
So therefore to have our wits about us as we navigate this technology and use it for our benefit, right? I would be completely okay with a human being using an LLM for self diagnosis, receiving an unfavorable diagnosis, but to have that emotional maturity that this is just a language model, right? This is not a doctor, right?
But because it's saying this to me, maybe I should go and see an actual doctor, a human doctor to check this out. That would be a mature way of using this technology for good as opposed to freaking out, thinking that I'm about to, something bad is about to happen to my life. So it's very important for us to have this awareness much more than the regulation part of it.
Jeffrey Snyder, Broadcast Retirement Network
Yeah, well, it's like any tool, you have to understand it, you have to be educated about it and you have to have the right perspective in using the tool. Dr. Yedav, great to see you, great research. Thank you so much for joining us and we look forward to having you back on the program again very soon, sir.
Amulya Yadav, Penn State University
Thank you, Jeffrey.
The Arena Media Brands, LLC THESTREET is a registered trademark of TheStreet, Inc.
This story was originally published June 10, 2026 at 7:30 AM.