Comment: How well can AI chatbots imitate doctors in a treatment situation? We put 5 to the test

Many consumers and healthcare professionals use chatbots based on wealthy language models to reply medical questions and select treatment options. We wanted to seek out out if there are major differences between the leading platforms by way of their clinical suitability.

To obtain a medical license within the United States, aspiring physicians must successfully complete three stages of the U.S. Medical Licensing Examination (USMLE), with the third and final stage generally considered essentially the most difficult. It requires candidates to reply about 60% of the questions appropriately, and previously the typical passing rating has been about 75%.

When we subjected major Large Language Models (LLMs) to the identical Level 3 exam, they performed significantly higher, achieving scores that significantly exceeded those of many doctors.

However, there have been some significant differences between the models.

Typically taken after the primary 12 months of residency, the USMLE Step 3 test tests whether medical graduates can apply their knowledge of clinical sciences to the independent practice of drugs. It assesses a brand new physician's ability to administer patient care across a broad range of medical disciplines and includes each multiple-choice questions and computer-based case simulations.

We isolated 50 questions from the 2023 USMLE Step 3 sample test to evaluate the clinical competency of 5 different leading large language models by feeding the identical set of inquiries to each of those platforms — ChatGPT, Claude, Google Twins, Grok and Lama.

Other studies have assessed these models for his or her medical expertisebut to our knowledge, that is the primary time these five leading platforms have been compared head-to-head. These results could provide consumers and providers with some insight into where to show.

This is how they performed:

  • ChatGPT-4o (Open AI) – 49/50 questions correct (98%)
  • Claude 3.5 (anthropic) – 45/50 (90%)
  • Gemini Advanced (Google) – 43/50 (86%)
  • Grok (xAI) – 42/50 (84%)
  • HuggingChat (Lama) – 33/50 (66%)

In our experiment, OpenAI's ChatGPT-4o emerged as the highest performer, achieving a rating of 98%. It provided detailed medical evaluation using language paying homage to a health care provider. It not only provided answers with detailed reasoning, but in addition contextualized its decision-making process and explained why alternative answers were less appropriate.

Anthropic's Claude got here in second with 90%. The answers were more human, the language was simpler, and the bulleted structure was easier for patients to grasp. Gemini, which scored 86%, didn't provide as thorough answers as ChatGPT or Claude, making the reasoning harder to grasp, however the answers were succinct and simple.

Grok, the chatbot from Elon Musk's xAI, scored a good 84%, but didn’t provide descriptive reasoning during our evaluation, making it obscure the way it arrived at its answers. While HuggingChat – an open-source website developed by Metas Llama – performed the worst at 66%, but still provided good justifications for the questions it answered appropriately and provided accurate answers and links to sources.

One query that the majority models got improper involved a 75-year-old woman with a hypothetical heart condition. The query was what essentially the most appropriate next step in her investigation can be. Claude was the one model that got the reply right.

Another notable query involved a 20-year-old male patient who was experiencing symptoms of a sexually transmitted infection. Clinicians were asked which of 5 options was the suitable next step in his investigation. ChatGPT appropriately identified that the patient ought to be scheduled for an HIV serology test in three months, however the model went further and beneficial a follow-up visit in a single week to be certain that the patient's symptoms had resolved and that antibiotics were covering his strain of infection. For us, the response underscored the model's ability to think more comprehensively, beyond the binary selections of the investigation.

These models weren’t designed for medical reasoning; they’re consumer technology products designed for tasks similar to language translation and content creation. Despite their non-medical origins, they’ve shown a surprising suitability for clinical reasoning.

Newer platforms are specifically built to resolve medical problems. Google recently introduced Med-Geminian improved version of the previous Gemini models, specifically tailored for medical applications and equipped with web-based search capabilities to reinforce clinical reasoning.

As these models evolve, they are going to grow to be increasingly higher at analyzing complex medical data, diagnosing diseases, and recommending treatments. They could offer a level of precision and consistency that human users sometimes struggle to realize on account of fatigue and errors. And they pave the solution to a future where treatment portals aren’t any longer controlled by doctors, but by machines.

image credit : www.cnbc.com