Author’s Corner: First-Person Account of Exploring Automated Clinical Concordance

A behind-the-scenes perspective of our study: Automated Evaluation of Large Language Model Response Concordance with Human Specialist Responses on Physician-to-Physician eConsult Cases, accepted to the 2026 Pacific Symposium on Biocomputing.
I think it’s rare that people read scientific articles today- I am guilty of this myself, usually it’s just a key figure or a couple headline-grabbing findings that people really know a study by. I’ve been pleasantly surprised by the amount of positive feedback we’ve gotten even just from a little preprint. Ethan, one of the co-lead authors of the study, is always full of great ideas, and suggested I write a little blog post too, a little “behind-the-scenes” of this study. My intention with this post is to share with y’all a first-person account of what we were thinking in the study as well as candid reflections on what my experience was like carrying out the study, enjoy!
Study findings TL;DR: LLM-as-judge outperforms a Decompose-then-Verify method (F1: .89 vs .72) and reaches human-level rater performance on assessing clinical concordance between AI vs. humans (Cohen’s κ = 0.75, similar to humans κ =.69-.90), suggesting that clinical evaluation of concordance can be done in a scalable way.
My personal TL;DR: Intelligence is different from Life and that working with LLM’s requires a different kind of focus that takes a mental toll!
The Beginnings
Ethan messaged me out of the blue one day and was like “Hey I have a study idea, would you be interested?” I said sure. Things aligned really well with my resident schedule, and I carved out some time to really buckle down, dive into the problem, and work with the amazing group here at Stanford ARISE/HealthRex Lab. This final part was a great unexpected joy.. It’s rare to work with such a world-class team of collaborators and really makes the academic process so much more fun and rewarding.
I had a spooky moment early on when testing these LLMs. I tried out LLM-as-judge on 100 cases and had it rate on a binary scale whether an AI-generated response was concordant to the human specialist response. This would take a human rater hours, maybe even days to do. Though I am no serious coder, my small amount of coding experience has taught me the importance of prayer, and as I hit run, I prayed for the code to run correctly. Around the 9th time.. It did! And it was spooky seeing LLM’s so quickly do this task that humans would take so long to do. I also asked for an “explanation” of their ratings and reading these explanations was what convinced me there was some merit to this method. Seeing all this happen so fast made me realize, “Oh shoot, this is something different.”

My “oh shoot” moment was when I ran 99 cases of human vs. AI through LLM-as-Judge and got responses back in <5 minutes… each one with some pretty good reasoning as to their concordance rating. This would take a human… multiple days to do!
Deeper Thoughts
I’ve been meaning to write another post on this but one of the key things I want to get across is that I am realizing there is a difference between Life and Intelligence. I used to work with mice and cell culture. These are alive, but not intelligent. LLM’s on the other hand, are intelligent, but not alive. Working with model organisms such as mice or cells or zebrafish or you-name-it come with their certain quirks and in a way, take a toll on the researcher (mice are smelly, they can bite you, and in a way you have to adapt to their nocturnal schedule and habits, you design your experiments around their quirks), whereas I have found that working with LLM’s is different in that they each have their own personalities, they are stochastic but readily available and the biggest thing is realizing… I am the bottleneck. My own willpower and ability to “hold the greater vision of the experiment” in my head is the limiting factor- it can be mentally exhausting working with these intelligent models in experiments while using other models to assist with my code writing to corral, contain, and execute on the experiments.
The Work Itself: As far as the manuscript itself goes, the general question is seeing if there is a scalable way to evaluate clinical concordance between an AI response and a specialist response to the same question. The reason we do this is because we have a nice dataset of thousands of retrospective eConsult cases. Clinical concordance can be used as a benchmark, one among many, a “vital sign” of sorts to see how new models are performing. For example, what if GPT-5 was only 70% concordant with Stanford specialists on 16,000 cases but Claude Opus was 80% concordant? Of course, one weakness is that Stanford specialists are not the Gold Standard but in real-world clinical cases, there isn’t really a true Gold Standard. Instead, what we have are silver-label cases.

I went on a lot of walks when I got stuck…
So to do this, we tested out two different methods – Decompose-then-Verify (DtV) vs LLM-as-judge (LaJ). Initially, I really thought that DtV would be better because it was systematic. It was based on some cool work out of Stanford (VeriFact) and Johns Hopkins (MedScore). We had decent results at first with DtV.. originally we were going to publish just this but my collaborators insisted we try LaJ so I was like sure… let’s do it… and then found oh my it was even better than DtV!!! In both cases, iterative prompt improvement improved performance but this really is a no-brainer these days. I do want to interpret these results with caution though: it’s possible that with a different concordance scale (what we’re working on next), the DtV method might outperform the LaJ method as there was relatively noisy agreement between human raters themselves.
The Difficult Question of Concordance: One deeper meta-question that we ran into that was actually quite difficult to get at was the question of the meaning of Concordance itself. What does it mean for two things to be ‘concordant’? For example, if the specialist says go get imaging first and then go to the ED if concerning findings, whereas AI says ___, are they concordant? I was so stumped by this question that one day I just stood up from my computer, exasperated, left and went on a hike with my best friend Chris. I thought maybe thinking through things on the hike would untangle it but it didn’t. Instead, we ended up stopping at Barnes and Nobles later and then I went to check out the logic/philosophy section and picked up a great book on Logic, “The Art of Logic in an Illogical World” by Eugenia Cheng which really helped me get a better grasp of how to even get at the problem. Highly recommend the book.

Where we hiked
Ultimately, we did decide on a rudimentary framework, just a binary scale, but as you can see in the study it was quite hard to get hospitalists to agree on something! You will just have to wait and see ‘til our next study to see what we have cooking up in terms of Concordance framework… I will just hint that our cousins in Radiology have a nice framework called RADPEER.
What is Good Medicine?
There is one final question: Just because an answer is concordant with a Stanford specialist doesn’t mean that it’s correct, or the best medicine. But also, sometimes there isn’t a clear answer either. What makes good medicine, really? Is it always just going to be “what a panel of physicians agree upon”? So far, that’s the way it’s been throughout the history of medicine in the 20th century onwards… “a consensus panel of experts” who base their decisions based on data or where there’s no data… “clinical expertise.”
I leave you with one final thought: what happens when Superintelligent AI arrives? How do we even know that it’s superintelligent, where in medicine there aren’t really clear benchmarks to measure if something is “Superhuman”? Like we know if AI can run 100m faster than Usain Bolt, sure, it’s faster, but there isn’t really a 100m in medicine. What do y’all think? Would love to hear your thoughts.
David JH Wu is a Stanford Radiation Oncology resident from originally from Cupertino, CA. He is interested in the application and evaluation of frontier models in medicine. In his spare time, he enjoys reading books from Little Free Libraries, napping on park benches, and pilates.


