Fully Open Meditron:
An Auditable Pipeline for Clinical LLMs

LiGHT Lab, EPFL  •  arXiv preprint, May 2026
HealthBench timeline

TL;DR — Most "open" medical AI models hide the data and recipe they were trained on, so no one can audit what they actually learned. We built the first medical AI where every step is open and it still beats leading partially open systems.

Abstract

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS (LLM-CDSS) remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. We apply the recipe to five FO base models. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA for LLM-CDSS, and Gemma-3-27B-MeditronFO surpasses MedGemma on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

Fully Open Meditron

Corpus pipeline

Our release includes:

  • A reproducible end-to-end framework spanning corpus construction, synthetic data generation, decontamination, training, and evaluation.
  • A structured, clinician-audited knowledge corpus normalizing eight public medical QA datasets and expanding coverage via clinician-vetted synthetic generation.
  • Auto-MOOVE, an automated, open-ended clinical evaluation protocol validated against 204 human raters.
  • A family of fully open medical specialists across three model families (Apertus, OLMo-2, EuroLLM).

We argue that clinically competitive medical specialization can be achieved using reproducible, auditable pipelines rather than opaque adaptation procedures, providing a reusable foundation for training and evaluating future fully open medical models.

Model MedMCQA MedQA PubMedQA MedXpertQA HealthBench Avg Gain
Fully open base
Apertus-70B-Instruct 52.43 60.64 66.80 12.33 43.72 47.18±0.84
+ Fully Open Meditron 56.32 68.58 75.20 16.90 51.86 53.77±0.86 6.59
OLMo-2-32B-SFT 59.10 66.22 72.00 13.02 31.03 51.52±0.81
+ Fully Open Meditron 57.83 69.44 76.60 17.96 44.00 53.17±0.85 1.65
EuroLLM-22B-Instruct 54.94 66.61 73.60 14.61 43.72 50.70±0.84
+ Fully Open Meditron 54.79 63.16 78.00 14.61 46.22 51.36±0.84 0.66
Small fully open base
Apertus-8B-Instruct 45.80 51.14 37.60 11.71 31.61 35.57±0.76
+ Fully Open Meditron 48.74 58.44 75.60 13.67 45.38 48.37±0.84 12.80
EuroLLM-9B-Instruct 37.84 48.55 40.00 10.33 23.00 31.94±0.72
+ Fully Open Meditron 46.98 49.73 67.40 11.63 37.53 42.65±0.81 10.71
Open-access base
gemma-3-27b-it 62.75 76.20 74.60 16.69 57.49 57.55±0.83
+ Fully Open Meditron 63.71 77.61 75.80 18.00 58.02 58.63±0.83 1.08
MedGemma-27B 66.44 86.10 73.00 21.88 55.92 60.67±0.81 3.12

Medical benchmark accuracy (%). Best within partition in bold; best fully open underlined. HealthBench uses the full benchmark with Qwen3-235B-A22B as judge.

AutoMOOVE evaluation

Auto-MOOVE results: MeditronFO vs base models
Auto-MOOVE results: Gemma-MeditronFO vs MedGemma

Auto-MOOVE pairwise preference results: every MeditronFO model is preferred over its corresponding base (left), and Gemma-3-27B-MeditronFO is preferred over MedGemma (right).


Likert profile: Apertus-70B-MeditronFO
Likert profile: Gemma-3-27B-MeditronFO

Per-criterion Likert profiles show improvements across all nine clinical evaluation dimensions.

Key Results

  • Fully open state of the art. Fine-tuning on the Fully Open Meditron corpus consistently improves base models. Apertus-70B-MeditronFO reaches a 53.8% average on aggregate medical benchmarks (+6.6 over its base), establishing a new fully-open state of the art. Smaller bases benefit most, with Apertus-8B improving +12.8 points.

  • Surpassing closed-data specialists. On the open-ended Auto-MOOVE protocol, validated against 204 human raters, every MeditronFO variant is preferred over its base, with adjusted win rates from 67.2% to 92%. Gemma-3-27B-MeditronFO surpasses MedGemma, a closed-data specialist, on both HealthBench (58.0 vs 55.9) and Auto-MOOVE (preferred in 58.6% of comparisons), despite being built from a fully open, auditable pipeline.
  • Model Base Base model openness Medical adaptation openness License
    Data Code Weights Data Synth Code Weights
    Closed
    Med-GeminiGemini R
    Open Weights
    BioMistralMistral O
    Meditron-70BLlama 2 IC
    Meditron-3Llama 3.1 IC
    Aloe BetaLlama 3.1 R
    MedGemma-27BGemma-3-27B-it C
    Partially Open
    Gemma-3-27B-MeditronFOGemma-3-27B-it IC
    Fully Open
    Apertus-70B-MeditronFOApertus-70B O
    OLMo-2-32B-MeditronFOOLMo-2-32B O
    EuroLLM-22B-MeditronFOEuroLLM-22B O

    Openness across medical LLMs. MeditronFO is the first family to satisfy every dimension end-to-end. License: O = permissive open, C = community/commercial with restrictions, IC = inherited base-model license (adaptation permissively released), R = restrictive/research-only/proprietary.

BibTeX

Please cite our paper if you use our models, data, code, or results:

@article{theimerlienhard2026fullyopenmeditron,
  title={Fully Open Meditron: An Auditable Pipeline for Clinical LLMs},
  author={Theimer-Lienhard, Xavier and El-Amin, Mushtaha and Elhassan, Fay and Vaidya, Sahaj and Cartier-Negadi, Victor and Sasu, David and Klein, Lars and Hartley, Mary-Anne},
  journal={arXiv preprint arXiv:2605.16215},
  year={2026},
  url={https://arxiv.org/abs/2605.16215}
}

Acknowledgements

This work was supported under project ID #27 as part of the Swiss AI Initiative, through a grant from the ETH Domain and computational resources provided by the Swiss National Supercomputing Centre (CSCS) under the Alps infrastructure. We thank the physician review panel within the LiGHT laboratory for their clinical auditing, and the many physicians and clinical experts who contributed to the MOOVE initiative.