Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

Theimer-Lienhard, Xavier; El-Amin, Mushtaha; Elhassan, Fay; Vaidya, Sahaj; Cartier-Negadi, Victor; Sasu, David; Klein, Lars; Hartley, Mary-Anne

Fully Open Meditron:
An Auditable Pipeline for Clinical LLMs

Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, Victor Cartier-Negadi,
David Sasu, Lars Klein, Mary-Anne Hartley

LiGHT Lab, EPFL • arXiv preprint, May 2026

xavier.theimer-lienhard@epfl.ch

arXiv 🤗 Models Datasets Code

TL;DR — Most "open" medical AI models hide the data and recipe they were trained on, so no one can audit what they actually learned. We built the first medical AI where every step is open and it still beats leading partially open systems.

Abstract

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS (LLM-CDSS) remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. We apply the recipe to five FO base models. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA for LLM-CDSS, and Gemma-3-27B-MeditronFO surpasses MedGemma on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

Fully Open Meditron

Our release includes:

A reproducible end-to-end framework spanning corpus construction, synthetic data generation, decontamination, training, and evaluation.
A structured, clinician-audited knowledge corpus normalizing eight public medical QA datasets and expanding coverage via clinician-vetted synthetic generation.
Auto-MOOVE, an automated, open-ended clinical evaluation protocol validated against 204 human raters.
A family of fully open medical specialists across three model families (Apertus, OLMo-2, EuroLLM).

We argue that clinically competitive medical specialization can be achieved using reproducible, auditable pipelines rather than opaque adaptation procedures, providing a reusable foundation for training and evaluating future fully open medical models.

Model	MedMCQA	MedQA	PubMedQA	MedXpertQA	HealthBench	Avg	Gain
Fully open base
Apertus-70B-Instruct	52.43	60.64	66.80	12.33	43.72	47.18±0.84	–
+ Fully Open Meditron	56.32	68.58	75.20	16.90	51.86	53.77±0.86	6.59
OLMo-2-32B-SFT	59.10	66.22	72.00	13.02	31.03	51.52±0.81	–
+ Fully Open Meditron	57.83	69.44	76.60	17.96	44.00	53.17±0.85	1.65
EuroLLM-22B-Instruct	54.94	66.61	73.60	14.61	43.72	50.70±0.84	–
+ Fully Open Meditron	54.79	63.16	78.00	14.61	46.22	51.36±0.84	0.66
Small fully open base
Apertus-8B-Instruct	45.80	51.14	37.60	11.71	31.61	35.57±0.76	–
+ Fully Open Meditron	48.74	58.44	75.60	13.67	45.38	48.37±0.84	12.80
EuroLLM-9B-Instruct	37.84	48.55	40.00	10.33	23.00	31.94±0.72	–
+ Fully Open Meditron	46.98	49.73	67.40	11.63	37.53	42.65±0.81	10.71
Open-access base
gemma-3-27b-it	62.75	76.20	74.60	16.69	57.49	57.55±0.83	–
+ Fully Open Meditron	63.71	77.61	75.80	18.00	58.02	58.63±0.83	1.08
MedGemma-27B	66.44	86.10	73.00	21.88	55.92	60.67±0.81	3.12

Medical benchmark accuracy (%). Best within partition in bold; best fully open underlined. HealthBench uses the full benchmark with Qwen3-235B-A22B as judge.

AutoMOOVE evaluation

Auto-MOOVE results: MeditronFO vs base models

Auto-MOOVE results: Gemma-MeditronFO vs MedGemma

Auto-MOOVE pairwise preference results: every MeditronFO model is preferred over its corresponding base (left), and Gemma-3-27B-MeditronFO is preferred over MedGemma (right).

Per-criterion Likert profiles show improvements across all nine clinical evaluation dimensions.

Key Results

Fully open state of the art. Fine-tuning on the Fully Open Meditron corpus consistently improves base models. Apertus-70B-MeditronFO reaches a 53.8% average on aggregate medical benchmarks (+6.6 over its base), establishing a new fully-open state of the art. Smaller bases benefit most, with Apertus-8B improving +12.8 points.
Surpassing closed-data specialists. On the open-ended Auto-MOOVE protocol, validated against 204 human raters, every MeditronFO variant is preferred over its base, with adjusted win rates from 67.2% to 92%. Gemma-3-27B-MeditronFO surpasses MedGemma, a closed-data specialist, on both HealthBench (58.0 vs 55.9) and Auto-MOOVE (preferred in 58.6% of comparisons), despite being built from a fully open, auditable pipeline.

Model	Base	Base model openness			Medical adaptation openness				License
Model	Base	Data	Code	Weights	Data	Synth	Code	Weights	License
Closed
Med-Gemini	Gemini								R
Open Weights
BioMistral	Mistral								O
Meditron-70B	Llama 2								IC
Meditron-3	Llama 3.1								IC
Aloe Beta	Llama 3.1								R
MedGemma-27B	Gemma-3-27B-it								C
Partially Open
Gemma-3-27B-MeditronFO	Gemma-3-27B-it								IC
Fully Open
Apertus-70B-MeditronFO	Apertus-70B								O
OLMo-2-32B-MeditronFO	OLMo-2-32B								O
EuroLLM-22B-MeditronFO	EuroLLM-22B								O

Openness across medical LLMs. MeditronFO is the first family to satisfy every dimension end-to-end. License: O = permissive open, C = community/commercial with restrictions, IC = inherited base-model license (adaptation permissively released), R = restrictive/research-only/proprietary.

Fully Open Meditron:
An Auditable Pipeline for Clinical LLMs

Abstract

Fully Open Meditron

AutoMOOVE evaluation

Key Results

BibTeX

Acknowledgements

.logo { width: 1.5em; position: relative; top: -10px; left: -5px; vertical-align: middle; } Fully Open Meditron:An Auditable Pipeline for Clinical LLMs

Abstract

Fully Open Meditron

AutoMOOVE evaluation

Key Results

BibTeX

Acknowledgements

Fully Open Meditron:
An Auditable Pipeline for Clinical LLMs