Dataset format

Supported modalities

This section describes the dataset format supported by the MultiMeditron pipeline. The dataset format varies from one modality to another and you can add your own modality by following this tutorial.

Our training pipeline supports two types of dataset: pretraining and instruction-tuning datasets. Here are the two formats that we support:

  1. Arrow/Parquet format (recommended): where the modalities are directly stored in the dataset

  2. JSONL format (not recommended): where the images and modalities are stored on the file system. Those dataset must be processed with merge_inputs.py

We support the following modalities, for a detailed format description please refer to the corresponding documentation.

Dataset validation

Use the CLI checker to validate that a dataset follows the expected schema before training. This tool reads Arrow/Parquet datasets saved with save_to_disk as well as JSONL files, and reports the first formatting issues it finds.

python -m multimeditron check-dataset /path/to/dataset -m image

You can also control parallelism with --num-proc to speed up checks on large datasets:

python -m multimeditron check-dataset /path/to/dataset -m image --num-proc 8

If the dataset is invalid, the command prints a list of errors describing which samples failed and why.

JSONL format (deprecated)

We also support .jsonl files where each line corresponds to a sample. We describe how each sample must be formatted:

Warning

Please note that JSONL format is not recommended! We provide scripts to convert JSONL-formatted dataset into Arrow dataset. If your dataset is in a JSONL format, you need to convert it first to Arrow before training.

Pretraining format

{
  "text": "Let's compare the first image: <|reserved_special_token_0|>, and the second 3D image: <|reserved_special_token_0|>",
  "modalities": [{"type" : "image", "value" : "path/to/png"}, {"type" : "image_3d", "value" : "path/to/npy"}]
}

Instruction-tuning format

{
  "conversations": [
    {"role": "system", "content" : "You are Meditron"},
    {"role": "user", "content" : "Compare the CT scan <|reserved_special_token_0|> with the image <|reserved_special_token_0|>."},
    {"role": "assistant", "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit."},
    {"role": "user", "content" : "How is it related to that signal: <|reserved_special_token_0|>?"},
    {"role": "assistant", "content" : "Sed non risus. Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor."}
  ],
  "modalities": [{"type" : "image_3d", "value" : "path/to/npy"}, {"type" : "image", "value" : "path/to/png"}, {"type" : "signal", "value" : "path/to/npy"}]
}