Dataset format¶

Supported modalities

Image modality

This section describes the dataset format supported by the MultiMeditron pipeline. The dataset format varies from one modality to another and you can add your own modality by following this tutorial.

Our training pipeline supports two types of dataset: pretraining and instruction-tuning datasets. Here are the two formats that we support:

Arrow/Parquet format (recommended): where the modalities are directly stored in the dataset
JSONL format (not recommended): where the images and modalities are stored on the file system. Those dataset must be processed with merge_inputs.py

We support the following modalities, for a detailed format description please refer to the corresponding documentation.

Dataset validation¶

Use the CLI checker to validate that a dataset follows the expected schema before training. This tool reads Arrow/Parquet datasets saved with save_to_disk as well as JSONL files, and reports the first formatting issues it finds.

python -m multimeditron check-dataset /path/to/dataset -m image

You can also control parallelism with --num-proc to speed up checks on large datasets:

python -m multimeditron check-dataset /path/to/dataset -m image --num-proc 8

If the dataset is invalid, the command prints a list of errors describing which samples failed and why.

Arrow format (recommended)¶

Pretraining dataset¶

Each dataset must contain a column text and a column modalities. The text column contains string of the following form:

"Let's compare the first image: <|reserved_special_token_0|>, and the second 3D image: <|reserved_special_token_0|>"

And the modalities column must be of the following form:

[{"type": "modality_type", "value" : some_modality}]

For instance, for image type, some_modality must contain a PIL Image object.

Note that we use a special placeholder <|reserved_special_token_0|> to indicate the position of the tokens from the modality

Instruction-tuning dataset¶

It’s the same as the pretraining dataset but instead of the text column, we have a conversations column:

[
    {"role" : "system", "content" : "You are Meditron"},
    {"role" : "user", "content" : "Compare the CT scan <|reserved_special_token_0|> with the image <|reserved_special_token_0|>."},
    {"role" : "assistant", "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit."},
    {"role" : "user", "content" : "How is it related to that signal: <|reserved_special_token_0|>?"},
    {"role" : "assistant", "content" : "Sed non risus. Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor."}
]

JSONL format (deprecated)¶

We also support .jsonl files where each line corresponds to a sample. We describe how each sample must be formatted:

Warning

Please note that JSONL format is not recommended! We provide scripts to convert JSONL-formatted dataset into Arrow dataset. If your dataset is in a JSONL format, you need to convert it first to Arrow before training.

Pretraining format¶

{
  "text": "Let's compare the first image: <|reserved_special_token_0|>, and the second 3D image: <|reserved_special_token_0|>",
  "modalities": [{"type" : "image", "value" : "path/to/png"}, {"type" : "image_3d", "value" : "path/to/npy"}]
}

Instruction-tuning format¶

{
  "conversations": [
    {"role": "system", "content" : "You are Meditron"},
    {"role": "user", "content" : "Compare the CT scan <|reserved_special_token_0|> with the image <|reserved_special_token_0|>."},
    {"role": "assistant", "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit."},
    {"role": "user", "content" : "How is it related to that signal: <|reserved_special_token_0|>?"},
    {"role": "assistant", "content" : "Sed non risus. Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor."}
  ],
  "modalities": [{"type" : "image_3d", "value" : "path/to/npy"}, {"type" : "image", "value" : "path/to/png"}, {"type" : "signal", "value" : "path/to/npy"}]
}