Training a MultiMeditron model¶
This tutorial provides a step-by-step guide on how to train a model using MultiMeditron. We will walk you through the process with clear examples.
Configuration files¶
Each training is configured through a YAML file. To get the full documentation of the different arguments supported by the configuration file, refer to the configuration reference
base_llm: Qwen/Qwen3-8B
base_model: null
attachment_token: <|reserved_special_token_0|>
tokenizer_type: qwen3
token_size: 4096
loaders:
- loader_type: raw-image
modality_type: image
modalities:
- model_type: meditron_clip
clip_name: openai/clip-vit-large-patch14
hidden_size: 4096
training_mode: ALIGNMENT
datasets:
- packed_path: /path/to/dataset
training_args:
output_dir: /path/to/checkpoint
dataloader_num_workers: 16
dataloader_prefetch_factor: 4
remove_unused_columns: false
ddp_find_unused_parameters: false
learning_rate: 1.0e-4
bf16: true
per_device_train_batch_size: 4
gradient_accumulation_steps: 8
num_train_epochs: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: true
save_strategy: epochs
max_grad_norm: 1.0
deepspeed: deepspeed.json
accelerator_config:
dispatch_batches: false
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
min_lr: 3.0e-5
logging_steps: 1
weight_decay: 0.01
Make sure to replace /path/to/dataset and /path/to/checkpoint by your dataset and the actual output checkpoint path.
Store this file in a YAML file. In our case, we store it in config.yaml.
Additionally, we are using Deepspeed for parallelism and we need to create a deepspeed config. Here is our config used on a NVidia GH200 setup:
{
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": false,
"contiguous_gradients": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"sub_group_size": 1e9,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false
},
"flops_profiler": {
"enabled": false
},
"aio": {
"block_size": 1048576,
"queue_depth": 8,
"single_submit": false,
"overlap_events": false
}
}
Store this file in deepspeed.json file, make sure that the path to this file matches the training_args.deepspeed argument from the YAML configuration.
Launch the training¶
Once the training configuration are done, we are ready to launch a training. We support both single node and multi node training.
Single node training¶
To launch a single training, run the following command
torchrun --nproc-per-node $PROC_PER_NODE -m multimeditron train --config config.yaml
where $PROC_PER_NODE is the number of GPUS available
Multi node training¶
We provide scripts to launch MultiMeditron training on multi node cluster. We provide scripts to launch trainings on:
SLURM cluster
TODO: Provide script for Run:ai cluster
SLURM cluster¶
To launch a training on a SLURM cluster, we can use the following sbatch script:
#!/bin/bash
#SBATCH --job-name multimeditron-training
#SBATCH --output ~/reports/R-%x.%j.out
#SBATCH --error ~/reports/R-%x.%j.err
#SBATCH --nodes 4 # number of Nodes
#SBATCH --ntasks-per-node 1 # number of MP tasks. IMPORTANT: torchrun represents just 1 Slurm task
#SBATCH --gres gpu:4 # Number of GPUs
#SBATCH --cpus-per-task 288 # number of CPUs per task.
#SBATCH --time 11:59:59 # maximum execution time (DD-HH:MM:SS)
#SBATCH --environment ~/.edf/multimodal.toml
#SBATCH --export=ALL,SCRATCH=/iopsstor/scratch/cscs/$USER
#SBATCH -A a127
export CUDA_LAUNCH_BLOCKING=1
echo "START TIME: $(date)"
# auto-fail on any errors in this script
set -eo pipefail
# logging script's variables/commands for future debug needs
set -x
######################
### Set enviroment ###
######################
GPUS_PER_NODE=4
echo "NODES: $SLURM_NNODES"
export HF_HOME=/path/to/hf/home
######################
#### Set network #####
######################
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6200
######################
# note that we don't want to interpolate `\$SLURM_PROCID` till `srun` since otherwise all nodes will get
# 0 and the launcher will hang
#
# same goes for `\$(hostname -s|tr -dc '0-9')` - we want it to interpolate at `srun` time
LAUNCHER="
torchrun \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $SLURM_NNODES \
--node_rank \$SLURM_PROCID \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
--rdzv_backend c10d \
--max_restarts 0 \
--tee 3 \
"
export CMD="$LAUNCHER -m multimeditron train --config config.yaml"
echo $CMD
SRUN_ARGS=" \
--cpus-per-task $SLURM_CPUS_PER_TASK \
--jobid $SLURM_JOB_ID \
--wait 60 \
-A a06 \
--reservation=sai-a127
"
# bash -c is needed for the delayed interpolation of env vars to work
srun $SRUN_ARGS bash -c "$CMD"
echo "END TIME: $(date)"
Make sure to set your $HF_HOME properly before launching the training. Models and datasets will be downloaded in this folder which can take many GB! Save this configuration file in a file called training.sh
Finally launch the training by running this command:
sbatch training.sh