Configuration Reference¶
MMIRAGE pipelines are configured through a single YAML file split into four top-level sections.
processors¶
A list of processor definitions. Currently the only supported type is llm.
processors:
- type: llm
server_args:
model_path: Qwen/Qwen3-8B
tp_size: 4
trust_remote_code: true
disable_custom_all_reduce: false
chat_template: "" # Set to e.g. "qwen2-vl" for VLMs
default_sampling_params:
temperature: 0.1
top_p: 0.9
max_new_tokens: 1024
processors[*].server_args¶
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
HuggingFace model ID or local path |
|
|
auto from |
Tensor parallelism size |
|
|
|
Allow custom model code from HuggingFace |
|
|
|
Disable custom all-reduce kernel |
processors[*].default_sampling_params¶
Any key-value pairs accepted by the SGLang sampling API, e.g.:
Field |
Description |
|---|---|
|
Sampling temperature |
|
Top-p nucleus sampling |
|
Maximum tokens to generate |
processors[*].chat_template¶
Optional. Set to a named template (e.g. qwen2-vl, llava, internvl, phi3_v) for vision-language models. Defaults to the tokenizer’s built-in template.
loading_params¶
Controls how datasets are loaded and distributed across shards.
loading_params:
state_dir: ~/.cache/MMIRAGE/state_dir
datasets:
- path: /path/to/data.jsonl
type: JSONL
output_dir: /path/to/output/shards
image_base_path: /path/to/images # optional, for vision tasks
num_shards: 4
shard_id: "$SLURM_ARRAY_TASK_ID"
batch_size: 64
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Shared directory for shard state, retry markers, and status files |
|
|
|
List of dataset configurations (see below) |
|
|
|
Total number of shards to split datasets into |
|
|
|
Index of this shard (0-based). In SLURM use |
|
|
|
Batch size for processing samples |
loading_params.datasets[*]¶
Field |
Type |
Required |
Description |
|---|---|---|---|
|
|
✓ |
Path to dataset file or directory |
|
|
✓ |
Loader type: |
|
|
✓ |
Directory where processed shards are written |
|
|
— |
Base directory for resolving relative image paths |
processing_params¶
Defines variable extraction, LLM-driven generation, and the final output structure.
processing_params:
inputs:
- name: my_var
key: field.nested[0].value # JMESPath expression
type: text # "text" (default) or "image"
outputs:
- name: my_output
type: llm
output_type: plain # "plain" or "JSON"
prompt: |
Do something with {{ my_var }}
output_schema: # Only for output_type: JSON
- field_a
- field_b
remove_columns: false
output_schema:
result: "{{ my_output }}"
processing_params.inputs[*]¶
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
— |
Variable name used in Jinja2 templates |
|
|
— |
JMESPath expression to extract value from a sample |
|
|
|
|
processing_params.outputs[*]¶
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
— |
Variable name made available in |
|
|
— |
Processor type — must match a registered processor ( |
|
|
|
|
|
|
— |
Jinja2 template for the LLM prompt |
|
|
|
Required field names when |
processing_params.output_schema¶
A dictionary describing the structure of each output sample. Values are Jinja2 templates that reference input or output variable names. Nested dicts and lists are supported.
processing_params.remove_columns¶
If true, all original columns are removed from the dataset before writing; only columns defined in output_schema are kept. Defaults to false.
execution_params¶
Controls where and how the pipeline runs.
execution_params:
mode: local # "local" or "slurm"
retry: false
merge: false
max_retries: 3
poll_interval_seconds: 30
settle_time_seconds: 60
# SLURM-specific (required when mode: slurm)
account: my_account
job_name: mmirage-sharded
reservation: ""
nodes: 1
ntasks_per_node: 1
gpus: 4
cpus_per_task: 288
time_limit: "11:59:59"
# Paths
project_root: /path/to/project # Supports ${ENV_VAR} expansion
report_dir: ~/reports
hf_home: ~/hf
edf_env: ""
Core fields¶
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
|
|
|
|
Auto-retry failed shards until success or |
|
|
|
Merge shard outputs after a successful run |
|
|
|
Maximum retries per shard |
|
|
|
Seconds between SLURM job status polls |
|
|
|
Seconds to wait after a SLURM job finishes before checking shard state |
SLURM-specific fields¶
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
— |
HPC account/partition (required for SLURM mode) |
|
|
|
SLURM job name |
|
|
— |
Optional SLURM reservation |
|
|
|
Number of nodes |
|
|
|
Tasks per node |
|
|
|
GPUs per node |
|
|
|
CPUs per task |
|
|
|
Wall-clock time limit ( |
Path fields¶
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
— |
Base project directory. Supports |
|
|
|
Directory for SLURM stdout/stderr logs |
|
|
|
HuggingFace cache directory |
|
|
— |
Optional EDF environment file path |
Merge output behaviour¶
Trigger |
Merged output location |
|---|---|
|
|
|
|
|
|
|
|
If shard_* folders are present directly inside --input-dir, MMIRAGE merges that dataset and ignores nested subdirectories (e.g. _pipeline_state).