# Configuration Reference MMIRAGE pipelines are configured through a single YAML file split into four top-level sections. --- ## `processors` A list of processor definitions. Currently the only supported type is `llm`. ```yaml processors: - type: llm server_args: model_path: Qwen/Qwen3-8B tp_size: 4 trust_remote_code: true disable_custom_all_reduce: false chat_template: "" # Set to e.g. "qwen2-vl" for VLMs default_sampling_params: temperature: 0.1 top_p: 0.9 max_new_tokens: 1024 ``` ### `processors[*].server_args` | Field | Type | Default | Description | |---|---|---|---| | `model_path` | `str` | `"none"` | HuggingFace model ID or local path | | `tp_size` | `int` | auto from `SLURM_GPUS_ON_NODE` | Tensor parallelism size | | `trust_remote_code` | `bool` | `true` | Allow custom model code from HuggingFace | | `disable_custom_all_reduce` | `bool` | `false` | Disable custom all-reduce kernel | ### `processors[*].default_sampling_params` Any key-value pairs accepted by the SGLang sampling API, e.g.: | Field | Description | |---|---| | `temperature` | Sampling temperature | | `top_p` | Top-p nucleus sampling | | `max_new_tokens` | Maximum tokens to generate | ### `processors[*].chat_template` Optional. Set to a named template (e.g. `qwen2-vl`, `llava`, `internvl`, `phi3_v`) for vision-language models. Defaults to the tokenizer's built-in template. --- ## `loading_params` Controls how datasets are loaded and distributed across shards. ```yaml loading_params: state_dir: ~/.cache/MMIRAGE/state_dir datasets: - path: /path/to/data.jsonl type: JSONL output_dir: /path/to/output/shards image_base_path: /path/to/images # optional, for vision tasks num_shards: 4 shard_id: "$SLURM_ARRAY_TASK_ID" batch_size: 64 ``` | Field | Type | Default | Description | |---|---|---|---| | `state_dir` | `str` | `~/.cache/MMIRAGE/state_dir` | Shared directory for shard state, retry markers, and status files | | `datasets` | `list` | `[]` | List of dataset configurations (see below) | | `num_shards` | `int` or env var | `1` | Total number of shards to split datasets into | | `shard_id` | `int` or env var | `0` | Index of this shard (0-based). In SLURM use `"$SLURM_ARRAY_TASK_ID"` | | `batch_size` | `int` | `1` | Batch size for processing samples | ### `loading_params.datasets[*]` | Field | Type | Required | Description | |---|---|---|---| | `path` | `str` | ✓ | Path to dataset file or directory | | `type` | `str` | ✓ | Loader type: `JSONL` or `loadable` (HuggingFace `load_from_disk`) | | `output_dir` | `str` | ✓ | Directory where processed shards are written | | `image_base_path` | `str` | — | Base directory for resolving relative image paths | --- ## `processing_params` Defines variable extraction, LLM-driven generation, and the final output structure. ```yaml processing_params: inputs: - name: my_var key: field.nested[0].value # JMESPath expression type: text # "text" (default) or "image" outputs: - name: my_output type: llm output_type: plain # "plain" or "JSON" prompt: | Do something with {{ my_var }} output_schema: # Only for output_type: JSON - field_a - field_b remove_columns: false output_schema: result: "{{ my_output }}" ``` ### `processing_params.inputs[*]` | Field | Type | Default | Description | |---|---|---|---| | `name` | `str` | — | Variable name used in Jinja2 templates | | `key` | `str` | — | JMESPath expression to extract value from a sample | | `type` | `str` | `text` | `"text"` or `"image"`. Image variables are resolved to PIL Images / absolute paths | ### `processing_params.outputs[*]` | Field | Type | Default | Description | |---|---|---|---| | `name` | `str` | — | Variable name made available in `output_schema` templates | | `type` | `str` | — | Processor type — must match a registered processor (`llm`) | | `output_type` | `str` | `plain` | `"plain"` (raw text) or `"JSON"` (structured object) | | `prompt` | `str` | — | Jinja2 template for the LLM prompt | | `output_schema` | `list[str]` | `[]` | Required field names when `output_type: JSON` | ### `processing_params.output_schema` A dictionary describing the structure of each output sample. Values are Jinja2 templates that reference input or output variable names. Nested dicts and lists are supported. ### `processing_params.remove_columns` If `true`, all original columns are removed from the dataset before writing; only columns defined in `output_schema` are kept. Defaults to `false`. --- ## `execution_params` Controls where and how the pipeline runs. ```yaml execution_params: mode: local # "local" or "slurm" retry: false merge: false max_retries: 3 poll_interval_seconds: 30 settle_time_seconds: 60 # SLURM-specific (required when mode: slurm) account: my_account job_name: mmirage-sharded reservation: "" nodes: 1 ntasks_per_node: 1 gpus: 4 cpus_per_task: 288 time_limit: "11:59:59" # Paths project_root: /path/to/project # Supports ${ENV_VAR} expansion report_dir: ~/reports hf_home: ~/hf edf_env: "" ``` ### Core fields | Field | Type | Default | Description | |---|---|---|---| | `mode` | `str` | `local` | `"local"` (run in-process) or `"slurm"` (submit sbatch array job) | | `retry` | `bool` | `false` | Auto-retry failed shards until success or `max_retries` is reached | | `merge` | `bool` | `false` | Merge shard outputs after a successful run | | `max_retries` | `int` | `3` | Maximum retries per shard | | `poll_interval_seconds` | `int` | `30` | Seconds between SLURM job status polls | | `settle_time_seconds` | `int` | `60` | Seconds to wait after a SLURM job finishes before checking shard state | ### SLURM-specific fields | Field | Type | Default | Description | |---|---|---|---| | `account` | `str` | — | HPC account/partition (**required** for SLURM mode) | | `job_name` | `str` | `mmirage-sharded` | SLURM job name | | `reservation` | `str` | — | Optional SLURM reservation | | `nodes` | `int` | `1` | Number of nodes | | `ntasks_per_node` | `int` | `1` | Tasks per node | | `gpus` | `int` | `4` | GPUs per node | | `cpus_per_task` | `int` | `288` | CPUs per task | | `time_limit` | `str` | `11:59:59` | Wall-clock time limit (`HH:MM:SS`) | ### Path fields | Field | Type | Default | Description | |---|---|---|---| | `project_root` | `str` | — | Base project directory. Supports `${VAR}` expansion | | `report_dir` | `str` | `~/reports` | Directory for SLURM stdout/stderr logs | | `hf_home` | `str` | `~/hf` | HuggingFace cache directory | | `edf_env` | `str` | — | Optional EDF environment file path | --- ## Merge output behaviour | Trigger | Merged output location | |---|---| | `run` with `merge: true` | `/merged/` per dataset | | `merge` without `--output-root` | `/merged/` per dataset | | `merge --output-root /path` | `/path//` per dataset | | `merge-dir --input-dir /path --output-dir /out` | `/out/` (single dataset) | If `shard_*` folders are present **directly** inside `--input-dir`, MMIRAGE merges that dataset and ignores nested subdirectories (e.g. `_pipeline_state`).