`mmirage.config` — Configuration¶

`mmirage.config.config`¶

Main configuration dataclasses for the MMIRAGE pipeline.

Configuration dataclasses for MMIRAGE pipeline.

class mmirage.config.config.ExecutionParams(mode='local', retry=False, merge=False, max_retries=3, poll_interval_seconds=30, settle_time_seconds=60, project_root=None, report_dir='~/reports', hf_home='~/hf', edf_env=None, account=None, job_name='mmirage-sharded', reservation=None, nodes=1, ntasks_per_node=1, gpus=4, cpus_per_task=288, time_limit='11:59:59')[source]¶

Bases: object

Parameters for executing the MMIRAGE pipeline.

Defines how the pipeline is executed, including local or SLURM-based distributed execution, retry logic, and resource allocation.

Parameters:

mode (str)
retry (bool)
merge (bool)
max_retries (int)
poll_interval_seconds (int)
settle_time_seconds (int)
project_root (str | None)
report_dir (str)
hf_home (str)
edf_env (str | None)
account (str | None)
job_name (str)
reservation (str | None)
nodes (int)
ntasks_per_node (int)
gpus (int)
cpus_per_task (int)
time_limit (str)

mode¶

Execution mode: “local” or “slurm”. Defaults to “local”.

Type:: str

retry¶

Whether automatic retry orchestration is enabled. Defaults to False.

Type:: bool

merge¶

Whether to merge shard outputs after a successful run. Defaults to False.

Type:: bool

max_retries¶

Maximum number of retries for failed shards. Defaults to 3.

Type:: int

poll_interval_seconds¶

Seconds to wait between polling job status. Defaults to 30.

Type:: int

settle_time_seconds¶

Seconds to wait after job completes before checking results. Defaults to 60.

Type:: int

# SLURM-specific parameters

account¶

HPC account/partition to charge. Required for SLURM mode.

Type:: str | None

job_name¶

SLURM job name. Defaults to “mmirage-sharded”.

Type:: str

reservation¶

Optional SLURM reservation name.

Type:: str | None

nodes¶

Number of nodes. Defaults to 1.

Type:: int

ntasks_per_node¶

Number of tasks per node. Defaults to 1.

Type:: int

gpus¶

Number of GPUs per node. Defaults to 4.

Type:: int

cpus_per_task¶

Number of CPUs per task. Defaults to 288.

Type:: int

time_limit¶

Job time limit (HH:MM:SS). Defaults to “11:59:59”.

Type:: str

# Paths

project_root¶

Base project directory. Can use environment variables with ${VAR}.

Type:: str | None

report_dir¶

Directory for SLURM output/error files. Defaults to ~/reports.

Type:: str

hf_home¶

HuggingFace cache directory. Defaults to ~/hf.

Type:: str

edf_env¶

Optional EDF environment file path.

Type:: str | None

mode: str = 'local'¶

retry: bool = False¶

merge: bool = False¶

max_retries: int = 3¶

poll_interval_seconds: int = 30¶

settle_time_seconds: int = 60¶

project_root: str | None = None¶

report_dir: str = '~/reports'¶

hf_home: str = '~/hf'¶

edf_env: str | None = None¶

account: str | None = None¶

job_name: str = 'mmirage-sharded'¶

reservation: str | None = None¶

nodes: int = 1¶

ntasks_per_node: int = 1¶

gpus: int = 4¶

cpus_per_task: int = 288¶

time_limit: str = '11:59:59'¶

is_slurm()[source]¶

Check if execution mode is SLURM.

Return type:: bool

class mmirage.config.config.ProcessingParams(inputs, outputs, output_schema, remove_columns=False)[source]¶

Bases: object

Parameters for processing dataset samples.

Defines how input variables are extracted, outputs are generated, and the final output schema is constructed.

Parameters:

inputs (List[InputVar])
outputs (List[OutputVar])
output_schema (Dict[str, Any])
remove_columns (bool)

inputs¶

List of input variables to extract from source datasets.

Type:: List[mmirage.core.process.variables.InputVar]

outputs¶

List of output variables to generate using processors.

Type:: List[mmirage.core.process.variables.OutputVar]

output_schema¶

Dictionary defining the structure of output samples.

Type:: Dict[str, Any]

remove_columns¶

If True, removes all columns from original dataset.

Type:: bool

inputs: List[InputVar]¶

outputs: List[OutputVar]¶

output_schema: Dict[str, Any]¶

remove_columns: bool = False¶

class mmirage.config.config.MMirageConfig(processors, loading_params, processing_params, execution_params=<factory>)[source]¶

Bases: object

Main configuration class for MMIRAGE pipeline.

Contains all configuration needed to run a MMIRAGE processing pipeline, including processor configurations, dataset loading parameters, processing parameters, and execution parameters.

Parameters:

processors (List[BaseProcessorConfig])
loading_params (LoadingParams)
processing_params (ProcessingParams)
execution_params (ExecutionParams)

processors¶

List of processor configurations for data transformation.

Type:: List[mmirage.core.process.base.BaseProcessorConfig]

loading_params¶

Parameters for loading input datasets.

Type:: mmirage.config.loading.LoadingParams

processing_params¶

Parameters for processing dataset samples.

Type:: mmirage.config.config.ProcessingParams

execution_params¶

Parameters for executing the pipeline (local/SLURM).

Type:: mmirage.config.config.ExecutionParams

processors: List[BaseProcessorConfig]¶

loading_params: LoadingParams¶

processing_params: ProcessingParams¶

execution_params: ExecutionParams¶

`mmirage.config.loading`¶

Dataset loading and sharding configuration.

Data loading configuration for MMIRAGE pipeline.

class mmirage.config.loading.LoadingParams(datasets=<factory>, state_dir='~/.cache/MMIRAGE/state_dir', output_dir='', num_shards=1, shard_id=0, batch_size=1)[source]¶

Bases: object

Parameters for loading and distributing datasets across shards.

Defines how datasets are loaded and processed in a distributed manner, supporting sharding for parallel processing.

datasets¶

List of dataset configurations to load.

Type:: List[mmirage.core.loader.base.BaseDataLoaderConfig]

state_dir¶

Shared directory for logical shard state/markers/retry tracking.

Type:: str

output_dir¶

Legacy top-level output directory. Prefer per-dataset output_dir.

Type:: str

num_shards¶

Total number of shards to split the dataset into.

Type:: int | str

shard_id¶

ID of this shard (0-indexed).

Type:: int | str

batch_size¶

Batch size for processing samples.

Type:: int | str

Raises:

ValueError – If num_shards, shard_id, or batch_size cannot be converted to int.

Parameters:

datasets (List[BaseDataLoaderConfig])
state_dir (str)
output_dir (str)
num_shards (int | str)
shard_id (int | str)
batch_size (int | str)

datasets: List[BaseDataLoaderConfig]¶

state_dir: str = '~/.cache/MMIRAGE/state_dir'¶

output_dir: str = ''¶

num_shards: int | str = 1¶

shard_id: int | str = 0¶

batch_size: int | str = 1¶

get_state_root()[source]¶

Get the state root path.

Returns:: State root path.
Return type:: str

get_num_shards()[source]¶

Get the total number of shards.

Returns:: Total number of shards.
Return type:: int

get_shard_id()[source]¶

Get the ID of this shard.

Returns:: Shard ID (0-indexed).
Return type:: int

get_batch_size()[source]¶

Get the batch size for processing.

Returns:: Batch size (minimum 1).
Return type:: int

`mmirage.config.utils`¶

YAML parsing helpers and configuration loaders.

Configuration loading utilities for MMIRAGE pipeline.

mmirage.config.utils.load_mmirage_config(config_path)[source]¶

Load MMIRAGE configuration from a YAML file.

Supports environment variable expansion and dynamic processor/loader configuration based on registered types.

Example config:

processors:

type: llm server_args:

model_path: Qwen/Qwen2-VL-7B-Instruct tp_size: 4 trust_remote_code: true

chat_template: qwen2-vl default_sampling_params:

temperature: 0.1 top_p: 0.9 max_new_tokens: 1024

loading_params:

datasets:

path: /path/to/dataset.jsonl type: JSONL output_dir: /path/to/output image_base_path: /path/to/images

num_shards: 4 shard_id: 0 batch_size: 64

processing_params:

inputs:

name: text key: text
name: image key: image_path type: image

outputs:

name: formatted_answer type: llm output_type: JSON output_schema:
- question
- answer
prompt: |
Generate a Q&A pair from: {{ text }}

remove_columns: True output_schema:

conversations:

role: “user” content: “{{ formatted_answer.question }}”

role: “assistant” content: “{{ formatted_answer.answer }}”

Parameters:: config_path (str) – Path to the YAML configuration file.
Returns:: Parsed and validated configuration object.
Return type:: MMirageConfig

mmirage.config — Configuration¶