mmirage.config — Configuration

mmirage.config.config

Main configuration dataclasses for the MMIRAGE pipeline.

Configuration dataclasses for MMIRAGE pipeline.

class mmirage.config.config.ExecutionParams(mode='local', retry=False, merge=False, max_retries=3, poll_interval_seconds=30, settle_time_seconds=60, project_root=None, report_dir='~/reports', hf_home='~/hf', edf_env=None, account=None, job_name='mmirage-sharded', reservation=None, nodes=1, ntasks_per_node=1, gpus=4, cpus_per_task=288, time_limit='11:59:59')[source]

Bases: object

Parameters for executing the MMIRAGE pipeline.

Defines how the pipeline is executed, including local or SLURM-based distributed execution, retry logic, and resource allocation.

Parameters:
  • mode (str)

  • retry (bool)

  • merge (bool)

  • max_retries (int)

  • poll_interval_seconds (int)

  • settle_time_seconds (int)

  • project_root (str | None)

  • report_dir (str)

  • hf_home (str)

  • edf_env (str | None)

  • account (str | None)

  • job_name (str)

  • reservation (str | None)

  • nodes (int)

  • ntasks_per_node (int)

  • gpus (int)

  • cpus_per_task (int)

  • time_limit (str)

mode

Execution mode: “local” or “slurm”. Defaults to “local”.

Type:

str

retry

Whether automatic retry orchestration is enabled. Defaults to False.

Type:

bool

merge

Whether to merge shard outputs after a successful run. Defaults to False.

Type:

bool

max_retries

Maximum number of retries for failed shards. Defaults to 3.

Type:

int

poll_interval_seconds

Seconds to wait between polling job status. Defaults to 30.

Type:

int

settle_time_seconds

Seconds to wait after job completes before checking results. Defaults to 60.

Type:

int

# SLURM-specific parameters
account

HPC account/partition to charge. Required for SLURM mode.

Type:

str | None

job_name

SLURM job name. Defaults to “mmirage-sharded”.

Type:

str

reservation

Optional SLURM reservation name.

Type:

str | None

nodes

Number of nodes. Defaults to 1.

Type:

int

ntasks_per_node

Number of tasks per node. Defaults to 1.

Type:

int

gpus

Number of GPUs per node. Defaults to 4.

Type:

int

cpus_per_task

Number of CPUs per task. Defaults to 288.

Type:

int

time_limit

Job time limit (HH:MM:SS). Defaults to “11:59:59”.

Type:

str

# Paths
project_root

Base project directory. Can use environment variables with ${VAR}.

Type:

str | None

report_dir

Directory for SLURM output/error files. Defaults to ~/reports.

Type:

str

hf_home

HuggingFace cache directory. Defaults to ~/hf.

Type:

str

edf_env

Optional EDF environment file path.

Type:

str | None

mode: str = 'local'
retry: bool = False
merge: bool = False
max_retries: int = 3
poll_interval_seconds: int = 30
settle_time_seconds: int = 60
project_root: str | None = None
report_dir: str = '~/reports'
hf_home: str = '~/hf'
edf_env: str | None = None
account: str | None = None
job_name: str = 'mmirage-sharded'
reservation: str | None = None
nodes: int = 1
ntasks_per_node: int = 1
gpus: int = 4
cpus_per_task: int = 288
time_limit: str = '11:59:59'
is_slurm()[source]

Check if execution mode is SLURM.

Return type:

bool

class mmirage.config.config.ProcessingParams(inputs, outputs, output_schema, remove_columns=False)[source]

Bases: object

Parameters for processing dataset samples.

Defines how input variables are extracted, outputs are generated, and the final output schema is constructed.

Parameters:
inputs

List of input variables to extract from source datasets.

Type:

List[mmirage.core.process.variables.InputVar]

outputs

List of output variables to generate using processors.

Type:

List[mmirage.core.process.variables.OutputVar]

output_schema

Dictionary defining the structure of output samples.

Type:

Dict[str, Any]

remove_columns

If True, removes all columns from original dataset.

Type:

bool

inputs: List[InputVar]
outputs: List[OutputVar]
output_schema: Dict[str, Any]
remove_columns: bool = False
class mmirage.config.config.MMirageConfig(processors, loading_params, processing_params, execution_params=<factory>)[source]

Bases: object

Main configuration class for MMIRAGE pipeline.

Contains all configuration needed to run a MMIRAGE processing pipeline, including processor configurations, dataset loading parameters, processing parameters, and execution parameters.

Parameters:
processors

List of processor configurations for data transformation.

Type:

List[mmirage.core.process.base.BaseProcessorConfig]

loading_params

Parameters for loading input datasets.

Type:

mmirage.config.loading.LoadingParams

processing_params

Parameters for processing dataset samples.

Type:

mmirage.config.config.ProcessingParams

execution_params

Parameters for executing the pipeline (local/SLURM).

Type:

mmirage.config.config.ExecutionParams

processors: List[BaseProcessorConfig]
loading_params: LoadingParams
processing_params: ProcessingParams
execution_params: ExecutionParams

mmirage.config.loading

Dataset loading and sharding configuration.

Data loading configuration for MMIRAGE pipeline.

class mmirage.config.loading.LoadingParams(datasets=<factory>, state_dir='~/.cache/MMIRAGE/state_dir', output_dir='', num_shards=1, shard_id=0, batch_size=1)[source]

Bases: object

Parameters for loading and distributing datasets across shards.

Defines how datasets are loaded and processed in a distributed manner, supporting sharding for parallel processing.

datasets

List of dataset configurations to load.

Type:

List[mmirage.core.loader.base.BaseDataLoaderConfig]

state_dir

Shared directory for logical shard state/markers/retry tracking.

Type:

str

output_dir

Legacy top-level output directory. Prefer per-dataset output_dir.

Type:

str

num_shards

Total number of shards to split the dataset into.

Type:

int | str

shard_id

ID of this shard (0-indexed).

Type:

int | str

batch_size

Batch size for processing samples.

Type:

int | str

Raises:

ValueError – If num_shards, shard_id, or batch_size cannot be converted to int.

Parameters:
datasets: List[BaseDataLoaderConfig]
state_dir: str = '~/.cache/MMIRAGE/state_dir'
output_dir: str = ''
num_shards: int | str = 1
shard_id: int | str = 0
batch_size: int | str = 1
get_state_root()[source]

Get the state root path.

Returns:

State root path.

Return type:

str

get_num_shards()[source]

Get the total number of shards.

Returns:

Total number of shards.

Return type:

int

get_shard_id()[source]

Get the ID of this shard.

Returns:

Shard ID (0-indexed).

Return type:

int

get_batch_size()[source]

Get the batch size for processing.

Returns:

Batch size (minimum 1).

Return type:

int

mmirage.config.utils

YAML parsing helpers and configuration loaders.

Configuration loading utilities for MMIRAGE pipeline.

mmirage.config.utils.load_mmirage_config(config_path)[source]

Load MMIRAGE configuration from a YAML file.

Supports environment variable expansion and dynamic processor/loader configuration based on registered types.

Example config:

processors:
  • type: llm server_args:

    model_path: Qwen/Qwen2-VL-7B-Instruct tp_size: 4 trust_remote_code: true

    chat_template: qwen2-vl default_sampling_params:

    temperature: 0.1 top_p: 0.9 max_new_tokens: 1024

loading_params:
datasets:
  • path: /path/to/dataset.jsonl type: JSONL output_dir: /path/to/output image_base_path: /path/to/images

num_shards: 4 shard_id: 0 batch_size: 64

processing_params:
inputs:
  • name: text key: text

  • name: image key: image_path type: image

outputs:
  • name: formatted_answer type: llm output_type: JSON output_schema:

    • question

    • answer

    prompt: |

    Generate a Q&A pair from: {{ text }}

remove_columns: True output_schema:

conversations:
  • role: “user” content: “{{ formatted_answer.question }}”

  • role: “assistant” content: “{{ formatted_answer.answer }}”

Parameters:

config_path (str) – Path to the YAML configuration file.

Returns:

Parsed and validated configuration object.

Return type:

MMirageConfig