mmirage.config — Configuration¶
mmirage.config.config¶
Main configuration dataclasses for the MMIRAGE pipeline.
Configuration dataclasses for MMIRAGE pipeline.
- class mmirage.config.config.ExecutionParams(mode='local', retry=False, merge=False, max_retries=3, poll_interval_seconds=30, settle_time_seconds=60, project_root=None, report_dir='~/reports', hf_home='~/hf', edf_env=None, account=None, job_name='mmirage-sharded', reservation=None, nodes=1, ntasks_per_node=1, gpus=4, cpus_per_task=288, time_limit='11:59:59')[source]¶
Bases:
objectParameters for executing the MMIRAGE pipeline.
Defines how the pipeline is executed, including local or SLURM-based distributed execution, retry logic, and resource allocation.
- Parameters:
mode (str)
retry (bool)
merge (bool)
max_retries (int)
poll_interval_seconds (int)
settle_time_seconds (int)
project_root (str | None)
report_dir (str)
hf_home (str)
edf_env (str | None)
account (str | None)
job_name (str)
reservation (str | None)
nodes (int)
ntasks_per_node (int)
gpus (int)
cpus_per_task (int)
time_limit (str)
- settle_time_seconds¶
Seconds to wait after job completes before checking results. Defaults to 60.
- Type:
- # SLURM-specific parameters
- # Paths
- class mmirage.config.config.ProcessingParams(inputs, outputs, output_schema, remove_columns=False)[source]¶
Bases:
objectParameters for processing dataset samples.
Defines how input variables are extracted, outputs are generated, and the final output schema is constructed.
- Parameters:
- inputs¶
List of input variables to extract from source datasets.
- Type:
- outputs¶
List of output variables to generate using processors.
- Type:
- class mmirage.config.config.MMirageConfig(processors, loading_params, processing_params, execution_params=<factory>)[source]¶
Bases:
objectMain configuration class for MMIRAGE pipeline.
Contains all configuration needed to run a MMIRAGE processing pipeline, including processor configurations, dataset loading parameters, processing parameters, and execution parameters.
- Parameters:
processors (List[BaseProcessorConfig])
loading_params (LoadingParams)
processing_params (ProcessingParams)
execution_params (ExecutionParams)
- processors¶
List of processor configurations for data transformation.
- Type:
- loading_params¶
Parameters for loading input datasets.
- processing_params¶
Parameters for processing dataset samples.
- execution_params¶
Parameters for executing the pipeline (local/SLURM).
- processors: List[BaseProcessorConfig]¶
- loading_params: LoadingParams¶
- processing_params: ProcessingParams¶
- execution_params: ExecutionParams¶
mmirage.config.loading¶
Dataset loading and sharding configuration.
Data loading configuration for MMIRAGE pipeline.
- class mmirage.config.loading.LoadingParams(datasets=<factory>, state_dir='~/.cache/MMIRAGE/state_dir', output_dir='', num_shards=1, shard_id=0, batch_size=1)[source]¶
Bases:
objectParameters for loading and distributing datasets across shards.
Defines how datasets are loaded and processed in a distributed manner, supporting sharding for parallel processing.
- datasets¶
List of dataset configurations to load.
- Type:
- Raises:
ValueError – If num_shards, shard_id, or batch_size cannot be converted to int.
- Parameters:
- datasets: List[BaseDataLoaderConfig]¶
mmirage.config.utils¶
YAML parsing helpers and configuration loaders.
Configuration loading utilities for MMIRAGE pipeline.
- mmirage.config.utils.load_mmirage_config(config_path)[source]¶
Load MMIRAGE configuration from a YAML file.
Supports environment variable expansion and dynamic processor/loader configuration based on registered types.
Example config:
- processors:
type: llm server_args:
model_path: Qwen/Qwen2-VL-7B-Instruct tp_size: 4 trust_remote_code: true
chat_template: qwen2-vl default_sampling_params:
temperature: 0.1 top_p: 0.9 max_new_tokens: 1024
- loading_params:
- datasets:
path: /path/to/dataset.jsonl type: JSONL output_dir: /path/to/output image_base_path: /path/to/images
num_shards: 4 shard_id: 0 batch_size: 64
- processing_params:
- inputs:
name: text key: text
name: image key: image_path type: image
- outputs:
name: formatted_answer type: llm output_type: JSON output_schema:
question
answer
- prompt: |
Generate a Q&A pair from: {{ text }}
remove_columns: True output_schema:
- conversations:
role: “user” content: “{{ formatted_answer.question }}”
role: “assistant” content: “{{ formatted_answer.answer }}”
- Parameters:
config_path (str) – Path to the YAML configuration file.
- Returns:
Parsed and validated configuration object.
- Return type: