Shard Utilities¶
mmirage.shard_process¶
Main entry point for processing a single shard.
Main script for processing dataset shards with MMIRAGE.
Supports both text-only and multimodal (vision-language) processing.
- mmirage.shard_process.rewrite_batch(batch, mapper, renderer, image_base_path=None)[source]¶
Rewrite a batch of samples by applying transformations. :param batch: Dictionary mapping column names to lists of values. :param mapper: MMIRAGEMapper for processing transformations. :param renderer: TemplateRenderer for generating output. :param image_base_path: Optional base directory for resolving relative image paths.
- Returns:
Dictionary mapping output keys to lists of rendered values.
- Raises:
ValueError – If variables are not computable given the configuration.
- Parameters:
mapper (MMIRAGEMapper)
renderer (TemplateRenderer)
image_base_path (str)
- Return type:
mmirage.shard_utils¶
Low-level helpers for shard state management, atomic saves, and status markers.
Utility functions for shard and merge processing.
This module contains helper functions for dataset sharding, state management, and file operations used in the MMIRAGE shard processing pipeline.
- class mmirage.shard_utils.ShardStatus(status='unknown', retry_count=0, shard_id=None, started_at=None, finished_at=None, error=None, hostname=None, pid=None, slurm_job_id=None, slurm_array_task_id=None, datasets=None)[source]¶
Bases:
objectTyped representation of the shard status.json payload.
- Parameters:
mmirage.merge_shards¶
Dataset shard merging logic.
Merge processed dataset shards.
- mmirage.merge_shards.merge_dataset_dir(dataset_dir, output_dir)[source]¶
Merge one dataset directory containing shard_* folders.
- Parameters:
- Returns:
MergeReport with summary details.
- Return type:
- mmirage.merge_shards.merge_input_dir(input_dir, output_dir)[source]¶
Merge all shard datasets found under an input directory.
The input can be either: - one dataset dir containing shard_* folders directly - a parent dir containing multiple dataset subdirectories, each with shard_*
- Parameters:
- Return type:
- mmirage.merge_shards.merge_from_config(cfg, output_root=None)[source]¶
Merge shard outputs described in config.loading_params.datasets.
- Parameters:
cfg (MMirageConfig) – Loaded MMIRAGE config.
output_root (str | None) – Optional destination root. If omitted, each dataset writes into <dataset.output_dir>/merged.
- Returns:
Merge reports for each dataset entry.
- Return type:
mmirage.cli_utils.status¶
Shard status checking and retry helpers.
Shard status and retry helpers for the MMIRAGE CLI.
- class mmirage.cli_utils.status.ShardSummary(total, successful, running, failed, max_retries_exceeded)[source]¶
Bases:
objectCompact status summary for shard execution.
- mmirage.cli_utils.status.max_allowed_attempts(max_retries)[source]¶
Return max allowed total attempts for a shard.
Total attempts = initial attempt + max_retries.
- mmirage.cli_utils.status.is_retry_budget_exceeded(attempt_count, max_retries)[source]¶
Return whether a shard has exceeded the retry budget.
- mmirage.cli_utils.status.shard_state_dir(state_root, shard_id)[source]¶
Return the state directory for a shard.
- mmirage.cli_utils.status.get_shard_status(state_dir)[source]¶
Read the current status and attempt counter for a shard.
- mmirage.cli_utils.status.check_failed_shards(cfg)[source]¶
Return retryable failed shards and a compact summary.
- Parameters:
cfg (MMirageConfig)
- Return type:
Tuple[List[int], ShardSummary]
- mmirage.cli_utils.status.confirm_retry(count, confirm_mode)[source]¶
Return whether retry submission is confirmed.
Modes: - prompt: ask the user interactively - yes: submit without prompting
- mmirage.cli_utils.status.status_exit_code(failed_shards, summary)[source]¶
Map shard status to an exit code.
- Parameters:
summary (ShardSummary)
- Return type:
mmirage.cli_utils.slurm¶
SLURM submission and job monitoring helpers.
SLURM helpers for the MMIRAGE CLI.
- mmirage.cli_utils.slurm.build_sbatch_script(cfg, config_path)[source]¶
Build the sbatch payload executed for each array task.
- Parameters:
cfg (MMirageConfig)
config_path (str)
- Return type:
- mmirage.cli_utils.slurm.submit_slurm_job(cfg, config_path, shard_ids=None)[source]¶
Submit a SLURM array job and return its job ID.
- Parameters:
cfg (MMirageConfig)
config_path (str)
- Return type:
int | None
- mmirage.cli_utils.slurm.wait_for_slurm_job(job_id, cfg)[source]¶
Wait for a SLURM job array to leave the queue.
- Parameters:
job_id (int)
cfg (MMirageConfig)
- Return type:
None
- mmirage.cli_utils.slurm.require_slurm(cfg, command_name)[source]¶
Ensure command can only run in SLURM mode.
- Parameters:
cfg (MMirageConfig)
command_name (str)
- Return type:
mmirage.cli_utils.runtime¶
Runtime environment setup helpers.
Runtime/path helpers for the MMIRAGE CLI.
- mmirage.cli_utils.runtime.expand_path(path, project_root=None)[source]¶
Expand environment variables, user home and relative paths.
- mmirage.cli_utils.runtime.get_project_root(cfg)[source]¶
Return the configured project root, or the current working directory.
- Parameters:
cfg (MMirageConfig)
- Return type:
- mmirage.cli_utils.runtime.create_directories(paths)[source]¶
Create directories if they do not already exist.
- mmirage.cli_utils.runtime.validate_edf_env_path(cfg)[source]¶
Validate the optional EDF environment file path.
- Parameters:
cfg (MMirageConfig)
- Return type:
None
- mmirage.cli_utils.runtime.add_file_logging(log_file, level)[source]¶
Add a file handler so logs are also written to disk.
- mmirage.cli_utils.runtime.setup_runtime(cfg, log_level)[source]¶
Initialize runtime-level logging.
- Parameters:
cfg (MMirageConfig)
log_level (str)
- Return type:
None