mmirage.core.loader — Data Loaders

Base classes

Base classes and registry for data loaders in MMIRAGE.

class mmirage.core.loader.base.BaseDataLoaderConfig(type, output_dir, image_base_path=None)[source]

Bases: object

Base configuration class for data loaders.

All data loader configurations must inherit from this class and specify a type identifier.

Parameters:
  • type (str)

  • output_dir (str)

  • image_base_path (str | None)

type

String identifier for the loader type (e.g., “JSONL”, “loadable”).

Type:

str

output_dir

Directory path for saving processed output shards.

Type:

str

image_base_path

Optional base directory for resolving relative image paths in this dataset.

Type:

str | None

type: str
output_dir: str
image_base_path: str | None = None
class mmirage.core.loader.base.BaseDataLoader[source]

Bases: ABC, Generic[C]

Abstract base class for data loaders.

Data loaders are responsible for loading datasets from various sources (JSONL files, Hugging Face datasets, etc.) and returning them as Hugging Face Dataset objects.

Type Parameters:

C: The configuration class type for this loader.

from_config()[source]

Load a dataset from the given configuration.

Parameters:

ds_config (C)

Return type:

_FakeDataset | _FakeDatasetDict | None

abstractmethod from_config(ds_config)[source]

Load a dataset from the given configuration.

Parameters:

ds_config (C) – Configuration object for loading the dataset.

Returns:

A Hugging Face Dataset or DatasetDict, or None if loading fails.

Raises:

NotImplementedError – If not implemented by subclass.

Return type:

_FakeDataset | _FakeDatasetDict | None

class mmirage.core.loader.base.DataLoaderRegistry[source]

Bases: object

Registry for managing and accessing available data loaders.

Provides a centralized registry for data loader classes and their associated configuration classes, allowing dynamic loader instantiation based on type names.

_registry

Mapping from loader name to registered loader class.

_config_registry

Mapping from loader name to its configuration class.

classmethod register(name, config_cls)[source]

Register a data loader class.

Parameters:
  • name (str) – String identifier for the loader.

  • config_cls (Type[BaseDataLoaderConfig]) – Configuration class associated with this loader.

Returns:

Decorator function to register the loader class.

Return type:

Callable

classmethod get_processor(name)[source]

Get a registered loader class by name.

Parameters:

name (str) – String identifier of the loader.

Returns:

The registered loader class.

Raises:

ValueError – If no loader is registered under the given name.

Return type:

Type[BaseDataLoader]

classmethod get_config_cls(name)[source]

Get a registered configuration class by loader name.

Parameters:

name (str) – String identifier of the loader.

Returns:

The registered configuration class.

Raises:

ValueError – If no loader is registered under the given name.

Return type:

Type[BaseDataLoaderConfig]

class mmirage.core.loader.base.AutoDataLoader[source]

Bases: object

Factory class for instantiating data loaders by name.

classmethod from_name(name)[source]

Retrieve a data loader class by its registered name.

Parameters:

name (str) – The registry name of the data loader.

Returns:

The registered data loader class.

Raises:

ValueError – If no data loader is registered under the given name.

Return type:

Type[BaseDataLoader]

JSONL loader

JSONL data loader implementation.

class mmirage.core.loader.jsonl.JSONLDataConfig(type, output_dir, image_base_path=None, path='')[source]

Bases: BaseDataLoaderConfig

Configuration for loading JSONL datasets.

Parameters:
type

Type identifier (must be “JSONL”).

Type:

str

path

File path to the JSONL file, or dict mapping split names to paths.

Type:

str | Dict[str, str]

output_dir

Directory for saving processed output.

Type:

str

path: str | Dict[str, str] = ''

HuggingFace local loader

Local Hugging Face dataset loader implementation.

class mmirage.core.loader.local_hf.LocalHFConfig(type, output_dir, image_base_path=None, path='')[source]

Bases: BaseDataLoaderConfig

Configuration for loading local Hugging Face datasets.

Parameters:
  • type (str)

  • output_dir (str)

  • image_base_path (str | None)

  • path (str)

type

Type identifier (must be “loadable”).

Type:

str

path

Directory path to the saved Hugging Face dataset.

Type:

str

output_dir

Directory for saving processed output.

Type:

str

path: str = ''

Loader utilities

Utility functions for loading datasets and handling images.

mmirage.core.loader.utils.load_datasets_from_configs(configs)[source]

Load multiple datasets from configurations.

Attempts to load datasets using the specified loader configurations. Failed loads are logged as warnings and skipped.

Parameters:

configs (List[BaseDataLoaderConfig]) – List of dataset configuration objects.

Returns:

List of Hugging Face Datasets/DatasetDicts.

Raises:

RuntimeError – If no datasets could be loaded successfully.

Return type:

List[_FakeDataset | _FakeDatasetDict]

mmirage.core.loader.utils.resolve_image_input(value, image_base_path=None)[source]

Resolve image input to a format SGLang can use.

Handles multiple image input formats: - PIL Image objects: passed through directly - URLs (http/https): passed through as-is - Absolute file paths: validated and passed through - Relative file paths: resolved using image_base_path

Parameters:
  • value (PIL.Image.Image | str) – The image value to resolve (PIL Image, path string, or URL).

  • image_base_path (str | None) – Optional base directory for resolving relative paths.

Returns:

Resolved image value suitable for SGLang processing.

Raises:
Return type:

PIL.Image.Image | str