`mmirage.core.process` — Processors¶

Variables¶

Variable system for MMIRAGE pipeline with multimodal support.

class mmirage.core.process.variables.BaseVar(name='')[source]¶

Bases: ABC

Base class for variables in the MMIRAGE pipeline.

Parameters:: name (str)

name¶

Name of the variable.

Type:: str

name: str = ''¶

class mmirage.core.process.variables.InputVar(name='', key='', type='text')[source]¶

Bases: BaseVar

Input variable extracted from source datasets.

Parameters:

name (str)
key (str)
type (Literal['text', 'image'])

key¶

JMESPath query to extract the variable from a sample.

Type:: str

type¶

Variable type - “text” or “image”.

Type:: Literal[‘text’, ‘image’]

key: str = ''¶

type: Literal['text', 'image'] = 'text'¶

is_image()[source]¶

Check if this input variable represents an image.

Returns:: True if the variable type is “image”, False otherwise.
Return type:: bool

class mmirage.core.process.variables.OutputVar(name='', type='')[source]¶

Bases: BaseVar

Output variable generated by processors.

Output variables are created by processors (e.g., LLMs) and can depend on input variables and previously computed output variables.

Parameters:

name (str)
type (str)

name¶

Name of the variable.

Type:: str

type¶

Type identifier for the processor that generates this variable.

Type:: str

type: str = ''¶

abstractmethod is_computable(vars)[source]¶

Parameters:: vars (Sequence[BaseVar])
Return type:: bool

class mmirage.core.process.variables.VariableEnvironment(var_env, image_vars=None)[source]¶

Bases: object

Environment for storing and accessing variables during processing.

Parameters:

var_env (Dict[str, Any])
image_vars (Optional[set])

__init__(var_env, image_vars=None)[source]¶

Initialize a variable environment.

Parameters:

var_env (Dict[str, Any]) – Dictionary mapping variable names to their values.
image_vars (set | None) – Set of variable names that represent images. Defaults to empty set.

Return type:

None

with_variable(key, value, is_image=False)[source]¶

Create a new environment with an additional variable.

Parameters:

key (str) – Name of the variable to add.
value (Any) – Value of the variable.
is_image (bool) – Whether the variable represents an image.

Returns:

New environment with the added variable.

Return type:

VariableEnvironment

to_dict()[source]¶

Get an immutable view of the variable dictionary.

Returns:: MappingProxyType providing read-only access to variables.
Return type:: MappingProxyType

get(key, default=None)[source]¶

Get a variable value by name.

Parameters:

key (str) – Name of the variable to retrieve.
default (Any) – Default value to return if variable is not found.

Returns:

Variable value, or default if not found.

Return type:

Any

is_image_var(key)[source]¶

Check if a variable represents an image.

Parameters:: key (str) – Name of the variable to check.
Returns:: True if the variable is an image variable, False otherwise.
Return type:: bool

get_image_vars()[source]¶

Get all image variable names.

Returns:: Copy of the set containing names of all image variables.
Return type:: set

get_images()[source]¶

Get image values in a deterministic order.

Return type:: List[Any]

has_images()[source]¶

Check if the environment contains any image variables.

Returns:: True if at least one image variable is present, False otherwise.
Return type:: bool

static from_input_variables(sample, input_vars, image_base_path=None)[source]¶

Create a variable environment from a single sample.

Parameters:

sample (Dict[str, Any]) – Dictionary containing the data for one sample.
input_vars (List[InputVar]) – List of input variable definitions to extract.
image_base_path (str | None) – Optional base directory for resolving relative image paths.

Returns:

Environment populated with extracted variables.

Return type:

VariableEnvironment

Raises:

ValueError – If a required input variable is not found in the sample.

static from_batch_input_variables(batch, input_vars, image_base_path=None)[source]¶

Extract input variables from a batch of samples.

Parameters:

batch (Dict[str, List[Any]]) – Dictionary mapping column names to lists of values.
input_vars (List[InputVar]) – List of input variable definitions.
image_base_path (str | None) – Optional base directory for resolving relative image paths.

Returns:

List of VariableEnvironments, one for each sample in the batch.

Return type:

List[VariableEnvironment]

Base processor¶

Base classes and registry for processors in MMIRAGE.

class mmirage.core.process.base.BaseProcessorConfig(type='')[source]¶

Bases: object

Base configuration class for processors.

All processor configurations must inherit from this class.

Parameters:: type (str)

type¶

String identifier for the processor type (e.g., “llm”).

Type:: str

type: str = ''¶

class mmirage.core.process.base.BaseProcessor(config)[source]¶

Bases: ABC, Generic[C]

Abstract base class for data processors.

Processors are responsible for transforming data by generating new output variables from existing variables.

Type Parameters:: C: The output variable type this processor works with.

Parameters:: config (BaseProcessorConfig)

config¶: Configuration object for this processor.

__init__(config)[source]¶

Initialize the processor with configuration.

Parameters:: config (BaseProcessorConfig) – Configuration object for this processor.
Return type:: None

abstractmethod batch_process_sample(batch, output_var)[source]¶

Process a batch of variable environments.

Parameters:

batch (List[VariableEnvironment]) – List of variable environments to process.
output_var (C) – Output variable definition to generate.

Returns:

List of updated variable environments with the new output variable.

Raises:

NotImplementedError – If not implemented by subclass.

Return type:

List[VariableEnvironment]

class mmirage.core.process.base.ProcessorRegistry[source]¶

Bases: object

Registry for managing and accessing available processors.

Provides a centralized registry for processor classes, their configuration classes, and their output variable classes.

_registry¶: Mapping from processor name to registered processor class.

_config_registry¶: Mapping from processor name to its configuration class.

_output_var_registry¶: Mapping from processor name to its output variable class.

classmethod register_types(name, config_cls, output_var_cls)[source]¶

Parameters:

name (str)
config_cls (Type[BaseProcessorConfig])
output_var_cls (Type[OutputVar])

Return type:

None

classmethod register(name, config_cls, output_var_cls)[source]¶

Parameters:

name (str) – String identifier for the processor.
config_cls (Type[BaseProcessorConfig]) – Configuration class associated with this processor.
output_var_cls (Type[OutputVar]) – Output variable class associated with this processor.

Returns:

Decorator function to register the processor class.

Return type:

Callable

classmethod get_processor(name)[source]¶

Get a registered processor class by name.

Parameters:: name (str) – String identifier of the processor.
Returns:: The registered processor class.
Raises:: ValueError – If no processor is registered under the given name.
Return type:: Type[BaseProcessor]

classmethod get_config_cls(name)[source]¶

Get a registered configuration class by processor name.

Parameters:: name (str) – String identifier of the processor.
Returns:: The registered configuration class.
Raises:: ValueError – If no processor is registered under the given name.
Return type:: Type[BaseProcessorConfig]

classmethod get_output_var_cls(name)[source]¶

Get a registered output variable class by processor name.

Parameters:: name (str) – String identifier of the processor.
Returns:: The registered output variable class.
Raises:: ValueError – If no processor is registered under the given name.
Return type:: Type[OutputVar]

class mmirage.core.process.base.AutoProcessor[source]¶

Bases: object

Factory class for instantiating processors by name.

classmethod from_name(name)[source]¶

Retrieve a processor class by its registered name.

Parameters:: name (str) – The registry name of the processor.
Returns:: The registered processor class.
Raises:: ValueError – If no processor is registered under the given name.
Return type:: Type[BaseProcessor]

Mapper¶

Mapper for orchestrating variable transformations.

class mmirage.core.process.mapper.MMIRAGEMapper(processor_configs, input_vars, output_vars)[source]¶

Bases: object

Mapper for orchestrating variable transformations in the MMIRAGE pipeline.

Manages processors, validates variable dependencies, and applies transformations to batches of data. Supports multimodal inputs.

Parameters:

processor_configs (List[BaseProcessorConfig])
input_vars (List[InputVar])
output_vars (List[OutputVar])

processors¶: Dictionary mapping processor types to processor instances.

output_vars¶: List of output variables to generate.

input_vars¶: List of input variables to extract.

__init__(processor_configs, input_vars, output_vars)[source]¶

Initialize the MMIRAGE mapper.

Parameters:

processor_configs (List[BaseProcessorConfig]) – List of processor configurations.
input_vars (List[InputVar]) – List of input variable definitions.
output_vars (List[OutputVar]) – List of output variable definitions.

Return type:

None

validate_vars()[source]¶

Validate that all output variables are computable.

Checks that each output variable can be computed given the available variables (inputs and previously computed outputs).

Returns:: True if all variables are computable, False otherwise.
Return type:: bool

rewrite_batch(batch, image_base_path=None)[source]¶

Transform a batch of samples by computing output variables.

Parameters:

batch (Dict[str, List[Any]]) – Dictionary mapping column names to lists of values.
image_base_path (str) – Optional base directory for resolving relative image paths.

Returns:

List of VariableEnvironments with all output variables computed.

Raises:

RuntimeError – If an output variable type has no registered processor.

Return type:

List[VariableEnvironment]

LLM processor¶

Configuration¶

Configuration for LLM processor in MMIRAGE.

class mmirage.core.process.processors.llm.config.SGLangServerArgs(model_path='none', tp_size=<factory>, trust_remote_code=True, disable_custom_all_reduce=False)[source]¶

Bases: object

Server arguments for SGLang engine.

Parameters:

model_path (str)
tp_size (int)
trust_remote_code (bool)
disable_custom_all_reduce (bool)

model_path¶

Path to the model or HuggingFace model ID.

Type:: str

tp_size¶

Tensor parallelism size.

Type:: int

trust_remote_code¶

Whether to trust remote code from HuggingFace.

Type:: bool

disable_custom_all_reduce¶

Whether to disable custom all reduce.

Type:: bool

model_path: str = 'none'¶

tp_size: int¶

trust_remote_code: bool = True¶

disable_custom_all_reduce: bool = False¶

class mmirage.core.process.processors.llm.config.SGLangLLMConfig(type='', server_args=<factory>, default_sampling_params=<factory>, chat_template='')[source]¶

Bases: BaseProcessorConfig

Configuration for LLM processor using SGLang.

Supports both text-only and multimodal (vision-language) models.

Parameters:

type (str)
server_args (SGLangServerArgs)
default_sampling_params (Dict[str, Any])
chat_template (str)

type¶

Type identifier (must be “llm”).

Type:: str

server_args¶

SGLang server arguments including model path and TP size.

Type:: mmirage.core.process.processors.llm.config.SGLangServerArgs

default_sampling_params¶

Default sampling parameters for generation.

Type:: Dict[str, Any]

chat_template¶

Chat template name for vision-language models (e.g., “qwen2-vl”).

Type:: str

server_args: SGLangServerArgs¶

default_sampling_params: Dict[str, Any]¶

chat_template: str = ''¶

class mmirage.core.process.processors.llm.config.LLMOutputVar(name='', type='', prompt='', output_schema=<factory>, output_type='')[source]¶

Bases: OutputVar

Output variable generated by LLM processor.

Uses Jinja2 templating for prompts and supports both plain text and structured JSON outputs.

Parameters:

name (str)
type (str)
prompt (str)
output_schema (List[str])
output_type (str)

name¶

Name of the variable.

Type:: str

type¶

Type identifier (must be “llm”).

Type:: str

prompt¶

Jinja2 template for the LLM prompt.

Type:: str

output_schema¶

List of field names for JSON output (empty for plain text).

Type:: List[str]

output_type¶

Output format - “JSON” or “plain”.

Type:: str

prompt: str = ''¶

output_schema: List[str]¶

output_type: str = ''¶

get_output_schema()[source]¶

Generate a Pydantic model for JSON output validation.

Returns:: A Pydantic BaseModel class if output_type is “JSON” and output_schema is non-empty, otherwise None.
Return type:: Type[pydantic.BaseModel] | None

is_computable(vars)[source]¶

Check if all variables referenced in the prompt are available.

Parameters:: vars (Sequence[BaseVar]) – Sequence of currently available variables.
Returns:: True if all template variables are declared, False otherwise.
Return type:: bool

Implementation¶

LLM processor implementation using SGLang with multimodal support.

mmirage.core.process — Processors¶

Variables¶

Base processor¶

Mapper¶

LLM processor¶

Configuration¶

Implementation¶

`mmirage.core.process` — Processors¶