mmirage.core.process — Processors

Variables

Variable system for MMIRAGE pipeline with multimodal support.

class mmirage.core.process.variables.BaseVar(name='')[source]

Bases: ABC

Base class for variables in the MMIRAGE pipeline.

Parameters:

name (str)

name

Name of the variable.

Type:

str

name: str = ''
class mmirage.core.process.variables.InputVar(name='', key='', type='text')[source]

Bases: BaseVar

Input variable extracted from source datasets.

Parameters:
key

JMESPath query to extract the variable from a sample.

Type:

str

type

Variable type - “text” or “image”.

Type:

Literal[‘text’, ‘image’]

key: str = ''
type: Literal['text', 'image'] = 'text'
is_image()[source]

Check if this input variable represents an image.

Returns:

True if the variable type is “image”, False otherwise.

Return type:

bool

class mmirage.core.process.variables.OutputVar(name='', type='')[source]

Bases: BaseVar

Output variable generated by processors.

Output variables are created by processors (e.g., LLMs) and can depend on input variables and previously computed output variables.

Parameters:
name

Name of the variable.

Type:

str

type

Type identifier for the processor that generates this variable.

Type:

str

type: str = ''
abstractmethod is_computable(vars)[source]
Parameters:

vars (Sequence[BaseVar])

Return type:

bool

class mmirage.core.process.variables.VariableEnvironment(var_env, image_vars=None)[source]

Bases: object

Environment for storing and accessing variables during processing.

Parameters:
  • var_env (Dict[str, Any])

  • image_vars (Optional[set])

__init__(var_env, image_vars=None)[source]

Initialize a variable environment.

Parameters:
  • var_env (Dict[str, Any]) – Dictionary mapping variable names to their values.

  • image_vars (set | None) – Set of variable names that represent images. Defaults to empty set.

Return type:

None

with_variable(key, value, is_image=False)[source]

Create a new environment with an additional variable.

Parameters:
  • key (str) – Name of the variable to add.

  • value (Any) – Value of the variable.

  • is_image (bool) – Whether the variable represents an image.

Returns:

New environment with the added variable.

Return type:

VariableEnvironment

to_dict()[source]

Get an immutable view of the variable dictionary.

Returns:

MappingProxyType providing read-only access to variables.

Return type:

MappingProxyType

get(key, default=None)[source]

Get a variable value by name.

Parameters:
  • key (str) – Name of the variable to retrieve.

  • default (Any) – Default value to return if variable is not found.

Returns:

Variable value, or default if not found.

Return type:

Any

is_image_var(key)[source]

Check if a variable represents an image.

Parameters:

key (str) – Name of the variable to check.

Returns:

True if the variable is an image variable, False otherwise.

Return type:

bool

get_image_vars()[source]

Get all image variable names.

Returns:

Copy of the set containing names of all image variables.

Return type:

set

get_images()[source]

Get image values in a deterministic order.

Return type:

List[Any]

has_images()[source]

Check if the environment contains any image variables.

Returns:

True if at least one image variable is present, False otherwise.

Return type:

bool

static from_input_variables(sample, input_vars, image_base_path=None)[source]

Create a variable environment from a single sample.

Parameters:
  • sample (Dict[str, Any]) – Dictionary containing the data for one sample.

  • input_vars (List[InputVar]) – List of input variable definitions to extract.

  • image_base_path (str | None) – Optional base directory for resolving relative image paths.

Returns:

Environment populated with extracted variables.

Return type:

VariableEnvironment

Raises:

ValueError – If a required input variable is not found in the sample.

static from_batch_input_variables(batch, input_vars, image_base_path=None)[source]

Extract input variables from a batch of samples.

Parameters:
  • batch (Dict[str, List[Any]]) – Dictionary mapping column names to lists of values.

  • input_vars (List[InputVar]) – List of input variable definitions.

  • image_base_path (str | None) – Optional base directory for resolving relative image paths.

Returns:

List of VariableEnvironments, one for each sample in the batch.

Return type:

List[VariableEnvironment]

Base processor

Base classes and registry for processors in MMIRAGE.

class mmirage.core.process.base.BaseProcessorConfig(type='')[source]

Bases: object

Base configuration class for processors.

All processor configurations must inherit from this class.

Parameters:

type (str)

type

String identifier for the processor type (e.g., “llm”).

Type:

str

type: str = ''
class mmirage.core.process.base.BaseProcessor(config)[source]

Bases: ABC, Generic[C]

Abstract base class for data processors.

Processors are responsible for transforming data by generating new output variables from existing variables.

Type Parameters:

C: The output variable type this processor works with.

Parameters:

config (BaseProcessorConfig)

config

Configuration object for this processor.

__init__(config)[source]

Initialize the processor with configuration.

Parameters:

config (BaseProcessorConfig) – Configuration object for this processor.

Return type:

None

abstractmethod batch_process_sample(batch, output_var)[source]

Process a batch of variable environments.

Parameters:
  • batch (List[VariableEnvironment]) – List of variable environments to process.

  • output_var (C) – Output variable definition to generate.

Returns:

List of updated variable environments with the new output variable.

Raises:

NotImplementedError – If not implemented by subclass.

Return type:

List[VariableEnvironment]

class mmirage.core.process.base.ProcessorRegistry[source]

Bases: object

Registry for managing and accessing available processors.

Provides a centralized registry for processor classes, their configuration classes, and their output variable classes.

_registry

Mapping from processor name to registered processor class.

_config_registry

Mapping from processor name to its configuration class.

_output_var_registry

Mapping from processor name to its output variable class.

classmethod register_types(name, config_cls, output_var_cls)[source]

Register config/output-var types without importing processor implementations.

Parameters:
Return type:

None

classmethod register(name, config_cls, output_var_cls)[source]

Register a processor class with its associated classes.

Parameters:
  • name (str) – String identifier for the processor.

  • config_cls (Type[BaseProcessorConfig]) – Configuration class associated with this processor.

  • output_var_cls (Type[OutputVar]) – Output variable class associated with this processor.

Returns:

Decorator function to register the processor class.

Return type:

Callable

classmethod get_processor(name)[source]

Get a registered processor class by name.

Parameters:

name (str) – String identifier of the processor.

Returns:

The registered processor class.

Raises:

ValueError – If no processor is registered under the given name.

Return type:

Type[BaseProcessor]

classmethod get_config_cls(name)[source]

Get a registered configuration class by processor name.

Parameters:

name (str) – String identifier of the processor.

Returns:

The registered configuration class.

Raises:

ValueError – If no processor is registered under the given name.

Return type:

Type[BaseProcessorConfig]

classmethod get_output_var_cls(name)[source]

Get a registered output variable class by processor name.

Parameters:

name (str) – String identifier of the processor.

Returns:

The registered output variable class.

Raises:

ValueError – If no processor is registered under the given name.

Return type:

Type[OutputVar]

class mmirage.core.process.base.AutoProcessor[source]

Bases: object

Factory class for instantiating processors by name.

classmethod from_name(name)[source]

Retrieve a processor class by its registered name.

Parameters:

name (str) – The registry name of the processor.

Returns:

The registered processor class.

Raises:

ValueError – If no processor is registered under the given name.

Return type:

Type[BaseProcessor]

Mapper

Mapper for orchestrating variable transformations.

class mmirage.core.process.mapper.MMIRAGEMapper(processor_configs, input_vars, output_vars)[source]

Bases: object

Mapper for orchestrating variable transformations in the MMIRAGE pipeline.

Manages processors, validates variable dependencies, and applies transformations to batches of data. Supports multimodal inputs.

Parameters:
processors

Dictionary mapping processor types to processor instances.

output_vars

List of output variables to generate.

input_vars

List of input variables to extract.

__init__(processor_configs, input_vars, output_vars)[source]

Initialize the MMIRAGE mapper.

Parameters:
Return type:

None

validate_vars()[source]

Validate that all output variables are computable.

Checks that each output variable can be computed given the available variables (inputs and previously computed outputs).

Returns:

True if all variables are computable, False otherwise.

Return type:

bool

rewrite_batch(batch, image_base_path=None)[source]

Transform a batch of samples by computing output variables.

Parameters:
  • batch (Dict[str, List[Any]]) – Dictionary mapping column names to lists of values.

  • image_base_path (str) – Optional base directory for resolving relative image paths.

Returns:

List of VariableEnvironments with all output variables computed.

Raises:

RuntimeError – If an output variable type has no registered processor.

Return type:

List[VariableEnvironment]

LLM processor

Configuration

Configuration for LLM processor in MMIRAGE.

class mmirage.core.process.processors.llm.config.SGLangServerArgs(model_path='none', tp_size=<factory>, trust_remote_code=True, disable_custom_all_reduce=False)[source]

Bases: object

Server arguments for SGLang engine.

Parameters:
  • model_path (str)

  • tp_size (int)

  • trust_remote_code (bool)

  • disable_custom_all_reduce (bool)

model_path

Path to the model or HuggingFace model ID.

Type:

str

tp_size

Tensor parallelism size.

Type:

int

trust_remote_code

Whether to trust remote code from HuggingFace.

Type:

bool

disable_custom_all_reduce

Whether to disable custom all reduce.

Type:

bool

model_path: str = 'none'
tp_size: int
trust_remote_code: bool = True
disable_custom_all_reduce: bool = False
class mmirage.core.process.processors.llm.config.SGLangLLMConfig(type='', server_args=<factory>, default_sampling_params=<factory>, chat_template='')[source]

Bases: BaseProcessorConfig

Configuration for LLM processor using SGLang.

Supports both text-only and multimodal (vision-language) models.

Parameters:
type

Type identifier (must be “llm”).

Type:

str

server_args

SGLang server arguments including model path and TP size.

Type:

mmirage.core.process.processors.llm.config.SGLangServerArgs

default_sampling_params

Default sampling parameters for generation.

Type:

Dict[str, Any]

chat_template

Chat template name for vision-language models (e.g., “qwen2-vl”).

Type:

str

server_args: SGLangServerArgs
default_sampling_params: Dict[str, Any]
chat_template: str = ''
class mmirage.core.process.processors.llm.config.LLMOutputVar(name='', type='', prompt='', output_schema=<factory>, output_type='')[source]

Bases: OutputVar

Output variable generated by LLM processor.

Uses Jinja2 templating for prompts and supports both plain text and structured JSON outputs.

Parameters:
name

Name of the variable.

Type:

str

type

Type identifier (must be “llm”).

Type:

str

prompt

Jinja2 template for the LLM prompt.

Type:

str

output_schema

List of field names for JSON output (empty for plain text).

Type:

List[str]

output_type

Output format - “JSON” or “plain”.

Type:

str

prompt: str = ''
output_schema: List[str]
output_type: str = ''
get_output_schema()[source]

Generate a Pydantic model for JSON output validation.

Returns:

A Pydantic BaseModel class if output_type is “JSON” and output_schema is non-empty, otherwise None.

Return type:

Type[pydantic.BaseModel] | None

is_computable(vars)[source]

Check if all variables referenced in the prompt are available.

Parameters:

vars (Sequence[BaseVar]) – Sequence of currently available variables.

Returns:

True if all template variables are declared, False otherwise.

Return type:

bool

Implementation

LLM processor implementation using SGLang with multimodal support.