multimeditron.model package

Subpackages

Submodules

multimeditron.model.constants module

Constants used throughout the multimeditron model.

multimeditron.model.data_loader module

class multimeditron.model.data_loader.DataCollatorForMultimodal(tokenizer: PreTrainedTokenizerBase, modality_processors: Dict[str, BaseModalityProcessor], modality_loaders: Dict[str, BaseModalityLoader], attachment_token: str, chat_template: ChatTemplate, add_generation_prompt: bool = False, use_2d_position_ids: bool = False, return_tensors: str = 'pt')

Bases: DataCollatorMixin

A data collator for multimodal datasets that prepares batches of data for input into models.

This class is designed to handle datasets containing multiple modalities (e.g., text, images, etc.). It processes and collates the data into a format suitable for multimodal model training and inference.

add_generation_prompt: bool = False
attachment_token: str
chat_template: ChatTemplate
modality_loaders: Dict[str, BaseModalityLoader]
modality_processors: Dict[str, BaseModalityProcessor]
numpy_call(features: List[Dict[str, Any]]) Dict[str, Any]

Placeholder for NumPy integration.

This function raises a NotImplementedError indicating that NumPy support is not implemented.

Parameters:

features (List[Dict[str, Any]]) – A batch consisting of dictionaries where each dictionary represents a sample.

Raises:

NotImplementedError – Always raised to indicate that NumPy support is not available.

Alternatives:

Users can consider implementing a NumPy-specific collator if required for their use case.

return_tensors: str = 'pt'
tf_call(features: List[Dict[str, Any]]) Dict[str, Any]

Placeholder for TensorFlow integration.

This function raises a NotImplementedError indicating that TensorFlow support is not implemented.

Parameters:

features (List[Dict[str, Any]]) – A batch consisting of dictionaries where each dictionary represents a sample.

Raises:

NotImplementedError – Always raised to indicate that TensorFlow support is not available.

Alternatives:

Users can consider implementing a TensorFlow-specific collator if required for their use case.

tokenizer: PreTrainedTokenizerBase
torch_call(raw_features: List[Dict[str, Any]]) Dict[str, Any]

Collate a batch of multimodal data.

Parameters:

raw_features (List[Dict[str, Any]]) –

A batch consisting of dictionaries where each dictionary represents a sample. Each sample must have:

  • conversations (List[Dict[str, str]]):

    Conversation history with roles and content.

  • modalities (List[Dict[str, Any]]):

    Information about additional modalities in the sample. Each modality contains: - type (str): Type of the modality (e.g., ‘image’, ‘audio’). - value (Any): Data associated with the modality.

or:

  • text (str):

    The text content of the sample.

  • modalities (List[Dict[str, Any]]):

    Information about additional modalities in the sample. Each modality contains:

    • type (str): Type of the modality (e.g., ‘image’, ‘audio’).

    • value (Any): Data associated with the modality.

Returns:

A dictionary structured as follows:

  • input_ids (torch.Tensor):

    Batch tensor of tokenized input sequences.

  • labels (torch.Tensor):

    Batch tensor of tokenized labels.

  • attention_mask (torch.Tensor):

    Batch tensor indicating padded positions (0 for padding, 1 otherwise).

  • position_ids (torch.Tensor):

    Batch tensor of position indices for each token in the sequence.

  • processed_multimodal_inputs (Dict[str, Any]):

Contains processed modality data with keys:

  • batch_idx (Dict[str, torch.Tensor]):

    Maps modality types to tensors indicating which batch sample each token belongs to.

  • token_range (Dict[str, torch.Tensor]):

    Maps modality types to tensors specifying the token range for each modality.

  • stacked (Dict[str, List[Any]]):

    Stores lists of modality values grouped by their types.

Return type:

Dict[str, Any]

The function performs the following steps:

  1. Separates input features by modality.

  2. Loads and processes modality-related data.

  3. Converts lists of modality features into tensors using the modality processors.

  4. Tokenizes text data by expanding the modality placeholders to the right amount.

  5. Computes positional and attention masks for sequence data.

use_2d_position_ids: bool = False

multimeditron.model.model module

class multimeditron.model.model.ChatTemplate(name: str = 'custom', delimiters: ~typing.Dict[str, ~typing.Dict[str, str]] = <factory>, special_tokens: ~typing.Dict[str, str] = <factory>)

Bases: object

A generic chat template class to serialize conversation messages for different LLM families (LLaMA, Qwen, Apertus, etc.).

static apertus() ChatTemplate
delimiters: Dict[str, Dict[str, str]]
static from_name(name: str) ChatTemplate
static llama() ChatTemplate
name: str = 'custom'
static qwen3() ChatTemplate
special_tokens: Dict[str, str]
class multimeditron.model.model.MultiModalModelForCausalLM(config: MultimodalConfig, bootstrap=False)

Bases: PreTrainedModel

A multimodal model for causal language modeling that integrates various modalities with a language model.

This model extends PreTrainedModel and is designed to process multiple modalities (such as images, audio, etc.) alongside text inputs. It embeds the multimodal inputs into the same embedding space as the text tokens and processes them through a shared transformer model.

The model architecture consists of:

  1. A base language model (like Llama-3)

  2. Multiple modality processors (one for each supported modality)

  3. Projection layers to map modality embeddings to the language model’s embedding space

This enables end-to-end training and inference with multimodal inputs, allowing the model to understand and generate text that incorporates information from multiple sources.

base_model_prefix = 'model'
config_class

alias of MultimodalConfig

embed_modalities_with_text(input_ids: Tensor, processed_multimodal_inputs: List[Dict[str, Any]])

Embeds multimodal inputs alongside text tokens in a unified embedding space.

This method takes text token IDs and processed multimodal inputs, embeds them both, and combines them into a single embedding tensor that can be processed by the transformer model. It first embeds the text tokens using the model’s token embeddings, then processes each modality’s inputs through their respective modality processors, projects them to the language model’s hidden dimension, and places them at the appropriate positions in the embedding sequence.

Parameters:
  • input_ids (torch.Tensor) – Token IDs for the text input, shape [batch_size, seq_len].

  • processed_multimodal_inputs (List[Dict[str, Any]]) – Dictionary containing: - ‘stacked’: Dict mapping modality names to tensors of processed inputs - ‘batch_idx’: Dict mapping modality names to batch indices for placement - ‘token_range’: Dict mapping modality names to token indices for placement

Returns:

Combined embeddings of text and multimodal inputs,

shape [batch_size, seq_len, hidden_size].

Return type:

torch.Tensor

forward(input_ids: LongTensor = None, inputs_embeds: FloatTensor | None = None, attention_mask: Tensor | None = None, position_ids: LongTensor | None = None, past_key_values: List[FloatTensor] | None = None, labels: LongTensor | None = None, use_cache: bool | None = None, multimodal_inputs=None, processed_multimodal_inputs=None, return_dict: bool | None = True, cache_position=None, **kwargs) Tuple | CausalLMOutputWithPast

Performs a forward pass through the multimodal model.

This is the main computation method that processes both text and multimodal inputs. It first embeds all inputs (if not already embedded), handles truncation if configured, and then passes the combined embeddings through the language model.

Parameters:
  • input_ids (torch.LongTensor, optional) – Token IDs for text input. Shape [batch_size, sequence_length].

  • inputs_embeds (torch.FloatTensor, optional) – Pre-computed input embeddings. If provided, input_ids will not be used. Shape [batch_size, sequence_length, hidden_size].

  • attention_mask (torch.Tensor, optional) – Mask to avoid attention on padding tokens. Shape [batch_size, sequence_length].

  • position_ids (torch.LongTensor, optional) – Indices of positions for positional embeddings. Shape [batch_size, sequence_length].

  • past_key_values (List[torch.FloatTensor], optional) – Cached key/values for faster inference.

  • labels (torch.LongTensor, optional) – Labels for computing language modeling loss. Shape [batch_size, sequence_length].

  • use_cache (bool, optional) – Whether to return the key/value states for future use.

  • multimodal_inputs (Any, optional) – Raw multimodal inputs that need processing.

  • processed_multimodal_inputs (Dict, optional) – Pre-processed multimodal inputs ready for embedding.

  • return_dict (bool, optional) – Whether to return a dictionary output. Defaults to True.

  • cache_position (Any, optional) – Position in the cache for retrieval.

  • **kwargs – Additional arguments passed to the base model.

Returns:

Model outputs, typically containing:
  • loss (if labels provided)

  • logits (prediction scores for each token)

  • past_key_values (if use_cache=True)

  • hidden_states (if output_hidden_states=True)

  • attentions (if output_attentions=True)

Return type:

Union[Tuple, CausalLMOutputWithPast]

freeze_for_alignment()

Freezes model parameters for alignment training.

This method prepares the model for alignment training by:

  1. Freezing only the modality parts of each modality processor (keeping projections trainable)

  2. Freezing the entire language model

This configuration is useful when aligning modality representations with the language model’s embedding space while keeping the core LM frozen.

freeze_for_end2end()

Freezes partial parameters for end-to-end training.

This method prepares the model for end-to-end training by:

  1. Freezing only the modality parts of each modality processor (keeping projections trainable)

  2. Making the language model parameters trainable

This configuration is useful for fine-tuning the language model and modality projections together, while keeping the core modality encoders fixed.

freeze_for_lm()

Freezes modality parameters for language model fine-tuning.

This method prepares the model for language model fine-tuning by:

  1. Freezing all modality processors completely (including projections)

  2. Making the language model parameters trainable

This configuration is useful when you want to fine-tune the language model on multimodal inputs while keeping the modality processors fixed.

generate(batch: Dict[str, Any], max_new_tokens=512, temperature=0.1, do_sample=True, **kwargs) Tensor | CausalLMOutputWithPast

Generates text from multimodal inputs using the model.

This method implements custom token generation logic for multimodal inputs. It processes a batch containing text token IDs and multimodal inputs, then performs autoregressive generation of new tokens until either the maximum token count is reached or all sequences have generated an end-of-sequence token.

Parameters:
  • batch (Dict[str, Any]) – Dictionary containing the following keys: - input_ids: Text token IDs (torch.Tensor) - processed_multimodal_inputs: Processed multimodal inputs - attention_mask: Attention mask for the input sequence - position_ids: Position IDs for the input sequence

  • max_new_tokens (int) – Maximum number of tokens to generate. Defaults to 512.

  • temperature (float) – Sampling temperature for controlling randomness in generation. Lower values make generation more deterministic. Defaults to 0.1.

  • do_sample (bool) – Whether to use sampling for generation instead of greedy decoding. Defaults to True.

  • **kwargs – Additional keyword arguments passed to the underlying generation process.

Returns:

Generated token IDs, shape [batch_size, sequence_length]

Return type:

torch.Tensor

get_input_embeddings() Embedding

Returns embeddings of the LLM model

get_model()
processors() Dict[str, BaseModalityProcessor]
set_input_embeddings(value: Embedding)

Set input embeddings of the LLM model to the given value

supports_gradient_checkpointing = True
unfreeze()

Unfreezes all model parameters for full training.

This method makes all parameters of the model trainable by:

  1. Unfreezing all modality processors (both core encoders and projections)

  2. Making the language model parameters trainable

This configuration enables full end-to-end training of the entire model.

class multimeditron.model.model.MultimodalConfig(vocab_size: int | None = None, modalities: List[BaseModalityConfig] = [], pad_token_idx: int = 0, eos_token_idx: int = 0, padding_side: str = 'left', initializer_range: float = 0.02, llm_path: str = 'meta-llama/Llama-3.1-8B-Instruct', truncation: bool = False, max_sequence_length: int | None = None, dtype='bfloat16', **kwargs)

Bases: PretrainedConfig

Configuration class for a multimodal model that integrates various modalities with a language model.

classmethod from_dict(config_dict, **kwargs)

Creates a MultimodalConfig instance from a dictionary.

This classmethod extends the parent class’s from_dict method to handle the special processing required for modality configurations. It extracts the modalities from the configuration dictionary, creates the appropriate ModalityConfig objects, and then initializes the MultimodalConfig with these processed modalities.

Parameters:
  • config_dict (dict) – Dictionary containing configuration parameters.

  • **kwargs – Additional keyword arguments passed to parent class’s from_dict method. Should include ‘return_unused_kwargs’ which determines the return format.

Returns:

Either just the config object or a tuple of (config, unused_kwargs) if return_unused_kwargs is True.

Return type:

Union[MultimodalConfig, Tuple[MultimodalConfig, Dict]]

model_type: str = 'multimodal'
to_dict()

Converts the MultimodalConfig object to a dictionary representation.

This method extends the parent class’s to_dict method by properly handling the modalities list, converting each ModalityConfig object to its dictionary representation.

Returns:

Dictionary containing all configuration parameters, with modalities

properly serialized.

Return type:

dict

multimeditron.model.model.bootstrap(config, tokenizer, modalities_config)
Bootstrap the model and initialize the model as follows:
  • LLM is initialized with the pretrained weights

  • The modalities embedders are initialized with pretrained weights

  • The modalities projector are initialized randomly

Parameters:
  • config (dict) – The configuration dictionary for the multimodal model.

  • tokenizer (PreTrainedTokenizerBase) – The tokenizer instance to use for tokenization.

  • modalities_config (List[BaseModalityConfig]) – List of modality configurations.

Returns:

The initialized multimodal model.

Return type:

MultiModalModelForCausalLM

multimeditron.model.prompt_tokenizers module

class multimeditron.model.prompt_tokenizers.PromptTokenizer(tokenizer: PreTrainedTokenizerBase, chat_template: ChatTemplate, attachment_token: str, modalities_num_embeddings: Dict[str, int | None], ignore_index: int = -100)

Bases: object

compute_token_range(sequence_input_ids: Tensor, sequence_modalities: List[Dict[str, Any]]) List[Tuple[int, int]]

Compute token range for a sample

Returns:

A list of tuples containing elements of the form (start, end) corresponding to the start of modality in the sequence and end of modality in the sequence

convert_tokens_to_ids(tokens: List[str]) int
expand_attachment_input_tokens(token_ids: Tensor, attention_mask: Tensor, modalities_for_message: List[Dict[str, Any]]) Tuple[Tensor, Tensor]

Expands attachment tokens in the token sequence based on the number of embeddings for each modality.

Parameters:
  • token_ids (torch.Tensor) – The original sequence of token IDs.

  • attention_mask (torch.Tensor) – The attention mask corresponding to the token_ids.

  • modalities_for_message (List[Dict[str, Any]]) – A list of modality dictionaries, each containing modality information.

Returns:

The expanded token IDs and corresponding attention mask.

Return type:

Tuple[torch.Tensor, torch.Tensor]

get_num_embeddings(modality: Dict[str, Any]) int
pad_tokenized(tokenized: List[Dict[str, Tensor]]) Dict[str, Tensor]
tokenize_conversation(prompt: List[List[Dict[str, str]]], modalities: List[List[Dict[str, Any]]], add_eos_token=True, add_generation_prompt=False) List[Dict[str, Any]]
tokenize_samples(samples: List[Dict[str, Any]] | Dict[str, Any], **kwargs) List[Dict[str, Any]]

Tokenizes a sample, which can either be a text or a conversation. :param sample: The sample to tokenize. It must contain either ‘text’ or ‘conversations’. :type sample: Dict[str, Any]

Returns:

The tokenized sample containing ‘input_ids’, ‘attention_mask’, and ‘labels’.

Return type:

Dict[str, Any]

tokenize_text(prompt: List[str], modalities: List[List[Dict[str, Any]]]) List[Dict[str, Any]]
update_with_token_range(tokenized: Dict[str, Tensor], samples: List[Dict[str, Any]]) List[Dict[str, Any]]
validate_tokenized_results(results: List[Dict[str, Any]])
property vocab_size
multimeditron.model.prompt_tokenizers.find_tag_pos(tensor, tag)
multimeditron.model.prompt_tokenizers.replace_between_tags_v2(tensor, left_tag, right_tag, replace_value=-100)

Module contents