multimeditron.dataset package

Subpackages

Submodules

multimeditron.dataset.sample_preprocessor module

class multimeditron.dataset.sample_preprocessor.SamplePreprocessor(tokenizer: PreTrainedTokenizerBase, chat_template: ChatTemplate, modality_processors: Dict[str, BaseModalityProcessor], attachment_token: str)

Bases: object

A class designed to preprocess input data samples for multimodal models.

This class is responsible for handling the tokenization of input samples and processing modality-specific data into tensors. It serves as the intermediary between the raw input data and the model-ready format, leveraging modality-specific processing logic and tokenizers.

process_modality_to_tensor(samples: List[Dict[str, Any]]) List[Dict[str, Any]]

Process modality-specific data in the input samples into tensors.

Parameters:

samples (List[Dict[str, Any]]) – A list of input samples where each sample contains modality-specific data.

Returns:

A list of dictionaries where each dictionary has processed modality data represented as tensors.

Return type:

List[Dict[str, Any]]

Example:

samples = [
    {"modalities": [
        {"type": "image", "value": "image"}
    ]}
]
processed_samples = sample_preprocessor.process_modality_to_tensor(samples)
tokenize(samples: List[Dict[str, Any]], **kwargs) List[Dict[str, Any]]

Tokenize a batch of input samples using the prompt tokenizer.

Parameters:
  • samples (List[Dict[str, Any]]) – A batch of input samples where each sample is a dictionary containing raw data to be tokenized.

  • kwargs – Additional arguments to customize the tokenization process.

Returns:

A list of dictionaries representing tokenized samples.

Return type:

List[Dict[str, Any]]

Example:

samples = [
    {"conversations":
        {
            "role": "user",
            "content": "Describe the image in detail."
        }
    },
]
tokenized_samples = sample_preprocessor.tokenize(samples)

Module contents