multimeditron.dataset package¶

Subpackages¶

multimeditron.dataset.loader package
- multimeditron.dataset.loader.image package

Submodules¶

multimeditron.dataset.sample_preprocessor module¶

class multimeditron.dataset.sample_preprocessor.SamplePreprocessor(tokenizer: PreTrainedTokenizerBase, chat_template: ChatTemplate, modality_processors: Dict[str, BaseModalityProcessor], attachment_token: str)¶

Bases: object

A class designed to preprocess input data samples for multimodal models.

This class is responsible for handling the tokenization of input samples and processing modality-specific data into tensors. It serves as the intermediary between the raw input data and the model-ready format, leveraging modality-specific processing logic and tokenizers.

process_modality_to_tensor(samples: List[Dict[str, Any]]) → List[Dict[str, Any]]¶

Process modality-specific data in the input samples into tensors.

Parameters:: samples (List[Dict[str, Any]]) – A list of input samples where each sample contains modality-specific data.
Returns:: A list of dictionaries where each dictionary has processed modality data represented as tensors.
Return type:: List[Dict[str, Any]]

Example:

samples = [
    {"modalities": [
        {"type": "image", "value": "image"}
    ]}
]
processed_samples = sample_preprocessor.process_modality_to_tensor(samples)

tokenize(samples: List[Dict[str, Any]], **kwargs) → List[Dict[str, Any]]¶

Tokenize a batch of input samples using the prompt tokenizer.

Parameters:

samples (List[Dict[str, Any]]) – A batch of input samples where each sample is a dictionary containing raw data to be tokenized.
kwargs – Additional arguments to customize the tokenization process.

Returns:

A list of dictionaries representing tokenized samples.

Return type:

List[Dict[str, Any]]

Example:

samples = [
    {"conversations":
        {
            "role": "user",
            "content": "Describe the image in detail."
        }
    },
]
tokenized_samples = sample_preprocessor.tokenize(samples)

multimeditron.dataset package¶

Subpackages¶

Submodules¶

multimeditron.dataset.sample_preprocessor module¶

Module contents¶