multimeditron.dataset package¶
Subpackages¶
Submodules¶
multimeditron.dataset.sample_preprocessor module¶
- class multimeditron.dataset.sample_preprocessor.SamplePreprocessor(tokenizer: PreTrainedTokenizerBase, chat_template: ChatTemplate, modality_processors: Dict[str, BaseModalityProcessor], attachment_token: str)¶
Bases:
objectA class designed to preprocess input data samples for multimodal models.
This class is responsible for handling the tokenization of input samples and processing modality-specific data into tensors. It serves as the intermediary between the raw input data and the model-ready format, leveraging modality-specific processing logic and tokenizers.
- process_modality_to_tensor(samples: List[Dict[str, Any]]) List[Dict[str, Any]]¶
Process modality-specific data in the input samples into tensors.
- Parameters:
samples (List[Dict[str, Any]]) – A list of input samples where each sample contains modality-specific data.
- Returns:
A list of dictionaries where each dictionary has processed modality data represented as tensors.
- Return type:
List[Dict[str, Any]]
Example:
samples = [ {"modalities": [ {"type": "image", "value": "image"} ]} ] processed_samples = sample_preprocessor.process_modality_to_tensor(samples)
- tokenize(samples: List[Dict[str, Any]], **kwargs) List[Dict[str, Any]]¶
Tokenize a batch of input samples using the prompt tokenizer.
- Parameters:
samples (List[Dict[str, Any]]) – A batch of input samples where each sample is a dictionary containing raw data to be tokenized.
kwargs – Additional arguments to customize the tokenization process.
- Returns:
A list of dictionaries representing tokenized samples.
- Return type:
List[Dict[str, Any]]
Example:
samples = [ {"conversations": { "role": "user", "content": "Describe the image in detail." } }, ] tokenized_samples = sample_preprocessor.tokenize(samples)