modalities.tokenization package
Submodules
modalities.tokenization.tokenizer_wrapper module
- class modalities.tokenization.tokenizer_wrapper.PreTrainedHFTokenizer(pretrained_model_name_or_path, truncation=False, padding=False, max_length=None, special_tokens=None)[source]
- Bases: - TokenizerWrapper- Wrapper for pretrained Hugging Face tokenizers. - Initializes the PreTrainedHFTokenizer. - Args:
- pretrained_model_name_or_path (str): Name or path of the pretrained model. truncation (bool, optional): Flag whether to apply truncation. Defaults to False. padding (bool | str, optional): Defines the padding strategy. Defaults to False. max_length (int, optional): Maximum length of the tokenization output. Defaults to None. special_tokens (dict[str, str], optional): Added token keys should be in the list - of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens]. Example: {“pad_token”: “[PAD]”} Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the unk_token to them). Defaults to None. 
 - Parameters:
 - decode(token_ids)[source]
- Decodes a list of token IDs into the original text. - Args:
- input_ids (list[int]): List of token IDs. 
- Returns:
- str: Decoded text. 
 
 - get_token_id(token)[source]
- Returns the token ID for a given token. - Args:
- token (str): Token to get the ID for. 
- Raises:
- ValueError: If the token cannot be represented by a single token ID. 
- Returns:
- int: Token ID. 
 
 - is_special_token_id(token_id)[source]
- Returns whether a token ID is a special token ID. - Args:
- token_id (int): Token ID to check. 
- Returns:
- bool: Flag whether the token ID is a special token ID. 
 
 - property special_tokens: dict[str, str | list[str]]
- Returns the special tokens of the tokenizer. - Returns:
- dict[str, str | list[str]]: Special tokens dictionary. 
 
 
- class modalities.tokenization.tokenizer_wrapper.PreTrainedSPTokenizer(tokenizer_model_file)[source]
- Bases: - TokenizerWrapper- Wrapper for pretrained SentencePiece tokenizers. - Initializes the PreTrainedSPTokenizer. - Args:
- tokenizer_model_file (str): Path to the tokenizer model file. 
 - Parameters:
- tokenizer_model_file (str) 
 - decode(token_ids)[source]
- Decodes a list of token IDs into the original text. - Args:
- input_ids (list[int]): List of token IDs. 
- Returns:
- str: Decoded text. 
 
 - get_token_id(token)[source]
- Returns the token ID for a given token. - Args:
- token (str): Token to get the ID for. 
- Raises:
- ValueError: If the token cannot be represented by a single token ID. 
- Returns:
- int: Token ID. 
 
 - is_special_token_id(token_id)[source]
- Returns whether a token ID is a special token ID. - Args:
- token_id (int): Token ID to check. 
- Raises:
- NotImplementedError: Must be implemented by a subclass. 
- Returns:
- bool: Flag whether the token ID is a special token ID. 
 
 
- class modalities.tokenization.tokenizer_wrapper.TokenizerWrapper[source]
- Bases: - ABC- Abstract interface for tokenizers. - decode(input_ids)[source]
- Decodes a list of token IDs into the original text. - Args:
- input_ids (list[int]): List of token IDs. 
- Raises:
- NotImplementedError: Must be implemented by a subclass. 
- Returns:
- str: Decoded text. 
 
 - get_token_id(token)[source]
- Returns the token ID for a given token. - Args:
- token (str): Token to get the ID for. 
- Raises:
- NotImplementedError: Must be implemented by a subclass. 
- Returns:
- int: Token ID. 
 
 - is_special_token_id(token_id)[source]
- Returns whether a token ID is a special token ID. - Args:
- token_id (int): Token ID to check. 
- Raises:
- NotImplementedError: Must be implemented by a subclass. 
- Returns:
- bool: Flag whether the token ID is a special token ID.