modalities.tokenization package
Submodules
modalities.tokenization.tokenizer_wrapper module
- class modalities.tokenization.tokenizer_wrapper.PreTrainedHFTokenizer(pretrained_model_name_or_path, truncation=False, padding=False, max_length=None, special_tokens=None)[source]
Bases:
TokenizerWrapper
Wrapper for pretrained Hugging Face tokenizers.
Initializes the PreTrainedHFTokenizer.
- Args:
pretrained_model_name_or_path (str): Name or path of the pretrained model. truncation (bool, optional): Flag whether to apply truncation. Defaults to False. padding (bool | str, optional): Defines the padding strategy. Defaults to False. max_length (int, optional): Maximum length of the tokenization output. Defaults to None. special_tokens (dict[str, str], optional): Added token keys should be in the list
of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens]. Example: {“pad_token”: “[PAD]”} Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the unk_token to them). Defaults to None.
- Parameters:
- decode(token_ids)[source]
Decodes a list of token IDs into the original text.
- Args:
input_ids (list[int]): List of token IDs.
- Returns:
str: Decoded text.
- get_token_id(token)[source]
Returns the token ID for a given token.
- Args:
token (str): Token to get the ID for.
- Raises:
ValueError: If the token cannot be represented by a single token ID.
- Returns:
int: Token ID.
- is_special_token_id(token_id)[source]
Returns whether a token ID is a special token ID.
- Args:
token_id (int): Token ID to check.
- Returns:
bool: Flag whether the token ID is a special token ID.
- property special_tokens: dict[str, str | list[str]]
Returns the special tokens of the tokenizer.
- Returns:
dict[str, str | list[str]]: Special tokens dictionary.
- class modalities.tokenization.tokenizer_wrapper.PreTrainedSPTokenizer(tokenizer_model_file)[source]
Bases:
TokenizerWrapper
Wrapper for pretrained SentencePiece tokenizers.
Initializes the PreTrainedSPTokenizer.
- Args:
tokenizer_model_file (str): Path to the tokenizer model file.
- Parameters:
tokenizer_model_file (str)
- decode(token_ids)[source]
Decodes a list of token IDs into the original text.
- Args:
input_ids (list[int]): List of token IDs.
- Returns:
str: Decoded text.
- get_token_id(token)[source]
Returns the token ID for a given token.
- Args:
token (str): Token to get the ID for.
- Raises:
ValueError: If the token cannot be represented by a single token ID.
- Returns:
int: Token ID.
- is_special_token_id(token_id)[source]
Returns whether a token ID is a special token ID.
- Args:
token_id (int): Token ID to check.
- Raises:
NotImplementedError: Must be implemented by a subclass.
- Returns:
bool: Flag whether the token ID is a special token ID.
- class modalities.tokenization.tokenizer_wrapper.TokenizerWrapper[source]
Bases:
ABC
Abstract interface for tokenizers.
- decode(input_ids)[source]
Decodes a list of token IDs into the original text.
- Args:
input_ids (list[int]): List of token IDs.
- Raises:
NotImplementedError: Must be implemented by a subclass.
- Returns:
str: Decoded text.
- get_token_id(token)[source]
Returns the token ID for a given token.
- Args:
token (str): Token to get the ID for.
- Raises:
NotImplementedError: Must be implemented by a subclass.
- Returns:
int: Token ID.
- is_special_token_id(token_id)[source]
Returns whether a token ID is a special token ID.
- Args:
token_id (int): Token ID to check.
- Raises:
NotImplementedError: Must be implemented by a subclass.
- Returns:
bool: Flag whether the token ID is a special token ID.