modalities.tokenization package

Submodules

modalities.tokenization.tokenizer_wrapper module

class modalities.tokenization.tokenizer_wrapper.PreTrainedHFTokenizer(pretrained_model_name_or_path, truncation=False, padding=False, max_length=None, special_tokens=None)[source]

Bases: TokenizerWrapper

Wrapper for pretrained Hugging Face tokenizers.

Initializes the PreTrainedHFTokenizer.

Args:

pretrained_model_name_or_path (str): Name or path of the pretrained model. truncation (bool, optional): Flag whether to apply truncation. Defaults to False. padding (bool | str, optional): Defines the padding strategy. Defaults to False. max_length (int, optional): Maximum length of the tokenization output. Defaults to None. special_tokens (dict[str, str], optional): Added token keys should be in the list

of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens]. Example: {“pad_token”: “[PAD]”} Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the unk_token to them). Defaults to None.

Parameters:
  • pretrained_model_name_or_path (str)

  • truncation (bool | None)

  • padding (bool | str | None)

  • max_length (int | None)

  • special_tokens (dict[str, str] | None)

decode(token_ids)[source]

Decodes a list of token IDs into the original text.

Return type:

str

Parameters:

token_ids (list[int])

Args:

input_ids (list[int]): List of token IDs.

Returns:

str: Decoded text.

get_token_id(token)[source]

Returns the token ID for a given token.

Return type:

int

Parameters:

token (str)

Args:

token (str): Token to get the ID for.

Raises:

ValueError: If the token cannot be represented by a single token ID.

Returns:

int: Token ID.

is_special_token_id(token_id)[source]

Returns whether a token ID is a special token ID.

Return type:

bool

Parameters:

token_id (int)

Args:

token_id (int): Token ID to check.

Returns:

bool: Flag whether the token ID is a special token ID.

property special_tokens: dict[str, str | list[str]]

Returns the special tokens of the tokenizer.

Returns:

dict[str, str | list[str]]: Special tokens dictionary.

tokenize(text)[source]

Tokenizes a text into a list of token IDs.

Return type:

list[int]

Parameters:

text (str)

Args:

text (str): Text to be tokenized.

Returns:

list[int]: List of token IDs.

property vocab_size: int

Returns the vocabulary size of the tokenizer.

Returns:

int: Vocabulary size.

class modalities.tokenization.tokenizer_wrapper.PreTrainedSPTokenizer(tokenizer_model_file)[source]

Bases: TokenizerWrapper

Wrapper for pretrained SentencePiece tokenizers.

Initializes the PreTrainedSPTokenizer.

Args:

tokenizer_model_file (str): Path to the tokenizer model file.

Parameters:

tokenizer_model_file (str)

decode(token_ids)[source]

Decodes a list of token IDs into the original text.

Return type:

str

Parameters:

token_ids (list[int])

Args:

input_ids (list[int]): List of token IDs.

Returns:

str: Decoded text.

get_token_id(token)[source]

Returns the token ID for a given token.

Return type:

int

Parameters:

token (str)

Args:

token (str): Token to get the ID for.

Raises:

ValueError: If the token cannot be represented by a single token ID.

Returns:

int: Token ID.

is_special_token_id(token_id)[source]

Returns whether a token ID is a special token ID.

Return type:

bool

Parameters:

token_id (int)

Args:

token_id (int): Token ID to check.

Raises:

NotImplementedError: Must be implemented by a subclass.

Returns:

bool: Flag whether the token ID is a special token ID.

tokenize(text)[source]

Tokenizes a text into a list of token IDs.

Return type:

list[int]

Parameters:

text (str)

Args:

text (str): Text to be tokenized.

Returns:

list[int]: List of token IDs.

property vocab_size: int

Returns the vocabulary size of the tokenizer.

Returns:

int: Vocabulary size.

class modalities.tokenization.tokenizer_wrapper.TokenizerWrapper[source]

Bases: ABC

Abstract interface for tokenizers.

decode(input_ids)[source]

Decodes a list of token IDs into the original text.

Return type:

str

Parameters:

input_ids (list[int])

Args:

input_ids (list[int]): List of token IDs.

Raises:

NotImplementedError: Must be implemented by a subclass.

Returns:

str: Decoded text.

get_token_id(token)[source]

Returns the token ID for a given token.

Return type:

int

Parameters:

token (str)

Args:

token (str): Token to get the ID for.

Raises:

NotImplementedError: Must be implemented by a subclass.

Returns:

int: Token ID.

is_special_token_id(token_id)[source]

Returns whether a token ID is a special token ID.

Return type:

bool

Parameters:

token_id (int)

Args:

token_id (int): Token ID to check.

Raises:

NotImplementedError: Must be implemented by a subclass.

Returns:

bool: Flag whether the token ID is a special token ID.

tokenize(text)[source]

Tokenizes a text into a list of token IDs.

Return type:

list[int]

Parameters:

text (str)

Args:

text (str): Text to be tokenized.

Raises:

NotImplementedError: Must be implemented by a subclass.

Returns:

list[int]: List of token IDs.

property vocab_size: int

Returns the vocabulary size of the tokenizer.

Raises:

NotImplementedError: Must be implemented by a subclass.

Returns:

int: Vocabulary size.

Module contents