modalities.tokenization package

Submodules

modalities.tokenization.tokenizer_wrapper module

class modalities.tokenization.tokenizer_wrapper.PreTrainedHFTokenizer(pretrained_model_name_or_path, truncation=False, padding=False, max_length=None, special_tokens=None)[source]

Bases: TokenizerWrapper

Wrapper for pretrained Hugging Face tokenizers.

Initializes the PreTrainedHFTokenizer.

Args:: pretrained_model_name_or_path (str): Name or path of the pretrained model. truncation (bool, optional): Flag whether to apply truncation. Defaults to False. padding (bool | str, optional): Defines the padding strategy. Defaults to False. max_length (int, optional): Maximum length of the tokenization output. Defaults to None. special_tokens (dict[str, str], optional): Added token keys should be in the list

of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens]. Example: {“pad_token”: “[PAD]”} Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the unk_token to them). Defaults to None.

Parameters:

pretrained_model_name_or_path (str)
truncation (bool | None)
padding (bool | str | None)
max_length (int | None)
special_tokens (dict[str, str] | None)

decode(token_ids)[source]

Decodes a list of token IDs into the original text.

Return type:: str
Parameters:: token_ids (list[int])

Args:: input_ids (list[int]): List of token IDs.
Returns:: str: Decoded text.

get_token_id(token)[source]

Returns the token ID for a given token.

Return type:: int
Parameters:: token (str)

Args:: token (str): Token to get the ID for.
Raises:: ValueError: If the token cannot be represented by a single token ID.
Returns:: int: Token ID.

is_special_token_id(token_id)[source]

Returns whether a token ID is a special token ID.

Return type:: bool
Parameters:: token_id (int)

Args:: token_id (int): Token ID to check.
Returns:: bool: Flag whether the token ID is a special token ID.

property special_tokens: dict[str, str | list[str]]

Returns the special tokens of the tokenizer.

Returns:: dict[str, str | list[str]]: Special tokens dictionary.

tokenize(text)[source]

Tokenizes a text into a list of token IDs.

Return type:: list[int]
Parameters:: text (str)

Args:: text (str): Text to be tokenized.
Returns:: list[int]: List of token IDs.

property vocab_size: int

Returns the vocabulary size of the tokenizer.

Returns:: int: Vocabulary size.

class modalities.tokenization.tokenizer_wrapper.PreTrainedSPTokenizer(tokenizer_model_file)[source]

Bases: TokenizerWrapper

Wrapper for pretrained SentencePiece tokenizers.

Initializes the PreTrainedSPTokenizer.

Args:: tokenizer_model_file (str): Path to the tokenizer model file.

Parameters:: tokenizer_model_file (str)

decode(token_ids)[source]

Decodes a list of token IDs into the original text.

Return type:: str
Parameters:: token_ids (list[int])

Args:: input_ids (list[int]): List of token IDs.
Returns:: str: Decoded text.

get_token_id(token)[source]

Returns the token ID for a given token.

Return type:: int
Parameters:: token (str)

Args:: token (str): Token to get the ID for.
Raises:: ValueError: If the token cannot be represented by a single token ID.
Returns:: int: Token ID.

is_special_token_id(token_id)[source]

Returns whether a token ID is a special token ID.

Return type:: bool
Parameters:: token_id (int)

Args:: token_id (int): Token ID to check.
Raises:: NotImplementedError: Must be implemented by a subclass.
Returns:: bool: Flag whether the token ID is a special token ID.

tokenize(text)[source]

Tokenizes a text into a list of token IDs.

Return type:: list[int]
Parameters:: text (str)

Args:: text (str): Text to be tokenized.
Returns:: list[int]: List of token IDs.

property vocab_size: int

Returns the vocabulary size of the tokenizer.

Returns:: int: Vocabulary size.

class modalities.tokenization.tokenizer_wrapper.TokenizerWrapper[source]

Bases: ABC

Abstract interface for tokenizers.

decode(input_ids)[source]

Decodes a list of token IDs into the original text.

Return type:: str
Parameters:: input_ids (list[int])

Args:: input_ids (list[int]): List of token IDs.
Raises:: NotImplementedError: Must be implemented by a subclass.
Returns:: str: Decoded text.

get_token_id(token)[source]

Returns the token ID for a given token.

Return type:: int
Parameters:: token (str)

Args:: token (str): Token to get the ID for.
Raises:: NotImplementedError: Must be implemented by a subclass.
Returns:: int: Token ID.

is_special_token_id(token_id)[source]

Returns whether a token ID is a special token ID.

Return type:: bool
Parameters:: token_id (int)

Args:: token_id (int): Token ID to check.
Raises:: NotImplementedError: Must be implemented by a subclass.
Returns:: bool: Flag whether the token ID is a special token ID.

tokenize(text)[source]

Tokenizes a text into a list of token IDs.

Return type:: list[int]
Parameters:: text (str)

Args:: text (str): Text to be tokenized.
Raises:: NotImplementedError: Must be implemented by a subclass.
Returns:: list[int]: List of token IDs.

property vocab_size: int

Returns the vocabulary size of the tokenizer.

Raises:: NotImplementedError: Must be implemented by a subclass.
Returns:: int: Vocabulary size.

modalities.tokenization package

Submodules

modalities.tokenization.tokenizer_wrapper module

Module contents