modalities.utils package
Submodules
modalities.utils.logging module
modalities.utils.mfu module
- class modalities.utils.mfu.GPT2MFUCalculator(n_layer, sequence_length, n_embd, world_size, wrapped_model)[source]
Bases:
MFUCalculatorABC
Class to calculate the Model Flops Utilization (MFU) for a given model.
- Parameters:
n_layer (int)
sequence_length (int)
n_embd (int)
world_size (int)
wrapped_model (FullyShardedDataParallel | FSDPModule)
modalities.utils.number_conversion module
- class modalities.utils.number_conversion.LocalNumBatchesFromNumSamplesConfig(**data)[source]
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
global_num_samples (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.LocalNumBatchesFromNumTokensConfig(**data)[source]
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
global_num_tokens (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumSamplesFromNumTokensConfig(**data)[source]
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumStepsFromNumSamplesConfig(**data)[source]
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
global_num_samples (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumStepsFromNumTokensConfig(**data)[source]
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
global_num_tokens (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumStepsFromRawDatasetIndexConfig(**data)[source]
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
raw_index_path (Path)
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumTokensFromNumStepsConfig(**data)[source]
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
num_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumTokensFromPackedMemMapDatasetContinuousConfig(**data)[source]
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
dataset_path (Path)
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumberConversion[source]
Bases:
object
- static get_global_num_seen_tokens_from_checkpoint_path(checkpoint_path)[source]
Returns the global num seen tokens from the checkpoint path.
- Args:
checkpoint_path (Path): Path to the checkpoint file.
- Returns:
int: Num seen tokens from the checkpoint path.
- static get_global_num_target_tokens_from_checkpoint_path(checkpoint_path)[source]
Returns the global num target tokens from the checkpoint path.
- Args:
checkpoint_path (Path): Path to the checkpoint file.
- Returns:
int: Num target tokens from the checkpoint path.
- static get_last_step_from_checkpoint_path(checkpoint_path)[source]
Returns the last step from the checkpoint path.
- Args:
checkpoint_path (Path): Path to the checkpoint file.
- Returns:
int: Last step from the checkpoint path.
- static get_local_num_batches_from_num_samples(num_ranks, global_num_samples, local_micro_batch_size)[source]
Calculates the number of local batches for each rank, given the global number of samples and number of ranks. This helper function is primarily used to calculate the number of batches to skip when resuming a dataloader during warmstart.
- Args:
num_ranks (int): Global number of ranks. global_num_samples (int): Global number of samples. local_micro_batch_size (int): Local micro batch size on single rank.
- Returns:
int: Number of local batches for single rank.
- static get_local_num_batches_from_num_tokens(num_ranks, global_num_tokens, sequence_length, local_micro_batch_size)[source]
Calculates the number of local batches for each rank, given the global number of tokens and number of ranks. This helper function is primarily used to calculate a dataloader’s number of batches (total and to skip)
- Return type:
- Parameters:
- Args:
num_ranks (int): Global number of ranks. global_num_tokens (int): Global number of tokens. sequence_length (int): Sequence length of the model. local_micro_batch_size (int): Local micro batch size on single rank.
- Returns:
int: Number of local batches for single rank.
- static get_num_samples_from_num_tokens(num_tokens, sequence_length)[source]
Calculates the number of samples given the global number of tokens and sequence length.
- Args:
num_tokens (int): Global number of tokens. sequence_length (int): Sequence length of the model.
- Returns:
int: Number of samples.
- static get_num_seen_steps_from_checkpoint_path(checkpoint_path)[source]
Returns the number of seen steps from the checkpoint path.”
- Args:
checkpoint_path (Path): Path to the checkpoint file.
- Returns:
int: Number of seen steps from the checkpoint path.
- static get_num_steps_from_num_samples(num_ranks, local_micro_batch_size, global_num_samples, gradient_accumulation_steps)[source]
Calculates the number of steps given the global number of samples, local micro batch size, number of ranks and gradient accumulation steps.
- Return type:
- Parameters:
- Args:
num_ranks (int): Global number of ranks. local_micro_batch_size (int): Local micro batch size on single rank. global_num_samples (int): Global number of samples. gradient_accumulation_steps (int): Number of gradient accumulation steps.
- Returns:
int: Number of steps.
- static get_num_steps_from_num_tokens(num_ranks, local_micro_batch_size, global_num_tokens, sequence_length, gradient_accumulation_steps)[source]
Calculates the number of steps given the global number of tokens, local micro batch size number of ranks and gradient accumulation steps.
- Return type:
- Parameters:
- Args:
num_ranks (int): Global number of ranks. local_micro_batch_size (int): Local micro batch size on single rank. global_num_tokens (int): Global number of tokens. sequence_length (int): Sequence length of the model. gradient_accumulation_steps (int): Number of gradient accumulation steps.
- Returns:
int: Number of steps.
- static get_num_steps_from_raw_dataset_index(raw_index_path, num_ranks, local_micro_batch_size, gradient_accumulation_steps)[source]
Get the number of steps from the raw index, number of ranks, local micro batch size and gradient accumulation steps. The index is a list of tuples where each tuple contains the offset and length of a sample in the raw data. Note, the index is not packed and the number of samples in respective raw JSONL file is the same as the length of the index.
- Return type:
- Parameters:
- Args:
raw_index_path (Path): Path to the raw index file of the JSONL dataset. num_ranks (int): Global number of ranks. local_micro_batch_size (int): Local micro batch size on single rank. gradient_accumulation_steps (int): Number of gradient accumulation steps.
- Returns:
int: Number of steps.
- static get_num_tokens_from_num_steps(num_steps, num_ranks, local_micro_batch_size, sequence_length, gradient_accumulation_steps)[source]
- Return type:
- Parameters:
- Calculates the number of global tokens given the number of steps, number of ranks, local micro batch size,
sequence length and gradient accumulation steps.
- Args:
num_steps (int): Number of steps. num_ranks (int): Global number of ranks. local_micro_batch_size (int): Local micro batch size on single rank. sequence_length (int): Sequence length of the model. gradient_accumulation_steps (int): Number of gradient accumulation steps.
- Returns:
int: Number of global tokens.
- static get_num_tokens_from_packed_mem_map_dataset_continuous(dataset_path, sequence_length, num_ranks, local_micro_batch_size, gradient_accumulation_steps)[source]
- Return type:
- Parameters:
- Get the number of tokens in a tokenized dataset that will be effectively used during training.
Due to the way the data is packed, batched and distributed, the number of tokens used during training might not the same as the number of tokens in the dataset.
- The number of tokens that are used during training is calculated as follows:
- num_steps = num_dataset_tokens // sequence_length// num_ranks //
local_micro_batch_size // gradient_accumulation_steps
- global_num_tokens = num_steps * sequence_length * num_ranks *
local_micro_batch_size * gradient_accumulation_steps
- Args:
dataset_path (Path): Path to the tokenized dataset. sequence_length (int): Sequence length of the model. num_ranks (int): Global number of ranks. local_micro_batch_size (int): Local micro batch size on single rank. gradient_accumulation_steps (int): Number of gradient accumulation steps.
- Returns:
int: Number of tokens that will be effectively used during training.
- class modalities.utils.number_conversion.NumberConversionFromCheckpointPathConfig(**data)[source]
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
checkpoint_path (Path)
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
modalities.utils.seeding module
modalities.utils.typing module
modalities.utils.verify_tokenization_consistency module
- class modalities.utils.verify_tokenization_consistency.TokenizerTypes(value)[source]
Bases:
Enum
- hugging_face = 'hugging_face'
- sentence_piece = 'sentence_piece'
- modalities.utils.verify_tokenization_consistency.build_hf_tokenization_components(tokenizer_path_or_name, eod_token)[source]
- modalities.utils.verify_tokenization_consistency.build_sp_tokenization_components(tokenizer_path, eod_token)[source]
- modalities.utils.verify_tokenization_consistency.verify_tokenization_consistency(src_path, eod_token, eod_token_id, tokenizer, tokenizer_config, jsonl_text_key)[source]
Verifies that the indexation and tokenization is consistent. This function applies the indexation and tokenization routines and then verifies that the index always captures entire samples and that the tokens in the JSON are correctly determined. For an example verification check out the test_end_to_end_indexation_and_tokenization_consistency test
- Args:
src_path (Path): Path to the JSONL file eod_token (str): end of document token eod_token_id (int): The token id of the end of document token tokenizer (Callable[[str], list[int]]): Callable executing the tokenization tokenizer_config (dict): Tokenizer config (same as used in the tokenization entry point) jsonl_text_key (str): The key mapping to the text of interest in each JSON file