modalities.utils package

Submodules

modalities.utils.logging module

modalities.utils.logging.get_logger(name='main')[source]

Return type:: Logger
Parameters:: name (str)

modalities.utils.mfu module

class modalities.utils.mfu.GPT2MFUCalculator(n_layer, sequence_length, n_embd, world_size, wrapped_model)[source]

Bases: MFUCalculatorABC

Class to calculate the Model Flops Utilization (MFU) for a given model.

Parameters:

n_layer (int)
sequence_length (int)
n_embd (int)
world_size (int)
wrapped_model (FullyShardedDataParallel | FSDPModule)

compute(num_samples_per_second)[source]

Computes the MFU for the given number of samples per second.

Return type:: Tensor
Parameters:: num_samples_per_second (Tensor)

Args:: num_samples_per_second (torch.Tensor): The number of samples per second.
Returns:: torch.Tensor: The computed MFU.

class modalities.utils.mfu.MFUCalculatorABC[source]

Bases: object

Interface for calculating the Model Flops Utilization (MFU).

compute(num_samples_per_second)[source]

Computes the MFU for the given number of samples per second.

Return type:: Tensor
Parameters:: num_samples_per_second (Tensor)

Args:: num_samples_per_second (torch.Tensor): The number of samples per second.
Returns:: torch.Tensor: The computed MFU.

modalities.utils.number_conversion module

class modalities.utils.number_conversion.LocalNumBatchesFromNumSamplesConfig(**data)[source]

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
global_num_samples (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])

global_num_samples: Annotated[int]

local_micro_batch_size: Annotated[int]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_ranks: Annotated[int]

class modalities.utils.number_conversion.LocalNumBatchesFromNumTokensConfig(**data)[source]

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
global_num_tokens (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])

global_num_tokens: Annotated[int]

local_micro_batch_size: Annotated[int]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_ranks: Annotated[int]

sequence_length: Annotated[int]

class modalities.utils.number_conversion.NumSamplesFromNumTokensConfig(**data)[source]

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

num_tokens (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_tokens: Annotated[int]

sequence_length: Annotated[int]

class modalities.utils.number_conversion.NumStepsFromNumSamplesConfig(**data)[source]

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
global_num_samples (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])

global_num_samples: Annotated[int]

gradient_accumulation_steps: Annotated[int]

local_micro_batch_size: Annotated[int]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_ranks: Annotated[int]

class modalities.utils.number_conversion.NumStepsFromNumTokensConfig(**data)[source]

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
global_num_tokens (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])

global_num_tokens: Annotated[int]

gradient_accumulation_steps: Annotated[int]

local_micro_batch_size: Annotated[int]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_ranks: Annotated[int]

sequence_length: Annotated[int]

class modalities.utils.number_conversion.NumStepsFromRawDatasetIndexConfig(**data)[source]

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

raw_index_path (Path)
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])

gradient_accumulation_steps: Annotated[int]

local_micro_batch_size: Annotated[int]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_ranks: Annotated[int]

raw_index_path: Path

class modalities.utils.number_conversion.NumTokensFromNumStepsConfig(**data)[source]

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

num_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])

gradient_accumulation_steps: Annotated[int]

local_micro_batch_size: Annotated[int]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_ranks: Annotated[int]

num_steps: Annotated[int]

sequence_length: Annotated[int]

class modalities.utils.number_conversion.NumTokensFromPackedMemMapDatasetContinuousConfig(**data)[source]

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

dataset_path (Path)
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])

dataset_path: Path

gradient_accumulation_steps: Annotated[int]

local_micro_batch_size: Annotated[int]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_ranks: Annotated[int]

sequence_length: Annotated[int]

class modalities.utils.number_conversion.NumberConversion[source]

Bases: object

static get_global_num_seen_tokens_from_checkpoint_path(checkpoint_path)[source]

Returns the global num seen tokens from the checkpoint path.

Return type:: int
Parameters:: checkpoint_path (Path)

Args:: checkpoint_path (Path): Path to the checkpoint file.
Returns:: int: Num seen tokens from the checkpoint path.

static get_global_num_target_tokens_from_checkpoint_path(checkpoint_path)[source]

Returns the global num target tokens from the checkpoint path.

Return type:: int
Parameters:: checkpoint_path (Path)

Args:: checkpoint_path (Path): Path to the checkpoint file.
Returns:: int: Num target tokens from the checkpoint path.

static get_last_step_from_checkpoint_path(checkpoint_path)[source]

Returns the last step from the checkpoint path.

Return type:: int
Parameters:: checkpoint_path (Path)

Args:: checkpoint_path (Path): Path to the checkpoint file.
Returns:: int: Last step from the checkpoint path.

static get_local_num_batches_from_num_samples(num_ranks, global_num_samples, local_micro_batch_size)[source]

Calculates the number of local batches for each rank, given the global number of samples and number of ranks. This helper function is primarily used to calculate the number of batches to skip when resuming a dataloader during warmstart.

Return type:

int

Parameters:

num_ranks (int)
global_num_samples (int)
local_micro_batch_size (int)

Args:: num_ranks (int): Global number of ranks. global_num_samples (int): Global number of samples. local_micro_batch_size (int): Local micro batch size on single rank.
Returns:: int: Number of local batches for single rank.

static get_local_num_batches_from_num_tokens(num_ranks, global_num_tokens, sequence_length, local_micro_batch_size)[source]

Calculates the number of local batches for each rank, given the global number of tokens and number of ranks. This helper function is primarily used to calculate a dataloader’s number of batches (total and to skip)

Return type:

int

Parameters:

num_ranks (int)
global_num_tokens (int)
sequence_length (int)
local_micro_batch_size (int)

Args:: num_ranks (int): Global number of ranks. global_num_tokens (int): Global number of tokens. sequence_length (int): Sequence length of the model. local_micro_batch_size (int): Local micro batch size on single rank.
Returns:: int: Number of local batches for single rank.

static get_num_samples_from_num_tokens(num_tokens, sequence_length)[source]

Calculates the number of samples given the global number of tokens and sequence length.

Return type:

int

Parameters:

num_tokens (int)
sequence_length (int)

Args:: num_tokens (int): Global number of tokens. sequence_length (int): Sequence length of the model.
Returns:: int: Number of samples.

static get_num_seen_steps_from_checkpoint_path(checkpoint_path)[source]

Returns the number of seen steps from the checkpoint path.”

Return type:: int
Parameters:: checkpoint_path (Path)

Args:: checkpoint_path (Path): Path to the checkpoint file.
Returns:: int: Number of seen steps from the checkpoint path.

static get_num_steps_from_num_samples(num_ranks, local_micro_batch_size, global_num_samples, gradient_accumulation_steps)[source]

Calculates the number of steps given the global number of samples, local micro batch size, number of ranks and gradient accumulation steps.

Return type:

int

Parameters:

num_ranks (int)
local_micro_batch_size (int)
global_num_samples (int)
gradient_accumulation_steps (int)

Args:: num_ranks (int): Global number of ranks. local_micro_batch_size (int): Local micro batch size on single rank. global_num_samples (int): Global number of samples. gradient_accumulation_steps (int): Number of gradient accumulation steps.
Returns:: int: Number of steps.

static get_num_steps_from_num_tokens(num_ranks, local_micro_batch_size, global_num_tokens, sequence_length, gradient_accumulation_steps)[source]

Calculates the number of steps given the global number of tokens, local micro batch size number of ranks and gradient accumulation steps.

Return type:

int

Parameters:

num_ranks (int)
local_micro_batch_size (int)
global_num_tokens (int)
sequence_length (int)
gradient_accumulation_steps (int)

Args:: num_ranks (int): Global number of ranks. local_micro_batch_size (int): Local micro batch size on single rank. global_num_tokens (int): Global number of tokens. sequence_length (int): Sequence length of the model. gradient_accumulation_steps (int): Number of gradient accumulation steps.
Returns:: int: Number of steps.

static get_num_steps_from_raw_dataset_index(raw_index_path, num_ranks, local_micro_batch_size, gradient_accumulation_steps)[source]

Get the number of steps from the raw index, number of ranks, local micro batch size and gradient accumulation steps. The index is a list of tuples where each tuple contains the offset and length of a sample in the raw data. Note, the index is not packed and the number of samples in respective raw JSONL file is the same as the length of the index.

Return type:

int

Parameters:

raw_index_path (Path)
num_ranks (int)
local_micro_batch_size (int)
gradient_accumulation_steps (int)

Args:: raw_index_path (Path): Path to the raw index file of the JSONL dataset. num_ranks (int): Global number of ranks. local_micro_batch_size (int): Local micro batch size on single rank. gradient_accumulation_steps (int): Number of gradient accumulation steps.
Returns:: int: Number of steps.

static get_num_target_steps_from_checkpoint_path(checkpoint_path)[source]

Return type:: int
Parameters:: checkpoint_path (Path)

static get_num_tokens_from_num_steps(num_steps, num_ranks, local_micro_batch_size, sequence_length, gradient_accumulation_steps)[source]

Return type:

int

Parameters:

num_steps (int)
num_ranks (int)
local_micro_batch_size (int)
sequence_length (int)
gradient_accumulation_steps (int)

Calculates the number of global tokens given the number of steps, number of ranks, local micro batch size,: sequence length and gradient accumulation steps.
Args:: num_steps (int): Number of steps. num_ranks (int): Global number of ranks. local_micro_batch_size (int): Local micro batch size on single rank. sequence_length (int): Sequence length of the model. gradient_accumulation_steps (int): Number of gradient accumulation steps.
Returns:: int: Number of global tokens.

static get_num_tokens_from_packed_mem_map_dataset_continuous(dataset_path, sequence_length, num_ranks, local_micro_batch_size, gradient_accumulation_steps)[source]

Return type:

int

Parameters:

dataset_path (Path)
sequence_length (int)
num_ranks (int)
local_micro_batch_size (int)
gradient_accumulation_steps (int)

Get the number of tokens in a tokenized dataset that will be effectively used during training.

Due to the way the data is packed, batched and distributed, the number of tokens used during training might not the same as the number of tokens in the dataset.

The number of tokens that are used during training is calculated as follows:

num_steps = num_dataset_tokens // sequence_length// num_ranks //: local_micro_batch_size // gradient_accumulation_steps
global_num_tokens = num_steps * sequence_length * num_ranks *: local_micro_batch_size * gradient_accumulation_steps

Args:

dataset_path (Path): Path to the tokenized dataset. sequence_length (int): Sequence length of the model. num_ranks (int): Global number of ranks. local_micro_batch_size (int): Local micro batch size on single rank. gradient_accumulation_steps (int): Number of gradient accumulation steps.

Returns:

int: Number of tokens that will be effectively used during training.

class modalities.utils.number_conversion.NumberConversionFromCheckpointPathConfig(**data)[source]

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:: checkpoint_path (Path)

checkpoint_path: Path

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

modalities.utils.seeding module

modalities.utils.seeding.calculate_hashed_seed(input_data, max_seed=4294967295)[source]

Return type:

int

Parameters:

input_data (list[str])
max_seed (int)

modalities.utils.typing module

modalities.utils.verify_tokenization_consistency module

class modalities.utils.verify_tokenization_consistency.TokenizerTypes(value)[source]

Bases: Enum

hugging_face = 'hugging_face'

sentence_piece = 'sentence_piece'

modalities.utils.verify_tokenization_consistency.build_hf_tokenization_components(tokenizer_path_or_name, eod_token)[source]

Parameters:

tokenizer_path_or_name (str)
eod_token (str)

modalities.utils.verify_tokenization_consistency.build_sp_tokenization_components(tokenizer_path, eod_token)[source]

Parameters:

tokenizer_path (Path)
eod_token (str)

modalities.utils.verify_tokenization_consistency.verify_tokenization_consistency(src_path, eod_token, eod_token_id, tokenizer, tokenizer_config, jsonl_text_key)[source]

Verifies that the indexation and tokenization is consistent. This function applies the indexation and tokenization routines and then verifies that the index always captures entire samples and that the tokens in the JSON are correctly determined. For an example verification check out the test_end_to_end_indexation_and_tokenization_consistency test

Args:: src_path (Path): Path to the JSONL file eod_token (str): end of document token eod_token_id (int): The token id of the end of document token tokenizer (Callable[[str], list[int]]): Callable executing the tokenization tokenizer_config (dict): Tokenizer config (same as used in the tokenization entry point) jsonl_text_key (str): The key mapping to the text of interest in each JSON file

Parameters:

src_path (Path)
eod_token (str)
eod_token_id (int)
tokenizer (Callable[[str], list[int]])
tokenizer_config (dict)
jsonl_text_key (str)

modalities.utils package

Submodules

modalities.utils.logging module

modalities.utils.mfu module

modalities.utils.number_conversion module

modalities.utils.seeding module

modalities.utils.typing module

modalities.utils.verify_tokenization_consistency module

Module contents