modalities.utils package
Subpackages
- modalities.utils.benchmarking package
- modalities.utils.profilers package
- Submodules
- modalities.utils.profilers.batch_generator module
- modalities.utils.profilers.modalities_profiler module
- modalities.utils.profilers.steppable_component_configs module
- modalities.utils.profilers.steppable_components module
- modalities.utils.profilers.steppable_components_if module
- Module contents
Submodules
modalities.utils.communication_test module
modalities.utils.debug module
- modalities.utils.debug.debug_nan_hook(module, input, output, module_path=None, raise_exception=False)[source]
Hook to detect NaN in forward pass
- modalities.utils.debug.enable_deterministic_cuda()[source]
Context manager to enable deterministic CUDA operations and restore previous state.
modalities.utils.debug_components module
- class modalities.utils.debug_components.Debugging(*, forward_hooks, enable_determinism)[source]
Bases:
object
- class modalities.utils.debug_components.HookRegistration[source]
Bases:
objectUtility component to register and manage hooks on a PyTorch model.
- static register_forward_hooks(model, hook_fn, module_filter=<function HookRegistration.<lambda>>)[source]
Registers forward hooks on all modules that satisfy the module_filter condition.
- Args:
model (torch.nn.Module): The PyTorch model to register hooks on. hook_fn (Any): The hook function to be registered. module_filter (Any, optional): A function that takes a module and
returns True if the hook should be registered. Defaults to a function that always returns True.
- Returns:
list[torch.utils.hooks.RemovableHandle]: A list of handles for the registered hooks.
- static register_nan_hooks(model, raise_exception=False, module_filter=<function HookRegistration.<lambda>>)[source]
Registers NaN detection hooks on all modules that satisfy the module_filter condition.
- Return type:
list[RemovableHandle]- Parameters:
- Args:
model (torch.nn.Module): The PyTorch model to register hooks on. raise_exception (bool, optional): Whether to raise an exception when NaN is detected. Defaults to False. module_filter (Any, optional): A function that takes a module and
returns True if the hook should be registered. Defaults to a function that always returns True.
- Returns:
list[torch.utils.hooks.RemovableHandle]: A list of handles for the registered hooks.
- static register_print_forward_hooks(model, print_shape_only=False, module_filter=<function HookRegistration.<lambda>>)[source]
Registers print hooks on all modules that satisfy the module_filter condition.
- Return type:
list[RemovableHandle]- Parameters:
- Args:
model (torch.nn.Module): The PyTorch model to register hooks on. module_filter (Any, optional): A function that takes a module and
returns True if the hook should be registered. Defaults to a function that always returns True.
- Returns:
list[torch.utils.hooks.RemovableHandle]: A list of handles for the registered hooks.
modalities.utils.debugging_configs module
- class modalities.utils.debugging_configs.DebuggingConfig(**data)[source]
Bases:
BaseModelCreate a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
-
enable_determinism:
bool Whether to enable deterministic operations in PyTorch for debugging purposes.
-
forward_hooks:
list[list[Annotated[RemovableHandle]]] List of lists of forward hook handles registered on the model.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.debugging_configs.NaNHookConfig(**data)[source]
Bases:
BaseModelConfiguration for registering NaN detection hooks on a model.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.debugging_configs.PrintForwardHookConfig(**data)[source]
Bases:
BaseModelConfiguration for registering print hooks on a model.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
modalities.utils.file_ops module
modalities.utils.logger_utils module
modalities.utils.mfu module
- class modalities.utils.mfu.GPT2MFUCalculator(n_layer, sequence_length, n_embd, world_size, wrapped_model, device_mesh=None)[source]
Bases:
MFUCalculatorABCClass to calculate the Model Flops Utilization (MFU) for a given model.
- Parameters:
n_layer (int)
sequence_length (int)
n_embd (int)
world_size (int)
wrapped_model (FullyShardedDataParallel | FSDPModule)
device_mesh (DeviceMesh | None)
modalities.utils.number_conversion module
- class modalities.utils.number_conversion.LocalNumBatchesFromNumSamplesConfig(**data)[source]
Bases:
BaseModelCreate a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
global_num_samples (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.LocalNumBatchesFromNumTokensConfig(**data)[source]
Bases:
BaseModelCreate a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
global_num_tokens (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumSamplesFromNumTokensConfig(**data)[source]
Bases:
BaseModelCreate a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumStepsFromNumSamplesConfig(**data)[source]
Bases:
BaseModelCreate a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
global_num_samples (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumStepsFromNumTokensConfig(**data)[source]
Bases:
BaseModelCreate a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
dp_degree (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
global_num_tokens (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumStepsFromRawDatasetIndexConfig(**data)[source]
Bases:
BaseModelCreate a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
raw_index_path (Path)
num_ranks (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumTokensFromNumStepsConfig(**data)[source]
Bases:
BaseModelCreate a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
num_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Ge(ge=0)])])
dp_degree (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumTokensFromPackedMemMapDatasetContinuousConfig(**data)[source]
Bases:
BaseModelCreate a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
dataset_path (Path)
sequence_length (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
dp_degree (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
local_micro_batch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
gradient_accumulation_steps (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
sample_key (str)
reuse_last_target (bool)
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.utils.number_conversion.NumberConversion[source]
Bases:
object- static get_global_num_seen_tokens_from_checkpoint_path(checkpoint_path)[source]
Returns the global num seen tokens from the checkpoint path.
- Args:
checkpoint_path (Path): Path to the checkpoint file.
- Returns:
int: Num seen tokens from the checkpoint path.
- static get_global_num_target_tokens_from_checkpoint_path(checkpoint_path)[source]
Returns the global num target tokens from the checkpoint path.
- Args:
checkpoint_path (Path): Path to the checkpoint file.
- Returns:
int: Num target tokens from the checkpoint path.
- static get_last_step_from_checkpoint_path(checkpoint_path)[source]
Returns the last step from the checkpoint path.
- Args:
checkpoint_path (Path): Path to the checkpoint file.
- Returns:
int: Last step from the checkpoint path.
- static get_local_num_batches_from_num_samples(num_ranks, global_num_samples, local_micro_batch_size)[source]
Calculates the number of local batches for each rank, given the global number of samples and number of ranks. This helper function is primarily used to calculate the number of batches to skip when resuming a dataloader during warmstart.
- Args:
num_ranks (int): Global number of ranks. global_num_samples (int): Global number of samples. local_micro_batch_size (int): Local micro batch size on single rank.
- Returns:
int: Number of local batches for single rank.
- static get_local_num_batches_from_num_tokens(num_ranks, global_num_tokens, sequence_length, local_micro_batch_size)[source]
Calculates the number of local batches for each rank, given the global number of tokens and number of ranks. This helper function is primarily used to calculate a dataloader’s number of batches (total and to skip)
- Return type:
- Parameters:
- Args:
num_ranks (int): Global number of ranks. global_num_tokens (int): Global number of tokens. sequence_length (int): Sequence length of the model. local_micro_batch_size (int): Local micro batch size on single rank.
- Returns:
int: Number of local batches for single rank.
- static get_num_samples_from_num_tokens(num_tokens, sequence_length)[source]
Calculates the number of samples given the global number of tokens and sequence length.
- Args:
num_tokens (int): Global number of tokens. sequence_length (int): Sequence length of the model.
- Returns:
int: Number of samples.
- static get_num_seen_steps_from_checkpoint_path(checkpoint_path)[source]
Returns the number of seen steps from the checkpoint path.”
- Args:
checkpoint_path (Path): Path to the checkpoint file.
- Returns:
int: Number of seen steps from the checkpoint path.
- static get_num_steps_from_num_samples(dp_degree, local_micro_batch_size, global_num_samples, gradient_accumulation_steps)[source]
Calculates the number of steps given the global number of samples, local micro batch size, number of data parallel ranks and gradient accumulation steps.
- Return type:
- Parameters:
- Args:
dp_degree (int): Number of data parallel ranks. local_micro_batch_size (int): Local micro batch size on single rank. global_num_samples (int): Global number of samples. gradient_accumulation_steps (int): Number of gradient accumulation steps.
- Returns:
int: Number of steps.
- static get_num_steps_from_num_tokens(dp_degree, local_micro_batch_size, global_num_tokens, sequence_length, gradient_accumulation_steps)[source]
Calculates the number of steps given the global number of tokens, local micro batch size number of ranks and gradient accumulation steps.
- Return type:
- Parameters:
- Args:
dp_degree (int): Number of data parallel ranks. local_micro_batch_size (int): Local micro batch size on single rank. global_num_tokens (int): Global number of tokens. sequence_length (int): Sequence length of the model. gradient_accumulation_steps (int): Number of gradient accumulation steps.
- Returns:
int: Number of steps.
- static get_num_steps_from_raw_dataset_index(raw_index_path, num_ranks, local_micro_batch_size, gradient_accumulation_steps)[source]
Get the number of steps from the raw index, number of ranks, local micro batch size and gradient accumulation steps. The index is a list of tuples where each tuple contains the offset and length of a sample in the raw data. Note, the index is not packed and the number of samples in respective raw JSONL file is the same as the length of the index.
- Return type:
- Parameters:
- Args:
raw_index_path (Path): Path to the raw index file of the JSONL dataset. num_ranks (int): Global number of ranks. local_micro_batch_size (int): Local micro batch size on single rank. gradient_accumulation_steps (int): Number of gradient accumulation steps.
- Returns:
int: Number of steps.
- static get_num_tokens_from_num_steps(num_steps, dp_degree, local_micro_batch_size, sequence_length, gradient_accumulation_steps)[source]
- Return type:
- Parameters:
- Calculates the number of global tokens given the number of steps, number of ranks, local micro batch size,
sequence length and gradient accumulation steps.
- Args:
num_steps (int): Number of steps. dp_degree (int): Number of data parallel ranks. local_micro_batch_size (int): Local micro batch size on single rank. sequence_length (int): Sequence length of the model. gradient_accumulation_steps (int): Number of gradient accumulation steps.
- Returns:
int: Number of global tokens.
- static get_num_tokens_from_packed_mem_map_dataset_continuous(dataset_path, sequence_length, dp_degree, local_micro_batch_size, gradient_accumulation_steps, sample_key, reuse_last_target)[source]
- Return type:
- Parameters:
- Get the number of tokens in a tokenized dataset that will be effectively used during training.
Due to the way the data is packed, batched and distributed, the number of tokens used during training might not be the same as the number of tokens in the dataset.
- The number of tokens that are used during training is calculated as follows:
- num_steps = num_dataset_tokens // sequence_length// dp_degree //
local_micro_batch_size // gradient_accumulation_steps
- global_num_tokens = num_steps * sequence_length * dp_degree *
local_micro_batch_size * gradient_accumulation_steps
- Args:
dataset_path (Path): Path to the tokenized dataset. sequence_length (int): Sequence length of the model. dp_degree (int): Number of data parallel ranks. local_micro_batch_size (int): Local micro batch size on single rank. gradient_accumulation_steps (int): Number of gradient accumulation steps.
- Returns:
int: Number of tokens that will be effectively used during training.
- class modalities.utils.number_conversion.NumberConversionFromCheckpointPathConfig(**data)[source]
Bases:
BaseModelCreate a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
checkpoint_path (Path)
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
modalities.utils.seeding module
modalities.utils.typing_utils module
modalities.utils.verify_tokenization_consistency module
- class modalities.utils.verify_tokenization_consistency.TokenizerTypes(value)[source]
Bases:
Enum- hugging_face = 'hugging_face'
- sentence_piece = 'sentence_piece'
- modalities.utils.verify_tokenization_consistency.build_hf_tokenization_components(tokenizer_path_or_name, eod_token)[source]
- modalities.utils.verify_tokenization_consistency.build_sp_tokenization_components(tokenizer_path, eod_token)[source]
- modalities.utils.verify_tokenization_consistency.verify_tokenization_consistency(src_path, eod_token, eod_token_id, tokenizer, tokenizer_config, jsonl_text_key)[source]
Verifies that the indexation and tokenization is consistent. This function applies the indexation and tokenization routines and then verifies that the index always captures entire samples and that the tokens in the JSON are correctly determined. For an example verification check out the test_end_to_end_indexation_and_tokenization_consistency test
- Args:
src_path (Path): Path to the JSONL file eod_token (str): end of document token eod_token_id (int): The token id of the end of document token tokenizer (Callable[[str], list[int]]): Callable executing the tokenization tokenizer_config (dict): Tokenizer config (same as used in the tokenization entry point) jsonl_text_key (str): The key mapping to the text of interest in each JSON file