modalities package
Subpackages
- modalities.checkpointing package
- Subpackages
- Submodules
- modalities.checkpointing.checkpoint_conversion module
- modalities.checkpointing.checkpoint_loading module
- modalities.checkpointing.checkpoint_saving module
- modalities.checkpointing.checkpoint_saving_execution module
- modalities.checkpointing.checkpoint_saving_instruction module
- modalities.checkpointing.checkpoint_saving_strategies module
- Module contents
- modalities.config package
- Submodules
- modalities.config.component_factory module
- modalities.config.config module
ActivationCheckpointedModelConfigActivationCheckpointedModelConfig.FullACParamsActivationCheckpointedModelConfig.SelectiveLayerACParamsActivationCheckpointedModelConfig.SelectiveOpACParamsActivationCheckpointedModelConfig.ac_fun_paramsActivationCheckpointedModelConfig.ac_variantActivationCheckpointedModelConfig.layers_fqnActivationCheckpointedModelConfig.modelActivationCheckpointedModelConfig.model_config
AdamOptimizerConfigAdamWOptimizerConfigBatchSamplerConfigCLMCrossEntropyLossConfigCheckpointSavingConfigCombinedDatasetConfigCompiledModelConfigConstantLRSchedulerConfigCosineAnnealingLRSchedulerConfigDCPAppStateConfigDCPCheckpointLoadingConfigDCPCheckpointSavingConfigDebuggingEnrichedModelConfigDistributedSamplerConfigDummyLRSchedulerConfigDummyProgressSubscriberConfigDummyResultSubscriberConfigEvaluationResultToDiscSubscriberConfigFSDP1ActivationCheckpointedModelConfigFSDP1CheckpointLoadingConfigFSDP1CheckpointLoadingConfig.block_namesFSDP1CheckpointLoadingConfig.global_rankFSDP1CheckpointLoadingConfig.mixed_precision_settingsFSDP1CheckpointLoadingConfig.model_configFSDP1CheckpointLoadingConfig.parse_mixed_precision_setting_by_name()FSDP1CheckpointLoadingConfig.parse_sharding_strategy_by_name()FSDP1CheckpointLoadingConfig.sharding_strategy
FSDP1CheckpointSavingConfigFSDP1CheckpointedModelConfigFSDP1CheckpointedOptimizerConfigFSDP2WrappedModelConfigFSDP2WrappedModelConfig.block_namesFSDP2WrappedModelConfig.device_meshFSDP2WrappedModelConfig.mixed_precision_settingsFSDP2WrappedModelConfig.modelFSDP2WrappedModelConfig.model_configFSDP2WrappedModelConfig.reshard_after_forwardFSDP2WrappedModelConfig.validate_dp_mesh_existence()FSDP2WrappedModelConfig.validate_mixed_precision_settings()
FSDPWrappedModelConfigFSDPWrappedModelConfig.block_namesFSDPWrappedModelConfig.mixed_precision_settingsFSDPWrappedModelConfig.modelFSDPWrappedModelConfig.model_configFSDPWrappedModelConfig.parse_mixed_precision_setting_by_name()FSDPWrappedModelConfig.parse_sharding_strategy_by_name()FSDPWrappedModelConfig.sharding_strategyFSDPWrappedModelConfig.sync_module_states
GPT2LLMCollateFnConfigGPT2MFUCalculatorConfigGPT2ModelTPConfigLLMDataLoaderConfigLinearLRSchedulerConfigMemMapDatasetConfigOneCycleLRSchedulerConfigOneCycleLRSchedulerConfig.anneal_strategyOneCycleLRSchedulerConfig.base_momentumOneCycleLRSchedulerConfig.check_totals_steps_and_epchs()OneCycleLRSchedulerConfig.cycle_momentumOneCycleLRSchedulerConfig.div_factorOneCycleLRSchedulerConfig.epochsOneCycleLRSchedulerConfig.final_div_factorOneCycleLRSchedulerConfig.last_epochOneCycleLRSchedulerConfig.max_lrOneCycleLRSchedulerConfig.max_momentumOneCycleLRSchedulerConfig.model_configOneCycleLRSchedulerConfig.optimizerOneCycleLRSchedulerConfig.pct_startOneCycleLRSchedulerConfig.steps_per_epochOneCycleLRSchedulerConfig.three_phaseOneCycleLRSchedulerConfig.total_steps
PackedMemMapDatasetContinuousConfigPackedMemMapDatasetMegatronConfigParallelDegreeConfigPassTypePreTrainedHFTokenizerConfigPreTrainedSPTokenizerConfigPrecisionEnumProcessGroupBackendTypeRawAppStateConfigReferenceConfigResumableDistributedSamplerConfigResumableDistributedSamplerConfig.datasetResumableDistributedSamplerConfig.drop_lastResumableDistributedSamplerConfig.epochResumableDistributedSamplerConfig.model_configResumableDistributedSamplerConfig.num_replicasResumableDistributedSamplerConfig.rankResumableDistributedSamplerConfig.seedResumableDistributedSamplerConfig.shuffleResumableDistributedSamplerConfig.skip_num_global_samples
RichProgressSubscriberConfigRichResultSubscriberConfigSaveEveryKStepsCheckpointingStrategyConfigSaveKMostRecentCheckpointsStrategyConfigSequentialSamplerConfigStepLRSchedulerConfigTokenizerTypesTorchCheckpointLoadingConfigWandBEvaluationResultSubscriberConfigWandBEvaluationResultSubscriberConfig.config_file_pathWandBEvaluationResultSubscriberConfig.directoryWandBEvaluationResultSubscriberConfig.experiment_idWandBEvaluationResultSubscriberConfig.global_rankWandBEvaluationResultSubscriberConfig.modeWandBEvaluationResultSubscriberConfig.model_configWandBEvaluationResultSubscriberConfig.project
WandbModeWeightInitializedModelConfigload_app_config_dict()
- modalities.config.instantiation_models module
ConsistencyEnforcementCudaEnvSettingsInstructionTuningDataInstantiationModelInstructionTuningDataInstantiationModel.InstructionDataTransformationInstructionTuningDataInstantiationModel.SettingsInstructionTuningDataInstantiationModel.chat_template_dataInstructionTuningDataInstantiationModel.instruction_data_transformationInstructionTuningDataInstantiationModel.jinja2_chat_templateInstructionTuningDataInstantiationModel.model_configInstructionTuningDataInstantiationModel.settings
IntervalsPackedDatasetComponentsInstantiationModelSplitConfigSplittingStepProfileTextGenerationInstantiationModelTrainingComponentsInstantiationModelTrainingComponentsInstantiationModel.SettingsTrainingComponentsInstantiationModel.app_stateTrainingComponentsInstantiationModel.checkpoint_savingTrainingComponentsInstantiationModel.device_meshTrainingComponentsInstantiationModel.eval_dataloadersTrainingComponentsInstantiationModel.evaluation_subscriberTrainingComponentsInstantiationModel.gradient_clipperTrainingComponentsInstantiationModel.loss_fnTrainingComponentsInstantiationModel.mfu_calculatorTrainingComponentsInstantiationModel.model_configTrainingComponentsInstantiationModel.model_rawTrainingComponentsInstantiationModel.progress_subscriberTrainingComponentsInstantiationModel.scheduled_pipelineTrainingComponentsInstantiationModel.settingsTrainingComponentsInstantiationModel.train_dataloaderTrainingComponentsInstantiationModel.train_dataset
TrainingProgressTrainingReportGeneratorTrainingTarget
- modalities.config.lookup_enum module
- modalities.config.pydantic_if_types module
- modalities.config.utils module
- Module contents
- modalities.conversion package
- Subpackages
- modalities.conversion.gpt2 package
- Submodules
- modalities.conversion.gpt2.configuration_gpt2 module
- modalities.conversion.gpt2.conversion_code module
- modalities.conversion.gpt2.conversion_model module
- modalities.conversion.gpt2.conversion_tokenizer module
- modalities.conversion.gpt2.convert_gpt2 module
- modalities.conversion.gpt2.modeling_gpt2 module
- Module contents
- modalities.conversion.gpt2 package
- Module contents
- Subpackages
- modalities.dataloader package
- Subpackages
- Submodules
- modalities.dataloader.apply_chat_template module
- modalities.dataloader.create_index module
- modalities.dataloader.create_instruction_tuning_data module
- modalities.dataloader.create_packed_data module
- modalities.dataloader.dataloader module
- modalities.dataloader.dataloader_factory module
- modalities.dataloader.dataset module
CombinedDatasetDatasetDummyDatasetDummyDatasetConfigDummySampleConfigDummySampleDataTypeMemMapDatasetPackedMemMapDatasetBasePackedMemMapDatasetBase.DATA_SECTION_LENGTH_IN_BYTESPackedMemMapDatasetBase.HEADER_SIZE_IN_BYTESPackedMemMapDatasetBase.TOKEN_SIZE_DESCRIPTOR_LENGTH_IN_BYTESPackedMemMapDatasetBase.np_dtype_of_tokens_on_disk_from_bytesPackedMemMapDatasetBase.token_size_in_bytesPackedMemMapDatasetBase.type_converter_for_torch
PackedMemMapDatasetContinuousPackedMemMapDatasetMegatron
- modalities.dataloader.dataset_factory module
- modalities.dataloader.filter_packed_data module
- modalities.dataloader.large_file_lines_reader module
- modalities.dataloader.sampler_factory module
ResumableDistributedMultiDimSamplerConfigResumableDistributedMultiDimSamplerConfig.data_parallel_keyResumableDistributedMultiDimSamplerConfig.datasetResumableDistributedMultiDimSamplerConfig.device_meshResumableDistributedMultiDimSamplerConfig.drop_lastResumableDistributedMultiDimSamplerConfig.epochResumableDistributedMultiDimSamplerConfig.model_configResumableDistributedMultiDimSamplerConfig.seedResumableDistributedMultiDimSamplerConfig.shuffleResumableDistributedMultiDimSamplerConfig.skip_num_global_samples
SamplerFactory
- modalities.dataloader.samplers module
- Module contents
- modalities.inference package
- modalities.logging_broker package
- modalities.models package
- Subpackages
- Submodules
- modalities.models.model module
- modalities.models.model_factory module
GPT2ModelFactoryModelFactoryModelFactory.get_activation_checkpointed_fsdp1_model_()ModelFactory.get_activation_checkpointed_fsdp2_model_()ModelFactory.get_compiled_model()ModelFactory.get_debugging_enriched_model()ModelFactory.get_fsdp1_checkpointed_model()ModelFactory.get_fsdp1_wrapped_model()ModelFactory.get_fsdp2_wrapped_model()ModelFactory.get_weight_initialized_model()
- modalities.models.utils module
- Module contents
- modalities.nn package
- modalities.optimizers package
- modalities.preprocessing package
- modalities.registry package
- modalities.running_env package
- modalities.tokenization package
- modalities.training package
- Subpackages
- Submodules
- modalities.training.training_progress module
TrainingProgressTrainingProgress.num_seen_steps_current_runTrainingProgress.num_seen_steps_previous_runTrainingProgress.num_seen_steps_totalTrainingProgress.num_seen_tokens_current_runTrainingProgress.num_seen_tokens_previous_runTrainingProgress.num_seen_tokens_totalTrainingProgress.num_target_stepsTrainingProgress.num_target_tokens
- Module contents
- modalities.utils package
- Subpackages
- modalities.utils.benchmarking package
- modalities.utils.profilers package
- Submodules
- modalities.utils.profilers.batch_generator module
- modalities.utils.profilers.modalities_profiler module
- modalities.utils.profilers.steppable_component_configs module
- modalities.utils.profilers.steppable_components module
- modalities.utils.profilers.steppable_components_if module
- Module contents
- Submodules
- modalities.utils.communication_test module
- modalities.utils.debug module
- modalities.utils.debug_components module
- modalities.utils.debugging_configs module
- modalities.utils.file_ops module
- modalities.utils.logger_utils module
- modalities.utils.mfu module
- modalities.utils.number_conversion module
LocalNumBatchesFromNumSamplesConfigLocalNumBatchesFromNumTokensConfigNumSamplesFromNumTokensConfigNumStepsFromNumSamplesConfigNumStepsFromNumTokensConfigNumStepsFromRawDatasetIndexConfigNumTokensFromNumStepsConfigNumTokensFromPackedMemMapDatasetContinuousConfigNumTokensFromPackedMemMapDatasetContinuousConfig.dataset_pathNumTokensFromPackedMemMapDatasetContinuousConfig.dp_degreeNumTokensFromPackedMemMapDatasetContinuousConfig.gradient_accumulation_stepsNumTokensFromPackedMemMapDatasetContinuousConfig.local_micro_batch_sizeNumTokensFromPackedMemMapDatasetContinuousConfig.model_configNumTokensFromPackedMemMapDatasetContinuousConfig.reuse_last_targetNumTokensFromPackedMemMapDatasetContinuousConfig.sample_keyNumTokensFromPackedMemMapDatasetContinuousConfig.sequence_length
NumberConversionNumberConversion.get_global_num_seen_tokens_from_checkpoint_path()NumberConversion.get_global_num_target_tokens_from_checkpoint_path()NumberConversion.get_last_step_from_checkpoint_path()NumberConversion.get_local_num_batches_from_num_samples()NumberConversion.get_local_num_batches_from_num_tokens()NumberConversion.get_num_samples_from_num_tokens()NumberConversion.get_num_seen_steps_from_checkpoint_path()NumberConversion.get_num_steps_from_num_samples()NumberConversion.get_num_steps_from_num_tokens()NumberConversion.get_num_steps_from_raw_dataset_index()NumberConversion.get_num_target_steps_from_checkpoint_path()NumberConversion.get_num_tokens_from_num_steps()NumberConversion.get_num_tokens_from_packed_mem_map_dataset_continuous()
NumberConversionFromCheckpointPathConfig
- modalities.utils.seeding module
- modalities.utils.typing_utils module
- modalities.utils.verify_tokenization_consistency module
- Module contents
- Subpackages
Submodules
modalities.api module
- class modalities.api.FileExistencePolicy(value)[source]
Bases:
Enum- ERROR = 'error'
- OVERRIDE = 'override'
- SKIP = 'skip'
- modalities.api.convert_pytorch_to_hf_checkpoint(config_file_path, output_hf_checkpoint_dir, prediction_key)[source]
Converts a PyTorch checkpoint to a Hugging Face checkpoint.
- Return type:
- Parameters:
- Args:
config_file_path (Path): Path to the config that generated the pytorch checkpoint. output_hf_checkpoint_dir (Path): Path to the output directory for the converted HF checkpoint. prediction_key (str): The key in the models output where one can find the predictions of interest.
- Returns:
HFModelAdapter: The Hugging Face model adapter.
- modalities.api.create_filtered_tokenized_dataset(input_data_path, filter_routine, output_data_path, file_existence_policy)[source]
- modalities.api.create_raw_data_index(src_path, index_path, file_existence_policy=FileExistencePolicy.ERROR)[source]
Creates the index file for the content of a large jsonl-file. The index file contains the byte-offsets and lengths of each line in the jsonl-file. Background is the ability to further process the respective file without loading it, while splitting its content line-based. This step is necessary in advance of further processing like tokenization. It is only necessary once for a jsonl-file and allows therefore different tokenizations without re-indexing.
- Args:
src_path (Path): The path to the jsonl-file. index_path (Path): The path to the index file, that will be created. file_existence_policy (FileExistencePolicy): Policy to apply when the index file already exists.
Defaults to FileExistencePolicy.ERROR.
- Raises:
ValueError: If the index file already exists.
- Parameters:
src_path (Path)
index_path (Path)
file_existence_policy (FileExistencePolicy)
- modalities.api.create_shuffled_dataset_chunk(file_path_list, output_chunk_file_path, chunk_id, num_chunks, file_existence_policy, global_seed=None)[source]
Creates a shuffled dataset chunk. Given a dataset consisting of multiple tokenized pbin files, this function creates a shuffled dataset chunk for a given chunk id. From each tokenized pbin file, the respective chunk is extracted, shuffled and written to a new pbin file.
- Args:
file_path_list (list[Path]): List of paths to the tokenized input pbin files. output_chunk_file_path (Path): Path to the output chunk which will be stored in pbin format. chunk_id (int): The id of the chunk to create. num_chunks (int): The total number of chunks to create. file_existence_policy (FileExistencePolicy): Policy to apply when the output chunk file already exists. global_seed (Optional[int]): The global seed to use for shuffling.
- Raises:
ValueError: If the chunk has no samples.
- modalities.api.create_shuffled_jsonl_dataset_chunk(file_path_list, output_chunk_file_path, chunk_id, num_chunks, file_existence_policy, global_seed=None)[source]
Creates a shuffled jsonl dataset chunk. Given a dataset consisting of multiple jsonl files, this function creates a shuffled dataset chunk for a given chunk id. From each jsonl file, the respective chunk is extracted, shuffled and written to a new jsonl file.
- Args:
file_path_list (list[Path]): List of paths to the input jsonl files. output_chunk_file_path (Path): Path to the output chunk which will be stored in jsonl format. chunk_id (int): The id of the chunk to create. num_chunks (int): The total number of chunks to create. file_existence_policy (FileExistencePolicy): Policy to apply when the output chunk file already exists. global_seed (Optional[int]): The global seed to use for shuffling.
- Raises:
ValueError: If the chunk has no samples.
- modalities.api.enforce_file_existence_policy(file_path, file_existence_policy)[source]
Enforces the file existence policy. Function returns True, if processing should be stopped. Otherwise False.
- Return type:
- Parameters:
file_path (Path)
file_existence_policy (FileExistencePolicy)
- Args:
file_path (Path): File path to the file to check. file_existence_policy (FileExistencePolicy): The file existence policy.
- Raises:
ValueError: Raised if the file existence policy is unknown or the policy requires to raise a ValueError.
- Returns:
bool: True if processing should be stopped, otherwise False.
- modalities.api.generate_text(config_file_path)[source]
Inference function to generate text with a given model.
- Args:
config_file_path (FilePath): Path to the YAML config file.
- Parameters:
config_file_path (Annotated[Path, PathType(path_type=file)])
- modalities.api.merge_packed_data_files(src_paths, target_path)[source]
Utility function for merging different pbin-files into one. This is especially useful, if different datasets were at different points in time or if one encoding takes so long, that the overall process was done in chunks. It is important that the same tokenizer got used for all chunks.
Specify an arbitrary amount of pbin-files and/or directory containing such as input.
- Args:
src_paths (list[Path]): List of paths to the pbin-files or directories containing such. target_path (Path): The path to the merged pbin-file, that will be created.
- modalities.api.pack_encoded_data(config_dict, file_existence_policy)[source]
Packs and encodes an indexed, large jsonl-file. (see also create_index for more information) Returns .pbin-file, which can be inserted into a training process directly and does not require its original jsonl-file or the respective index file anymore.
- Args:
config_dict (dict): Dictionary containing the configuration for the packed data generation. file_existence_policy (FileExistencePolicy): Policy to apply when the output file already exists.
- Parameters:
config_dict (dict)
file_existence_policy (FileExistencePolicy)
- modalities.api.shuffle_jsonl_data(input_data_path, output_data_path, file_existence_policy, seed=None)[source]
Shuffles a JSONL file (.jsonl) and stores it on disc.
- Args:
input_data_path (Path): File path to the jsonl data (.jsonl). output_data_path (Path): File path to write the shuffled jsonl data. file_existence_policy (FileExistencePolicy): Policy to apply when the output file already exists. seed (Optional[int]): The seed to use for shuffling.
- Parameters:
input_data_path (Path)
output_data_path (Path)
file_existence_policy (FileExistencePolicy)
seed (int | None)
- modalities.api.shuffle_tokenized_data(input_data_path, output_data_path, batch_size, file_existence_policy, seed=None)[source]
Shuffles a tokenized file (.pbin) and stores it on disc.
- Args:
input_data_path (Path): File path to the tokenized data (.pbin). output_data_path (Path): File path to write the shuffled tokenized data. batch_size (int): Number of documents to process per batch. file_existence_policy (FileExistencePolicy): Policy to apply when the output file already exists. seed (Optional[int]): The seed to use for shuffling.
- Parameters:
input_data_path (Path)
output_data_path (Path)
batch_size (int)
file_existence_policy (FileExistencePolicy)
seed (int | None)
modalities.batch module
- class modalities.batch.Batch[source]
Bases:
ABCAbstract class that defines the necessary methods any Batch implementation needs to implement.
- class modalities.batch.DatasetBatch(samples, targets, batch_dim=0)[source]
Bases:
Batch,TorchDeviceMixinA batch of samples and its targets. Used to batch train a model.
- class modalities.batch.EvaluationResultBatch(dataloader_tag, num_train_steps_done, losses=<factory>, metrics=<factory>, throughput_metrics=<factory>)[source]
Bases:
BatchData class for storing the results of a single or multiple batches. Also entire epoch results are stored in here.
- Parameters:
dataloader_tag (str)
num_train_steps_done (int)
losses (dict[str, ResultItem])
metrics (dict[str, ResultItem])
throughput_metrics (dict[str, ResultItem])
-
losses:
dict[str,ResultItem] = <dataclasses._MISSING_TYPE object>
-
metrics:
dict[str,ResultItem] = <dataclasses._MISSING_TYPE object>
-
throughput_metrics:
dict[str,ResultItem] = <dataclasses._MISSING_TYPE object>
- class modalities.batch.InferenceResultBatch(targets, predictions, batch_dim=0)[source]
Bases:
Batch,TorchDeviceMixinStores targets and predictions of an entire batch.
modalities.evaluator module
- class modalities.evaluator.Evaluator(progress_publisher, evaluation_result_publisher)[source]
Bases:
objectEvaluator class which is responsible for evaluating the model on a set of datasets
Initializes the Evaluator class.
- Args:
progress_publisher (MessagePublisher[ProgressUpdate]): Publisher for progress updates evaluation_result_publisher (MessagePublisher[EvaluationResultBatch]): Publisher for evaluation results
- Parameters:
progress_publisher (MessagePublisher[ProgressUpdate])
evaluation_result_publisher (MessagePublisher[EvaluationResultBatch])
- evaluate(model, data_loaders, loss_fun, num_train_steps_done, scheduled_pipeline=None)[source]
Evaluate the model on a set of datasets.
- Return type:
- Parameters:
model (Module)
data_loaders (list[LLMDataLoader])
loss_fun (Callable[[InferenceResultBatch], Tensor])
num_train_steps_done (int)
scheduled_pipeline (Pipeline | None)
- Args:
model (nn.Module): The model to evaluate data_loaders (list[LLMDataLoader]): List of dataloaders to evaluate the model on loss_fun (Callable[[InferenceResultBatch], torch.Tensor]): The loss function to calculate the loss num_train_steps_done (int): The number of training steps done so far for logging purposes scheduled_pipeline (Pipeline | None, optional): In case of pipeline parallelism, this is used to
operate the model. Defaults to None.
- Returns:
dict[str, EvaluationResultBatch]: A dictionary containing the evaluation results for each dataloader
- evaluate_batch(batch, model, loss_fun, scheduled_pipeline=None)[source]
Evaluate a single batch by forwarding it through the model and calculating the loss.
- Return type:
- Parameters:
batch (DatasetBatch)
model (Module)
loss_fun (Callable[[InferenceResultBatch], Tensor])
scheduled_pipeline (Pipeline | None)
- Args:
batch (DatasetBatch): The batch to evaluate model (nn.Module): The model to evaluate loss_fun (Callable[[InferenceResultBatch], torch.Tensor]): The loss function to calculate the loss scheduled_pipeline (Pipeline | None, optional): In case of pipeline parallelism, this is used to
operate the model. Defaults to None.
- Returns:
- torch.Tensor | None: The loss of the batch
None, if a non-last stage was processed in pipeline parallelism
modalities.exceptions module
modalities.gym module
- class modalities.gym.Gym(trainer, evaluator, loss_fun, num_ranks)[source]
Bases:
objectClass to perform the model training, including evaluation and checkpointing.
Initializes a Gym object.
- Args:
trainer (Trainer): Trainer object to perform the training. evaluator (Evaluator): Evaluator object to perform the evaluation. loss_fun (Loss): Loss function applied during training and evaluation. num_ranks (int): Number of ranks used for distributed training.
- run(app_state, training_log_interval_in_steps, checkpointing_interval_in_steps, evaluation_interval_in_steps, train_data_loader, evaluation_data_loaders, checkpoint_saving, scheduled_pipeline=None)[source]
Runs the model training, including evaluation and checkpointing.
- Args:
app_state (AppState): Application state containing the model, optimizer and lr scheduler. training_log_interval_in_steps (int): Interval in steps to log training progress. checkpointing_interval_in_steps (int): Interval in steps to save checkpoints. evaluation_interval_in_steps (int): Interval in steps to perform evaluation. train_data_loader (LLMDataLoader): Data loader with the training data. evaluation_data_loaders (list[LLMDataLoader]): List of data loaders with the evaluation data. checkpoint_saving (CheckpointSaving): Routine for saving checkpoints. scheduled_pipeline (Pipeline | None, optional): In case of pipeline parallelism, this is used to
operate the model. Defaults to None.
- Parameters:
app_state (AppState)
training_log_interval_in_steps (int)
checkpointing_interval_in_steps (int)
evaluation_interval_in_steps (int)
train_data_loader (LLMDataLoader)
evaluation_data_loaders (list[LLMDataLoader])
checkpoint_saving (CheckpointSaving)
scheduled_pipeline (Pipeline | None)
modalities.loss_functions module
- class modalities.loss_functions.CLMCrossEntropyLoss(target_key, prediction_key, tag='CLMCrossEntropyLoss')[source]
Bases:
Loss
- class modalities.loss_functions.NCELoss(prediction_key1, prediction_key2, is_asymmetric=True, temperature=1.0, tag='NCELoss')[source]
Bases:
LossNoise Contrastive Estimation Loss
- Args:
prediction_key1 (str): key to access embedding 1. prediction_key2 (str): key to access embedding 2. is_asymmetric (bool, optional): specifies symmetric or asymmetric calculation of NCEloss. Defaults to True. temperature (float, optional): temperature. Defaults to 1.0. tag (str, optional): Defaults to “NCELoss”.
- modalities.loss_functions.nce_loss(embedding1, embedding2, device, is_asymmetric, temperature)[source]
This implementation calculates the noise contrastive estimation loss between embeddings of two different modalities Implementation slightly adapted from https://arxiv.org/pdf/1912.06430.pdf, https://github.com/antoine77340/MIL-NCE_HowTo100M changes include adding a temperature value and the choice of calculating asymmetric loss w.r.t. one modality This implementation is adapted to contrastive loss from CoCa model https://arxiv.org/pdf/2205.01917.pdf
- Return type:
- Parameters:
- Args:
embedding1 (torch.Tensor): embeddings from modality 1 of size batch_size x embed_dim. embedding2 (torch.Tensor): embeddings from modality 2 of size batch_size x embed_dim. device (torch.device): torch device for calculating loss. is_asymmetric (bool): boolean value to specify if the loss is calculated in one direction or both directions. temperature (float): temperature value for regulating loss.
- Returns:
torch.Tensor: loss tensor.
modalities.main module
- class modalities.main.Main(config_path, additional_resolver_funs=None, experiment_id=None)[source]
Bases:
objectMain class that orchestrates the training process.
- Parameters:
- add_custom_component(component_key, variant_key, custom_component, custom_config)[source]
Add a custom component to the registry.
This method comes in especially handy when Modalities is used as a library and the user wants to add custom components (e.g., custom model or custom loss function) to the registry.
- Return type:
- Parameters:
- Args:
component_key (str): Key of the component to be added to the registry variant_key (str): Key of the variant to be added to the registry custom_component (Type): The class type of the custom component custom_config (Type): The pydantic config type of the custom component
- build_components(components_model_type)[source]
Given a pydantic basemodel, this method builds the components specified in the config file.
Depending on the use case (e.g., training, inference, etc.), the user can pass different pydantic base models. For instance, for tokenization, the basemodel would only have the tokenization-related components specified.
- Return type:
BaseModel- Parameters:
components_model_type (Type[BaseModel])
- Args:
- components_model_type (Type[BaseModel]): The pydantic basemodel type that should be
used to build the components.
- Returns:
BaseModel: The components built based on the config file.
- get_logging_publishers(progress_subscriber, results_subscriber, global_rank, local_rank)[source]
Returns the logging publishers for the training.
These publishers are used to pass the evaluation results and the progress updates to the message broker. The message broker is then used to pass the messages to the subscribers, such as WandB.
- Return type:
tuple[MessagePublisher[EvaluationResultBatch],MessagePublisher[ProgressUpdate]]- Parameters:
progress_subscriber (MessageSubscriberIF[ProgressUpdate])
results_subscriber (MessageSubscriberIF[EvaluationResultBatch])
global_rank (int)
local_rank (int)
- Args:
progress_subscriber (MessageSubscriberIF[ProgressUpdate]): The progress subscriber results_subscriber (MessageSubscriberIF[EvaluationResultBatch]): The results subscriber global_rank (int): The global rank of the current process local_rank (int): The local rank of the current process on the current node
- Returns:
- tuple[MessagePublisher[EvaluationResultBatch], MessagePublisher[ProgressUpdate]]: The evaluation
result publisher and the progress publisher
- run(components)[source]
Entrypoint fo running the training process.
We pass in a TrainingComponentsInstantiationModel, which is a pydantic model that contains all the components needed for the training process.
- Args:
components (TrainingComponentsInstantiationModel): The components needed for the training process.
- Parameters:
components (TrainingComponentsInstantiationModel)
modalities.trainer module
- class modalities.trainer.ThroughputAggregationKeys(value)[source]
Bases:
Enum- FORWARD_BACKWARD_TIME = 'FORWARD_BACKWARD_TIME'
- NUM_SAMPLES = 'NUM_SAMPLES'
- class modalities.trainer.Trainer(global_rank, progress_publisher, evaluation_result_publisher, gradient_acc_steps, global_num_tokens_per_train_step, device_mesh, num_seen_train_steps, global_num_seen_tokens, num_target_steps, num_target_tokens, gradient_clipper, mfu_calculator=None)[source]
Bases:
objectInitializes the Trainer object.
- Args:
global_rank (int): The global rank. progress_publisher (MessagePublisher[ProgressUpdate]): Progress publisher. evaluation_result_publisher (MessagePublisher[EvaluationResultBatch]): Evaluation result publisher. gradient_acc_steps (int): Gradient accumulation steps. global_num_tokens_per_train_step (int): Global number of tokens per train step. dp_degree (int): Data parallelism degree. pp_degree (int): Pipeline parallelism degree. num_seen_train_steps (int): Number of seen train steps. global_num_seen_tokens (int): Global number of seen tokens. num_target_steps (int): Number of target steps. num_target_tokens (int): Number of target tokens. gradient_clipper (GradientClipperIF): Gradient clipper. mfu_calculator (Optional[MFUCalculatorABC]): MFU calculator.
- Returns:
None
- Parameters:
global_rank (int)
progress_publisher (MessagePublisher[ProgressUpdate])
evaluation_result_publisher (MessagePublisher[EvaluationResultBatch])
gradient_acc_steps (int)
global_num_tokens_per_train_step (int)
device_mesh (DeviceMesh | None)
num_seen_train_steps (int)
global_num_seen_tokens (int)
num_target_steps (int)
num_target_tokens (int)
gradient_clipper (GradientClipperIF)
mfu_calculator (MFUCalculatorABC | None)
- train(app_state, train_loader, loss_fun, training_log_interval_in_steps, evaluation_callback, checkpointing_callback, scheduled_pipeline=None)[source]
Trains the model.
- Args:
app_state (AppState): The application state containing the model, optimizer and lr scheduler. train_loader (LLMDataLoader): The data loader containing the training data. loss_fun (Loss): The loss function used for training. training_log_interval_in_steps (int): The interval at which training progress is logged. evaluation_callback (Callable[[TrainingProgress], None]): A callback function for evaluation. checkpointing_callback (Callable[[TrainingProgress], None]): A callback function for checkpointing. scheduled_pipeline (Pipeline | None, optional): In case of pipeline parallelism, this is used to
operate the model. Defaults to None.
- Returns:
None
- Parameters:
app_state (AppState)
train_loader (LLMDataLoader)
loss_fun (Loss)
training_log_interval_in_steps (int)
evaluation_callback (Callable[[TrainingProgress], None])
checkpointing_callback (Callable[[TrainingProgress], None])
scheduled_pipeline (Pipeline | None)
modalities.util module
- class modalities.util.Aggregator[source]
Bases:
Generic[T]
- class modalities.util.TimeRecorder[source]
Bases:
objectClass with context manager to record execution time
- class modalities.util.TimeRecorderStates(value)[source]
Bases:
Enum- RUNNING = 'RUNNING'
- STOPPED = 'STOPPED'
- modalities.util.format_metrics_to_gb(item)[source]
quick function to format numbers to gigabyte and round to 4 digit precision
- modalities.util.get_experiment_id_from_config(config_file_path, hash_length=16)[source]
Create experiment ID including the date and time for file save uniqueness example: 2022-05-07__14-31-22_fdh1xaj2’
- modalities.util.get_local_number_of_trainable_parameters(model)[source]
Returns the number of trainable parameters that are materialized on the current rank. The model can be sharded with FSDP1 or FSDP2 or not sharded at all.
- Args:
model (nn.Module): The model for which to calculate the number of trainable parameters.
- Returns:
int: The number of trainable parameters materialized on the current rank.
- modalities.util.get_module_class_from_name(module, name)[source]
From Accelerate source code (https://github.com/huggingface/accelerate/blob/1f7a79b428749f45187ec69485f2c966fe21926e/src/accelerate/utils/dataclasses.py#L1902) Gets a class from a module by its name.
- Args:
module (torch.nn.Module): The module to get the class from. name (str): The name of the class.
- modalities.util.get_synced_experiment_id_of_run(config_file_path=None, hash_length=16, max_experiment_id_byte_length=1024)[source]
Create a unique experiment ID for the current run on rank 0 and broadcast it to all ranks. Internally, the experiment ID is generated by hashing the configuration file path and appending the current date and time. The experiment ID is then converted to a byte array (with maximum length of max_experiment_id_byte_length) and broadcasted to all ranks. In the unlikely case of the experiment ID being too long, a ValueError is raised and max_experiment_id_byte_length must be increased. Each rank then decodes the byte array to the original string representation and returns it. Having a globally synced experiment ID is mandatory for saving files / checkpionts in a distributed training setup.
- Return type:
- Parameters:
- Args:
config_file_path (Path): Path to the configuration file. hash_length (Optional[int], optional): Defines the char length of the commit hash. Defaults to 16. max_experiment_id_byte_length (Optional[int]): Defines max byte length of the experiment_id
to be shared to other ranks. Defaults to 1024.
- Returns:
str: The experiment ID.
- modalities.util.get_synced_string(string_to_be_synced, from_rank=0, max_string_byte_length=1024)[source]
Broadcast a string from one rank to all other ranks in the distributed setup.
- Args:
string_to_be_synced (str): The string to be synced across ranks. from_rank (int, optional): The rank that generates the string. Defaults to 0. max_string_byte_length (Optional[int], optional): Maximum byte length of the string to be synced.
Defaults to 1024.
- Returns:
str: The synced string, decoded from the byte array.
- Raises:
ValueError: If the string exceeds the maximum byte length.
- modalities.util.get_total_number_of_trainable_parameters(model, device_mesh)[source]
Returns the total number of trainable parameters across all ranks. The model must be sharded with FSDP1 or FSDP2.
- Return type:
- Parameters:
model (FullyShardedDataParallel | FSDPModule)
device_mesh (DeviceMesh | None)
- Args:
model (FSDPX): The model for which to calculate the number of trainable parameters. device_mesh (DeviceMesh | None): The device mesh used for distributed training.
- Returns:
Number: The total number of trainable parameters across all ranks.