modalities.checkpointing package
Subpackages
- modalities.checkpointing.fsdp package
- modalities.checkpointing.stateful package
- modalities.checkpointing.torch package
Submodules
modalities.checkpointing.checkpoint_conversion module
- class modalities.checkpointing.checkpoint_conversion.CheckpointConversion(config_file_path, output_hf_checkpoint_dir)[source]
Bases:
object
Class to convert a PyTorch checkpoint to a Hugging Face checkpoint.
Initializes the CheckpointConversion object.
- Args:
config_file_path (Path): The path to the configuration file containing the pytorch model configuration. output_hf_checkpoint_dir (Path): The path to the output Hugging Face checkpoint directory.
- Raises:
ValueError: If the config_file_path does not exist.
- convert_pytorch_to_hf_checkpoint(prediction_key)[source]
Converts a PyTorch checkpoint to a Hugging Face checkpoint.
- Return type:
- Parameters:
prediction_key (str)
- Args:
prediction_key (str): The prediction key to be used in the HFModelAdapter.
- Returns:
HFModelAdapter: The converted Hugging Face model adapter.
modalities.checkpointing.checkpoint_loading module
- class modalities.checkpointing.checkpoint_loading.DistributedCheckpointLoadingIF[source]
Bases:
ABC
Distributed checkpoint loading interface for loading PyTorch models and optimizer checkpoints.
- abstractmethod load_checkpoint_(app_state, checkpoint_dir_path)[source]
Loads the distributed checkpoint from the specified directory path into the AppState.
- Args:
app_state (AppState): The application state with the model, optimizer and lr scheduler. checkpoint_dir_path (Path): The directory path to the distributed checkpoint.
- Raises:
NotImplementedError: This abstract method is not implemented and should be overridden in a subclass.
- Returns:
AppState: The application state with the loaded checkpoint.
- class modalities.checkpointing.checkpoint_loading.FSDP1CheckpointLoadingIF[source]
Bases:
ABC
Checkpoint loading interface for loading PyTorch models and optimizer checkpoints.
- abstractmethod load_model_checkpoint(model, file_path)[source]
Loads a model checkpoint from the specified file path.
- Args:
model (nn.Module): The model to load the checkpoint into. file_path (Path): The path to the checkpoint file.
- Returns:
nn.Module: The loaded model with the checkpoint parameters.
- Raises:
NotImplementedError: This abstract method is not implemented and should be overridden in a subclass.
- abstractmethod load_optimizer_checkpoint_(optimizer, model, file_path)[source]
Loads an optimizer checkpoint from the specified file path (in-place).
- Args:
optimizer (Optimizer): The optimizer to load the checkpoint into (in-place). model (nn.Module): The model associated with the optimizer. file_path (Path): The path to the checkpoint file.
- Raises:
NotImplementedError: This abstract method is not implemented and should be overridden in a subclass.
modalities.checkpointing.checkpoint_saving module
- class modalities.checkpointing.checkpoint_saving.CheckpointSaving(checkpoint_saving_strategy, checkpoint_saving_execution)[source]
Bases:
object
Class for saving checkpoints based on a savig and execution strategy.
Initializes the CheckpointSaving object.
- Args:
checkpoint_saving_strategy (CheckpointSavingStrategyIF): The strategy for saving checkpoints. checkpoint_saving_execution (CheckpointSavingExecutionABC): The execution for saving checkpoints.
- Parameters:
checkpoint_saving_strategy (CheckpointSavingStrategyIF)
checkpoint_saving_execution (CheckpointSavingExecutionABC)
- save_checkpoint(training_progress, evaluation_result, app_state, early_stoppping_criterion_fulfilled=False)[source]
Saves a checkpoint of the model and optimizer.
- Args:
training_progress (TrainingProgress): The training progress. evaluation_result (dict[str, EvaluationResultBatch]): The evaluation result. app_state (AppState): The application state to be checkpointed. early_stoppping_criterion_fulfilled (bool, optional): Whether the early stopping criterion is fulfilled. Defaults to False.
- Parameters:
training_progress (TrainingProgress)
evaluation_result (dict[str, EvaluationResultBatch])
app_state (AppState)
early_stoppping_criterion_fulfilled (bool)
modalities.checkpointing.checkpoint_saving_execution module
- class modalities.checkpointing.checkpoint_saving_execution.CheckpointSavingExecutionABC[source]
Bases:
ABC
Abstract class for saving PyTorch model and optimizer checkpoints.
- run_checkpoint_instruction(checkpointing_instruction, training_progress, app_state)[source]
Runs the checkpoint instruction.
- Args:
checkpointing_instruction (CheckpointingInstruction): The checkpointing instruction. training_progress (TrainingProgress): The training progress. app_state (AppState): The application state to be checkpointed.
- Parameters:
checkpointing_instruction (CheckpointingInstruction)
training_progress (TrainingProgress)
app_state (AppState)
modalities.checkpointing.checkpoint_saving_instruction module
- class modalities.checkpointing.checkpoint_saving_instruction.CheckpointingInstruction(save_current=False, checkpoints_to_delete=<factory>)[source]
Bases:
object
Represents a checkpointing instruction (i.e., saving and deleting).
- Attributes:
save_current (bool): Indicates whether to save the current checkpoint. checkpoints_to_delete (list[TrainingProgress]): List of checkpoint IDs to delete.
- Parameters:
save_current (bool)
checkpoints_to_delete (list[TrainingProgress])
-
checkpoints_to_delete:
list
[TrainingProgress
] = <dataclasses._MISSING_TYPE object>
modalities.checkpointing.checkpoint_saving_strategies module
- class modalities.checkpointing.checkpoint_saving_strategies.CheckpointSavingStrategyIF[source]
Bases:
ABC
Checkpoint saving strategy interface.
- abstractmethod get_checkpoint_instruction(training_progress, evaluation_result=None, early_stoppping_criterion_fulfilled=False)[source]
Returns the checkpointing instruction.
- Return type:
- Parameters:
training_progress (TrainingProgress)
evaluation_result (dict[str, EvaluationResultBatch] | None)
early_stoppping_criterion_fulfilled (bool)
- Parameters:
training_progress (TrainingProgress): The training progress. evaluation_result (dict[str, EvaluationResultBatch] | None, optional): The evaluation result. Defaults to None. early_stoppping_criterion_fulfilled (bool, optional): Whether the early stopping criterion is fulfilled. Defaults to False.
- Returns:
CheckpointingInstruction: The checkpointing instruction.
- class modalities.checkpointing.checkpoint_saving_strategies.SaveEveryKStepsCheckpointingStrategy(k)[source]
Bases:
CheckpointSavingStrategyIF
Initializes the CheckpointSavingStrategy object.
- Args:
k (int): The value of k.
- Returns:
None
- Parameters:
k (int)
- get_checkpoint_instruction(training_progress, evaluation_result=None, early_stoppping_criterion_fulfilled=False)[source]
Returns a CheckpointingInstruction object.
- Return type:
- Parameters:
training_progress (TrainingProgress)
evaluation_result (dict[str, EvaluationResultBatch] | None)
early_stoppping_criterion_fulfilled (bool)
- Args:
training_progress (TrainingProgress): The training progress. evaluation_result (dict[str, EvaluationResultBatch] | None, optional): The evaluation result. Defaults to None. early_stoppping_criterion_fulfilled (bool, optional): Whether the early stopping criterion is fulfilled. Defaults to False.
- Returns:
CheckpointingInstruction: The checkpointing instruction object.
- class modalities.checkpointing.checkpoint_saving_strategies.SaveKMostRecentCheckpointsStrategy(k=-1)[source]
Bases:
CheckpointSavingStrategyIF
Strategy for saving the k most recent checkpoints only.
Initializes the checkpoint saving strategy.
- Args:
- k (int, optional): The number of most recent checkpoints to save.
Defaults to -1, which means all checkpoints are saved. Set to 0 to not save any checkpoints. Set to a positive integer to save the specified number of checkpointsStrategy for saving the k most recent checkpoints only.
- Parameters:
k (int)
- get_checkpoint_instruction(training_progress, evaluation_result=None, early_stoppping_criterion_fulfilled=False)[source]
Generates a checkpointing instruction based on the given parameters.
- Return type:
- Parameters:
training_progress (TrainingProgress)
evaluation_result (dict[str, EvaluationResultBatch] | None)
early_stoppping_criterion_fulfilled (bool)
- Args:
training_progress (TrainingProgress): The training progress. evaluation_result (dict[str, EvaluationResultBatch] | None, optional):
The evaluation result. Defaults to None.
- early_stoppping_criterion_fulfilled (bool, optional):
Whether the early stopping criterion is fulfilled. Defaults to False.
- Returns:
CheckpointingInstruction: The generated checkpointing instruction.