modalities.checkpointing.fsdp package

Submodules

modalities.checkpointing.fsdp.fsdp_checkpoint_loading module

class modalities.checkpointing.fsdp.fsdp_checkpoint_loading.DCPCheckpointLoading(global_rank)[source]

Bases: DistributedCheckpointLoadingIF

Distributed checkpoint loading interface for loading PyTorch models and optimizer checkpoints.

Initializes the DCPCheckpointLoading object.

Args:

global_rank (int): The global rank of the process.

Returns:

None

Parameters:

global_rank (int)

load_checkpoint_(app_state, checkpoint_dir_path)[source]

Loads the distributed checkpoint from the specified directory path. NOTE: The model in the app_state must be already FSDP-wrapped.

Args:

app_state (AppState): The application state with the model and optimizer. checkpoint_directory_path (Path): The directory path to the distributed checkpoint.

Parameters:
class modalities.checkpointing.fsdp.fsdp_checkpoint_loading.FSDP1CheckpointLoading(global_rank, block_names, mixed_precision_settings, sharding_strategy)[source]

Bases: FSDP1CheckpointLoadingIF

FSDP1 checkpoint loading class.

Initializes the FSDP1CheckpointLoading object.

Args:

global_rank (int): The global rank of the process. block_names (list[str]): The names of the blocks. mixed_precision_settings (MixedPrecisionSettings): The settings for mixed precision. sharding_strategy (ShardingStrategy): The sharding strategy.

Returns:

None

Parameters:
load_model_checkpoint(model, file_path)[source]

Loads the checkpoint as full state dict into the model on rank 0. After loading the model to CPU RAM, the model is wrapped with FSDP and sharded across the ranks according to the sharding strategy.

Return type:

Module

Parameters:
Args:

model (nn.Module): The model to load the checkpoint into. file_path (Path): The path to the checkpoint file.

Returns:

nn.Module: The model wrapped with FSDP and sharded according to the sharding strategy.

load_optimizer_checkpoint_(optimizer, model, file_path)[source]

Loads the checkpoint as full state dict into the optimizer on rank 0 (in-place)

Args:

optimizer (Optimizer): The optimizer to load the checkpoint into (in-place). model (FSDP): The FSDP-wrapped model. file_path (Path): The path to the checkpoint file.

Parameters:

modalities.checkpointing.fsdp.fsdp_checkpoint_saving module

class modalities.checkpointing.fsdp.fsdp_checkpoint_saving.CheckpointingEntityType(value)[source]

Bases: Enum

Enum class representing the types of entities that can be checkpointed.

Attributes:

MODEL (str): Represents the model entity. OPTIMIZER (str): Represents the optimizer entity.

MODEL = 'model'
OPTIMIZER = 'optimizer'
class modalities.checkpointing.fsdp.fsdp_checkpoint_saving.DCPCheckpointSaving(checkpoint_path, experiment_id, global_rank)[source]

Bases: CheckpointSavingExecutionABC

DCPCheckpointSaving class for saving checkpoints of FSDP2 models and optimizers in a distributed fashion. Each rank saves its own model and optimizer state in a combined file. The advantage over FSDP1CheckpointSaving is that the model and optimizer states do not have to synced to rank 0 or loaded into CPU memory.

Initializes the FSDP2CheckpointSaving class.

Args:

checkpoint_path (Path): folder path to the checkpoint experiment_id (str): ID of the experiment global_rank (int): global rank within the current process group

Returns:

None

Parameters:
  • checkpoint_path (Path)

  • experiment_id (str)

  • global_rank (int)

CHECKPOINT_FOLDER_STRUCTURE = 'eid_{experiment_id}-seen_steps_{num_seen_steps}-seen_tokens_{num_seen_tokens}-target_steps_{num_target_steps}-target_tokens_{num_target_tokens}'
class modalities.checkpointing.fsdp.fsdp_checkpoint_saving.FSDP1CheckpointSaving(checkpoint_path, experiment_id, global_rank)[source]

Bases: CheckpointSavingExecutionABC

FSDP1CheckpointSaving class for saving checkpoints of FSDP models and optimizers. NOTE: This checkpoint saving routing loads the model into CPU memory before saving it to disk and stores the model and optimizer in separate files. This routine only works in conjunction with FSDP1.

Initializes the FSDPCheckpointSaving class.

Args:

checkpoint_path (Path): folder path to the checkpoint experiment_id (str): ID of the experiment global_rank (int): global rank within the current process group

Returns:

None

Parameters:
  • checkpoint_path (Path)

  • experiment_id (str)

  • global_rank (int)

CHECKPOINT_STRUCTURE = 'eid_{experiment_id}-{entity}-seen_steps_{num_seen_steps}-seen_tokens_{num_seen_tokens}-target_steps_{num_target_steps}-target_tokens_{num_target_tokens}.bin'

Module contents