modalities.optimizers package

Submodules

modalities.optimizers.lr_schedulers module

class modalities.optimizers.lr_schedulers.DummyLRScheduler(optimizer, last_epoch=-1)[source]

Bases: LRScheduler

Parameters:
get_lr()[source]

Compute learning rate using chainable form of the scheduler.

Return type:

list[float]

modalities.optimizers.optimizer_factory module

class modalities.optimizers.optimizer_factory.OptimizerFactory[source]

Bases: object

static get_adam(lr, betas, eps, weight_decay, weight_decay_groups_excluded, wrapped_model, foreach=None, fused=None)[source]
Return type:

Optimizer

Parameters:
static get_adam_w(lr, betas, eps, weight_decay, weight_decay_groups_excluded, wrapped_model, foreach=None, fused=None)[source]
Return type:

Optimizer

Parameters:
static get_fsdp1_checkpointed_optimizer_(checkpoint_loading, checkpoint_path, wrapped_model, optimizer)[source]

Loads an FSDP1-checkpointed optimizer from a checkpoint file.

Return type:

Optimizer

Parameters:
Args:

checkpoint_loading (FSDP1CheckpointLoadingIF): The FDSP1 checkpoint loading strategy. checkpoint_path (Path): The path to the checkpoint file. wrapped_model (FSDP1): The FSDP1 model associated with the optimizer. optimizer (Optimizer): The optimizer to load the checkpoint into.

Returns:

Optimizer: The optimizer loaded from the checkpoint.

modalities.optimizers.optimizer_factory.get_optimizer_groups(model, weight_decay, weight_decay_groups_excluded)[source]

divide model parameters into optimizer groups (with or without weight decay)

inspired by: - https://github.com/pytorch/pytorch/issues/101343 - https://github.com/karpathy/nanoGPT

Return type:

list[dict[str, list[Parameter] | float]]

Parameters:

modalities.optimizers.optimizer_list module

class modalities.optimizers.optimizer_list.OptimizersList(model_parts, optimizers)[source]

Bases: Optimizer, Stateful, list[Optimizer]

Class to handle multiple optimizers for different model parts. Particular relevant for pipeline parallelism, where each stage has its own optimizer. This class wraps a list of optimizers and provides a unified interface to step, zero_grad, state_dict and load_state_dict.

Parameters:
load_state_dict(state_dict)[source]

Load the optimizer state.

Args:
state_dict (dict): optimizer state. Should be an object returned

from a call to state_dict().

Note

The names of the parameters (if they exist under the “param_names” key of each param group in state_dict()) will not affect the loading process. To use the parameters’ names for custom cases (such as when the parameters in the loaded state dict differ from those initialized in the optimizer), a custom register_load_state_dict_pre_hook should be implemented to adapt the loaded dict accordingly. If param_names exist in loaded state dict param_groups they will be saved and override the current names, if present, in the optimizer state. If they do not exist in loaded state dict, the optimizer param_names will remain unchanged.

Parameters:

state_dict (dict[str, Any])

state_dict()[source]

Return the state of the optimizer as a dict.

It contains two entries:

  • state: a Dict holding current optimization state. Its content

    differs between optimizer classes, but some common characteristics hold. For example, state is saved per parameter, and the parameter itself is NOT saved. state is a Dictionary mapping parameter ids to a Dict with state corresponding to each parameter.

  • param_groups: a List containing all parameter groups where each

    parameter group is a Dict. Each parameter group contains metadata specific to the optimizer, such as learning rate and weight decay, as well as a List of parameter IDs of the parameters in the group. If a param group was initialized with named_parameters() the names content will also be saved in the state dict.

NOTE: The parameter IDs may look like indices but they are just IDs associating state with param_group. When loading from a state_dict, the optimizer will zip the param_group params (int IDs) and the optimizer param_groups (actual nn.Parameter s) in order to match state WITHOUT additional verification.

A returned state dict might look something like:

{
    'state': {
        0: {'momentum_buffer': tensor(...), ...},
        1: {'momentum_buffer': tensor(...), ...},
        2: {'momentum_buffer': tensor(...), ...},
        3: {'momentum_buffer': tensor(...), ...}
    },
    'param_groups': [
        {
            'lr': 0.01,
            'weight_decay': 0,
            ...
            'params': [0]
            'param_names' ['param0']  (optional)
        },
        {
            'lr': 0.001,
            'weight_decay': 0.5,
            ...
            'params': [1, 2, 3]
            'param_names': ['param1', 'layer.weight', 'layer.bias'] (optional)
        }
    ]
}
Return type:

list[dict[str, Any]]

step(*args, **kwargs)[source]

Perform a single optimization step to update parameter.

Args:
closure (Callable): A closure that reevaluates the model and

returns the loss. Optional for most optimizers.

zero_grad(*args, **kwargs)[source]

Reset the gradients of all optimized torch.Tensor s.

Args:
set_to_none (bool): instead of setting to zero, set the grads to None.

This will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests zero_grad(set_to_none=True) followed by a backward pass, .grads are guaranteed to be None for params that did not receive a gradient. 3. torch.optim optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).

modalities.optimizers.scheduler_list module

class modalities.optimizers.scheduler_list.SchedulerList(schedulers)[source]

Bases: LRScheduler, Stateful, list[LRScheduler]

A list of learning rate schedulers that can be treated as a single scheduler. Each scheduler in the list should correspond to an optimizer in a multi-optimizer setup. NOTE: Similar to torchtitan, this class assumes that all schedulers have the same state.

Parameters:

schedulers (Iterable[LRScheduler])

property base_lrs
get_last_lr()[source]

Return last computed learning rate by current scheduler.

get_lr()[source]

Compute learning rate using chainable form of the scheduler.

property last_epoch
load_state_dict(state_dict)[source]

Load the scheduler’s state.

Return type:

None

Parameters:

state_dict (dict[str, Any])

Args:
state_dict (dict): scheduler state. Should be an object returned

from a call to state_dict().

state_dict()[source]

Return the state of the scheduler as a dict.

It contains an entry for every variable in self.__dict__ which is not the optimizer.

Return type:

dict[str, Any]

step(epoch=None)[source]

Perform a step.

Parameters:

epoch (int | None)

Module contents