modalities.training.gradient_clipping package

Submodules

modalities.training.gradient_clipping.fsdp_gradient_clipper module

class modalities.training.gradient_clipping.fsdp_gradient_clipper.FSDP1GradientClipper(wrapped_model, max_norm, norm_type=<enum 'GradientClippingMode'>)[source]

Bases: GradientClipperIF

The FSDP1GradientClipper class that is responsible for clipping the gradients of a model wrapped with FSDP. Follows the documentation from https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel.clip_grad_norm_

Initialize the FSDP1GradientClipper object.

Args:

wrapped_model (FSDP1): The wrapped model. max_norm (float): The maximum norm value for gradient clipping. norm_type (GradientClippingMode, optional): The type of gradient clipping. Defaults to GradientClippingMode.

Returns:

None

Parameters:
clip_gradients()[source]

Clips the gradients of the wrapped model using the specified maximum norm and norm type.

Return type:

Tensor

Returns:

torch.Tensor: The gradient norm after clipping.

class modalities.training.gradient_clipping.fsdp_gradient_clipper.FSDP1LoggingOnlyGradientClipper(wrapped_model, norm_type=<enum 'GradientClippingMode'>)[source]

Bases: GradientClipperIF

The FSDP1LoggingOnlyGradientClipper class that is responsible for logging the gradient norms without actually clipping the gradients.

Initialize the FSDP1LoggingOnlyGradientClipper.

Args:

wrapped_model (FSDP1): The wrapped FSDP1 model. norm_type (GradientClippingMode, optional): The type of gradient clipping. Defaults to GradientClippingMode.

Returns:

None

Parameters:

wrapped_model (FullyShardedDataParallel)

clip_gradients()[source]

Returns the gradient norm, but does not apply clipping since max_norm is set to inifinity.

Return type:

Tensor

Returns:

torch.Tensor: The gradient norms.

class modalities.training.gradient_clipping.fsdp_gradient_clipper.FSDP2GradientClipper(model_parts, max_norm, norm_type, device_mesh=None, error_if_nonfinite=False, foreach=None)[source]

Bases: FSDP2LoggingOnlyGradientClipper

The FSDP2GradientClipper class that is responsible for clipping the gradients of a model wrapped with FSDP.

Initialize the FSDP2GradientClipper object.

Args:

model_parts (FSDP2 | list[FSDP2]): The wrapped FSDP2 model or list of model parts. max_norm (float): The maximum norm value for gradient clipping. norm_type (GradientClippingMode): The type of gradient clipping. device_mesh (DeviceMesh, optional): The device mesh used for distributed training. Defaults to None. error_if_nonfinite (bool): if True, an error is thrown if the total

norm of the gradients from parameters is nan, inf, or -inf. Default: False (will switch to True in the future)

foreach (bool): use the faster foreach-based implementation.

If None, use the foreach implementation for CUDA and CPU native tensors and silently fall back to the slow implementation for other device types. Default: None

Returns:

None

Parameters:
clip_gradients()[source]

Clips the gradients of the wrapped model using the specified maximum norm and norm type.

Return type:

Tensor

Returns:

torch.Tensor: The gradient norm after clipping.

class modalities.training.gradient_clipping.fsdp_gradient_clipper.FSDP2LoggingOnlyGradientClipper(model_parts, norm_type, device_mesh=None, error_if_nonfinite=False, foreach=None)[source]

Bases: GradientClipperIF

The FSDP2LoggingOnlyGradientClipper class that is responsible for logging the gradient norms without actually clipping the gradients.

Initialize the FSDP2LoggingOnlyGradientClipper.

Args:

model_parts (FSDP2 | list[FSDP2]): The wrapped FSDP2 model or list of models. norm_type (GradientClippingMode): The type of gradient clipping. device_mesh (DeviceMesh, optional): The device mesh used for distributed training. Defaults to None. error_if_nonfinite (bool): if True, an error is thrown if the total

norm of the gradients from parameters is nan, inf, or -inf. Default: False (will switch to True in the future)

foreach (bool): use the faster foreach-based implementation.

If None, use the foreach implementation for CUDA and CPU native tensors and silently fall back to the slow implementation for other device types. Default: None

Returns:

None

Parameters:
clip_gradients()[source]

Returns the gradient norm, but does not apply clipping since max_norm is set to inifinity.

Return type:

Tensor

Returns:

torch.Tensor: The gradient norms.

class modalities.training.gradient_clipping.fsdp_gradient_clipper.GradientClippingMode(value)[source]

Bases: LookupEnum

Enum class representing different modes of gradient clipping.

Attributes:

P1_NORM (int): Mode for Manhattan norm based clipping. P2_NORM (int): Mode for Euclidean norm based clipping. MAX_NORM (str): Mode for maximum norm based clipping.

MAX_NORM = 'inf'
P1_NORM = 1
P2_NORM = 2

modalities.training.gradient_clipping.fsdp_gradient_clipper_config module

modalities.training.gradient_clipping.gradient_clipper module

class modalities.training.gradient_clipping.gradient_clipper.GradientClipperIF[source]

Bases: ABC

The GradientClipper interface that defines the methods for clipping gradients.

abstractmethod clip_gradients()[source]

Clip the gradients of the model.

Return type:

Tensor

Returns:

torch.Tensor: The clipped gradients.

Module contents