modalities.training.gradient_clipping package

Submodules

modalities.training.gradient_clipping.fsdp_gradient_clipper module

class modalities.training.gradient_clipping.fsdp_gradient_clipper.DummyGradientClipper[source]

Bases: GradientClipperIF

The DummyGradientClipper class that does not apply gradient clipping.

clip_gradients()[source]

Returns a tensor with value -1.0 indicating that DummyGradientClipper does not actually apply gradient clipping.

Return type:: Tensor

Returns:: torch.Tensor: Tensor with value -1.0

class modalities.training.gradient_clipping.fsdp_gradient_clipper.FSDP1GradientClipper(wrapped_model, max_norm, norm_type=<enum 'GradientClippingMode'>)[source]

Bases: GradientClipperIF

The FSDP1GradientClipper class that is responsible for clipping the gradients of a model wrapped with FSDP. Follows the documentation from https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel.clip_grad_norm_

Initialize the FSDP1GradientClipper object.

Args:: wrapped_model (FSDP1): The wrapped model. max_norm (float): The maximum norm value for gradient clipping. norm_type (GradientClippingMode, optional): The type of gradient clipping. Defaults to GradientClippingMode.
Returns:: None

Parameters:

wrapped_model (FullyShardedDataParallel)
max_norm (float)

clip_gradients()[source]

Clips the gradients of the wrapped model using the specified maximum norm and norm type.

Return type:: Tensor

Returns:: torch.Tensor: The gradient norm after clipping.

class modalities.training.gradient_clipping.fsdp_gradient_clipper.FSDP1LoggingOnlyGradientClipper(wrapped_model, norm_type=<enum 'GradientClippingMode'>)[source]

Bases: GradientClipperIF

The FSDP1LoggingOnlyGradientClipper class that is responsible for logging the gradient norms without actually clipping the gradients.

Initialize the FSDP1LoggingOnlyGradientClipper.

Args:: wrapped_model (FSDP1): The wrapped FSDP1 model. norm_type (GradientClippingMode, optional): The type of gradient clipping. Defaults to GradientClippingMode.
Returns:: None

Parameters:: wrapped_model (FullyShardedDataParallel)

clip_gradients()[source]

Returns the gradient norm, but does not apply clipping since max_norm is set to inifinity.

Return type:: Tensor

Returns:: torch.Tensor: The gradient norms.

class modalities.training.gradient_clipping.fsdp_gradient_clipper.FSDP2GradientClipper(wrapped_model, max_norm, norm_type=<enum 'GradientClippingMode'>)[source]

Bases: GradientClipperIF

The FSDP2GradientClipper class that is responsible for clipping the gradients of a model wrapped with FSDP.

Initialize the FSDP2GradientClipper object.

Args:: wrapped_model (FSDP2): The wrapped model. max_norm (float): The maximum norm value for gradient clipping. norm_type (GradientClippingMode, optional): The type of gradient clipping. Defaults to GradientClippingMode.
Returns:: None

Parameters:

wrapped_model (FSDPModule)
max_norm (float)

static clip_grad_norm_(parameters, max_norm, norm_type=2.0, error_if_nonfinite=False, foreach=None)[source]

Clip the gradient norm of an iterable of parameters.

Gradient norm clipping requires computing the gradient norm over the entire model. torch.nn.utils.clip_grad_norm_ only computes gradient norm along DP/FSDP/TP dimensions.

TODO: for pipeline parallelism, we need to implement it like here: https://github.com/pytorch/torchtitan/blob/b291ad662493b63d25b038a30a915082d3617baf/torchtitan/distributed/utils.py#L245 I removed all the code w.r.t. pipeline parallelism for now.

Return type:

Tensor

Parameters:

parameters (Tensor | Iterable[Tensor])
max_norm (float)
norm_type (float)
error_if_nonfinite (bool)
foreach (bool | None)

Args:

parameters: an iterable of Tensors or a single Tensor that will have gradients normalized max_norm (float): max norm of the gradients norm_type (float): type of the used p-norm. Can be 'inf' for

infinity norm.

error_if_nonfinite (bool): if True, an error is thrown if the total: norm of the gradients from parameters is nan, inf, or -inf. Default: False (will switch to True in the future)
foreach (bool): use the faster foreach-based implementation.: If None, use the foreach implementation for CUDA and CPU native tensors and silently fall back to the slow implementation for other device types. Default: None

Returns:

Total norm of the parameter gradients (viewed as a single vector).

clip_gradients()[source]

Clips the gradients of the wrapped model using the specified maximum norm and norm type.

Return type:: Tensor

Returns:: torch.Tensor: The gradient norm after clipping.

class modalities.training.gradient_clipping.fsdp_gradient_clipper.FSDP2LoggingOnlyGradientClipper(wrapped_model, norm_type=<enum 'GradientClippingMode'>)[source]

Bases: GradientClipperIF

The FSDP2LoggingOnlyGradientClipper class that is responsible for logging the gradient norms without actually clipping the gradients.

Initialize the FSDP2LoggingOnlyGradientClipper.

Args:: wrapped_model (FSDP2): The wrapped FSDP2 model. norm_type (GradientClippingMode, optional): The type of gradient clipping. Defaults to GradientClippingMode.
Returns:: None

Parameters:: wrapped_model (FSDPModule)

clip_gradients()[source]

Returns the gradient norm, but does not apply clipping since max_norm is set to inifinity.

Return type:: Tensor

Returns:: torch.Tensor: The gradient norms.

class modalities.training.gradient_clipping.fsdp_gradient_clipper.GradientClippingMode(value)[source]

Bases: LookupEnum

Enum class representing different modes of gradient clipping.

Attributes:: P1_NORM (int): Mode for Manhattan norm based clipping. P2_NORM (int): Mode for Euclidean norm based clipping. MAX_NORM (str): Mode for maximum norm based clipping.

MAX_NORM = 'inf'

P1_NORM = 1

P2_NORM = 2

modalities.training.gradient_clipping.fsdp_gradient_clipper_config module

class modalities.training.gradient_clipping.fsdp_gradient_clipper_config.DummyGradientClipperConfig(**data)[source]

Bases: BaseModel

Configuration class for dummy gradient clipper.

This class is a placeholder and does not have any specific functionality.

Attributes:: None
Methods:: None

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class modalities.training.gradient_clipping.fsdp_gradient_clipper_config.FSDPDummyGradientClipperConfig(**data)[source]

Bases: BaseModel

Configuration class for FSDP dummy gradient clipper.

Args:: wrapped_model (PydanticPytorchModuleType): The wrapped PyTorch model. norm_type (GradientClippingMode): The type of gradient clipping to be applied.
Attributes:: wrapped_model (PydanticPytorchModuleType): The wrapped PyTorch model. norm_type (GradientClippingMode): The type of gradient clipping to be applied.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

wrapped_model (Annotated[Module, <modalities.config.pydantic_if_types.PydanticThirdPartyTypeIF object at 0x7f67efca71d0>])
norm_type (GradientClippingMode)

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

norm_type: GradientClippingMode

wrapped_model: Annotated[Module]

class modalities.training.gradient_clipping.fsdp_gradient_clipper_config.FSDPGradientClipperConfig(**data)[source]

Bases: BaseModel

Configuration class for FSDP gradient clipper.

Args:: max_norm (float): The maximum norm value for gradient clipping. norm_type (GradientClippingMode): The type of gradient clipping to be applied. wrapped_model (PydanticPytorchModuleType): The wrapped PyTorch model.
Attributes:: max_norm (float): The maximum norm value for gradient clipping. norm_type (GradientClippingMode): The type of gradient clipping to be applied. wrapped_model (PydanticPytorchModuleType): The wrapped PyTorch model.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

max_norm (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Strict(strict=True), Gt(gt=0)])])
norm_type (GradientClippingMode)
wrapped_model (Annotated[Module, <modalities.config.pydantic_if_types.PydanticThirdPartyTypeIF object at 0x7f67efca71d0>])

max_norm: Annotated[float]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

norm_type: GradientClippingMode

wrapped_model: Annotated[Module]

modalities.training.gradient_clipping.gradient_clipper module

class modalities.training.gradient_clipping.gradient_clipper.GradientClipperIF[source]

Bases: ABC

The GradientClipper interface that defines the methods for clipping gradients.

abstractmethod clip_gradients()[source]

Clip the gradients of the model.

Return type:: Tensor

Returns:: torch.Tensor: The clipped gradients.

modalities.training.gradient_clipping package

Submodules

modalities.training.gradient_clipping.fsdp_gradient_clipper module

modalities.training.gradient_clipping.fsdp_gradient_clipper_config module

modalities.training.gradient_clipping.gradient_clipper module

Module contents