modalities.models.vision_transformer package

Submodules

modalities.models.vision_transformer.vision_transformer_model module

class modalities.models.vision_transformer.vision_transformer_model.ImagePatchEmbedding(n_img_channels=3, n_embd=768, patch_size=16, patch_stride=16, add_cls_token=True)[source]

Bases: Module

ImagePatchEmbedding class.

Initializes an ImagePatchEmbedding object.

Args:

n_img_channels (int): Number of image channels. Defaults to 3. n_embd (int): Number of embedding dimensions. Defaults to 768. patch_size (int): Patch size for convolutional layer. Defaults to 16. patch_stride (int): Patch stride for convolutional layer. Defaults to 16. add_cls_token (bool): Flag indicating whether to add a classification token. Defaults to True.

Returns:

None

Parameters:
  • n_img_channels (int)

  • n_embd (int)

  • patch_size (int)

  • patch_stride (int)

  • add_cls_token (bool)

forward(x)[source]

Forward pass of the ImagePatchEmbedding.

Return type:

Tensor

Parameters:

x (Tensor)

Args:

x (torch.Tensor): Input tensor.

Returns:

torch.Tensor: Output tensor.

class modalities.models.vision_transformer.vision_transformer_model.VisionTransformer(sample_key, prediction_key, img_size=224, n_classes=1000, n_layer=12, attention_config=None, n_head=8, n_embd=768, ffn_hidden=3072, dropout=0.0, patch_size=16, patch_stride=16, n_img_channels=3, add_cls_token=True, bias=True)[source]

Bases: Module

VisionTransformer class.

The Vision Transformer (ViT) is a pure transformer architecture that applies attention mechanisms directly to sequences of image patches for image classification tasks.

Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Link: https://arxiv.org/abs/2010.11929

Initializes the VisionTransformer object.

Args:

sample_key (str): The key for the samples. prediction_key (str): The key for the predictions. img_size (tuple[int, int] | int, optional): The size of the input image. Defaults to 224. n_classes (int, optional): The number of classes. Defaults to 1000. n_layer (int, optional): The number of layers. Defaults to 12. attention_config (AttentionConfig, optional): The attention configuration. Defaults to None. n_head (int, optional): The number of attention heads. Defaults to 8. n_embd (int, optional): The embedding dimension. Defaults to 768. ffn_hidden (int, optional): The hidden dimension of the feed-forward network. Defaults to 3072. dropout (float, optional): The dropout rate. Defaults to 0.0. patch_size (int, optional): The size of the image patch. Defaults to 16. patch_stride (int, optional): The stride of the image patch. Defaults to 16. n_img_channels (int, optional): The number of image channels. Defaults to 3. add_cls_token (bool, optional): Flag indicating whether to add a classification token. Defaults to True. bias (bool, optional): Flag indicating whether to include bias terms. Defaults to True.

Returns:

None

Parameters:
forward(inputs)[source]

Forward pass of the VisionTransformer module.

Return type:

dict[str, Tensor]

Parameters:

inputs (dict[str, Tensor])

Args:

inputs (dict[str, torch.Tensor]): Dictionary containing input tensors.

Returns:

dict[str, torch.Tensor]: Dictionary containing output tensor.

forward_images(x)[source]

Forward pass for processing images using the VisionTransformer module.

Return type:

Tensor

Parameters:

x (Tensor)

Args:

x (torch.Tensor): Input tensor of shape (batch_size, channels, height, width).

Returns:

torch.Tensor: Output tensor after processing the input images.

class modalities.models.vision_transformer.vision_transformer_model.VisionTransformerBlock(n_embd=768, n_head=8, ffn_hidden=3072, bias=True, dropout=0.0, attention_config=None)[source]

Bases: Module

VisionTransformerBlock class.

Initializes a VisionTransformerBlock object.

Args:

n_embd (int, optional): The dimensionality of the embedding layer. Defaults to 768. n_head (int, optional): The number of attention heads. Defaults to 8. ffn_hidden (int, optional): The number of hidden units in the feed-forward network. Defaults to 3072. bias (bool, optional): Flag indicating whether to include bias terms. Defaults to True. dropout (float, optional): The dropout rate. Defaults to 0.0. attention_config (AttentionConfig, optional): The configuration for the attention mechanism. Defaults to None.

Returns:

None

Parameters:
forward(x)[source]

Forward pass of the VisionTransformerBlock module.

Return type:

Tensor

Parameters:

x (Tensor)

Args:

x (torch.Tensor): Input tensor.

Returns:

torch.Tensor: Output tensor.

class modalities.models.vision_transformer.vision_transformer_model.VisionTransformerConfig(**data)[source]

Bases: BaseModel

Configuration class for the VisionTransformer.

Args:

sample_key (str): The key for the input sample. prediction_key (str): The key for the model prediction. img_size (tuple[int, int] | int, optional): The size of the input image. Defaults to 224. n_classes (int, optional): The number of output classes. Defaults to 1000. n_layer (int): The number of layers in the model. Defaults to 12. attention_config (AttentionConfig, optional): The configuration for the attention mechanism. Defaults to None. n_head (int): The number of attention heads. Defaults to 8. n_embd (int): The dimensionality of the embedding. Defaults to 768. dropout (float): The dropout rate. Defaults to 0.0. patch_size (int): The size of the image patches. Defaults to 16. patch_stride (int): The stride of the image patches. Defaults to 16. n_img_channels (int): The number of image channels. Defaults to 3. add_cls_token (bool): Flag indicating whether to add a classification token. Defaults to True. bias (bool): Flag indicating whether to include bias terms. Defaults to True.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:
  • sample_key (str)

  • prediction_key (str)

  • img_size (Annotated[tuple[int, int] | int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • n_classes (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])] | None)

  • n_layer (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • attention_config (AttentionConfig)

  • n_head (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • n_embd (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • dropout (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])])

  • patch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • patch_stride (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • n_img_channels (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • add_cls_token (bool)

  • bias (bool)

add_cls_token: bool
attention_config: AttentionConfig
bias: bool
dropout: Annotated[float]
img_size: Annotated[tuple[int, int] | int]
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_classes: Optional[Annotated[int]]
n_embd: Annotated[int]
n_head: Annotated[int]
n_img_channels: Annotated[int]
n_layer: Annotated[int]
patch_size: Annotated[int]
patch_stride: Annotated[int]
prediction_key: str
sample_key: str

Module contents