modalities.models.vision_transformer package

Submodules

modalities.models.vision_transformer.vision_transformer_model module

class modalities.models.vision_transformer.vision_transformer_model.ImagePatchEmbedding(n_img_channels=3, n_embd=768, patch_size=16, patch_stride=16, add_cls_token=True)[source]

Bases: Module

ImagePatchEmbedding class.

Initializes an ImagePatchEmbedding object.

Args:: n_img_channels (int): Number of image channels. Defaults to 3. n_embd (int): Number of embedding dimensions. Defaults to 768. patch_size (int): Patch size for convolutional layer. Defaults to 16. patch_stride (int): Patch stride for convolutional layer. Defaults to 16. add_cls_token (bool): Flag indicating whether to add a classification token. Defaults to True.
Returns:: None

Parameters:

n_img_channels (int)
n_embd (int)
patch_size (int)
patch_stride (int)
add_cls_token (bool)

forward(x)[source]

Forward pass of the ImagePatchEmbedding.

Return type:: Tensor
Parameters:: x (Tensor)

Args:: x (torch.Tensor): Input tensor.
Returns:: torch.Tensor: Output tensor.

class modalities.models.vision_transformer.vision_transformer_model.VisionTransformer(sample_key, prediction_key, img_size=224, n_classes=1000, n_layer=12, attention_config=None, n_head=8, n_embd=768, ffn_hidden=3072, dropout=0.0, patch_size=16, patch_stride=16, n_img_channels=3, add_cls_token=True, bias=True)[source]

Bases: Module

VisionTransformer class.

The Vision Transformer (ViT) is a pure transformer architecture that applies attention mechanisms directly to sequences of image patches for image classification tasks.

Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Link: https://arxiv.org/abs/2010.11929

Initializes the VisionTransformer object.

Args:

sample_key (str): The key for the samples. prediction_key (str): The key for the predictions. img_size (tuple[int, int] | int, optional): The size of the input image. Defaults to 224. n_classes (int, optional): The number of classes. Defaults to 1000. n_layer (int, optional): The number of layers. Defaults to 12. attention_config (AttentionConfig, optional): The attention configuration. Defaults to None. n_head (int, optional): The number of attention heads. Defaults to 8. n_embd (int, optional): The embedding dimension. Defaults to 768. ffn_hidden (int, optional): The hidden dimension of the feed-forward network. Defaults to 3072. dropout (float, optional): The dropout rate. Defaults to 0.0. patch_size (int, optional): The size of the image patch. Defaults to 16. patch_stride (int, optional): The stride of the image patch. Defaults to 16. n_img_channels (int, optional): The number of image channels. Defaults to 3. add_cls_token (bool, optional): Flag indicating whether to add a classification token. Defaults to True. bias (bool, optional): Flag indicating whether to include bias terms. Defaults to True.

Returns:: None

Parameters:

sample_key (str)
prediction_key (str)
img_size (tuple[int, int] | int)
n_classes (int)
n_layer (int)
attention_config (AttentionConfig)
n_head (int)
n_embd (int)
ffn_hidden (int)
dropout (float)
patch_size (int)
patch_stride (int)
n_img_channels (int)
add_cls_token (bool)
bias (bool)

forward(inputs)[source]

Forward pass of the VisionTransformer module.

Return type:: dict[str, Tensor]
Parameters:: inputs (dict[str, Tensor])

Args:: inputs (dict[str, torch.Tensor]): Dictionary containing input tensors.
Returns:: dict[str, torch.Tensor]: Dictionary containing output tensor.

forward_images(x)[source]

Forward pass for processing images using the VisionTransformer module.

Return type:: Tensor
Parameters:: x (Tensor)

Args:: x (torch.Tensor): Input tensor of shape (batch_size, channels, height, width).
Returns:: torch.Tensor: Output tensor after processing the input images.

class modalities.models.vision_transformer.vision_transformer_model.VisionTransformerBlock(n_embd=768, n_head=8, ffn_hidden=3072, bias=True, dropout=0.0, attention_config=None)[source]

Bases: Module

VisionTransformerBlock class.

Initializes a VisionTransformerBlock object.

Args:: n_embd (int, optional): The dimensionality of the embedding layer. Defaults to 768. n_head (int, optional): The number of attention heads. Defaults to 8. ffn_hidden (int, optional): The number of hidden units in the feed-forward network. Defaults to 3072. bias (bool, optional): Flag indicating whether to include bias terms. Defaults to True. dropout (float, optional): The dropout rate. Defaults to 0.0. attention_config (AttentionConfig, optional): The configuration for the attention mechanism. Defaults to None.
Returns:: None

Parameters:

n_embd (int)
n_head (int)
ffn_hidden (int)
bias (bool)
dropout (float)
attention_config (AttentionConfig)

forward(x)[source]

Forward pass of the VisionTransformerBlock module.

Return type:: Tensor
Parameters:: x (Tensor)

Args:: x (torch.Tensor): Input tensor.
Returns:: torch.Tensor: Output tensor.

class modalities.models.vision_transformer.vision_transformer_model.VisionTransformerConfig(**data)[source]

Bases: BaseModel

Configuration class for the VisionTransformer.

Args:: sample_key (str): The key for the input sample. prediction_key (str): The key for the model prediction. img_size (tuple[int, int] | int, optional): The size of the input image. Defaults to 224. n_classes (int, optional): The number of output classes. Defaults to 1000. n_layer (int): The number of layers in the model. Defaults to 12. attention_config (AttentionConfig, optional): The configuration for the attention mechanism. Defaults to None. n_head (int): The number of attention heads. Defaults to 8. n_embd (int): The dimensionality of the embedding. Defaults to 768. dropout (float): The dropout rate. Defaults to 0.0. patch_size (int): The size of the image patches. Defaults to 16. patch_stride (int): The stride of the image patches. Defaults to 16. n_img_channels (int): The number of image channels. Defaults to 3. add_cls_token (bool): Flag indicating whether to add a classification token. Defaults to True. bias (bool): Flag indicating whether to include bias terms. Defaults to True.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

sample_key (str)
prediction_key (str)
img_size (Annotated[tuple[int, int] | int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_classes (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])] | None)
n_layer (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
attention_config (AttentionConfig)
n_head (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_embd (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
dropout (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])])
patch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
patch_stride (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_img_channels (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
add_cls_token (bool)
bias (bool)

add_cls_token: bool

attention_config: AttentionConfig

bias: bool

dropout: Annotated[float]

img_size: Annotated[tuple[int, int] | int]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_classes: Optional[Annotated[int]]

n_embd: Annotated[int]

n_head: Annotated[int]

n_img_channels: Annotated[int]

n_layer: Annotated[int]

patch_size: Annotated[int]

patch_stride: Annotated[int]

prediction_key: str

sample_key: str

modalities.models.vision_transformer package

Submodules

modalities.models.vision_transformer.vision_transformer_model module

Module contents