modalities.models.vision_transformer package
Submodules
modalities.models.vision_transformer.vision_transformer_model module
- class modalities.models.vision_transformer.vision_transformer_model.ImagePatchEmbedding(n_img_channels=3, n_embd=768, patch_size=16, patch_stride=16, add_cls_token=True)[source]
Bases:
Module
ImagePatchEmbedding class.
Initializes an ImagePatchEmbedding object.
- Args:
n_img_channels (int): Number of image channels. Defaults to 3. n_embd (int): Number of embedding dimensions. Defaults to 768. patch_size (int): Patch size for convolutional layer. Defaults to 16. patch_stride (int): Patch stride for convolutional layer. Defaults to 16. add_cls_token (bool): Flag indicating whether to add a classification token. Defaults to True.
- Returns:
None
- Parameters:
- class modalities.models.vision_transformer.vision_transformer_model.VisionTransformer(sample_key, prediction_key, img_size=224, n_classes=1000, n_layer=12, attention_config=None, n_head=8, n_embd=768, ffn_hidden=3072, dropout=0.0, patch_size=16, patch_stride=16, n_img_channels=3, add_cls_token=True, bias=True)[source]
Bases:
Module
VisionTransformer class.
The Vision Transformer (ViT) is a pure transformer architecture that applies attention mechanisms directly to sequences of image patches for image classification tasks.
Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Link: https://arxiv.org/abs/2010.11929
Initializes the VisionTransformer object.
- Args:
sample_key (str): The key for the samples. prediction_key (str): The key for the predictions. img_size (tuple[int, int] | int, optional): The size of the input image. Defaults to 224. n_classes (int, optional): The number of classes. Defaults to 1000. n_layer (int, optional): The number of layers. Defaults to 12. attention_config (AttentionConfig, optional): The attention configuration. Defaults to None. n_head (int, optional): The number of attention heads. Defaults to 8. n_embd (int, optional): The embedding dimension. Defaults to 768. ffn_hidden (int, optional): The hidden dimension of the feed-forward network. Defaults to 3072. dropout (float, optional): The dropout rate. Defaults to 0.0. patch_size (int, optional): The size of the image patch. Defaults to 16. patch_stride (int, optional): The stride of the image patch. Defaults to 16. n_img_channels (int, optional): The number of image channels. Defaults to 3. add_cls_token (bool, optional): Flag indicating whether to add a classification token. Defaults to True. bias (bool, optional): Flag indicating whether to include bias terms. Defaults to True.
- Returns:
None
- Parameters:
- class modalities.models.vision_transformer.vision_transformer_model.VisionTransformerBlock(n_embd=768, n_head=8, ffn_hidden=3072, bias=True, dropout=0.0, attention_config=None)[source]
Bases:
Module
VisionTransformerBlock class.
Initializes a VisionTransformerBlock object.
- Args:
n_embd (int, optional): The dimensionality of the embedding layer. Defaults to 768. n_head (int, optional): The number of attention heads. Defaults to 8. ffn_hidden (int, optional): The number of hidden units in the feed-forward network. Defaults to 3072. bias (bool, optional): Flag indicating whether to include bias terms. Defaults to True. dropout (float, optional): The dropout rate. Defaults to 0.0. attention_config (AttentionConfig, optional): The configuration for the attention mechanism. Defaults to None.
- Returns:
None
- Parameters:
- class modalities.models.vision_transformer.vision_transformer_model.VisionTransformerConfig(**data)[source]
Bases:
BaseModel
Configuration class for the VisionTransformer.
- Args:
sample_key (str): The key for the input sample. prediction_key (str): The key for the model prediction. img_size (tuple[int, int] | int, optional): The size of the input image. Defaults to 224. n_classes (int, optional): The number of output classes. Defaults to 1000. n_layer (int): The number of layers in the model. Defaults to 12. attention_config (AttentionConfig, optional): The configuration for the attention mechanism. Defaults to None. n_head (int): The number of attention heads. Defaults to 8. n_embd (int): The dimensionality of the embedding. Defaults to 768. dropout (float): The dropout rate. Defaults to 0.0. patch_size (int): The size of the image patches. Defaults to 16. patch_stride (int): The stride of the image patches. Defaults to 16. n_img_channels (int): The number of image channels. Defaults to 3. add_cls_token (bool): Flag indicating whether to add a classification token. Defaults to True. bias (bool): Flag indicating whether to include bias terms. Defaults to True.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
sample_key (str)
prediction_key (str)
img_size (Annotated[tuple[int, int] | int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_classes (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])] | None)
n_layer (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
attention_config (AttentionConfig)
n_head (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_embd (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
dropout (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])])
patch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
patch_stride (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_img_channels (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
add_cls_token (bool)
bias (bool)
-
attention_config:
AttentionConfig
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].