modalities.models.vision_transformer package
Submodules
modalities.models.vision_transformer.vision_transformer_model module
- class modalities.models.vision_transformer.vision_transformer_model.ImagePatchEmbedding(n_img_channels=3, n_embd=768, patch_size=16, patch_stride=16, add_cls_token=True)[source]
- Bases: - Module- ImagePatchEmbedding class. - Initializes an ImagePatchEmbedding object. - Args:
- n_img_channels (int): Number of image channels. Defaults to 3. n_embd (int): Number of embedding dimensions. Defaults to 768. patch_size (int): Patch size for convolutional layer. Defaults to 16. patch_stride (int): Patch stride for convolutional layer. Defaults to 16. add_cls_token (bool): Flag indicating whether to add a classification token. Defaults to True. 
- Returns:
- None 
 - Parameters:
 
- class modalities.models.vision_transformer.vision_transformer_model.VisionTransformer(sample_key, prediction_key, img_size=224, n_classes=1000, n_layer=12, attention_config=None, n_head=8, n_embd=768, ffn_hidden=3072, dropout=0.0, patch_size=16, patch_stride=16, n_img_channels=3, add_cls_token=True, bias=True)[source]
- Bases: - Module- VisionTransformer class. - The Vision Transformer (ViT) is a pure transformer architecture that applies attention mechanisms directly to sequences of image patches for image classification tasks. - Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Link: https://arxiv.org/abs/2010.11929 - Initializes the VisionTransformer object. - Args:
- sample_key (str): The key for the samples. prediction_key (str): The key for the predictions. img_size (tuple[int, int] | int, optional): The size of the input image. Defaults to 224. n_classes (int, optional): The number of classes. Defaults to 1000. n_layer (int, optional): The number of layers. Defaults to 12. attention_config (AttentionConfig, optional): The attention configuration. Defaults to None. n_head (int, optional): The number of attention heads. Defaults to 8. n_embd (int, optional): The embedding dimension. Defaults to 768. ffn_hidden (int, optional): The hidden dimension of the feed-forward network. Defaults to 3072. dropout (float, optional): The dropout rate. Defaults to 0.0. patch_size (int, optional): The size of the image patch. Defaults to 16. patch_stride (int, optional): The stride of the image patch. Defaults to 16. n_img_channels (int, optional): The number of image channels. Defaults to 3. add_cls_token (bool, optional): Flag indicating whether to add a classification token. Defaults to True. bias (bool, optional): Flag indicating whether to include bias terms. Defaults to True. - Returns:
- None 
 
 - Parameters:
 
- class modalities.models.vision_transformer.vision_transformer_model.VisionTransformerBlock(n_embd=768, n_head=8, ffn_hidden=3072, bias=True, dropout=0.0, attention_config=None)[source]
- Bases: - Module- VisionTransformerBlock class. - Initializes a VisionTransformerBlock object. - Args:
- n_embd (int, optional): The dimensionality of the embedding layer. Defaults to 768. n_head (int, optional): The number of attention heads. Defaults to 8. ffn_hidden (int, optional): The number of hidden units in the feed-forward network. Defaults to 3072. bias (bool, optional): Flag indicating whether to include bias terms. Defaults to True. dropout (float, optional): The dropout rate. Defaults to 0.0. attention_config (AttentionConfig, optional): The configuration for the attention mechanism. Defaults to None. 
- Returns:
- None 
 - Parameters:
 
- class modalities.models.vision_transformer.vision_transformer_model.VisionTransformerConfig(**data)[source]
- Bases: - BaseModel- Configuration class for the VisionTransformer. - Args:
- sample_key (str): The key for the input sample. prediction_key (str): The key for the model prediction. img_size (tuple[int, int] | int, optional): The size of the input image. Defaults to 224. n_classes (int, optional): The number of output classes. Defaults to 1000. n_layer (int): The number of layers in the model. Defaults to 12. attention_config (AttentionConfig, optional): The configuration for the attention mechanism. Defaults to None. n_head (int): The number of attention heads. Defaults to 8. n_embd (int): The dimensionality of the embedding. Defaults to 768. dropout (float): The dropout rate. Defaults to 0.0. patch_size (int): The size of the image patches. Defaults to 16. patch_stride (int): The stride of the image patches. Defaults to 16. n_img_channels (int): The number of image channels. Defaults to 3. add_cls_token (bool): Flag indicating whether to add a classification token. Defaults to True. bias (bool): Flag indicating whether to include bias terms. Defaults to True. 
 - Create a new model by parsing and validating input data from keyword arguments. - Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. - self is explicitly positional-only to allow self as a field name. - Parameters:
- sample_key (str) 
- prediction_key (str) 
- img_size (Annotated[tuple[int, int] | int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- n_classes (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])] | None) 
- n_layer (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- attention_config (AttentionConfig) 
- n_head (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- n_embd (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- dropout (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])]) 
- patch_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- patch_stride (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- n_img_channels (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- add_cls_token (bool) 
- bias (bool) 
 
 - 
attention_config: AttentionConfig
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].