modalities.models.coca package
Submodules
modalities.models.coca.attention_pooling module
- class modalities.models.coca.attention_pooling.AttentionPooling(n_embd, n_head, bias, epsilon, attention_config=None)[source]
Bases:
Module
Attention pooling class.
Initializes an instance of the AttentionPooling class.
- Args:
n_embd (int): The size of the embeddings.
n_head (int): The number of attention heads. bias (bool): Flag indicating whether to include bias in the layer normalization. epsilon (float): A small value to avoid division by zero in layer normalization. attention_config (AttentionConfig, optional): The configuration for attention mechanism. Defaults to None.
- Returns:
None
- Parameters:
n_embd (int)
n_head (int)
bias (bool)
epsilon (float)
attention_config (AttentionConfig)
modalities.models.coca.coca_model module
- class modalities.models.coca.coca_model.CoCa(prediction_key, vision_cls_prediction_key, text_cls_prediction_key, vision_embd_prediction_key, text_embd_prediction_key, n_vision_queries, n_pool_head, bias_attn_pool, epsilon_attn_pool, vision_encoder_config, text_decoder_config)[source]
Bases:
NNModel
CoCa model
The Contrastive Captioner (CoCa) is an encoder-decoder model that integrates the concepts of CLIP and generative models such as SimVLM by using contrastive and captioning losses for training.
Paper: CoCa: Contrastive Captioners are Image-Text Foundation Models Link: https://arxiv.org/abs/2205.01917
Initializes the CocaModel object.
- Args:
prediction_key (str): The key for the predictions. vision_cls_prediction_key (str): The key for the vision cls token. text_cls_prediction_key (str): The key for the text cls token. vision_embd_prediction_key (str): The key for the vision embeddings. text_embd_prediction_key (str): The key for the text embeddings.
n_vision_queries (int): The number of vision queries. n_pool_head (int): The number of pool heads. bias_attn_pool (bool): Flag indicating whether to use bias in attention pooling. epsilon_attn_pool (float): The epsilon value for attention pooling. vision_encoder_config (VisionTransformerConfig): The configuration for the vision encoder. text_decoder_config (TextDecoderConfig): The configuration for the text decoder.
- Returns:
None
- Parameters:
prediction_key (str)
vision_cls_prediction_key (str)
text_cls_prediction_key (str)
vision_embd_prediction_key (str)
text_embd_prediction_key (str)
n_vision_queries (int)
n_pool_head (int)
bias_attn_pool (bool)
epsilon_attn_pool (float)
vision_encoder_config (VisionTransformerConfig)
text_decoder_config (TextDecoderConfig)
- class modalities.models.coca.coca_model.CoCaConfig(**data)[source]
Bases:
BaseModel
Configuration class for CoCa model.
- Args:
prediction_key (str): The key for the predictions. vision_embd_prediction_key (str): The key for the vision embeddings. text_embd_prediction_key (str): The key for the text embeddings. vision_cls_prediction_key (str): The key for the vision cls token. text_cls_prediction_key (str): The key for the text cls token. vision_encoder_config (VisionTransformerConfig): Configuration for the vision encoder. text_decoder_config (TextDecoderConfig): Configuration for the text decoder. n_pool_head (int): Number of attention heads for pooling. n_vision_queries (int): Number of vision queries. bias_attn_pool (bool): Flag indicating whether to use bias in attention pooling. epsilon_attn_pool (float): Epsilon value for attention pooling.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
prediction_key (str)
vision_embd_prediction_key (str)
text_embd_prediction_key (str)
vision_cls_prediction_key (str)
text_cls_prediction_key (str)
vision_encoder_config (VisionTransformerConfig)
text_decoder_config (TextDecoderConfig)
n_pool_head (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_vision_queries (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
bias_attn_pool (bool)
epsilon_attn_pool (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
text_decoder_config:
TextDecoderConfig
-
vision_encoder_config:
VisionTransformerConfig
- class modalities.models.coca.coca_model.TextDecoderConfig(**data)[source]
Bases:
BaseModel
Configuration class for the TextDecoder.
- Args:
sample_key (str): The key for the samples. prediction_key (str): The key for the predictions. block_size (int): The block size. Must be greater than or equal to 1. vocab_size (int): The vocabulary size. Must be greater than or equal to 1. n_layer_text (int): The number of layers for processing text. Must be greater than or equal to 1. n_layer_multimodal_text (int): -. Must be greater than or equal to 1. n_head (int): The number of attention heads. Must be greater than or equal to 1. n_embd (int): The embedding size. Must be greater than or equal to 1. ffn_hidden (int): The hidden size for the feed-forward network. Must be greater than or equal to 1. dropout (float): The dropout rate. Must be greater than or equal to 0.0. bias (bool): Flag indicating whether to include bias in the model. attention_config (AttentionConfig): The attention configuration. activation (ActivationType): The activation type. epsilon (float): The epsilon value. Must be greater than or equal to 0.0.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
sample_key (str)
prediction_key (str)
block_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
vocab_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_layer_text (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_layer_multimodal_text (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_head (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_embd (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
ffn_hidden (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
dropout (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])])
bias (bool)
attention_config (AttentionConfig)
activation (ActivationType)
epsilon (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])])
-
activation:
ActivationType
-
attention_config:
AttentionConfig
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
modalities.models.coca.collator module
- class modalities.models.coca.collator.CoCaCollateFnConfig(**data)[source]
Bases:
BaseModel
Configuration class for CoCaCollateFn.
- Args:
sample_keys (list[str]): List of samples keys. target_keys (list[str]): List of target keys. text_sample_key (str): Key for the text samples. text_target_key (str): Key for the text targets.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class modalities.models.coca.collator.CoCaCollatorFn(sample_keys, target_keys, text_sample_key, text_target_key)[source]
Bases:
CollateFnIF
Collator function for CoCa model.
Initializes the CoCaCollatorFn object.
- Args:
sample_keys (list[str]): List of samples keys. target_keys (list[str]): List of target keys. text_sample_key (str): Key for the text samples. text_target_key (str): Key for the text targets.
- Raises:
ValueError: If text_sample_key is not part of sample_keys. ValueError: If text_target_key is part of target_keys.
- Returns:
None
modalities.models.coca.multi_modal_decoder module
- class modalities.models.coca.multi_modal_decoder.MultiModalTextDecoder(sample_key, prediction_key, block_size, vocab_size, n_layer, n_head, n_embd, ffn_hidden, dropout, bias, activation, epsilon, attention_config)[source]
Bases:
NNModel
MultiModalTextDecoder class.
Initializes the MultiModalTextDecoder object.
- Args:
sample_key (str): The key for the input samples. prediction_key (str): The key for the predictions. block_size (int): The size of the blocks. vocab_size (int): The size of the vocabulary. n_layer (int): The number of layers. n_head (int): The number of attention heads. n_embd (int): The dimension of the embeddings. ffn_hidden (int): The size of the feed-forward network hidden layer. dropout (float): The dropout rate. bias (bool): Flag indicating whether to include bias terms. activation (ActivationType): The activation function to use. epsilon (float): The epsilon value for layer normalization. attention_config (AttentionConfig): The attention configuration.
- Returns:
None
- Parameters:
- class modalities.models.coca.multi_modal_decoder.TransformerBlock(n_embd, bias, epsilon, activation, n_head, dropout, ffn_hidden, with_context, attention_type, attention_config=None, add_extra_mlp=False)[source]
Bases:
Module
Transformer block class.
Initializes the TransformerBlock object.
- Args:
n_embd (int): The size of the embeddings. bias (bool): Flag indicating whether to include bias terms. epsilon (float): Small value to avoid division by zero in LayerNorm. activation (ActivationType): The type of activation function to use. n_head (int): The number of attention heads. dropout (float): The dropout rate. ffn_hidden (int): The number of hidden units in the feed-forward network. with_context (bool): Flag indicating whether to include context in the decoder. attention_type (AttentionType): The type of attention mechanism to use. attention_config (AttentionConfig, optional): The configuration for the attention mechanism. Defaults to None. add_extra_mlp (bool, optional): Flag indicating whether to add an extra MLP layer. Defaults to False.
- Parameters:
n_embd (int)
bias (bool)
epsilon (float)
activation (ActivationType)
n_head (int)
dropout (float)
ffn_hidden (int)
with_context (bool)
attention_type (AttentionType)
attention_config (AttentionConfig)
add_extra_mlp (bool)
modalities.models.coca.text_decoder module
- class modalities.models.coca.text_decoder.TextDecoder(sample_key, prediction_key, block_size, vocab_size, n_layer, n_head, n_embd, ffn_hidden, dropout, bias, activation, epsilon, attention_config=None)[source]
Bases:
NNModel
TextDecoder class.
Initializes the TextDecoder class.
- Args:
sample_key (str): The key for the samples. prediction_key (str): The key for the predictions. block_size (int): The block size. vocab_size (int): The size of the vocabulary. n_layer (int): The number of layers. n_head (int): The number of attention heads. n_embd (int): The embedding dimension. ffn_hidden (int): The hidden dimension of the feed-forward network. dropout (float): The dropout rate. bias (bool): Flag indicating whether to include bias terms. activation (ActivationType): The activation function to use. epsilon (float): Small value to avoid division by zero in LayerNorm. attention_config (AttentionConfig, optional): The attention configuration. Defaults to None.
- Parameters: