modalities.models.coca package
Submodules
modalities.models.coca.attention_pooling module
- class modalities.models.coca.attention_pooling.AttentionPooling(n_embd, n_head, bias, epsilon, attention_config=None)[source]
- Bases: - Module- Attention pooling class. - Initializes an instance of the AttentionPooling class. - Args:
- n_embd (int): The size of the embeddings. - n_head (int): The number of attention heads. bias (bool): Flag indicating whether to include bias in the layer normalization. epsilon (float): A small value to avoid division by zero in layer normalization. attention_config (AttentionConfig, optional): The configuration for attention mechanism. Defaults to None. 
- Returns:
- None 
 - Parameters:
- n_embd (int) 
- n_head (int) 
- bias (bool) 
- epsilon (float) 
- attention_config (AttentionConfig) 
 
 
modalities.models.coca.coca_model module
- class modalities.models.coca.coca_model.CoCa(prediction_key, vision_cls_prediction_key, text_cls_prediction_key, vision_embd_prediction_key, text_embd_prediction_key, n_vision_queries, n_pool_head, bias_attn_pool, epsilon_attn_pool, vision_encoder_config, text_decoder_config)[source]
- Bases: - NNModel- CoCa model - The Contrastive Captioner (CoCa) is an encoder-decoder model that integrates the concepts of CLIP and generative models such as SimVLM by using contrastive and captioning losses for training. - Paper: CoCa: Contrastive Captioners are Image-Text Foundation Models Link: https://arxiv.org/abs/2205.01917 - Initializes the CocaModel object. - Args:
- prediction_key (str): The key for the predictions. vision_cls_prediction_key (str): The key for the vision cls token. text_cls_prediction_key (str): The key for the text cls token. vision_embd_prediction_key (str): The key for the vision embeddings. text_embd_prediction_key (str): The key for the text embeddings. - n_vision_queries (int): The number of vision queries. n_pool_head (int): The number of pool heads. bias_attn_pool (bool): Flag indicating whether to use bias in attention pooling. epsilon_attn_pool (float): The epsilon value for attention pooling. vision_encoder_config (VisionTransformerConfig): The configuration for the vision encoder. text_decoder_config (TextDecoderConfig): The configuration for the text decoder. 
- Returns:
- None 
 - Parameters:
- prediction_key (str) 
- vision_cls_prediction_key (str) 
- text_cls_prediction_key (str) 
- vision_embd_prediction_key (str) 
- text_embd_prediction_key (str) 
- n_vision_queries (int) 
- n_pool_head (int) 
- bias_attn_pool (bool) 
- epsilon_attn_pool (float) 
- vision_encoder_config (VisionTransformerConfig) 
- text_decoder_config (TextDecoderConfig) 
 
 
- class modalities.models.coca.coca_model.CoCaConfig(**data)[source]
- Bases: - BaseModel- Configuration class for CoCa model. - Args:
- prediction_key (str): The key for the predictions. vision_embd_prediction_key (str): The key for the vision embeddings. text_embd_prediction_key (str): The key for the text embeddings. vision_cls_prediction_key (str): The key for the vision cls token. text_cls_prediction_key (str): The key for the text cls token. vision_encoder_config (VisionTransformerConfig): Configuration for the vision encoder. text_decoder_config (TextDecoderConfig): Configuration for the text decoder. n_pool_head (int): Number of attention heads for pooling. n_vision_queries (int): Number of vision queries. bias_attn_pool (bool): Flag indicating whether to use bias in attention pooling. epsilon_attn_pool (float): Epsilon value for attention pooling. 
 - Create a new model by parsing and validating input data from keyword arguments. - Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. - self is explicitly positional-only to allow self as a field name. - Parameters:
- prediction_key (str) 
- vision_embd_prediction_key (str) 
- text_embd_prediction_key (str) 
- vision_cls_prediction_key (str) 
- text_cls_prediction_key (str) 
- vision_encoder_config (VisionTransformerConfig) 
- text_decoder_config (TextDecoderConfig) 
- n_pool_head (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- n_vision_queries (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- bias_attn_pool (bool) 
- epsilon_attn_pool (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])]) 
 
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 - 
text_decoder_config: TextDecoderConfig
 - 
vision_encoder_config: VisionTransformerConfig
 
- class modalities.models.coca.coca_model.TextDecoderConfig(**data)[source]
- Bases: - BaseModel- Configuration class for the TextDecoder. - Args:
- sample_key (str): The key for the samples. prediction_key (str): The key for the predictions. block_size (int): The block size. Must be greater than or equal to 1. vocab_size (int): The vocabulary size. Must be greater than or equal to 1. n_layer_text (int): The number of layers for processing text. Must be greater than or equal to 1. n_layer_multimodal_text (int): -. Must be greater than or equal to 1. n_head (int): The number of attention heads. Must be greater than or equal to 1. n_embd (int): The embedding size. Must be greater than or equal to 1. ffn_hidden (int): The hidden size for the feed-forward network. Must be greater than or equal to 1. dropout (float): The dropout rate. Must be greater than or equal to 0.0. bias (bool): Flag indicating whether to include bias in the model. attention_config (AttentionConfig): The attention configuration. activation (ActivationType): The activation type. epsilon (float): The epsilon value. Must be greater than or equal to 0.0. 
 - Create a new model by parsing and validating input data from keyword arguments. - Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. - self is explicitly positional-only to allow self as a field name. - Parameters:
- sample_key (str) 
- prediction_key (str) 
- block_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- vocab_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- n_layer_text (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- n_layer_multimodal_text (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- n_head (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- n_embd (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- ffn_hidden (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])]) 
- dropout (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])]) 
- bias (bool) 
- attention_config (AttentionConfig) 
- activation (ActivationType) 
- epsilon (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])]) 
 
 - 
activation: ActivationType
 - 
attention_config: AttentionConfig
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 
modalities.models.coca.collator module
- class modalities.models.coca.collator.CoCaCollateFnConfig(**data)[source]
- Bases: - BaseModel- Configuration class for CoCaCollateFn. - Args:
- sample_keys (list[str]): List of samples keys. target_keys (list[str]): List of target keys. text_sample_key (str): Key for the text samples. text_target_key (str): Key for the text targets. 
 - Create a new model by parsing and validating input data from keyword arguments. - Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. - self is explicitly positional-only to allow self as a field name. - Parameters:
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 
- class modalities.models.coca.collator.CoCaCollatorFn(sample_keys, target_keys, text_sample_key, text_target_key)[source]
- Bases: - CollateFnIF- Collator function for CoCa model. - Initializes the CoCaCollatorFn object. - Args:
- sample_keys (list[str]): List of samples keys. target_keys (list[str]): List of target keys. text_sample_key (str): Key for the text samples. text_target_key (str): Key for the text targets. 
- Raises:
- ValueError: If text_sample_key is not part of sample_keys. ValueError: If text_target_key is part of target_keys. 
- Returns:
- None 
 
modalities.models.coca.multi_modal_decoder module
- class modalities.models.coca.multi_modal_decoder.MultiModalTextDecoder(sample_key, prediction_key, block_size, vocab_size, n_layer, n_head, n_embd, ffn_hidden, dropout, bias, activation, epsilon, attention_config)[source]
- Bases: - NNModel- MultiModalTextDecoder class. - Initializes the MultiModalTextDecoder object. - Args:
- sample_key (str): The key for the input samples. prediction_key (str): The key for the predictions. block_size (int): The size of the blocks. vocab_size (int): The size of the vocabulary. n_layer (int): The number of layers. n_head (int): The number of attention heads. n_embd (int): The dimension of the embeddings. ffn_hidden (int): The size of the feed-forward network hidden layer. dropout (float): The dropout rate. bias (bool): Flag indicating whether to include bias terms. activation (ActivationType): The activation function to use. epsilon (float): The epsilon value for layer normalization. attention_config (AttentionConfig): The attention configuration. 
- Returns:
- None 
 - Parameters:
 
- class modalities.models.coca.multi_modal_decoder.TransformerBlock(n_embd, bias, epsilon, activation, n_head, dropout, ffn_hidden, with_context, attention_type, attention_config=None, add_extra_mlp=False)[source]
- Bases: - Module- Transformer block class. - Initializes the TransformerBlock object. - Args:
- n_embd (int): The size of the embeddings. bias (bool): Flag indicating whether to include bias terms. epsilon (float): Small value to avoid division by zero in LayerNorm. activation (ActivationType): The type of activation function to use. n_head (int): The number of attention heads. dropout (float): The dropout rate. ffn_hidden (int): The number of hidden units in the feed-forward network. with_context (bool): Flag indicating whether to include context in the decoder. attention_type (AttentionType): The type of attention mechanism to use. attention_config (AttentionConfig, optional): The configuration for the attention mechanism. Defaults to None. add_extra_mlp (bool, optional): Flag indicating whether to add an extra MLP layer. Defaults to False. 
 - Parameters:
- n_embd (int) 
- bias (bool) 
- epsilon (float) 
- activation (ActivationType) 
- n_head (int) 
- dropout (float) 
- ffn_hidden (int) 
- with_context (bool) 
- attention_type (AttentionType) 
- attention_config (AttentionConfig) 
- add_extra_mlp (bool) 
 
 
modalities.models.coca.text_decoder module
- class modalities.models.coca.text_decoder.TextDecoder(sample_key, prediction_key, block_size, vocab_size, n_layer, n_head, n_embd, ffn_hidden, dropout, bias, activation, epsilon, attention_config=None)[source]
- Bases: - NNModel- TextDecoder class. - Initializes the TextDecoder class. - Args:
- sample_key (str): The key for the samples. prediction_key (str): The key for the predictions. block_size (int): The block size. vocab_size (int): The size of the vocabulary. n_layer (int): The number of layers. n_head (int): The number of attention heads. n_embd (int): The embedding dimension. ffn_hidden (int): The hidden dimension of the feed-forward network. dropout (float): The dropout rate. bias (bool): Flag indicating whether to include bias terms. activation (ActivationType): The activation function to use. epsilon (float): Small value to avoid division by zero in LayerNorm. attention_config (AttentionConfig, optional): The attention configuration. Defaults to None. 
 - Parameters: