modalities.models.coca package

Submodules

modalities.models.coca.attention_pooling module

class modalities.models.coca.attention_pooling.AttentionPooling(n_embd, n_head, bias, epsilon, attention_config=None)[source]

Bases: Module

Attention pooling class.

Initializes an instance of the AttentionPooling class.

Args:

n_embd (int): The size of the embeddings.

n_head (int): The number of attention heads. bias (bool): Flag indicating whether to include bias in the layer normalization. epsilon (float): A small value to avoid division by zero in layer normalization. attention_config (AttentionConfig, optional): The configuration for attention mechanism. Defaults to None.

Returns:

None

Parameters:
forward(queries, context)[source]

Forward pass of the attention pooling module.

Return type:

Tensor

Parameters:
Args:

queries (torch.Tensor): The input queries tensor. context (torch.Tensor): The input context tensor.

Returns:

torch.Tensor: The output tensor.

modalities.models.coca.coca_model module

class modalities.models.coca.coca_model.CoCa(prediction_key, vision_cls_prediction_key, text_cls_prediction_key, vision_embd_prediction_key, text_embd_prediction_key, n_vision_queries, n_pool_head, bias_attn_pool, epsilon_attn_pool, vision_encoder_config, text_decoder_config)[source]

Bases: NNModel

CoCa model

The Contrastive Captioner (CoCa) is an encoder-decoder model that integrates the concepts of CLIP and generative models such as SimVLM by using contrastive and captioning losses for training.

Paper: CoCa: Contrastive Captioners are Image-Text Foundation Models Link: https://arxiv.org/abs/2205.01917

Initializes the CocaModel object.

Args:

prediction_key (str): The key for the predictions. vision_cls_prediction_key (str): The key for the vision cls token. text_cls_prediction_key (str): The key for the text cls token. vision_embd_prediction_key (str): The key for the vision embeddings. text_embd_prediction_key (str): The key for the text embeddings.

n_vision_queries (int): The number of vision queries. n_pool_head (int): The number of pool heads. bias_attn_pool (bool): Flag indicating whether to use bias in attention pooling. epsilon_attn_pool (float): The epsilon value for attention pooling. vision_encoder_config (VisionTransformerConfig): The configuration for the vision encoder. text_decoder_config (TextDecoderConfig): The configuration for the text decoder.

Returns:

None

Parameters:
forward(inputs)[source]

Forward pass of the CoCa model.

Return type:

dict[str, Tensor]

Parameters:

inputs (dict[str, Tensor])

Args:

inputs (dict[str, torch.Tensor]): Input dictionary containing the tensors.

Returns:

dict[str, torch.Tensor]: Output dictionary.

class modalities.models.coca.coca_model.CoCaConfig(**data)[source]

Bases: BaseModel

Configuration class for CoCa model.

Args:

prediction_key (str): The key for the predictions. vision_embd_prediction_key (str): The key for the vision embeddings. text_embd_prediction_key (str): The key for the text embeddings. vision_cls_prediction_key (str): The key for the vision cls token. text_cls_prediction_key (str): The key for the text cls token. vision_encoder_config (VisionTransformerConfig): Configuration for the vision encoder. text_decoder_config (TextDecoderConfig): Configuration for the text decoder. n_pool_head (int): Number of attention heads for pooling. n_vision_queries (int): Number of vision queries. bias_attn_pool (bool): Flag indicating whether to use bias in attention pooling. epsilon_attn_pool (float): Epsilon value for attention pooling.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:
  • prediction_key (str)

  • vision_embd_prediction_key (str)

  • text_embd_prediction_key (str)

  • vision_cls_prediction_key (str)

  • text_cls_prediction_key (str)

  • vision_encoder_config (VisionTransformerConfig)

  • text_decoder_config (TextDecoderConfig)

  • n_pool_head (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • n_vision_queries (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • bias_attn_pool (bool)

  • epsilon_attn_pool (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])])

bias_attn_pool: bool
epsilon_attn_pool: Annotated[float]
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_pool_head: Annotated[int]
n_vision_queries: Annotated[int]
prediction_key: str
text_cls_prediction_key: str
text_decoder_config: TextDecoderConfig
text_embd_prediction_key: str
vision_cls_prediction_key: str
vision_embd_prediction_key: str
vision_encoder_config: VisionTransformerConfig
class modalities.models.coca.coca_model.TextDecoderConfig(**data)[source]

Bases: BaseModel

Configuration class for the TextDecoder.

Args:

sample_key (str): The key for the samples. prediction_key (str): The key for the predictions. block_size (int): The block size. Must be greater than or equal to 1. vocab_size (int): The vocabulary size. Must be greater than or equal to 1. n_layer_text (int): The number of layers for processing text. Must be greater than or equal to 1. n_layer_multimodal_text (int): -. Must be greater than or equal to 1. n_head (int): The number of attention heads. Must be greater than or equal to 1. n_embd (int): The embedding size. Must be greater than or equal to 1. ffn_hidden (int): The hidden size for the feed-forward network. Must be greater than or equal to 1. dropout (float): The dropout rate. Must be greater than or equal to 0.0. bias (bool): Flag indicating whether to include bias in the model. attention_config (AttentionConfig): The attention configuration. activation (ActivationType): The activation type. epsilon (float): The epsilon value. Must be greater than or equal to 0.0.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:
  • sample_key (str)

  • prediction_key (str)

  • block_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • vocab_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • n_layer_text (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • n_layer_multimodal_text (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • n_head (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • n_embd (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • ffn_hidden (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])

  • dropout (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])])

  • bias (bool)

  • attention_config (AttentionConfig)

  • activation (ActivationType)

  • epsilon (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])])

activation: ActivationType
attention_config: AttentionConfig
bias: bool
block_size: Annotated[int]
dropout: Annotated[float]
epsilon: Annotated[float]
ffn_hidden: Annotated[int]
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_embd: Annotated[int]
n_head: Annotated[int]
n_layer_multimodal_text: Annotated[int]
n_layer_text: Annotated[int]
prediction_key: str
sample_key: str
vocab_size: Annotated[int]

modalities.models.coca.collator module

class modalities.models.coca.collator.CoCaCollateFnConfig(**data)[source]

Bases: BaseModel

Configuration class for CoCaCollateFn.

Args:

sample_keys (list[str]): List of samples keys. target_keys (list[str]): List of target keys. text_sample_key (str): Key for the text samples. text_target_key (str): Key for the text targets.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

sample_keys: list[str]
target_keys: list[str]
text_sample_key: str
text_target_key: str
class modalities.models.coca.collator.CoCaCollatorFn(sample_keys, target_keys, text_sample_key, text_target_key)[source]

Bases: CollateFnIF

Collator function for CoCa model.

Initializes the CoCaCollatorFn object.

Args:

sample_keys (list[str]): List of samples keys. target_keys (list[str]): List of target keys. text_sample_key (str): Key for the text samples. text_target_key (str): Key for the text targets.

Raises:

ValueError: If text_sample_key is not part of sample_keys. ValueError: If text_target_key is part of target_keys.

Returns:

None

Parameters:

modalities.models.coca.multi_modal_decoder module

class modalities.models.coca.multi_modal_decoder.MultiModalTextDecoder(sample_key, prediction_key, block_size, vocab_size, n_layer, n_head, n_embd, ffn_hidden, dropout, bias, activation, epsilon, attention_config)[source]

Bases: NNModel

MultiModalTextDecoder class.

Initializes the MultiModalTextDecoder object.

Args:

sample_key (str): The key for the input samples. prediction_key (str): The key for the predictions. block_size (int): The size of the blocks. vocab_size (int): The size of the vocabulary. n_layer (int): The number of layers. n_head (int): The number of attention heads. n_embd (int): The dimension of the embeddings. ffn_hidden (int): The size of the feed-forward network hidden layer. dropout (float): The dropout rate. bias (bool): Flag indicating whether to include bias terms. activation (ActivationType): The activation function to use. epsilon (float): The epsilon value for layer normalization. attention_config (AttentionConfig): The attention configuration.

Returns:

None

Parameters:
forward(inputs)[source]

Forward pass of the MultiModalTextDecoder module.

Return type:

dict[str, Tensor]

Parameters:

inputs (dict[str, Tensor])

Args:

inputs (dict[str, torch.Tensor]): Input dictionary containing the input tensors.

Returns:

dict[str, torch.Tensor]: Output dictionary containing the output logits tensor:

class modalities.models.coca.multi_modal_decoder.TransformerBlock(n_embd, bias, epsilon, activation, n_head, dropout, ffn_hidden, with_context, attention_type, attention_config=None, add_extra_mlp=False)[source]

Bases: Module

Transformer block class.

Initializes the TransformerBlock object.

Args:

n_embd (int): The size of the embeddings. bias (bool): Flag indicating whether to include bias terms. epsilon (float): Small value to avoid division by zero in LayerNorm. activation (ActivationType): The type of activation function to use. n_head (int): The number of attention heads. dropout (float): The dropout rate. ffn_hidden (int): The number of hidden units in the feed-forward network. with_context (bool): Flag indicating whether to include context in the decoder. attention_type (AttentionType): The type of attention mechanism to use. attention_config (AttentionConfig, optional): The configuration for the attention mechanism. Defaults to None. add_extra_mlp (bool, optional): Flag indicating whether to add an extra MLP layer. Defaults to False.

Parameters:
forward(x, context=None)[source]

Forward pass of the TransformerBlock module.

Return type:

Tensor

Parameters:
Args:

x (torch.Tensor): Input tensor. context (torch.Tensor, optional): Context tensor. Defaults to None.

Returns:

torch.Tensor: Output tensor.

modalities.models.coca.text_decoder module

class modalities.models.coca.text_decoder.TextDecoder(sample_key, prediction_key, block_size, vocab_size, n_layer, n_head, n_embd, ffn_hidden, dropout, bias, activation, epsilon, attention_config=None)[source]

Bases: NNModel

TextDecoder class.

Initializes the TextDecoder class.

Args:

sample_key (str): The key for the samples. prediction_key (str): The key for the predictions. block_size (int): The block size. vocab_size (int): The size of the vocabulary. n_layer (int): The number of layers. n_head (int): The number of attention heads. n_embd (int): The embedding dimension. ffn_hidden (int): The hidden dimension of the feed-forward network. dropout (float): The dropout rate. bias (bool): Flag indicating whether to include bias terms. activation (ActivationType): The activation function to use. epsilon (float): Small value to avoid division by zero in LayerNorm. attention_config (AttentionConfig, optional): The attention configuration. Defaults to None.

Parameters:
forward(inputs)[source]

Forward pass of the TextDecoder module.

Return type:

dict[str, Tensor]

Parameters:

inputs (dict[str, Tensor])

Args:

inputs (dict[str, torch.Tensor]): Input dictionary.

Returns:

dict[str, torch.Tensor]: Output dictionary containing the predictions.

Module contents