modalities.models.coca package

Submodules

modalities.models.coca.attention_pooling module

class modalities.models.coca.attention_pooling.AttentionPooling(n_embd, n_head, bias, epsilon, attention_config=None)[source]

Bases: Module

Attention pooling class.

Initializes an instance of the AttentionPooling class.

Args:

n_embd (int): The size of the embeddings.

n_head (int): The number of attention heads. bias (bool): Flag indicating whether to include bias in the layer normalization. epsilon (float): A small value to avoid division by zero in layer normalization. attention_config (AttentionConfig, optional): The configuration for attention mechanism. Defaults to None.

Returns:

None

Parameters:

n_embd (int)
n_head (int)
bias (bool)
epsilon (float)
attention_config (AttentionConfig)

forward(queries, context)[source]

Forward pass of the attention pooling module.

Return type:

Tensor

Parameters:

queries (Tensor)
context (Tensor)

Args:: queries (torch.Tensor): The input queries tensor. context (torch.Tensor): The input context tensor.
Returns:: torch.Tensor: The output tensor.

modalities.models.coca.coca_model module

class modalities.models.coca.coca_model.CoCa(prediction_key, vision_cls_prediction_key, text_cls_prediction_key, vision_embd_prediction_key, text_embd_prediction_key, n_vision_queries, n_pool_head, bias_attn_pool, epsilon_attn_pool, vision_encoder_config, text_decoder_config)[source]

Bases: NNModel

CoCa model

The Contrastive Captioner (CoCa) is an encoder-decoder model that integrates the concepts of CLIP and generative models such as SimVLM by using contrastive and captioning losses for training.

Paper: CoCa: Contrastive Captioners are Image-Text Foundation Models Link: https://arxiv.org/abs/2205.01917

Initializes the CocaModel object.

Args:

prediction_key (str): The key for the predictions. vision_cls_prediction_key (str): The key for the vision cls token. text_cls_prediction_key (str): The key for the text cls token. vision_embd_prediction_key (str): The key for the vision embeddings. text_embd_prediction_key (str): The key for the text embeddings.

n_vision_queries (int): The number of vision queries. n_pool_head (int): The number of pool heads. bias_attn_pool (bool): Flag indicating whether to use bias in attention pooling. epsilon_attn_pool (float): The epsilon value for attention pooling. vision_encoder_config (VisionTransformerConfig): The configuration for the vision encoder. text_decoder_config (TextDecoderConfig): The configuration for the text decoder.

Returns:

None

Parameters:

prediction_key (str)
vision_cls_prediction_key (str)
text_cls_prediction_key (str)
vision_embd_prediction_key (str)
text_embd_prediction_key (str)
n_vision_queries (int)
n_pool_head (int)
bias_attn_pool (bool)
epsilon_attn_pool (float)
vision_encoder_config (VisionTransformerConfig)
text_decoder_config (TextDecoderConfig)

forward(inputs)[source]

Forward pass of the CoCa model.

Return type:: dict[str, Tensor]
Parameters:: inputs (dict[str, Tensor])

Args:: inputs (dict[str, torch.Tensor]): Input dictionary containing the tensors.
Returns:: dict[str, torch.Tensor]: Output dictionary.

class modalities.models.coca.coca_model.CoCaConfig(**data)[source]

Bases: BaseModel

Configuration class for CoCa model.

Args:: prediction_key (str): The key for the predictions. vision_embd_prediction_key (str): The key for the vision embeddings. text_embd_prediction_key (str): The key for the text embeddings. vision_cls_prediction_key (str): The key for the vision cls token. text_cls_prediction_key (str): The key for the text cls token. vision_encoder_config (VisionTransformerConfig): Configuration for the vision encoder. text_decoder_config (TextDecoderConfig): Configuration for the text decoder. n_pool_head (int): Number of attention heads for pooling. n_vision_queries (int): Number of vision queries. bias_attn_pool (bool): Flag indicating whether to use bias in attention pooling. epsilon_attn_pool (float): Epsilon value for attention pooling.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

prediction_key (str)
vision_embd_prediction_key (str)
text_embd_prediction_key (str)
vision_cls_prediction_key (str)
text_cls_prediction_key (str)
vision_encoder_config (VisionTransformerConfig)
text_decoder_config (TextDecoderConfig)
n_pool_head (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_vision_queries (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
bias_attn_pool (bool)
epsilon_attn_pool (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])])

bias_attn_pool: bool

epsilon_attn_pool: Annotated[float]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_pool_head: Annotated[int]

n_vision_queries: Annotated[int]

prediction_key: str

text_cls_prediction_key: str

text_decoder_config: TextDecoderConfig

text_embd_prediction_key: str

vision_cls_prediction_key: str

vision_embd_prediction_key: str

vision_encoder_config: VisionTransformerConfig

class modalities.models.coca.coca_model.TextDecoderConfig(**data)[source]

Bases: BaseModel

Configuration class for the TextDecoder.

Args:: sample_key (str): The key for the samples. prediction_key (str): The key for the predictions. block_size (int): The block size. Must be greater than or equal to 1. vocab_size (int): The vocabulary size. Must be greater than or equal to 1. n_layer_text (int): The number of layers for processing text. Must be greater than or equal to 1. n_layer_multimodal_text (int): -. Must be greater than or equal to 1. n_head (int): The number of attention heads. Must be greater than or equal to 1. n_embd (int): The embedding size. Must be greater than or equal to 1. ffn_hidden (int): The hidden size for the feed-forward network. Must be greater than or equal to 1. dropout (float): The dropout rate. Must be greater than or equal to 0.0. bias (bool): Flag indicating whether to include bias in the model. attention_config (AttentionConfig): The attention configuration. activation (ActivationType): The activation type. epsilon (float): The epsilon value. Must be greater than or equal to 0.0.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

sample_key (str)
prediction_key (str)
block_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
vocab_size (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_layer_text (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_layer_multimodal_text (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_head (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
n_embd (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
ffn_hidden (Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=1)])])
dropout (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])])
bias (bool)
attention_config (AttentionConfig)
activation (ActivationType)
epsilon (Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0.0)])])

activation: ActivationType

attention_config: AttentionConfig

bias: bool

block_size: Annotated[int]

dropout: Annotated[float]

epsilon: Annotated[float]

ffn_hidden: Annotated[int]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_embd: Annotated[int]

n_head: Annotated[int]

n_layer_multimodal_text: Annotated[int]

n_layer_text: Annotated[int]

prediction_key: str

sample_key: str

vocab_size: Annotated[int]

modalities.models.coca.collator module

class modalities.models.coca.collator.CoCaCollateFnConfig(**data)[source]

Bases: BaseModel

Configuration class for CoCaCollateFn.

Args:: sample_keys (list[str]): List of samples keys. target_keys (list[str]): List of target keys. text_sample_key (str): Key for the text samples. text_target_key (str): Key for the text targets.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

sample_keys (list[str])
target_keys (list[str])
text_sample_key (str)
text_target_key (str)

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

sample_keys: list[str]

target_keys: list[str]

text_sample_key: str

text_target_key: str

class modalities.models.coca.collator.CoCaCollatorFn(sample_keys, target_keys, text_sample_key, text_target_key)[source]

Bases: CollateFnIF

Collator function for CoCa model.

Initializes the CoCaCollatorFn object.

Args:: sample_keys (list[str]): List of samples keys. target_keys (list[str]): List of target keys. text_sample_key (str): Key for the text samples. text_target_key (str): Key for the text targets.
Raises:: ValueError: If text_sample_key is not part of sample_keys. ValueError: If text_target_key is part of target_keys.
Returns:: None

Parameters:

sample_keys (list[str])
target_keys (list[str])
text_sample_key (str)
text_target_key (str)

modalities.models.coca.multi_modal_decoder module

class modalities.models.coca.multi_modal_decoder.MultiModalTextDecoder(sample_key, prediction_key, block_size, vocab_size, n_layer, n_head, n_embd, ffn_hidden, dropout, bias, activation, epsilon, attention_config)[source]

Bases: NNModel

MultiModalTextDecoder class.

Initializes the MultiModalTextDecoder object.

Args:: sample_key (str): The key for the input samples. prediction_key (str): The key for the predictions. block_size (int): The size of the blocks. vocab_size (int): The size of the vocabulary. n_layer (int): The number of layers. n_head (int): The number of attention heads. n_embd (int): The dimension of the embeddings. ffn_hidden (int): The size of the feed-forward network hidden layer. dropout (float): The dropout rate. bias (bool): Flag indicating whether to include bias terms. activation (ActivationType): The activation function to use. epsilon (float): The epsilon value for layer normalization. attention_config (AttentionConfig): The attention configuration.
Returns:: None

Parameters:

sample_key (str)
prediction_key (str)
block_size (int)
vocab_size (int)
n_layer (int)
n_head (int)
n_embd (int)
ffn_hidden (int)
dropout (float)
bias (bool)
activation (ActivationType)
epsilon (float)
attention_config (AttentionConfig)

forward(inputs)[source]

Forward pass of the MultiModalTextDecoder module.

Return type:: dict[str, Tensor]
Parameters:: inputs (dict[str, Tensor])

Args:: inputs (dict[str, torch.Tensor]): Input dictionary containing the input tensors.
Returns:: dict[str, torch.Tensor]: Output dictionary containing the output logits tensor:

class modalities.models.coca.multi_modal_decoder.TransformerBlock(n_embd, bias, epsilon, activation, n_head, dropout, ffn_hidden, with_context, attention_type, attention_config=None, add_extra_mlp=False)[source]

Bases: Module

Transformer block class.

Initializes the TransformerBlock object.

Args:: n_embd (int): The size of the embeddings. bias (bool): Flag indicating whether to include bias terms. epsilon (float): Small value to avoid division by zero in LayerNorm. activation (ActivationType): The type of activation function to use. n_head (int): The number of attention heads. dropout (float): The dropout rate. ffn_hidden (int): The number of hidden units in the feed-forward network. with_context (bool): Flag indicating whether to include context in the decoder. attention_type (AttentionType): The type of attention mechanism to use. attention_config (AttentionConfig, optional): The configuration for the attention mechanism. Defaults to None. add_extra_mlp (bool, optional): Flag indicating whether to add an extra MLP layer. Defaults to False.

Parameters:

n_embd (int)
bias (bool)
epsilon (float)
activation (ActivationType)
n_head (int)
dropout (float)
ffn_hidden (int)
with_context (bool)
attention_type (AttentionType)
attention_config (AttentionConfig)
add_extra_mlp (bool)

forward(x, context=None)[source]

Forward pass of the TransformerBlock module.

Return type:

Tensor

Parameters:

x (Tensor)
context (Tensor)

Args:: x (torch.Tensor): Input tensor. context (torch.Tensor, optional): Context tensor. Defaults to None.
Returns:: torch.Tensor: Output tensor.

modalities.models.coca.text_decoder module

class modalities.models.coca.text_decoder.TextDecoder(sample_key, prediction_key, block_size, vocab_size, n_layer, n_head, n_embd, ffn_hidden, dropout, bias, activation, epsilon, attention_config=None)[source]

Bases: NNModel

TextDecoder class.

Initializes the TextDecoder class.

Args:: sample_key (str): The key for the samples. prediction_key (str): The key for the predictions. block_size (int): The block size. vocab_size (int): The size of the vocabulary. n_layer (int): The number of layers. n_head (int): The number of attention heads. n_embd (int): The embedding dimension. ffn_hidden (int): The hidden dimension of the feed-forward network. dropout (float): The dropout rate. bias (bool): Flag indicating whether to include bias terms. activation (ActivationType): The activation function to use. epsilon (float): Small value to avoid division by zero in LayerNorm. attention_config (AttentionConfig, optional): The attention configuration. Defaults to None.

Parameters:

sample_key (str)
prediction_key (str)
block_size (int)
vocab_size (int)
n_layer (int)
n_head (int)
n_embd (int)
ffn_hidden (int)
dropout (float)
bias (bool)
activation (ActivationType)
epsilon (float)
attention_config (AttentionConfig)

forward(inputs)[source]

Forward pass of the TextDecoder module.

Return type:: dict[str, Tensor]
Parameters:: inputs (dict[str, Tensor])

Args:: inputs (dict[str, torch.Tensor]): Input dictionary.
Returns:: dict[str, torch.Tensor]: Output dictionary containing the predictions.

modalities.models.coca package

Submodules

modalities.models.coca.attention_pooling module

modalities.models.coca.coca_model module

modalities.models.coca.collator module

modalities.models.coca.multi_modal_decoder module

modalities.models.coca.text_decoder module

Module contents