modalities.conversion.gpt2 package

Submodules

modalities.conversion.gpt2.configuration_gpt2 module

LLaMA-like GPT2 model configuration

class modalities.conversion.gpt2.configuration_gpt2.GPT2Config(vocab_size=32000, hidden_size=4096, intermediate_size=11008, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=None, hidden_act='silu', max_position_embeddings=2048, initializer_range=0.02, rms_norm_eps=None, layer_norm_eps=1e-06, layer_norm_bias=True, layer_norm_elementwise_affine=True, use_cache=True, pad_token_id=None, bos_token_id=1, eos_token_id=2, pretraining_tp=1, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, attention_bias=False, attention_dropout=0.0, mlp_bias=False, head_dim=None, **kwargs)[source]

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [GPT2Model]. It is used to instantiate an GPT2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the LLaMA-7B.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

Args:
vocab_size (int, optional, defaults to 32000):

Vocabulary size of the GPT2 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [GPT2Model]

hidden_size (int, optional, defaults to 4096):

Dimension of the hidden representations.

intermediate_size (int, optional, defaults to 11008):

Dimension of the MLP representations.

num_hidden_layers (int, optional, defaults to 32):

Number of hidden layers in the Transformer decoder.

num_attention_heads (int, optional, defaults to 32):

Number of attention heads for each attention layer in the Transformer decoder.

num_key_value_heads (int, optional):

This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to num_attention_heads.

hidden_act (str or function, optional, defaults to “silu”):

The non-linear activation function (function or string) in the decoder.

max_position_embeddings (int, optional, defaults to 2048):

The maximum sequence length that this model might ever be used with.

initializer_range (float, optional, defaults to 0.02):

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

rms_norm_eps (float, optional, defaults to 1e-06):

The epsilon used by the rms normalization layers.

use_cache (bool, optional, defaults to True):

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

pad_token_id (int, optional):

Padding token id.

bos_token_id (int, optional, defaults to 1):

Beginning of stream token id.

eos_token_id (int, optional, defaults to 2):

End of stream token id.

pretraining_tp (int, optional, defaults to 1):

Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).

tie_word_embeddings (bool, optional, defaults to False):

Whether to tie weight embeddings

rope_theta (float, optional, defaults to 10000.0):

The base period of the RoPE embeddings.

rope_scaling (Dict, optional):

Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type and you expect the model to work on longer max_position_embeddings, we recommend you to update this value accordingly. Expected contents:

rope_type (str):

The sub-variant of RoPE to use. Can be one of [‘default’, ‘linear’, ‘dynamic’, ‘yarn’, ‘longrope’, ‘llama3’], with ‘default’ being the original RoPE implementation.

factor (float, optional):

Used with all rope types except ‘default’. The scaling factor to apply to the RoPE embeddings. In most scaling types, a factor of x will enable the model to handle sequences of length x * original maximum pre-trained length.

original_max_position_embeddings (int, optional):

Used with ‘dynamic’, ‘longrope’ and ‘llama3’. The original max position embeddings used during pretraining.

attention_factor (float, optional):

Used with ‘yarn’ and ‘longrope’. The scaling factor to be applied on the attention computation. If unspecified, it defaults to value recommended by the implementation, using the factor field to infer the suggested value.

beta_fast (float, optional):

Only used with ‘yarn’. Parameter to set the boundary for extrapolation (only) in the linear ramp function. If unspecified, it defaults to 32.

beta_slow (float, optional):

Only used with ‘yarn’. Parameter to set the boundary for interpolation (only) in the linear ramp function. If unspecified, it defaults to 1.

short_factor (List[float], optional):

Only used with ‘longrope’. The scaling factor to be applied to short contexts (< original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2

long_factor (List[float], optional):

Only used with ‘longrope’. The scaling factor to be applied to long contexts (< original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2

low_freq_factor (float, optional):

Only used with ‘llama3’. Scaling factor applied to low frequency components of the RoPE

high_freq_factor (float, optional):

Only used with ‘llama3’. Scaling factor applied to high frequency components of the RoPE

attention_bias (bool, optional, defaults to False):

Whether to use a bias in the query, key, value and output projection layers during self-attention.

attention_dropout (float, optional, defaults to 0.0):

The dropout ratio for the attention probabilities.

mlp_bias (bool, optional, defaults to False):

Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.

head_dim (int, optional):

The attention head dimension. If None, it will default to hidden_size // num_heads

```python >>> from transformers import GPT2Model, GPT2Config

>>> # Initializing a GPT2 with a llama-7b style configuration
>>> configuration = GPT2Config()
>>> # Initializing a model from the llama-7b style configuration
>>> model = GPT2Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```
Parameters:
  • layer_norm_eps (float)

  • layer_norm_bias (bool)

  • layer_norm_elementwise_affine (bool)

base_model_tp_plan: Optional[dict[str, Any]] = {'layers.*.mlp.down_proj': 'rowwise', 'layers.*.mlp.gate_proj': 'colwise', 'layers.*.mlp.up_proj': 'colwise', 'layers.*.self_attn.k_proj': 'colwise', 'layers.*.self_attn.o_proj': 'rowwise', 'layers.*.self_attn.q_proj': 'colwise', 'layers.*.self_attn.v_proj': 'colwise'}
keys_to_ignore_at_inference = ['past_key_values']
model_type: str = 'modalities-gpt2'

modalities.conversion.gpt2.conversion_code module

modalities.conversion.gpt2.conversion_code.transfer_model_code(output_dir)[source]
Copies the required model code to the output directory and replaces modalities imports.

This allows the converted model to be used without the modalities package via: >>> from transformers import AutoModelForCausalLM >>> model = AutoModelForCausalLM.from_pretrained(“path/to/converted/model”, trust_remote_code=True)

Args:

output_dir (str): Directory of the converted model.

Parameters:

output_dir (str)

modalities.conversion.gpt2.conversion_model module

modalities.conversion.gpt2.conversion_model.check_converted_model(hf_model, modalities_model, num_testruns, vocab_size)[source]

Tests the converted model by inputting a random token sequence and comparing the output logits of both models.

Args:

hf_model (GPT2ForCausalLM): Huggingface transformers model. modalities_model (GPT2LLM): Modalities model. num_testruns (int): Number of test runs to perform. vocab_size (int): Vocabulary size of the model. (Required for generating random input tokens.)

Parameters:
modalities.conversion.gpt2.conversion_model.convert_model_checkpoint(modalities_config)[source]
Return type:

tuple[GPT2ForCausalLM, GPT2LLM]

Parameters:

modalities_config (dict)

Converts the modalities model to a Huggingface transformers model.

Both the loaded modalities model and the converted Huggingface model are returned so that they can be compared.

Args:

modalities_config (dict): Modalities config dictionary.

Returns:

tuple[GPT2ForCausalLM, GPT2LLM]: Converted Hugging Face model and the original modalities model.

modalities.conversion.gpt2.conversion_model.convert_model_config(modalities_config)[source]
Return type:

GPT2Config

Parameters:

modalities_config (dict)

Converts the modalities model configuration to a Huggingface transformers configuration.

For this the model_raw or model section of the modalities config is used. Corresponding entries are mapped to the Huggingface configuration.

Args:

modalities_config (dict): Modalities config dictionary.

Returns:

GPT2Config: Converted Huggingface model configuration.

modalities.conversion.gpt2.conversion_tokenizer module

modalities.conversion.gpt2.conversion_tokenizer.convert_tokenizer(tokenizer_model_path, output_dir)[source]

Converts a SentencePiece tokenizer to a Huggingface tokenizer.

Return type:

tuple[int, int, int, int]

Parameters:
  • tokenizer_model_path (str)

  • output_dir (str)

Args:

tokenizer_model_path (str): Path to the SentencePiece tokenizer model file. output_dir (str): Path to the directory where the converted tokenizer will be saved.

Returns:
tuple[int, int, int, int]: The actual bos_token_id, eos_token_id, pad_token_id and

unk_token_id of the tokenizer. Note, that these are not set in the transformers part of the created tokenizer. Only in the wrapped SentencePiece tokenizer.

modalities.conversion.gpt2.convert_gpt2 module

usage: convert_gpt2.py [-h] [–num_testruns NUM_TESTRUNS] [–device_modalities DEVICE_MODALITIES]

[–device_hf DEVICE_HF] modalities_config output_dir

Convert GPT-2 model checkpoint to Huggingface transformers format.

positional arguments:

modalities_config Path to the modalities config file. output_dir Directory to save the converted model.

options:
-h, --help

show this help message and exit

--num_testruns NUM_TESTRUNS

Number of test runs to perform.

--device_modalities DEVICE_MODALITIES

Device for the modalities model.

--device_hf DEVICE_HF

Device for the Hugging Face model.

modalities.conversion.gpt2.convert_gpt2.convert_gpt2(modalities_config_path, output_dir, num_testruns=0, device_modalities='cpu', device_hf='cpu')[source]
Return type:

None

Parameters:
  • modalities_config_path (str)

  • output_dir (str)

  • num_testruns (int)

  • device_modalities (str)

  • device_hf (str)

Takes a modalities gpt2 model and converts it to a Huggingface transformers model.

The provided config yaml file should contain the model_raw or model section with the model configuration. Additionally, the checkpointed_model section should be present and contain the path to the model checkpoint. Optionally, the function can run a number of test runs to compare the converted model with the original one. If a tokenizer is specified in the config, it will be converted as well.

Args:

modalities_config_path (str): Path to the modalities config file. output_dir (str): Directory to save the converted model. num_testruns (int, optional): Number of test runs to perform. Defaults to 0. device_modalities (str, optional): Device for the modalities model. Defaults to “cpu”. device_hf (str, optional): Device for the Hugging Face model. Defaults to “cpu”.

modalities.conversion.gpt2.modeling_gpt2 module

class modalities.conversion.gpt2.modeling_gpt2.GPT2DecoderLayer(config, layer_idx)[source]

Bases: Module

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters:
forward(hidden_states, attention_mask=None, position_ids=None, past_key_value=None, output_attentions=False, use_cache=False, cache_position=None, position_embeddings=None, **kwargs)[source]
Return type:

Tuple[FloatTensor, Optional[Tuple[FloatTensor, FloatTensor]]]

Parameters:
  • hidden_states (Tensor)

  • attention_mask (Tensor | None)

  • position_ids (LongTensor | None)

  • past_key_value (Cache | None)

  • output_attentions (bool | None)

  • use_cache (bool | None)

  • cache_position (LongTensor | None)

  • position_embeddings (Tuple[Tensor, Tensor] | None)

Args:

hidden_states (torch.FloatTensor): input to the layer of shape (batch, seq_len, embed_dim) attention_mask (torch.FloatTensor, optional):

attention mask of size (batch_size, sequence_length) if flash attention is used or (batch_size, 1, query_sequence_length, key_sequence_length) if default attention is used.

output_attentions (bool, optional):

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

use_cache (bool, optional):

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

past_key_value (Tuple(torch.FloatTensor), optional): cached past key and value projection states cache_position (torch.LongTensor of shape (sequence_length), optional):

Indices depicting the position of the input sequence tokens in the sequence

position_embeddings (Tuple[torch.FloatTensor, torch.FloatTensor], optional):

Tuple containing the cosine and sine positional embeddings of shape (batch_size, seq_len, head_dim), with head_dim being the embedding dimension of each attention head.

kwargs (dict, optional):

Arbitrary kwargs to be ignored, used for FSDP and other methods that injects code into the model

class modalities.conversion.gpt2.modeling_gpt2.GPT2ForCausalLM(config)[source]

Bases: GPT2PreTrainedModel, GenerationMixin

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None, cache_position=None, num_logits_to_keep=0, **kwargs)[source]

The [GPT2ForCausalLM] forward method, overrides the __call__ special method.

<Tip>

Although the recipe for forward pass needs to be defined within this function, one should call the [Module] instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

</Tip>

Return type:

Union[Tuple, CausalLMOutputWithPast]

Parameters:
  • input_ids (LongTensor)

  • attention_mask (Tensor | None)

  • position_ids (LongTensor | None)

  • past_key_values (Cache | List[FloatTensor] | None)

  • inputs_embeds (FloatTensor | None)

  • labels (LongTensor | None)

  • use_cache (bool | None)

  • output_attentions (bool | None)

  • output_hidden_states (bool | None)

  • return_dict (bool | None)

  • cache_position (LongTensor | None)

  • num_logits_to_keep (int)

  • kwargs (Unpack[KwargsForCausalLM])

Args:
input_ids (torch.LongTensor of shape (batch_size, sequence_length)):

Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

[What are input IDs?](../glossary#input-ids)

attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional):

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

  • 1 for tokens that are not masked,

  • 0 for tokens that are masked.

[What are attention masks?](../glossary#attention-mask)

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

If past_key_values is used, optionally only the last input_ids have to be input (see past_key_values).

If you want to change padding behavior, you should read [modeling_opt._prepare_decoder_attention_mask] and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more information on the default strategy.

  • 1 indicates the head is not masked,

  • 0 indicates the head is masked.

position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional):

Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

[What are position IDs?](../glossary#position-ids)

past_key_values (Cache or tuple(tuple(torch.FloatTensor)), optional):

Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.

Two formats are allowed: - a [~cache_utils.Cache] instance, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache); - Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)). This is also known as the legacy cache format.

The model will output the same cache format that is fed as input. If no past_key_values are passed, the legacy cache format will be returned.

If past_key_values are used, the user can optionally input only the last input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all input_ids of shape (batch_size, sequence_length).

inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional):

Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

use_cache (bool, optional):

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

output_attentions (bool, optional):

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

output_hidden_states (bool, optional):

Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

return_dict (bool, optional):

Whether or not to return a [~utils.ModelOutput] instead of a plain tuple.

cache_position (torch.LongTensor of shape (sequence_length), optional):

Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.

Args:
labels (torch.LongTensor of shape (batch_size, sequence_length), optional):

Labels for computing the masked language modeling loss. Indices should either be in [0, …, config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size].

num_logits_to_keep (int, optional):

Calculate logits for the last num_logits_to_keep tokens. If 0, calculate logits for all input_ids (special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size.

Returns:

[transformers.modeling_outputs.CausalLMOutputWithPast] or tuple(torch.FloatTensor): A [transformers.modeling_outputs.CausalLMOutputWithPast] or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration ([GPT2Config]) and inputs.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Language modeling loss (for next-token prediction).

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) – It is a [~cache_utils.Cache] instance. For more details, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache).

    Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Example:

```python >>> from transformers import AutoTokenizer, GPT2ForCausalLM

>>> model = GPT2ForCausalLM.from_pretrained("...")
>>> tokenizer = AutoTokenizer.from_pretrained("...")
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```
get_decoder()[source]
get_input_embeddings()[source]

Returns the model’s input embeddings.

Returns:

nn.Module: A torch module mapping vocabulary to hidden states.

get_output_embeddings()[source]

Returns the model’s output embeddings.

Returns:

nn.Module: A torch module mapping hidden states to vocabulary.

set_decoder(decoder)[source]
set_input_embeddings(value)[source]

Set model’s input embeddings.

Args:

value (nn.Module): A module mapping vocabulary to hidden states.

set_output_embeddings(new_embeddings)[source]
class modalities.conversion.gpt2.modeling_gpt2.GPT2ForQuestionAnswering(config)[source]

Bases: GPT2PreTrainedModel

The Llama-like Model transformer with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

This model inherits from [PreTrainedModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters:
config ([LlamaConfig]):

Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

base_model_prefix = 'transformer'
forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, start_positions=None, end_positions=None, output_attentions=None, output_hidden_states=None, return_dict=None, **kwargs)[source]

The [GPT2ForQuestionAnswering] forward method, overrides the __call__ special method.

<Tip>

Although the recipe for forward pass needs to be defined within this function, one should call the [Module] instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

</Tip>

Return type:

Union[Tuple, QuestionAnsweringModelOutput]

Parameters:
  • input_ids (LongTensor | None)

  • attention_mask (FloatTensor | None)

  • position_ids (LongTensor | None)

  • past_key_values (Cache | List[FloatTensor] | None)

  • inputs_embeds (FloatTensor | None)

  • start_positions (LongTensor | None)

  • end_positions (LongTensor | None)

  • output_attentions (bool | None)

  • output_hidden_states (bool | None)

  • return_dict (bool | None)

Args:
input_ids (torch.LongTensor of shape (batch_size, sequence_length)):

Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

[What are input IDs?](../glossary#input-ids)

attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional):

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

  • 1 for tokens that are not masked,

  • 0 for tokens that are masked.

[What are attention masks?](../glossary#attention-mask)

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

If past_key_values is used, optionally only the last input_ids have to be input (see past_key_values).

If you want to change padding behavior, you should read [modeling_opt._prepare_decoder_attention_mask] and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more information on the default strategy.

  • 1 indicates the head is not masked,

  • 0 indicates the head is masked.

position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional):

Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

[What are position IDs?](../glossary#position-ids)

past_key_values (Cache or tuple(tuple(torch.FloatTensor)), optional):

Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.

Two formats are allowed: - a [~cache_utils.Cache] instance, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache); - Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)). This is also known as the legacy cache format.

The model will output the same cache format that is fed as input. If no past_key_values are passed, the legacy cache format will be returned.

If past_key_values are used, the user can optionally input only the last input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all input_ids of shape (batch_size, sequence_length).

inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional):

Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

use_cache (bool, optional):

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

output_attentions (bool, optional):

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

output_hidden_states (bool, optional):

Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

return_dict (bool, optional):

Whether or not to return a [~utils.ModelOutput] instead of a plain tuple.

cache_position (torch.LongTensor of shape (sequence_length), optional):

Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.

start_positions (torch.LongTensor of shape (batch_size,), optional):

Labels for position (index) of the start of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.

end_positions (torch.LongTensor of shape (batch_size,), optional):

Labels for position (index) of the end of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.

get_input_embeddings()[source]

Returns the model’s input embeddings.

Returns:

nn.Module: A torch module mapping vocabulary to hidden states.

set_input_embeddings(value)[source]

Set model’s input embeddings.

Args:

value (nn.Module): A module mapping vocabulary to hidden states.

class modalities.conversion.gpt2.modeling_gpt2.GPT2ForSequenceClassification(config)[source]

Bases: GPT2PreTrainedModel

The LLaMa-like GPT2 Model transformer with a sequence classification head on top (linear layer).

[GPT2ForSequenceClassification] uses the last token in order to do the classification, as other causal models (e.g. GPT-2) do.

Since it does classification on the last token, it requires to know the position of the last token. If a pad_token_id is defined in the configuration, it finds the last token that is not a padding token in each row. If no pad_token_id is defined, it simply takes the last value in each row of the batch. Since it cannot guess the padding tokens when inputs_embeds are passed instead of input_ids, it does the same (take the last value in each row of the batch).

This model inherits from [PreTrainedModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters:
config ([LlamaConfig]):

Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]

The [GPT2ForSequenceClassification] forward method, overrides the __call__ special method.

<Tip>

Although the recipe for forward pass needs to be defined within this function, one should call the [Module] instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

</Tip>

Return type:

Union[Tuple, SequenceClassifierOutputWithPast]

Parameters:
  • input_ids (LongTensor | None)

  • attention_mask (Tensor | None)

  • position_ids (LongTensor | None)

  • past_key_values (Cache | List[FloatTensor] | None)

  • inputs_embeds (FloatTensor | None)

  • labels (LongTensor | None)

  • use_cache (bool | None)

  • output_attentions (bool | None)

  • output_hidden_states (bool | None)

  • return_dict (bool | None)

Args:
input_ids (torch.LongTensor of shape (batch_size, sequence_length)):

Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

[What are input IDs?](../glossary#input-ids)

attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional):

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

  • 1 for tokens that are not masked,

  • 0 for tokens that are masked.

[What are attention masks?](../glossary#attention-mask)

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

If past_key_values is used, optionally only the last input_ids have to be input (see past_key_values).

If you want to change padding behavior, you should read [modeling_opt._prepare_decoder_attention_mask] and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more information on the default strategy.

  • 1 indicates the head is not masked,

  • 0 indicates the head is masked.

position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional):

Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

[What are position IDs?](../glossary#position-ids)

past_key_values (Cache or tuple(tuple(torch.FloatTensor)), optional):

Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.

Two formats are allowed: - a [~cache_utils.Cache] instance, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache); - Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)). This is also known as the legacy cache format.

The model will output the same cache format that is fed as input. If no past_key_values are passed, the legacy cache format will be returned.

If past_key_values are used, the user can optionally input only the last input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all input_ids of shape (batch_size, sequence_length).

inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional):

Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

use_cache (bool, optional):

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

output_attentions (bool, optional):

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

output_hidden_states (bool, optional):

Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

return_dict (bool, optional):

Whether or not to return a [~utils.ModelOutput] instead of a plain tuple.

cache_position (torch.LongTensor of shape (sequence_length), optional):

Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.

labels (torch.LongTensor of shape (batch_size,), optional):

Labels for computing the sequence classification/regression loss. Indices should be in [0, …, config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

get_input_embeddings()[source]

Returns the model’s input embeddings.

Returns:

nn.Module: A torch module mapping vocabulary to hidden states.

set_input_embeddings(value)[source]

Set model’s input embeddings.

Args:

value (nn.Module): A module mapping vocabulary to hidden states.

class modalities.conversion.gpt2.modeling_gpt2.GPT2ForTokenClassification(config)[source]

Bases: GPT2PreTrainedModel

The Llama-like GPT2 Model transformer with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

This model inherits from [PreTrainedModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters:
config ([LlamaConfig]):

Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]

The [GPT2ForTokenClassification] forward method, overrides the __call__ special method.

<Tip>

Although the recipe for forward pass needs to be defined within this function, one should call the [Module] instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

</Tip>

Return type:

Union[Tuple, TokenClassifierOutput]

Parameters:
  • input_ids (LongTensor | None)

  • attention_mask (Tensor | None)

  • position_ids (LongTensor | None)

  • past_key_values (List[FloatTensor] | None)

  • inputs_embeds (FloatTensor | None)

  • labels (LongTensor | None)

  • use_cache (bool | None)

  • output_attentions (bool | None)

  • output_hidden_states (bool | None)

  • return_dict (bool | None)

Args:
input_ids (torch.LongTensor of shape (batch_size, sequence_length)):

Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

[What are input IDs?](../glossary#input-ids)

attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional):

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

  • 1 for tokens that are not masked,

  • 0 for tokens that are masked.

[What are attention masks?](../glossary#attention-mask)

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

If past_key_values is used, optionally only the last input_ids have to be input (see past_key_values).

If you want to change padding behavior, you should read [modeling_opt._prepare_decoder_attention_mask] and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more information on the default strategy.

  • 1 indicates the head is not masked,

  • 0 indicates the head is masked.

position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional):

Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

[What are position IDs?](../glossary#position-ids)

past_key_values (Cache or tuple(tuple(torch.FloatTensor)), optional):

Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.

Two formats are allowed: - a [~cache_utils.Cache] instance, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache); - Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)). This is also known as the legacy cache format.

The model will output the same cache format that is fed as input. If no past_key_values are passed, the legacy cache format will be returned.

If past_key_values are used, the user can optionally input only the last input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all input_ids of shape (batch_size, sequence_length).

inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional):

Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

use_cache (bool, optional):

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

output_attentions (bool, optional):

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

output_hidden_states (bool, optional):

Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

return_dict (bool, optional):

Whether or not to return a [~utils.ModelOutput] instead of a plain tuple.

cache_position (torch.LongTensor of shape (sequence_length), optional):

Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.

labels (torch.LongTensor of shape (batch_size,), optional):

Labels for computing the sequence classification/regression loss. Indices should be in [0, …, config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

Returns:

[transformers.modeling_outputs.TokenClassifierOutput] or tuple(torch.FloatTensor): A [transformers.modeling_outputs.TokenClassifierOutput] or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration ([GPT2Config]) and inputs.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Classification loss.

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels)) – Classification scores (before SoftMax).

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Example:

```python >>> from transformers import AutoTokenizer, GPT2ForTokenClassification >>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
>>> model = GPT2ForTokenClassification.from_pretrained("meta-llama/Llama-2-7b-hf")
>>> inputs = tokenizer(
...     "HuggingFace is a company based in Paris and New York", add_special_tokens=False, return_tensors="pt"
... )
>>> with torch.no_grad():
...     logits = model(**inputs).logits
>>> predicted_token_class_ids = logits.argmax(-1)
>>> # Note that tokens are classified rather then input words which means that
>>> # there might be more predicted token classes than words.
>>> # Multiple token classes might account for the same word
>>> predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids[0]]
>>> labels = predicted_token_class_ids
>>> loss = model(**inputs, labels=labels).loss
```
get_input_embeddings()[source]

Returns the model’s input embeddings.

Returns:

nn.Module: A torch module mapping vocabulary to hidden states.

set_input_embeddings(value)[source]

Set model’s input embeddings.

Args:

value (nn.Module): A module mapping vocabulary to hidden states.

class modalities.conversion.gpt2.modeling_gpt2.GPT2Model(config)[source]

Bases: GPT2PreTrainedModel

The bare LLaMA Model outputting raw hidden-states without any specific head on top. This model inherits from [PreTrainedModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters:
config ([LlamaConfig]):

Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.

Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a [LlamaDecoderLayer]

Args:

config: LlamaConfig

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters:

config (GPT2Config)

forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None, cache_position=None, **flash_attn_kwargs)[source]

The [GPT2Model] forward method, overrides the __call__ special method.

<Tip>

Although the recipe for forward pass needs to be defined within this function, one should call the [Module] instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

</Tip>

Return type:

Union[Tuple, BaseModelOutputWithPast]

Parameters:
  • input_ids (LongTensor)

  • attention_mask (Tensor | None)

  • position_ids (LongTensor | None)

  • past_key_values (Cache | List[FloatTensor] | None)

  • inputs_embeds (FloatTensor | None)

  • use_cache (bool | None)

  • output_attentions (bool | None)

  • output_hidden_states (bool | None)

  • return_dict (bool | None)

  • cache_position (LongTensor | None)

  • flash_attn_kwargs (Unpack[FlashAttentionKwargs])

Args:
input_ids (torch.LongTensor of shape (batch_size, sequence_length)):

Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

[What are input IDs?](../glossary#input-ids)

attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional):

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

  • 1 for tokens that are not masked,

  • 0 for tokens that are masked.

[What are attention masks?](../glossary#attention-mask)

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

If past_key_values is used, optionally only the last input_ids have to be input (see past_key_values).

If you want to change padding behavior, you should read [modeling_opt._prepare_decoder_attention_mask] and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more information on the default strategy.

  • 1 indicates the head is not masked,

  • 0 indicates the head is masked.

position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional):

Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

[What are position IDs?](../glossary#position-ids)

past_key_values (Cache or tuple(tuple(torch.FloatTensor)), optional):

Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.

Two formats are allowed: - a [~cache_utils.Cache] instance, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache); - Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)). This is also known as the legacy cache format.

The model will output the same cache format that is fed as input. If no past_key_values are passed, the legacy cache format will be returned.

If past_key_values are used, the user can optionally input only the last input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all input_ids of shape (batch_size, sequence_length).

inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional):

Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

use_cache (bool, optional):

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

output_attentions (bool, optional):

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

output_hidden_states (bool, optional):

Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

return_dict (bool, optional):

Whether or not to return a [~utils.ModelOutput] instead of a plain tuple.

cache_position (torch.LongTensor of shape (sequence_length), optional):

Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.

get_input_embeddings()[source]

Returns the model’s input embeddings.

Returns:

nn.Module: A torch module mapping vocabulary to hidden states.

set_input_embeddings(value)[source]

Set model’s input embeddings.

Args:

value (nn.Module): A module mapping vocabulary to hidden states.

class modalities.conversion.gpt2.modeling_gpt2.GPT2PreTrainedModel(config, *inputs, **kwargs)[source]

Bases: PreTrainedModel

The bare LLaMA Model outputting raw hidden-states without any specific head on top. This model inherits from [PreTrainedModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters:
config ([LlamaConfig]):

Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters:

config (PretrainedConfig)

base_model_prefix = 'model'
config_class

alias of GPT2Config

supports_gradient_checkpointing = True
class modalities.conversion.gpt2.modeling_gpt2.KwargsForCausalLM[source]

Bases: dict

cumulative_seqlens_k: Optional[LongTensor]
cumulative_seqlens_q: Optional[LongTensor]
max_length_k: Optional[int]
max_length_q: Optional[int]
num_items_in_batch: Optional[Tensor]
class modalities.conversion.gpt2.modeling_gpt2.LlamaAttention(config, layer_idx=None)[source]

Bases: Module

Multi-headed attention from ‘Attention Is All You Need’ paper

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters:
forward(hidden_states, attention_mask=None, position_ids=None, past_key_value=None, output_attentions=False, use_cache=False, cache_position=None, position_embeddings=None, **kwargs)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type:

Tuple[Tensor, Optional[Tensor], Optional[Tuple[Tensor]]]

Parameters:
  • hidden_states (Tensor)

  • attention_mask (Tensor | None)

  • position_ids (LongTensor | None)

  • past_key_value (Cache | None)

  • output_attentions (bool)

  • use_cache (bool)

  • cache_position (LongTensor | None)

  • position_embeddings (Tuple[Tensor, Tensor] | None)

class modalities.conversion.gpt2.modeling_gpt2.LlamaDynamicNTKScalingRotaryEmbedding(*args, **kwargs)[source]

Bases: LlamaRotaryEmbedding

LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla

Initialize internal Module state, shared by both nn.Module and ScriptModule.

class modalities.conversion.gpt2.modeling_gpt2.LlamaFlashAttention2(*args, **kwargs)[source]

Bases: LlamaAttention

Llama flash attention module. This module inherits from LlamaAttention as the weights of the module stays untouched. The only required change would be on the forward pass where it needs to correctly call the public API of flash attention and deal with padding tokens in case the input contains any of them.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(hidden_states, attention_mask=None, position_ids=None, past_key_value=None, output_attentions=False, use_cache=False, cache_position=None, position_embeddings=None, **kwargs)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type:

Tuple[Tensor, Optional[Tensor], Optional[Tuple[Tensor]]]

Parameters:
  • hidden_states (Tensor)

  • attention_mask (LongTensor | None)

  • position_ids (LongTensor | None)

  • past_key_value (Cache | None)

  • output_attentions (bool)

  • use_cache (bool)

  • cache_position (LongTensor | None)

  • position_embeddings (Tuple[Tensor, Tensor] | None)

  • kwargs (Unpack[FlashAttentionKwargs])

class modalities.conversion.gpt2.modeling_gpt2.LlamaLinearScalingRotaryEmbedding(*args, **kwargs)[source]

Bases: LlamaRotaryEmbedding

LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev

Initialize internal Module state, shared by both nn.Module and ScriptModule.

class modalities.conversion.gpt2.modeling_gpt2.LlamaMLP(config)[source]

Bases: Module

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class modalities.conversion.gpt2.modeling_gpt2.LlamaRotaryEmbedding(dim=None, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0, rope_type='default', config=None)[source]

Bases: Module

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters:

config (GPT2Config | None)

forward(x, position_ids)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class modalities.conversion.gpt2.modeling_gpt2.LlamaSdpaAttention(config, layer_idx=None)[source]

Bases: LlamaAttention

Llama attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from LlamaAttention as the weights of the module stays untouched. The only changes are on the forward pass to adapt to SDPA API.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters:
forward(hidden_states, attention_mask=None, position_ids=None, past_key_value=None, output_attentions=False, use_cache=False, cache_position=None, position_embeddings=None, **kwargs)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type:

Tuple[Tensor, Optional[Tensor], Optional[Tuple[Tensor]]]

Parameters:
  • hidden_states (Tensor)

  • attention_mask (Tensor | None)

  • position_ids (LongTensor | None)

  • past_key_value (Cache | None)

  • output_attentions (bool)

  • use_cache (bool)

  • cache_position (LongTensor | None)

  • position_embeddings (Tuple[Tensor, Tensor] | None)

modalities.conversion.gpt2.modeling_gpt2.apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1)[source]

Applies Rotary Position Embedding to the query and key tensors.

Args:

q (torch.Tensor): The query tensor. k (torch.Tensor): The key tensor. cos (torch.Tensor): The cosine part of the rotary embedding. sin (torch.Tensor): The sine part of the rotary embedding. position_ids (torch.Tensor, optional):

Deprecated and unused.

unsqueeze_dim (int, optional, defaults to 1):

The ‘unsqueeze_dim’ argument specifies the dimension along which to unsqueeze cos[position_ids] and sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.

Returns:

tuple(torch.Tensor) comprising of the query and key tensors rotated using the Rotary Position Embedding.

modalities.conversion.gpt2.modeling_gpt2.repeat_kv(hidden_states, n_rep)[source]

This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)

Return type:

Tensor

Parameters:
modalities.conversion.gpt2.modeling_gpt2.rotate_half(x)[source]

Rotates half the hidden dims of the input.

Module contents