modalities.conversion.gpt2 package

Submodules

modalities.conversion.gpt2.configuration_gpt2 module

LLaMA-like GPT2 model configuration

class modalities.conversion.gpt2.configuration_gpt2.GPT2Config(vocab_size=32000, hidden_size=4096, intermediate_size=11008, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=None, hidden_act='silu', max_position_embeddings=2048, initializer_range=0.02, rms_norm_eps=None, layer_norm_eps=1e-06, layer_norm_bias=True, layer_norm_elementwise_affine=True, use_cache=True, pad_token_id=None, bos_token_id=1, eos_token_id=2, pretraining_tp=1, tie_word_embeddings=False, rope_theta=10000.0, rope_scaling=None, attention_bias=False, attention_dropout=0.0, mlp_bias=False, head_dim=None, **kwargs)[source]

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [GPT2Model]. It is used to instantiate an GPT2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the LLaMA-7B.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

Args:

vocab_size (int, optional, defaults to 32000):

Vocabulary size of the GPT2 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [GPT2Model]

hidden_size (int, optional, defaults to 4096):

Dimension of the hidden representations.

intermediate_size (int, optional, defaults to 11008):

Dimension of the MLP representations.

num_hidden_layers (int, optional, defaults to 32):

Number of hidden layers in the Transformer decoder.

num_attention_heads (int, optional, defaults to 32):

Number of attention heads for each attention layer in the Transformer decoder.

num_key_value_heads (int, optional):

This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to num_attention_heads.

hidden_act (str or function, optional, defaults to “silu”):

The non-linear activation function (function or string) in the decoder.

max_position_embeddings (int, optional, defaults to 2048):

The maximum sequence length that this model might ever be used with.

initializer_range (float, optional, defaults to 0.02):

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

rms_norm_eps (float, optional, defaults to 1e-06):

The epsilon used by the rms normalization layers.

use_cache (bool, optional, defaults to True):

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

pad_token_id (int, optional):

Padding token id.

bos_token_id (int, optional, defaults to 1):

Beginning of stream token id.

eos_token_id (int, optional, defaults to 2):

End of stream token id.

pretraining_tp (int, optional, defaults to 1):

Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).

tie_word_embeddings (bool, optional, defaults to False):

Whether to tie weight embeddings

rope_theta (float, optional, defaults to 10000.0):

The base period of the RoPE embeddings.

rope_scaling (Dict, optional):

Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type and you expect the model to work on longer max_position_embeddings, we recommend you to update this value accordingly. Expected contents:

rope_type (str):
The sub-variant of RoPE to use. Can be one of [‘default’, ‘linear’, ‘dynamic’, ‘yarn’, ‘longrope’, ‘llama3’], with ‘default’ being the original RoPE implementation.

factor (float, optional):
Used with all rope types except ‘default’. The scaling factor to apply to the RoPE embeddings. In most scaling types, a factor of x will enable the model to handle sequences of length x * original maximum pre-trained length.

original_max_position_embeddings (int, optional):
Used with ‘dynamic’, ‘longrope’ and ‘llama3’. The original max position embeddings used during pretraining.

attention_factor (float, optional):
Used with ‘yarn’ and ‘longrope’. The scaling factor to be applied on the attention computation. If unspecified, it defaults to value recommended by the implementation, using the factor field to infer the suggested value.

beta_fast (float, optional):
Only used with ‘yarn’. Parameter to set the boundary for extrapolation (only) in the linear ramp function. If unspecified, it defaults to 32.

beta_slow (float, optional):
Only used with ‘yarn’. Parameter to set the boundary for interpolation (only) in the linear ramp function. If unspecified, it defaults to 1.

short_factor (List[float], optional):
Only used with ‘longrope’. The scaling factor to be applied to short contexts (< original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2

long_factor (List[float], optional):
Only used with ‘longrope’. The scaling factor to be applied to long contexts (< original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2

low_freq_factor (float, optional):
Only used with ‘llama3’. Scaling factor applied to low frequency components of the RoPE

high_freq_factor (float, optional):
Only used with ‘llama3’. Scaling factor applied to high frequency components of the RoPE

attention_bias (bool, optional, defaults to False):

Whether to use a bias in the query, key, value and output projection layers during self-attention.

attention_dropout (float, optional, defaults to 0.0):

The dropout ratio for the attention probabilities.

mlp_bias (bool, optional, defaults to False):

Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.

head_dim (int, optional):

The attention head dimension. If None, it will default to hidden_size // num_heads

```python >>> from transformers import GPT2Model, GPT2Config

>>> # Initializing a GPT2 with a llama-7b style configuration
>>> configuration = GPT2Config()

>>> # Initializing a model from the llama-7b style configuration
>>> model = GPT2Model(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```

Parameters:

layer_norm_eps (float)
layer_norm_bias (bool)
layer_norm_elementwise_affine (bool)

base_model_tp_plan: Optional[dict[str, Any]] = {'layers.*.mlp.down_proj': 'rowwise', 'layers.*.mlp.gate_proj': 'colwise', 'layers.*.mlp.up_proj': 'colwise', 'layers.*.self_attn.k_proj': 'colwise', 'layers.*.self_attn.o_proj': 'rowwise', 'layers.*.self_attn.q_proj': 'colwise', 'layers.*.self_attn.v_proj': 'colwise'}

keys_to_ignore_at_inference = ['past_key_values']

model_type: str = 'modalities-gpt2'

modalities.conversion.gpt2.conversion_code module

modalities.conversion.gpt2.conversion_code.transfer_model_code(output_dir)[source]

Copies the required model code to the output directory and replaces modalities imports.: This allows the converted model to be used without the modalities package via: >>> from transformers import AutoModelForCausalLM >>> model = AutoModelForCausalLM.from_pretrained(“path/to/converted/model”, trust_remote_code=True)
Args:: output_dir (str): Directory of the converted model.

Parameters:: output_dir (str)

modalities.conversion.gpt2.conversion_model module

modalities.conversion.gpt2.conversion_model.check_converted_model(hf_model, modalities_model, num_testruns, vocab_size)[source]

Tests the converted model by inputting a random token sequence and comparing the output logits of both models.

Args:: hf_model (GPT2ForCausalLM): Huggingface transformers model. modalities_model (GPT2LLM): Modalities model. num_testruns (int): Number of test runs to perform. vocab_size (int): Vocabulary size of the model. (Required for generating random input tokens.)

Parameters:

hf_model (GPT2ForCausalLM)
modalities_model (GPT2LLM)
num_testruns (int)
vocab_size (int)

modalities.conversion.gpt2.conversion_model.convert_model_checkpoint(modalities_config)[source]

Return type:: tuple[GPT2ForCausalLM, GPT2LLM]
Parameters:: modalities_config (dict)

Converts the modalities model to a Huggingface transformers model.: Both the loaded modalities model and the converted Huggingface model are returned so that they can be compared.
Args:: modalities_config (dict): Modalities config dictionary.
Returns:: tuple[GPT2ForCausalLM, GPT2LLM]: Converted Hugging Face model and the original modalities model.

modalities.conversion.gpt2.conversion_model.convert_model_config(modalities_config)[source]

Return type:: GPT2Config
Parameters:: modalities_config (dict)

Converts the modalities model configuration to a Huggingface transformers configuration.: For this the model_raw or model section of the modalities config is used. Corresponding entries are mapped to the Huggingface configuration.
Args:: modalities_config (dict): Modalities config dictionary.
Returns:: GPT2Config: Converted Huggingface model configuration.

modalities.conversion.gpt2.conversion_tokenizer module

modalities.conversion.gpt2.conversion_tokenizer.convert_tokenizer(tokenizer_model_path, output_dir)[source]

Converts a SentencePiece tokenizer to a Huggingface tokenizer.

Return type:

tuple[int, int, int, int]

Parameters:

tokenizer_model_path (str)
output_dir (str)

Args:

tokenizer_model_path (str): Path to the SentencePiece tokenizer model file. output_dir (str): Path to the directory where the converted tokenizer will be saved.

Returns:

tuple[int, int, int, int]: The actual bos_token_id, eos_token_id, pad_token_id and: unk_token_id of the tokenizer. Note, that these are not set in the transformers part of the created tokenizer. Only in the wrapped SentencePiece tokenizer.

modalities.conversion.gpt2.convert_gpt2 module

usage: convert_gpt2.py [-h] [–num_testruns NUM_TESTRUNS] [–device_modalities DEVICE_MODALITIES]: [–device_hf DEVICE_HF] modalities_config output_dir

Convert GPT-2 model checkpoint to Huggingface transformers format.

positional arguments:

modalities_config Path to the modalities config file. output_dir Directory to save the converted model.

options:

-h, --help: show this help message and exit
--num_testruns NUM_TESTRUNS: Number of test runs to perform.
--device_modalities DEVICE_MODALITIES: Device for the modalities model.
--device_hf DEVICE_HF: Device for the Hugging Face model.

modalities.conversion.gpt2.convert_gpt2.convert_gpt2(modalities_config_path, output_dir, num_testruns=0, device_modalities='cpu', device_hf='cpu')[source]

Return type:

None

Parameters:

modalities_config_path (str)
output_dir (str)
num_testruns (int)
device_modalities (str)
device_hf (str)

Takes a modalities gpt2 model and converts it to a Huggingface transformers model.: The provided config yaml file should contain the model_raw or model section with the model configuration. Additionally, the checkpointed_model section should be present and contain the path to the model checkpoint. Optionally, the function can run a number of test runs to compare the converted model with the original one. If a tokenizer is specified in the config, it will be converted as well.
Args:: modalities_config_path (str): Path to the modalities config file. output_dir (str): Directory to save the converted model. num_testruns (int, optional): Number of test runs to perform. Defaults to 0. device_modalities (str, optional): Device for the modalities model. Defaults to “cpu”. device_hf (str, optional): Device for the Hugging Face model. Defaults to “cpu”.

modalities.conversion.gpt2.modeling_gpt2 module

class modalities.conversion.gpt2.modeling_gpt2.GPT2ForCausalLM(config)[source]

Bases: GPT2PreTrainedModel, GenerationMixin

This model inherits from [PreTrainedModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters:

config ([GPT2ForCausalLM]):: Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.

Args: config ([GPT2ForCausalLM]):

Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.

Parameters:: config (GPT2Config)

config_class: alias of GPT2Config

forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, cache_position=None, logits_to_keep=0, **kwargs)[source]

The [GPT2ForCausalLM] forward method, overrides the __call__ special method.

<Tip>

Although the recipe for forward pass needs to be defined within this function, one should call the [Module] instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

</Tip>

Return type:

CausalLMOutputWithPast

Parameters:

input_ids (LongTensor | None)
attention_mask (Tensor | None)
position_ids (LongTensor | None)
past_key_values (Cache | None)
inputs_embeds (FloatTensor | None)
labels (LongTensor | None)
use_cache (bool | None)
cache_position (LongTensor | None)
logits_to_keep (int | Tensor)
kwargs (Unpack[TransformersKwargs])

Args:

input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional):

Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

[What are input IDs?](../glossary#input-ids)

attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional):

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

1 for tokens that are not masked,
0 for tokens that are masked.

[What are attention masks?](../glossary#attention-mask)

position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional):

Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

[What are position IDs?](../glossary#position-ids)

past_key_values (~cache_utils.Cache, optional):

Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True.

Only [~cache_utils.Cache] instance is allowed as input, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache). If no past_key_values are passed, [~cache_utils.DynamicCache] will be initialized by default.

The model will output the same cache format that is fed as input.

If past_key_values are used, the user is expected to input only unprocessed input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, unprocessed_length) instead of all input_ids of shape (batch_size, sequence_length).

inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional):

Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

labels (torch.LongTensor of shape (batch_size, sequence_length), optional):

Labels for computing the masked language modeling loss. Indices should either be in [0, …, config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size].

use_cache (bool, optional):

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

cache_position (torch.LongTensor of shape (sequence_length), optional):

Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.

logits_to_keep (Union[int, torch.Tensor], defaults to 0):

If an int, compute logits for the last logits_to_keep tokens. If 0, calculate logits for all input_ids (special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If a torch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).

Returns:

[transformers.modeling_outputs.CausalLMOutputWithPast] or tuple(torch.FloatTensor): A [transformers.modeling_outputs.CausalLMOutputWithPast] or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration ([None]) and inputs.

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Language modeling loss (for next-token prediction).
logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) – It is a [~cache_utils.Cache] instance. For more details, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache).

Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Example:

```python >>> from transformers import AutoTokenizer, LlamaForCausalLM

>>> model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```

get_decoder()[source]

Best-effort lookup of the decoder module.

Order of attempts (covers ~85 % of current usages):

self.decoder
self.model (many wrappers store the decoder here)
self.model.get_decoder() (nested wrappers)
fallback: raise for the few exotic models that need a bespoke rule

set_decoder(decoder)[source]: Symmetric setter. Mirrors the lookup logic used in get_decoder.

class modalities.conversion.gpt2.modeling_gpt2.GPT2ForQuestionAnswering(config)[source]

Bases: GenericForQuestionAnswering, GPT2PreTrainedModel

Args: config ([PretrainedConfig]):

Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.

Parameters:: config (GPT2Config)

base_model_prefix = 'transformer'

config_class: alias of GPT2Config

class modalities.conversion.gpt2.modeling_gpt2.GPT2ForSequenceClassification(config)[source]

Bases: GenericForSequenceClassification, GPT2PreTrainedModel

Args: config ([PretrainedConfig]):

Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.

Parameters:: config (GPT2Config)

config_class: alias of GPT2Config

class modalities.conversion.gpt2.modeling_gpt2.GPT2ForTokenClassification(config)[source]

Bases: GenericForTokenClassification, GPT2PreTrainedModel

Args: config ([PretrainedConfig]):

Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.

Parameters:: config (GPT2Config)

config_class: alias of GPT2Config

class modalities.conversion.gpt2.modeling_gpt2.GPT2Model(config)[source]

Bases: GPT2PreTrainedModel

This model inherits from [PreTrainedModel]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters:

config ([GPT2Config]):: Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.

Args: config ([GPT2Config]):

Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [~PreTrainedModel.from_pretrained] method to load the model weights.

Parameters:: config (GPT2Config)

config_class: alias of GPT2Config

forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, cache_position=None, use_cache=None, **kwargs)[source]

The [GPT2Model] forward method, overrides the __call__ special method.

<Tip>

Although the recipe for forward pass needs to be defined within this function, one should call the [Module] instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

</Tip>

Return type:

BaseModelOutputWithPast

Parameters:

input_ids (LongTensor | None)
attention_mask (Tensor | None)
position_ids (LongTensor | None)
past_key_values (Cache | None)
inputs_embeds (FloatTensor | None)
cache_position (LongTensor | None)
use_cache (bool | None)
kwargs (Unpack[TransformersKwargs])

Args:

input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional):

Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

[What are input IDs?](../glossary#input-ids)

attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional):

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

1 for tokens that are not masked,
0 for tokens that are masked.