modalities.preprocessing package
Submodules
modalities.preprocessing.create_chunks module
- class modalities.preprocessing.create_chunks.Chunking[source]
Bases:
object
- static get_tokenized_file_chunk(dataset, num_chunks, chunk_id)[source]
- Return type:
- Parameters:
dataset (PackedMemMapDatasetBase)
num_chunks (int)
chunk_id (int)
modalities.preprocessing.shuffle_data module
- class modalities.preprocessing.shuffle_data.DataShuffler[source]
Bases:
object
- static shuffle_tokenized_data(input_data_path, output_data_path, batch_size, seed=None)[source]
Shuffles a tokenized file (.pbin). Shuffled data is written to the specified output file.
Note that the tokenized data is fully materialized in-memory.
- Return type:
- Parameters:
- Args:
input_data_path (Path): Path to the tokenized data (.pbin). output_data_path (Path): Path to write the shuffled tokenized data. batch_size (int): Number of documents to process per batch. seed (Optional[int], optional): Seed for the random number generator. Defaults to None.
- Returns:
None