modalities.preprocessing package
Submodules
modalities.preprocessing.create_chunks module
- class modalities.preprocessing.create_chunks.Chunking[source]
- Bases: - object- static get_tokenized_file_chunk(dataset, num_chunks, chunk_id)[source]
- Return type:
- Parameters:
- dataset (PackedMemMapDatasetBase) 
- num_chunks (int) 
- chunk_id (int) 
 
 
 
modalities.preprocessing.shuffle_data module
- class modalities.preprocessing.shuffle_data.DataShuffler[source]
- Bases: - object- static shuffle_tokenized_data(input_data_path, output_data_path, batch_size, seed=None)[source]
- Shuffles a tokenized file (.pbin). Shuffled data is written to the specified output file. - Note that the tokenized data is fully materialized in-memory. - Return type:
- Parameters:
 - Args:
- input_data_path (Path): Path to the tokenized data (.pbin). output_data_path (Path): Path to write the shuffled tokenized data. batch_size (int): Number of documents to process per batch. seed (Optional[int], optional): Seed for the random number generator. Defaults to None. 
- Returns:
- None