modalities.preprocessing package

Submodules

class modalities.preprocessing.create_chunks.Chunking[source]

static get_jsonl_file_chunk(dataset, num_chunks, chunk_id)[source]

Return type:

list[Any]

Parameters:

static get_tokenized_file_chunk(dataset, num_chunks, chunk_id)[source]

Return type:

list[ndarray]

Parameters:

static shuffle_file_chunks_in_place(file_chunks, seed=None)[source]

Return type:

None

Parameters:

class modalities.preprocessing.shuffle_data.DataShuffler[source]

static shuffle_jsonl_data(input_data_path, output_data_path, seed=None)[source]

Parameters:

static shuffle_tokenized_data(input_data_path, output_data_path, batch_size, seed=None)[source]

Shuffles a tokenized file (.pbin). Shuffled data is written to the specified output file.

Note that the tokenized data is fully materialized in-memory.

Return type:

None

Parameters:

Args:: input_data_path (Path): Path to the tokenized data (.pbin). output_data_path (Path): Path to write the shuffled tokenized data. batch_size (int): Number of documents to process per batch. seed (Optional[int], optional): Seed for the random number generator. Defaults to None.
Returns:: None