modalities.preprocessing package

Submodules

modalities.preprocessing.create_chunks module

class modalities.preprocessing.create_chunks.Chunking[source]

Bases: object

static get_jsonl_file_chunk(dataset, num_chunks, chunk_id)[source]
Return type:

list[Any]

Parameters:
static get_tokenized_file_chunk(dataset, num_chunks, chunk_id)[source]
Return type:

list[ndarray]

Parameters:
static shuffle_file_chunks_in_place(file_chunks, seed=None)[source]
Return type:

None

Parameters:

modalities.preprocessing.shuffle_data module

class modalities.preprocessing.shuffle_data.DataShuffler[source]

Bases: object

static shuffle_jsonl_data(input_data_path, output_data_path, seed=None)[source]
Parameters:
  • input_data_path (Path)

  • output_data_path (Path)

  • seed (int | None)

static shuffle_tokenized_data(input_data_path, output_data_path, batch_size, seed=None)[source]

Shuffles a tokenized file (.pbin). Shuffled data is written to the specified output file.

Note that the tokenized data is fully materialized in-memory.

Return type:

None

Parameters:
  • input_data_path (Path)

  • output_data_path (Path)

  • batch_size (int)

  • seed (int | None)

Args:

input_data_path (Path): Path to the tokenized data (.pbin). output_data_path (Path): Path to write the shuffled tokenized data. batch_size (int): Number of documents to process per batch. seed (Optional[int], optional): Seed for the random number generator. Defaults to None.

Returns:

None

Module contents