modalities.dataloader package

Submodules

modalities.dataloader.create_index module

modalities.dataloader.create_packed_data module

modalities.dataloader.dataloader module

modalities.dataloader.dataloader_factory module

modalities.dataloader.dataset module

modalities.dataloader.dataset_factory module

modalities.dataloader.large_file_lines_reader module

class modalities.dataloader.large_file_lines_reader.BaseReader[source]

Bases: ABC

class modalities.dataloader.large_file_lines_reader.LargeFileLinesReader(raw_data_path, index_path=None, encoding='utf-8', use_sample_length_from_index=True)[source]

Bases: BaseReader

LargeFileLinesReader class that read lines from a large file efficiently.

Initializes a LargeFileLinesReader object.

Args:

raw_data_path (Path): Path to a jsonl file, which holds text data. index_path (Optional[Path]): Path to an index file, which indicates the start character/byte position

and length of samples given in raw_data_path. If not defined, an index next to raw_data_path is picked, by replacing its suffix with “.idx”.

encoding (Optional[str]): The encoding of the file (default: “utf-8”).

If encoding is None, the raw data is read as bytes.

use_sample_length_from_index (bool): If True, the sample length is taken from the index file

i.e., the (offset, sample_length) pairs. If False, the sample length is calculated as the difference between the starting point of the next and the current sample.

Returns:

None

Parameters:
  • raw_data_path (Path)

  • index_path (Path | None)

  • encoding (str | None)

  • use_sample_length_from_index (bool)

close()[source]
static default_index_path(raw_data_path, index_path=None)[source]

Returns the default index path for the given raw data path.

Return type:

Path

Parameters:
  • raw_data_path (Path)

  • index_path (Path | None)

Args:

raw_data_path (Path): The path to the raw data file. index_path (Optional[Path]): The path to the index file (default: None).

Returns:

Path: The default index path.

Note:
If index_path is not provided, the default index path is generated by

appending the extension “.idx” to the stem of the raw_data_path.

modalities.dataloader.samplers module

Module contents