modalities.dataloader package
Submodules
modalities.dataloader.create_index module
modalities.dataloader.create_packed_data module
modalities.dataloader.dataloader module
modalities.dataloader.dataloader_factory module
modalities.dataloader.dataset module
modalities.dataloader.dataset_factory module
modalities.dataloader.large_file_lines_reader module
- class modalities.dataloader.large_file_lines_reader.LargeFileLinesReader(raw_data_path, index_path=None, encoding='utf-8', use_sample_length_from_index=True)[source]
Bases:
BaseReader
LargeFileLinesReader class that read lines from a large file efficiently.
Initializes a LargeFileLinesReader object.
- Args:
raw_data_path (Path): Path to a jsonl file, which holds text data. index_path (Optional[Path]): Path to an index file, which indicates the start character/byte position
and length of samples given in raw_data_path. If not defined, an index next to raw_data_path is picked, by replacing its suffix with “.idx”.
- encoding (Optional[str]): The encoding of the file (default: “utf-8”).
If encoding is None, the raw data is read as bytes.
- use_sample_length_from_index (bool): If True, the sample length is taken from the index file
i.e., the (offset, sample_length) pairs. If False, the sample length is calculated as the difference between the starting point of the next and the current sample.
- Returns:
None
- Parameters:
- static default_index_path(raw_data_path, index_path=None)[source]
Returns the default index path for the given raw data path.
- Args:
raw_data_path (Path): The path to the raw data file. index_path (Optional[Path]): The path to the index file (default: None).
- Returns:
Path: The default index path.
- Note:
- If index_path is not provided, the default index path is generated by
appending the extension “.idx” to the stem of the raw_data_path.