Merger

class Merger(merge_config: dict[str, tuple[str, ...]])[source]

Bases: object

Merge existing splits of the dataset and assign them custom names.

Create new DatasetDict with new split names corresponding to the merged existing splits (e.g. “train”, “valid” and “test”).

Parameters:

merge_config (Dict[str, Tuple[str, ...]]) – Dictionary with keys - the desired split names to values - tuples of the current split names that will be merged together

Examples

Create new DatasetDict with a split name “new_train” that is created as a merger of the “train” and “valid” splits. Keep the “test” split.

>>> # Assuming there is a dataset_dict of type `DatasetDict`
>>> # dataset_dict is {"train": train-data, "valid": valid-data, "test": test-data}
>>> merger = Merger(
>>>     merge_config={
>>>         "new_train": ("train", "valid"),
>>>         "test": ("test", )
>>>     }
>>> )
>>> new_dataset_dict = merger(dataset_dict)
>>> # new_dataset_dict is
>>> # {"new_train": concatenation of train-data and valid-data, "test": test-data}

Methods

resplit(dataset)

Resplit the dataset according to the merge_config.

resplit(dataset: DatasetDict) DatasetDict[source]

Resplit the dataset according to the merge_config.