Divider¶

class Divider(divide_config: Dict[str, float] | Dict[str, int] | Dict[str, Dict[str, float]] | Dict[str, Dict[str, int]], divide_split: str | None = None, drop_remaining_splits: bool = False)[source]¶

Bases: object

Dive existing split(s) of the dataset and assign them custom names.

Create new DatasetDict with new split names with corresponding percentages of data and custom names.

Parameters:

divide_config (Union[Dict[str, int], Dict[str, float], Dict[str, Dict[str, int]], Dict[str, Dict[str, float]]]) – If single level dictionary, keys represent the split names. If values are: int, they represent the number of samples in each split; float, they represent the fraction of the total samples assigned to that split. These fractions do not have to sum up to 1.0. The order of values (either int or float) matter: the first key will get the first split starting from the beginning of the dataset, and so on. If two level dictionary (dictionary of dictionaries) then the first keys are the split names that will be divided into different splits. It’s an alternative to specifying divide_split if you need to divide many splits.
divide_split (Optional[str]) – In case of single level dictionary specification of divide_config, specifies the split name that will be divided. Might be left None in case of a single- split dataset (it will be automatically inferred). Ignored in case of multi-split configuration.
drop_remaining_splits (bool) – In case of single level dictionary specification of divide_config, specifies if the splits that are not divided are dropped.

Raises:

ValuesError if the specified name of a new split is already present in the dataset –
and the drop_remaining_splits is False. –

Examples

Create new DatasetDict with a divided split “train” into “train” and “valid” splits by using 80% and 20% correspondingly. Keep the “test” split.

Using the divide_split parameter and “smaller” (i.e. single-level) divide_config

>>> # Assuming there is a dataset_dict of type `DatasetDict`
>>> # dataset_dict is {"train": train-data, "test": test-data}
>>> divider = Divider(
>>>     divide_config={
>>>         "train": 0.8,
>>>         "valid": 0.2,
>>>     }
>>>     divide_split="train",
>>> )
>>> new_dataset_dict = divider(dataset_dict)
>>> # new_dataset_dict is
>>> # {"train": 80% of train, "valid": 20% of train, "test": test-data}

1) Using “bigger” (i.e. two-level dict) version of divide_config and no divide_split to accomplish the same (splitting train into train, valid with 80%, 20% correspondingly) and additionally dividing the test set.

>>> # Assuming there is a dataset_dict of type `DatasetDict`
>>> # dataset_dict is {"train": train-data, "test": test-data}
>>> divider = Divider(
>>>     divide_config={
>>>         "train": {
>>>             "train": 0.8,
>>>             "valid": 0.2,
>>>         },
>>>         "test": {"test-a": 0.4, "test-b": 0.6 }
>>>     }
>>> )
>>> new_dataset_dict = divider(dataset_dict)
>>> # new_dataset_dict is
>>> # {"train": 80% of train, "valid": 20% of train,
>>> # "test-a": 40% of test, "test-b": 60% of test}

Methods

resplit(dataset)

Resplit the dataset according to the configuration.

resplit(dataset: DatasetDict) → DatasetDict[source]¶: Resplit the dataset according to the configuration.