Split#

libreco.data.random_split(data, shuffle=True, test_size=None, multi_ratios=None, filter_unknown=True, pad_unknown=False, pad_val=None, seed=42)[source]#

Split the data randomly.

Parameters:

data (pandas.DataFrame) – The data to split.
shuffle (bool, default: True) – Whether to shuffle data when splitting.
test_size (float or None, default: None) – Test data ratio.
multi_ratios (list of float, tuple of (float,) or None, default: None) – Ratios for splitting data in multiple parts. If test_size is not None, multi_ratios will be ignored.
filter_unknown (bool, default: True) – Whether to filter out users and items that don’t appear in the train data from eval and test data. Since models can only recommend items in the train data.
pad_unknown (bool, default: False) – Fill the unknown users/items with pad_val. If filter_unknown is True, this parameter will be ignored.
pad_val (any, default: None) – Pad value used in pad_unknown.
seed (int, default: 42) – Random seed.

Returns:

multiple data – The split data.

Return type:

list of pandas.DataFrame

Raises:

ValueError – If neither test_size nor multi_ratio is provided.

Examples

>>> train, test = random_split(data, test_size=0.2)
>>> train_data, eval_data, test_data = random_split(data, multi_ratios=[0.8, 0.1, 0.1])

libreco.data.split_by_ratio(data, order=True, shuffle=False, test_size=None, multi_ratios=None, filter_unknown=True, pad_unknown=False, pad_val=None, seed=42)[source]#

Assign certain ratio of items to test data for each user.

Note

If a user’s total # of interacted items is less than 3, these items will all been assigned to train data.

Parameters:

data (pandas.DataFrame) – The data to split.
order (bool, default: True) – Whether to preserve order for user’s item sequence.
shuffle (bool, default: False) – Whether to shuffle data after splitting.
test_size (float or None, default: None) – Test data ratio.
multi_ratios (list of float, tuple of (float,) or None, default: None) – Ratios for splitting data in multiple parts. If test_size is not None, multi_ratios will be ignored.
filter_unknown (bool, default: True) – Whether to filter out users and items that don’t appear in the train data from eval and test data. Since models can only recommend items in the train data.
pad_unknown (bool, default: False) – Fill the unknown users/items with pad_val. If filter_unknown is True, this parameter will be ignored.
pad_val (any, default: None) – Pad value used in pad_unknown.
seed (int, default: 42) – Random seed.

Returns:

multiple data – The split data.

Return type:

list of pandas.DataFrame

Raises:

ValueError – If neither test_size nor multi_ratio is provided.

See also

split_by_ratio_chrono

libreco.data.split_by_num(data, order=True, shuffle=False, test_size=1, filter_unknown=True, pad_unknown=False, pad_val=None, seed=42)[source]#

Assign a certain number of items to test data for each user.

Note

If a user’s total # of interacted items is less than 3, these items will all been assigned to train data.

Parameters:

data (pandas.DataFrame) – The data to split.
order (bool, default: True) – Whether to preserve order for user’s item sequence.
shuffle (bool, default: False) – Whether to shuffle data after splitting.
test_size (float or None, default: None) – Test data ratio.
filter_unknown (bool, default: True) – Whether to filter out users and items that don’t appear in the train data from eval and test data. Since models can only recommend items in the train data.
pad_unknown (bool, default: False) – Fill the unknown users/items with pad_val. If filter_unknown is True, this parameter will be ignored.
pad_val (any, default: None) – Pad value used in pad_unknown.
seed (int, default: 42) – Random seed.

Returns:

multiple data – The split data.

Return type:

list of pandas.DataFrame

Raises:

ValueError – If neither test_size nor multi_ratio is provided.

See also

split_by_num_chrono

libreco.data.split_by_ratio_chrono(data, order=True, shuffle=False, test_size=None, multi_ratios=None, seed=42)[source]#

Assign a certain ratio of items to test data for each user, where items are sorted by time first.

Important

This function implies the data should contain a time column.

Parameters:

data (pandas.DataFrame) – The data to split.
order (bool, default: True) – Whether to preserve order for user’s item sequence.
shuffle (bool, default: False) – Whether to shuffle data after splitting.
test_size (float or None, default: None) – Test data ratio.
multi_ratios (list of float, tuple of (float,) or None, default: None) – Ratios for splitting data in multiple parts. If test_size is not None, multi_ratios will be ignored.
seed (int, default: 42) – Random seed.

Returns:

multiple data – The split data.

Return type:

list of pandas.DataFrame

Raises:

ValueError – If neither test_size nor multi_ratio is provided.

See also

split_by_ratio

libreco.data.split_by_num_chrono(data, order=True, shuffle=False, test_size=1, seed=42)[source]#

Assign a certain number of items to test data for each user, where items are sorted by time first.

Important

This function implies the data should contain a time column.

Parameters:

data (pandas.DataFrame) – The data to split.
order (bool, default: True) – Whether to preserve order for user’s item sequence.
shuffle (bool, default: False) – Whether to shuffle data after splitting.
test_size (float or None, default: None) – Test data ratio.
seed (int, default: 42) – Random seed.

Returns:

multiple data – The split data.

Return type:

list of pandas.DataFrame

Raises:

ValueError – If neither test_size nor multi_ratio is provided.

See also

split_by_num

libreco.data.split_multi_value(data, multi_value_col, sep, max_len=None, pad_val='missing', user_col=None, item_col=None)[source]#

Transform multi-valued features to the divided sub-features.

Parameters:

data (pandas.DataFrame) – Original data.
multi_value_col (list of str) – Multi-value columns names.
sep (str) – Delimiter to use.
max_len (list or tuple of int or None, default: None) – The maximum number of sub-features after transformation. If it is None, the maximum category length of all the samples will be used. If not None, it should be a list or tuple, because there are possibly many multi_value features.
pad_val (Any or list of Any, default: "missing") – The padding value used for missing features.
user_col (list of str or None, default: None) – User column names.
item_col (list of str or None, default: None) – Item column names.

Returns:

data (pandas.DataFrame) – Transformed data.
multi_sparse_col (list of str) – Transformed multi-sparse column names.
user_sparse_col (list of str) – Transformed user columns.
item_sparse_col (list of str) – Transformed item columns.

Raises:

AssertionError – If max_len is not list or tuple.
AssertionError – If max_len size != multi_value_col size.