Split#
- libreco.data.random_split(data, shuffle=True, test_size=None, multi_ratios=None, filter_unknown=True, pad_unknown=False, pad_val=None, seed=42)[source]#
Split the data randomly.
- Parameters:
data (pandas.DataFrame) – The data to split.
shuffle (bool, default: True) – Whether to shuffle data when splitting.
test_size (float or None, default: None) – Test data ratio.
multi_ratios (list of float, tuple of (float,) or None, default: None) – Ratios for splitting data in multiple parts. If
test_size
is not None,multi_ratios
will be ignored.filter_unknown (bool, default: True) – Whether to filter out users and items that don’t appear in the train data from eval and test data. Since models can only recommend items in the train data.
pad_unknown (bool, default: False) – Fill the unknown users/items with
pad_val
. Iffilter_unknown
is True, this parameter will be ignored.pad_val (any, default: None) – Pad value used in
pad_unknown
.seed (int, default: 42) – Random seed.
- Returns:
multiple data – The split data.
- Return type:
- Raises:
ValueError – If neither test_size nor multi_ratio is provided.
Examples
>>> train, test = random_split(data, test_size=0.2) >>> train_data, eval_data, test_data = random_split(data, multi_ratios=[0.8, 0.1, 0.1])
- libreco.data.split_by_ratio(data, order=True, shuffle=False, test_size=None, multi_ratios=None, filter_unknown=True, pad_unknown=False, pad_val=None, seed=42)[source]#
Assign certain ratio of items to test data for each user.
Note
If a user’s total # of interacted items is less than 3, these items will all been assigned to train data.
- Parameters:
data (pandas.DataFrame) – The data to split.
order (bool, default: True) – Whether to preserve order for user’s item sequence.
shuffle (bool, default: False) – Whether to shuffle data after splitting.
test_size (float or None, default: None) – Test data ratio.
multi_ratios (list of float, tuple of (float,) or None, default: None) – Ratios for splitting data in multiple parts. If
test_size
is not None,multi_ratios
will be ignored.filter_unknown (bool, default: True) – Whether to filter out users and items that don’t appear in the train data from eval and test data. Since models can only recommend items in the train data.
pad_unknown (bool, default: False) – Fill the unknown users/items with
pad_val
. Iffilter_unknown
is True, this parameter will be ignored.pad_val (any, default: None) – Pad value used in
pad_unknown
.seed (int, default: 42) – Random seed.
- Returns:
multiple data – The split data.
- Return type:
- Raises:
ValueError – If neither test_size nor multi_ratio is provided.
See also
- libreco.data.split_by_num(data, order=True, shuffle=False, test_size=1, filter_unknown=True, pad_unknown=False, pad_val=None, seed=42)[source]#
Assign a certain number of items to test data for each user.
Note
If a user’s total # of interacted items is less than 3, these items will all been assigned to train data.
- Parameters:
data (pandas.DataFrame) – The data to split.
order (bool, default: True) – Whether to preserve order for user’s item sequence.
shuffle (bool, default: False) – Whether to shuffle data after splitting.
test_size (float or None, default: None) – Test data ratio.
filter_unknown (bool, default: True) – Whether to filter out users and items that don’t appear in the train data from eval and test data. Since models can only recommend items in the train data.
pad_unknown (bool, default: False) – Fill the unknown users/items with
pad_val
. Iffilter_unknown
is True, this parameter will be ignored.pad_val (any, default: None) – Pad value used in
pad_unknown
.seed (int, default: 42) – Random seed.
- Returns:
multiple data – The split data.
- Return type:
- Raises:
ValueError – If neither test_size nor multi_ratio is provided.
See also
- libreco.data.split_by_ratio_chrono(data, order=True, shuffle=False, test_size=None, multi_ratios=None, seed=42)[source]#
Assign a certain ratio of items to test data for each user, where items are sorted by time first.
Important
This function implies the data should contain a time column.
- Parameters:
data (pandas.DataFrame) – The data to split.
order (bool, default: True) – Whether to preserve order for user’s item sequence.
shuffle (bool, default: False) – Whether to shuffle data after splitting.
test_size (float or None, default: None) – Test data ratio.
multi_ratios (list of float, tuple of (float,) or None, default: None) – Ratios for splitting data in multiple parts. If
test_size
is not None,multi_ratios
will be ignored.seed (int, default: 42) – Random seed.
- Returns:
multiple data – The split data.
- Return type:
- Raises:
ValueError – If neither test_size nor multi_ratio is provided.
See also
- libreco.data.split_by_num_chrono(data, order=True, shuffle=False, test_size=1, seed=42)[source]#
Assign a certain number of items to test data for each user, where items are sorted by time first.
Important
This function implies the data should contain a time column.
- Parameters:
data (pandas.DataFrame) – The data to split.
order (bool, default: True) – Whether to preserve order for user’s item sequence.
shuffle (bool, default: False) – Whether to shuffle data after splitting.
test_size (float or None, default: None) – Test data ratio.
seed (int, default: 42) – Random seed.
- Returns:
multiple data – The split data.
- Return type:
- Raises:
ValueError – If neither test_size nor multi_ratio is provided.
See also
- libreco.data.split_multi_value(data, multi_value_col, sep, max_len=None, pad_val='missing', user_col=None, item_col=None)[source]#
Transform multi-valued features to the divided sub-features.
- Parameters:
data (pandas.DataFrame) – Original data.
sep (str) – Delimiter to use.
max_len (list or tuple of int or None, default: None) – The maximum number of sub-features after transformation. If it is None, the maximum category length of all the samples will be used. If not None, it should be a list or tuple, because there are possibly many
multi_value
features.pad_val (Any or list of Any, default: "missing") – The padding value used for missing features.
user_col (list of str or None, default: None) – User column names.
item_col (list of str or None, default: None) – Item column names.
- Returns:
data (pandas.DataFrame) – Transformed data.
multi_sparse_col (list of str) – Transformed multi-sparse column names.
user_sparse_col (list of str) – Transformed user columns.
item_sparse_col (list of str) – Transformed item columns.
- Raises:
AssertionError – If
max_len
is not list or tuple.AssertionError – If
max_len
size !=multi_value_col
size.