Dataset#
Classes for Transforming and Building Data.
- class libreco.data.dataset.DatasetPure[source]#
Bases:
_Dataset
Dataset class used for building pure collaborative filtering data.
Examples
>>> from libreco.data import DatasetPure >>> train_data, data_info = DatasetPure.build_trainset(train_data) >>> eval_data = DatasetPure.build_evalset(eval_data) >>> test_data = DatasetPure.build_testset(test_data)
- classmethod build_trainset(train_data, shuffle=False, seed=42)[source]#
Build transformed train data and data_info from original data.
Changed in version 1.0.0: Data construction in Model Retrain has been moved to
merge_trainset()
- Parameters:
train_data (pandas.DataFrame) – Data must contain at least three columns, i.e.
user
,item
,label
.shuffle (bool, default: False) –
Whether to fully shuffle data.
Warning
If your data is order or time dependent, it is not recommended to shuffle data.
seed (int, default: 42) – Random seed.
- Returns:
trainset (
TransformedSet
) – Transformed Data object used for training.data_info (
DataInfo
) – Object that contains some useful information.
- classmethod merge_trainset(train_data, data_info, merge_behavior=True, shuffle=False, seed=42)[source]#
Build transformed data by merging new train data with old data.
New in version 1.0.0.
Changed in version 1.1.0: Applying a more functional approach. A new
data_info
will be constructed and returned, and the passed olddata_info
should be discarded.- Parameters:
train_data (pandas.DataFrame) – Data must contain at least three columns, i.e.
user
,item
,label
.data_info (DataInfo) – Object that contains past data information.
merge_behavior (bool, default: True) – Whether to merge the user behavior in old and new data.
shuffle (bool, default: False) – Whether to fully shuffle data.
seed (int, default: 42) – Random seed.
- Returns:
new_trainset (
TransformedSet
) – New transformed Data object used for training.new_data_info (
DataInfo
) – Newdata_info
that contains some useful information.
- classmethod build_evalset(eval_data, shuffle=False, seed=42)#
Build transformed eval data from original data.
Changed in version 1.0.0: Data construction in Model Retrain has been moved to
merge_evalset()
- Parameters:
eval_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.
shuffle (bool, default: False) – Whether to fully shuffle data.
seed (int, default: 42) – Random seed.
- Returns:
Transformed Data object used for evaluating.
- Return type:
- classmethod build_testset(test_data, shuffle=False, seed=42)#
Build transformed test data from original data.
Changed in version 1.0.0: Data construction in Model Retrain has been moved to
merge_testset()
- Parameters:
test_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.
shuffle (bool, default: False) – Whether to fully shuffle data.
seed (int, default: 42) – Random seed.
- Returns:
Transformed Data object used for testing.
- Return type:
- classmethod merge_evalset(eval_data, data_info, shuffle=False, seed=42)#
Build transformed data by merging new train data with old data.
New in version 1.0.0.
- Parameters:
eval_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.
data_info (DataInfo) – Object that contains past data information.
shuffle (bool, default: False) – Whether to fully shuffle data.
seed (int, default: 42) – Random seed.
- Returns:
Transformed Data object used for testing.
- Return type:
- classmethod merge_testset(test_data, data_info, shuffle=False, seed=42)#
Build transformed data by merging new train data with old data.
New in version 1.0.0.
- Parameters:
test_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.
data_info (DataInfo) – Object that contains past data information.
shuffle (bool, default: False) – Whether to fully shuffle data.
seed (int, default: 42) – Random seed.
- Returns:
Transformed Data object used for testing.
- Return type:
- static shuffle_data(data, seed)#
Shuffle data randomly.
- Parameters:
data (pandas.DataFrame) – Data to shuffle.
seed (int) – Random seed.
- Returns:
Shuffled data.
- Return type:
- class libreco.data.dataset.DatasetFeat[source]#
Bases:
_Dataset
Dataset class used for building data contains features.
Examples
>>> from libreco.data import DatasetFeat >>> train_data, data_info = DatasetFeat.build_trainset(train_data) >>> eval_data = DatasetFeat.build_evalset(eval_data) >>> test_data = DatasetFeat.build_testset(test_data)
- classmethod build_trainset(train_data, user_col=None, item_col=None, sparse_col=None, dense_col=None, multi_sparse_col=None, unique_feat=False, pad_val='missing', shuffle=False, seed=42)[source]#
Build transformed feat train data and data_info from original data.
Changed in version 1.0.0: Data construction in Model Retrain has been moved to
merge_trainset()
- Parameters:
train_data (pandas.DataFrame) – Data must contain at least three columns, i.e.
user
,item
,label
.user_col (list of str or None, default: None) – List of user feature column names.
item_col (list of str or None, default: None) – List of item feature column names.
sparse_col (list of str or None, default: None) – List of sparse feature columns names.
multi_sparse_col (nested lists of str or None, default: None) – Nested lists of multi_sparse feature columns names. For example,
[["a", "b", "c"], ["d", "e"]]
dense_col (list of str or None, default: None) – List of dense feature column names.
unique_feat (bool, default: False) – Whether the features of users and items are unique in train data.
pad_val (int or str or list, default: "missing") –
Padding value in multi_sparse columns to ensure same length of all samples.
Warning
If the
pad_val
is a single value, it will be used in allmulti_sparse
columns. So if you want to use differentpad_val
for differentmulti_sparse
columns, thepad_val
should be a list.shuffle (bool, default: False) –
Whether to fully shuffle data.
Warning
If your data is order or time dependent, it is not recommended to shuffle data.
seed (int, default: 42) – Random seed.
- Returns:
trainset (
TransformedSet
) – Transformed Data object used for training.data_info (
DataInfo
) – Object that contains some useful information.
- Raises:
ValueError – If the feature columns specified by the user are inconsistent.
- classmethod merge_trainset(train_data, data_info, merge_behavior=True, shuffle=False, seed=42)[source]#
Build transformed data by merging new train data with old data.
New in version 1.0.0.
Changed in version 1.1.0: Applying a more functional approach. A new
data_info
will be constructed and returned, and the passed olddata_info
should be discarded.- Parameters:
train_data (pandas.DataFrame) – Data must contain at least three columns, i.e.
user
,item
,label
.data_info (DataInfo) – Object that contains past data information.
merge_behavior (bool, default: True) – Whether to merge the user behavior in old and new data.
shuffle (bool, default: False) – Whether to fully shuffle data.
seed (int, default: 42) – Random seed.
- Returns:
new_trainset (
TransformedSet
) – New transformed Data object used for training.new_data_info (
DataInfo
) – Newdata_info
that contains some useful information.
- classmethod build_evalset(eval_data, shuffle=False, seed=42)#
Build transformed eval data from original data.
Changed in version 1.0.0: Data construction in Model Retrain has been moved to
merge_evalset()
- Parameters:
eval_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.
shuffle (bool, default: False) – Whether to fully shuffle data.
seed (int, default: 42) – Random seed.
- Returns:
Transformed Data object used for evaluating.
- Return type:
- classmethod build_testset(test_data, shuffle=False, seed=42)#
Build transformed test data from original data.
Changed in version 1.0.0: Data construction in Model Retrain has been moved to
merge_testset()
- Parameters:
test_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.
shuffle (bool, default: False) – Whether to fully shuffle data.
seed (int, default: 42) – Random seed.
- Returns:
Transformed Data object used for testing.
- Return type:
- classmethod merge_evalset(eval_data, data_info, shuffle=False, seed=42)#
Build transformed data by merging new train data with old data.
New in version 1.0.0.
- Parameters:
eval_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.
data_info (DataInfo) – Object that contains past data information.
shuffle (bool, default: False) – Whether to fully shuffle data.
seed (int, default: 42) – Random seed.
- Returns:
Transformed Data object used for testing.
- Return type:
- classmethod merge_testset(test_data, data_info, shuffle=False, seed=42)#
Build transformed data by merging new train data with old data.
New in version 1.0.0.
- Parameters:
test_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.
data_info (DataInfo) – Object that contains past data information.
shuffle (bool, default: False) – Whether to fully shuffle data.
seed (int, default: 42) – Random seed.
- Returns:
Transformed Data object used for testing.
- Return type:
- static shuffle_data(data, seed)#
Shuffle data randomly.
- Parameters:
data (pandas.DataFrame) – Data to shuffle.
seed (int) – Random seed.
- Returns:
Shuffled data.
- Return type: