Dataset#

Classes for Transforming and Building Data.

class libreco.data.dataset.DatasetPure[source]#

Bases: _Dataset

Dataset class used for building pure collaborative filtering data.

Examples

>>> from libreco.data import DatasetPure
>>> train_data, data_info = DatasetPure.build_trainset(train_data)
>>> eval_data = DatasetPure.build_evalset(eval_data)
>>> test_data = DatasetPure.build_testset(test_data)
classmethod build_trainset(train_data, shuffle=False, seed=42)[source]#

Build transformed train data and data_info from original data.

Changed in version 1.0.0: Data construction in Model Retrain has been moved to merge_trainset()

Parameters:
  • train_data (pandas.DataFrame) – Data must contain at least three columns, i.e. user, item, label.

  • shuffle (bool, default: False) –

    Whether to fully shuffle data.

    Warning

    If your data is order or time dependent, it is not recommended to shuffle data.

  • seed (int, default: 42) – Random seed.

Returns:

  • trainset (TransformedSet) – Transformed Data object used for training.

  • data_info (DataInfo) – Object that contains some useful information.

classmethod merge_trainset(train_data, data_info, merge_behavior=True, shuffle=False, seed=42)[source]#

Build transformed data by merging new train data with old data.

New in version 1.0.0.

Changed in version 1.1.0: Applying a more functional approach. A new data_info will be constructed and returned, and the passed old data_info should be discarded.

Parameters:
  • train_data (pandas.DataFrame) – Data must contain at least three columns, i.e. user, item, label.

  • data_info (DataInfo) – Object that contains past data information.

  • merge_behavior (bool, default: True) – Whether to merge the user behavior in old and new data.

  • shuffle (bool, default: False) – Whether to fully shuffle data.

  • seed (int, default: 42) – Random seed.

Returns:

  • new_trainset (TransformedSet) – New transformed Data object used for training.

  • new_data_info (DataInfo) – New data_info that contains some useful information.

classmethod build_evalset(eval_data, shuffle=False, seed=42)#

Build transformed eval data from original data.

Changed in version 1.0.0: Data construction in Model Retrain has been moved to merge_evalset()

Parameters:
  • eval_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.

  • shuffle (bool, default: False) – Whether to fully shuffle data.

  • seed (int, default: 42) – Random seed.

Returns:

Transformed Data object used for evaluating.

Return type:

TransformedEvalSet

classmethod build_testset(test_data, shuffle=False, seed=42)#

Build transformed test data from original data.

Changed in version 1.0.0: Data construction in Model Retrain has been moved to merge_testset()

Parameters:
  • test_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.

  • shuffle (bool, default: False) – Whether to fully shuffle data.

  • seed (int, default: 42) – Random seed.

Returns:

Transformed Data object used for testing.

Return type:

TransformedEvalSet

classmethod merge_evalset(eval_data, data_info, shuffle=False, seed=42)#

Build transformed data by merging new train data with old data.

New in version 1.0.0.

Parameters:
  • eval_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.

  • data_info (DataInfo) – Object that contains past data information.

  • shuffle (bool, default: False) – Whether to fully shuffle data.

  • seed (int, default: 42) – Random seed.

Returns:

Transformed Data object used for testing.

Return type:

TransformedEvalSet

classmethod merge_testset(test_data, data_info, shuffle=False, seed=42)#

Build transformed data by merging new train data with old data.

New in version 1.0.0.

Parameters:
  • test_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.

  • data_info (DataInfo) – Object that contains past data information.

  • shuffle (bool, default: False) – Whether to fully shuffle data.

  • seed (int, default: 42) – Random seed.

Returns:

Transformed Data object used for testing.

Return type:

TransformedEvalSet

static shuffle_data(data, seed)#

Shuffle data randomly.

Parameters:
Returns:

Shuffled data.

Return type:

pandas.DataFrame

class libreco.data.dataset.DatasetFeat[source]#

Bases: _Dataset

Dataset class used for building data contains features.

Examples

>>> from libreco.data import DatasetFeat
>>> train_data, data_info = DatasetFeat.build_trainset(train_data)
>>> eval_data = DatasetFeat.build_evalset(eval_data)
>>> test_data = DatasetFeat.build_testset(test_data)
classmethod build_trainset(train_data, user_col=None, item_col=None, sparse_col=None, dense_col=None, multi_sparse_col=None, unique_feat=False, pad_val='missing', shuffle=False, seed=42)[source]#

Build transformed feat train data and data_info from original data.

Changed in version 1.0.0: Data construction in Model Retrain has been moved to merge_trainset()

Parameters:
  • train_data (pandas.DataFrame) – Data must contain at least three columns, i.e. user, item, label.

  • user_col (list of str or None, default: None) – List of user feature column names.

  • item_col (list of str or None, default: None) – List of item feature column names.

  • sparse_col (list of str or None, default: None) – List of sparse feature columns names.

  • multi_sparse_col (nested lists of str or None, default: None) – Nested lists of multi_sparse feature columns names. For example, [["a", "b", "c"], ["d", "e"]]

  • dense_col (list of str or None, default: None) – List of dense feature column names.

  • unique_feat (bool, default: False) – Whether the features of users and items are unique in train data.

  • pad_val (int or str or list, default: "missing") –

    Padding value in multi_sparse columns to ensure same length of all samples.

    Warning

    If the pad_val is a single value, it will be used in all multi_sparse columns. So if you want to use different pad_val for different multi_sparse columns, the pad_val should be a list.

  • shuffle (bool, default: False) –

    Whether to fully shuffle data.

    Warning

    If your data is order or time dependent, it is not recommended to shuffle data.

  • seed (int, default: 42) – Random seed.

Returns:

  • trainset (TransformedSet) – Transformed Data object used for training.

  • data_info (DataInfo) – Object that contains some useful information.

Raises:

ValueError – If the feature columns specified by the user are inconsistent.

classmethod merge_trainset(train_data, data_info, merge_behavior=True, shuffle=False, seed=42)[source]#

Build transformed data by merging new train data with old data.

New in version 1.0.0.

Changed in version 1.1.0: Applying a more functional approach. A new data_info will be constructed and returned, and the passed old data_info should be discarded.

Parameters:
  • train_data (pandas.DataFrame) – Data must contain at least three columns, i.e. user, item, label.

  • data_info (DataInfo) – Object that contains past data information.

  • merge_behavior (bool, default: True) – Whether to merge the user behavior in old and new data.

  • shuffle (bool, default: False) – Whether to fully shuffle data.

  • seed (int, default: 42) – Random seed.

Returns:

  • new_trainset (TransformedSet) – New transformed Data object used for training.

  • new_data_info (DataInfo) – New data_info that contains some useful information.

classmethod build_evalset(eval_data, shuffle=False, seed=42)#

Build transformed eval data from original data.

Changed in version 1.0.0: Data construction in Model Retrain has been moved to merge_evalset()

Parameters:
  • eval_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.

  • shuffle (bool, default: False) – Whether to fully shuffle data.

  • seed (int, default: 42) – Random seed.

Returns:

Transformed Data object used for evaluating.

Return type:

TransformedEvalSet

classmethod build_testset(test_data, shuffle=False, seed=42)#

Build transformed test data from original data.

Changed in version 1.0.0: Data construction in Model Retrain has been moved to merge_testset()

Parameters:
  • test_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.

  • shuffle (bool, default: False) – Whether to fully shuffle data.

  • seed (int, default: 42) – Random seed.

Returns:

Transformed Data object used for testing.

Return type:

TransformedEvalSet

classmethod merge_evalset(eval_data, data_info, shuffle=False, seed=42)#

Build transformed data by merging new train data with old data.

New in version 1.0.0.

Parameters:
  • eval_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.

  • data_info (DataInfo) – Object that contains past data information.

  • shuffle (bool, default: False) – Whether to fully shuffle data.

  • seed (int, default: 42) – Random seed.

Returns:

Transformed Data object used for testing.

Return type:

TransformedEvalSet

classmethod merge_testset(test_data, data_info, shuffle=False, seed=42)#

Build transformed data by merging new train data with old data.

New in version 1.0.0.

Parameters:
  • test_data (pandas.DataFrame) – Data must contain at least two columns, i.e. user, item.

  • data_info (DataInfo) – Object that contains past data information.

  • shuffle (bool, default: False) – Whether to fully shuffle data.

  • seed (int, default: 42) – Random seed.

Returns:

Transformed Data object used for testing.

Return type:

TransformedEvalSet

static shuffle_data(data, seed)#

Shuffle data randomly.

Parameters:
Returns:

Shuffled data.

Return type:

pandas.DataFrame