YouTubeRetrieval#

class libreco.algorithms.YouTubeRetrieval(task='ranking', data_info=None, loss_type='sampled_softmax', embed_size=16, norm_embed=False, n_epochs=20, lr=0.001, lr_decay=False, epsilon=1e-05, reg=None, batch_size=256, use_bn=True, dropout_rate=None, hidden_units=(128, 64), num_sampled_per_batch=None, sampler='uniform', recent_num=10, random_num=None, multi_sparse_combiner='sqrtn', seed=42, lower_upper_bound=None, tf_sess_config=None)[source]#

Bases: DynEmbedBase

YouTubeRetrieval algorithm. See YouTubeRetrieval / YouTubeRanking for more details.

Note

The algorithm implemented mainly corresponds to the candidate generation phase based on the original paper.

Warning

YouTubeRetrieval can only be used in ranking task.

Parameters:

task ({'ranking'}) – Recommendation task. See Task.
data_info (DataInfo object) – Object that contains useful information for training and inference.
loss_type ({'sampled_softmax', 'nce'}, default: 'sampled_softmax') – Loss for model training.
embed_size (int, default: 16) – Vector size of embeddings.
norm_embed (bool, default: False) – Whether to l2 normalize output embeddings.
n_epochs (int, default: 10) – Number of epochs for training.
lr (float, default 0.001) – Learning rate for training.
lr_decay (bool, default: False) – Whether to use learning rate decay.
epsilon (float, default: 1e-5) – A small constant added to the denominator to improve numerical stability in Adam optimizer. According to the official comment, default value of 1e-8 for epsilon is generally not good, so here we choose 1e-5. Users can try tuning this hyperparameter if the training is unstable.
reg (float or None, default: None) – Regularization parameter, must be non-negative or None.
batch_size (int, default: 256) – Batch size for training.
use_bn (bool, default: True) – Whether to use batch normalization.
dropout_rate (float or None, default: None) – Probability of an element to be zeroed. If it is None, dropout is not used.
hidden_units (int, list of int or tuple of (int,), default: (128, 64)) –
Number of layers and corresponding layer size in MLP.

Changed in version 1.0.0: Accept type of int, list or tuple, instead of str.
num_sampled_per_batch (int or None, default: None) – Number of negative samples in a batch. If None, it is set to batch_size.
sampler (str, default: 'uniform') – Negative Sampling strategy. ‘uniform’ will use uniform sampler, and setting to other value will use log_uniform_candidate_sampler in TensorFlow. In recommendation scenarios the uniform sampler is generally preferred.
recent_num (int or None, default: 10) – Number of recent items to use in user behavior sequence.
random_num (int or None, default: None) – Number of random sampled items to use in user behavior sequence. If recent_num is not None, random_num is not considered.
multi_sparse_combiner ({'normal', 'mean', 'sum', 'sqrtn'}, default: 'sqrtn') – Options for combining multi_sparse features.
seed (int, default: 42) – Random seed.
lower_upper_bound (tuple or None, default: None) – Lower and upper score bound for rating task.
tf_sess_config (dict or None, default: None) – Optional TensorFlow session config, see ConfigProto options.

References

Paul Covington et al. Deep Neural Networks for YouTube Recommendations.

dyn_user_embedding(user, user_feats=None, seq=None, include_bias=False, inner_id=False)[source]#

Generate user embedding based on given user features or item sequence.

New in version 1.2.0.

Parameters:

user (int or str) – Query user id. Must be a single user.
user_feats (dict or None, default: None) – Extra user features for recommendation.
seq (list or numpy.ndarray or None, default: None) – Extra item sequence for recommendation. If the sequence length is larger than recent_num hyperparameter specified in the model, it will be truncated. If smaller, it will be padded.
include_bias (bool, default: False) – Whether to include bias term in returned embeddings. Note some models such as SVD, BPR etc., use bias term in model inference.
inner_id (bool, default: False) – Whether to use inner_id defined in libreco. For library users inner_id may never be used.

Returns:

user_embedding – Generated dynamic user embeddings.

Return type:

numpy.ndarray

Raises:

ValueError – If user is not a single user.
ValueError – If seq is provided but the model doesn’t support sequence recommendation.

convert_array_id(user, inner_id)#

Convert a single user to inner user id.

If the user doesn’t exist, it will be converted to padding id. The return type should be array_like for further shape compatibility.

fit(train_data, neg_sampling, verbose=1, shuffle=True, eval_data=None, metrics=None, k=10, eval_batch_size=8192, eval_user_num=None, num_workers=0)#

Fit embed model on the training data.

Parameters:

train_data (TransformedSet object) – Data object used for training.
neg_sampling (bool) –
Whether to perform negative sampling for training or evaluating data.

New in version 1.1.0.

Note

Negative sampling is needed if your data is implicit(i.e., task is ranking) and ONLY contains positive labels. Otherwise, it should be False.
verbose (int, default: 1) –
Print verbosity.
- verbose <= 0: Print nothing.
- verbose == 1: Print progress bar and training time.
- verbose > 1 : Print evaluation metrics if eval_data is provided.
shuffle (bool, default: True) – Whether to shuffle the training data.
eval_data (TransformedSet object, default: None) – Data object used for evaluating.
metrics (list or None, default: None) – List of metrics for evaluating.
k (int, default: 10) – Parameter of metrics, e.g. recall at k, ndcg at k
eval_batch_size (int, default: 8192) – Batch size for evaluating.
eval_user_num (int or None, default: None) – Number of users for evaluating. Setting it to a positive number will sample users randomly from eval data.
num_workers (int, default: 0) –
How many subprocesses to use for training data loading. 0 means that the data will be loaded in the main process, which is slower than multiprocessing.

New in version 1.1.0.

Caution

Using multiprocessing(num_workers > 0) may consume more memory than single processing. See Multi-process data loading.

Raises:

RuntimeError – If fit() is called from a loaded model(load()).
AssertionError – If neg_sampling parameter is not bool type.

get_item_embedding(item=None, include_bias=False)#

Get item embedding(s) from the model.

Parameters:

item (int or str or None, default: None) – Query item id. If it is None, all item embeddings will be returned.
include_bias (bool, default: False) – Whether to include bias term in returned embeddings.

Returns:

item_embedding – Returned item embeddings.

Return type:

numpy.ndarray

Raises:

ValueError – If the item does not appear in the training data.
AssertionError – If the model has not been trained.

get_user_embedding(user=None, include_bias=False)#

Get user embedding(s) from the model.

Parameters:

user (int or str or None, default: None) – Query user id. If it is None, all user embeddings will be returned.
include_bias (bool, default: False) – Whether to include bias term in returned embeddings.

Returns:

user_embedding – Returned user embeddings.

Return type:

numpy.ndarray

Raises:

ValueError – If the user does not appear in the training data.
AssertionError – If the model has not been trained.

init_knn(approximate, sim_type, M=100, ef_construction=200, ef_search=200)#

Initialize k-nearest-search model.

Parameters:

approximate (bool) – Whether to use approximate nearest neighbor search. If it is True, nmslib must be installed. The HNSW method in nmslib is used.
sim_type ({'cosine', 'inner-product'}) – Similarity space type.
M (int, default: 100) – Parameter in HNSW, refer to nmslib doc.
ef_construction (int, default: 200) –
Parameter in HNSW, refer to nmslib doc.
ef_search (int, default: 200) –
Parameter in HNSW, refer to nmslib doc.

Raises:

ValueError – If sim_type is not one of (‘cosine’, ‘inner-product’).
ModuleNotFoundError – If approximate=True and nmslib is not installed.

classmethod load(path, model_name, data_info, **kwargs)#

Load saved embed model for inference.

Parameters:

path (str) – File folder path to save model.
model_name (str) – Name of the saved model file.
data_info (DataInfo object) – Object that contains some useful information.

Returns:

model – Loaded embed model.

Return type:

type(cls)