Tutorial#

This tutorial will walk you through the comprehensive process of training a model in LibRecommender, i.e. data processing -> feature engineering -> training -> evaluate -> save/load -> retrain. We will use Wide & Deep as the example algorithm.

First make sure LibRecommender is installed.

$ pip install LibRecommender

Serving

For how to deploy a trained model in LibRecommender, see Serving Guide.

TensorFlow1 issue

If you encounter errors like Variables already exist, disallowed... in this tutorial, just call tf.compat.v1.reset_default_graph() first. It’s one of the inconvenience from TensorFlow1.

Load Data#

In this tutorial we will use the MovieLens 1M dataset. The following code will load the data into pandas.DataFrame format. If the data does not exist locally, it will be downloaded at first.

import random
import warnings
import zipfile
from pathlib import Path
from urllib.request import urlretrieve

import pandas as pd
import tensorflow as tf
import tqdm
warnings.filterwarnings("ignore")

def split_genre(line):
    genres = line.split("|")
    if len(genres) == 3:
        return genres[0], genres[1], genres[2]
    elif len(genres) == 2:
        return genres[0], genres[1], "missing"
    elif len(genres) == 1:
        return genres[0], "missing", "missing"
    else:
        return "missing", "missing", "missing"

def load_ml_1m():
    download_path = "http://files.grouplens.org/datasets/movielens/ml-1m.zip"
    original_file = "ml-1m.zip"
    cur_path = Path(".").absolute()
    if not Path.exists(Path(original_file)):
        print("Data does not exist, start downloading...")
        with tqdm.tqdm(unit='B', unit_scale=True) as p:
            def report(chunk, chunksize, total):
                p.total = total
                p.update(chunksize)
            urlretrieve(download_path, original_file, reporthook=report)
        print("Download successful!")
    # extract zip file
    with zipfile.ZipFile(original_file, 'r') as f:
        f.extractall(cur_path)

    # read and merge data into same table
    ratings = pd.read_csv(
        cur_path / "ml-1m" / "ratings.dat",
        sep="::",
        usecols=[0, 1, 2, 3],
        names=["user", "item", "rating", "time"],
    )
    users = pd.read_csv(
        cur_path / "ml-1m" / "users.dat",
        sep="::",
        usecols=[0, 1, 2, 3],
        names=["user", "sex", "age", "occupation"],
    )
    items = pd.read_csv(
        cur_path / "ml-1m" / "movies.dat",
        sep="::",
        usecols=[0, 2],
        names=["item", "genre"],
        encoding="iso-8859-1",
    )
    items["genre1"], items["genre2"], items["genre3"] = zip(*items["genre"].apply(split_genre))
    items.drop("genre", axis=1, inplace=True)
    data = ratings.merge(users, on="user").merge(items, on="item")
    data.rename(columns={"rating": "label"}, inplace=True)
    return data

>>> data = load_ml_1m()
>>> data.shape

data shape: (1000209, 10)

>>> data.iloc[random.choices(range(len(data)), k=10)]  # randomly select 10 rows

	user	item	label	time	sex	age	occupation	genre1	genre2	genre3
951319	4913	3538	3	962677962	F	25	1	Comedy	missing	missing
969300	3246	2977	5	968309625	F	35	1	Comedy	Drama	missing
914441	1181	3015	2	976142934	M	35	7	Thriller	missing	missing
905593	2063	695	2	974665086	M	25	4	Mystery	Thriller	missing
512570	4867	1200	4	962817971	M	25	16	missing	missing	missing
524227	4684	3174	2	963667810	F	25	0	Comedy	Drama	missing
801408	3792	1224	4	966360592	M	25	6	Drama	War	missing
117662	2270	480	5	974574449	M	18	1	Action	Adventure	Sci-Fi
935170	1088	3825	1	1037975844	F	1	10	Drama	missing	missing
309994	4808	3051	3	962934115	M	35	0	Drama	missing	missing

Now we have about 1 million data. In order to perform evaluation after training, we need to split the data into train, eval and test data first. In this tutorial we will simply use random_split(). For other ways of splitting data, see Data Processing.

Note

For now, We will only use first half data for training. Later we will use the rest data to retrain the model.

Process Data & Features#

>>> from libreco.data import random_split

# split data into three folds for training, evaluating and testing
>>> first_half_data = data[: (len(data) // 2)]
>>> train_data, eval_data, test_data = random_split(first_half_data, multi_ratios=[0.8, 0.1, 0.1], seed=42)

>>> print("first half data shape:", first_half_data.shape)

first half data shape: (500104, 10)

The data contains some categorical features such as “sex” and “genre”, as well as a numerical feature “age”. In LibRecommender we use sparse_col to represent categorical features and dense_col to represent numerical features. So one should specify the column information and then use DatasetFeat.build_* functions to process the data.

>>> from libreco.data import DatasetFeat

>>> sparse_col = ["sex", "occupation", "genre1", "genre2", "genre3"]
>>> dense_col = ["age"]
>>> user_col = ["sex", "age", "occupation"]
>>> item_col = ["genre1", "genre2", "genre3"]

>>> train_data, data_info = DatasetFeat.build_trainset(train_data, user_col, item_col, sparse_col, dense_col)
>>> eval_data = DatasetFeat.build_evalset(eval_data)
>>> test_data = DatasetFeat.build_testset(test_data)

user_col means features belong to user, and item_col means features belong to item. Note that the column numbers should match, i.e. len(sparse_col) + len(dense_col) == len(user_col) + len(item_col).

>>> print(data_info)

n_users: 6040, n_items: 3576, data density: 1.8523 %

In this example we treat all the samples in data as positive samples, and perform negative sampling. This is a standard procedure for “implicit data”.

# sample negative items for each record
>>> train_data.build_negative_samples(data_info)
>>> eval_data.build_negative_samples(data_info)
>>> test_data.build_negative_samples(data_info)

Training the Model#

Now with all the data and features prepared, we can start training the model!

Since as its name suggests, the Wide & Deep algorithm has wide and deep parts, and they use different optimizers. So we should specify the learning rate separately by using a dict: {"wide": 0.01, "deep": 3e-4}. For other model hyper-parameters, see API reference of WideDeep.

from libreco.algorithms import WideDeep

model = WideDeep(
    task="ranking",
    data_info=data_info,
    embed_size=16,
    n_epochs=2,
    loss_type="cross_entropy",
    lr={"wide": 0.01, "deep": 3e-4},
    batch_size=2048,
    use_bn=True,
    hidden_units=(128, 64, 32),
)

model.fit(
    train_data,
    verbose=2,
    shuffle=True,
    eval_data=eval_data,
    metrics=["loss", "roc_auc", "precision", "recall", "ndcg"],
)

Epoch 1 elapsed: 3.053s
    train_loss: 0.6778
    eval log_loss: 0.5676
    eval roc_auc: 0.8005
    eval precision@10: 0.0277
    eval recall@10: 0.0409
    eval ndcg@10: 0.1119

Epoch 2 elapsed: 3.008s
    train_loss: 0.4994
    eval log_loss: 0.4928
    eval roc_auc: 0.8373
    eval precision@10: 0.0321
    eval recall@10: 0.0506
    eval ndcg@10: 0.1384

We’ve trained the model for 2 epochs and evaluated the performance on the eval data during training. Next we can evaluate on the independent test data.

>>> from libreco.evaluation import evaluate
>>> evaluate(model=model, data=test_data, metrics=["loss", "roc_auc", "precision", "recall", "ndcg"])

{'loss': 0.49392982253743395,
 'roc_auc': 0.8364561294428758,
 'precision': 0.03056640625,
 'recall': 0.05029253291880213,
 'ndcg': 0.12794099009836263}

Make Recommendation#

The recommend part is pretty straightforward. You can make recommendation for one user or a batch of users.

>>> model.recommend_user(user=1, n_rec=3)

{1: array([ 260, 2858, 1210])}

>>> model.recommend_user(user=[1, 2, 3], n_rec=3)

{1: array([ 260, 2858, 1210]),
 2: array([527, 356, 480]),
 3: array([ 589, 2571, 1240])}

Save, Load and Inference#

When saving the model, we should also save the data_info for feature information.

>>> data_info.save("model_path", model_name="wide_deep")
>>> model.save("model_path", model_name="wide_deep")

Then we can load the model and make recommendation again.

>>> tf.compat.v1.reset_default_graph()  # need to reset graph in TensorFlow1

>>> from libreco.data import DataInfo

>>> loaded_data_info = DataInfo.load("model_path", model_name="wide_deep")
>>> loaded_model = WideDeep.load("model_path", model_name="wide_deep", data_info=loaded_data_info)
>>> loaded_model.recommend_user(user=1, n_rec=3)

Retrain the Model with New Data#

Remember that we split the original MovieLens 1M data into two parts in the first place? We will treat the second half of the data as our new data and retrain the saved model with it. In real-world recommender systems, data may be generated every day, so it is inefficient to train the model from scratch every time we get some new data.

>>> second_half_data = data[(len(data) // 2) :]
>>> train_data, eval_data = random_split(second_half_data, multi_ratios=[0.8, 0.2])

>>> print("second half data shape:", second_half_data.shape)

second half data shape: (500105, 10)

The data processing is similar, except that we should use merge_trainset() and merge_evalset() in DatasetFeat.

The purpose of these functions is combining information from old data with that from new data, especially for the possible new users/items from new data. For more details, see Model Retrain.

>>> train_data = DatasetFeat.merge_trainset(train_data, loaded_data_info, merge_behavior=True)  # use loaded_data_info
>>> eval_data = DatasetFeat.merge_evalset(eval_data, loaded_data_info)

>>> train_data.build_negative_samples(loaded_data_info, seed=2022)  # use loaded_data_info
>>> eval_data.build_negative_samples(loaded_data_info, seed=2222)

Then we construct a new model, and call rebuild_model() method to assign the old trained variables into the new model.

>>> tf.compat.v1.reset_default_graph()  # need to reset graph in TensorFlow1

new_model = WideDeep(
    task="ranking",
    data_info=loaded_data_info,  # pass loaded_data_info
    embed_size=16,
    n_epochs=2,
    loss_type="cross_entropy",
    lr={"wide": 0.01, "deep": 3e-4},
    batch_size=2048,
    use_bn=True,
    hidden_units=(128, 64, 32),
)

new_model.rebuild_model(path="model_path", model_name="wide_deep", full_assign=True)

Finally, the training and recommendation parts are the same as before.

new_model.fit(
    train_data,
    verbose=2,
    shuffle=True,
    eval_data=eval_data,
    metrics=["loss", "roc_auc", "precision", "recall", "ndcg"],
)

Epoch 1 elapsed: 2.955s
    train_loss: 0.4604
    eval log_loss: 0.4497
    eval roc_auc: 0.8678
    eval precision@10: 0.1015
    eval recall@10: 0.0715
    eval ndcg@10: 0.3106

Epoch 2 elapsed: 2.657s
    train_loss: 0.4332
    eval log_loss: 0.4371
    eval roc_auc: 0.8760
    eval precision@10: 0.1043
    eval recall@10: 0.0740
    eval ndcg@10: 0.3189

>>> new_model.recommend_user(user=1, n_rec=3)

{1: array([2858, 1259, 3175])}

>>> new_model.recommend_user(user=[1, 2, 3], n_rec=3)

{1: array([2858, 1259, 3175]),
 2: array([1222, 1240,  858]),
 3: array([2858, 1580,  589])}

This completes our tutorial!

Where to go from here

For more examples, see the examples/ folder on GitHub.

For more usages, please head to User Guide.

For serving a trained model, please head to Python Serving Guide.