Comedy Rating Prediction Using Content-Based Filtering
I was going through my datasets and came across one which was from a competition organized by Data Science Nigeria in 2018. It is about jokes and ratings given by viewers. We are to predict the ratings the viewers will give to new jokes based on their previous choices and habits.
The general solution to this problem is called Collaborative Filtering, which looks at what products the current user has rated, find other users that have rated similar products, and then recommend other products that those users have rated.
But in our case, which is Content-Based Filtering, instead of recommending other new products, we are predicting the ratings the same users will give to the new products similar to the previous products.
So, for example, viewer A1 may have watched jokes by the Nigerian comedian, Alibaba; he may have also watched jokes made in Lagos. The model may not know these particular properties of the jokes viewer A1 have watched, but it will be able to see other jokes that are similar to viewer A1’s choices and will be able to predict the rating which viewer A1 will give to those jokes. In other words, to use this approach, we don’t necessarily need to know anything about the jokes, except who like to watch them.
We never told our model that viewer A1 likes jokes by Alibaba and jokes in Lagos, and still, there is the latent factor underlying the concept of jokes by Alibaba and jokes in Lagos, and these concepts must be relevant to jokes similar to those of viewer A1.
So we will start by getting the dataset for this problem.
import fastbook
fastbook.setup_book()from fastai.collab import *
from fastai.tabular.all import *base_path = "/content/gdrive/MyDrive/kaggle/joke_recommend/"
train = pd.read_csv(base_path + 'train.csv')
test = pd.read_csv(base_path + 'test.csv')
The dataset is relatively large, with over 600,000 rows in the train set, over 40,000 viewers_id, and 127 joke_identifiers
So, I used a large batch size: 2¹⁷ (131072). There are trade-offs for large batch sizes. While the updates per epoch may be smaller, they are actually few globally good updates, in contrast to using a smaller batch size and having many ineffective updates.
CollabDataLoaders is a data loader from the fast.ai library. We get our data into this data loader. By default, it takes the first column for the user (viewers_id), the second for the item (joke_identifier), and the third for the rating. In our case, we will change the rating_name to use the ‘Rating’ instead of the ‘Response_ID’.
dls = CollabDataLoaders.from_df(train, rating_name='Rating', bs=131072,)
Learning The Latent Factors
First, we randomly initialize some parameters. These parameters will be a set of latent factors for each viewer and joke. We decided to use n_factors = 1000. Each viewer will have a set of these factors and each movie will have a set of these factors.
Secondly, we calculate our prediction from the match between each joke and each viewer.
And thirdly, we calculate our loss. Here, we use Root Mean Square Error (RMSE) since it is a reasonable way to represent the accuracy of a prediction.
With this, we can optimize our latent factors using Stochastic Gradient Descent (SGD) to minimize the loss. At each step, the SGD optimizer will calculate the prediction (from the match between the latent factors of each joke and each viewer) and will compare it to the actual rating that each viewer gave to each joke. It will calculate the derivative of this value and will step the weights by multiplying this by the learning rate. After doing this lots of times, the loss will get better and better, and the predictions will also get better and better.
The data loader generated from our data, dls, will be passed into the collaborative filtering learner/model from fastai, collab_learner. To make this model better, we have to force the predictions of the ratings to be between 0 and 5. From experience, it is better to have the range a little bit above 5, so we use 5.5 (0 and 5.5).
learn = collab_learner(dls, n_factors=1000, y_range=(0, 5.5), metrics=rmse)
We are now ready to fit our model. We used a large epoch, 100 . There are two functions for the fitting of our model: fit_one_cycle for training a model from scratch or fine_tune for a pretrained model. But we also found out that fine_tune can also work for our dataset (fastai lets us use fine_tune even for a non-pretrained model). You can experiment and choose whichever works well for you.
Weight Decay
To prevent overfitting of our model, we used weight decay. Weight Decay, or L2 regularization consists in adding the sum of all the weights squared to your loss function. Though it is going to hinder the training of our model, it helps the model generalize better by making the weights as small as possible. We used a weight decay, passing wd=0.1 in our call to fine_tune
learn.fine_tune(100, 5e-3, wd=0.01)
The results from the fitting of our model gave us an RMSE of ~2.312712, and from learn.show_results(), the rating predictions are close to the actual ratings.
We created our test dataloader similar to our train dataloader using the function:
test_dl = learn.dls.test_dl(test)
And made our predictions of the ratings of the new jokes from the same viwers with this function:
preds = learn.get_preds(dl=test_dl)
rating_test = pd.DataFrame(preds[0], columns=['Rating'])
The predicted ratings and the viewers_id were merged to make a new dataframe fro submission.
submission = pd.DataFrame(test['Response_ID']).join(rating_test)
submission.to_csv('submission.csv', index=False)
The RMSE (2.312712) is better than that submitted previously for the competition which was 2.69805, courtesy of the robust fast.ai library we used here. Shout out to Jeremy Howard and the team.
Conclusion
This is a brief technical writing on content-based filtering. We did not go into the details of it. This is to explain it in the context of a real competition problem. The full code can be found here.