movie_matcher

Movie Recommender¶

Being a movie fanatic, I am always look ing for new films that will blow my mind. The problem is that I always lack ideas or lose myself in the pletor of movies offered on streaming platform.

My go-to website about film is the french https://www.senscritique.com/, the reference website for movies' enthousiastes. On it, I already rated about 1000 titles. This gave me an idea, can I use my ratings to build a custom movie recommender?

So, Let's see if I can set up a movie recommender engine

The idea:

take my senscritique.com rating about movies
take movies dataset from https://www.kaggle.com/rounakbanik/movie-recommender-systems
set Similarities Filtering
- cosine similarity of movies description
set Collaborative Filtering
- find users with similar taste in movies
set Hybrid FIltering
- retrieve similar movies that I am likely to like

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
import re
#from nltk.stem.wordnet import WordNetLemmatizer
#from nltk.corpus import wordnet
#from surprise import Reader, Dataset, SVD, evaluate

Get my movies ratings from senscritique¶

First, I am going to scrap my ratings on senscritique.com.
For that requests and BeautifulSoup are always the way to go.

import requests
from bs4 import BeautifulSoup
import time

cURL =

import uncurl
import json

context = uncurl.parse_context(cURL)

s = requests.session()
response = s.post(context.url, json=json.loads(context.data[1:]), headers=context.headers)

from benedict import benedict

response_json = benedict(response.json()[0])

senscrtq = pd.json_normalize(response_json['data.user.collection.products'])

senscrtq.loc[senscrtq.originalTitle.isnull(),"originalTitle"] = senscrtq.loc[senscrtq.originalTitle.isnull(),"title"]

senscrtq = senscrtq[["originalTitle","dateReleaseOriginal","otherUserInfos.rating"]]

senscrtq.dateReleaseOriginal = pd.to_datetime(senscrtq.dateReleaseOriginal)

senscrtq

Out

	originalTitle	dateReleaseOriginal	otherUserInfos.rating
0	Barbarian	2022-08-31	5
1	Eternals	2021-11-05	5
2	The Many Saints of Newark	2021-10-01	5
3	The Banshees of Inisherin	2022-10-21	7
4	The Menu	2022-11-18	3
...	...	...	...
1267	Shutter Island	2010-02-19	8
1268	Forrest Gump	1994-07-06	10
1269	Django Unchained	2012-12-25	8
1270	Avatar	2009-12-16	8
1271	Inglourious Basterds	2009-08-20	7

1272 rows × 3 columns

Per curiosity let's see my favorite decade

senscrtq["decade"] = ((senscrtq.dateReleaseOriginal.dt.year//10)*10).astype("Int64")

sns.boxplot(x="decade", y='otherUserInfos.rating', data=senscrtq.dropna(), showfliers=False)

Out

<AxesSubplot:xlabel='decade', ylabel='otherUserInfos.rating'>

The 70's, golden age of the independant cinema!

senscrtq["year"] = senscrtq.dateReleaseOriginal.dt.year

senscrtq.rename(columns={"originalTitle":"original_title","otherUserInfos.rating":"note"},inplace=True)

senscrtq.to_csv('my_note3.csv',index=False)

Load Data¶

my_notes = pd.read_csv('my_note3.csv')
my_notes['year'] = my_notes['year'].astype(float)
my_notes

Out

	original_title	dateReleaseOriginal	note	decade	year
0	Barbarian	2022-08-31	5	2020.0	2022.0
1	Eternals	2021-11-05	5	2020.0	2021.0
2	The Many Saints of Newark	2021-10-01	5	2020.0	2021.0
3	The Banshees of Inisherin	2022-10-21	7	2020.0	2022.0
4	The Menu	2022-11-18	3	2020.0	2022.0
...	...	...	...	...	...
1267	Shutter Island	2010-02-19	8	2010.0	2010.0
1268	Forrest Gump	1994-07-06	10	1990.0	1994.0
1269	Django Unchained	2012-12-25	8	2010.0	2012.0
1270	Avatar	2009-12-16	8	2000.0	2009.0
1271	Inglourious Basterds	2009-08-20	7	2000.0	2009.0

1272 rows × 5 columns

I also need to load dataset from other users. For that, the dataset provides by ROUNAK BANIK on his Kaggle project page is extremly useful.

Some cleaning and addition are necessary though:

parsing dates in the right format
retrieve year of release
retrieve all film genre into list
- find users with similar taste in movies

movies = pd.read_csv('movies_metadata.csv')
movies['release_date'] = pd.to_datetime(movies.release_date,infer_datetime_format=True,errors='coerce')
movies['year'] = movies.release_date.dt.year
movies['genres'] = movies['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
movies.head()

C:UserscorenAppDataLocalTempipykernel_150602263965596.py:1: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  movies = pd.read_csv('movies_metadata.csv')

Out

	adult	belongs_to_collection	budget	homepage	id	imdb_id	original_language	original_title	overview	...	revenue	runtime	spoken_languages	status	tagline	title	video	vote_average	vote_count	year
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	http://toystory.disney.com/toy-story	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	...	373554033.0	81.0		Released	NaN	Toy Story	False	7.7	5415.0	1995.0
1	False	NaN	65000000	NaN	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	...	262797249.0	104.0	[{'iso_639_1': 'en', 'name': 'English'}, {'iso...	Released	Roll the dice and unleash the excitement!	Jumanji	False	6.9	2413.0	1995.0
2	False	{'id': 119050, 'name': 'Grumpy Old Men Collect...	0	NaN	15602	tt0113228	en	Grumpier Old Men	A family wedding reignites the ancient feud be...	...	0.0	101.0		Released	Still Yelling. Still Fighting. Still Ready for...	Grumpier Old Men	False	6.5	92.0	1995.0
3	False	NaN	16000000	NaN	31357	tt0114885	en	Waiting to Exhale	Cheated on, mistreated and stepped on, the wom...	...	81452156.0	127.0		Released	Friends are the people who let you be yourself...	Waiting to Exhale	False	6.1	34.0	1995.0
4	False	{'id': 96871, 'name': 'Father of the Bride Col...	0	NaN	11862	tt0113041	en	Father of the Bride Part II	Just when George Banks has recovered from his ...	...	76578911.0	106.0		Released	Just When His World Is Back To Normal... He's ...	Father of the Bride Part II	False	5.7	173.0	1995.0

5 rows × 25 columns

Following the method of Rounak, I will use IMDB's weighted rating formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating

where,

v is the number of votes for the movie
m is the minimum votes required to be listed in the chart
R is the average rating of the movie
C is the mean vote across the whole

vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

Out

5.244896612406511

m = vote_counts.quantile(0.70)
m

Out

25.0

movies = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())][['id','original_title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
movies

Out

	id	original_title	year	vote_count	vote_average	popularity	genres
0	862	Toy Story	1995.0	5415.0	7.7	21.946943
1	8844	Jumanji	1995.0	2413.0	6.9	17.015539
2	15602	Grumpier Old Men	1995.0	92.0	6.5	11.7129
3	31357	Waiting to Exhale	1995.0	34.0	6.1	3.859495
4	11862	Father of the Bride Part II	1995.0	173.0	5.7	8.387519
...	...	...	...	...	...	...	...
45380	432789	The Incredible Jessica James	2017.0	37.0	6.2	5.667067
45437	455661	In a Heartbeat	2017.0	146.0	8.3	20.82178
45441	14008	Cadet Kelly	2002.0	145.0	5.2	4.392389
45443	49279	L'Homme à la tête de caoutchouc	1901.0	29.0	7.6	1.618458
45460	30840	Robin Hood	1991.0	26.0	5.7	5.683753

13810 rows × 7 columns

Ok so we went from 45466 movies down to the 13810 most voted movies. This could actually be a problem for recommendation as I am always eager to watch niche, underrated movies which are likely to be removed with this method. Maybe if I can find a more qualitative dataset where movie nerds rates those specific movies it could work. But for now, I will experiment with what I have.

def weighted_rating(x):
    """IMDB weighting formula"""
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

movies['wr'] = movies.apply(weighted_rating, axis=1)

movies = movies[pd.to_numeric(movies['id'], errors='coerce').notnull()]

movies['id'] = movies.id.astype(int)

When it comes to movies, I tend to trust certain members of the crew. Indeed, the movie director and director of photography are often reliable people
Therefore, I will retrieve those crew member from the credit.csv data link to movies.

def get_crew(x,job):
    """
    Retrieve a crew member from his/her job label
    """
    list_ = literal_eval(x)
    if isinstance(list_, list):
        job_list = [i['name'] for i in list_ if i['job']==job]
        if len(job_list)>0:
            return job_list[0]
        else:
            return ""
    else:
        return ""

credit = pd.read_csv('credits.csv')
credit['Director'] = credit['crew'].fillna('[]').apply(lambda x: get_crew(x,'Director'))
credit['Director of Photography'] = credit['crew'].fillna('[]').apply(lambda x: get_crew(x,'Director of Photography'))
credit

Out

	cast	crew	id	Director	Director of Photography
0	[{'cast_id': 14, 'character': 'Woody	[{'credit_id': '52fe4284c3a36847f8024f49', 'de...	862	John Lasseter
1	[{'cast_id': 1, 'character': 'Alan Parrish', '...	[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...	8844	Joe Johnston	Thomas E. Ackerman
2	[{'cast_id': 2, 'character': 'Max Goldman', 'c...	[{'credit_id': '52fe466a9251416c75077a89', 'de...	15602	Howard Deutch
3	[{'cast_id': 1, 'character': "Savannah 'Vannah...	[{'credit_id': '52fe44779251416c91011acb', 'de...	31357	Forest Whitaker
4	[{'cast_id': 1, 'character': 'George Banks', '...	[{'credit_id': '52fe44959251416c75039ed7', 'de...	11862	Charles Shyer	Elliot Davis
...	...	...	...	...	...
45471	[{'cast_id': 0, 'character': '', 'credit_id': ...	[{'credit_id': '5894a97d925141426c00818c', 'de...	439050	Hamid Nematollah
45472	[{'cast_id': 1002, 'character': 'Sister Angela...	[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...	111109	Lav Diaz
45473	[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...	[{'credit_id': '52fe4776c3a368484e0c8387', 'de...	67758	Mark L. Lester	João Fernandes
45474	[{'cast_id': 2, 'character': '', 'credit_id': ...	[{'credit_id': '533bccebc3a36844cf0011a7', 'de...	227506	Yakov Protazanov
45475		[{'credit_id': '593e676c92514105b702e68e', 'de...	461257	Daisy Asquith

45476 rows × 5 columns

Finally, let's get some keyword that will help the matching process.

keywords = pd.read_csv('keywords.csv')

keywords['keywords'] = keywords['keywords'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
keywords

Out

	id	keywords
0	862	[jealousy, toy, boy, friendship, friends, riva...
1	8844	[board game, disappearance, based on children'...
2	15602	[fishing, best friend, duringcreditsstinger, o...
3	31357	[based on novel, interracial relationship, sin...
4	11862	[baby, midlife crisis, confidence, aging, daug...
...	...	...
46414	439050
46415	111109
46416	67758
46417	227506
46418	461257

46419 rows × 2 columns

movies = movies.merge(credit.merge(keywords, on='id'), on='id')
movies

Out

	id	original_title	year	vote_count	vote_average	popularity	genres	wr	cast	crew	Director	Director of Photography	keywords
0	862	Toy Story	1995.0	5415.0	7.7	21.946943		7.688717	[{'cast_id': 14, 'character': 'Woody	[{'credit_id': '52fe4284c3a36847f8024f49', 'de...	John Lasseter		[jealousy, toy, boy, friendship, friends, riva...
1	8844	Jumanji	1995.0	2413.0	6.9	17.015539		6.883028	[{'cast_id': 1, 'character': 'Alan Parrish', '...	[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...	Joe Johnston	Thomas E. Ackerman	[board game, disappearance, based on children'...
2	15602	Grumpier Old Men	1995.0	92.0	6.5	11.7129		6.231816	[{'cast_id': 2, 'character': 'Max Goldman', 'c...	[{'credit_id': '52fe466a9251416c75077a89', 'de...	Howard Deutch		[fishing, best friend, duringcreditsstinger, o...
3	31357	Waiting to Exhale	1995.0	34.0	6.1	3.859495		5.737668	[{'cast_id': 1, 'character': "Savannah 'Vannah...	[{'credit_id': '52fe44779251416c91011acb', 'de...	Forest Whitaker		[based on novel, interracial relationship, sin...
4	11862	Father of the Bride Part II	1995.0	173.0	5.7	8.387519		5.642537	[{'cast_id': 1, 'character': 'George Banks', '...	[{'credit_id': '52fe44959251416c75039ed7', 'de...	Charles Shyer	Elliot Davis	[baby, midlife crisis, confidence, aging, daug...
...	...	...	...	...	...	...	...	...	...	...	...	...	...
14041	432789	The Incredible Jessica James	2017.0	37.0	6.2	5.667067		5.814878	[{'cast_id': 6, 'character': 'Jessica James', ...	[{'credit_id': '586a688dc3a3680f4e017d65', 'de...	Jim Strouse	Sean McElwee
14042	455661	In a Heartbeat	2017.0	146.0	8.3	20.82178		7.853347		[{'credit_id': '5981a15c92514151e0011b51', 'de...	Beth David
14043	14008	Cadet Kelly	2002.0	145.0	5.2	4.392389		5.206602	[{'cast_id': 1, 'character': 'Kelly Collins', ...	[{'credit_id': '52fe45c29251416c75061803', 'de...	Larry Shaw
14044	49279	L'Homme à la tête de caoutchouc	1901.0	29.0	7.6	1.618458		6.509674	[{'cast_id': 2, 'character': '', 'credit_id': ...	[{'credit_id': '52fe478dc3a36847f813bd5f', 'de...	Georges Méliès		[laboratory, mad scientist, disembodied head, ...
14045	30840	Robin Hood	1991.0	26.0	5.7	5.683753		5.476910	[{'cast_id': 1, 'character': 'Sir Robert Hode'...	[{'credit_id': '52fe44439251416c9100a899', 'de...	John Irvin	Jason Lehel

14046 rows × 13 columns

Let's create a 'soup' columns that will merge crew members, genre and keywords.
The finale string will decribe the movie by it's attribute. It will serve later to compare movies using Natural Languague Processing

movies['Director'] = movies['Director'].str.lower().str.replace(" ", "")
movies['Director of Photography'] = movies['Director of Photography'].str.lower().str.replace(" ", "")
#movies['genres'] = movies['genres'].apply(lambda x: [stemmer.stem(i) for i in x])
movies['genres'] = movies['genres'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
#movies['keywords'] = movies['keywords'].apply(lambda x: [stemmer.stem(i,) for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [str.lower(i) for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [''] if len(x) < 1 else x)
movies['keywords'] = movies['keywords'].apply(lambda x: np.hstack([re.sub("[^w]", " ",  word).split() for word in x]).tolist())

movies['soup'] = movies['keywords']+movies['genres']

Again, following Rounak methodology, we are going to use SnowballStemmer in order to get only the stem of the keywords and genres. This will help to harmonise the keyword and genre into similar classes. Quite an ingenious trick that I have to admit, I would have never think of.

stemmer = SnowballStemmer('english')

movies['soup'] = movies['soup'].apply(lambda x: [stemmer.stem(i) for i in x])
movies['soup'] = movies['soup'].apply(lambda x: ' '.join(x))+' '+movies['Director of Photography'] +' '+ movies['Director']
movies['soup'] = movies['soup'].fillna('')
movies

Out

	id	original_title	year	vote_count	vote_average	popularity	genres	wr	cast	crew	Director	Director of Photography	keywords	soup
0	862	Toy Story	1995.0	5415.0	7.7	21.946943		7.688717	[{'cast_id': 14, 'character': 'Woody	[{'credit_id': '52fe4284c3a36847f8024f49', 'de...	johnlasseter		[jealousy, toy, boy, friendship, friends, riva...	jealousi toy boy friendship friend rivalri boy...
1	8844	Jumanji	1995.0	2413.0	6.9	17.015539		6.883028	[{'cast_id': 1, 'character': 'Alan Parrish', '...	[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...	joejohnston	thomase.ackerman	[board, game, disappearance, based, on, childr...	board game disappear base on children s book n...
2	15602	Grumpier Old Men	1995.0	92.0	6.5	11.7129		6.231816	[{'cast_id': 2, 'character': 'Max Goldman', 'c...	[{'credit_id': '52fe466a9251416c75077a89', 'de...	howarddeutch		[fishing, best, friend, duringcreditsstinger, ...	fish best friend duringcreditssting old men ro...
3	31357	Waiting to Exhale	1995.0	34.0	6.1	3.859495		5.737668	[{'cast_id': 1, 'character': "Savannah 'Vannah...	[{'credit_id': '52fe44779251416c91011acb', 'de...	forestwhitaker		[based, on, novel, interracial, relationship, ...	base on novel interraci relationship singl mot...
4	11862	Father of the Bride Part II	1995.0	173.0	5.7	8.387519		5.642537	[{'cast_id': 1, 'character': 'George Banks', '...	[{'credit_id': '52fe44959251416c75039ed7', 'de...	charlesshyer	elliotdavis	[baby, midlife, crisis, confidence, aging, dau...	babi midlif crisi confid age daughter mother d...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14041	432789	The Incredible Jessica James	2017.0	37.0	6.2	5.667067		5.814878	[{'cast_id': 6, 'character': 'Jessica James', ...	[{'credit_id': '586a688dc3a3680f4e017d65', 'de...	jimstrouse	seanmcelwee		romanc comedi seanmcelwee jimstrouse
14042	455661	In a Heartbeat	2017.0	146.0	8.3	20.82178		7.853347		[{'credit_id': '5981a15c92514151e0011b51', 'de...	bethdavid			love teenag lgbt short famili anim romanc come...
14043	14008	Cadet Kelly	2002.0	145.0	5.2	4.392389		5.206602	[{'cast_id': 1, 'character': 'Kelly Collins', ...	[{'credit_id': '52fe45c29251416c75061803', 'de...	larryshaw			militari school comedi larryshaw
14044	49279	L'Homme à la tête de caoutchouc	1901.0	29.0	7.6	1.618458		6.509674	[{'cast_id': 2, 'character': '', 'credit_id': ...	[{'credit_id': '52fe478dc3a36847f813bd5f', 'de...	georgesméliès		[laboratory, mad, scientist, disembodied, head...	laboratori mad scientist disembodi head silent...
14045	30840	Robin Hood	1991.0	26.0	5.7	5.683753		5.476910	[{'cast_id': 1, 'character': 'Sir Robert Hode'...	[{'credit_id': '52fe44439251416c9100a899', 'de...	johnirvin	jasonlehel		drama action romanc jasonlehel johnirvin

14046 rows × 14 columns

Cosine Similarity Filtering¶

Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors' lengths

Cosine similarity is often use in NLP as a metric for text-similarity between two documents. A word is represented into a vector form. and documents are represented in a n-dimensional vector space.

Therefore, we need to change our soup strings into vectors before proceeding to Cosine similarity. I checked 2 ways:

CountVectorizer
BERT transformer

Count Vectorizer¶

Count Vectorizers set token to each word of each text input, resulting in a sparse matrix.
Using stop words, one can remove the most commonly used word of a language as they do not reflect meaning i a sentence. I use it in our case, but not sure it is really useful as the "soup" is not a real sentence.

from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

def get_recommendations_CV(df, title,n=100):
    count  = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=1, stop_words='english')
    count_matrix = count.fit_transform(df['soup'])
    cosine_sim = cosine_similarity(count_matrix, count_matrix)
    idx = df[df['original_title'] == title ].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:1+n]
    movie_indices = [i[0] for i in sim_scores]
    return df[['id','original_title','year','wr']].iloc[movie_indices].sort_values('wr',ascending=False)

BERT Transformer¶

Bidirectional Encoder Representations from Transformers or BERT is a technique for natural language processing pre-training developed by Google. This transformer is commonly used

Many pre-trained models are available on https://huggingface.co/transformers/

from sentence_transformers import SentenceTransformer
from sentence_transformers import models, losses

model = SentenceTransformer('bert-base-nli-mean-tokens')

text_embeddings = model.encode(movies['soup'], show_progress_bar = True)
similarities = cosine_similarity(text_embeddings)

with open('similarities.npy', 'wb') as f:
    np.save(f, similarities)

def get_recommendations_BERT(df, title,n=100):
    similarities = np.load('similarities.npy')
    idx = df[df['original_title'] == title ].index[0]
    sim_scores = list(enumerate(similarities[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:1+n]
    movie_indices = [i[0] for i in sim_scores]
    return df[['id','original_title','year','wr']].iloc[movie_indices].sort_values('wr',ascending=False)

Here are the top 10 recommendations based on movie genre, cast and keywords:

display(get_recommendations_CV(movies, 'The Social Network',n=10))
display(get_recommendations_BERT(movies, 'The Social Network',n=10))

	id	original_title	year	wr
1863	550	Fight Club	1999.0	8.292128
10332	210577	Gone Girl	2014.0	7.889025
8692	65754	The Girl with the Dragon Tattoo	2011.0	7.180480
2718	41050	La notte	1961.0	7.067241
10787	295144	Marvellous	2015.0	6.484483
5276	10429	Takedown	2000.0	5.974351
2585	39507	The Dead	1987.0	5.968177
2375	16094	House Party	1990.0	5.822369
13210	368031	Unfriend	2016.0	5.296565
7071	13991	College	2008.0	4.952362

	id	original_title	year	wr
10314	250658	The Internet's Own Boy: The Story of Aaron Swartz	2014.0	7.488402
10474	284427	Who Am I - Kein System ist sicher	2014.0	7.470599
13148	328387	Nerve	2016.0	7.079721
8755	76726	Chronicle	2012.0	6.582976
11800	317144	Cyberbully	2015.0	6.260476
9695	115782	Jobs	2013.0	5.984373
5276	10429	Takedown	2000.0	5.974351
2559	9989	Antitrust	2001.0	5.723328
114	9886	Johnny Mnemonic	1995.0	5.484253
13210	368031	Unfriend	2016.0	5.296565

Collaborative Filtering¶

We will now set a recommander based on collaborative filtering. The idea is to recomend you certain film based on what other user with a similar profile than yours like. We are going to use surprise, a package for recommender systems.

We must first match the IDs between ratings table and our movies table

match = pd.read_csv('links_small.csv')
match = match.dropna(how="any")
match['tmdbId'] = match['tmdbId'].astype('Int64')#.astype(str)
match.rename(columns={'tmdbId':'id'},inplace=True)
match.set_index('id',inplace=True)

from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate, GridSearchCV

reader = Reader()

For such a recommendation task, we need to gather users ratings. The following dataset is precisely that:

ratings = pd.read_csv('ratings_small.csv')
ratings

Out

	userId	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205
...	...	...	...	...
99999	671	6268	2.5	1065579370
100000	671	6269	4.0	1065149201
100001	671	6365	4.0	1070940363
100002	671	6385	2.5	1070979663
100003	671	6565	3.5	1074784724

100004 rows × 4 columns

Now I need to align my ratings with the available ratings so that I can become a user among others. For that I set a userId = -1 which is meant to be me; and bring my rating up to 5. Also I must match 'original_title','year'] to get the movieId and the imdbID

inner_df = pd.merge(my_notes,movies, on=['original_title','year'],how='inner')
inner_df['id'] = inner_df.id.astype("int64",errors='ignore')

#inner_df= inner_df[['original_title','year','note','id']]
#inner_df.drop_duplicates(inplace=True)

inner_df['rating'] = inner_df['note']/2
inner_df['userId'] = -1
inner_df = inner_df.merge(match.reset_index(),how='inner',on='id')

We will train a SVD algorithm on our dataset of ratings

X = pd.concat([ratings[['userId', 'movieId', 'rating']],inner_df[['userId', 'movieId', 'rating']]])
data = Dataset.load_from_df(X, reader)

param_grid = {
    "n_epochs": [5, 10],
    "lr_all": [0.002, 0.005],
    "reg_all": [0.4, 0.6]
}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)

gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

0.9151389977377514
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}

Using gs.best_params and gs.best_score gave me an RMSE of 0.91

svd = gs.best_estimator['rmse']
svd.fit(data.build_full_trainset())

Out

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x11717f06d00>

Now, we can loop through all our movies and set a estimated rating for me based on other user with similar tastes.

def helper(x,userId,algo):
    try:
        return algo.predict(userId,match['movieId'].to_dict()[x]).est
    except:
        return np.nan

movies['est'] = movies['id'].apply(lambda x: helper(x,-1,svd))
movies

Out

	id	original_title	year	vote_count	vote_average	popularity	genres	wr	cast	crew	Director	Director of Photography	keywords	soup	est
0	862	Toy Story	1995.0	5415.0	7.7	21.946943		7.688717	[{'cast_id': 14, 'character': 'Woody	[{'credit_id': '52fe4284c3a36847f8024f49', 'de...	johnlasseter		[jealousy, toy, boy, friendship, friends, riva...	jealousi toy boy friendship friend rivalri boy...	3.643723
1	8844	Jumanji	1995.0	2413.0	6.9	17.015539		6.883028	[{'cast_id': 1, 'character': 'Alan Parrish', '...	[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...	joejohnston	thomase.ackerman	[board, game, disappearance, based, on, childr...	board game disappear base on children s book n...	3.307559
2	15602	Grumpier Old Men	1995.0	92.0	6.5	11.7129		6.231816	[{'cast_id': 2, 'character': 'Max Goldman', 'c...	[{'credit_id': '52fe466a9251416c75077a89', 'de...	howarddeutch		[fishing, best, friend, duringcreditsstinger, ...	fish best friend duringcreditssting old men ro...	3.104655
3	31357	Waiting to Exhale	1995.0	34.0	6.1	3.859495		5.737668	[{'cast_id': 1, 'character': "Savannah 'Vannah...	[{'credit_id': '52fe44779251416c91011acb', 'de...	forestwhitaker		[based, on, novel, interracial, relationship, ...	base on novel interraci relationship singl mot...	2.903563
4	11862	Father of the Bride Part II	1995.0	173.0	5.7	8.387519		5.642537	[{'cast_id': 1, 'character': 'George Banks', '...	[{'credit_id': '52fe44959251416c75039ed7', 'de...	charlesshyer	elliotdavis	[baby, midlife, crisis, confidence, aging, dau...	babi midlif crisi confid age daughter mother d...	3.183032
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14041	432789	The Incredible Jessica James	2017.0	37.0	6.2	5.667067		5.814878	[{'cast_id': 6, 'character': 'Jessica James', ...	[{'credit_id': '586a688dc3a3680f4e017d65', 'de...	jimstrouse	seanmcelwee		romanc comedi seanmcelwee jimstrouse	NaN
14042	455661	In a Heartbeat	2017.0	146.0	8.3	20.82178		7.853347		[{'credit_id': '5981a15c92514151e0011b51', 'de...	bethdavid			love teenag lgbt short famili anim romanc come...	NaN
14043	14008	Cadet Kelly	2002.0	145.0	5.2	4.392389		5.206602	[{'cast_id': 1, 'character': 'Kelly Collins', ...	[{'credit_id': '52fe45c29251416c75061803', 'de...	larryshaw			militari school comedi larryshaw	NaN
14044	49279	L'Homme à la tête de caoutchouc	1901.0	29.0	7.6	1.618458		6.509674	[{'cast_id': 2, 'character': '', 'credit_id': ...	[{'credit_id': '52fe478dc3a36847f813bd5f', 'de...	georgesméliès		[laboratory, mad, scientist, disembodied, head...	laboratori mad scientist disembodi head silent...	NaN
14045	30840	Robin Hood	1991.0	26.0	5.7	5.683753		5.476910	[{'cast_id': 1, 'character': 'Sir Robert Hode'...	[{'credit_id': '52fe44439251416c9100a899', 'de...	johnirvin	jasonlehel		drama action romanc jasonlehel johnirvin	NaN

14046 rows × 15 columns

Hybrid¶

Final step, let's combined the 2 Filtering techniques.
We will first filter movies by similarities, and then see what other users might recommend me within this cluster.

def hybrid(df, userId, title, algo = svd):
    ds = get_recommendations_CV(df, title, n=50)
    df = df.loc[ds.index,['id','original_title','year','wr']]
    #df = ds.reset_index().rename(columns={'index':'id'})
    df['est'] = df['id'].apply(lambda x: helper(x,userId,algo))
    df.dropna(axis=0,how='any', inplace=True)
    df.sort_values('est', ascending=False, inplace =True)
    df = df.head(10)
    df['year'] = df['year'].astype(int)
    df['link'] = df['id'].apply(lambda x: 'https://www.themoviedb.org/movie/{}'.format(str(x)))
    return df.head(10)

def hybrid_bis(df, userId, title, algo = svd):
    ds = get_recommendations_BERT(df, title, n=50)
    df = df.loc[ds.index,['id','original_title','year','wr']]
    #df = ds.reset_index().rename(columns={'index':'id'})
    df['est'] = df['id'].apply(lambda x: helper(x,userId,algo))
    df.dropna(axis=0,how='any', inplace=True)
    df.sort_values('est', ascending=False, inplace =True)
    df = df.head(10)
    df['year'] = df['year'].astype(int)
    df['link'] = df['id'].apply(lambda x: 'https://www.themoviedb.org/movie/{}'.format(str(x)))
    return df.head(10)

response = hybrid(movies,-1, 'This Is Spinal Tap', algo = svd)
response2 = hybrid_bis(movies,-1, 'This Is Spinal Tap', algo = svd)

def make_clickable(val):
    # target _blank to open new window
    return '<a target="_blank" href="{}">{}</a>'.format(val, val)

display(response.style.format({'link': make_clickable}))
display(response2.style.format({'link': make_clickable}))

	id	original_title	year	wr	est	link
1931	11663	The Commitments	1991	7.056831	3.616117	https://www.themoviedb.org/movie/11663
6523	5723	Once	2007	7.288221	3.575962	https://www.themoviedb.org/movie/5723
2257	31516	On the Town	1949	6.348299	3.574190	https://www.themoviedb.org/movie/31516
5711	15258	The Aristocrats	2005	5.737811	3.539981	https://www.themoviedb.org/movie/15258
4531	9459	Woodstock	1970	6.595896	3.535589	https://www.themoviedb.org/movie/9459
1682	11779	Buena Vista Social Club	1999	6.791311	3.505207	https://www.themoviedb.org/movie/11779
658	12614	Victor/Victoria	1982	6.537039	3.503791	https://www.themoviedb.org/movie/12614
4095	1584	School of Rock	2003	6.773714	3.484627	https://www.themoviedb.org/movie/1584
4989	13671	The Music Man	1962	6.194756	3.483782	https://www.themoviedb.org/movie/13671
6313	2179	Tenacious D in The Pick of Destiny	2006	6.429330	3.473225	https://www.themoviedb.org/movie/2179

	id	original_title	year	wr	est	link
3728	11949	Monty Python Live at the Hollywood Bowl	1982	6.411652	3.546626	https://www.themoviedb.org/movie/11949
4531	9459	Woodstock	1970	6.595896	3.535589	https://www.themoviedb.org/movie/9459
5818	19082	No Direction Home: Bob Dylan	2005	6.527973	3.517776	https://www.themoviedb.org/movie/19082
4439	132	The Rolling Stones: Gimme Shelter	1970	6.846603	3.510834	https://www.themoviedb.org/movie/132
1634	6396	SLC Punk	1998	6.864597	3.496494	https://www.themoviedb.org/movie/6396
2223	27745	The Filth and the Fury	2000	6.348563	3.494963	https://www.themoviedb.org/movie/27745
4095	1584	School of Rock	2003	6.773714	3.484627	https://www.themoviedb.org/movie/1584
5707	1665	Last Days	2005	5.288520	3.482124	https://www.themoviedb.org/movie/1665
6313	2179	Tenacious D in The Pick of Destiny	2006	6.429330	3.473225	https://www.themoviedb.org/movie/2179
3014	27327	Phantom of the Paradise	1974	7.052559	3.463405	https://www.themoviedb.org/movie/27327

Not too bad of a selection!

corentin kuster