Movie Recommender¶
Being a movie fanatic, I am always look ing for new films that will blow my mind. The problem is that I always lack ideas or lose myself in the pletor of movies offered on streaming platform.
My go-to website about film is the french https://www.senscritique.com/, the reference website for movies' enthousiastes. On it, I already rated about 1000 titles. This gave me an idea, can I use my ratings to build a custom movie recommender?
So, Let's see if I can set up a movie recommender engine
The idea:
- take my senscritique.com rating about movies
- take movies dataset from https://www.kaggle.com/rounakbanik/movie-recommender-systems
- set Similarities Filtering
- cosine similarity of movies description
- set Collaborative Filtering
- find users with similar taste in movies
- set Hybrid FIltering
- retrieve similar movies that I am likely to like
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
import re
#from nltk.stem.wordnet import WordNetLemmatizer
#from nltk.corpus import wordnet
#from surprise import Reader, Dataset, SVD, evaluate
Get my movies ratings from senscritique¶
First, I am going to scrap my ratings on senscritique.com.
For that requests and BeautifulSoup are always the way to go.
import requests
from bs4 import BeautifulSoup
import time
cURL =
import uncurl
import json
context = uncurl.parse_context(cURL)
s = requests.session()
response = s.post(context.url, json=json.loads(context.data[1:]), headers=context.headers)
from benedict import benedict
response_json = benedict(response.json()[0])
senscrtq = pd.json_normalize(response_json['data.user.collection.products'])
senscrtq.loc[senscrtq.originalTitle.isnull(),"originalTitle"] = senscrtq.loc[senscrtq.originalTitle.isnull(),"title"]
senscrtq = senscrtq[["originalTitle","dateReleaseOriginal","otherUserInfos.rating"]]
senscrtq.dateReleaseOriginal = pd.to_datetime(senscrtq.dateReleaseOriginal)
senscrtq
originalTitle | dateReleaseOriginal | otherUserInfos.rating | |
---|---|---|---|
0 | Barbarian | 2022-08-31 | 5 |
1 | Eternals | 2021-11-05 | 5 |
2 | The Many Saints of Newark | 2021-10-01 | 5 |
3 | The Banshees of Inisherin | 2022-10-21 | 7 |
4 | The Menu | 2022-11-18 | 3 |
... | ... | ... | ... |
1267 | Shutter Island | 2010-02-19 | 8 |
1268 | Forrest Gump | 1994-07-06 | 10 |
1269 | Django Unchained | 2012-12-25 | 8 |
1270 | Avatar | 2009-12-16 | 8 |
1271 | Inglourious Basterds | 2009-08-20 | 7 |
1272 rows × 3 columns
Per curiosity let's see my favorite decade
senscrtq["decade"] = ((senscrtq.dateReleaseOriginal.dt.year//10)*10).astype("Int64")
sns.boxplot(x="decade", y='otherUserInfos.rating', data=senscrtq.dropna(), showfliers=False)
<AxesSubplot:xlabel='decade', ylabel='otherUserInfos.rating'>
The 70's, golden age of the independant cinema!
senscrtq["year"] = senscrtq.dateReleaseOriginal.dt.year
senscrtq.rename(columns={"originalTitle":"original_title","otherUserInfos.rating":"note"},inplace=True)
senscrtq.to_csv('my_note3.csv',index=False)
Load Data¶
my_notes = pd.read_csv('my_note3.csv')
my_notes['year'] = my_notes['year'].astype(float)
my_notes
original_title | dateReleaseOriginal | note | decade | year | |
---|---|---|---|---|---|
0 | Barbarian | 2022-08-31 | 5 | 2020.0 | 2022.0 |
1 | Eternals | 2021-11-05 | 5 | 2020.0 | 2021.0 |
2 | The Many Saints of Newark | 2021-10-01 | 5 | 2020.0 | 2021.0 |
3 | The Banshees of Inisherin | 2022-10-21 | 7 | 2020.0 | 2022.0 |
4 | The Menu | 2022-11-18 | 3 | 2020.0 | 2022.0 |
... | ... | ... | ... | ... | ... |
1267 | Shutter Island | 2010-02-19 | 8 | 2010.0 | 2010.0 |
1268 | Forrest Gump | 1994-07-06 | 10 | 1990.0 | 1994.0 |
1269 | Django Unchained | 2012-12-25 | 8 | 2010.0 | 2012.0 |
1270 | Avatar | 2009-12-16 | 8 | 2000.0 | 2009.0 |
1271 | Inglourious Basterds | 2009-08-20 | 7 | 2000.0 | 2009.0 |
1272 rows × 5 columns
I also need to load dataset from other users. For that, the dataset provides by ROUNAK BANIK on his Kaggle project page is extremly useful.
Some cleaning and addition are necessary though:
- parsing dates in the right format
- retrieve year of release
- retrieve all film genre into list
- find users with similar taste in movies
movies = pd.read_csv('movies_metadata.csv')
movies['release_date'] = pd.to_datetime(movies.release_date,infer_datetime_format=True,errors='coerce')
movies['year'] = movies.release_date.dt.year
movies['genres'] = movies['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
movies.head()
C:UserscorenAppDataLocalTempipykernel_150602263965596.py:1: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False. movies = pd.read_csv('movies_metadata.csv')
adult | belongs_to_collection | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | ... | revenue | runtime | spoken_languages | status | tagline | title | video | vote_average | vote_count | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | ... | 373554033.0 | 81.0 | Released | NaN | Toy Story | False | 7.7 | 5415.0 | 1995.0 | ||
1 | False | NaN | 65000000 | NaN | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | ... | 262797249.0 | 104.0 | [{'iso_639_1': 'en', 'name': 'English'}, {'iso... | Released | Roll the dice and unleash the excitement! | Jumanji | False | 6.9 | 2413.0 | 1995.0 | |
2 | False | {'id': 119050, 'name': 'Grumpy Old Men Collect... | 0 | NaN | 15602 | tt0113228 | en | Grumpier Old Men | A family wedding reignites the ancient feud be... | ... | 0.0 | 101.0 | Released | Still Yelling. Still Fighting. Still Ready for... | Grumpier Old Men | False | 6.5 | 92.0 | 1995.0 | ||
3 | False | NaN | 16000000 | NaN | 31357 | tt0114885 | en | Waiting to Exhale | Cheated on, mistreated and stepped on, the wom... | ... | 81452156.0 | 127.0 | Released | Friends are the people who let you be yourself... | Waiting to Exhale | False | 6.1 | 34.0 | 1995.0 | ||
4 | False | {'id': 96871, 'name': 'Father of the Bride Col... | 0 | NaN | 11862 | tt0113041 | en | Father of the Bride Part II | Just when George Banks has recovered from his ... | ... | 76578911.0 | 106.0 | Released | Just When His World Is Back To Normal... He's ... | Father of the Bride Part II | False | 5.7 | 173.0 | 1995.0 |
5 rows × 25 columns
Following the method of Rounak, I will use IMDB's weighted rating formula to construct my chart. Mathematically, it is represented as follows:
Weighted Rating
where,
- v is the number of votes for the movie
- m is the minimum votes required to be listed in the chart
- R is the average rating of the movie
- C is the mean vote across the whole
vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C
5.244896612406511
m = vote_counts.quantile(0.70)
m
25.0
movies = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())][['id','original_title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
movies
id | original_title | year | vote_count | vote_average | popularity | genres | |
---|---|---|---|---|---|---|---|
0 | 862 | Toy Story | 1995.0 | 5415.0 | 7.7 | 21.946943 | |
1 | 8844 | Jumanji | 1995.0 | 2413.0 | 6.9 | 17.015539 | |
2 | 15602 | Grumpier Old Men | 1995.0 | 92.0 | 6.5 | 11.7129 | |
3 | 31357 | Waiting to Exhale | 1995.0 | 34.0 | 6.1 | 3.859495 | |
4 | 11862 | Father of the Bride Part II | 1995.0 | 173.0 | 5.7 | 8.387519 | |
... | ... | ... | ... | ... | ... | ... | ... |
45380 | 432789 | The Incredible Jessica James | 2017.0 | 37.0 | 6.2 | 5.667067 | |
45437 | 455661 | In a Heartbeat | 2017.0 | 146.0 | 8.3 | 20.82178 | |
45441 | 14008 | Cadet Kelly | 2002.0 | 145.0 | 5.2 | 4.392389 | |
45443 | 49279 | L'Homme à la tête de caoutchouc | 1901.0 | 29.0 | 7.6 | 1.618458 | |
45460 | 30840 | Robin Hood | 1991.0 | 26.0 | 5.7 | 5.683753 |
13810 rows × 7 columns
Ok so we went from 45466 movies down to the 13810 most voted movies. This could actually be a problem for recommendation as I am always eager to watch niche, underrated movies which are likely to be removed with this method. Maybe if I can find a more qualitative dataset where movie nerds rates those specific movies it could work. But for now, I will experiment with what I have.
def weighted_rating(x):
"""IMDB weighting formula"""
v = x['vote_count']
R = x['vote_average']
return (v/(v+m) * R) + (m/(m+v) * C)
movies['wr'] = movies.apply(weighted_rating, axis=1)
movies = movies[pd.to_numeric(movies['id'], errors='coerce').notnull()]
movies['id'] = movies.id.astype(int)
When it comes to movies, I tend to trust certain members of the crew. Indeed, the movie director and director of photography are often reliable people
Therefore, I will retrieve those crew member from the credit.csv data link to movies.
def get_crew(x,job):
"""
Retrieve a crew member from his/her job label
"""
list_ = literal_eval(x)
if isinstance(list_, list):
job_list = [i['name'] for i in list_ if i['job']==job]
if len(job_list)>0:
return job_list[0]
else:
return ""
else:
return ""
credit = pd.read_csv('credits.csv')
credit['Director'] = credit['crew'].fillna('[]').apply(lambda x: get_crew(x,'Director'))
credit['Director of Photography'] = credit['crew'].fillna('[]').apply(lambda x: get_crew(x,'Director of Photography'))
credit
cast | crew | id | Director | Director of Photography | |
---|---|---|---|---|---|
0 | [{'cast_id': 14, 'character': 'Woody | [{'credit_id': '52fe4284c3a36847f8024f49', 'de... | 862 | John Lasseter | |
1 | [{'cast_id': 1, 'character': 'Alan Parrish', '... | [{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de... | 8844 | Joe Johnston | Thomas E. Ackerman |
2 | [{'cast_id': 2, 'character': 'Max Goldman', 'c... | [{'credit_id': '52fe466a9251416c75077a89', 'de... | 15602 | Howard Deutch | |
3 | [{'cast_id': 1, 'character': "Savannah 'Vannah... | [{'credit_id': '52fe44779251416c91011acb', 'de... | 31357 | Forest Whitaker | |
4 | [{'cast_id': 1, 'character': 'George Banks', '... | [{'credit_id': '52fe44959251416c75039ed7', 'de... | 11862 | Charles Shyer | Elliot Davis |
... | ... | ... | ... | ... | ... |
45471 | [{'cast_id': 0, 'character': '', 'credit_id': ... | [{'credit_id': '5894a97d925141426c00818c', 'de... | 439050 | Hamid Nematollah | |
45472 | [{'cast_id': 1002, 'character': 'Sister Angela... | [{'credit_id': '52fe4af1c3a36847f81e9b15', 'de... | 111109 | Lav Diaz | |
45473 | [{'cast_id': 6, 'character': 'Emily Shaw', 'cr... | [{'credit_id': '52fe4776c3a368484e0c8387', 'de... | 67758 | Mark L. Lester | João Fernandes |
45474 | [{'cast_id': 2, 'character': '', 'credit_id': ... | [{'credit_id': '533bccebc3a36844cf0011a7', 'de... | 227506 | Yakov Protazanov | |
45475 | [{'credit_id': '593e676c92514105b702e68e', 'de... | 461257 | Daisy Asquith |
45476 rows × 5 columns
Finally, let's get some keyword that will help the matching process.
keywords = pd.read_csv('keywords.csv')
keywords['keywords'] = keywords['keywords'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
keywords
id | keywords | |
---|---|---|
0 | 862 | [jealousy, toy, boy, friendship, friends, riva... |
1 | 8844 | [board game, disappearance, based on children'... |
2 | 15602 | [fishing, best friend, duringcreditsstinger, o... |
3 | 31357 | [based on novel, interracial relationship, sin... |
4 | 11862 | [baby, midlife crisis, confidence, aging, daug... |
... | ... | ... |
46414 | 439050 | |
46415 | 111109 | |
46416 | 67758 | |
46417 | 227506 | |
46418 | 461257 |
46419 rows × 2 columns
movies = movies.merge(credit.merge(keywords, on='id'), on='id')
movies
id | original_title | year | vote_count | vote_average | popularity | genres | wr | cast | crew | Director | Director of Photography | keywords | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 862 | Toy Story | 1995.0 | 5415.0 | 7.7 | 21.946943 | 7.688717 | [{'cast_id': 14, 'character': 'Woody | [{'credit_id': '52fe4284c3a36847f8024f49', 'de... | John Lasseter | [jealousy, toy, boy, friendship, friends, riva... | ||
1 | 8844 | Jumanji | 1995.0 | 2413.0 | 6.9 | 17.015539 | 6.883028 | [{'cast_id': 1, 'character': 'Alan Parrish', '... | [{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de... | Joe Johnston | Thomas E. Ackerman | [board game, disappearance, based on children'... | |
2 | 15602 | Grumpier Old Men | 1995.0 | 92.0 | 6.5 | 11.7129 | 6.231816 | [{'cast_id': 2, 'character': 'Max Goldman', 'c... | [{'credit_id': '52fe466a9251416c75077a89', 'de... | Howard Deutch | [fishing, best friend, duringcreditsstinger, o... | ||
3 | 31357 | Waiting to Exhale | 1995.0 | 34.0 | 6.1 | 3.859495 | 5.737668 | [{'cast_id': 1, 'character': "Savannah 'Vannah... | [{'credit_id': '52fe44779251416c91011acb', 'de... | Forest Whitaker | [based on novel, interracial relationship, sin... | ||
4 | 11862 | Father of the Bride Part II | 1995.0 | 173.0 | 5.7 | 8.387519 | 5.642537 | [{'cast_id': 1, 'character': 'George Banks', '... | [{'credit_id': '52fe44959251416c75039ed7', 'de... | Charles Shyer | Elliot Davis | [baby, midlife crisis, confidence, aging, daug... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14041 | 432789 | The Incredible Jessica James | 2017.0 | 37.0 | 6.2 | 5.667067 | 5.814878 | [{'cast_id': 6, 'character': 'Jessica James', ... | [{'credit_id': '586a688dc3a3680f4e017d65', 'de... | Jim Strouse | Sean McElwee | ||
14042 | 455661 | In a Heartbeat | 2017.0 | 146.0 | 8.3 | 20.82178 | 7.853347 | [{'credit_id': '5981a15c92514151e0011b51', 'de... | Beth David | ||||
14043 | 14008 | Cadet Kelly | 2002.0 | 145.0 | 5.2 | 4.392389 | 5.206602 | [{'cast_id': 1, 'character': 'Kelly Collins', ... | [{'credit_id': '52fe45c29251416c75061803', 'de... | Larry Shaw | |||
14044 | 49279 | L'Homme à la tête de caoutchouc | 1901.0 | 29.0 | 7.6 | 1.618458 | 6.509674 | [{'cast_id': 2, 'character': '', 'credit_id': ... | [{'credit_id': '52fe478dc3a36847f813bd5f', 'de... | Georges Méliès | [laboratory, mad scientist, disembodied head, ... | ||
14045 | 30840 | Robin Hood | 1991.0 | 26.0 | 5.7 | 5.683753 | 5.476910 | [{'cast_id': 1, 'character': 'Sir Robert Hode'... | [{'credit_id': '52fe44439251416c9100a899', 'de... | John Irvin | Jason Lehel |
14046 rows × 13 columns
Let's create a 'soup' columns that will merge crew members, genre and keywords.
The finale string will decribe the movie by it's attribute. It will serve later to compare movies using Natural Languague Processing
movies['Director'] = movies['Director'].str.lower().str.replace(" ", "")
movies['Director of Photography'] = movies['Director of Photography'].str.lower().str.replace(" ", "")
#movies['genres'] = movies['genres'].apply(lambda x: [stemmer.stem(i) for i in x])
movies['genres'] = movies['genres'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
#movies['keywords'] = movies['keywords'].apply(lambda x: [stemmer.stem(i,) for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [str.lower(i) for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [''] if len(x) < 1 else x)
movies['keywords'] = movies['keywords'].apply(lambda x: np.hstack([re.sub("[^w]", " ", word).split() for word in x]).tolist())
movies['soup'] = movies['keywords']+movies['genres']
Again, following Rounak methodology, we are going to use SnowballStemmer in order to get only the stem of the keywords and genres. This will help to harmonise the keyword and genre into similar classes. Quite an ingenious trick that I have to admit, I would have never think of.
stemmer = SnowballStemmer('english')
movies['soup'] = movies['soup'].apply(lambda x: [stemmer.stem(i) for i in x])
movies['soup'] = movies['soup'].apply(lambda x: ' '.join(x))+' '+movies['Director of Photography'] +' '+ movies['Director']
movies['soup'] = movies['soup'].fillna('')
movies
id | original_title | year | vote_count | vote_average | popularity | genres | wr | cast | crew | Director | Director of Photography | keywords | soup | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 862 | Toy Story | 1995.0 | 5415.0 | 7.7 | 21.946943 | 7.688717 | [{'cast_id': 14, 'character': 'Woody | [{'credit_id': '52fe4284c3a36847f8024f49', 'de... | johnlasseter | [jealousy, toy, boy, friendship, friends, riva... | jealousi toy boy friendship friend rivalri boy... | ||
1 | 8844 | Jumanji | 1995.0 | 2413.0 | 6.9 | 17.015539 | 6.883028 | [{'cast_id': 1, 'character': 'Alan Parrish', '... | [{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de... | joejohnston | thomase.ackerman | [board, game, disappearance, based, on, childr... | board game disappear base on children s book n... | |
2 | 15602 | Grumpier Old Men | 1995.0 | 92.0 | 6.5 | 11.7129 | 6.231816 | [{'cast_id': 2, 'character': 'Max Goldman', 'c... | [{'credit_id': '52fe466a9251416c75077a89', 'de... | howarddeutch | [fishing, best, friend, duringcreditsstinger, ... | fish best friend duringcreditssting old men ro... | ||
3 | 31357 | Waiting to Exhale | 1995.0 | 34.0 | 6.1 | 3.859495 | 5.737668 | [{'cast_id': 1, 'character': "Savannah 'Vannah... | [{'credit_id': '52fe44779251416c91011acb', 'de... | forestwhitaker | [based, on, novel, interracial, relationship, ... | base on novel interraci relationship singl mot... | ||
4 | 11862 | Father of the Bride Part II | 1995.0 | 173.0 | 5.7 | 8.387519 | 5.642537 | [{'cast_id': 1, 'character': 'George Banks', '... | [{'credit_id': '52fe44959251416c75039ed7', 'de... | charlesshyer | elliotdavis | [baby, midlife, crisis, confidence, aging, dau... | babi midlif crisi confid age daughter mother d... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14041 | 432789 | The Incredible Jessica James | 2017.0 | 37.0 | 6.2 | 5.667067 | 5.814878 | [{'cast_id': 6, 'character': 'Jessica James', ... | [{'credit_id': '586a688dc3a3680f4e017d65', 'de... | jimstrouse | seanmcelwee | romanc comedi seanmcelwee jimstrouse | ||
14042 | 455661 | In a Heartbeat | 2017.0 | 146.0 | 8.3 | 20.82178 | 7.853347 | [{'credit_id': '5981a15c92514151e0011b51', 'de... | bethdavid | love teenag lgbt short famili anim romanc come... | ||||
14043 | 14008 | Cadet Kelly | 2002.0 | 145.0 | 5.2 | 4.392389 | 5.206602 | [{'cast_id': 1, 'character': 'Kelly Collins', ... | [{'credit_id': '52fe45c29251416c75061803', 'de... | larryshaw | militari school comedi larryshaw | |||
14044 | 49279 | L'Homme à la tête de caoutchouc | 1901.0 | 29.0 | 7.6 | 1.618458 | 6.509674 | [{'cast_id': 2, 'character': '', 'credit_id': ... | [{'credit_id': '52fe478dc3a36847f813bd5f', 'de... | georgesméliès | [laboratory, mad, scientist, disembodied, head... | laboratori mad scientist disembodi head silent... | ||
14045 | 30840 | Robin Hood | 1991.0 | 26.0 | 5.7 | 5.683753 | 5.476910 | [{'cast_id': 1, 'character': 'Sir Robert Hode'... | [{'credit_id': '52fe44439251416c9100a899', 'de... | johnirvin | jasonlehel | drama action romanc jasonlehel johnirvin |
14046 rows × 14 columns
Cosine Similarity Filtering¶
Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors' lengths
Cosine similarity is often use in NLP as a metric for text-similarity between two documents. A word is represented into a vector form. and documents are represented in a n-dimensional vector space.
Therefore, we need to change our soup strings into vectors before proceeding to Cosine similarity. I checked 2 ways:
- CountVectorizer
- BERT transformer
Count Vectorizer¶
Count Vectorizers set token to each word of each text input, resulting in a sparse matrix.
Using stop words, one can remove the most commonly used word of a language as they do not reflect meaning i a sentence. I use it in our case, but not sure it is really useful as the "soup" is not a real sentence.
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
def get_recommendations_CV(df, title,n=100):
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=1, stop_words='english')
count_matrix = count.fit_transform(df['soup'])
cosine_sim = cosine_similarity(count_matrix, count_matrix)
idx = df[df['original_title'] == title ].index[0]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:1+n]
movie_indices = [i[0] for i in sim_scores]
return df[['id','original_title','year','wr']].iloc[movie_indices].sort_values('wr',ascending=False)
BERT Transformer¶
Bidirectional Encoder Representations from Transformers or BERT is a technique for natural language processing pre-training developed by Google. This transformer is commonly used
Many pre-trained models are available on https://huggingface.co/transformers/
from sentence_transformers import SentenceTransformer
from sentence_transformers import models, losses
model = SentenceTransformer('bert-base-nli-mean-tokens')
text_embeddings = model.encode(movies['soup'], show_progress_bar = True)
similarities = cosine_similarity(text_embeddings)
with open('similarities.npy', 'wb') as f:
np.save(f, similarities)
def get_recommendations_BERT(df, title,n=100):
similarities = np.load('similarities.npy')
idx = df[df['original_title'] == title ].index[0]
sim_scores = list(enumerate(similarities[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:1+n]
movie_indices = [i[0] for i in sim_scores]
return df[['id','original_title','year','wr']].iloc[movie_indices].sort_values('wr',ascending=False)
Here are the top 10 recommendations based on movie genre, cast and keywords:
display(get_recommendations_CV(movies, 'The Social Network',n=10))
display(get_recommendations_BERT(movies, 'The Social Network',n=10))
id | original_title | year | wr | |
---|---|---|---|---|
1863 | 550 | Fight Club | 1999.0 | 8.292128 |
10332 | 210577 | Gone Girl | 2014.0 | 7.889025 |
8692 | 65754 | The Girl with the Dragon Tattoo | 2011.0 | 7.180480 |
2718 | 41050 | La notte | 1961.0 | 7.067241 |
10787 | 295144 | Marvellous | 2015.0 | 6.484483 |
5276 | 10429 | Takedown | 2000.0 | 5.974351 |
2585 | 39507 | The Dead | 1987.0 | 5.968177 |
2375 | 16094 | House Party | 1990.0 | 5.822369 |
13210 | 368031 | Unfriend | 2016.0 | 5.296565 |
7071 | 13991 | College | 2008.0 | 4.952362 |
id | original_title | year | wr | |
---|---|---|---|---|
10314 | 250658 | The Internet's Own Boy: The Story of Aaron Swartz | 2014.0 | 7.488402 |
10474 | 284427 | Who Am I - Kein System ist sicher | 2014.0 | 7.470599 |
13148 | 328387 | Nerve | 2016.0 | 7.079721 |
8755 | 76726 | Chronicle | 2012.0 | 6.582976 |
11800 | 317144 | Cyberbully | 2015.0 | 6.260476 |
9695 | 115782 | Jobs | 2013.0 | 5.984373 |
5276 | 10429 | Takedown | 2000.0 | 5.974351 |
2559 | 9989 | Antitrust | 2001.0 | 5.723328 |
114 | 9886 | Johnny Mnemonic | 1995.0 | 5.484253 |
13210 | 368031 | Unfriend | 2016.0 | 5.296565 |
Collaborative Filtering¶
We will now set a recommander based on collaborative filtering. The idea is to recomend you certain film based on what other user with a similar profile than yours like. We are going to use surprise, a package for recommender systems.
We must first match the IDs between ratings table and our movies table
match = pd.read_csv('links_small.csv')
match = match.dropna(how="any")
match['tmdbId'] = match['tmdbId'].astype('Int64')#.astype(str)
match.rename(columns={'tmdbId':'id'},inplace=True)
match.set_index('id',inplace=True)
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate, GridSearchCV
reader = Reader()
For such a recommendation task, we need to gather users ratings. The following dataset is precisely that:
ratings = pd.read_csv('ratings_small.csv')
ratings
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 31 | 2.5 | 1260759144 |
1 | 1 | 1029 | 3.0 | 1260759179 |
2 | 1 | 1061 | 3.0 | 1260759182 |
3 | 1 | 1129 | 2.0 | 1260759185 |
4 | 1 | 1172 | 4.0 | 1260759205 |
... | ... | ... | ... | ... |
99999 | 671 | 6268 | 2.5 | 1065579370 |
100000 | 671 | 6269 | 4.0 | 1065149201 |
100001 | 671 | 6365 | 4.0 | 1070940363 |
100002 | 671 | 6385 | 2.5 | 1070979663 |
100003 | 671 | 6565 | 3.5 | 1074784724 |
100004 rows × 4 columns
Now I need to align my ratings with the available ratings so that I can become a user among others. For that I set a userId = -1
which is meant to be me; and bring my rating up to 5. Also I must match 'original_title','year']
to get the movieId
and the imdbID
inner_df = pd.merge(my_notes,movies, on=['original_title','year'],how='inner')
inner_df['id'] = inner_df.id.astype("int64",errors='ignore')
#inner_df= inner_df[['original_title','year','note','id']]
#inner_df.drop_duplicates(inplace=True)
inner_df['rating'] = inner_df['note']/2
inner_df['userId'] = -1
inner_df = inner_df.merge(match.reset_index(),how='inner',on='id')
We will train a SVD algorithm on our dataset of ratings
X = pd.concat([ratings[['userId', 'movieId', 'rating']],inner_df[['userId', 'movieId', 'rating']]])
data = Dataset.load_from_df(X, reader)
param_grid = {
"n_epochs": [5, 10],
"lr_all": [0.002, 0.005],
"reg_all": [0.4, 0.6]
}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])
0.9151389977377514 {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
Using gs.best_params
and gs.best_score
gave me an RMSE of 0.91
svd = gs.best_estimator['rmse']
svd.fit(data.build_full_trainset())
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x11717f06d00>
Now, we can loop through all our movies and set a estimated rating for me based on other user with similar tastes.
def helper(x,userId,algo):
try:
return algo.predict(userId,match['movieId'].to_dict()[x]).est
except:
return np.nan
movies['est'] = movies['id'].apply(lambda x: helper(x,-1,svd))
movies
id | original_title | year | vote_count | vote_average | popularity | genres | wr | cast | crew | Director | Director of Photography | keywords | soup | est | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 862 | Toy Story | 1995.0 | 5415.0 | 7.7 | 21.946943 | 7.688717 | [{'cast_id': 14, 'character': 'Woody | [{'credit_id': '52fe4284c3a36847f8024f49', 'de... | johnlasseter | [jealousy, toy, boy, friendship, friends, riva... | jealousi toy boy friendship friend rivalri boy... | 3.643723 | ||
1 | 8844 | Jumanji | 1995.0 | 2413.0 | 6.9 | 17.015539 | 6.883028 | [{'cast_id': 1, 'character': 'Alan Parrish', '... | [{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de... | joejohnston | thomase.ackerman | [board, game, disappearance, based, on, childr... | board game disappear base on children s book n... | 3.307559 | |
2 | 15602 | Grumpier Old Men | 1995.0 | 92.0 | 6.5 | 11.7129 | 6.231816 | [{'cast_id': 2, 'character': 'Max Goldman', 'c... | [{'credit_id': '52fe466a9251416c75077a89', 'de... | howarddeutch | [fishing, best, friend, duringcreditsstinger, ... | fish best friend duringcreditssting old men ro... | 3.104655 | ||
3 | 31357 | Waiting to Exhale | 1995.0 | 34.0 | 6.1 | 3.859495 | 5.737668 | [{'cast_id': 1, 'character': "Savannah 'Vannah... | [{'credit_id': '52fe44779251416c91011acb', 'de... | forestwhitaker | [based, on, novel, interracial, relationship, ... | base on novel interraci relationship singl mot... | 2.903563 | ||
4 | 11862 | Father of the Bride Part II | 1995.0 | 173.0 | 5.7 | 8.387519 | 5.642537 | [{'cast_id': 1, 'character': 'George Banks', '... | [{'credit_id': '52fe44959251416c75039ed7', 'de... | charlesshyer | elliotdavis | [baby, midlife, crisis, confidence, aging, dau... | babi midlif crisi confid age daughter mother d... | 3.183032 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14041 | 432789 | The Incredible Jessica James | 2017.0 | 37.0 | 6.2 | 5.667067 | 5.814878 | [{'cast_id': 6, 'character': 'Jessica James', ... | [{'credit_id': '586a688dc3a3680f4e017d65', 'de... | jimstrouse | seanmcelwee | romanc comedi seanmcelwee jimstrouse | NaN | ||
14042 | 455661 | In a Heartbeat | 2017.0 | 146.0 | 8.3 | 20.82178 | 7.853347 | [{'credit_id': '5981a15c92514151e0011b51', 'de... | bethdavid | love teenag lgbt short famili anim romanc come... | NaN | ||||
14043 | 14008 | Cadet Kelly | 2002.0 | 145.0 | 5.2 | 4.392389 | 5.206602 | [{'cast_id': 1, 'character': 'Kelly Collins', ... | [{'credit_id': '52fe45c29251416c75061803', 'de... | larryshaw | militari school comedi larryshaw | NaN | |||
14044 | 49279 | L'Homme à la tête de caoutchouc | 1901.0 | 29.0 | 7.6 | 1.618458 | 6.509674 | [{'cast_id': 2, 'character': '', 'credit_id': ... | [{'credit_id': '52fe478dc3a36847f813bd5f', 'de... | georgesméliès | [laboratory, mad, scientist, disembodied, head... | laboratori mad scientist disembodi head silent... | NaN | ||
14045 | 30840 | Robin Hood | 1991.0 | 26.0 | 5.7 | 5.683753 | 5.476910 | [{'cast_id': 1, 'character': 'Sir Robert Hode'... | [{'credit_id': '52fe44439251416c9100a899', 'de... | johnirvin | jasonlehel | drama action romanc jasonlehel johnirvin | NaN |
14046 rows × 15 columns
Hybrid¶
Final step, let's combined the 2 Filtering techniques.
We will first filter movies by similarities, and then see what other users might recommend me within this cluster.
def hybrid(df, userId, title, algo = svd):
ds = get_recommendations_CV(df, title, n=50)
df = df.loc[ds.index,['id','original_title','year','wr']]
#df = ds.reset_index().rename(columns={'index':'id'})
df['est'] = df['id'].apply(lambda x: helper(x,userId,algo))
df.dropna(axis=0,how='any', inplace=True)
df.sort_values('est', ascending=False, inplace =True)
df = df.head(10)
df['year'] = df['year'].astype(int)
df['link'] = df['id'].apply(lambda x: 'https://www.themoviedb.org/movie/{}'.format(str(x)))
return df.head(10)
def hybrid_bis(df, userId, title, algo = svd):
ds = get_recommendations_BERT(df, title, n=50)
df = df.loc[ds.index,['id','original_title','year','wr']]
#df = ds.reset_index().rename(columns={'index':'id'})
df['est'] = df['id'].apply(lambda x: helper(x,userId,algo))
df.dropna(axis=0,how='any', inplace=True)
df.sort_values('est', ascending=False, inplace =True)
df = df.head(10)
df['year'] = df['year'].astype(int)
df['link'] = df['id'].apply(lambda x: 'https://www.themoviedb.org/movie/{}'.format(str(x)))
return df.head(10)
response = hybrid(movies,-1, 'This Is Spinal Tap', algo = svd)
response2 = hybrid_bis(movies,-1, 'This Is Spinal Tap', algo = svd)
def make_clickable(val):
# target _blank to open new window
return '<a target="_blank" href="{}">{}</a>'.format(val, val)
display(response.style.format({'link': make_clickable}))
display(response2.style.format({'link': make_clickable}))
id | original_title | year | wr | est | link | |
---|---|---|---|---|---|---|
1931 | 11663 | The Commitments | 1991 | 7.056831 | 3.616117 | https://www.themoviedb.org/movie/11663 |
6523 | 5723 | Once | 2007 | 7.288221 | 3.575962 | https://www.themoviedb.org/movie/5723 |
2257 | 31516 | On the Town | 1949 | 6.348299 | 3.574190 | https://www.themoviedb.org/movie/31516 |
5711 | 15258 | The Aristocrats | 2005 | 5.737811 | 3.539981 | https://www.themoviedb.org/movie/15258 |
4531 | 9459 | Woodstock | 1970 | 6.595896 | 3.535589 | https://www.themoviedb.org/movie/9459 |
1682 | 11779 | Buena Vista Social Club | 1999 | 6.791311 | 3.505207 | https://www.themoviedb.org/movie/11779 |
658 | 12614 | Victor/Victoria | 1982 | 6.537039 | 3.503791 | https://www.themoviedb.org/movie/12614 |
4095 | 1584 | School of Rock | 2003 | 6.773714 | 3.484627 | https://www.themoviedb.org/movie/1584 |
4989 | 13671 | The Music Man | 1962 | 6.194756 | 3.483782 | https://www.themoviedb.org/movie/13671 |
6313 | 2179 | Tenacious D in The Pick of Destiny | 2006 | 6.429330 | 3.473225 | https://www.themoviedb.org/movie/2179 |
id | original_title | year | wr | est | link | |
---|---|---|---|---|---|---|
3728 | 11949 | Monty Python Live at the Hollywood Bowl | 1982 | 6.411652 | 3.546626 | https://www.themoviedb.org/movie/11949 |
4531 | 9459 | Woodstock | 1970 | 6.595896 | 3.535589 | https://www.themoviedb.org/movie/9459 |
5818 | 19082 | No Direction Home: Bob Dylan | 2005 | 6.527973 | 3.517776 | https://www.themoviedb.org/movie/19082 |
4439 | 132 | The Rolling Stones: Gimme Shelter | 1970 | 6.846603 | 3.510834 | https://www.themoviedb.org/movie/132 |
1634 | 6396 | SLC Punk | 1998 | 6.864597 | 3.496494 | https://www.themoviedb.org/movie/6396 |
2223 | 27745 | The Filth and the Fury | 2000 | 6.348563 | 3.494963 | https://www.themoviedb.org/movie/27745 |
4095 | 1584 | School of Rock | 2003 | 6.773714 | 3.484627 | https://www.themoviedb.org/movie/1584 |
5707 | 1665 | Last Days | 2005 | 5.288520 | 3.482124 | https://www.themoviedb.org/movie/1665 |
6313 | 2179 | Tenacious D in The Pick of Destiny | 2006 | 6.429330 | 3.473225 | https://www.themoviedb.org/movie/2179 |
3014 | 27327 | Phantom of the Paradise | 1974 | 7.052559 | 3.463405 | https://www.themoviedb.org/movie/27327 |
Not too bad of a selection!