This project is a Movie Recommendation Engine that provides personalized suggestions using six different machine learning algorithms. Each model uses structured movie metadata to learn what the user likes and predict similar content.
We start by converting raw movie metadata into a fixed 800-dimensional vector representation.
- Extract features from:
genres,cast,overview,keywords,director - Use TF-IDF vectorization on text-based columns.
- Reduce dimensionality (e.g., via PCA or Truncated SVD) to form 800-D vectors.
- TF-IDF Score:
TF-IDF(t, d) = TF(t, d) × log(N / DF(t))
wheret= term,d= document,N= total documents,DF(t)= document frequency
➡️ Each movie becomes a numeric vector in ℝ⁸⁰⁰
➡️ These vectors are the basis for all models.
Finds the most similar movies based on vector distances from liked movies.
- Recommending movies close to user's liked history in feature space.
- Cosine Similarity:
sim(A, B) = (A · B) / (||A|| × ||B||) - Euclidean Distance:
dist(A, B) = √Σ (Ai - Bi)²
Groups movies into clusters of similar content.
- Recommending movies from the same cluster as liked ones.
-
K-Means Objective:
Minimize total within-cluster variance:
∑ᵢ ∑ₓ ∈ Cᵢ ||x - μᵢ||²
whereμᵢis cluster centroid -
Linkages in Hierarchical Clustering:
- Single-link: min distance
- Complete-link: max distance
- Average-link: average pairwise distance
A shallow neural net that classifies whether a movie will be liked.
- Binary classification (like/dislike) from past data.
- Input → Dense Layer → ReLU → Dense Layer → Sigmoid Output
-
Output:
y = sigmoid(Wx + b)
where sigmoid(z) =1 / (1 + e⁻ᶻ) -
Loss Function: Binary Cross-Entropy
L = -[y·log(p) + (1-y)·log(1-p)]
Classifies based on conditional probabilities assuming feature independence.
- Probabilistic prediction of movie preference.
-
Bayes Theorem:
P(A | B) = (P(B | A) × P(A)) / P(B) -
We predict the class (like/dislike) with the highest posterior.
Recommends movies that are textually or semantically similar to user's liked movies.
- Matching metadata (genres, plot, actors) using vector space models.
- Same as cosine similarity:
sim(A, B) = (A · B) / (||A|| × ||B||)
| Model | Type | Formula / Key Concept |
|---|---|---|
| TF-IDF + Vectors | Preprocessing | TF-IDF(t, d) = TF × log(N / DF) |
| KNN | Similarity-based | Cosine / Euclidean distance |
| K-Means Clustering | Unsupervised | Minimize ∑ |
| Perceptron | Neural Network | y = sigmoid(Wx + b) |
| Naïve Bayes | Probabilistic | P(A |
| Content-Based | Metadata Matching | Cosine similarity on text-based features |
- TMDB or similar movie dataset.
- Features include: Title, Genres, Cast, Overview, Keywords, Ratings, Popularity, etc.