Comparative Analysis of Collaborative Filtering Algorithms on MovieLens 100K Dataset

Overview

This repository contains the code, data, and report for the Project titled "Comparative Analysis of Collaborative Filtering Algorithms on MovieLens 100K Dataset". The project aims to explore and compare the performance of various collaborative filtering algorithms, specifically matrix factorisation and Transformer-based models, on the MovieLens 100K dataset.

Abstract

The report focuses on recommending interesting movies to users by comparing the performance of matrix factorisation methods and Transformer-based models. The comparison includes two matrix factorisation methods and three Transformer implementations to see how well they predict user ratings on the MovieLens 100K dataset.

Introduction

The project aims to find the best recommendation system by:

Assessing the performance of traditional methods.
Comparing less intuitive techniques like Transformers with basic factorisation methods.
Investigating potential improvements to Transformer architecture.

The dataset used contains 100,836 ratings from 610 users on 9,742 movies.

Data

The dataset used is the MovieLens 100K dataset, which includes:

ratings.csv: User ratings for movies.
movies.csv: Movie details.
tags.csv: User-defined tags for movies.
links.csv: Links to IMDb and TMDb.

Methods

We used Python packages such as PyTorch, NumPy, Pandas, Statsmodels, WordCloud, Matplotlib, and Seaborn. The following methods were applied:

Matrix Factorisation: Used for dimensionality reduction and finding latent matrices.
Transformers: Utilised for their self-attention mechanism to find patterns in user-item interactions and temporal sequences.

Exploratory Data Analysis

The exploratory data analysis section includes various plots such as histograms, line graphs, and cloud plots to provide insights into the dataset.

Models

Data Preprocessing

Data preprocessing involved merging datasets, encoding user and movie identifiers, and splitting the data into training and validation sets.

Baseline Model

A baseline model using the median rating was created for performance comparison.

Matrix Factorisation

Two models were implemented:

Without Bias Terms: Basic matrix factorisation.
With Bias Terms: Matrix factorisation with user and item bias terms.

Transformers

Three Transformer models were created with different embeddings:

Model 1: User and movie embeddings.
Model 2: User, movie, and rating history embeddings.
Model 3: User, movie, rating history, and movie genre embeddings.

Limitations

The limitations include:

Linearity assumptions in matrix factorisation.
Handling of sparse data.
Cold-start problem.
Complexity and interpretability issues with Transformer models.
Computational resource requirements.

Conclusion

The project concluded that Transformer-based models offer higher predictive accuracy than matrix factorisation models. The inclusion of bias terms and sequential data improves predictive performance. Future research may explore hybrid models that combine the strengths of both methodologies.

Acknowledgements

We acknowledge the use of GPT-4 (OpenAI) and Meta-Llama-3-70B for checking grammar, proofreading, and providing explanations on SGD factorisation and attention mechanisms. These tools also helped in editing plot margins and debugging data preprocessing and PyTorch training loops.

For more details, refer to the full report included in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ST312_code.ipynb		ST312_code.ipynb
ST312_presentation.pdf		ST312_presentation.pdf
ST312_report.pdf		ST312_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Comparative Analysis of Collaborative Filtering Algorithms on MovieLens 100K Dataset

Overview

Table of Contents

Abstract

Introduction

Data

Methods

Exploratory Data Analysis

Models

Data Preprocessing

Baseline Model

Matrix Factorisation

Transformers

Limitations

Conclusion

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

pwr-usr/Collaborative-Filtering

Folders and files

Latest commit

History

Repository files navigation

Comparative Analysis of Collaborative Filtering Algorithms on MovieLens 100K Dataset

Overview

Table of Contents

Abstract

Introduction

Data

Methods

Exploratory Data Analysis

Models

Data Preprocessing

Baseline Model

Matrix Factorisation

Transformers

Limitations

Conclusion

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages