#tagReddit

`Motivation:`

Reddit is a community-driven social platform with lots of interesting posts shared every day across a number of subreddits on a variety of topics. This is great for Reddit community but this also leads to some problems for the users. The users have to spend a significant amount of time going over all the posts in order to discover interesting posts of their choice/preference in a subreddit they follow. While exploring a new Subreddit, users would want to spend less time to get an overview of the subreddit community they are exploring. In this project I come up with - Reddit-Tags, that would help to enhance user experience and improve user engagement by showing popular tags. The inspiration for Reddit-Tags come from hashtags used in other social media platforms like Instagram and Twitter. Hashtags allow users to explore new contents corresponding to each tags thereby increasing user engagement and improving user experience on the platform.

In addition to improving the user experience, the tags can also be helpful for Ads team to show ads based on the keywords clicked by users.

`Objective:`

Generate popular tags(keywords) for each SubReddit for each month and year.
Index all the posts corresponding to each generated tag.
Allow user to see popular tags for a subreddit over the time.

`Future Ideas:`

Identify the most active SubReddits (Real Time Processing)
Index posts in real time
Top keywords being used in last one hour

`Business Case:`

Improve Customer Experience by generating tags for each subreddit and displaying corresponding list of posts
Displaying ads on relevant reddit channel
Seasonality insight can be used by Analyst and Marketing team in managing their social media/Ad campaigns

`Dataset:`

Data taken from Pushshift.io.
Size: 500 GB (taken from 2008 - 2015 for comments) Schema -
Data Stored in AWS S3 data lake in three separate buckets -

Raw Comments: Data taken from source, Cleaned Comments: Data generated after initial preprocessing, Frequent Words: Tags identified from the posts

`Data Pipeline`

1. AWS S3: Serve as datalake

2. Spark: 4 node cluster, 1 master-3workers. I have written two separate pipelines on Spark, One to clean the text data, second to generate frequent tags from the cleaned data.

3. Elasticsearch: Indexed comments data on Elasticsearch. Used Spark to transform data into NDJSON format and then loaded on ES using bulk upload API. Comments were indexed by year. Latency to query indexes vary from .15 second to 1.35 seconds with variation in index size.

4. Airflow: My cluster has used

5. Postgres:

`Cluster Setup`

The cluster was set up on AWS. All the services used in this project were set up by myself. I didn't use any managed services as they are very expensive and difficult to debug if you run into any errors. I used about 9 m4.large EC2 instances for the cluster.

`Engineering Challenge:`

Text data preprocessing and cleaning which include - removing stopword, removing punctuations, removing urls etc. Complexity: O(M*(N^2)) O(N^2) to compare two text bodies (comments vs stopwords) O(M) for all posts from reddit M= documents N= number of records
Spark performance improvement by increasing executor.memory from 1g to 6g, resulted in 5% increase in performance.
Implement incremental aggregations and avoid re-computation.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
airflow		airflow
cluster_setup		cluster_setup
elasticsearch		elasticsearch
img		img
postgres		postgres
sample_data		sample_data
spark		spark
upload_to_S3		upload_to_S3
webapp		webapp
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

#tagReddit

`Motivation:`

`Objective:`

`Future Ideas:`

`Business Case:`

`Dataset:`

`Data Pipeline`

`Cluster Setup`

`Engineering Challenge:`

`Presentation Slides:`

`Project Link:`

`Project Demo:`

About

Releases

Packages

Languages

insight-cc-bot/Insight_Data_Engineering_Project

Folders and files

Latest commit

History

Repository files navigation

#tagReddit

Motivation:

Objective:

Future Ideas:

Business Case:

Dataset:

Data Pipeline

Cluster Setup

Engineering Challenge:

Presentation Slides:

Project Link:

Project Demo:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Motivation:`

`Objective:`

`Future Ideas:`

`Business Case:`

`Dataset:`

`Data Pipeline`

`Cluster Setup`

`Engineering Challenge:`

`Presentation Slides:`

`Project Link:`

`Project Demo:`

Packages