Yelp, founded in 2004, is a major online review platform that influences consumer decisions by allowing users to rate and review businesses. Research suggests that consumers value reviews under certain conditions and rely on cues like reviewer expertise, consistency, and overall rating trends to assess credibility (Fogel & Zachariah, 2017). Additionally, reviews are perceived as more trustworthy when they align with majority opinions and come from experienced sources, such as Yelp Elite users, whose status signals credibility and influence (Lim & Van Der Heide, 2015). These findings highlight how Yelp's user-generated content shapes consumer trust and business reputations. In this project we aim to analyze different types of user-generated content (reviews) and the different impact it has on businesses.
In this section we provide the research questions for this project. For the motivation behind the questions we refer you to the report.
Is there a correlation between Yelp user type and the rating given to a business?
- Does the number of fans an elite Yelp user has moderate the relationship between user type and the rating a business receives from this user?
- Does the relationship differ when analyzing only restaurants compared to non-restaurant businesses?
The data set has 14 columns, their meaning can be found in the following table:
Variable | Data Type | Explanation |
---|---|---|
fans | Numeric integer | The number of fans a user has. |
user_id | Character string | Unique, 22 character long ID that defines which user wrote the review. |
review_count_users | Numeric float | Number that represents the amount of reviews the user wrote in total. |
elite | Numeric integer | Long integer showing all the years a user was elite. If a user was never an elite, it shows as NA. |
average_stars | Numeric float | Average rating of all reviews a user has given in the past. |
review_id | Character string | Unique, 22 character long ID that defines the review. |
business_id | Character string | Unique, 22 character long ID that defines the business for which the review was given. |
stars_users | Numeric integer | Star rating that was given by the user with the review. |
categories | Array of strings | Array of strings which includes the categories a business has. |
attributes | Object | Business attributes to logic values. Please note: some attributes might be objects of attributes to logic values again. |
name | Character string | Business' name. |
stars_business | Numeric float | Average stars a business got from the reviews they have received. The amount of stars are rounded to half-stars. |
review_count_business | Numeric integer | The amount of reviews a business has gotten in total. |
elite_binary | Binary | A binary value indicating whether the user has ever, at least once, had the elite status before (1) or not (0). |
The methods used in this project for answering our research questions are:
- Simple Linear Regression (T-Test).
- Moderated Multiple Linear Regression
- Two-way ANOVA
For a more extensive explanation on how these methods will be used, please refer to the report.
Generally, our analysis reveals a nuanced relationship between Yelp user type and business ratings. Using simple linear regression and a t-test, we found that elite users generally provide more favorable and consistent ratings than non-elite users.
Plot 1 visualizes the difference between the ratings given to businesses by elite vs non-elite users
A moderated multiple linear regression further shows that the number of fans an elite user has can influence this relationship—elite status tends to boost ratings, but this effect is moderated by the user's fanbase size.
Plot 2 visualizes the moderation effect of different fan percentiles on user status and ratings given to a business
Finally, a two-way ANOVA indicates that the effect of elite status on ratings varies by business category, with distinct patterns emerging when comparing restaurants to non-restaurant businesses.
Plot 3 visualizes the moderation effect of business category on user status and ratings given to a business
The end product of this project is deployed as an integrated analytical report and interactive dashboard, offering clear tables and visualizations that help businesses and platforms like Yelp interpret user feedback more effectively.
For a more elaborate explanation of the results, please refer to the report.
This research holds significance for multiple stakeholders, including consumers, business owners, online review platforms, and the academic community.
For consumers, the findings provide insights into how different types of reviewers shape business ratings, enabling them to make more informed purchasing decisions. By understanding the influence of elite and non-elite users, as well as the role of reviewer popularity, consumers can better assess the credibility of ratings and reviews before choosing a business. This knowledge allows them to navigate potential biases in online reviews, ensuring that their decisions are based on a more accurate representation of a business’s quality and service.
For business owners, this study highlights the factors that influence online reputation, helping them better understand how different types of reviewers affect consumer perceptions. By recognizing the impact of elite reviewers and highly followed users, businesses can refine their engagement strategies, respond more effectively to feedback, and leverage online reviews to build stronger customer relationships. Understanding these dynamics can also help businesses anticipate potential biases in ratings and adjust their marketing efforts accordingly.
For online review platforms, such as Yelp, can also benefit from this research by gaining deeper insights into how reviewer characteristics shape rating distributions. These findings can inform platform policies regarding elite status designations, ranking algorithms, and review visibility, ultimately improving the fairness and reliability of their rating systems. Platforms can use this research to enhance user trust and engagement, ensuring that reviews provide an accurate reflection of business quality.
For the academic community, this study contributes to the broader literature on online reviews, consumer purchasing behavior, and the influence of digital opinion leaders. By analyzing how elite status and reviewer influence impact business ratings, this research deepens the understanding of social dynamics in digital review platforms. It provides a foundation for future studies exploring trust in user-generated content, platform design, and the psychological factors that drive consumer decision-making. As online reviews continue to play a critical role in e-commerce and digital marketing, these findings offer valuable perspectives for researchers in marketing, behavioral economics, and information systems.
One limitation of this study is that it categorizes businesses into only two broad groups: restaurants and non-restaurants. While this approach provides a general understanding of how reviewer type influences ratings across different business types, it overlooks potential nuances within more specific industries. For example, customer expectations and rating behaviors may differ significantly between hotels, rental services, beauty salons, or fitness centers. A more granular analysis of distinct business categories could offer deeper insights into how elite and non-elite reviewers interact with different types of businesses.
Additionally, due to the immense size of the dataset, this study analyzes a reduced sample of 10,000 Yelp users rather than the full dataset. While this sample allows for computational efficiency and meaningful statistical analysis, it may not fully capture the broader trends present in the complete dataset. Future research could expand this work by utilizing the entire Yelp dataset, ensuring a more representative analysis and potentially uncovering additional patterns in reviewer behavior.
Once more code will be created, a diagram that illustrates the repository structure will be added here.
├── README.md
├── makefile
├── .gitignore
├── raw data
│ ├── load-packages.R
│ ├── download-data.R
│ ├── data-cleaning.R
│ ├── final_data.R
├── reporting
│ ├── report.Rmd
│ ├── start_app.R
├── src
│ ├── analysis
│ │ ├── analysis.R
│ ├── data-preparation
│ │ ├── Data_exploration.Rmd
│ │ ├── Data-preparation.R
In order to run the code for this project the following packages should be installed and loaded in R. To install the packages, please run this code:
install.packages("googledrive")
install.packages("dplyr")
install.packages("readr")
install.packages("data.table")
install.packages("httr")
install.packages("ggplot2")
install.packages("tidyverse")
install.packages("tinytex")
install.packages("knitr")
install.packages("car")
install.packages("effsize")
Then to load the packages, please run this code:
library(googledrive)
library(dplyr)
library(readr)
library(data.table)
library(httr)
library(ggplot2)
library(tidyverse)
library(tinytex)
library(knitr)
library(car)
library(effsize)
For this workflow to properly work, the following steps should be followed: Please note that step 2 and 3 take a lot of time and storage. These steps create the final data set used for this project and can be skipped since the final data set will also be directly loaded in step 4.
- Run load-packages.R OR run the code in the Dependencies section.
- Run download-data.R (optional)
- Run data-cleaning.R (optional)
- Run final-data.R to load the data set that will be used for the project.
- Next, to prepare the data for the exploration and research run Data-preparation.R
- In order to get to know the data set, please run the Data-exploration.Rmd file.
- For the data analysis used to answer the research questions. Please run analysis.R
This project is set up as part of the Master's course Data Preparation & Workflow Management at the Department of Marketing, Tilburg University, the Netherlands.
The project is implemented by team 9 which includes the members:
- Mitsal Athaya Minantoputra (2153569)
- Amartya Iqra Akhlaqi (2099128)
- Naomi Parmentier (2053479)
- Niusha Amri (2149204)
- Lan Vu (2055251)
- Fogel, J., & Zachariah, S. (2017). Intentions to use the Yelp review website and purchase behavior after reading reviews. Journal of Theoretical and Applied Electronic Commerce Research, 12(1), 17–30. https://doi.org/10.4067/S0718-18762017000100005
- Karaca-Mandic, P., Norton, E. C., & Dowd, B. (2012). Interaction terms in nonlinear models. Health Services Research, 47(1 Pt 1), 255–274. https://doi.org/10.1111/j.1475-6773.2011.01314.x
- Lim, Y., & Van Der Heide, B. (2015). Evaluating the wisdom of strangers: The perceived credibility of online consumer reviews on Yelp. Journal of Computer-Mediated Communication, 20(1), 67–82. https://doi.org/10.1111/jcc4.12093
- Luca, M. (2016). Reviews, reputation, and revenue: The case of Yelp.com. Harvard Business School NOM Unit Working Paper No. 12-016. https://doi.org/10.2139/ssrn.1928601
- Luepsen, H. (2023). ANOVA with binary variables: The F-test and some alternatives. Communications in Statistics - Simulation and Computation, 52(3), 745–769. https://doi.org/10.1080/03610918.2020.1869983
- Moe, W. W., & Trusov, M. (2011). The value of social dynamics in online product ratings forums. Journal of Marketing Research, 48(3), 444-456. https://doi.org/10.1509/jmkr.48.3.444
- Su, X., Yan, X., & Tsai, C.-L. (2012). Linear regression. Wiley Interdisciplinary Reviews: Computational Statistics, 4(3), 275–294. https://doi.org/10.1002/wics.1198
- Su, Y., Gao, X., Li, X., & Tao, D. (2012). Multivariate multilinear regression. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(6). https://doi.org/10.1109/TSMCB.2012.2195171