diff --git a/_posts/2024-12-13-price-control.md b/_posts/2024-12-13-price-control.md
new file mode 100644
index 0000000..d49c2bf
--- /dev/null
+++ b/_posts/2024-12-13-price-control.md
@@ -0,0 +1,315 @@
+---
+layout: distill
+title: Does Price Matter?
+description: Investigating the impact of price in Chatbot Arena
+giscus_comments: true
+date: 2024-12-13
+featured: false
+thumbnail: assets/img/blog/style_control/logo.png
+
+authors:
+ - name: Sophie Xie*
+ url: "https://www.linkedin.com/in/sxie2/"
+ affiliations:
+ name: UC Berkeley
+ - name: Anastasios Angelopoulos*
+ url: "http://angelopoulos.ai"
+ - name: Wei-Lin Chiang*
+ url: "https://infwinston.github.io/"
+ - name: Jackie Lian*
+---
+
+When deciding what model to use, a huge factor that people consider is _price_. What model gives me the best bang for my buck? What model is the most cost-effective?
+
+Introducing price analysis to Chatbot Arena, to reveal the most cost-effective models!
+
+The strongest models in Chatbot Arena are also the most expensive–for example, smaller models are cheaper to run, and thus can be offered at a lower price. In other words, the models are not playing on a level field. To understand which models are the best, we have to adjust for this fact.
+
+One way to make this comparison is to plot the Pareto frontier between cost and performance. We have included this plot below, and it provides an information-rich signal about which models are best. **Check it out!**
+
+
+
+
Figure 1. Pareto frontier plot between cost and performance. For an interactive version of the plot, check out Chatbot Arena's Arena Explorer tab
+ +Although the cost-performance Pareto frontier gives us a full description of the interactions between cost and performance, we might also want to perform an _explicit_ adjustment for price. In other words, can we check which models exhibit surprisingly strong performance, controlling for the effect of price? For this, we introduce a _price-controlled_ leaderboard. **See it below!** + +### Overall Ranking + Price Control +
+
+Figure 1. Overall Chatbot Arena ranking vs. Overall Chatbot Arena ranking where price is “controlled”.
+ +The changes in the leaderboard are relatively drastic, as the price of the model is a good predictor of model strength. Thus, the effect of a model’s identity, adjusted for its price, is generally smaller (hence the Arena Scores all go down) and the more expensive models drop more drastically than the cheaper ones. Models like Claude 3.5 Sonnet dropped in rankings, while models like Gemini-1.5-Flash-Exp-0827 increased in rankings. In the hard prompt subset, models like gpt-turbo-2024-04-09 decreased in rankings, while smaller, open-source models like athene-70b-0725 rose in rankings. + +### Hard Prompt Ranking + Style Control +
+Figure 2. Hard Prompt category ranking vs Hard Prompt category ranking where price is "controlled".
+ +Before launching into the methodology, it is worth saying that as in our last post on [style control](https://blog.lmarena.ai/blog/2024/style-control/), this analysis does not give us a causal analysis without strong parametric assumptions. It’s more of showing an adjustment in rankings when controlled for price. + +### Full Leaderboard with Price Control + +The leaderboard is still in progress. + +Please find the below links to leaderboard and colab notebook. + +- Leaderboard [link](https://lmarena.ai/?leaderboard) +- Colab [link](https://colab.research.google.com/drive/15M9Kng_eeS0VglpGJHu_VyaDgKyU6XrO?usp=sharing) + +## Methodology +To control for price, we gathered pricing data for each model. For publicly listed models, we used the available prices directly. For models without public pricing (typically smaller, open-source models), we estimated costs based on model size and a third-party pricing framework (i.e. together.ai). Battles involving models without public size or pricing data (e.g., Grok) were excluded from the dataset. + +Our current pricing information is available here. We encourage and appreciate contributions of updated pricing data to keep our information accurate and current. + +Similarly, to how [style](https://blog.lmarena.ai/blog/2024/style-control/) was controlled for when determining style’s effect on Arena score, we explicitly modeled price as an independent variable in our Chatbot Arena’s Bradley-Terry regression. We define our price feature as the difference between model A and model B’s total response price. More formally, our price feature would be expressed as: + +\begin{equation} +\text{normalize }\left(\frac{\text{total_response_price}\_A - \text{total_response_price}\_B}{\text{total_response_price}\_A + \text{total_response_price}\_B}\right) +\end{equation} + +Where total_response_price is calculated by the number of tokens used in the model’s response multiplied by the model’s price per token. + +We tested several different price features: +1. Raw output token price +2. Total response price +3. Log of total response price +4. Indicators (1 if output_token_priceA > output_token_priceB else 0) + +Below is a table of the coefficients for each price attribute across different methods of controlling price. We determined that the difference in total response price was the most effective gage in price. + +| + | Price | +Total Response Price | +Log Price | +Indicator | +
|---|---|---|---|---|
| Control All | +0.022 | +0.117 | +-0.0 | +-0.0 | +
| Control Raw Price Only | +0.01 | +- | +- | +- | +
| Control Total Response Price Only | +- | +0.117 | +- | +- | +
| Control Log Price Only | +- | +- | +0.0 | +- | +
| Indicator Only | +- | +- | +- | +0.005 | +
| + | Total Response Price | +Length | +Markdown List | +Markdown Header | +Markdown Bold | +
|---|---|---|---|---|---|
| Control All | +0.055 | +0.243 | +0.034 | +0.019 | +0.024 | +
| Control Price Only | +0.117 | +- | +- | +- | +- | +
| Control Length Only | +- | +0.270 | +- | +- | +- | +
| Control Markdown Only | +- | +- | +0.115 | +0.042 | +0.055 | +
| Model | +Rank Diff (Length Only) | +Rank Diff (Markdown Only) | +Rank Diff (Both) | +
|---|---|---|---|
| chatgpt-4o-latest | +1->1 | +1->1 | +1->1 | +
| gemini-1.5-pro-exp-0827 | +2->2 | +2->2 | +2->2 | +
| gemini-1.5-pro-exp-0801 | +2->2 | +2->2 | +2->2 | +
| gpt-4o-2024-05-13 | +5->3 | +5->3 | +5->2 | +
| claude-3-5-sonnet-20240620 | +6->5 | +6->4 | +6->4 | +
| gemini-advanced-0514 | +7->5 | +7->8 | +7->6 | +
| grok-2-2024-08-13 | +2->4 | +2->4 | +2->5 | +
| llama-3.1-405b-instruct | +6->6 | +6->4 | +6->6 | +
| gpt-4o-2024-08-06 | +7->6 | +7->8 | +7->6 | +
| gpt-4-turbo-2024-04-09 | +11->8 | +11->8 | +11->9 | +
| claude-3-opus-20240229 | +16->14 | +16->8 | +16->10 | +
| gemini-1.5-pro-api-0514 | +10->8 | +10->13 | +10->10 | +
| gemini-1.5-flash-exp-0827 | +6->8 | +6->9 | +6->9 | +
| gpt-4-1106-preview | +16->14 | +16->8 | +16->11 | +
| gpt-4o-mini-2024-07-18 | +6->8 | +6->11 | +6->11 | +
| gpt-4-0125-preview | +17->14 | +17->12 | +17->13 | +
| mistral-large-2407 | +16->14 | +16->13 | +16->13 | +
| athene-70b-0725 | +16->16 | +16->17 | +16->17 | +
| grok-2-mini-2024-08-13 | +6->15 | +6->15 | +6->18 | +
| gemini-1.5-pro-api-0409-preview | +11->16 | +11->21 | +11->18 | +