diff --git a/_posts/2024-12-13-price-control.md b/_posts/2024-12-13-price-control.md new file mode 100644 index 0000000..d49c2bf --- /dev/null +++ b/_posts/2024-12-13-price-control.md @@ -0,0 +1,315 @@ +--- +layout: distill +title: Does Price Matter? +description: Investigating the impact of price in Chatbot Arena +giscus_comments: true +date: 2024-12-13 +featured: false +thumbnail: assets/img/blog/style_control/logo.png + +authors: + - name: Sophie Xie* + url: "https://www.linkedin.com/in/sxie2/" + affiliations: + name: UC Berkeley + - name: Anastasios Angelopoulos* + url: "http://angelopoulos.ai" + - name: Wei-Lin Chiang* + url: "https://infwinston.github.io/" + - name: Jackie Lian* +--- + +When deciding what model to use, a huge factor that people consider is _price_. What model gives me the best bang for my buck? What model is the most cost-effective? + +Introducing price analysis to Chatbot Arena, to reveal the most cost-effective models! + +The strongest models in Chatbot Arena are also the most expensive–for example, smaller models are cheaper to run, and thus can be offered at a lower price. In other words, the models are not playing on a level field. To understand which models are the best, we have to adjust for this fact. + +One way to make this comparison is to plot the Pareto frontier between cost and performance. We have included this plot below, and it provides an information-rich signal about which models are best. **Check it out!** + + + +

Figure 1. Pareto frontier plot between cost and performance. For an interactive version of the plot, check out Chatbot Arena's Arena Explorer tab

+ +Although the cost-performance Pareto frontier gives us a full description of the interactions between cost and performance, we might also want to perform an _explicit_ adjustment for price. In other words, can we check which models exhibit surprisingly strong performance, controlling for the effect of price? For this, we introduce a _price-controlled_ leaderboard. **See it below!** + +### Overall Ranking + Price Control + + +

Figure 1. Overall Chatbot Arena ranking vs. Overall Chatbot Arena ranking where price is “controlled”.

+ +The changes in the leaderboard are relatively drastic, as the price of the model is a good predictor of model strength. Thus, the effect of a model’s identity, adjusted for its price, is generally smaller (hence the Arena Scores all go down) and the more expensive models drop more drastically than the cheaper ones. Models like Claude 3.5 Sonnet dropped in rankings, while models like Gemini-1.5-Flash-Exp-0827 increased in rankings. In the hard prompt subset, models like gpt-turbo-2024-04-09 decreased in rankings, while smaller, open-source models like athene-70b-0725 rose in rankings. + +### Hard Prompt Ranking + Style Control + +

Figure 2. Hard Prompt category ranking vs Hard Prompt category ranking where price is "controlled".

+ +Before launching into the methodology, it is worth saying that as in our last post on [style control](https://blog.lmarena.ai/blog/2024/style-control/), this analysis does not give us a causal analysis without strong parametric assumptions. It’s more of showing an adjustment in rankings when controlled for price. + +### Full Leaderboard with Price Control + +The leaderboard is still in progress. + +Please find the below links to leaderboard and colab notebook. + +- Leaderboard [link](https://lmarena.ai/?leaderboard) +- Colab [link](https://colab.research.google.com/drive/15M9Kng_eeS0VglpGJHu_VyaDgKyU6XrO?usp=sharing) + +## Methodology +To control for price, we gathered pricing data for each model. For publicly listed models, we used the available prices directly. For models without public pricing (typically smaller, open-source models), we estimated costs based on model size and a third-party pricing framework (i.e. together.ai). Battles involving models without public size or pricing data (e.g., Grok) were excluded from the dataset. + +Our current pricing information is available here. We encourage and appreciate contributions of updated pricing data to keep our information accurate and current. + +Similarly, to how [style](https://blog.lmarena.ai/blog/2024/style-control/) was controlled for when determining style’s effect on Arena score, we explicitly modeled price as an independent variable in our Chatbot Arena’s Bradley-Terry regression. We define our price feature as the difference between model A and model B’s total response price. More formally, our price feature would be expressed as: + +\begin{equation} +\text{normalize }\left(\frac{\text{total_response_price}\_A - \text{total_response_price}\_B}{\text{total_response_price}\_A + \text{total_response_price}\_B}\right) +\end{equation} + +Where total_response_price is calculated by the number of tokens used in the model’s response multiplied by the model’s price per token. + +We tested several different price features: +1. Raw output token price +2. Total response price +3. Log of total response price +4. Indicators (1 if output_token_priceA > output_token_priceB else 0) + +Below is a table of the coefficients for each price attribute across different methods of controlling price. We determined that the difference in total response price was the most effective gage in price. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PriceTotal Response PriceLog PriceIndicator
Control All0.0220.117-0.0-0.0
Control Raw Price Only0.01---
Control Total Response Price Only-0.117--
Control Log Price Only--0.0-
Indicator Only---0.005
+ +## Ablation + +We also controlled for style and price. Below is a table of the coefficients for each relevant price attribute and style attribute across different methods of controlling price and style. We can see that when controlling for both price and style attributes, length is still the dominant feature (which makes sense as length is a factor in total response price), and price and markdown features are second order. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Total Response PriceLengthMarkdown ListMarkdown HeaderMarkdown Bold
Control All0.0550.2430.0340.0190.024
Control Price Only0.117----
Control Length Only-0.270---
Control Markdown Only--0.1150.0420.055
+ +Below we compare the rank changes between controlling for just price and just price and style. We can observe that some model rankings go up when controlled for price and style but go down when controlled only for price and vice versa. This aligns with the coefficients above, where style seems to have a more dominant effect on rankings than price. + +**Table in progress** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelRank Diff (Length Only)Rank Diff (Markdown Only)Rank Diff (Both)
chatgpt-4o-latest1->11->11->1
gemini-1.5-pro-exp-08272->22->22->2
gemini-1.5-pro-exp-08012->22->22->2
gpt-4o-2024-05-135->35->35->2
claude-3-5-sonnet-202406206->56->46->4
gemini-advanced-05147->57->87->6
grok-2-2024-08-132->42->42->5
llama-3.1-405b-instruct6->66->46->6
gpt-4o-2024-08-067->67->87->6
gpt-4-turbo-2024-04-0911->811->811->9
claude-3-opus-2024022916->1416->816->10
gemini-1.5-pro-api-051410->810->1310->10
gemini-1.5-flash-exp-08276->86->96->9
gpt-4-1106-preview16->1416->816->11
gpt-4o-mini-2024-07-186->86->116->11
gpt-4-0125-preview17->1417->1217->13
mistral-large-240716->1416->1316->13
athene-70b-072516->1616->1716->17
grok-2-mini-2024-08-136->156->156->18
gemini-1.5-pro-api-0409-preview11->1611->2111->18
+ +## Citation + +``` +@misc{stylearena2024, + title = {Does Style Matter? Disentangling style and substance in Chatbot Arena}, + url = {https://blog.lmarena.ai/blog/2024/style-control/}, + author = {Tianle Li*, Anastasios Angelopoulos*, Wei-Lin Chiang*}, + month = {August}, + year = {2024} +} +``` + +--- diff --git a/assets/img/blog/price_control/comparison_hard.png b/assets/img/blog/price_control/comparison_hard.png new file mode 100644 index 0000000..0e7b76d Binary files /dev/null and b/assets/img/blog/price_control/comparison_hard.png differ diff --git a/assets/img/blog/price_control/comparison_overall.png b/assets/img/blog/price_control/comparison_overall.png new file mode 100644 index 0000000..36014de Binary files /dev/null and b/assets/img/blog/price_control/comparison_overall.png differ diff --git a/assets/img/blog/price_control/cost_performance_scatterplot.png b/assets/img/blog/price_control/cost_performance_scatterplot.png new file mode 100644 index 0000000..80e1fa9 Binary files /dev/null and b/assets/img/blog/price_control/cost_performance_scatterplot.png differ