Test Pandasai on the spider dateset #1105
Replies: 2 comments
-
@HITjunle we are building an open source benchmark ourselves. The problem in the spider dataset is it benchmark the SQL, but in our case it's slightly different. We could take some of the examples from that benchmark and adapt it to PandasAI. If you are interested in contributing, let me know, we're creating a task force for it! |
Beta Was this translation helpful? Give feedback.
-
Although Spider is a benchmark for SQL, in pandas, we can use read_sql_query to execute SQL queries. Therefore, we can compare the results of queries performed with pd.read_sql_query and pandasai. Here's my approach: First, we read the Spider dataset's .json files to extract questions and corresponding SQL queries. Then, we use these questions as input for pandasai to generate Pandas code and query results. Finally, we compare the results with those obtained using pd.read_sql_query(query) to validate correctness. I've tested a portion of it, and in some questions, it seems that Pandasai can generate code but fails to execute it correctly. Additionally, the generated code contains errors. For example, for a table: CREATE TABLE "stadium" (
"Stadium_ID" int,
"Location" text,
"Name" text,
"Capacity" int,
"Highest" int,
"Lowest" int,
"Average" int,
PRIMARY KEY ("Stadium_ID")
);
INSERT INTO "stadium" VALUES (1,"Raith Rovers","Stark's Park","10104","4812","1294","2106");
INSERT INTO "stadium" VALUES (2,"Ayr United","Somerset Park","11998","2363","1057","1477");
INSERT INTO "stadium" VALUES (3,"East Fife","Bayview Stadium","2000","1980","533","864");
INSERT INTO "stadium" VALUES (4,"Queen's Park","Hampden Park","52500","1763","466","730");
INSERT INTO "stadium" VALUES (5,"Stirling Albion","Forthbank Stadium","3808","1125","404","642");
INSERT INTO "stadium" VALUES (6,"Arbroath","Gayfield Park","4125","921","411","638");
INSERT INTO "stadium" VALUES (7,"Alloa Athletic","Recreation Park","3100","1057","331","637");
INSERT INTO "stadium" VALUES (9,"Peterhead","Balmoor","4000","837","400","615");
INSERT INTO "stadium" VALUES (10,"Brechin City","Glebe Park","3960","780","315","552"); Question: What is the maximum capacity and the average of all stadiums? Pandasai output code: # TODO: import the required dependencies
import pandas as pd
# Access the stadium dataframe from dfs list
stadium_df = dfs[0]
# Calculate maximum capacity and average of all stadiums
max_capacity = stadium_df['Capacity'].max()
avg_capacity = stadium_df['Capacity'].mean()
# Declare result variable as a dictionary of type and value
result = {
"type": "string",
"value": f"The maximum capacity is {max_capacity} and the average capacity is {avg_capacity}."
} It incorrectly interprets the average value as the average capacity. Actually, there's an 'average' column in the stadium table, so it should return the data corresponding to this 'average' column. The correct SQL query is select max(capacity), average from stadium |
Beta Was this translation helpful? Give feedback.
-
Spider dataset
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students.
Test pandasai
Beta Was this translation helpful? Give feedback.
All reactions