-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Joining dataframes #16
Comments
Hey @dave-killough , I attached to a specific dataframe because it was the easiest to "gather" the local context. In a python kernel, just doing In many jupyter notebooks, people often re-use the same dataframe as well (editing in place), so the clarity of the answer I felt should be contextualized to a single dataframe. One option for multiple joining I considered was something like df.sketch.howto("question", extra_contextual_dfs=[*dfs]) but that didn't feel "easy to type" / "native". Do you have any ideas or preferences for what would make a good API? |
Thanks for the explanation. Specifying the dataframes to include is important. There are often many working dataframes created in a notebook that should be excluded from end-result queries. It seems useful to specify the dataframes once, and then have a higher level ask function that can connect the dataframes. ChatGPT is quite good at modeling table relationships from just schema, and we're getting effective SQL joins as a result. I expect it could work similarly with dataframe merges. Maybe something like this: sketch.dataset([client_df, purchase_df, contact_df])
sketch.ask("who bought the most last year?") |
Asking questions across multiple dataframes would be very interesting. I was hoping to do something like this: import pandas as pd
import sketch
from sqlalchemy import create_engine
# Define a class to store views
class Views:
def __init__(self, schema_name, db_url):
self.views = {}
self.history = []
self.engine = create_engine(db_url)
self.schema_name = schema_name
# Get all views in the schema
views = pd.read_sql_query(f"SELECT table_name FROM information_schema.views WHERE table_schema = '{schema_name}'", con=self.engine)
# Loop through each view and read it into a Pandas dataframe
for view in views['table_name']:
df = pd.read_sql_query(f"SELECT * FROM {schema_name}.{view}", con=self.engine)
self.add_view(view, df)
def add_view(self, name, df):
self.views[name] = df
def ask(self, question):
answer = sketch.ask(question, self.views.items())
self.history.append((question, answer))
# Connect to Postgres
db_url = 'postgresql://user:password@localhost:5432/mydatabase'
schema_name = 'my_schema'
# Create an instance of Views
my_views = Views(schema_name, db_url)
# Ask questions of all views and store history
my_views.ask('What are the top problematic projects with respect to planned vs actual time tracked?')
my_views.ask('Which activities appear to require upskilling?')
# Interrogate my_views.history for example |
Nice. I've been doing similar things with SQL sources. I'm curious why you attach the ask to a specific dataframe. Can multiple dataframes be considered in one request? Thanks
The text was updated successfully, but these errors were encountered: