-
-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project: Build new tools to help data scientists clean our database #1837
Comments
Cleaning
tab in GrapherCleaning
tab in Grapher admin
We should talk about this in a tech tea to better understand the issue. It seems that datasette could potentially get us x% of the way there (especially if we add pageview data to it) But there's also a way forward that integrates more into the bulk chart editor (e.g. we could have a saved set of views which allow users to operate on charts that match each of the suggested filters in the OP), which could be worthwhile depending on how advanced the editing/cleaning tasks are. |
Tech tea outcomes: Some feature ideas that came upAnother common task is merging two charts. The charts have to be identified, the better of the two charts has to be picked, the "loser" deleted, and then a redirect created. Right now it's quite manual. It would be nice to have an automated flow that deletes one chart and adds a redirect from it to its equivalent. Some of these issues could be prevented through better input validation in the chart editor (e.g. enforcing fullstops at the end of subtitles) Datasette?If we want to do this through datasette, we'd have to ingest pageview data, and put it behind auth. We could commit queries, as views, directly to datasette, but there's also an argument to store queries as links in Notion, as it would be easier for anyone to update it if they write a useful query. A downside of datasette is that it operates on a read-only, slightly edited, 24-hour-stale version of our DB. This means that you wouldn't be able to use the tool to immediately verify your cleaning efforts. We may be able to get 70% of the way with datasette and then still find that it would be nice to have an MySQL editor in the admin with write access (or equivalent functionality through the bulk chart editor) PywebioIf datasette doesn't work out, we could also look into building a tool with Pywebio as that would allow the data team to hack on features instead of requiring front-end developers. To proceed, we should list all the common tasks we know we'd like to be able to do (@edomt). We can then go through each one and see what it would be like to do via datasette, and if it would be awful/inconvenient, we can work out what the ideal experience would be. |
As Ike summarized above, we'll try Datasette now for the easy cases and wait for a bit more clarity on the more complex cases that would benefit from more immediate editing or a live connection to the MySQL DB before we scope this project more. |
Cleaning
tab in Grapher admin
@danyx23 @ikesau I've now added several views to Datasette. This indeed solves a lot of usecases for us already. unused_old_datasets The critical piece that's missing now is page views on charts. Ideally, our team would have access to this as a table in Datasette to create views on top of it. The following BQ query is basically what we'd need. (Total page views for each chart slug over the last 12 months.) SELECT
REPLACE(REGEXP_REPLACE(pagePath, "\\?.*", ""), "/grapher/", "") AS slug,
SUM(pageviews) AS pageviews
FROM
`owid-analytics.prod_google_analytics.page_metrics_by_page_day`
WHERE
date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 year)
AND pagePath LIKE "/grapher/%"
GROUP BY
slug
ORDER BY
pageviews DESC (The resulting table currently has 12K rows.) Is that something that could be added soon? |
Just my two cents: Even if that's a concern, though, it would be pretty straightforward to add auth to Datasette (owid/owid-datasette#6) - the way I suggested over there is to have most things available to the public, but there's a few tables that are private. They would only be accessible if someone logs in with a GH account that's associated with @owid. If that's a priority we can get working on that soon, shouldn't take long. |
Thanks, @marcelgerber! I checked the table in Grapher, and having it there is enough for now. We won't use this data often, so it's okay to connect to the DB to analyze it. Ideally, if possible, we'd have a |
Thanks for answering already, Marcel! I think we should add a private datasette instance that can have all data, including unpublished posts. I'll add some thoughts in the issue you mentioned. Let's try to set this up some time in the not too far future but it seems it's not super urgent rn. |
Now that we have a private Datasette instance + views_365d, I can close this! |
Core problem
As mentioned to @larsyencken and @mathisonian, we have an idea for a small project that would probably need a bit of work from our engineering and P&D teams.
The goal would be to give the data team a cleaning tool that would help us clean our database – both as a big one-off cleaning in the first half of 2023, but also smaller, regular cleanings in the long run.
Proposed solution
What I have in mind right now is a
Cleaning
section in the Grapher admin. On this new page, internal users would have access to a bunch of tabs, with for example:On our end, we would have to define the exact list of features we need.
@lucasrodes and I would most likely be the main contacts for this project on the data team.
The text was updated successfully, but these errors were encountered: