Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project: Build new tools to help data scientists clean our database #1837

Closed
edomt opened this issue Dec 26, 2022 · 9 comments
Closed

Project: Build new tools to help data scientists clean our database #1837

edomt opened this issue Dec 26, 2022 · 9 comments

Comments

@edomt
Copy link
Contributor

edomt commented Dec 26, 2022

Core problem

As mentioned to @larsyencken and @mathisonian, we have an idea for a small project that would probably need a bit of work from our engineering and P&D teams.

The goal would be to give the data team a cleaning tool that would help us clean our database – both as a big one-off cleaning in the first half of 2023, but also smaller, regular cleanings in the long run.

Proposed solution

What I have in mind right now is a Cleaning section in the Grapher admin. On this new page, internal users would have access to a bunch of tabs, with for example:

  • A list of datasets with 0 charts, uploaded more than 6 months ago
  • Charts that have 0 internal references and whose 30-day page views are below a certain threshold
  • Charts that are possible duplicates (= charts that use the exact same variables)
  • Charts that have potential issues (see Project: Improve the metadata (FASTT) of our charts #78)
  • Charts that don't have any topic tags
  • etc.

On our end, we would have to define the exact list of features we need.

@lucasrodes and I would most likely be the main contacts for this project on the data team.

@marcelgerber marcelgerber changed the title Project: Cleaning tab in Grapher Project: Cleaning tab in Grapher admin Dec 26, 2022
@ikesau
Copy link
Member

ikesau commented Jan 6, 2023

We should talk about this in a tech tea to better understand the issue. It seems that datasette could potentially get us x% of the way there (especially if we add pageview data to it)

But there's also a way forward that integrates more into the bulk chart editor (e.g. we could have a saved set of views which allow users to operate on charts that match each of the suggested filters in the OP), which could be worthwhile depending on how advanced the editing/cleaning tasks are.

@ikesau
Copy link
Member

ikesau commented Jan 10, 2023

Tech tea outcomes:

Some feature ideas that came up

Another common task is merging two charts. The charts have to be identified, the better of the two charts has to be picked, the "loser" deleted, and then a redirect created. Right now it's quite manual. It would be nice to have an automated flow that deletes one chart and adds a redirect from it to its equivalent.

Some of these issues could be prevented through better input validation in the chart editor (e.g. enforcing fullstops at the end of subtitles)

Datasette?

If we want to do this through datasette, we'd have to ingest pageview data, and put it behind auth.

We could commit queries, as views, directly to datasette, but there's also an argument to store queries as links in Notion, as it would be easier for anyone to update it if they write a useful query.

A downside of datasette is that it operates on a read-only, slightly edited, 24-hour-stale version of our DB. This means that you wouldn't be able to use the tool to immediately verify your cleaning efforts.

We may be able to get 70% of the way with datasette and then still find that it would be nice to have an MySQL editor in the admin with write access (or equivalent functionality through the bulk chart editor)

Pywebio

If datasette doesn't work out, we could also look into building a tool with Pywebio as that would allow the data team to hack on features instead of requiring front-end developers.


To proceed, we should list all the common tasks we know we'd like to be able to do (@edomt). We can then go through each one and see what it would be like to do via datasette, and if it would be awful/inconvenient, we can work out what the ideal experience would be.

@danyx23
Copy link
Contributor

danyx23 commented Jan 11, 2023

As Ike summarized above, we'll try Datasette now for the easy cases and wait for a bit more clarity on the more complex cases that would benefit from more immediate editing or a live connection to the MySQL DB before we scope this project more.

@edomt
Copy link
Contributor Author

edomt commented Jan 11, 2023

Thank you @ikesau @danyx23!

I'll most likely tackle the full list of use cases in the next cycle.

In the meantime, I've created this PR, to test out having extra views in Datasette. @danyx23 can you please give it a quick review and merge it when you get a chance? (not urgent)

@edomt edomt self-assigned this Jan 16, 2023
@edomt edomt changed the title Project: Cleaning tab in Grapher admin Project: Build new tools to hep data scientists clean our database Feb 21, 2023
@edomt edomt changed the title Project: Build new tools to hep data scientists clean our database Project: Build new tools to help data scientists clean our database Feb 21, 2023
@edomt
Copy link
Contributor Author

edomt commented Feb 21, 2023

@danyx23 @ikesau I've now added several views to Datasette. This indeed solves a lot of usecases for us already.

unused_old_datasets
charts_without_origin_url
charts_potential_duplicates
charts_broken_origin_url
charts_by_dataset
charts_with_manual_year
charts_without_tag

The critical piece that's missing now is page views on charts. Ideally, our team would have access to this as a table in Datasette to create views on top of it.

The following BQ query is basically what we'd need. (Total page views for each chart slug over the last 12 months.)

SELECT
  REPLACE(REGEXP_REPLACE(pagePath, "\\?.*", ""), "/grapher/", "") AS slug,
  SUM(pageviews) AS pageviews
FROM
  `owid-analytics.prod_google_analytics.page_metrics_by_page_day`
WHERE
  date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 year)
  AND pagePath LIKE "/grapher/%"
GROUP BY
  slug
ORDER BY
  pageviews DESC

(The resulting table currently has 12K rows.)

Is that something that could be added soon?

@edomt edomt assigned danyx23 and unassigned edomt Feb 21, 2023
@marcelgerber
Copy link
Member

Just my two cents:
Thanks to Mojmir, we already have pageview data (not yet quite as normalised as you suggest, but that's easy enough to do) in the grapher database.
The only reason why this is not available in Datasette at this point is, frankly, that we are not quite sure if we want to make all of this data available to the public; because as you will know, right now everything that's in Datasette is visible to the public.

Even if that's a concern, though, it would be pretty straightforward to add auth to Datasette (owid/owid-datasette#6) - the way I suggested over there is to have most things available to the public, but there's a few tables that are private. They would only be accessible if someone logs in with a GH account that's associated with @owid.

If that's a priority we can get working on that soon, shouldn't take long.

@edomt
Copy link
Contributor Author

edomt commented Feb 22, 2023

Thanks, @marcelgerber! I checked the table in Grapher, and having it there is enough for now. We won't use this data often, so it's okay to connect to the DB to analyze it.

Ideally, if possible, we'd have a views_365d variable on top of the 7d and 14d ones.

@danyx23
Copy link
Contributor

danyx23 commented Feb 23, 2023

Thanks for answering already, Marcel! I think we should add a private datasette instance that can have all data, including unpublished posts. I'll add some thoughts in the issue you mentioned.

Let's try to set this up some time in the not too far future but it seems it's not super urgent rn.

@edomt
Copy link
Contributor Author

edomt commented May 25, 2023

Now that we have a private Datasette instance + views_365d, I can close this!

@edomt edomt closed this as completed May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants