Project: Build new tools to help data scientists clean our database #1837

edomt · 2022-12-26T08:55:36Z

Core problem

As mentioned to @larsyencken and @mathisonian, we have an idea for a small project that would probably need a bit of work from our engineering and P&D teams.

The goal would be to give the data team a cleaning tool that would help us clean our database – both as a big one-off cleaning in the first half of 2023, but also smaller, regular cleanings in the long run.

Proposed solution

What I have in mind right now is a Cleaning section in the Grapher admin. On this new page, internal users would have access to a bunch of tabs, with for example:

A list of datasets with 0 charts, uploaded more than 6 months ago
Charts that have 0 internal references and whose 30-day page views are below a certain threshold
Charts that are possible duplicates (= charts that use the exact same variables)
Charts that have potential issues (see Project: Improve the metadata (FASTT) of our charts #78)
Charts that don't have any topic tags
etc.

On our end, we would have to define the exact list of features we need.

@lucasrodes and I would most likely be the main contacts for this project on the data team.

The text was updated successfully, but these errors were encountered:

ikesau · 2023-01-06T15:00:03Z

We should talk about this in a tech tea to better understand the issue. It seems that datasette could potentially get us x% of the way there (especially if we add pageview data to it)

But there's also a way forward that integrates more into the bulk chart editor (e.g. we could have a saved set of views which allow users to operate on charts that match each of the suggested filters in the OP), which could be worthwhile depending on how advanced the editing/cleaning tasks are.

ikesau · 2023-01-10T15:53:06Z

Tech tea outcomes:

Some feature ideas that came up

Another common task is merging two charts. The charts have to be identified, the better of the two charts has to be picked, the "loser" deleted, and then a redirect created. Right now it's quite manual. It would be nice to have an automated flow that deletes one chart and adds a redirect from it to its equivalent.

Some of these issues could be prevented through better input validation in the chart editor (e.g. enforcing fullstops at the end of subtitles)

Datasette?

If we want to do this through datasette, we'd have to ingest pageview data, and put it behind auth.

We could commit queries, as views, directly to datasette, but there's also an argument to store queries as links in Notion, as it would be easier for anyone to update it if they write a useful query.

A downside of datasette is that it operates on a read-only, slightly edited, 24-hour-stale version of our DB. This means that you wouldn't be able to use the tool to immediately verify your cleaning efforts.

We may be able to get 70% of the way with datasette and then still find that it would be nice to have an MySQL editor in the admin with write access (or equivalent functionality through the bulk chart editor)

Pywebio

If datasette doesn't work out, we could also look into building a tool with Pywebio as that would allow the data team to hack on features instead of requiring front-end developers.

To proceed, we should list all the common tasks we know we'd like to be able to do (@edomt). We can then go through each one and see what it would be like to do via datasette, and if it would be awful/inconvenient, we can work out what the ideal experience would be.

danyx23 · 2023-01-11T16:39:30Z

As Ike summarized above, we'll try Datasette now for the easy cases and wait for a bit more clarity on the more complex cases that would benefit from more immediate editing or a live connection to the MySQL DB before we scope this project more.

edomt · 2023-01-11T18:09:08Z

Thank you @ikesau @danyx23!

I'll most likely tackle the full list of use cases in the next cycle.

In the meantime, I've created this PR, to test out having extra views in Datasette. @danyx23 can you please give it a quick review and merge it when you get a chance? (not urgent)

edomt · 2023-02-21T13:13:40Z

@danyx23 @ikesau I've now added several views to Datasette. This indeed solves a lot of usecases for us already.

unused_old_datasets
charts_without_origin_url
charts_potential_duplicates
charts_broken_origin_url
charts_by_dataset
charts_with_manual_year
charts_without_tag

The critical piece that's missing now is page views on charts. Ideally, our team would have access to this as a table in Datasette to create views on top of it.

The following BQ query is basically what we'd need. (Total page views for each chart slug over the last 12 months.)

SELECT
  REPLACE(REGEXP_REPLACE(pagePath, "\\?.*", ""), "/grapher/", "") AS slug,
  SUM(pageviews) AS pageviews
FROM
  `owid-analytics.prod_google_analytics.page_metrics_by_page_day`
WHERE
  date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 year)
  AND pagePath LIKE "/grapher/%"
GROUP BY
  slug
ORDER BY
  pageviews DESC

(The resulting table currently has 12K rows.)

Is that something that could be added soon?

marcelgerber · 2023-02-21T16:40:00Z

Just my two cents:
Thanks to Mojmir, we already have pageview data (not yet quite as normalised as you suggest, but that's easy enough to do) in the grapher database.
The only reason why this is not available in Datasette at this point is, frankly, that we are not quite sure if we want to make all of this data available to the public; because as you will know, right now everything that's in Datasette is visible to the public.

Even if that's a concern, though, it would be pretty straightforward to add auth to Datasette (owid/owid-datasette#6) - the way I suggested over there is to have most things available to the public, but there's a few tables that are private. They would only be accessible if someone logs in with a GH account that's associated with @owid.

If that's a priority we can get working on that soon, shouldn't take long.

edomt · 2023-02-22T12:35:49Z

Thanks, @marcelgerber! I checked the table in Grapher, and having it there is enough for now. We won't use this data often, so it's okay to connect to the DB to analyze it.

Ideally, if possible, we'd have a views_365d variable on top of the 7d and 14d ones.

danyx23 · 2023-02-23T14:55:22Z

Thanks for answering already, Marcel! I think we should add a private datasette instance that can have all data, including unpublished posts. I'll add some thoughts in the issue you mentioned.

Let's try to set this up some time in the not too far future but it seems it's not super urgent rn.

edomt · 2023-05-25T09:22:57Z

Now that we have a private Datasette instance + views_365d, I can close this!

edomt added feature needs triage labels Dec 26, 2022

marcelgerber changed the title ~~Project: Cleaning tab in Grapher~~ Project: Cleaning tab in Grapher admin Dec 26, 2022

edomt mentioned this issue Jan 10, 2023

feat: add cleaning views owid/owid-datasette#4

Merged

danyx23 added priority 3 - nice to have needs examples and removed needs triage labels Jan 11, 2023

edomt self-assigned this Jan 16, 2023

edomt changed the title ~~Project: Cleaning tab in Grapher admin~~ Project: Build new tools to hep data scientists clean our database Feb 21, 2023

edomt changed the title ~~Project: Build new tools to hep data scientists clean our database~~ Project: Build new tools to help data scientists clean our database Feb 21, 2023

edomt assigned danyx23 and unassigned edomt Feb 21, 2023

marcelgerber mentioned this issue Feb 27, 2023

Set up a daily pipeline to update 30d pageviews for posts and charts #1885

Closed

edomt closed this as completed May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project: Build new tools to help data scientists clean our database #1837

Project: Build new tools to help data scientists clean our database #1837

edomt commented Dec 26, 2022

ikesau commented Jan 6, 2023

ikesau commented Jan 10, 2023

danyx23 commented Jan 11, 2023

edomt commented Jan 11, 2023 •

edited

Loading

edomt commented Feb 21, 2023 •

edited

Loading

marcelgerber commented Feb 21, 2023

edomt commented Feb 22, 2023

danyx23 commented Feb 23, 2023

edomt commented May 25, 2023

Project: Build new tools to help data scientists clean our database #1837

Project: Build new tools to help data scientists clean our database #1837

Comments

edomt commented Dec 26, 2022

Core problem

Proposed solution

ikesau commented Jan 6, 2023

ikesau commented Jan 10, 2023

Some feature ideas that came up

Datasette?

Pywebio

danyx23 commented Jan 11, 2023

edomt commented Jan 11, 2023 • edited Loading

edomt commented Feb 21, 2023 • edited Loading

marcelgerber commented Feb 21, 2023

edomt commented Feb 22, 2023

danyx23 commented Feb 23, 2023

edomt commented May 25, 2023

edomt commented Jan 11, 2023 •

edited

Loading

edomt commented Feb 21, 2023 •

edited

Loading