Skip to content

Add a script to download all (or most?) data, aka "exit plan" #95

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks
rudokemper opened this issue May 12, 2025 · 0 comments
Open
3 tasks

Add a script to download all (or most?) data, aka "exit plan" #95

rudokemper opened this issue May 12, 2025 · 0 comments
Labels
feature New specs for new behavior import/export
Milestone

Comments

@rudokemper
Copy link
Member

rudokemper commented May 12, 2025

Feature Request

We have committed to making it possible for users to "download all of their data" on GuardianConnector. The primary reasons given for this include the possibility that they would like to stop using the stack but retrieve their data before doing so, or if they just want to do a backup at any given time. This is an important feature towards guaranteeing data sovereignty.

In what follows, I will first provide some background on what we might mean, and what our users might expect, from downloading their data. I will then focus on the Windmill / gc-scripts-hub specific tasks.

Note

At this time of writing, this is not a priority issue. I am just documenting some ideas that came to me this morning. Thoughts and reflections on anything here are welcomed.

Background

What data are we talking about?

What does “download all of their data” actually mean? Here’s a breakdown of the relevant data types:

  1. Tabular data stored in the warehouse Postgres database (e.g. data ingested by connector scripts like Kobo or CoMapeo, or uploaded manually)
  2. Files in /persistent-storage/datalake (e.g. attachments from connector scripts, file exports like GeoJSON, or files added via Filebrowser)
  3. GuardianConnector Explorer and Superset config (guardianconnector and superset-metastore DB tables)
  4. Windmill config for scheduled scripts (windmill db table)
  5. Windmill run history, stored in windmill logs on captain--windmill-worker-logs volume
  6. CapRover platform config (/captain/data/config-captain.json)
  7. CoMapeo Cloud raw data (stored in a Docker volume captain--comapeo-data)
  8. auth0 authentication and session history (stored externally)
  9. ...am I forgetting anything? Probably!

(Note: Third-party resources like Mapbox or Planet are not considered here.)

User stories for downloading data

But what might our users actually want? I think there are potentially three user stories:

User story A: I’m done with GuardianConnector and want to take my project data with me.

This is likely the most obvious user story. For this, data types (1) and (2) are the most essential. But perhaps we could also offer the option to download (4) and (5) so that the GC connector config -- both upstream (ETL scripts) and downstream (front ends) -- is provided as a source of reference for how the data was ingested and visualized, in case it's helpful for other work.

User story B: I want to pause my use of GuardianConnector, but might return later.

This is a user story relevant to CMI, even. It has already happened that we set up GC for a potentially interested user, but they did not access it at all, so we took it down. Currently we just re-set things up (VM, CapRover, services, etc.) from scratch. But maybe we can just download the whole VM (e.g. as a VHD) and file share content? I'm sure there are some devils in the details, and the process for this would look differently on Azure vs. DigitalOcean vs. other providers. But in theory anyway, this would effectively backup data types (1) - (7), and just as importantly, the plumbing through which much of this config data is put to use. (Auth0 would have to be handled separately.)

User story C: I want to migrate a specific service off of GuardianConnector.

Take CoMapeo for example. It's possible that after some time, a user wants to move their CoMapeo service to a different domain that is not GuardianConnector. (For example: Awana Digital has started to maintain their own CoMapeo cloud fleet, and a user might prefer to start leveraging that to keep their GuardianConnector stack lightweight).

All three user stories are valid but require distinct approaches.

For the 2025–2026 roadmap, I propose prioritizing User Stories A and B, as they are the most broadly applicable and technically feasible in the near term.

  • User Story B is an issue to file for gc-forge.
  • User Story A could be implement here in gc-scripts-hub. Hence, the title of this issue and what follows is scoped to what is achievable in Windmill.

Implementation Plan: Windmill script to download project (and optionally, config) data

We could do something like the following:

  1. User runs "Download All Data" script in Windmill UI.
  2. Script performs the following:
    • Exports relevant database tables to CSV
    • Archives files from /persistent-storage/datalake
    • (Optional) Export config database tables (guardianconnector, windmill, etc.)
  3. Outputs bundled as a downloadable .zip archive
  4. Stored in a location accessible to the user, e.g. via download link

This workflow assumes the user's GuardianConnector instance is still running and they have access to Windmill. While this won't always be the case, it's a reasonable starting point. We should clearly communicate the expectation that users must export their data before shutting down their GuardianConnector instance.

Something to consider is storage impact: Storing a large .zip archive in /persistent-storage/datalake will quickly eat up disk space, especially for media-heavy projects. Also, once GuardianConnector is shut down, the archive would no longer be accessible. Perhaps we could stream the zip file to a given external object store (e.g., S3, Azure Blob)?

@rudokemper rudokemper added this to the Nia Tero 2025 milestone May 12, 2025
@rudokemper rudokemper added the feature New specs for new behavior label May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New specs for new behavior import/export
Projects
None yet
Development

No branches or pull requests

1 participant