You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have committed to making it possible for users to "download all of their data" on GuardianConnector. The primary reasons given for this include the possibility that they would like to stop using the stack but retrieve their data before doing so, or if they just want to do a backup at any given time. This is an important feature towards guaranteeing data sovereignty.
In what follows, I will first provide some background on what we might mean, and what our users might expect, from downloading their data. I will then focus on the Windmill / gc-scripts-hub specific tasks.
Note
At this time of writing, this is not a priority issue. I am just documenting some ideas that came to me this morning. Thoughts and reflections on anything here are welcomed.
Background
What data are we talking about?
What does “download all of their data” actually mean? Here’s a breakdown of the relevant data types:
Tabular data stored in the warehouse Postgres database (e.g. data ingested by connector scripts like Kobo or CoMapeo, or uploaded manually)
Files in /persistent-storage/datalake (e.g. attachments from connector scripts, file exports like GeoJSON, or files added via Filebrowser)
GuardianConnector Explorer and Superset config (guardianconnector and superset-metastore DB tables)
Windmill config for scheduled scripts (windmill db table)
Windmill run history, stored in windmill logs on captain--windmill-worker-logs volume
CoMapeo Cloud raw data (stored in a Docker volume captain--comapeo-data)
auth0 authentication and session history (stored externally)
...am I forgetting anything? Probably!
(Note: Third-party resources like Mapbox or Planet are not considered here.)
User stories for downloading data
But what might our users actually want? I think there are potentially three user stories:
User story A: I’m done with GuardianConnector and want to take my project data with me.
This is likely the most obvious user story. For this, data types (1) and (2) are the most essential. But perhaps we could also offer the option to download (4) and (5) so that the GC connector config -- both upstream (ETL scripts) and downstream (front ends) -- is provided as a source of reference for how the data was ingested and visualized, in case it's helpful for other work.
User story B: I want to pause my use of GuardianConnector, but might return later.
This is a user story relevant to CMI, even. It has already happened that we set up GC for a potentially interested user, but they did not access it at all, so we took it down. Currently we just re-set things up (VM, CapRover, services, etc.) from scratch. But maybe we can just download the whole VM (e.g. as a VHD) and file share content? I'm sure there are some devils in the details, and the process for this would look differently on Azure vs. DigitalOcean vs. other providers. But in theory anyway, this would effectively backup data types (1) - (7), and just as importantly, the plumbing through which much of this config data is put to use. (Auth0 would have to be handled separately.)
User story C: I want to migrate a specific service off of GuardianConnector.
Take CoMapeo for example. It's possible that after some time, a user wants to move their CoMapeo service to a different domain that is not GuardianConnector. (For example: Awana Digital has started to maintain their own CoMapeo cloud fleet, and a user might prefer to start leveraging that to keep their GuardianConnector stack lightweight).
All three user stories are valid but require distinct approaches.
For the 2025–2026 roadmap, I propose prioritizing User Stories A and B, as they are the most broadly applicable and technically feasible in the near term.
Stored in a location accessible to the user, e.g. via download link
This workflow assumes the user's GuardianConnector instance is still running and they have access to Windmill. While this won't always be the case, it's a reasonable starting point. We should clearly communicate the expectation that users must export their data before shutting down their GuardianConnector instance.
Something to consider is storage impact: Storing a large .zip archive in /persistent-storage/datalake will quickly eat up disk space, especially for media-heavy projects. Also, once GuardianConnector is shut down, the archive would no longer be accessible. Perhaps we could stream the zip file to a given external object store (e.g., S3, Azure Blob)?
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Feature Request
We have committed to making it possible for users to "download all of their data" on GuardianConnector. The primary reasons given for this include the possibility that they would like to stop using the stack but retrieve their data before doing so, or if they just want to do a backup at any given time. This is an important feature towards guaranteeing data sovereignty.
In what follows, I will first provide some background on what we might mean, and what our users might expect, from downloading their data. I will then focus on the Windmill /
gc-scripts-hub
specific tasks.Note
At this time of writing, this is not a priority issue. I am just documenting some ideas that came to me this morning. Thoughts and reflections on anything here are welcomed.
Background
What data are we talking about?
What does “download all of their data” actually mean? Here’s a breakdown of the relevant data types:
warehouse
Postgres database (e.g. data ingested by connector scripts like Kobo or CoMapeo, or uploaded manually)/persistent-storage/datalake
(e.g. attachments from connector scripts, file exports like GeoJSON, or files added via Filebrowser)guardianconnector
andsuperset-metastore
DB tables)windmill
db table)captain--windmill-worker-logs
volume/captain/data/config-captain.json
)captain--comapeo-data
)(Note: Third-party resources like Mapbox or Planet are not considered here.)
User stories for downloading data
But what might our users actually want? I think there are potentially three user stories:
User story A: I’m done with GuardianConnector and want to take my project data with me.
This is likely the most obvious user story. For this, data types (1) and (2) are the most essential. But perhaps we could also offer the option to download (4) and (5) so that the GC connector config -- both upstream (ETL scripts) and downstream (front ends) -- is provided as a source of reference for how the data was ingested and visualized, in case it's helpful for other work.
User story B: I want to pause my use of GuardianConnector, but might return later.
This is a user story relevant to CMI, even. It has already happened that we set up GC for a potentially interested user, but they did not access it at all, so we took it down. Currently we just re-set things up (VM, CapRover, services, etc.) from scratch. But maybe we can just download the whole VM (e.g. as a
VHD
) and file share content? I'm sure there are some devils in the details, and the process for this would look differently on Azure vs. DigitalOcean vs. other providers. But in theory anyway, this would effectively backup data types (1) - (7), and just as importantly, the plumbing through which much of this config data is put to use. (Auth0 would have to be handled separately.)User story C: I want to migrate a specific service off of GuardianConnector.
Take CoMapeo for example. It's possible that after some time, a user wants to move their CoMapeo service to a different domain that is not GuardianConnector. (For example: Awana Digital has started to maintain their own CoMapeo cloud fleet, and a user might prefer to start leveraging that to keep their GuardianConnector stack lightweight).
All three user stories are valid but require distinct approaches.
For the 2025–2026 roadmap, I propose prioritizing User Stories A and B, as they are the most broadly applicable and technically feasible in the near term.
gc-forge
.gc-scripts-hub
. Hence, the title of this issue and what follows is scoped to what is achievable in Windmill.Implementation Plan: Windmill script to download project (and optionally, config) data
We could do something like the following:
/persistent-storage/datalake
guardianconnector
,windmill
, etc.).zip
archiveThis workflow assumes the user's GuardianConnector instance is still running and they have access to Windmill. While this won't always be the case, it's a reasonable starting point. We should clearly communicate the expectation that users must export their data before shutting down their GuardianConnector instance.
Something to consider is storage impact: Storing a large
.zip
archive in/persistent-storage/datalake
will quickly eat up disk space, especially for media-heavy projects. Also, once GuardianConnector is shut down, the archive would no longer be accessible. Perhaps we could stream the zip file to a given external object store (e.g., S3, Azure Blob)?The text was updated successfully, but these errors were encountered: