Skip to content

Add script to extract and transform uploaded Timelapse data #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rudokemper opened this issue Jan 30, 2025 · 8 comments · May be fixed by #94
Open

Add script to extract and transform uploaded Timelapse data #71

rudokemper opened this issue Jan 30, 2025 · 8 comments · May be fixed by #94
Assignees
Labels
connectors Connector scripts for ETL from upstream data sources feature New specs for new behavior

Comments

@rudokemper
Copy link
Member

Feature Request

Add a script that takes a path for Timelapse data (in ZIP file format), transforms the data to write to a database, and copies attachments to the datalake.

The script should have an optional "delete?" boolean, similar to the Locus Map script.

One key question is what kind of tabular data we want to ingest: a CSV export, or the raw contents of the Timelapse SQLite database(s).

Timelapse data structure

Note

Sample camera trap assets for a CMI partner (requires authorization to access) here: https://drive.google.com/drive/u/2/folders/1dWk2g3tK5tE5jM7hUfPuLBH7YHdtlHuy

In Timelapse, the .tdb database stores the project template and metadata, and the .ddb database stores the annotations as applied to camera trap media. They are both SQLite dbs.

Lastly, it is possible to export Timelapse annotation data as a CSV.

Let's dig into all three.

.tdb

Within the .tdb file, there are three tables: TemplateInfo, TemplateTable, and sqlite_sequence. Only the TemplateTable is of interest for ingestion; the other two contain non-relevant file metadata.

TemplateTable

The columns in the TemplateTable are standardized:

  • Id (int): primary key
  • ControlOrder (int): not 100% sure but I'm guessing the order in which the fields appear in Timelapse.
  • SpreadsheetOrder (int): not 100% sure but I'm guessing the order in which the fields appear in a CSV export.
  • Type (text): the type of template field e.g. File, RelativePath, DateTime, DeleteFlag, FixedChoice, Counter, Note, Flag
  • DefaultValue (text): default value for the field
  • Label (text): the user-facing label e.g. "Tënon_Soort" (type of animal in Wayana)
  • DataLabel (text): the system name for the field e.g. "Tenon_Soort"
  • Tooltip (text): user-facing tip about the data field e.g. "welke soort dier/tenon mekpale"
  • TXTBOXWIDTH: the width of the data field in Timelapse
  • Copyable (text): true/false value if the field can be copied or not.
  • Visible (text): true/false value if the field is hidden or not.
  • List (text): If it's a FixedChoice field, then the list of choices will be stored here in the following way:
    {"IncludeEmptyChoice":true,"ChoiceListNonEmpty": ["value1", "value2", ...]}
    
  • ExportToCSV (text): true/false value if the field will be included in a CSV export

.ddb

Within the .ddb file, there are five tables (I'm not sure if these are standard or specific to the CMI partner asset): DataTable, ImageSetTable, MarkersTable, TemplateTable, and sqlite_sequence.

TemplateTable

This table is actually a duplicate of the homonymous table in the .tdb! So we may not need to work with the .tdb database at all.

ImageSetTable

This table seems to store metadata about the root folder of the project, and the number of media files annotated:

  • Id (int): primary key
  • RootFolder (text): the name of the folder where the template and media files are stored. For example, this might be "Recordings per month"
  • Log (text): unsure, it's empty in the partner asset
  • Row (text): number value, but it doesn't seem to match the number of rows in DataTable. So not sure what this refers to.
  • VersionCompatibily (text) (not a typo): the version of Timelapse used, for example, "2.3.1.0"
  • SortTerms (text): an array composed of terms used... for sorting.
    [
     {
       "DataLabel": "RelativePath",
       "DisplayLabel": "RelativePath",
       "ControlType": "RelativePath",
       "IsAscending": "true"
     },
     {
       "DataLabel": "DateTime",
       "DisplayLabel": "DateTime",
       "ControlType": "DateTime",
       "IsAscending": "true"
     }
    ]
    
  • SearchTerms (text): object composed of a SearchTerms array specifying search capabilities for each term (I provide just one example below), and then a bunch of key/value pairs of which I'm not sure what they do within Timelapse, or if they are useful to store. @abfleishman might know.
    {
      "SearchTerms": [
        ...,
        {
          "ControlType": "Counter",
          "DatabaseValue": "0",
          "DataLabel": "Tapsik_aantal",
          "Label": "Tapsik_aantal",
          "List": [],
          "Operator": ">",
          "UseForSearching": false
        },
      ],
      "RandomSample": 0,
      "UseTimeInsteadOfDate": false,
      "TermCombiningOperator": 0,
      "ShowMissingDetections": false,
      "DetectionSelections": {
        "Enabled": false,
        "UseRecognition": false,
        "InterpretAllDetectionsAsEmpty": false,
        "AllDetections": false,
        "RankByConfidence": false,
        "RecognitionType": 2,
        "DetectionCategory": "1",
        "ClassificationCategory": "1",
        "ConfidenceThreshold1ForUI": 0.800000011920929,
        "ConfidenceThreshold2ForUI": 1.0,
        "ConfidenceThresholdForSelect": {
          "Item1": 0.800000011920929,
          "Item2": 1.0
        },
        "CurrentDetectionThreshold": 0.800000011920929,
        "CurrentClassificationThreshold": 0.75
      },
      "EpisodeShowAllIfAnyMatch": false,
      "EpisodeNoteField": "",
      "DuplicatesRelativePathAndFileTuple": null,
      "GetRelativePathFolder": "02-24"
    }
    

MarkersTable

Not sure if this table has any value for ingestion, or what it does for that matter. For the partner asset, there are two columns, and empty rows:

  • id (integer): primary key
  • Tapsik_aantal (text); this is for a Counter field type, so perhaps this table is added by default for counter fields, and the user did not leverage whatever functionality it offers in Timelapse.

DataTable

This table stores all of the annotations.

There is an id (integer) primary key column.

The rest of the columns are the values of the DataLabel column for each row in TemplateTable.

For example, the linked asset has the following rows. (The third column here represents the Timelapse field type in TemplateTable

Column Description Type
id (integer) primary key
File (text) filename File
RelativePath (text) the relative path within the project root directory RelativePath
DateTime (datetime) timestamp of the media file DateTime
DeleteFlag (text) true/false boolean DeleteFlag
Tenon_Soort (text) type of species FixedChoice
Tapsik_aantal (text) number of animals observed Counter
Takahe_eitop (text) unsure, but it's a text / note field Note
Ipokan_foto (text) Is this a "voucher" / top pick scene Flag
Enijika_tenei (text) unsure, but it's a text / note field Note

Reviewing the Timelapse Template Guide, the following columns are mandatory and the DataLabel cannot be changed:

  • File
  • RelativePath
  • DateTime
  • DeleteFlag

CSV export

In addition to accessing the raw .tdb and .ddb files, it is also possible to have the user export their data to a CSV file from within Timelapse. The menu option is called Export all data (folder data, image/ video data) to a CSV file.

This file yields only the annotations, and the root folder. In other words, the same data as the DataTable from the .ddb, supplemented by the RootFolder column from the ImageSetTable from the .ddb. For example (some names changed for privacy)...

RootFolder,File,RelativePath,DateTime,Tenon_Soort,Tapsik_aantal,Takahe_eitop,Ipokan_foto,Enikija_tenei,
Recordings per month,Cam1_1_9_2_2024_01am.mp4,02-24\My Location\Cam 1\Lokatie 1, 2024-02-13 08:45:28,Dog,1,,false,,
Recordings per month,Cam1_1_5_2_2024_03am.mp4,02-24\My Location\Cam 1\Lokatie 1, 2024-02-13 08:45:34,Cat,1,,false,,
Recordings per month,Cam1_1_7_2_2024_12am.mp4,02-24\My Location\Cam 1\Lokatie 1, 2024-02-13 08:45:42,Dog,1,,false,,
Recordings per month,Cam1_1_7_2_2024_12am1.mp4,02-24\My Location\Cam 1\Lokatie 1, 2024-02-13 08:45:48,Dog,1,,false,,
Recordings per month,Cam1_1_6_2_2024_10pm.mp4,02-24\My Location\Cam 1\Lokatie 1, 2024-02-13 08:45:56,Fox,1,,false,,
Recordings per month,Cam1_1_8_2_2024_12am1.mp4,02-24\My Location\Cam 1\Lokatie 1, 2024-02-13 08:46:10,Dog,1,,false,,

What kind of data do we want to store for the partner?

The CSV seems to really capture the bulk of what a user might want out of their Timelapse annotations. But having dug into the contents of the .ddb above, the key question is: is there anything in the Timelapse SQLite databases that we might want to store for the partner, that isn't already provided in the CSV export?

For example, I can imagine that the contents of ImageSetTable -> SearchTerms might be useful for analysis purposes. But maybe not. (Again, @abfleishman curious to your take on this.)

Furthermore, for reasons of provenance it might be good to store more of the Timelapse metadata as high-fidelity as possible. One really useful piece of metadata that is not included in this CSV is ImageSetTable -> VersionCompatibily (sp) which stores the actual version of Timelapse that was used for annotating. And perhaps even the field properties (like default values, or text box width?) might be useful someday... or they are just noise.

One non-trivial advantage of a CSV Export is that it is an easier thing to ask for. Many users are familiar with a workflow of exporting a file from a software, and then finding that file (e.g., on their Desktop or Downloads folder) to send or upload it somewhere. Getting the SQLite database files will require the partner to use Windows Explorer to find their Timelapse root folder. Or... do they need to do that anyway to get the images/videos?

@rudokemper rudokemper added the feature New specs for new behavior label Jan 30, 2025
@rudokemper rudokemper added this to the 11th Hour Project milestone Jan 30, 2025
@rudokemper rudokemper self-assigned this Jan 30, 2025
@rudokemper
Copy link
Member Author

rudokemper commented Jan 30, 2025

There is one additional type of metadata missing from what we can get out of Timelapse (at least, for this partner), which is metadata about the camera traps / locations themselves: e.g. a camera ID, GPS coordinates...

In the data above, the only place to glean this is from the File and RelativePath fields, where mention is made of a "Cam x" and "Lokatie x". That is, there are no dedicated fields that store the camera ID and location ID.

So as it stands, we will need to join this Timelapse data with an external dataset. I'm not sure how the partner is currently storing this information. I will ask. The partner indicates that they maintain a spreadsheet with village, camera numbers, location, and GPS coordinates.

While working with this partner, to streamline this process, @mmckown set up a KoboToolbox form with the following questions for them:

  1. Date
  2. GPS location
  3. Village
  4. Who deployed?
  5. Picture of camera site
  6. Site name
  7. Camera
  8. Notes

As part of a gc-scripts-hub camera trap flow, we could request this kind of form submissions data from the Kobo API, and join it with incoming Timelapse data.

Maybe we should come up with a generic template of this form, and recommend it as part of our GuardianConnector-supported workflow for working with camera trap data and Timelapse? (@abfleishman note that this mirrors what you were thinking about in the discussions about how to get bioacoustic sensor metadata from the client)

@abfleishman
Copy link
Member

For example, I can imagine that the contents of ImageSetTable -> SearchTerms might be useful for analysis purposes. But maybe not. (Again, @abfleishman curious to your take on this.)

I think this "SearchTerms" and the entire ImageSetTable are specific to TimeLapse and may not be useful for capture separately.

Furthermore, for reasons of provenance it might be good to store more of the Timelapse metadata as high-fidelity as possible. One really useful piece of metadata that is not included in this CSV is ImageSetTable -> VersionCompatibily (sp) which stores the actual version of Timelapse that was used for annotating. And perhaps even the field properties (like default values, or text box width?) might be useful someday... or they are just noise.

One thought is around why we are ingesting these data: to store long term, to allow analysis/visualization, etc. If one of the things we want to support is allowing folks to retrieve something from GC that would allow them to open a "project" in TimeLapse again, then it seems valuable to know what is needed to "import" a project into timelapse. (All the important info might be in the template file + the csv export. IDK.

Maybe we should come up with a generic template of this form, and recommend it as part of our GuardianConnector-supported workflow for working with camera trap data and Timelapse? (@abfleishman note that this mirrors what you were thinking about in the discussions about how to get bioacoustic sensor metadata from the client)

Yes that does seem interesting. I think there is probably other info that we want to capture beyond Matthews short form, but that is from a science standpoint. And I bet there is a bunch of info that community members might want that I would not think of.

@rudokemper
Copy link
Member Author

If one of the things we want to support is allowing folks to retrieve something from GC that would allow them to open a "project" in TimeLapse again, then it seems valuable to know what is needed to "import" a project into timelapse.

Good question. I think the answer is likely to be "yes", if we ask. The same question comes up about CoMapeo data all the time (where the answer is decisively "no").

Per page 91 of the Timelapse Reference Guide, it is possible to import a CSV, but only when already working with a .tdb template file which tells Timelapse what kind of data type each column represents.

So, if we decide to store the .tdb file for the purpose of allowing a user to re-setup their Timelapse project, then exporting the CSV file is superfluous: we can just ask the user to ZIP their entire root folder (which they need to access anyway, to get the images/videos? If I'm not mistaken) and extract the annotated data from the .ddb file instead of the CSV.

@rudokemper
Copy link
Member Author

rudokemper commented May 5, 2025

Since Version 2.3.2.0 (released in June 2024), Timelapse now supports tracking of Project- and Station-level metadata (see: Timelapse Metadata Guide and release notes). This appears to allow recording much of the deployment metadata (e.g., coordinates, camera trap name, etc.) directly within Timelapse, enabling it to serve as a singular source of truth.

This could reduce redundancy, minimize potential points of failure, and help avoid edge cases associated with managing multiple upstream data sources. For example, one concern I have with using KoboToolbox for metadata collection is the need to customize forms to accommodate bespoke metadata needs. The KoboToolbox form builder UI can sometimes change field names during customization, and it's not always obvious to non-technical users that this has occurred.

However, this raises the question of how such metadata is actually inserted into a Timelapse project. In our experience, much of this information is collected in the field using pen and paper along with a handheld GPS. Is the intended workflow for users to enter this data into Timelapse manually afterward, when they get back from the field? If so, this introduces the possibility of errors, such as missing digits in lat/lon values. Is there a way to streamline or optimize this metadata collection workflow? Perhaps we could still use KoboToolbox (or similar) and merge the data into the Timelapse database using a script?

I'm planning to reach out to the Timelapse team to get their input.

@rudokemper
Copy link
Member Author

Also, just documenting @abfleishman's suggestion (corroborated by a colleague) to utilize the GBIF recommendations for standardized camera trap metadata.

I am guessing that the most relevant section for our consideration is the list of column naming conventions in 3.4.1. Column naming.

@rudokemper
Copy link
Member Author

rudokemper commented May 6, 2025

Zooming out: I think I’m leaning toward taking a different approach to this script, now that we know Timelapse supports Project, Station, and Deployment metadata—and that the columns for all of these (much like the image annotation data itself) can be customized at will.

We’re likely better off treating this script as not expecting any predefined structure—just as we do for all connector scripts except alerts-gcs, where the schema is rigid and well-defined.

We do know there are specific tables (or CSV exports) for ImageData, Project, Station, and Deployment, and we can ingest and write their contents to the database as-is—similar to what we do for CoMapeo, Kobo, Locus, etc.

If we try to enforce a rigid schema for Timelapse data, we’re likely to encounter mismatches—where users have unknowingly modified a structured Timelapse template to suit their use case—leading to ingestion errors.

One downside to this approach is the risk of missing data needed for front-end visualization. For instance, a station’s latitude/longitude—which in Timelapse are stored in generic DecimalAny fields. But I think it’s more appropriate to handle this in the front-end business logic and to manage expectations at the project level with field teams. For example, in gc-explorer, the Gallery view doesn’t render if no image attachments are detected. EDIT: or even better, we could specify in gc-explorer config which columns to use for which elements, e.g. if a Dutch-speaking user translated "latitude" to "breedtegraad" then we can specify to use the values for that column for latitude.

An alternative could be a semi-structured approach, as we use in e.g. the CoMapeo script. There, we do know that lat/lon fields exist under specific names. But in Timelapse, all fields are user-editable—including renaming latitude/longitude to localized translations—so there’s no reliable way to identify them without introspecting values heuristically, which is fraught.

Other opinions welcome.

@IamJeffG
Copy link
Contributor

IamJeffG commented May 6, 2025

Yikes. If the upstream data format is known and consistent then I see no need to accept open-ended data; however if it's true that all fields are renamable, then we have no choice but to accept anything that's given.

Here's an edge case: What if today the data has a column "latitude" but tomorrow it has a column "breedtegraad"? It's one thing to allow a user-defined schema; but another to allow a dynamic schema.

One option might be to have the Windmill script take as another input the (user edited) names of the cols representing lon and lat. Editing the Timelapse schema requires also editing the script run params.

@rudokemper
Copy link
Member Author

One option might be to have the Windmill script take as another input the (user edited) names of the cols representing lon and lat. Editing the Timelapse schema requires also editing the script run params.

Good idea, thanks. There might be quite a few params to define like this for the envisioned FE described in ConservationMetrics/gc-explorer#46. But I'll give this some thought.

@rudokemper rudokemper linked a pull request May 7, 2025 that will close this issue
@rudokemper rudokemper linked a pull request May 14, 2025 that will close this issue
@rudokemper rudokemper added the connectors Connector scripts for ETL from upstream data sources label May 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connectors Connector scripts for ETL from upstream data sources feature New specs for new behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants