Add seed file support to Browsertrix backend #2710

tw4l · 2025-07-02T19:24:10Z

Fixes #2673

Changes in this PR:

Adds a new file_uploads.py module and corresponding /files API prefix with methods/endpoints for uploading, GETing, and deleting seed files (can be extended to other types of files moving forward)
Seed files are supported via CrawlConfig.config.seedFileId on POST and PATCH endpoints. This seedFileId is replaced by a presigned url when passed to the crawler by the operator
Seed files are read when first uploaded to calculate firstSeed and seedCount and store them in the database, and this is copied into the workflow and crawl documents when they are created.
Logic is added to store firstSeed and seedCount for other workflows as well, and a migration added to backfill data, to maintain consistency and fix some of the pymongo aggregations that previously assumed all workflows would have at least one Seed object in CrawlConfig.seeds
Seed file and thumbnail storage stats are added to org stats
Seed file and thumbnail uploads first check that the org's storage quota has not been exceeded and return a 400 if so
A cron background job (run weekly each Sunday at midnight by default, but configurable) is added to look for seed files at least x minutes old (1440 minutes, or 1 day, by default, but configurable) that are not in use in any workflows, and to delete them when they are found. The backend pods will ensure this k8s batch job exists when starting up and create it if it does not already exist. A database entry for each run of the job is created in the operator on job completion so that it'll appear in the /jobs API endpoints, but retrying of this type of regularly scheduled background job is not supported as we don't want to accidentally create multiple competing scheduled jobs.
Adds a min_seed_file_crawler_image value to the Helm chart that is checked before creating a crawl from a workflow if set. If a workflow cannot be run, return the detail of the exception in CrawlConfigAddedResponse.errorDetail so that we can display the reason in the frontend

~~## Todo~~

~~- Modify test chart crawler release back to latest once crawler with seed file support is properly released~~

backend/btrixcloud/crawlconfigs.py

backend/btrixcloud/models.py

tw4l · 2025-07-10T17:20:40Z

Some of the all crawls tests toward the end of the test run in test_uploads.py are periodically returning 502, I'm not quite sure what's happening there but will keep looking into it. Otherwise this is now ready for review.

SuaYoo

Reviewing functionality of API changes against frontend-upload-seed-url-list, working well! I didn't review code or other functionality.

tw4l · 2025-07-16T21:58:09Z

Converted to draft while making sure the cron job to clean up unused seed files is working as expected. Getting close but there seems to be a permissions issue preventing the batch job from being created in the default k8s namespace:

future: <Task finished name='Task-4' coro=<BackgroundJobOps.ensure_cron_cleanup_jobs_exist() done, defined at /app/btrixcloud/background_jobs.py:478> exception=FailToCreateError([ApiException()])>
Traceback (most recent call last):
  File "/app/btrixcloud/background_jobs.py", line 480, in ensure_cron_cleanup_jobs_exist
    await self.crawl_manager.ensure_cleanup_seed_file_cron_job_exists()
  File "/app/btrixcloud/crawlmanager.py", line 257, in ensure_cleanup_seed_file_cron_job_exists
    await self.create_from_yaml(data, namespace=DEFAULT_NAMESPACE)
  File "/app/btrixcloud/k8sapi.py", line 190, in create_from_yaml
    created = await create_from_dict(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/utils/create_from_yaml.py", line 148, in create_from_dict
    raise FailToCreateError(api_exceptions)
kubernetes_asyncio.utils.create_from_yaml.FailToCreateError: Error from server (Forbidden): {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"cronjobs.batch is forbidden: User \"system:serviceaccount:default:default\" cannot create resource \"cronjobs\" in API group \"batch\" in the namespace \"default\"","reason":"Forbidden","details":{"group":"batch","kind":"cronjobs"},"code":403}

~~The job was previously in the crawlers namespace, but that meant it didn't have access to secrets it needed e.g. for database access.~~

~~Will continue to work on this tomorrow unless @ikreymer beats me to the punch~~

Edit: resolved

.github/workflows/k3d-nightly-ci.yaml

backend/btrixcloud/file_uploads.py

tw4l · 2025-07-17T15:47:42Z

@ikreymer Seed file cleanup is now working, tested locally, passing in CI, and more configurable via helm chart values. Ready for your eyes again.

Also makes delta for how old unused files need to be before getting cleaned up configurable via helm chart so that we can set it to 1 minute for nightly tests.

This is probably a step forward but may still be confusing until we rework collection thumbnails to be stored in the new files db collection as well and rationalize into one set of models.

Co-authored-by: Tessa Walsh <[email protected]>

… either to current origin or to internal origin, if no headers provided - use resolve_relative_access_path() for both file uploads and collection thumbnails

- consolidate UserUploadFile into SeedFile, can split again if needed - add get_absolute_presigned_url() to UserFile for convenience in getting absolute url directly

Fixes #2673 Changes in this PR: - Adds a new `file_uploads.py` module and corresponding `/files` API prefix with methods/endpoints for uploading, GETing, and deleting seed files (can be extended to other types of files moving forward) - Seed files are supported via `CrawlConfig.config.seedFileId` on POST and PATCH endpoints. This seedFileId is replaced by a presigned url when passed to the crawler by the operator - Seed files are read when first uploaded to calculate `firstSeed` and `seedCount` and store them in the database, and this is copied into the workflow and crawl documents when they are created. - Logic is added to store `firstSeed` and `seedCount` for other workflows as well, and a migration added to backfill data, to maintain consistency and fix some of the pymongo aggregations that previously assumed all workflows would have at least one `Seed` object in `CrawlConfig.seeds` - Seed file and thumbnail storage stats are added to org stats - Seed file and thumbnail uploads first check that the org's storage quota has not been exceeded and return a 400 if so - A cron background job (run weekly each Sunday at midnight by default, but configurable) is added to look for seed files at least x minutes old (1440 minutes, or 1 day, by default, but configurable) that are not in use in any workflows, and to delete them when they are found. The backend pods will ensure this k8s batch job exists when starting up and create it if it does not already exist. A database entry for each run of the job is created in the operator on job completion so that it'll appear in the `/jobs` API endpoints, but retrying of this type of regularly scheduled background job is not supported as we don't want to accidentally create multiple competing scheduled jobs. - Adds a `min_seed_file_crawler_image` value to the Helm chart that is checked before creating a crawl from a workflow if set. If a workflow cannot be run, return the detail of the exception in `CrawlConfigAddedResponse.errorDetail` so that we can display the reason in the frontend - Add SeedFile model from base UserFile (former ImageFIle), ensure all APIs returning uploaded files return an absolute pre-signed URL (either with external origin or internal service origin) --------- Co-authored-by: Ilya Kreymer <[email protected]>

Resolves #2646 Depends on #2710 ## Changes (Copied from #2689) - Allows users to specify URL list as file. - Allow uploading a text file of URLs - Allow specifying >100 URLs into URL list, where they will turn into an uploaded list automatically. --------- Co-authored-by: sua yoo <[email protected]>

tw4l commented Jul 3, 2025

View reviewed changes

backend/btrixcloud/crawlconfigs.py Outdated Show resolved Hide resolved

tw4l force-pushed the issue-2673-seed-file-backend-support branch from d6cecdb to 690238b Compare July 3, 2025 14:45

tw4l commented Jul 3, 2025

View reviewed changes

backend/btrixcloud/models.py Outdated Show resolved Hide resolved

tw4l force-pushed the issue-2673-seed-file-backend-support branch 6 times, most recently from 9741bac to dc9efcf Compare July 10, 2025 16:20

tw4l changed the title ~~Work in progress: Add seed file support to Browsertrix backend~~ Add seed file support to Browsertrix backend Jul 10, 2025

tw4l marked this pull request as ready for review July 10, 2025 17:19

tw4l requested review from ikreymer and SuaYoo July 10, 2025 17:20

SuaYoo mentioned this pull request Jul 14, 2025

feat: Specify seed list as file in UI #2689

Closed

SuaYoo changed the base branch from main to issue-2673-seed-file July 14, 2025 19:41

This comment was marked as resolved.

Sign in to view

tw4l mentioned this pull request Jul 15, 2025

[Task]: Include seed file size in org dashboard #2733

Open

tw4l force-pushed the issue-2673-seed-file-backend-support branch 2 times, most recently from a31fceb to 69a5c5a Compare July 16, 2025 15:22

SuaYoo approved these changes Jul 16, 2025

View reviewed changes

tw4l marked this pull request as draft July 16, 2025 21:56

tw4l commented Jul 16, 2025

View reviewed changes

.github/workflows/k3d-nightly-ci.yaml Outdated Show resolved Hide resolved

ikreymer reviewed Jul 17, 2025

View reviewed changes

backend/btrixcloud/file_uploads.py Outdated Show resolved Hide resolved

tw4l force-pushed the issue-2673-seed-file-backend-support branch from 5031352 to ac499f4 Compare July 17, 2025 15:13

tw4l marked this pull request as ready for review July 17, 2025 15:47

tw4l force-pushed the issue-2673-seed-file-backend-support branch from b321e35 to 34efaeb Compare July 17, 2025 16:07

tw4l and others added 24 commits July 22, 2025 18:36

Don't reuse match_query variable name

8c6693d

Add test for deleting in-use seed file

8f4e7b8

Make seed file cleanup job cron schedule configurable

df651a5

Add nightly test for seed file cleanup jobs

af93959

TEMP FOR TESTING: Run nightly tests on this branch

db8d212

Fix handling of seed cleanup jobs

4000f06

Fixup

d4fb246

Remove ops-configs volume from background cron job yaml

8e27e34

Try running job in default namespace

8292225

rolebinding: allow creating cronjobs in default namespace

e8ffca8

Add back ops-configs volume

b74f698

Cleanup cronjobs in default namespace on reset

aad591e

Fix cleanup cron job and tests for it

79aea71

Also makes delta for how old unused files need to be before getting cleaned up configurable via helm chart so that we can set it to 1 minute for nightly tests.

Rename ImageFile -> UserFile

9831001

This is probably a step forward but may still be confusing until we rework collection thumbnails to be stored in the new files db collection as well and rationalize into one set of models.

Extend seed file cleanup test to include file not cleaned up

425f2da

Remove extra space

13fbc0a

Remove reundant resolve_internal_access_path call

6a28aa1

Update .github/workflows/k3d-nightly-ci.yaml

aee2974

Co-authored-by: Tessa Walsh <[email protected]>

- add resolve_relative_access_path() which will resolve relative urls…

1943b87

… either to current origin or to internal origin, if no headers provided - use resolve_relative_access_path() for both file uploads and collection thumbnails

make useruploadfile derive from userfile

b3bbbc9

fix namespace on seed cleanup cronjob check

8dd5f00

pass headers for all collection ops

abe9d8e

further cleanup:

782ae28

- consolidate UserUploadFile into SeedFile, can split again if needed - add get_absolute_presigned_url() to UserFile for convenience in getting absolute url directly

tests: add host header to be sure of exact host

1b6b4e0

ikreymer force-pushed the issue-2673-seed-file-backend-support branch from a0d3679 to 1b6b4e0 Compare July 23, 2025 01:36

ikreymer merged commit 3d38669 into issue-2673-seed-file Jul 23, 2025
22 checks passed

ikreymer deleted the issue-2673-seed-file-backend-support branch July 23, 2025 02:07

ikreymer mentioned this pull request Jul 23, 2025

feat: Frontend upload seed url list #2761

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add seed file support to Browsertrix backend #2710

Add seed file support to Browsertrix backend #2710

Uh oh!

tw4l commented Jul 2, 2025 •

edited by ikreymer

Loading

Uh oh!

Uh oh!

Uh oh!

tw4l commented Jul 10, 2025

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

SuaYoo left a comment

Uh oh!

tw4l commented Jul 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

tw4l commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add seed file support to Browsertrix backend #2710

Add seed file support to Browsertrix backend #2710

Uh oh!

Conversation

tw4l commented Jul 2, 2025 • edited by ikreymer Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tw4l commented Jul 10, 2025

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

SuaYoo left a comment

Choose a reason for hiding this comment

Uh oh!

tw4l commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tw4l commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

tw4l commented Jul 2, 2025 •

edited by ikreymer

Loading

tw4l commented Jul 16, 2025 •

edited

Loading