Skip to content

Conversation

@tw4l
Copy link
Member

@tw4l tw4l commented Nov 18, 2025

Fixes #2957

Full backend and frontend implementation, with a new email notification to org admins when a crawl is paused because an org quota has been reached.

Backend changes

  • Modify operator to auto-pause crawls when quotas are reached or archiving is disabled rather than stopping the crawls
  • Add new crawl states: paused_storage_quota_reached, paused_time_quota_reached, paused_org_readonly
  • Add uploaded WACZs to org storage totals immediately after upload so that auto-paused crawls will actually put the org's bytesStored above the storage quota
  • Send an email from new template to all org admins when a crawl is auto-paused with information about what to do
  • Fix datetime deprecation in tests

Updated nightly tests all pass: https://github.com/webrecorder/browsertrix/actions/runs/19684324914

Frontend changes

  • Add new paused crawl states
  • Update checks throughout frontend for whether crawl is paused to compare against all paused states

Dependencies

Relies on crawler changes introduced in webrecorder/browsertrix-crawler#919

Out of scope

Crawl workflow counts are a bit off, counting all crawls that complete as successful regardless of state and sometimes incrementing workflow storage counts incorrectly. I started trying to address that in this branch but it's a bit involved and may require a migration so best handled separately, I think. Issue: #3011

@tw4l tw4l force-pushed the issue-2957-pause-crawl-on-quota-reached branch 6 times, most recently from 4e5d015 to 6730c7f Compare November 25, 2025 17:03
@tw4l tw4l marked this pull request as ready for review November 25, 2025 20:14
Comment on lines +1410 to +1413

# sizes = await redis.hkeys(f"{crawl.id}:size")
# for size in sizes:
# await redis.hmset(f"{crawl.id}:size", {size: 0 for size in sizes})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# sizes = await redis.hkeys(f"{crawl.id}:size")
# for size in sizes:
# await redis.hmset(f"{crawl.id}:size", {size: 0 for size in sizes})

Remove before merging

Comment on lines +1543 to +1551
print(f"pending size: {pending_size}", flush=True)
print(f"status.filesAdded: {status.filesAdded}", flush=True)
print(f"status.filesAddedSize: {status.filesAddedSize}", flush=True)
print(f"total: {total_size}", flush=True)
print(
f"org quota: {crawl.org.bytesStored + stats.size} <= {crawl.org.quotas.storageQuota}",
flush=True,
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f"pending size: {pending_size}", flush=True)
print(f"status.filesAdded: {status.filesAdded}", flush=True)
print(f"status.filesAddedSize: {status.filesAddedSize}", flush=True)
print(f"total: {total_size}", flush=True)
print(
f"org quota: {crawl.org.bytesStored + stats.size} <= {crawl.org.quotas.storageQuota}",
flush=True,
)

Remove before merging, useful for testing

@tw4l tw4l requested review from SuaYoo, emma-sg and ikreymer November 25, 2025 20:15
@tw4l
Copy link
Member Author

tw4l commented Nov 25, 2025

Tagging @emma-sg @SuaYoo for review in addition to @ikreymer , with particular interest in getting your eyes on the frontend, email, and email copy parts of this. Thanks!

Copy link
Member

@SuaYoo SuaYoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Still doing manual testing, my initial impression is it's probably worth adding an isPaused helper to utils/crawler.

export function isPaused({ state }: { state: string | null }) {
  return state && (PAUSED_STATES as readonly string[]).includes(state);
}

@ikreymer
Copy link
Member

We want to send the e-mails multiple times, if a crawl reaches quota, then is resumed, then reaches quota again, right?
If so, should also clear autoPausedEmailsSent when crawl is running again

@tw4l
Copy link
Member Author

tw4l commented Nov 26, 2025

Nice! Still doing manual testing, my initial impression is it's probably worth adding an isPaused helper to utils/crawler.

export function isPaused({ state }: { state: string | null }) {
  return state && (PAUSED_STATES as readonly string[]).includes(state);
}

I added a helper but made it except a string or null rather than an object with state property, as none of the uses of this take an object with that key. Take a look and let me know what you think.

@tw4l
Copy link
Member Author

tw4l commented Nov 26, 2025

We want to send the e-mails multiple times, if a crawl reaches quota, then is resumed, then reaches quota again, right? If so, should also clear autoPausedEmailsSent when crawl is running again

Done, and now storing this state in the db to be more reliable.

@SuaYoo SuaYoo self-requested a review November 26, 2025 19:19
Copy link
Member

@SuaYoo SuaYoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frontend portion looks good!

@tw4l tw4l force-pushed the issue-2957-pause-crawl-on-quota-reached branch from 7726a59 to 0ad1644 Compare November 26, 2025 20:34
Copy link
Member

@emma-sg emma-sg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Email language looks good! Left a few suggestions, one splitting a sentence into two and a few just using curly quotes/removing unused code. Nice work!

I'll take another look for frontend & backend changes, just wanted to get you some feedback on the email template now.

tw4l and others added 19 commits November 26, 2025 16:03
Needs to be tested, just pushing as-is so that I can pick it up
next week. There's an issue in local testing where crawls sometimes
appear to be twice as big as they really are, which is making
Browsertrix think the storage quota is reached prematurely. I
haven't yet pinned down the cause of this and it seems intermittent.
#3013)

… pending, un-uploaded size

- use pending size to determine if quota reached
- also request pause to be set before assuming paused state
- also ensure data is actually committed before shutting down pods (in
case of any edge cases)
- clear paused flag in redis after crawler pods shutdown
- add OpCrawlStats to avoid adding unnecessary profile_update to public
API

this assumes changes in crawler to support: clearing size after WACZ
upload, ensure upload happens if pod starts when crawl is paused

---------

Co-authored-by: Tessa Walsh <[email protected]>
This is much more reliable, prevents duplicate emails as was
sometimes happening before, and makes it easier to clear the state
when a crawl is unpaused.
@SuaYoo SuaYoo force-pushed the issue-2957-pause-crawl-on-quota-reached branch from d2cba1b to 97dd148 Compare November 27, 2025 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: When a quota is reached, the crawl should be paused instead of stopped.

5 participants