-
-
Notifications
You must be signed in to change notification settings - Fork 60
Pause crawls instead of stopping when quotas are reached or archiving is disabled #2997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4e5d015 to
6730c7f
Compare
|
|
||
| # sizes = await redis.hkeys(f"{crawl.id}:size") | ||
| # for size in sizes: | ||
| # await redis.hmset(f"{crawl.id}:size", {size: 0 for size in sizes}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # sizes = await redis.hkeys(f"{crawl.id}:size") | |
| # for size in sizes: | |
| # await redis.hmset(f"{crawl.id}:size", {size: 0 for size in sizes}) |
Remove before merging
| print(f"pending size: {pending_size}", flush=True) | ||
| print(f"status.filesAdded: {status.filesAdded}", flush=True) | ||
| print(f"status.filesAddedSize: {status.filesAddedSize}", flush=True) | ||
| print(f"total: {total_size}", flush=True) | ||
| print( | ||
| f"org quota: {crawl.org.bytesStored + stats.size} <= {crawl.org.quotas.storageQuota}", | ||
| flush=True, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| print(f"pending size: {pending_size}", flush=True) | |
| print(f"status.filesAdded: {status.filesAdded}", flush=True) | |
| print(f"status.filesAddedSize: {status.filesAddedSize}", flush=True) | |
| print(f"total: {total_size}", flush=True) | |
| print( | |
| f"org quota: {crawl.org.bytesStored + stats.size} <= {crawl.org.quotas.storageQuota}", | |
| flush=True, | |
| ) |
Remove before merging, useful for testing
SuaYoo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Still doing manual testing, my initial impression is it's probably worth adding an isPaused helper to utils/crawler.
export function isPaused({ state }: { state: string | null }) {
return state && (PAUSED_STATES as readonly string[]).includes(state);
}|
We want to send the e-mails multiple times, if a crawl reaches quota, then is resumed, then reaches quota again, right? |
I added a helper but made it except a string or null rather than an object with |
Done, and now storing this state in the db to be more reliable. |
SuaYoo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frontend portion looks good!
7726a59 to
0ad1644
Compare
emma-sg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Email language looks good! Left a few suggestions, one splitting a sentence into two and a few just using curly quotes/removing unused code. Nice work!
I'll take another look for frontend & backend changes, just wanted to get you some feedback on the email template now.
Needs to be tested, just pushing as-is so that I can pick it up next week. There's an issue in local testing where crawls sometimes appear to be twice as big as they really are, which is making Browsertrix think the storage quota is reached prematurely. I haven't yet pinned down the cause of this and it seems intermittent.
#3013) … pending, un-uploaded size - use pending size to determine if quota reached - also request pause to be set before assuming paused state - also ensure data is actually committed before shutting down pods (in case of any edge cases) - clear paused flag in redis after crawler pods shutdown - add OpCrawlStats to avoid adding unnecessary profile_update to public API this assumes changes in crawler to support: clearing size after WACZ upload, ensure upload happens if pod starts when crawl is paused --------- Co-authored-by: Tessa Walsh <[email protected]>
This is much more reliable, prevents duplicate emails as was sometimes happening before, and makes it easier to clear the state when a crawl is unpaused.
Co-authored-by: Emma Segal-Grossman <[email protected]>
d2cba1b to
97dd148
Compare
Fixes #2957
Full backend and frontend implementation, with a new email notification to org admins when a crawl is paused because an org quota has been reached.
Backend changes
paused_storage_quota_reached,paused_time_quota_reached,paused_org_readonlybytesStoredabove the storage quotaUpdated nightly tests all pass: https://github.com/webrecorder/browsertrix/actions/runs/19684324914
Frontend changes
Dependencies
Relies on crawler changes introduced in webrecorder/browsertrix-crawler#919
Out of scope
Crawl workflow counts are a bit off, counting all crawls that complete as successful regardless of state and sometimes incrementing workflow storage counts incorrectly. I started trying to address that in this branch but it's a bit involved and may require a migration so best handled separately, I think. Issue: #3011