perf(scheduler): stream snapshot JSON to avoid OOM on large clusters#1564
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe snapshot endpoint refactoring replaces in-memory JSON marshaling with streaming JSON directly into a ZIP file. A new internal JSON streaming abstraction emits object fields and slices incrementally. The HTTP handler creates a ZIP entry for ChangesSnapshot Endpoint Streaming Refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/scheduler/plugins/snapshot/snapshot.go (1)
100-118: ⚡ Quick winUse
deferfor ZIP cleanup once the writer is created.
zipWriter.Close()is currently duplicated across the success and failure paths. Deferring the close right afterCreate()succeeds makes cleanup deterministic and reduces the chance of missing it on a future early return.As per coding guidelines, "Use
deferfor cleanup and state restoration".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/scheduler/plugins/snapshot/snapshot.go` around lines 100 - 118, After zipWriter is created and jsonWriter is assigned (after zipWriter.Create(SnapshotFileName) succeeds), add a deferred cleanup that closes zipWriter and logs any close error instead of calling zipWriter.Close() in both success and error paths; update the code around zipWriter, jsonWriter, SnapshotFileName and sp.writeSnapshot(...) to remove the duplicate explicit Close() calls and rely on the deferred func() { if err := zipWriter.Close(); err != nil { log.InfraLogger.Errorf("Error closing snapshot zip: %v", err) } } to ensure deterministic cleanup and error logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/scheduler/plugins/snapshot/snapshot.go`:
- Around line 285-290: The function writeListedSlice currently swallows listing
errors by turning them into empty slices; instead propagate the failure so
callers can detect it: in writeListedSlice (with signature writeListedSlice[T
any](object *jsonObjectWriter, fieldName string, list func() ([]T, error),
errorMessage string)) remove the line that sets values = []T{} and return the
original err (after logging) so the error bubbles up; alternatively, if you
intentionally want the JSON to show null, explicitly write a null field on
object (use the jsonObjectWriter null/SetNull/WriteNullField method) and then
return nil—do not silently replace errors with an empty array.
- Around line 188-191: The code currently clears discoverySnapshot.Resources on
any error from discoveryClient.ServerGroupsAndResources(), discarding partial
results; change this so you only set discoverySnapshot.Resources = nil if the
returned slice itself is nil—i.e., call
discoveryClient.ServerGroupsAndResources(), log the error as you do with
log.InfraLogger.V(2).Warnf, but do not overwrite discoverySnapshot.Resources
when it contains partial data (leave the returned slice intact), only set it to
nil when the function actually returned a nil slice.
---
Nitpick comments:
In `@pkg/scheduler/plugins/snapshot/snapshot.go`:
- Around line 100-118: After zipWriter is created and jsonWriter is assigned
(after zipWriter.Create(SnapshotFileName) succeeds), add a deferred cleanup that
closes zipWriter and logs any close error instead of calling zipWriter.Close()
in both success and error paths; update the code around zipWriter, jsonWriter,
SnapshotFileName and sp.writeSnapshot(...) to remove the duplicate explicit
Close() calls and rely on the deferred func() { if err := zipWriter.Close(); err
!= nil { log.InfraLogger.Errorf("Error closing snapshot zip: %v", err) } } to
ensure deterministic cleanup and error logging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f93de662-8c62-4872-b058-2846d538b073
📒 Files selected for processing (1)
pkg/scheduler/plugins/snapshot/snapshot.go
📊 Performance Benchmark ResultsComparing PR (
|
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
|
@yysindi Thanks for the fix! |
The /get-snapshot endpoint was buffering the entire snapshot as a JSON []byte via json.Marshal, then copying it again via string() conversion before writing to the zip — 3 copies of multi-GB data in memory for large clusters (682 nodes, 15K pods). This caused OOM kills even at 32Gi memory limit. Stream JSON per-element directly into the zip writer using a jsonStream helper that encodes one object at a time. Peak memory drops from ~3x the JSON payload to just the raw Go objects plus trivial streaming buffers (~1x). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: yysindi <[email protected]>
d338e7d to
40d7a4a
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
1b591f4
|
Successfully created backport PR for |
Description
The
/get-snapshotendpoint buffers the entire cluster snapshot as a JSON[]byteviajson.Marshal, then copies it again viastring()conversion before writing to the zip writer — 3 copies of multi-GB data in memory for large clusters. On a 682-node / 15K-pod cluster this causes OOM kills even at 32Gi memory limit.This PR streams JSON per-element directly into the zip writer using a
jsonStream/jsonObjectWriterabstraction that encodes one K8s object at a time. Peak memory drops from ~3x the JSON payload to just the raw Go objects held by the informer cache plus trivial streaming buffers (~1x).Before
After
Checklist
Breaking Changes
None. The HTTP response format is identical (ZIP containing
snapshot.json). Consumers require no changes.Additional Notes
jsonStreamhelper usesjson.Encoder.Encodeper-element, which appends a\nafter each value. This produces valid JSON with slightly different whitespace thanjson.Marshal— any JSON parser handles it correctly./get-snapshoteven with a 32Gi memory limit. This change eliminates the serialization overhead entirely.Summary by CodeRabbit