schedule, server: harden scheduler config input validation#10311
schedule, server: harden scheduler config input validation#10311bufferflies wants to merge 1 commit intomasterfrom
Conversation
Signed-off-by: tongjian <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThis PR introduces comprehensive input validation and type-safe handling across multiple scheduler components. Changes add validation for engine/rule/alias/timeout in balance_range, type-checking for ranges and store-leader-id in grant_leader/grant_hot_region/evict_leader, nil-safety checks, and standardization of HTTP error responses from 500 to 400 for invalid inputs. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Linking issues" section in the CONTRIBUTING.md. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
server/api/scheduler.go (1)
128-133: Consider using a different HTTP status code.HTTP 406 (Not Acceptable) is typically used for content negotiation failures (Accept header mismatch). For "scheduler config handler is unavailable", HTTP 503 (Service Unavailable) or 500 might be more semantically appropriate.
However, this is a minor concern and doesn't affect functionality.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/api/scheduler.go` around lines 128 - 133, Replace the inappropriate HTTP 406 responses in the scheduler config handler with a more semantically correct status (use http.StatusServiceUnavailable (503) or http.StatusInternalServerError (500)); update both h.r.JSON(w, http.StatusNotAcceptable, err.Error()) and the subsequent h.r.JSON(w, http.StatusNotAcceptable, "scheduler config handler is unavailable") calls in the scheduler config handler to return the chosen 5xx status so errors and the unavailable message use the correct HTTP code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 121-134: The timeout parsing branch in BalanceRange handler
currently accepts "0s" and negative durations; after parsing timeoutStr with
time.ParseDuration (used to set job.Timeout) add a check that the parsed
duration is strictly positive and return a 400 via handler.rd.JSON with an
explanatory message (e.g., "timeout must be > 0") if timeout <= 0; ensure this
validation happens before assigning job.Timeout so invalid values are rejected
consistently.
In `@pkg/schedule/schedulers/evict_leader.go`:
- Around line 173-183: The current pauseLeaderTransferIfStoreNotExist returns
true for the case where the store exists (so callers cannot tell if this request
actually paused the store), causing updateConfig to unconditionally call
resumeLeaderTransferIfExist on 4xx early returns and clear an existing pause;
change the logic so pauseLeaderTransferIfStoreNotExist (and its callers in
updateConfig) track and return whether this particular request performed the
PauseLeaderTransfer call (true only when PauseLeaderTransfer was invoked
successfully), and update every early-return cleanup branch in updateConfig to
call resumeLeaderTransferIfExist(id) only when that returned flag is true;
ensure the same guard is applied to the other referenced range (lines ~431-450)
so resume is not called for stores you didn’t pause.
In `@pkg/schedule/schedulers/grant_leader.go`:
- Around line 268-296: The code consumes "ranges" before ensuring a valid
store_id; update the handler in grant_leader.go to first validate that
input["store_id"] exists and is the expected type (e.g., int/float64/string as
your API expects) and return the same ResumeLeaderTransfer + HTTP 400 path on
failure, then only proceed to parse rangesVal (refer to symbols rangesVal,
input, handler.config.ResumeLeaderTransfer, handler.config.getRanges) and append
to args; ensure you unlock the config and call handler.rd.JSON on the new error
path exactly like the existing ranges-type error branches so malformed payloads
cannot let a range token be misinterpreted as a store_id.
- Around line 276-291: The current update handler resumes leader transfer
unconditionally on error paths even for stores that were not paused by this
request; modify the logic in the grant leader update flow (inside the request
parsing loop that handles StoreIDWithRanges and the branch building ranges) to
track whether you called handler.config.cluster.PauseLeaderTransfer (e.g., a
boolean like pausedByReq per store or overall) and only call
handler.config.cluster.ResumeLeaderTransfer(id, constant.Out) when that flag is
true; ensure the flag is set immediately after calling PauseLeaderTransfer and
cleared on successful commit so all error returns (the JSON bad-request
responses) only resume transfers for stores paused by this request, while
preserving the existing handler.config.Lock()/Unlock() usage around those
cluster calls.
---
Nitpick comments:
In `@server/api/scheduler.go`:
- Around line 128-133: Replace the inappropriate HTTP 406 responses in the
scheduler config handler with a more semantically correct status (use
http.StatusServiceUnavailable (503) or http.StatusInternalServerError (500));
update both h.r.JSON(w, http.StatusNotAcceptable, err.Error()) and the
subsequent h.r.JSON(w, http.StatusNotAcceptable, "scheduler config handler is
unavailable") calls in the scheduler config handler to return the chosen 5xx
status so errors and the unavailable message use the correct HTTP code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cd95b201-f573-4109-8cc9-8fc69997e5f8
📒 Files selected for processing (12)
pkg/schedule/schedulers/balance_range.gopkg/schedule/schedulers/balance_range_test.gopkg/schedule/schedulers/evict_leader.gopkg/schedule/schedulers/evict_leader_test.gopkg/schedule/schedulers/grant_hot_region.gopkg/schedule/schedulers/grant_hot_region_test.gopkg/schedule/schedulers/grant_leader.gopkg/schedule/schedulers/grant_leader_test.gopkg/schedule/schedulers/scheduler_controller.gopkg/schedule/schedulers/transfer_witness_leader.gopkg/schedule/schedulers/transfer_witness_leader_test.goserver/api/scheduler.go
| if timeoutVal, exists := input["timeout"]; exists { | ||
| timeoutStr, ok := timeoutVal.(string) | ||
| if !ok { | ||
| handler.rd.JSON(w, http.StatusBadRequest, "timeout must be a string") | ||
| return | ||
| } | ||
| job.Timeout = timeout | ||
| if len(timeoutStr) > 0 { | ||
| timeout, err := time.ParseDuration(timeoutStr) | ||
| if err != nil { | ||
| handler.rd.JSON(w, http.StatusBadRequest, fmt.Sprintf("timeout:%s is invalid", timeoutStr)) | ||
| return | ||
| } | ||
| job.Timeout = timeout | ||
| } |
There was a problem hiding this comment.
Reject non-positive timeout values too.
time.ParseDuration accepts "0s" and negative durations, so this still lets malformed payloads create jobs that immediately finish in shouldFinished(). Returning 400 unless timeout > 0 would keep the new validation behavior consistent.
Suggested fix
if timeoutVal, exists := input["timeout"]; exists {
timeoutStr, ok := timeoutVal.(string)
if !ok {
handler.rd.JSON(w, http.StatusBadRequest, "timeout must be a string")
return
}
if len(timeoutStr) > 0 {
timeout, err := time.ParseDuration(timeoutStr)
if err != nil {
handler.rd.JSON(w, http.StatusBadRequest, fmt.Sprintf("timeout:%s is invalid", timeoutStr))
return
}
+ if timeout <= 0 {
+ handler.rd.JSON(w, http.StatusBadRequest, "timeout must be greater than 0")
+ return
+ }
job.Timeout = timeout
}
}As per coding guidelines, "HTTP handlers validate payloads and return proper status codes; avoid panics."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if timeoutVal, exists := input["timeout"]; exists { | |
| timeoutStr, ok := timeoutVal.(string) | |
| if !ok { | |
| handler.rd.JSON(w, http.StatusBadRequest, "timeout must be a string") | |
| return | |
| } | |
| job.Timeout = timeout | |
| if len(timeoutStr) > 0 { | |
| timeout, err := time.ParseDuration(timeoutStr) | |
| if err != nil { | |
| handler.rd.JSON(w, http.StatusBadRequest, fmt.Sprintf("timeout:%s is invalid", timeoutStr)) | |
| return | |
| } | |
| job.Timeout = timeout | |
| } | |
| if timeoutVal, exists := input["timeout"]; exists { | |
| timeoutStr, ok := timeoutVal.(string) | |
| if !ok { | |
| handler.rd.JSON(w, http.StatusBadRequest, "timeout must be a string") | |
| return | |
| } | |
| if len(timeoutStr) > 0 { | |
| timeout, err := time.ParseDuration(timeoutStr) | |
| if err != nil { | |
| handler.rd.JSON(w, http.StatusBadRequest, fmt.Sprintf("timeout:%s is invalid", timeoutStr)) | |
| return | |
| } | |
| if timeout <= 0 { | |
| handler.rd.JSON(w, http.StatusBadRequest, "timeout must be greater than 0") | |
| return | |
| } | |
| job.Timeout = timeout | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/schedule/schedulers/balance_range.go` around lines 121 - 134, The timeout
parsing branch in BalanceRange handler currently accepts "0s" and negative
durations; after parsing timeoutStr with time.ParseDuration (used to set
job.Timeout) add a check that the parsed duration is strictly positive and
return a 400 via handler.rd.JSON with an explanatory message (e.g., "timeout
must be > 0") if timeout <= 0; ensure this validation happens before assigning
job.Timeout so invalid values are rejected consistently.
| func (conf *evictLeaderSchedulerConfig) pauseLeaderTransferIfStoreNotExist(id uint64) (bool, error) { | ||
| conf.RLock() | ||
| defer conf.RUnlock() | ||
| if _, exist := conf.StoreIDWithRanges[id]; !exist { | ||
| if err := conf.cluster.PauseLeaderTransfer(id, constant.In); err != nil { | ||
| return exist, err | ||
| } | ||
| if _, exist := conf.StoreIDWithRanges[id]; exist { | ||
| return true, nil | ||
| } | ||
| if err := conf.cluster.PauseLeaderTransfer(id, constant.In); err != nil { | ||
| return false, err | ||
| } | ||
| return true, nil | ||
| return false, nil | ||
| } |
There was a problem hiding this comment.
Track whether this request paused the store, not whether the store already exists.
pauseLeaderTransferIfStoreNotExist now returns true for an existing config, but all the 4xx cleanup paths still call resumeLeaderTransferIfExist(id), which resumes unconditionally. That means a bad update against an already-evicted store can clear the old pause and silently disable the live config.
Suggested fix
- exist, err = handler.config.pauseLeaderTransferIfStoreNotExist(id)
+ exist, err = handler.config.pauseLeaderTransferIfStoreNotExist(id)
if err != nil {
handler.rd.JSON(w, http.StatusInternalServerError, err.Error())
return
}
+ pausedForUpdate := inputHasStoreID && !exist- handler.config.resumeLeaderTransferIfExist(id)
+ if pausedForUpdate {
+ handler.config.resumeLeaderTransferIfExist(id)
+ }Apply the same guard to the other early-return cleanup branches in updateConfig.
As per coding guidelines, "HTTP handlers validate payloads and return proper status codes; avoid panics."
Also applies to: 431-450
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/schedule/schedulers/evict_leader.go` around lines 173 - 183, The current
pauseLeaderTransferIfStoreNotExist returns true for the case where the store
exists (so callers cannot tell if this request actually paused the store),
causing updateConfig to unconditionally call resumeLeaderTransferIfExist on 4xx
early returns and clear an existing pause; change the logic so
pauseLeaderTransferIfStoreNotExist (and its callers in updateConfig) track and
return whether this particular request performed the PauseLeaderTransfer call
(true only when PauseLeaderTransfer was invoked successfully), and update every
early-return cleanup branch in updateConfig to call
resumeLeaderTransferIfExist(id) only when that returned flag is true; ensure the
same guard is applied to the other referenced range (lines ~431-450) so resume
is not called for stores you didn’t pause.
| rangesVal, hasRanges := input["ranges"] | ||
| if hasRanges { | ||
| var ranges []string | ||
| switch val := rangesVal.(type) { | ||
| case []string: | ||
| ranges = val | ||
| case []any: | ||
| ranges = make([]string, 0, len(val)) | ||
| for _, item := range val { | ||
| s, ok := item.(string) | ||
| if !ok { | ||
| handler.config.Lock() | ||
| handler.config.cluster.ResumeLeaderTransfer(id, constant.Out) | ||
| handler.config.Unlock() | ||
| handler.rd.JSON(w, http.StatusBadRequest, errs.ErrSchedulerConfig.FastGenByArgs("ranges")) | ||
| return | ||
| } | ||
| ranges = append(ranges, s) | ||
| } | ||
| default: | ||
| handler.config.Lock() | ||
| handler.config.cluster.ResumeLeaderTransfer(id, constant.Out) | ||
| handler.config.Unlock() | ||
| handler.rd.JSON(w, http.StatusBadRequest, errs.ErrSchedulerConfig.FastGenByArgs("ranges")) | ||
| return | ||
| } | ||
| args = append(args, ranges...) | ||
| } else if exists { | ||
| args = append(args, handler.config.getRanges(id)...) |
There was a problem hiding this comment.
Require a valid store_id before consuming ranges.
If store_id is missing or wrong-typed, this block still appends ranges into args, so buildWithArgs can treat the first range token as the store ID instead of rejecting the payload. That can update the wrong store on malformed requests.
Suggested fix
- rangesVal, hasRanges := input["ranges"]
+ rangesVal, hasRanges := input["ranges"]
if hasRanges {
+ if !ok {
+ handler.rd.JSON(w, http.StatusBadRequest, errs.ErrSchedulerConfig.FastGenByArgs("id"))
+ return
+ }
var ranges []string
switch val := rangesVal.(type) {As per coding guidelines, "HTTP handlers validate payloads and return proper status codes; avoid panics."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/schedule/schedulers/grant_leader.go` around lines 268 - 296, The code
consumes "ranges" before ensuring a valid store_id; update the handler in
grant_leader.go to first validate that input["store_id"] exists and is the
expected type (e.g., int/float64/string as your API expects) and return the same
ResumeLeaderTransfer + HTTP 400 path on failure, then only proceed to parse
rangesVal (refer to symbols rangesVal, input,
handler.config.ResumeLeaderTransfer, handler.config.getRanges) and append to
args; ensure you unlock the config and call handler.rd.JSON on the new error
path exactly like the existing ranges-type error branches so malformed payloads
cannot let a range token be misinterpreted as a store_id.
| for _, item := range val { | ||
| s, ok := item.(string) | ||
| if !ok { | ||
| handler.config.Lock() | ||
| handler.config.cluster.ResumeLeaderTransfer(id, constant.Out) | ||
| handler.config.Unlock() | ||
| handler.rd.JSON(w, http.StatusBadRequest, errs.ErrSchedulerConfig.FastGenByArgs("ranges")) | ||
| return | ||
| } | ||
| ranges = append(ranges, s) | ||
| } | ||
| default: | ||
| handler.config.Lock() | ||
| handler.config.cluster.ResumeLeaderTransfer(id, constant.Out) | ||
| handler.config.Unlock() | ||
| handler.rd.JSON(w, http.StatusBadRequest, errs.ErrSchedulerConfig.FastGenByArgs("ranges")) |
There was a problem hiding this comment.
Only resume leader transfer when this request actually paused it.
For stores already present in StoreIDWithRanges, the earlier branch does not call PauseLeaderTransfer, but these new error paths still call ResumeLeaderTransfer. A bad update can therefore clear the existing pause and leave an active grant-leader config unenforced.
Suggested fix
- var exists bool
+ var exists bool
+ var pausedForUpdate bool
var id uint64
idFloat, ok := input["store_id"].(float64)
if ok {
id = (uint64)(idFloat)
handler.config.RLock()
if _, exists = handler.config.StoreIDWithRanges[id]; !exists {
if err := handler.config.cluster.PauseLeaderTransfer(id, constant.Out); err != nil {
handler.config.RUnlock()
handler.rd.JSON(w, http.StatusInternalServerError, err.Error())
return
}
+ pausedForUpdate = true
}
handler.config.RUnlock()
args = append(args, strconv.FormatUint(id, 10))
}- handler.config.cluster.ResumeLeaderTransfer(id, constant.Out)
+ if pausedForUpdate {
+ handler.config.cluster.ResumeLeaderTransfer(id, constant.Out)
+ }
...
- handler.config.cluster.ResumeLeaderTransfer(id, constant.Out)
+ if pausedForUpdate {
+ handler.config.cluster.ResumeLeaderTransfer(id, constant.Out)
+ }As per coding guidelines, "HTTP handlers validate payloads and return proper status codes; avoid panics."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/schedule/schedulers/grant_leader.go` around lines 276 - 291, The current
update handler resumes leader transfer unconditionally on error paths even for
stores that were not paused by this request; modify the logic in the grant
leader update flow (inside the request parsing loop that handles
StoreIDWithRanges and the branch building ranges) to track whether you called
handler.config.cluster.PauseLeaderTransfer (e.g., a boolean like pausedByReq per
store or overall) and only call handler.config.cluster.ResumeLeaderTransfer(id,
constant.Out) when that flag is true; ensure the flag is set immediately after
calling PauseLeaderTransfer and cleared on successful commit so all error
returns (the JSON bad-request responses) only resume transfers for stores paused
by this request, while preserving the existing handler.config.Lock()/Unlock()
usage around those cluster calls.
|
@bufferflies: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: None
What is changed and how does it work?
Check List
Tests
Code changes
Release note
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests