Fault update sources#4971
Conversation
FIS network faults impair traffic between ECS tasks and domain-named
services. Before this change, the FIS resolves domains to IPs once at
fault start and hands them to the ECS Agent, which installs tc rules
against that fixed set and never revisits it. The fault is incomplete
for ALB, API Gateway, SQS, and third-party domains:
DNS resolvers do not always return all IPs per lookup, and DNS records
rotate the IPs during the test.
Add POST /fault/v1/{network-latency|network-packet-loss}/add-sources,
which installs additional tc filters against a running fault without
restarting it. Cap the combined IP count at 16 to keep calls inside
the shared 5s requestTimeoutSeconds budget.
Add 18 //go:build linux && sudo tests for the add-sources endpoint to handlers_sudo_integ_test.go. Tests issue real tc commands against the host's default network interface and require root. Run them with `make run-sudo-tests`. Reuse the file's existing startServer, cleanupLatencyAndPacketLossFaults, and skipForUnsupportedTc helpers. Add an addSourcesEndpoint URL format and two new routes in startServer for AddSourcesNetworkLatency and AddSourcesNetworkPacketLoss. Coverage: - Happy path on latency and packet loss, with Sources, SourcesToFilter, and CIDR inputs - 404 on no fault running, wrong fault type, and after stop - 400 on empty body, over cap, within-request overlap, invalid IP, malformed JSON, and missing body; 200 at the cap boundary - Duplicate IPs across requests, add concurrent with stop, two adds concurrent - Qdisc parameters unchanged after add-sources - Stop after add returns qdisc state to baseline
| ).Methods("POST") | ||
|
|
||
| // Setting up add-sources handler endpoints for network latency and packet | ||
| // loss faults. These let the FIS SSM script push newly resolved IPs into a |
There was a problem hiding this comment.
non-blocking:
These let the FIS SSM script push newly resolved IPs
Technically this will let any caller/client to push new IPs
There was a problem hiding this comment.
I will update the comment. Thank you!
| packetLossFaultAlreadyRunningError = "There is already one network packet loss fault running" | ||
| // add-sources 404 messages. We emit a distinct message per fault type so the | ||
| // SSM script's logs make it obvious which fault it was trying to extend. | ||
| latencyFaultNotRunningError = "no network-latency fault is currently running on the task's network namespace" |
There was a problem hiding this comment.
on the task's network namespace
In ECS, for awsvpc mode tasks, they will have dedicated network namespace and for host mode, they don't. Can we rewrite this a bit? like get rid of this.
| responseBody = types.NewNetworkFaultInjectionErrorResponse(internalError) | ||
| httpStatusCode = http.StatusInternalServerError | ||
| } else if !faultPresent(latencyExists, packetLossExists) { | ||
| // 404 is the signal to the SSM script that the fault is gone and |
There was a problem hiding this comment.
Same here. Can we make it generic by removing SSM script? this applies to all callers.
| httpStatusCode = http.StatusInternalServerError | ||
| } else if !faultPresent(latencyExists, packetLossExists) { | ||
| // 404 is the signal to the SSM script that the fault is gone and | ||
| // the refresh loop should stop. The design calls this out as the |
There was a problem hiding this comment.
Similar to the comment above, can we please keep this handler decoupled from FIS logic and generic in the comments.
| // stay in place; the SSM script retries the same IPs on the | ||
| // next refresh cycle, and tc filter add is idempotent enough | ||
| // across kernel versions that the retry either succeeds or | ||
| // fails silently without corrupting the qdisc hierarchy. |
There was a problem hiding this comment.
Same comment as above regarding a reference to SSM script.
| if err := h.addIPAddressesToFilter( | ||
| ctx, request.SourcesToFilter, taskMetadata, nsenterPrefix, | ||
| tcAllowlistIPCommandString, netInterface.DeviceName, | ||
| ); err != nil { |
There was a problem hiding this comment.
Do we get an error if the source to be added already exists? Like file exists or something like that? If so, I think that should not be treated as an error and should be ignored to allow later commands to be executed.
There was a problem hiding this comment.
Verified empirically. tc filter add does not error on duplicate u32 matches. It appends another filter at the same priority each time. Repro on Linux:
ip-10-42-2-146:~# curl -sS -X POST -H 'Content-Type: application/json' --data '{"LossPercent":50,"Sources":["1.1.1.1"]}' "$URL/start"; echo
{"Status":"running"}
ip-10-42-2-146:~# curl -sS -X POST -H 'Content-Type: application/json' --data '{"Sources":["1.1.1.1"]}' "$URL/add-sources"; echo
{"Status":"running"}
ip-10-42-2-146:~# curl -sS -X POST -H 'Content-Type: application/json' --data '{"Sources":["8.8.8.8"]}' "$URL/add-sources"; echo
{"Status":"running"}
ip-10-42-2-146:~# curl -sS -X POST -H 'Content-Type: application/json' --data '{"Sources":["8.8.8.8"]}' "$URL/add-sources"; echo
{"Status":"running"}
ip-10-42-2-146:~#
ip-10-42-2-146:~# tc filter show dev eth0
filter parent 1: protocol all pref 2 u32 chain 0
filter parent 1: protocol all pref 2 u32 chain 0 fh 800: ht divisor 1
filter parent 1: protocol all pref 2 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 not_in_hw
match 01010101/ffffffff at 16
filter parent 1: protocol all pref 2 u32 chain 0 fh 800::801 order 2049 key ht 800 bkt 0 flowid 1:1 not_in_hw
match 01010101/ffffffff at 16
filter parent 1: protocol all pref 2 u32 chain 0 fh 800::802 order 2050 key ht 800 bkt 0 flowid 1:1 not_in_hw
match 08080808/ffffffff at 16
filter parent 1: protocol all pref 2 u32 chain 0 fh 800::803 order 2051 key ht 800 bkt 0 flowid 1:1 not_in_hw
match 08080808/ffffffff at 16
So the retry path is safe as-is, no file exists to swallow. Duplicates accumulate and get reaped at qdisc teardown on stop. I've rewritten the no-rollback comment to match what actually happens, replacing the "idempotent / fails silently" wording.
In-agent dedupe would mean introducing per-NS fault state, which FaultHandler doesn't carry today (it only holds a mutexMap. checkTCFault re-derives state from tc q show on every call). Happy to take that on as a follow-up if you want it, but it felt out of scope for this change.
There was a problem hiding this comment.
Ok thanks. I think it's good as is.
Address PR aws#4971 review feedback. The add-sources endpoint accepts any caller, not just the FIS SSM script, so the surrounding comments and the not-running error messages should not name a specific client. Drop "FIS SSM script" / "SSM script" references from: - the add-sources route registration block in task_server_setup.go - the 404 inline comment and the no-rollback retry comment in addSourcesHandler - the AddSourcesNetworkLatency godoc Reword the latency/packet-loss not-running error strings to drop "on the task's network namespace": host-mode tasks do not have a dedicated network namespace, so the phrasing was wrong for that case. Rewrite the no-rollback retry comment to match observed behavior: `tc filter add` accepts duplicate u32 matches without error, so retries are safe; duplicates are reaped when the qdisc is torn down on stop. The previous "idempotent enough" / "fails silently" wording was wrong. Mirror the ecs-agent edits into agent/vendor/ to keep the vendored copy in sync. NOSIM --- Prompt: fix the PR aws#4971 review comments I replied to
Address PR aws#4971 review feedback. The add-sources endpoint accepts any caller, not just the FIS SSM script, so the surrounding comments and the not-running error messages should not name a specific client. Drop "FIS SSM script" / "SSM script" references from: - the add-sources route registration block in task_server_setup.go - the 404 inline comment and the no-rollback retry comment in addSourcesHandler - the AddSourcesNetworkLatency godoc Reword the latency/packet-loss not-running error strings to drop "on the task's network namespace": host-mode tasks do not have a dedicated network namespace, so the phrasing was wrong for that case. Rewrite the no-rollback retry comment to match observed behavior: `tc filter add` accepts duplicate u32 matches without error, so retries are safe; duplicates are reaped when the qdisc is torn down on stop. The previous "idempotent enough" / "fails silently" wording was wrong. Mirror the ecs-agent edits into agent/vendor/ to keep the vendored copy in sync.
994a036 to
0fb6d9f
Compare
Summary
FIS network faults target traffic between an ECS task and a network destination. Customers identify the destination by IP, CIDR, or DNS name; for DNS names FIS resolves to IPs at fault start and hands the resulting set to the ECS Agent, which installs tc rules against it and never revisits the resolution. The set may go stale on long-running faults: resolvers may return only a subset of IPs per lookup, and DNS records rotate over time, so traffic to later-resolved IPs is unimpaired.
Implementation details
Add
POST /fault/v1/{network-latency|network-packet-loss}/add-sources. The handler installs additionaltcfilters against a running fault without touching theqdisc, returns 404 if no fault of the matching type is running, and caps combinedSources+SourcesToFilterat 16 IPs per call to fit the 5srequestTimeoutSecondsbudget.Touched files: request/response types in
ecs-agent/tmds/handlers/fault/v1/types/types.go, handlers inecs-agent/tmds/handlers/fault/v1/handlers/handlers.go(and itsagent/vendor/...mirror), and route registration inagent/handlers/task_server_setup.go.Testing
Unit tests in
handlers_test.goandtypes_test.gocover validation, the 16-IP cap, missing/wrong/stopped fault states, and routing.18 sudo-tagged integration tests issue real
tccommands on EC2 Linux (go test -tags sudo -run TestAddSources ./tmds/handlers/fault/v1/handlers/): happy path for latency and packet loss withSources/SourcesToFilter/CIDR; 404 on no fault, wrong type, after stop; 400 on empty/over-cap/overlap/invalid IP/malformed/missing body; 200 at the 16-IP cap; duplicate IPs idempotent; concurrent add+stop and add+add serialized;qdiscparams unchanged; stop after add cleans everything. All pass.Manual tests on EC2 against TMDS from a task container, all passing: latency and packet-loss happy paths with mid-fault
add-sources(verified viaping); IPv6 source installs v6 match keys; duplicate IP re-add returns 200 withoutqdisccorruption; add before start, wrong fault type, and add after stop return 404 with the expected error; empty body,Sources=[], and invalid IP return 400 with the expected error.New tests cover the changes: yes
Description for the changelog
Feature - Add FIS-driven DNS refresh add-sources endpoint for network-latency and network-packet-loss faults
Additional Information
Does this PR include breaking model changes? If so, Have you added transformation functions?
No. New endpoints and types only; existing types and routes are unchanged.
Does this PR include the addition of new environment variables in the README?
No.
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.