Fault update sources by ssedano · Pull Request #4971 · aws/amazon-ecs-agent

ssedano · 2026-05-22T12:41:46Z

Summary

FIS network faults target traffic between an ECS task and a network destination. Customers identify the destination by IP, CIDR, or DNS name; for DNS names FIS resolves to IPs at fault start and hands the resulting set to the ECS Agent, which installs tc rules against it and never revisits the resolution. The set may go stale on long-running faults: resolvers may return only a subset of IPs per lookup, and DNS records rotate over time, so traffic to later-resolved IPs is unimpaired.

Implementation details

Add POST /fault/v1/{network-latency|network-packet-loss}/add-sources. The handler installs additional tc filters against a running fault without touching the qdisc, returns 404 if no fault of the matching type is running, and caps combined Sources+SourcesToFilter at 16 IPs per call to fit the 5s requestTimeoutSeconds budget.

Touched files: request/response types in ecs-agent/tmds/handlers/fault/v1/types/types.go, handlers in ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go (and its agent/vendor/... mirror), and route registration in agent/handlers/task_server_setup.go.

Testing

Unit tests in handlers_test.go and types_test.go cover validation, the 16-IP cap, missing/wrong/stopped fault states, and routing.

18 sudo-tagged integration tests issue real tc commands on EC2 Linux (go test -tags sudo -run TestAddSources ./tmds/handlers/fault/v1/handlers/): happy path for latency and packet loss with Sources/SourcesToFilter/CIDR; 404 on no fault, wrong type, after stop; 400 on empty/over-cap/overlap/invalid IP/malformed/missing body; 200 at the 16-IP cap; duplicate IPs idempotent; concurrent add+stop and add+add serialized; qdisc params unchanged; stop after add cleans everything. All pass.

Manual tests on EC2 against TMDS from a task container, all passing: latency and packet-loss happy paths with mid-fault add-sources (verified via ping); IPv6 source installs v6 match keys; duplicate IP re-add returns 200 without qdisc corruption; add before start, wrong fault type, and add after stop return 404 with the expected error; empty body, Sources=[], and invalid IP return 400 with the expected error.

New tests cover the changes: yes

Description for the changelog

Feature - Add FIS-driven DNS refresh add-sources endpoint for network-latency and network-packet-loss faults

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?
No. New endpoints and types only; existing types and routes are unchanged.

Does this PR include the addition of new environment variables in the README?
No.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

FIS network faults impair traffic between ECS tasks and domain-named services. Before this change, the FIS resolves domains to IPs once at fault start and hands them to the ECS Agent, which installs tc rules against that fixed set and never revisits it. The fault is incomplete for ALB, API Gateway, SQS, and third-party domains: DNS resolvers do not always return all IPs per lookup, and DNS records rotate the IPs during the test. Add POST /fault/v1/{network-latency|network-packet-loss}/add-sources, which installs additional tc filters against a running fault without restarting it. Cap the combined IP count at 16 to keep calls inside the shared 5s requestTimeoutSeconds budget.

Add 18 //go:build linux && sudo tests for the add-sources endpoint to handlers_sudo_integ_test.go. Tests issue real tc commands against the host's default network interface and require root. Run them with `make run-sudo-tests`. Reuse the file's existing startServer, cleanupLatencyAndPacketLossFaults, and skipForUnsupportedTc helpers. Add an addSourcesEndpoint URL format and two new routes in startServer for AddSourcesNetworkLatency and AddSourcesNetworkPacketLoss. Coverage: - Happy path on latency and packet loss, with Sources, SourcesToFilter, and CIDR inputs - 404 on no fault running, wrong fault type, and after stop - 400 on empty body, over cap, within-request overlap, invalid IP, malformed JSON, and missing body; 200 at the cap boundary - Duplicate IPs across requests, add concurrent with stop, two adds concurrent - Qdisc parameters unchanged after add-sources - Stop after add returns qdisc state to baseline

xxx0624 · 2026-05-22T16:54:55Z

 	).Methods("POST")

+	// Setting up add-sources handler endpoints for network latency and packet
+	// loss faults. These let the FIS SSM script push newly resolved IPs into a


non-blocking:

These let the FIS SSM script push newly resolved IPs

Technically this will let any caller/client to push new IPs

I will update the comment. Thank you!

xxx0624 · 2026-05-22T17:00:38Z

 	packetLossFaultAlreadyRunningError = "There is already one network packet loss fault running"
+	// add-sources 404 messages. We emit a distinct message per fault type so the
+	// SSM script's logs make it obvious which fault it was trying to extend.
+	latencyFaultNotRunningError    = "no network-latency fault is currently running on the task's network namespace"


on the task's network namespace

In ECS, for awsvpc mode tasks, they will have dedicated network namespace and for host mode, they don't. Can we rewrite this a bit? like get rid of this.

Will do, thanks!

xxx0624 · 2026-05-22T17:07:52Z

+			responseBody = types.NewNetworkFaultInjectionErrorResponse(internalError)
+			httpStatusCode = http.StatusInternalServerError
+		} else if !faultPresent(latencyExists, packetLossExists) {
+			// 404 is the signal to the SSM script that the fault is gone and


Same here. Can we make it generic by removing SSM script? this applies to all callers.

Will do, thanks!

amogh09 · 2026-05-26T17:51:03Z

+			httpStatusCode = http.StatusInternalServerError
+		} else if !faultPresent(latencyExists, packetLossExists) {
+			// 404 is the signal to the SSM script that the fault is gone and
+			// the refresh loop should stop. The design calls this out as the


Similar to the comment above, can we please keep this handler decoupled from FIS logic and generic in the comments.

Will do, thanks!

amogh09 · 2026-05-26T17:55:03Z

+				// stay in place; the SSM script retries the same IPs on the
+				// next refresh cycle, and tc filter add is idempotent enough
+				// across kernel versions that the retry either succeeds or
+				// fails silently without corrupting the qdisc hierarchy.


Same comment as above regarding a reference to SSM script.

Will fix, thanks!

amogh09 · 2026-05-26T17:58:53Z

+		if err := h.addIPAddressesToFilter(
+			ctx, request.SourcesToFilter, taskMetadata, nsenterPrefix,
+			tcAllowlistIPCommandString, netInterface.DeviceName,
+		); err != nil {


Do we get an error if the source to be added already exists? Like file exists or something like that? If so, I think that should not be treated as an error and should be ignored to allow later commands to be executed.

Verified empirically. tc filter add does not error on duplicate u32 matches. It appends another filter at the same priority each time. Repro on Linux:

ip-10-42-2-146:~# curl -sS -X POST -H 'Content-Type: application/json' --data '{"LossPercent":50,"Sources":["1.1.1.1"]}' "$URL/start"; echo {"Status":"running"} ip-10-42-2-146:~# curl -sS -X POST -H 'Content-Type: application/json' --data '{"Sources":["1.1.1.1"]}' "$URL/add-sources"; echo {"Status":"running"} ip-10-42-2-146:~# curl -sS -X POST -H 'Content-Type: application/json' --data '{"Sources":["8.8.8.8"]}' "$URL/add-sources"; echo {"Status":"running"} ip-10-42-2-146:~# curl -sS -X POST -H 'Content-Type: application/json' --data '{"Sources":["8.8.8.8"]}' "$URL/add-sources"; echo {"Status":"running"} ip-10-42-2-146:~# ip-10-42-2-146:~# tc filter show dev eth0 filter parent 1: protocol all pref 2 u32 chain 0 filter parent 1: protocol all pref 2 u32 chain 0 fh 800: ht divisor 1 filter parent 1: protocol all pref 2 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 not_in_hw match 01010101/ffffffff at 16 filter parent 1: protocol all pref 2 u32 chain 0 fh 800::801 order 2049 key ht 800 bkt 0 flowid 1:1 not_in_hw match 01010101/ffffffff at 16 filter parent 1: protocol all pref 2 u32 chain 0 fh 800::802 order 2050 key ht 800 bkt 0 flowid 1:1 not_in_hw match 08080808/ffffffff at 16 filter parent 1: protocol all pref 2 u32 chain 0 fh 800::803 order 2051 key ht 800 bkt 0 flowid 1:1 not_in_hw match 08080808/ffffffff at 16

So the retry path is safe as-is, no file exists to swallow. Duplicates accumulate and get reaped at qdisc teardown on stop. I've rewritten the no-rollback comment to match what actually happens, replacing the "idempotent / fails silently" wording.

In-agent dedupe would mean introducing per-NS fault state, which FaultHandler doesn't carry today (it only holds a mutexMap. checkTCFault re-derives state from tc q show on every call). Happy to take that on as a follow-up if you want it, but it felt out of scope for this change.

Ok thanks. I think it's good as is.

Address PR aws#4971 review feedback. The add-sources endpoint accepts any caller, not just the FIS SSM script, so the surrounding comments and the not-running error messages should not name a specific client. Drop "FIS SSM script" / "SSM script" references from: - the add-sources route registration block in task_server_setup.go - the 404 inline comment and the no-rollback retry comment in addSourcesHandler - the AddSourcesNetworkLatency godoc Reword the latency/packet-loss not-running error strings to drop "on the task's network namespace": host-mode tasks do not have a dedicated network namespace, so the phrasing was wrong for that case. Rewrite the no-rollback retry comment to match observed behavior: `tc filter add` accepts duplicate u32 matches without error, so retries are safe; duplicates are reaped when the qdisc is torn down on stop. The previous "idempotent enough" / "fails silently" wording was wrong. Mirror the ecs-agent edits into agent/vendor/ to keep the vendored copy in sync. NOSIM --- Prompt: fix the PR aws#4971 review comments I replied to

Address PR aws#4971 review feedback. The add-sources endpoint accepts any caller, not just the FIS SSM script, so the surrounding comments and the not-running error messages should not name a specific client. Drop "FIS SSM script" / "SSM script" references from: - the add-sources route registration block in task_server_setup.go - the 404 inline comment and the no-rollback retry comment in addSourcesHandler - the AddSourcesNetworkLatency godoc Reword the latency/packet-loss not-running error strings to drop "on the task's network namespace": host-mode tasks do not have a dedicated network namespace, so the phrasing was wrong for that case. Rewrite the no-rollback retry comment to match observed behavior: `tc filter add` accepts duplicate u32 matches without error, so retries are safe; duplicates are reaped when the qdisc is torn down on stop. The previous "idempotent enough" / "fails silently" wording was wrong. Mirror the ecs-agent edits into agent/vendor/ to keep the vendored copy in sync.

Serafin Sedano added 2 commits May 22, 2026 14:27

ssedano requested a review from a team as a code owner May 22, 2026 12:41

ssedano added the bot/test label May 22, 2026

amazon-ecs-bot removed the bot/test label May 22, 2026

xxx0624 reviewed May 22, 2026

View reviewed changes

amogh09 reviewed May 26, 2026

View reviewed changes

ssedano force-pushed the fault-update-sources branch from 994a036 to 0fb6d9f Compare May 27, 2026 18:30

amogh09 approved these changes May 27, 2026

View reviewed changes

xxx0624 approved these changes May 27, 2026

View reviewed changes

ssedano added the bot/test label May 27, 2026

amazon-ecs-bot removed the bot/test label May 27, 2026

Conversation

ssedano commented May 22, 2026

Summary

Implementation details

Testing

Description for the changelog

Additional Information

Licensing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants