Skip to content

Fault update sources#4971

Open
ssedano wants to merge 3 commits into
aws:devfrom
ssedano:fault-update-sources
Open

Fault update sources#4971
ssedano wants to merge 3 commits into
aws:devfrom
ssedano:fault-update-sources

Conversation

@ssedano
Copy link
Copy Markdown

@ssedano ssedano commented May 22, 2026

Summary

FIS network faults target traffic between an ECS task and a network destination. Customers identify the destination by IP, CIDR, or DNS name; for DNS names FIS resolves to IPs at fault start and hands the resulting set to the ECS Agent, which installs tc rules against it and never revisits the resolution. The set may go stale on long-running faults: resolvers may return only a subset of IPs per lookup, and DNS records rotate over time, so traffic to later-resolved IPs is unimpaired.

Implementation details

Add POST /fault/v1/{network-latency|network-packet-loss}/add-sources. The handler installs additional tc filters against a running fault without touching the qdisc, returns 404 if no fault of the matching type is running, and caps combined Sources+SourcesToFilter at 16 IPs per call to fit the 5s requestTimeoutSeconds budget.

Touched files: request/response types in ecs-agent/tmds/handlers/fault/v1/types/types.go, handlers in ecs-agent/tmds/handlers/fault/v1/handlers/handlers.go (and its agent/vendor/... mirror), and route registration in agent/handlers/task_server_setup.go.

Testing

Unit tests in handlers_test.go and types_test.go cover validation, the 16-IP cap, missing/wrong/stopped fault states, and routing.

18 sudo-tagged integration tests issue real tc commands on EC2 Linux (go test -tags sudo -run TestAddSources ./tmds/handlers/fault/v1/handlers/): happy path for latency and packet loss with Sources/SourcesToFilter/CIDR; 404 on no fault, wrong type, after stop; 400 on empty/over-cap/overlap/invalid IP/malformed/missing body; 200 at the 16-IP cap; duplicate IPs idempotent; concurrent add+stop and add+add serialized; qdisc params unchanged; stop after add cleans everything. All pass.

Manual tests on EC2 against TMDS from a task container, all passing: latency and packet-loss happy paths with mid-fault add-sources (verified via ping); IPv6 source installs v6 match keys; duplicate IP re-add returns 200 without qdisc corruption; add before start, wrong fault type, and add after stop return 404 with the expected error; empty body, Sources=[], and invalid IP return 400 with the expected error.

New tests cover the changes: yes

Description for the changelog

Feature - Add FIS-driven DNS refresh add-sources endpoint for network-latency and network-packet-loss faults

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?
No. New endpoints and types only; existing types and routes are unchanged.

Does this PR include the addition of new environment variables in the README?
No.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Serafin Sedano added 2 commits May 22, 2026 14:27
FIS network faults impair traffic between ECS tasks and domain-named
services. Before this change, the FIS resolves domains to IPs once at
fault start and hands them to the ECS Agent, which installs tc rules
against that fixed set and never revisits it. The fault is incomplete
for ALB, API Gateway, SQS, and third-party domains:
DNS resolvers do not always return all IPs per lookup, and DNS records
rotate the IPs during the test.

Add POST /fault/v1/{network-latency|network-packet-loss}/add-sources,
which installs additional tc filters against a running fault without
restarting it. Cap the combined IP count at 16 to keep calls inside
the shared 5s requestTimeoutSeconds budget.
Add 18 //go:build linux && sudo tests for the add-sources endpoint to
handlers_sudo_integ_test.go. Tests issue real tc commands against the
host's default network interface and require root. Run them with
`make run-sudo-tests`.

Reuse the file's existing startServer,
cleanupLatencyAndPacketLossFaults, and skipForUnsupportedTc helpers.
Add an addSourcesEndpoint URL format and two new routes in startServer
for AddSourcesNetworkLatency and AddSourcesNetworkPacketLoss.

Coverage:
- Happy path on latency and packet loss, with Sources, SourcesToFilter,
  and CIDR inputs
- 404 on no fault running, wrong fault type, and after stop
- 400 on empty body, over cap, within-request overlap, invalid IP,
  malformed JSON, and missing body; 200 at the cap boundary
- Duplicate IPs across requests, add concurrent with stop, two adds
  concurrent
- Qdisc parameters unchanged after add-sources
- Stop after add returns qdisc state to baseline
@ssedano ssedano requested a review from a team as a code owner May 22, 2026 12:41
Comment thread agent/handlers/task_server_setup.go Outdated
).Methods("POST")

// Setting up add-sources handler endpoints for network latency and packet
// loss faults. These let the FIS SSM script push newly resolved IPs into a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking:

These let the FIS SSM script push newly resolved IPs

Technically this will let any caller/client to push new IPs

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the comment. Thank you!

packetLossFaultAlreadyRunningError = "There is already one network packet loss fault running"
// add-sources 404 messages. We emit a distinct message per fault type so the
// SSM script's logs make it obvious which fault it was trying to extend.
latencyFaultNotRunningError = "no network-latency fault is currently running on the task's network namespace"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the task's network namespace

In ECS, for awsvpc mode tasks, they will have dedicated network namespace and for host mode, they don't. Can we rewrite this a bit? like get rid of this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks!

responseBody = types.NewNetworkFaultInjectionErrorResponse(internalError)
httpStatusCode = http.StatusInternalServerError
} else if !faultPresent(latencyExists, packetLossExists) {
// 404 is the signal to the SSM script that the fault is gone and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Can we make it generic by removing SSM script? this applies to all callers.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks!

httpStatusCode = http.StatusInternalServerError
} else if !faultPresent(latencyExists, packetLossExists) {
// 404 is the signal to the SSM script that the fault is gone and
// the refresh loop should stop. The design calls this out as the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment above, can we please keep this handler decoupled from FIS logic and generic in the comments.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks!

Comment on lines +1198 to +1201
// stay in place; the SSM script retries the same IPs on the
// next refresh cycle, and tc filter add is idempotent enough
// across kernel versions that the retry either succeeds or
// fails silently without corrupting the qdisc hierarchy.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above regarding a reference to SSM script.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix, thanks!

if err := h.addIPAddressesToFilter(
ctx, request.SourcesToFilter, taskMetadata, nsenterPrefix,
tcAllowlistIPCommandString, netInterface.DeviceName,
); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we get an error if the source to be added already exists? Like file exists or something like that? If so, I think that should not be treated as an error and should be ignored to allow later commands to be executed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified empirically. tc filter add does not error on duplicate u32 matches. It appends another filter at the same priority each time. Repro on Linux:

ip-10-42-2-146:~# curl -sS -X POST -H 'Content-Type: application/json' --data '{"LossPercent":50,"Sources":["1.1.1.1"]}' "$URL/start"; echo
{"Status":"running"}
ip-10-42-2-146:~# curl -sS -X POST -H 'Content-Type: application/json' --data '{"Sources":["1.1.1.1"]}' "$URL/add-sources"; echo
{"Status":"running"}
ip-10-42-2-146:~# curl -sS -X POST -H 'Content-Type: application/json' --data '{"Sources":["8.8.8.8"]}' "$URL/add-sources"; echo
{"Status":"running"}
ip-10-42-2-146:~# curl -sS -X POST -H 'Content-Type: application/json' --data '{"Sources":["8.8.8.8"]}' "$URL/add-sources"; echo
{"Status":"running"}
ip-10-42-2-146:~#
ip-10-42-2-146:~# tc filter show dev eth0
filter parent 1: protocol all pref 2 u32 chain 0
filter parent 1: protocol all pref 2 u32 chain 0 fh 800: ht divisor 1
filter parent 1: protocol all pref 2 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 not_in_hw
  match 01010101/ffffffff at 16
filter parent 1: protocol all pref 2 u32 chain 0 fh 800::801 order 2049 key ht 800 bkt 0 flowid 1:1 not_in_hw
  match 01010101/ffffffff at 16
filter parent 1: protocol all pref 2 u32 chain 0 fh 800::802 order 2050 key ht 800 bkt 0 flowid 1:1 not_in_hw
  match 08080808/ffffffff at 16
filter parent 1: protocol all pref 2 u32 chain 0 fh 800::803 order 2051 key ht 800 bkt 0 flowid 1:1 not_in_hw
  match 08080808/ffffffff at 16

So the retry path is safe as-is, no file exists to swallow. Duplicates accumulate and get reaped at qdisc teardown on stop. I've rewritten the no-rollback comment to match what actually happens, replacing the "idempotent / fails silently" wording.

In-agent dedupe would mean introducing per-NS fault state, which FaultHandler doesn't carry today (it only holds a mutexMap. checkTCFault re-derives state from tc q show on every call). Happy to take that on as a follow-up if you want it, but it felt out of scope for this change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks. I think it's good as is.

ssedano pushed a commit to ssedano/amazon-ecs-agent that referenced this pull request May 27, 2026
Address PR aws#4971 review feedback. The add-sources endpoint accepts any
caller, not just the FIS SSM script, so the surrounding comments and
the not-running error messages should not name a specific client.

Drop "FIS SSM script" / "SSM script" references from:
- the add-sources route registration block in task_server_setup.go
- the 404 inline comment and the no-rollback retry comment in
  addSourcesHandler
- the AddSourcesNetworkLatency godoc

Reword the latency/packet-loss not-running error strings to drop "on
the task's network namespace": host-mode tasks do not have a dedicated
network namespace, so the phrasing was wrong for that case.

Rewrite the no-rollback retry comment to match observed behavior:
`tc filter add` accepts duplicate u32 matches without error, so retries
are safe; duplicates are reaped when the qdisc is torn down on stop.
The previous "idempotent enough" / "fails silently" wording was wrong.

Mirror the ecs-agent edits into agent/vendor/ to keep the vendored copy
in sync.

NOSIM
---
Prompt: fix the PR aws#4971 review comments I replied to
Address PR aws#4971 review feedback. The add-sources endpoint accepts any
caller, not just the FIS SSM script, so the surrounding comments and
the not-running error messages should not name a specific client.

Drop "FIS SSM script" / "SSM script" references from:
- the add-sources route registration block in task_server_setup.go
- the 404 inline comment and the no-rollback retry comment in
  addSourcesHandler
- the AddSourcesNetworkLatency godoc

Reword the latency/packet-loss not-running error strings to drop "on
the task's network namespace": host-mode tasks do not have a dedicated
network namespace, so the phrasing was wrong for that case.

Rewrite the no-rollback retry comment to match observed behavior:
`tc filter add` accepts duplicate u32 matches without error, so retries
are safe; duplicates are reaped when the qdisc is torn down on stop.
The previous "idempotent enough" / "fails silently" wording was wrong.

Mirror the ecs-agent edits into agent/vendor/ to keep the vendored copy
in sync.
@ssedano ssedano force-pushed the fault-update-sources branch from 994a036 to 0fb6d9f Compare May 27, 2026 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants