essential container is pending after enabling service connect for ECS on EC2 #4397

zxkane · 2024-10-15T06:21:08Z

Summary

Every thing works well. However, the essential container always is pending after enabling service connect on ECS on EC2.

While updating the ECS service to Fargate, it works again with service connect enabled.

I found the ECS agent lost the connection to ECS service when 'sending status change to ECS'.

Description

I'm using below code snippet to create ECS service on EC2 with service connect via CDK,

        const ec2Service = new Ec2Service(this, 'DifyEcsClusterEC2SandboxService', {
            cluster: this.cluster,
            taskDefinition: this.sandboxTaskDefinition,
            desiredCount: 1,
            serviceName: 'ec2-dify-sandbox',
            vpcSubnets: this.cluster.vpc.selectSubnets({ subnetType: SubnetType.PRIVATE_WITH_EGRESS }),
            securityGroups: [this.securityGroup],
            capacityProviderStrategies: [{ capacityProvider: this.clusterDefaultCapacityProviderName, weight: 1 }]
        })

        ec2Service.addPlacementStrategies(
            PlacementStrategy.spreadAcross('attribute:ecs.availability-zone'),
            PlacementStrategy.spreadAcrossInstances(),
        )

        ec2Service.enableServiceConnect({
            namespace: this.cluster.defaultCloudMapNamespace?.namespaceArn,
            services: [{
                portMappingName: this.sandboxTaskDefinition.defaultContainer?.portMappings[0].name || "serverless-dify-sandbox-8194-tcp",
                dnsName: "dify-sandbox-on-ec2.serverless",
                discoveryName: "serverless-dify-sandbox-on-ec2",
            }],
            logDriver: LogDriver.awsLogs({
                streamPrefix: 'serverless-dify/service-connect/',
                logGroup: new LogGroup(this, 'service-connect-log-on-ec2', {
                    retention: RetentionDays.ONE_WEEK,
                  }),
            })
        })

If the service connect is disabled, the container started and ran well. However, it always is pending after enabling service connect.

After inspecting the logs of ecs-agent container on EC2, I found below output in logs.

level=warn time=2024-10-15T05:19:31Z msg="Received an unrecognized attachment property" attachmentProperty="{\n  Name: \"EcsTaskSetArn\",\n  Value: \"arn:aws:ecs:ap-northeast-1:845861764576:task-set/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/serverless-dify-sandbox/ecs-svc/7792931705797793743\"\n}"
level=info time=2024-10-15T05:19:31Z msg="Received task payload from ACS" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskVersion="3" desiredStatus=RUNNING
level=info time=2024-10-15T05:19:31Z msg="Found application credentials for task" taskVersion="3" roleARN="arn:aws:iam::845861764576:role/ServerlessDifyStack-EcsCl-ServerlessDifyClusterSand-UHkUiKUZcERq" roleType="TaskApplication" credentialsID="50e0b19f-38e1-45c6-a580-07a7ab5c6cbb" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Found execution credentials for task" roleType="TaskExecution" credentialsID="85db1dd5-4c11-4e98-8345-7bbe668de81c" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskVersion="3" roleARN="arn:aws:iam::845861764576:role/ServerlessDifyStack-EcsCl-ServerlessDifyClusterSand-UHkUiKUZcERq"
level=info time=2024-10-15T05:19:31Z msg="Resources successfully consumed, continue to task creation" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Host resources consumed, progressing task" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Digest resolution not required" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" containerName="~internal~ecs~pause" image="amazon/amazon-ecs-pause:0.1.0"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" knownStatus="NONE" desiredStatus="RESOURCES_PROVISIONED" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="" changeEventStatus="MANIFEST_PULLED"
level=info time=2024-10-15T05:19:31Z msg="Creating task cgroup taskARN=arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db cgroupPath=/ecs/2700c02483254aef95cbc43f497f16db cgroupV2=false" module=cgroup.go
level=info time=2024-10-15T05:19:31Z msg="Transitioned resource" status="CREATED" task="2700c02483254aef95cbc43f497f16db" resource="cgroup"
level=info time=2024-10-15T05:19:31Z msg="Managed task got resource" resource="cgroup" status="CREATED" task="2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="" changeEventStatus="PULLED" knownStatus="MANIFEST_PULLED" desiredStatus="RESOURCES_PROVISIONED"
level=info time=2024-10-15T05:19:31Z msg="Creating container" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause"
level=info time=2024-10-15T05:19:31Z msg="Created container name mapping for task" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" dockerContainerName="ecs-serverless-dify-sandbox-3-internalecspause-b4edaff8a49797bb4500"
level=info time=2024-10-15T05:19:31Z msg="Created docker container for task" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" dockerId="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" elapsed=74.224358ms
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" changeEventStatus="CREATED" knownStatus="PULLED" desiredStatus="RESOURCES_PROVISIONED" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause"
level=info time=2024-10-15T05:19:31Z msg="Starting container" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause"
level=info time=2024-10-15T05:19:31Z msg="Started container" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" elapsed=171.265595ms
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" changeEventStatus="RUNNING" knownStatus="CREATED" desiredStatus="RESOURCES_PROVISIONED" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9"
level=info time=2024-10-15T05:19:31Z msg="Start streaming metrics for container" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9"
level=info time=2024-10-15T05:19:31Z msg="Setting up container resources for container" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause"
level=info time=2024-10-15T05:19:31Z msg="Setting up CNI config for task" cniContainerNetNs="/host/proc/24031/ns/net" task="2700c02483254aef95cbc43f497f16db" cniContainerID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" cniPluginPath="" cniID="06:d5:89:0d:2f:2d" cniBridgeName=""
level=info time=2024-10-15T05:19:31Z msg="Task associated with ip address" task="2700c02483254aef95cbc43f497f16db" ip="169.254.172.4"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" changeEventStatus="RESOURCES_PROVISIONED" knownStatus="RUNNING" desiredStatus="RESOURCES_PROVISIONED"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" changeEventStatus="MANIFEST_PULLED" knownStatus="NONE" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf" runtimeID=""
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" container="sandbox" runtimeID="" changeEventStatus="MANIFEST_PULLED" knownStatus="NONE" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Container change also resulted in task change" runtimeID="" desiredStatus="RUNNING" knownStatus="MANIFEST_PULLED" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" container="ecs-service-connect-nconQf" runtimeID="" changeEventStatus="PULLED" knownStatus="MANIFEST_PULLED" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Creating container" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf"
level=info time=2024-10-15T05:19:31Z msg="Applying execution role credentials to container log auth" awslogs-credentials-endpoint="/v2/credentials/85db1dd5-4c11-4e98-8345-7bbe668de81c" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" roleType="TaskExecution" roleARN="arn:aws:iam::845861764576:role/ServerlessDifyStack-EcsCl-ServerlessDifyClusterSand-UHkUiKUZcERq" credentialsID="85db1dd5-4c11-4e98-8345-7bbe668de81c"
level=info time=2024-10-15T05:19:31Z msg="Created container name mapping for task" dockerContainerName="ecs-serverless-dify-sandbox-3-ecs-service-connect-nconQf-94b4e0ccf1d3fdbf9d01" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf"
level=info time=2024-10-15T05:19:32Z msg="Created docker container for task" container="ecs-service-connect-nconQf" dockerId="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b" elapsed=70.437094ms task="2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:32Z msg="Handling container change event" knownStatus="PULLED" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b" changeEventStatus="CREATED"
level=info time=2024-10-15T05:19:32Z msg="Starting container" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b"
2024-10-15T05:19:32Z 200 10.0.169.54:59412 "/v2/credentials" "Go-http-client/1.1" arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db GetCredentialsExecutionRole 2 ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq arn:aws:ecs:ap-northeast-1:845861764576:container-instance/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/3929aad2e3374aa79d0480c7eb68919d
level=info time=2024-10-15T05:19:32Z msg="Processing credential request, credentialType=TaskExecution taskARN=arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" module=credentials_handler.go
level=info time=2024-10-15T05:19:32Z msg="Started container" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b" elapsed=170.946791ms task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf"
level=info time=2024-10-15T05:19:32Z msg="Handling container change event" knownStatus="CREATED" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b" changeEventStatus="RUNNING"
level=info time=2024-10-15T05:19:32Z msg="Start streaming metrics for container" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b"
level=info time=2024-10-15T05:19:32Z msg="Writing response for v4 task metadata" tmdsEndpointContainerID="e1ab4a8e-3b73-4f70-9d72-85eba94ca012" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:32Z msg="Writing response for v4 container metadata" tmdsEndpointContainerID="e1ab4a8e-3b73-4f70-9d72-85eba94ca012" container="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b"
level=info time=2024-10-15T05:19:32Z msg="Writing response for v4 task metadata" tmdsEndpointContainerID="e1ab4a8e-3b73-4f70-9d72-85eba94ca012" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:50Z msg="Sending state change to ECS" eventType="TaskStateChange" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskStatus="MANIFEST_PULLED" taskReason="" taskPullStartedAt="0001-01-01T00:00:00Z" taskPullStoppedAt="0001-01-01T00:00:00Z" taskKnownSentStatus="NONE" taskExecutionStoppedAt="0001-01-01T00:00:00Z" containerChange-0="containerName=ecs-service-connect-nconQf containerStatus=RUNNING containerKnownSentStatus=NONE containerRuntimeID=fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b containerIsEssential=true"
level=info time=2024-10-15T05:21:12Z msg="TCS Websocket connection closed for a valid reason"
level=info time=2024-10-15T05:21:12Z msg="Using cached DiscoverPollEndpoint" endpoint="https://ecs-a-5.ap-northeast-1.amazonaws.com/" telemetryEndpoint="https://ecs-t-5.ap-northeast-1.amazonaws.com/" serviceConnectEndpoint="https://ecs-sc.ap-northeast-1.api.aws" containerInstanceARN="arn:aws:ecs:ap-northeast-1:845861764576:container-instance/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/3929aad2e3374aa79d0480c7eb68919d"
level=info time=2024-10-15T05:21:12Z msg="Establishing a Websocket connection" url="https://ecs-t-5.ap-northeast-1.amazonaws.com/ws?agentHash=cf8c7a6b&agentVersion=1.87.0&cluster=ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq&containerInstance=arn%3Aaws%3Aecs%3Aap-northeast-1%3A845861764576%3Acontainer-instance%2FServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq%2F3929aad2e3374aa79d0480c7eb68919d&dockerVersion=25.0.6"
level=info time=2024-10-15T05:21:12Z msg="Websocket connection established." URL="https://ecs-t-5.ap-northeast-1.amazonaws.com/ws?agentHash=cf8c7a6b&agentVersion=1.87.0&cluster=ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq&containerInstance=arn%3Aaws%3Aecs%3Aap-northeast-1%3A845861764576%3Acontainer-instance%2FServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq%2F3929aad2e3374aa79d0480c7eb68919d&dockerVersion=25.0.6" ConnectTime="2024-10-15 05:21:12" ExpectedDisconnectTime="2024-10-15 05:51:12"
level=info time=2024-10-15T05:21:12Z msg="Connected to TCS endpoint"

You could see abnormal connectivity of between ECS and agent on EC2,

level=info time=2024-10-15T05:19:50Z msg="Sending state change to ECS" eventType="TaskStateChange" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskStatus="MANIFEST_PULLED" taskReason="" taskPullStartedAt="0001-01-01T00:00:00Z" taskPullStoppedAt="0001-01-01T00:00:00Z" taskKnownSentStatus="NONE" taskExecutionStoppedAt="0001-01-01T00:00:00Z" containerChange-0="containerName=ecs-service-connect-nconQf containerStatus=RUNNING containerKnownSentStatus=NONE containerRuntimeID=fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b containerIsEssential=true"
level=info time=2024-10-15T05:21:12Z msg="TCS Websocket connection closed for a valid reason"

After updating above ECS service to use Fargate, it works well.

        const service = new FargateService(this, 'DifyEcsClusterSandboxService', {
            cluster: this.cluster,
            taskDefinition: this.sandboxTaskDefinition,
            desiredCount: 1,
            serviceName: 'fargate-dify-sandbox',
            vpcSubnets: this.cluster.vpc.selectSubnets({ subnetType: SubnetType.PRIVATE_WITH_EGRESS }),
            securityGroups: [this.securityGroup],
        })

        service.enableServiceConnect({
            namespace: this.cluster.defaultCloudMapNamespace?.namespaceArn,
            services: [{
                portMappingName: this.sandboxTaskDefinition.defaultContainer?.portMappings[0].name || "serverless-dify-sandbox-8194-tcp",
                dnsName: "serverles-dify-sandbox",
                discoveryName: "serverless-dify-sandbox",
            }],
            logDriver: LogDriver.awsLogs({
                streamPrefix: 'serverless-dify/service-connect/',
                logGroup: new LogGroup(this, 'service-connect-on-fargate', {
                    retention: RetentionDays.ONE_WEEK,
                  }),
            })
        })

Expected Behavior

The essential container could be started.

Observed Behavior

The ECS agent could not start the container after the service connect container is started.

Environment Details

ECS agent: 1.87.0
EC2 AMI: amzn2-ami-ecs-hvm-2.0.20241010-x86_64-ebs

Supporting Log Snippets

see above description

The text was updated successfully, but these errors were encountered:

mye956 · 2024-10-29T21:38:06Z

Hi @zxkane, thanks for opening up this issue.

You could see abnormal connectivity of between ECS and agent on EC2,

level=info time=2024-10-15T05:19:50Z msg="Sending state change to ECS" eventType="TaskStateChange" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskStatus="MANIFEST_PULLED" taskReason="" taskPullStartedAt="0001-01-01T00:00:00Z" taskPullStoppedAt="0001-01-01T00:00:00Z" taskKnownSentStatus="NONE" taskExecutionStoppedAt="0001-01-01T00:00:00Z" containerChange-0="containerName=ecs-service-connect-nconQf containerStatus=RUNNING containerKnownSentStatus=NONE containerRuntimeID=fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b containerIsEssential=true"
level=info time=2024-10-15T05:21:12Z msg="TCS Websocket connection closed for a valid reason"

What you see here is actually the expected behavior where agent disconnecting with the ECS telemetry connection (see ref where this log statement is coming from). This is also not the same ECS endpoint where we send state changes over (it's the ACS endpoint). The TCS endpoint is where we send over metrics. Agent will periodically disconnect and then reconnect back with the telemetry endpoint which you should see a corresponding log statement a bit after.

Could you help clarify a bit more on what you mean by the essential container being stuck in pending? It looks like the container did transition to a running state.

level=info time=2024-10-15T05:19:31Z msg="Handling container change event" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" changeEventStatus="RESOURCES_PROVISIONED" knownStatus="RUNNING" desiredStatus="RESOURCES_PROVISIONED"

If possible, could share a bit more on how the task definition is configured?

zxkane · 2024-11-07T06:41:13Z

@mye956 thanks for looking at this.

I uploaded the task definition for your reference: serverless-dify-sandbox-revision1.json.

In my latest attempt, I configured both EC2-based and Fargate-based ECS services with service connect enabled, using the same task definition.

From the below screenshot, you could see the service based on EC2 is always pending for ready.

Digging into the task of the EC2-based service, the essential container dify-sandbox always pending to be started. It also makes the ecs-service-connect become unhealthy after running a while.

After disabling the service connect configuration for the EC2-based service, the task runs successfully.

Any suggestions for addressing the availability issue when enabling Service Connect for EC2-based services?

haphazardeous · 2024-12-04T09:06:09Z

Hi @zxkane,

Are you still having this issue? We've been having a lot of ecs-service-connect issues on our ECS cluster which is back by EC2.

We have a scheduled system where the cluster is shutdown over the weekend and brought up on Monday morning. But it's been failing to come up due to ecs-service connect failing health checks since upgrading the agent.

tinchogon34 · 2024-12-10T16:13:54Z

Hi @zxkane, we're having exactly the same issue. By any chance, did you find any solution?

zxkane · 2024-12-11T01:24:04Z

Hi @zxkane, we're having exactly the same issue. By any chance, did you find any solution?

No solution for EC2. Using Fargate works like a charm.

tinchogon34 · 2024-12-11T01:36:47Z

I just realized today that with EC2, but with Amazon Linux instead of Bottlerocket, works too

amogh09 · 2024-12-19T21:19:33Z

Hi @zxkane. We will need to more logs from your container instance to investigate this issue. Can you please enable debug logs by adding ECS_LOGLEVEL=debug to your ecs configuration file (located at /etc/ecs/ecs.config by default)? Then can you please run a new task with the same configuration, collect logs from your container instance using ECS Logs Collector, and send the logs bundle to [email protected]? Please reference this Github issue in your email subject.

A PENDING task container usually is due to an unresolved container dependency.

alessiogaldy · 2024-12-22T13:20:34Z

@amogh09 We are also experiencing a similar issue.
I sent an email with debug logs to [email protected], the subject isGithub Issue 4397 - ECS container instance DEBUG logs

rwohleb · 2025-01-08T07:53:30Z

We've started periodically experiencing the perpetual pending status issue on EC2 backed ECS clusters. I believe we first noticed the strange behaviour around November. Only services with Service Connect enabled are affected. When a container starts showing the issue we set it to drain. New EC2 instances that are spun up by the ASG don't seem to exhibit the issue immediately. The age of the instance doesn't seem to to be related, as we've seen containers start to show the issue that are newer than ones that don't. The ECS agent version varies.

alessiogaldy · 2025-01-09T19:43:05Z

What @rwohleb described is exactly what we are experiencing when using Service Connect on EC2 instances.

Every 2-3 days we find 1 or 2 instances in a broken state, looking at the logs on the instance all we can see is that the AppNet Agent failed with an error like the following:

[2024-12-17 19:55:26.759][7][error] [AppNet Agent] Envoy readiness check failed with: Get "http://127.0.0.1:9901/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[2024-12-17 19:55:26.909][7][error] [AppNet Agent] Envoy connectivity check failed with: Get "http://127.0.0.1:9901/stats?filter=control_plane.connected_state&format=json": EOF
[2024-12-17 19:55:26.909][7][error] [AppNet Agent] Envoy readiness check failed with: Get "http://127.0.0.1:9901/ready": read unix @->/tmp/envoy_admin.sock: read: connection reset by peer
[2024-12-17 19:55:26.909][7][error] [AppNet Agent] Envoy connectivity check failed with: Get "http://127.0.0.1:9901/stats?filter=control_plane.connected_state&format=json": dial unix /tmp/envoy_admin.sock: connect: connection refused
[2024-12-17 19:55:26.942][7][warning] [AppNet Agent] [Envoy process 15] Exited with code [-1]
[2024-12-17 19:55:26.943][7][warning] [AppNet Agent] [Envoy process 15] Additional Exit data: [Core Dump: false][Normal Exit: false][Process Signalled: true]

Initially we were also terminating the instances and letting the ASG create new ones but recently we noticed that just rebooting the instance brings it back to a normal state.

mkleinsasser · 2025-01-10T21:16:06Z

We're seeing this issue as well, and have a case open with AWS Support. Collecting what may be related issues:

[BUG] amazon-ecs-service-connect-agent#41
[BUG] ECS connect container unhealthy during new deployments to ECS EC2 amazon-ecs-service-connect-agent#91
[BUG] AppMesh Envoy Proxy crashes after upgrade from v1.27.2.0-prod to v1.29.9.0-prod amazon-ecs-service-connect-agent#104
Always Restart Service Connect Relay agent upon exit #4459

timvw · 2025-01-13T12:23:26Z

@tinchogon34 Correct to assume that you mean amazon-linux-2023?

We experience the same issue with amzn2-ami-ecs-gpu-hvm-2.0.20250102-x86_64-ebs AMI for the ec2 instance...

nicolas-nannoni · 2025-01-17T17:34:20Z

We are also affected by the issue. It started as soon as we updated from ami-05bc1b71d9fb7b17c (al2023-ami-ecs-hvm-2023.0.20240917-kernel-6.1-arm64 - ECS Agent: 1.86.3 - Docker version: 25.0.6) to ami-05c972f666f1ea926 (al2023-ami-ecs-hvm-2023.0.20250110-kernel-6.1-arm64 - ECS Agent: 1.89.2 - Docker version: 25.0.6) in eu-west-1. The AppNet Agent is killed - sometimes days after the instance started -, and never restarts:

[2025-01-15 19:30:29.505][19][info][main] [source/server/server.cc:932] all clusters initialized. initializing init manager
[2025-01-15 19:30:29.505][19][info][config] [source/common/listener_manager/listener_manager_impl.cc:926] all dependencies initialized. starting workers
[2025-01-17 12:57:27.772][7][warning] [AppNet Agent] [Envoy process 19] Exited with code [-1]
[2025-01-17 12:57:27.772][7][warning] [AppNet Agent] [Envoy process 19] Additional Exit data: [Core Dump: false][Normal Exit: false][Process Signalled: true]

The envoy process somehow dies at some point; in this case hours after the machine started (other times, it fails right as the first containers get scheduled on the instance). It appears that it got triggered by some containers being newly scheduled on the instance a minute prior to Envoy "abnormally ending".

Jan 17 12:56:44 ip-10-40-52-59 audit[1864076]: ANOM_ABEND auid=4294967295 uid=20000 gid=0 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=1864076 comm="wrk:worker_1" exe="/usr/bin/envoy" sig=11 res=1
Jan 17 12:56:45 ip-10-40-52-59 audit[1864077]: ANOM_ABEND auid=4294967295 uid=20000 gid=0 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=1864077 comm="wrk:worker_1" exe="/usr/bin/envoy" sig=11 res=1


Jan 17 12:57:27 ip-10-40-52-59 systemd-coredump[1864535]: Process 1864076 (envoy) of user 20000 dumped core.
                                                                                           
                                                                                           Stack trace of thread 46:
                                                                                           #0  0x0000aaaac1c3cc8c n/a (/usr/bin/envoy + 0x14acc8c)
                                                                                           #1  0x0000aaaac30a85c8 n/a (/usr/bin/envoy + 0x29185c8)
                                                                                           #2  0x0000aaaac30a85c8 n/a (/usr/bin/envoy + 0x29185c8)
                                                                                           #3  0x0000aaaac30b5104 n/a (/usr/bin/envoy + 0x2925104)
                                                                                           #4  0x0000aaaac30b1e80 n/a (/usr/bin/envoy + 0x2921e80)
                                                                                           #5  0x0000aaaac3059098 n/a (/usr/bin/envoy + 0x28c9098)
                                                                                           #6  0x0000aaaac36bc87c n/a (/usr/bin/envoy + 0x2f2c87c)
                                                                                           #7  0x0000aaaac36bc7cc n/a (/usr/bin/envoy + 0x2f2c7cc)
                                                                                           #8  0x0000aaaac30774b0 n/a (/usr/bin/envoy + 0x28e74b0)
                                                                                           #9  0x0000aaaac38d556c n/a (/usr/bin/envoy + 0x314556c)
                                                                                           #10 0x0000ffff8f7cd230 n/a (/lib64/libpthread.so.0 + 0x7230)
                                                                                           #11 0x0000ffff8f71b7dc n/a (/lib64/libc.so.6 + 0xdb7dc)
                                                                                           ELF object binary architecture: AARCH64
Jan 17 12:57:27 ip-10-40-52-59 systemd-coredump[1864558]: Process 1864077 (envoy) of user 20000 dumped core.
                                                                                           
                                                                                           Stack trace of thread 45:
                                                                                           #0  0x0000aaaacc85cc8c n/a (/usr/bin/envoy + 0x14acc8c)
                                                                                           #1  0x0000aaaacdcc85c8 n/a (/usr/bin/envoy + 0x29185c8)
                                                                                           #2  0x0000aaaacdcc85c8 n/a (/usr/bin/envoy + 0x29185c8)
                                                                                           #3  0x0000aaaacdcd5104 n/a (/usr/bin/envoy + 0x2925104)
                                                                                           #4  0x0000aaaacdcd1e80 n/a (/usr/bin/envoy + 0x2921e80)
                                                                                           #5  0x0000aaaacdc79098 n/a (/usr/bin/envoy + 0x28c9098)
                                                                                           #6  0x0000aaaace2dc87c n/a (/usr/bin/envoy + 0x2f2c87c)
                                                                                           #7  0x0000aaaace2dc7cc n/a (/usr/bin/envoy + 0x2f2c7cc)
                                                                                           #8  0x0000aaaacdc974b0 n/a (/usr/bin/envoy + 0x28e74b0)
                                                                                           #9  0x0000aaaace4f556c n/a (/usr/bin/envoy + 0x314556c)
                                                                                           #10 0x0000ffffa2e7d230 n/a (/lib64/libpthread.so.0 + 0x7230)
                                                                                           #11 0x0000ffffa2dcb7dc n/a (/lib64/libc.so.6 + 0xdb7dc)
                                                                                           ELF object binary architecture: AARCH64

We have 10 clusters (5 in eu-west-1, 5 in us-west-2) using EC2 instances only & ECS Service Connect. We started experiencing the problem as soon as we upgraded, although for now only on one of our 10 clusters . They are all provisioned in the same way using Terraform code though and run similar services deployed in the same way. We are rolling back to the last known working AMI we were using but I have logs to share privately if needed from affected instances.

zxkane changed the title ~~essential container is pending after enabling service connect on ECS on EC2~~ essential container is pending after enabling service connect for ECS on EC2 Oct 15, 2024

mye956 added the researching actively looking into the issue label Oct 29, 2024

mye956 added the more info needed label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

essential container is pending after enabling service connect for ECS on EC2 #4397

essential container is pending after enabling service connect for ECS on EC2 #4397

zxkane commented Oct 15, 2024

mye956 commented Oct 29, 2024 •

edited

Loading

zxkane commented Nov 7, 2024

haphazardeous commented Dec 4, 2024

tinchogon34 commented Dec 10, 2024 •

edited

Loading

zxkane commented Dec 11, 2024

tinchogon34 commented Dec 11, 2024

amogh09 commented Dec 19, 2024 •

edited

Loading

alessiogaldy commented Dec 22, 2024

rwohleb commented Jan 8, 2025

alessiogaldy commented Jan 9, 2025 •

edited

Loading

mkleinsasser commented Jan 10, 2025

timvw commented Jan 13, 2025 •

edited

Loading

nicolas-nannoni commented Jan 17, 2025 •

edited

Loading

essential container is pending after enabling service connect for ECS on EC2 #4397

essential container is pending after enabling service connect for ECS on EC2 #4397

Comments

zxkane commented Oct 15, 2024

Summary

Description

Expected Behavior

Observed Behavior

Environment Details

Supporting Log Snippets

mye956 commented Oct 29, 2024 • edited Loading

zxkane commented Nov 7, 2024

haphazardeous commented Dec 4, 2024

tinchogon34 commented Dec 10, 2024 • edited Loading

zxkane commented Dec 11, 2024

tinchogon34 commented Dec 11, 2024

amogh09 commented Dec 19, 2024 • edited Loading

alessiogaldy commented Dec 22, 2024

rwohleb commented Jan 8, 2025

alessiogaldy commented Jan 9, 2025 • edited Loading

mkleinsasser commented Jan 10, 2025

timvw commented Jan 13, 2025 • edited Loading

nicolas-nannoni commented Jan 17, 2025 • edited Loading

mye956 commented Oct 29, 2024 •

edited

Loading

tinchogon34 commented Dec 10, 2024 •

edited

Loading

amogh09 commented Dec 19, 2024 •

edited

Loading

alessiogaldy commented Jan 9, 2025 •

edited

Loading

timvw commented Jan 13, 2025 •

edited

Loading

nicolas-nannoni commented Jan 17, 2025 •

edited

Loading