Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

essential container is pending after enabling service connect for ECS on EC2 #4397

Open
zxkane opened this issue Oct 15, 2024 · 13 comments
Open
Labels
more info needed researching actively looking into the issue

Comments

@zxkane
Copy link

zxkane commented Oct 15, 2024

Summary

Every thing works well. However, the essential container always is pending after enabling service connect on ECS on EC2.

While updating the ECS service to Fargate, it works again with service connect enabled.

I found the ECS agent lost the connection to ECS service when 'sending status change to ECS'.

Description

I'm using below code snippet to create ECS service on EC2 with service connect via CDK,

        const ec2Service = new Ec2Service(this, 'DifyEcsClusterEC2SandboxService', {
            cluster: this.cluster,
            taskDefinition: this.sandboxTaskDefinition,
            desiredCount: 1,
            serviceName: 'ec2-dify-sandbox',
            vpcSubnets: this.cluster.vpc.selectSubnets({ subnetType: SubnetType.PRIVATE_WITH_EGRESS }),
            securityGroups: [this.securityGroup],
            capacityProviderStrategies: [{ capacityProvider: this.clusterDefaultCapacityProviderName, weight: 1 }]
        })

        ec2Service.addPlacementStrategies(
            PlacementStrategy.spreadAcross('attribute:ecs.availability-zone'),
            PlacementStrategy.spreadAcrossInstances(),
        )

        ec2Service.enableServiceConnect({
            namespace: this.cluster.defaultCloudMapNamespace?.namespaceArn,
            services: [{
                portMappingName: this.sandboxTaskDefinition.defaultContainer?.portMappings[0].name || "serverless-dify-sandbox-8194-tcp",
                dnsName: "dify-sandbox-on-ec2.serverless",
                discoveryName: "serverless-dify-sandbox-on-ec2",
            }],
            logDriver: LogDriver.awsLogs({
                streamPrefix: 'serverless-dify/service-connect/',
                logGroup: new LogGroup(this, 'service-connect-log-on-ec2', {
                    retention: RetentionDays.ONE_WEEK,
                  }),
            })
        })

If the service connect is disabled, the container started and ran well. However, it always is pending after enabling service connect.

After inspecting the logs of ecs-agent container on EC2, I found below output in logs.

level=warn time=2024-10-15T05:19:31Z msg="Received an unrecognized attachment property" attachmentProperty="{\n  Name: \"EcsTaskSetArn\",\n  Value: \"arn:aws:ecs:ap-northeast-1:845861764576:task-set/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/serverless-dify-sandbox/ecs-svc/7792931705797793743\"\n}"
level=info time=2024-10-15T05:19:31Z msg="Received task payload from ACS" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskVersion="3" desiredStatus=RUNNING
level=info time=2024-10-15T05:19:31Z msg="Found application credentials for task" taskVersion="3" roleARN="arn:aws:iam::845861764576:role/ServerlessDifyStack-EcsCl-ServerlessDifyClusterSand-UHkUiKUZcERq" roleType="TaskApplication" credentialsID="50e0b19f-38e1-45c6-a580-07a7ab5c6cbb" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Found execution credentials for task" roleType="TaskExecution" credentialsID="85db1dd5-4c11-4e98-8345-7bbe668de81c" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskVersion="3" roleARN="arn:aws:iam::845861764576:role/ServerlessDifyStack-EcsCl-ServerlessDifyClusterSand-UHkUiKUZcERq"
level=info time=2024-10-15T05:19:31Z msg="Resources successfully consumed, continue to task creation" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Host resources consumed, progressing task" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Digest resolution not required" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" containerName="~internal~ecs~pause" image="amazon/amazon-ecs-pause:0.1.0"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" knownStatus="NONE" desiredStatus="RESOURCES_PROVISIONED" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="" changeEventStatus="MANIFEST_PULLED"
level=info time=2024-10-15T05:19:31Z msg="Creating task cgroup taskARN=arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db cgroupPath=/ecs/2700c02483254aef95cbc43f497f16db cgroupV2=false" module=cgroup.go
level=info time=2024-10-15T05:19:31Z msg="Transitioned resource" status="CREATED" task="2700c02483254aef95cbc43f497f16db" resource="cgroup"
level=info time=2024-10-15T05:19:31Z msg="Managed task got resource" resource="cgroup" status="CREATED" task="2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="" changeEventStatus="PULLED" knownStatus="MANIFEST_PULLED" desiredStatus="RESOURCES_PROVISIONED"
level=info time=2024-10-15T05:19:31Z msg="Creating container" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause"
level=info time=2024-10-15T05:19:31Z msg="Created container name mapping for task" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" dockerContainerName="ecs-serverless-dify-sandbox-3-internalecspause-b4edaff8a49797bb4500"
level=info time=2024-10-15T05:19:31Z msg="Created docker container for task" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" dockerId="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" elapsed=74.224358ms
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" changeEventStatus="CREATED" knownStatus="PULLED" desiredStatus="RESOURCES_PROVISIONED" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause"
level=info time=2024-10-15T05:19:31Z msg="Starting container" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause"
level=info time=2024-10-15T05:19:31Z msg="Started container" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" elapsed=171.265595ms
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" changeEventStatus="RUNNING" knownStatus="CREATED" desiredStatus="RESOURCES_PROVISIONED" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9"
level=info time=2024-10-15T05:19:31Z msg="Start streaming metrics for container" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9"
level=info time=2024-10-15T05:19:31Z msg="Setting up container resources for container" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause"
level=info time=2024-10-15T05:19:31Z msg="Setting up CNI config for task" cniContainerNetNs="/host/proc/24031/ns/net" task="2700c02483254aef95cbc43f497f16db" cniContainerID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" cniPluginPath="" cniID="06:d5:89:0d:2f:2d" cniBridgeName=""
level=info time=2024-10-15T05:19:31Z msg="Task associated with ip address" task="2700c02483254aef95cbc43f497f16db" ip="169.254.172.4"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" changeEventStatus="RESOURCES_PROVISIONED" knownStatus="RUNNING" desiredStatus="RESOURCES_PROVISIONED"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" changeEventStatus="MANIFEST_PULLED" knownStatus="NONE" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf" runtimeID=""
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" container="sandbox" runtimeID="" changeEventStatus="MANIFEST_PULLED" knownStatus="NONE" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Container change also resulted in task change" runtimeID="" desiredStatus="RUNNING" knownStatus="MANIFEST_PULLED" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf"
level=info time=2024-10-15T05:19:31Z msg="Handling container change event" container="ecs-service-connect-nconQf" runtimeID="" changeEventStatus="PULLED" knownStatus="MANIFEST_PULLED" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:31Z msg="Creating container" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf"
level=info time=2024-10-15T05:19:31Z msg="Applying execution role credentials to container log auth" awslogs-credentials-endpoint="/v2/credentials/85db1dd5-4c11-4e98-8345-7bbe668de81c" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" roleType="TaskExecution" roleARN="arn:aws:iam::845861764576:role/ServerlessDifyStack-EcsCl-ServerlessDifyClusterSand-UHkUiKUZcERq" credentialsID="85db1dd5-4c11-4e98-8345-7bbe668de81c"
level=info time=2024-10-15T05:19:31Z msg="Created container name mapping for task" dockerContainerName="ecs-serverless-dify-sandbox-3-ecs-service-connect-nconQf-94b4e0ccf1d3fdbf9d01" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf"
level=info time=2024-10-15T05:19:32Z msg="Created docker container for task" container="ecs-service-connect-nconQf" dockerId="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b" elapsed=70.437094ms task="2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:32Z msg="Handling container change event" knownStatus="PULLED" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b" changeEventStatus="CREATED"
level=info time=2024-10-15T05:19:32Z msg="Starting container" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b"
2024-10-15T05:19:32Z 200 10.0.169.54:59412 "/v2/credentials" "Go-http-client/1.1" arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db GetCredentialsExecutionRole 2 ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq arn:aws:ecs:ap-northeast-1:845861764576:container-instance/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/3929aad2e3374aa79d0480c7eb68919d
level=info time=2024-10-15T05:19:32Z msg="Processing credential request, credentialType=TaskExecution taskARN=arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" module=credentials_handler.go
level=info time=2024-10-15T05:19:32Z msg="Started container" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b" elapsed=170.946791ms task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf"
level=info time=2024-10-15T05:19:32Z msg="Handling container change event" knownStatus="CREATED" desiredStatus="RUNNING" task="2700c02483254aef95cbc43f497f16db" container="ecs-service-connect-nconQf" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b" changeEventStatus="RUNNING"
level=info time=2024-10-15T05:19:32Z msg="Start streaming metrics for container" runtimeID="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b"
level=info time=2024-10-15T05:19:32Z msg="Writing response for v4 task metadata" tmdsEndpointContainerID="e1ab4a8e-3b73-4f70-9d72-85eba94ca012" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:32Z msg="Writing response for v4 container metadata" tmdsEndpointContainerID="e1ab4a8e-3b73-4f70-9d72-85eba94ca012" container="fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b"
level=info time=2024-10-15T05:19:32Z msg="Writing response for v4 task metadata" tmdsEndpointContainerID="e1ab4a8e-3b73-4f70-9d72-85eba94ca012" taskARN="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db"
level=info time=2024-10-15T05:19:50Z msg="Sending state change to ECS" eventType="TaskStateChange" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskStatus="MANIFEST_PULLED" taskReason="" taskPullStartedAt="0001-01-01T00:00:00Z" taskPullStoppedAt="0001-01-01T00:00:00Z" taskKnownSentStatus="NONE" taskExecutionStoppedAt="0001-01-01T00:00:00Z" containerChange-0="containerName=ecs-service-connect-nconQf containerStatus=RUNNING containerKnownSentStatus=NONE containerRuntimeID=fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b containerIsEssential=true"
level=info time=2024-10-15T05:21:12Z msg="TCS Websocket connection closed for a valid reason"
level=info time=2024-10-15T05:21:12Z msg="Using cached DiscoverPollEndpoint" endpoint="https://ecs-a-5.ap-northeast-1.amazonaws.com/" telemetryEndpoint="https://ecs-t-5.ap-northeast-1.amazonaws.com/" serviceConnectEndpoint="https://ecs-sc.ap-northeast-1.api.aws" containerInstanceARN="arn:aws:ecs:ap-northeast-1:845861764576:container-instance/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/3929aad2e3374aa79d0480c7eb68919d"
level=info time=2024-10-15T05:21:12Z msg="Establishing a Websocket connection" url="https://ecs-t-5.ap-northeast-1.amazonaws.com/ws?agentHash=cf8c7a6b&agentVersion=1.87.0&cluster=ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq&containerInstance=arn%3Aaws%3Aecs%3Aap-northeast-1%3A845861764576%3Acontainer-instance%2FServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq%2F3929aad2e3374aa79d0480c7eb68919d&dockerVersion=25.0.6"
level=info time=2024-10-15T05:21:12Z msg="Websocket connection established." URL="https://ecs-t-5.ap-northeast-1.amazonaws.com/ws?agentHash=cf8c7a6b&agentVersion=1.87.0&cluster=ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq&containerInstance=arn%3Aaws%3Aecs%3Aap-northeast-1%3A845861764576%3Acontainer-instance%2FServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq%2F3929aad2e3374aa79d0480c7eb68919d&dockerVersion=25.0.6" ConnectTime="2024-10-15 05:21:12" ExpectedDisconnectTime="2024-10-15 05:51:12"
level=info time=2024-10-15T05:21:12Z msg="Connected to TCS endpoint"

You could see abnormal connectivity of between ECS and agent on EC2,

level=info time=2024-10-15T05:19:50Z msg="Sending state change to ECS" eventType="TaskStateChange" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskStatus="MANIFEST_PULLED" taskReason="" taskPullStartedAt="0001-01-01T00:00:00Z" taskPullStoppedAt="0001-01-01T00:00:00Z" taskKnownSentStatus="NONE" taskExecutionStoppedAt="0001-01-01T00:00:00Z" containerChange-0="containerName=ecs-service-connect-nconQf containerStatus=RUNNING containerKnownSentStatus=NONE containerRuntimeID=fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b containerIsEssential=true"
level=info time=2024-10-15T05:21:12Z msg="TCS Websocket connection closed for a valid reason"

After updating above ECS service to use Fargate, it works well.

        const service = new FargateService(this, 'DifyEcsClusterSandboxService', {
            cluster: this.cluster,
            taskDefinition: this.sandboxTaskDefinition,
            desiredCount: 1,
            serviceName: 'fargate-dify-sandbox',
            vpcSubnets: this.cluster.vpc.selectSubnets({ subnetType: SubnetType.PRIVATE_WITH_EGRESS }),
            securityGroups: [this.securityGroup],
        })

        service.enableServiceConnect({
            namespace: this.cluster.defaultCloudMapNamespace?.namespaceArn,
            services: [{
                portMappingName: this.sandboxTaskDefinition.defaultContainer?.portMappings[0].name || "serverless-dify-sandbox-8194-tcp",
                dnsName: "serverles-dify-sandbox",
                discoveryName: "serverless-dify-sandbox",
            }],
            logDriver: LogDriver.awsLogs({
                streamPrefix: 'serverless-dify/service-connect/',
                logGroup: new LogGroup(this, 'service-connect-on-fargate', {
                    retention: RetentionDays.ONE_WEEK,
                  }),
            })
        })

Expected Behavior

The essential container could be started.

Observed Behavior

The ECS agent could not start the container after the service connect container is started.

Environment Details

ECS agent: 1.87.0
EC2 AMI: amzn2-ami-ecs-hvm-2.0.20241010-x86_64-ebs

Supporting Log Snippets

see above description

@zxkane zxkane changed the title essential container is pending after enabling service connect on ECS on EC2 essential container is pending after enabling service connect for ECS on EC2 Oct 15, 2024
@mye956 mye956 added the researching actively looking into the issue label Oct 29, 2024
@mye956
Copy link
Contributor

mye956 commented Oct 29, 2024

Hi @zxkane, thanks for opening up this issue.

You could see abnormal connectivity of between ECS and agent on EC2,

level=info time=2024-10-15T05:19:50Z msg="Sending state change to ECS" eventType="TaskStateChange" taskArn="arn:aws:ecs:ap-northeast-1:845861764576:task/ServerlessDifyStack-EcsClusterStackNestedStackEcsClusterStackNestedStackResourceC1F1FB-8S3K0YS6ISZ-EcsClusterStackAFF371BA-PIgAAAsN8xTq/2700c02483254aef95cbc43f497f16db" taskStatus="MANIFEST_PULLED" taskReason="" taskPullStartedAt="0001-01-01T00:00:00Z" taskPullStoppedAt="0001-01-01T00:00:00Z" taskKnownSentStatus="NONE" taskExecutionStoppedAt="0001-01-01T00:00:00Z" containerChange-0="containerName=ecs-service-connect-nconQf containerStatus=RUNNING containerKnownSentStatus=NONE containerRuntimeID=fd2e9f9d1e405bd64f17b9afb9c41dbf8a97f4ca8cf1ffe465b3f4652782f11b containerIsEssential=true"
level=info time=2024-10-15T05:21:12Z msg="TCS Websocket connection closed for a valid reason"

What you see here is actually the expected behavior where agent disconnecting with the ECS telemetry connection (see ref where this log statement is coming from). This is also not the same ECS endpoint where we send state changes over (it's the ACS endpoint). The TCS endpoint is where we send over metrics. Agent will periodically disconnect and then reconnect back with the telemetry endpoint which you should see a corresponding log statement a bit after.

Could you help clarify a bit more on what you mean by the essential container being stuck in pending? It looks like the container did transition to a running state.

level=info time=2024-10-15T05:19:31Z msg="Handling container change event" task="2700c02483254aef95cbc43f497f16db" container="~internal~ecs~pause" runtimeID="f92cdfa8241c3bfca21afb8eec25ab7258ec249e88399c17fd6808a63f8a5ca9" changeEventStatus="RESOURCES_PROVISIONED" knownStatus="RUNNING" desiredStatus="RESOURCES_PROVISIONED"

If possible, could share a bit more on how the task definition is configured?

@zxkane
Copy link
Author

zxkane commented Nov 7, 2024

@mye956 thanks for looking at this.

I uploaded the task definition for your reference: serverless-dify-sandbox-revision1.json.

In my latest attempt, I configured both EC2-based and Fargate-based ECS services with service connect enabled, using the same task definition.

From the below screenshot, you could see the service based on EC2 is always pending for ready.
image

Digging into the task of the EC2-based service, the essential container dify-sandbox always pending to be started. It also makes the ecs-service-connect become unhealthy after running a while.

image

After disabling the service connect configuration for the EC2-based service, the task runs successfully.
image

Any suggestions for addressing the availability issue when enabling Service Connect for EC2-based services?

@haphazardeous
Copy link

Hi @zxkane,

Are you still having this issue? We've been having a lot of ecs-service-connect issues on our ECS cluster which is back by EC2.

We have a scheduled system where the cluster is shutdown over the weekend and brought up on Monday morning. But it's been failing to come up due to ecs-service connect failing health checks since upgrading the agent.

@tinchogon34
Copy link

tinchogon34 commented Dec 10, 2024

Hi @zxkane, we're having exactly the same issue. By any chance, did you find any solution?

@zxkane
Copy link
Author

zxkane commented Dec 11, 2024

Hi @zxkane, we're having exactly the same issue. By any chance, did you find any solution?

No solution for EC2. Using Fargate works like a charm.

@tinchogon34
Copy link

I just realized today that with EC2, but with Amazon Linux instead of Bottlerocket, works too

@amogh09
Copy link
Contributor

amogh09 commented Dec 19, 2024

Hi @zxkane. We will need to more logs from your container instance to investigate this issue. Can you please enable debug logs by adding ECS_LOGLEVEL=debug to your ecs configuration file (located at /etc/ecs/ecs.config by default)? Then can you please run a new task with the same configuration, collect logs from your container instance using ECS Logs Collector, and send the logs bundle to [email protected]? Please reference this Github issue in your email subject.

A PENDING task container usually is due to an unresolved container dependency.

@alessiogaldy
Copy link

@amogh09 We are also experiencing a similar issue.
I sent an email with debug logs to [email protected], the subject isGithub Issue 4397 - ECS container instance DEBUG logs

@rwohleb
Copy link

rwohleb commented Jan 8, 2025

We've started periodically experiencing the perpetual pending status issue on EC2 backed ECS clusters. I believe we first noticed the strange behaviour around November. Only services with Service Connect enabled are affected. When a container starts showing the issue we set it to drain. New EC2 instances that are spun up by the ASG don't seem to exhibit the issue immediately. The age of the instance doesn't seem to to be related, as we've seen containers start to show the issue that are newer than ones that don't. The ECS agent version varies.

@alessiogaldy
Copy link

alessiogaldy commented Jan 9, 2025

What @rwohleb described is exactly what we are experiencing when using Service Connect on EC2 instances.

Every 2-3 days we find 1 or 2 instances in a broken state, looking at the logs on the instance all we can see is that the AppNet Agent failed with an error like the following:

[2024-12-17 19:55:26.759][7][error] [AppNet Agent] Envoy readiness check failed with: Get "http://127.0.0.1:9901/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[2024-12-17 19:55:26.909][7][error] [AppNet Agent] Envoy connectivity check failed with: Get "http://127.0.0.1:9901/stats?filter=control_plane.connected_state&format=json": EOF
[2024-12-17 19:55:26.909][7][error] [AppNet Agent] Envoy readiness check failed with: Get "http://127.0.0.1:9901/ready": read unix @->/tmp/envoy_admin.sock: read: connection reset by peer
[2024-12-17 19:55:26.909][7][error] [AppNet Agent] Envoy connectivity check failed with: Get "http://127.0.0.1:9901/stats?filter=control_plane.connected_state&format=json": dial unix /tmp/envoy_admin.sock: connect: connection refused
[2024-12-17 19:55:26.942][7][warning] [AppNet Agent] [Envoy process 15] Exited with code [-1]
[2024-12-17 19:55:26.943][7][warning] [AppNet Agent] [Envoy process 15] Additional Exit data: [Core Dump: false][Normal Exit: false][Process Signalled: true]

Initially we were also terminating the instances and letting the ASG create new ones but recently we noticed that just rebooting the instance brings it back to a normal state.

@timvw
Copy link

timvw commented Jan 13, 2025

@tinchogon34 Correct to assume that you mean amazon-linux-2023?

We experience the same issue with amzn2-ami-ecs-gpu-hvm-2.0.20250102-x86_64-ebs AMI for the ec2 instance...

@nicolas-nannoni
Copy link

nicolas-nannoni commented Jan 17, 2025

We are also affected by the issue. It started as soon as we updated from ami-05bc1b71d9fb7b17c (al2023-ami-ecs-hvm-2023.0.20240917-kernel-6.1-arm64 - ECS Agent: 1.86.3 - Docker version: 25.0.6) to ami-05c972f666f1ea926 (al2023-ami-ecs-hvm-2023.0.20250110-kernel-6.1-arm64 - ECS Agent: 1.89.2 - Docker version: 25.0.6) in eu-west-1. The AppNet Agent is killed - sometimes days after the instance started -, and never restarts:

[2025-01-15 19:30:29.505][19][info][main] [source/server/server.cc:932] all clusters initialized. initializing init manager
[2025-01-15 19:30:29.505][19][info][config] [source/common/listener_manager/listener_manager_impl.cc:926] all dependencies initialized. starting workers
[2025-01-17 12:57:27.772][7][warning] [AppNet Agent] [Envoy process 19] Exited with code [-1]
[2025-01-17 12:57:27.772][7][warning] [AppNet Agent] [Envoy process 19] Additional Exit data: [Core Dump: false][Normal Exit: false][Process Signalled: true]

The envoy process somehow dies at some point; in this case hours after the machine started (other times, it fails right as the first containers get scheduled on the instance). It appears that it got triggered by some containers being newly scheduled on the instance a minute prior to Envoy "abnormally ending".

Jan 17 12:56:44 ip-10-40-52-59 audit[1864076]: ANOM_ABEND auid=4294967295 uid=20000 gid=0 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=1864076 comm="wrk:worker_1" exe="/usr/bin/envoy" sig=11 res=1
Jan 17 12:56:45 ip-10-40-52-59 audit[1864077]: ANOM_ABEND auid=4294967295 uid=20000 gid=0 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=1864077 comm="wrk:worker_1" exe="/usr/bin/envoy" sig=11 res=1


Jan 17 12:57:27 ip-10-40-52-59 systemd-coredump[1864535]: Process 1864076 (envoy) of user 20000 dumped core.
                                                                                           
                                                                                           Stack trace of thread 46:
                                                                                           #0  0x0000aaaac1c3cc8c n/a (/usr/bin/envoy + 0x14acc8c)
                                                                                           #1  0x0000aaaac30a85c8 n/a (/usr/bin/envoy + 0x29185c8)
                                                                                           #2  0x0000aaaac30a85c8 n/a (/usr/bin/envoy + 0x29185c8)
                                                                                           #3  0x0000aaaac30b5104 n/a (/usr/bin/envoy + 0x2925104)
                                                                                           #4  0x0000aaaac30b1e80 n/a (/usr/bin/envoy + 0x2921e80)
                                                                                           #5  0x0000aaaac3059098 n/a (/usr/bin/envoy + 0x28c9098)
                                                                                           #6  0x0000aaaac36bc87c n/a (/usr/bin/envoy + 0x2f2c87c)
                                                                                           #7  0x0000aaaac36bc7cc n/a (/usr/bin/envoy + 0x2f2c7cc)
                                                                                           #8  0x0000aaaac30774b0 n/a (/usr/bin/envoy + 0x28e74b0)
                                                                                           #9  0x0000aaaac38d556c n/a (/usr/bin/envoy + 0x314556c)
                                                                                           #10 0x0000ffff8f7cd230 n/a (/lib64/libpthread.so.0 + 0x7230)
                                                                                           #11 0x0000ffff8f71b7dc n/a (/lib64/libc.so.6 + 0xdb7dc)
                                                                                           ELF object binary architecture: AARCH64
Jan 17 12:57:27 ip-10-40-52-59 systemd-coredump[1864558]: Process 1864077 (envoy) of user 20000 dumped core.
                                                                                           
                                                                                           Stack trace of thread 45:
                                                                                           #0  0x0000aaaacc85cc8c n/a (/usr/bin/envoy + 0x14acc8c)
                                                                                           #1  0x0000aaaacdcc85c8 n/a (/usr/bin/envoy + 0x29185c8)
                                                                                           #2  0x0000aaaacdcc85c8 n/a (/usr/bin/envoy + 0x29185c8)
                                                                                           #3  0x0000aaaacdcd5104 n/a (/usr/bin/envoy + 0x2925104)
                                                                                           #4  0x0000aaaacdcd1e80 n/a (/usr/bin/envoy + 0x2921e80)
                                                                                           #5  0x0000aaaacdc79098 n/a (/usr/bin/envoy + 0x28c9098)
                                                                                           #6  0x0000aaaace2dc87c n/a (/usr/bin/envoy + 0x2f2c87c)
                                                                                           #7  0x0000aaaace2dc7cc n/a (/usr/bin/envoy + 0x2f2c7cc)
                                                                                           #8  0x0000aaaacdc974b0 n/a (/usr/bin/envoy + 0x28e74b0)
                                                                                           #9  0x0000aaaace4f556c n/a (/usr/bin/envoy + 0x314556c)
                                                                                           #10 0x0000ffffa2e7d230 n/a (/lib64/libpthread.so.0 + 0x7230)
                                                                                           #11 0x0000ffffa2dcb7dc n/a (/lib64/libc.so.6 + 0xdb7dc)
                                                                                           ELF object binary architecture: AARCH64

We have 10 clusters (5 in eu-west-1, 5 in us-west-2) using EC2 instances only & ECS Service Connect. We started experiencing the problem as soon as we upgraded, although for now only on one of our 10 clusters . They are all provisioned in the same way using Terraform code though and run similar services deployed in the same way. We are rolling back to the last known working AMI we were using but I have logs to share privately if needed from affected instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more info needed researching actively looking into the issue
Projects
None yet
Development

No branches or pull requests

10 participants