-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
essential container is pending after enabling service connect for ECS on EC2 #4397
Comments
Hi @zxkane, thanks for opening up this issue.
What you see here is actually the expected behavior where agent disconnecting with the ECS telemetry connection (see ref where this log statement is coming from). This is also not the same ECS endpoint where we send state changes over (it's the ACS endpoint). The TCS endpoint is where we send over metrics. Agent will periodically disconnect and then reconnect back with the telemetry endpoint which you should see a corresponding log statement a bit after. Could you help clarify a bit more on what you mean by the essential container being stuck in pending? It looks like the container did transition to a running state.
If possible, could share a bit more on how the task definition is configured? |
@mye956 thanks for looking at this. I uploaded the task definition for your reference: serverless-dify-sandbox-revision1.json. In my latest attempt, I configured both EC2-based and Fargate-based ECS services with service connect enabled, using the same task definition. From the below screenshot, you could see the service based on EC2 is always pending for ready. Digging into the task of the EC2-based service, the essential container After disabling the service connect configuration for the EC2-based service, the task runs successfully. Any suggestions for addressing the availability issue when enabling Service Connect for EC2-based services? |
Hi @zxkane, Are you still having this issue? We've been having a lot of ecs-service-connect issues on our ECS cluster which is back by EC2. We have a scheduled system where the cluster is shutdown over the weekend and brought up on Monday morning. But it's been failing to come up due to ecs-service connect failing health checks since upgrading the agent. |
Hi @zxkane, we're having exactly the same issue. By any chance, did you find any solution? |
No solution for EC2. Using Fargate works like a charm. |
I just realized today that with EC2, but with Amazon Linux instead of Bottlerocket, works too |
Hi @zxkane. We will need to more logs from your container instance to investigate this issue. Can you please enable A PENDING task container usually is due to an unresolved container dependency. |
@amogh09 We are also experiencing a similar issue. |
We've started periodically experiencing the perpetual pending status issue on EC2 backed ECS clusters. I believe we first noticed the strange behaviour around November. Only services with Service Connect enabled are affected. When a container starts showing the issue we set it to drain. New EC2 instances that are spun up by the ASG don't seem to exhibit the issue immediately. The age of the instance doesn't seem to to be related, as we've seen containers start to show the issue that are newer than ones that don't. The ECS agent version varies. |
What @rwohleb described is exactly what we are experiencing when using Service Connect on EC2 instances. Every 2-3 days we find 1 or 2 instances in a broken state, looking at the logs on the instance all we can see is that the AppNet Agent failed with an error like the following:
Initially we were also terminating the instances and letting the ASG create new ones but recently we noticed that just rebooting the instance brings it back to a normal state. |
We're seeing this issue as well, and have a case open with AWS Support. Collecting what may be related issues:
|
@tinchogon34 Correct to assume that you mean amazon-linux-2023? We experience the same issue with amzn2-ami-ecs-gpu-hvm-2.0.20250102-x86_64-ebs AMI for the ec2 instance... |
We are also affected by the issue. It started as soon as we updated from
The envoy process somehow dies at some point; in this case hours after the machine started (other times, it fails right as the first containers get scheduled on the instance). It appears that it got triggered by some containers being newly scheduled on the instance a minute prior to Envoy "abnormally ending".
We have 10 clusters (5 in eu-west-1, 5 in us-west-2) using EC2 instances only & ECS Service Connect. We started experiencing the problem as soon as we upgraded, although for now only on one of our 10 clusters . They are all provisioned in the same way using Terraform code though and run similar services deployed in the same way. We are rolling back to the last known working AMI we were using but I have logs to share privately if needed from affected instances. |
Summary
Every thing works well. However, the essential container always is pending after enabling service connect on ECS on EC2.
While updating the ECS service to Fargate, it works again with service connect enabled.
I found the ECS agent lost the connection to ECS service when 'sending status change to ECS'.
Description
I'm using below code snippet to create ECS service on EC2 with service connect via CDK,
If the service connect is disabled, the container started and ran well. However, it always is pending after enabling service connect.
After inspecting the logs of
ecs-agent
container on EC2, I found below output in logs.You could see abnormal connectivity of between ECS and agent on EC2,
After updating above ECS service to use Fargate, it works well.
Expected Behavior
The essential container could be started.
Observed Behavior
The ECS agent could not start the container after the service connect container is started.
Environment Details
ECS agent: 1.87.0
EC2 AMI: amzn2-ami-ecs-hvm-2.0.20241010-x86_64-ebs
Supporting Log Snippets
see above description
The text was updated successfully, but these errors were encountered: