Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pending Cluster Tasks][Allocation] Add context on 'shard-failed' #102606

Closed
stefnestor opened this issue Nov 24, 2023 · 4 comments · Fixed by #125520
Closed

[Pending Cluster Tasks][Allocation] Add context on 'shard-failed' #102606

stefnestor opened this issue Nov 24, 2023 · 4 comments · Fixed by #125520
Assignees
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement stalled Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Coordination Meta label for Distributed Coordination team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@stefnestor
Copy link
Contributor

Description

👋 howdy, team!

Would you kindly consider adding in the index name to Cluster Pending Tasks for shard-failed (ballpark code) similar as is done for shard-started (code)?

Currently emitted examples where shard-started is informative but shard-failed is not:

{
  "tasks": [
    {
      "executing": false,
      "insert_order": 5862,
      "priority": "HIGH",
-      "source": "shard-failed",
      "time_in_queue": "201ms",
      "time_in_queue_millis": 201
    },
    {
      "executing": false,
      "insert_order": 5789,
      "priority": "URGENT",
+      "source": "shard-started StartedShardEntry{shardId [[MY_INDEX_NAME][0]], allocationId [SOME_UUID], primary term [4], message [after peer recovery]}",
      "time_in_queue": "1.1m",
      "time_in_queue_millis": 71104
    }
  ]
}
@stefnestor stefnestor added >enhancement needs:triage Requires assignment of a team area label labels Nov 24, 2023
@mayya-sharipova mayya-sharipova added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed needs:triage Requires assignment of a team area label labels Nov 24, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 24, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@stefnestor stefnestor added the Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. label Nov 24, 2023
@DaveCTurner
Copy link
Contributor

This highlights a general problem with the descriptions for cluster tasks being a bit of a mess. Originally source was intended as a key for grouping similar tasks together (e.g. shard-failed) but many implementations put too much detail there for grouping to be meaningful. But then again that detail is useful and today there's nowhere else to put it. I sketched out an improvement that adds a better separation between source and detail at #102613 which would help with this kind of thing.

JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue Mar 24, 2025
Appends the FailedShardEntry request to the 'shard-failed'
task source string in ShardFailedTransportHandler.messageReceived().
This information will now be available in the 'source' string for
shard failed task entries in the Cluster Pending Tasks API response.
This source string change matches what is done in the
ShardStartedTransportHandler.

Closes elastic#102606.
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue Mar 25, 2025
Appends the FailedShardEntry request to the 'shard-failed'
task source string in ShardFailedTransportHandler.messageReceived().
This information will now be available in the 'source' string for
shard failed task entries in the Cluster Pending Tasks API response.
This source string change matches what is done in the
ShardStartedTransportHandler.

Closes elastic#102606.
@JeremyDahlgren JeremyDahlgren self-assigned this Mar 26, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Mar 26, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-obsolete (Team:Distributed (Obsolete))

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

omricohenn pushed a commit to omricohenn/elasticsearch that referenced this issue Mar 28, 2025
…#125520)

Appends the FailedShardEntry request to the 'shard-failed'
task source string in ShardFailedTransportHandler.messageReceived().
This information will now be available in the 'source' string for
shard failed task entries in the Cluster Pending Tasks API response.
This source string change matches what is done in the
ShardStartedTransportHandler.

Closes elastic#102606.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement stalled Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Coordination Meta label for Distributed Coordination team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants