Skip to content

Conversation

@MrMarvin
Copy link

Use case

Safely remove an agent instance with (potentially) SDK based data frameworks running.

Idea

  1. Prepare a list of nodes to decommission (address/name and mesos agent id)
  2. For each node:
  3. Get a list of running SDK based tasks, remember these
  4. shut down mesos-agent via kill -s SIGUSR1 and systemctl stop
  5. trigger a pod replace for each task found in step above
  6. decommission node
  7. should also sleep a little while to not replace nodes too quickly for data replication and stuff

The provided script implements the steps 3 and 5.

Note: 'kafka' was chosen as an arbitrary SDK based package to install the SDK CLI tools. E.g. dcos package install kafka --cli. The command doesn't care from which package it comes, all of them support the pod replace for ANY SDK based service, including K8s.

Example

grey:ops marv$ bash pod_replace_drain_agent.sh f497df44-5ddd-4807-813a-ddacef17e0d0-S9 > pod_replaces.sh
Getting the list of tasks on f497df44-5ddd-4807-813a-ddacef17e0d0-S9 ...
Framework data-services/confluent-kafka-kerberos has tasks on node.
Pod to issue replace for "kafka-2":
Framework data-services/hdfs has tasks on node.
Pod to issue replace for "data-1":
Framework mom-apps has tasks on node.
Framework kubernetes/kubernetes has tasks on node.
Pod to issue replace for "kube-controller-manager-1":
Framework /logs/elasticsearch has tasks on node.
Pod to issue replace for "data-1":
Pod to issue replace for "master-1":
Framework marathon has tasks on node.
Framework data-services/confluent-kafka has tasks on node.
Pod to issue replace for "kafka-2":
Framework data-services/confluent-zookeeper has tasks on node.
Pod to issue replace for "zookeeper-0":

grey:ops marv$ dcos node ssh --private-ip=172.31.10.48 --master-proxy "sudo systemctl kill -s SIGUSR1 dcos-mesos-slave"
Running `ssh -A -t  54.42.23.1 -- ssh -A -t  172.31.10.48 -- sudo systemctl kill -s SIGUSR1 dcos-mesos-slave`
dcos node ssh --private-ip=172.31.10.48 --master-proxy "sudo systemctl stop dcos-mesos-slave
[...]

grey:ops marv$ bash pod_replaces.sh
{
  "pod": "kafka-2",
  "tasks": ["kafka-2-broker"]
}

{
  "pod": "data-1",
  "tasks": ["data-1-node"]
}

{
  "pod": "kube-controller-manager-1",
  "tasks": ["kube-controller-manager-1-instance"]
}

{
  "pod": "data-1",
  "tasks": ["data-1-node"]
}

{
  "pod": "master-1",
  "tasks": ["master-1-node"]
}

{
  "pod": "kafka-2",
  "tasks": ["kafka-2-broker"]
}

{
  "pod": "zookeeper-0",
  "tasks": [
    "zookeeper-0-metrics",
    "zookeeper-0-server"
  ]
}

grey:ops marv$ dcos node decommission f497df44-5ddd-4807-813a-ddacef17e0d0-S9
Agent f497df44-5ddd-4807-813a-ddacef17e0d0-S9 has been marked as gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant