Scheduler extension

cdrage · Nov 25, 2015 · cadc24e · cadc24e
1 parent ef84c57
commit cadc24e
Show file tree

Hide file tree

Showing 20 changed files with 1,278 additions and 138 deletions.
diff --git a/docs/design/scheduler_extender.md b/docs/design/scheduler_extender.md
@@ -0,0 +1,117 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+<strong>
+The latest release of this document can be found
+[here](http://releases.k8s.io/release-1.1/docs/design/scheduler_extender.md).
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Scheduler extender
+
+There are three ways to add new scheduling rules (predicates and priority functions) to Kubernetes: (1) by adding these rules to the scheduler and recompiling (described here: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md), (2) implementing your own scheduler process that runs instead of, or alongside of, the standard Kubernetes scheduler, (3) implementing a "scheduler extender" process that the standard Kubernetes scheduler calls out to as a final pass when making scheduling decisions.
+
+This document describes the third approach. This approach is needed for use cases where scheduling decisions need to be made on resources not directly managed by the standard Kubernetes scheduler. The extender helps make scheduling decisions based on such resources. (Note that the three approaches are not mutually exclusive.)
+
+When scheduling a pod, the extender allows an external process to filter and prioritize nodes. Two separate http/https calls are issued to the extender, one for "filter" and one for "prioritize" actions. To use the extender, you must create a scheduler policy configuration file. The configuration specifies how to reach the extender, whether to use http or https and the timeout.
+
+```go
+// Holds the parameters used to communicate with the extender. If a verb is unspecified/empty,
+// it is assumed that the extender chose not to provide that extension.
+type ExtenderConfig struct {
+	// URLPrefix at which the extender is available
+	URLPrefix string `json:"urlPrefix"`
+	// Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender.
+	FilterVerb string `json:"filterVerb,omitempty"`
+	// Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender.
+	PrioritizeVerb string `json:"prioritizeVerb,omitempty"`
+	// The numeric multiplier for the node scores that the prioritize call generates.
+	// The weight should be a positive integer
+	Weight int `json:"weight,omitempty"`
+	// EnableHttps specifies whether https should be used to communicate with the extender
+	EnableHttps bool `json:"enableHttps,omitempty"`
+	// TLSConfig specifies the transport layer security config
+	TLSConfig *client.TLSClientConfig `json:"tlsConfig,omitempty"`
+	// HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize
+	// timeout is ignored, k8s/other extenders priorities are used to select the node.
+	HTTPTimeout time.Duration `json:"httpTimeout,omitempty"`
+}
+```
+
+A sample scheduler policy file with extender configuration:
+
+```json
+{
+  "predicates": [
+    {
+      "name": "HostName"
+    },
+    {
+      "name": "MatchNodeSelector"
+    },
+    {
+      "name": "PodFitsResources"
+    }
+  ],
+  "priorities": [
+    {
+      "name": "LeastRequestedPriority",
+      "weight": 1
+    }
+  ],
+  "extenders": [
+    {
+      "urlPrefix": "http://127.0.0.1:12345/api/scheduler",
+      "filterVerb": "filter",
+      "enableHttps": false
+    }
+  ]
+}
+```
+
+Arguments passed to the FilterVerb endpoint on the extender are the set of nodes filtered through the k8s predicates and the pod. Arguments passed to the PrioritizeVerb endpoint on the extender are the set of nodes filtered through the k8s predicates and extender predicates and the pod.
+
+```go
+// ExtenderArgs represents the arguments needed by the extender to filter/prioritize
+// nodes for a pod.
+type ExtenderArgs struct {
+	// Pod being scheduled
+	Pod   api.Pod      `json:"pod"`
+	// List of candidate nodes where the pod can be scheduled
+	Nodes api.NodeList `json:"nodes"`
+}
+```
+
+The "filter" call returns a list of nodes (api.NodeList). The "prioritize" call returns priorities for each node (schedulerapi.HostPriorityList).
+
+The "filter" call may prune the set of nodes based on its predicates. Scores returned by the "prioritize" call are added to the k8s scores (computed through its priority functions) and used for final host selection.
+
+Multiple extenders can be configured in the scheduler policy.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler_extender.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/examples/examples_test.go b/examples/examples_test.go
@@ -235,7 +235,8 @@ func TestExampleObjectSchemas(t *testing.T) {
 			"daemon": &extensions.DaemonSet{},
 		},
 		"../examples": {
-			"scheduler-policy-config": &schedulerapi.Policy{},
+			"scheduler-policy-config":               &schedulerapi.Policy{},
+			"scheduler-policy-config-with-extender": &schedulerapi.Policy{},
 		},
 		"../examples/rbd/secret": {
 			"ceph-secret": &api.Secret{},
@@ -409,7 +410,7 @@ func TestExampleObjectSchemas(t *testing.T) {
 				t.Logf("skipping : %s/%s\n", path, name)
 				return
 			}
-			if name == "scheduler-policy-config" {
+			if strings.Contains(name, "scheduler-policy-config") {
 				if err := schedulerapilatest.Codec.DecodeInto(data, expectedType); err != nil {
 					t.Errorf("%s did not decode correctly: %v\n%s", path, err, string(data))
 					return

diff --git a/examples/scheduler-policy-config-with-extender.json b/examples/scheduler-policy-config-with-extender.json
@@ -0,0 +1,25 @@
+{
+"kind" : "Policy",
+"apiVersion" : "v1",
+"predicates" : [
+	{"name" : "PodFitsPorts"},
+	{"name" : "PodFitsResources"},
+	{"name" : "NoDiskConflict"},
+	{"name" : "MatchNodeSelector"},
+	{"name" : "HostName"}
+	],
+"priorities" : [
+	{"name" : "LeastRequestedPriority", "weight" : 1},
+	{"name" : "BalancedResourceAllocation", "weight" : 1},
+	{"name" : "ServiceSpreadingPriority", "weight" : 1},
+	{"name" : "EqualPriority", "weight" : 1}
+	],
+"extender": {
+        "url": "http://127.0.0.1:12346/scheduler",
+        "apiVersion": "v1beta1",
+        "filterVerb": "filter",
+        "prioritizeVerb": "prioritize",
+        "weight": 5,
+        "enableHttps": false
+    }
+}
diff --git a/plugin/pkg/scheduler/algorithm/priorities/priorities.go b/plugin/pkg/scheduler/algorithm/priorities/priorities.go
@@ -25,6 +25,7 @@ import (
 	"k8s.io/kubernetes/pkg/labels"
 	"k8s.io/kubernetes/plugin/pkg/scheduler/algorithm"
 	"k8s.io/kubernetes/plugin/pkg/scheduler/algorithm/predicates"
+	schedulerapi "k8s.io/kubernetes/plugin/pkg/scheduler/api"
 )
 
 // the unused capacity is calculated on a scale of 0-10
@@ -73,7 +74,7 @@ func getNonzeroRequests(requests *api.ResourceList) (int64, int64) {
 
 // Calculate the resource occupancy on a node.  'node' has information about the resources on the node.
 // 'pods' is a list of pods currently scheduled on the node.
-func calculateResourceOccupancy(pod *api.Pod, node api.Node, pods []*api.Pod) algorithm.HostPriority {
+func calculateResourceOccupancy(pod *api.Pod, node api.Node, pods []*api.Pod) schedulerapi.HostPriority {
 	totalMilliCPU := int64(0)
 	totalMemory := int64(0)
 	capacityMilliCPU := node.Status.Capacity.Cpu().MilliValue()
@@ -104,7 +105,7 @@ func calculateResourceOccupancy(pod *api.Pod, node api.Node, pods []*api.Pod) al
 		cpuScore, memoryScore,
 	)
 
-	return algorithm.HostPriority{
+	return schedulerapi.HostPriority{
 		Host:  node.Name,
 		Score: int((cpuScore + memoryScore) / 2),
 	}
@@ -114,14 +115,14 @@ func calculateResourceOccupancy(pod *api.Pod, node api.Node, pods []*api.Pod) al
 // It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes
 // based on the minimum of the average of the fraction of requested to capacity.
 // Details: cpu((capacity - sum(requested)) * 10 / capacity) + memory((capacity - sum(requested)) * 10 / capacity) / 2
-func LeastRequestedPriority(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (algorithm.HostPriorityList, error) {
+func LeastRequestedPriority(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (schedulerapi.HostPriorityList, error) {
 	nodes, err := nodeLister.List()
 	if err != nil {
-		return algorithm.HostPriorityList{}, err
+		return schedulerapi.HostPriorityList{}, err
 	}
 	podsToMachines, err := predicates.MapPodsToMachines(podLister)
 
-	list := algorithm.HostPriorityList{}
+	list := schedulerapi.HostPriorityList{}
 	for _, node := range nodes.Items {
 		list = append(list, calculateResourceOccupancy(pod, node, podsToMachines[node.Name]))
 	}
@@ -144,7 +145,7 @@ func NewNodeLabelPriority(label string, presence bool) algorithm.PriorityFunctio
 // CalculateNodeLabelPriority checks whether a particular label exists on a node or not, regardless of its value.
 // If presence is true, prioritizes nodes that have the specified label, regardless of value.
 // If presence is false, prioritizes nodes that do not have the specified label.
-func (n *NodeLabelPrioritizer) CalculateNodeLabelPriority(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (algorithm.HostPriorityList, error) {
+func (n *NodeLabelPrioritizer) CalculateNodeLabelPriority(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (schedulerapi.HostPriorityList, error) {
 	var score int
 	nodes, err := nodeLister.List()
 	if err != nil {
@@ -157,7 +158,7 @@ func (n *NodeLabelPrioritizer) CalculateNodeLabelPriority(pod *api.Pod, podListe
 		labeledNodes[node.Name] = (exists && n.presence) || (!exists && !n.presence)
 	}
 
-	result := []algorithm.HostPriority{}
+	result := []schedulerapi.HostPriority{}
 	//score int - scale of 0-10
 	// 0 being the lowest priority and 10 being the highest
 	for nodeName, success := range labeledNodes {
@@ -166,7 +167,7 @@ func (n *NodeLabelPrioritizer) CalculateNodeLabelPriority(pod *api.Pod, podListe
 		} else {
 			score = 0
 		}
-		result = append(result, algorithm.HostPriority{Host: nodeName, Score: score})
+		result = append(result, schedulerapi.HostPriority{Host: nodeName, Score: score})
 	}
 	return result, nil
 }
@@ -177,21 +178,21 @@ func (n *NodeLabelPrioritizer) CalculateNodeLabelPriority(pod *api.Pod, podListe
 // close the two metrics are to each other.
 // Detail: score = 10 - abs(cpuFraction-memoryFraction)*10. The algorithm is partly inspired by:
 // "Wei Huang et al. An Energy Efficient Virtual Machine Placement Algorithm with Balanced Resource Utilization"
-func BalancedResourceAllocation(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (algorithm.HostPriorityList, error) {
+func BalancedResourceAllocation(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (schedulerapi.HostPriorityList, error) {
 	nodes, err := nodeLister.List()
 	if err != nil {
-		return algorithm.HostPriorityList{}, err
+		return schedulerapi.HostPriorityList{}, err
 	}
 	podsToMachines, err := predicates.MapPodsToMachines(podLister)
 
-	list := algorithm.HostPriorityList{}
+	list := schedulerapi.HostPriorityList{}
 	for _, node := range nodes.Items {
 		list = append(list, calculateBalancedResourceAllocation(pod, node, podsToMachines[node.Name]))
 	}
 	return list, nil
 }
 
-func calculateBalancedResourceAllocation(pod *api.Pod, node api.Node, pods []*api.Pod) algorithm.HostPriority {
+func calculateBalancedResourceAllocation(pod *api.Pod, node api.Node, pods []*api.Pod) schedulerapi.HostPriority {
 	totalMilliCPU := int64(0)
 	totalMemory := int64(0)
 	score := int(0)
@@ -234,7 +235,7 @@ func calculateBalancedResourceAllocation(pod *api.Pod, node api.Node, pods []*ap
 		score,
 	)
 
-	return algorithm.HostPriority{
+	return schedulerapi.HostPriority{
 		Host:  node.Name,
 		Score: score,
 	}