Multi-cri is a modulable container runtime interface (CRI) for kubernetes which manages the pod lifecycle and allows to configure adapters for different container runtimes or resource managers such as SLURM. In addition, it provides multi-CRI for kubernetes, so different CRIs can be configured by setting the RuntimeClass pod attribute.
Multi-cri is implemented under Kubernetes v1.13.0
Multi-cri execution creates a unix socket. It can be configured by using the following options:
  --adapter-name                     Adapter name. It setup "slurm" by default. 
  --enable-pod-network               Enable pod network namespace
  --enable-pod-persistence           Enable pod and container persistence in cache file
  --network-bin-dir string           The directory for putting network binaries. (default "/opt/cni/bin")
  --network-conf-dir string          The directory for putting network plugin configuration files. (default "/etc/cni/net.d")
  --remote-runtime-endpoints         Remote runtime endpoints to support RuntimeClass. Add several by separating with comma. (default "default:/var/run/dockershim.sock")
  --resources-cache-path string      Path where image, container and sandbox information will be stored. It will also be the image pool path (default "/root/.multi-cri/")
  --root-dir string                  Root directory path for multi-cri managed files (metadata checkpoint etc). (default "/var/lib/multi-cri")
  --sandbox-image string             The image used by sandbox container. (default "gcr.io/google_containers/pause:3.0")
  --socket-path string               Path to the socket which multi-cri serves on. (default "/var/run/multicri.sock")
  --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
  --stream-addr string               The ip address streaming server is listening on. Default host interface is used if this is empty.
  --stream-port string               The port streaming server is listening on. (default "10010")
Multi-cri allows to configure pod network namespace by using CNI. It can be enabled by using --enable-pod-network.
Pod and container metadata can be persistent in disk by enabling it by using --enable-pod-persistence.
In order to provide multi-CRI for kubernetes, we support RuntimeClass which can configure several remote CRIs and identify them with a name. It configures
the runtime in the runtimeClassName pod spec attribute. Our implementation we contemplate several scenarios:
- Configure a default remote CRI which will be used in case that 
runtimeClassNameattribute does not have value. In this case, we will setupruntimeClassName: multicriin the pods we want to run in multi-cri. - The default CRI is not configured, so multi-cri will be used by default.
 - Several remote CRI endpoints will be configured. It can be done in this format: 
--remote-runtime-endpoints default:/var/run/dockershim.sock 
Due to Kubernetes does not provide pod information to de image manager, we need to specify the runtime class to the image name: image: multicri/perl
We need to configure a full container runtime interface, able to execute docker containers. For, example dockershim.
In the following, we can see how to configure the Multicri runtimeClass:
# kubectl apply -f runtime_multicri.yaml
apiVersion: node.k8s.io/v1alpha1  # RuntimeClass is defined in the node.k8s.io API group
kind: RuntimeClass
metadata:
  name: multicri
  # The name the RuntimeClass will be referenced by
  # RuntimeClass is a non-namespaced resource
spec:
  runtimeHandler: multicri 
The last section of this document shows an example of the full multi-cri setup.
Our CRI provides support to docker and singularity repositories. In order to identify the images in the multi-cri runtimeClass, they need to start always with the selected CRI identification name multicri.
This CRI supports the Singularity repository, https://singularity-hub.org, which contains a bunch of public singularity images.
Due to the kubernetes image format does not allow to include ://, we need to specify the image without those characters. Therefore, we use singularity-repository to identify images from this kind of repository.
In the following, we show the image pull workflow for singularity hub images.
- User specify the image name including the repository name (
singularity-repository) in kubernetes with format. For example:sigularity-repository.jorgesece/singularity-nginx:latest. - CRI parse the image name to get the right singularity image url. For example it will generate the url 
shub://jorgesece/singularity-nginxfrom the aforementioned image name. - CRI builds the image by using the right singularity client (singularity build 
shub://jorgesece/singularity-nginx), registers the image in the CRI image pool and stores it in the image storage folder. 
Docker image repository is supported, at least for those adapters which are based on singularity or docker.
In the following, we show the image pull workflow for docker repository images:
- User specify the image name including the repository name (
docker-repository). For example:docker-repository.perl. - CRI parse the image name to get the right docker image url. For example it will generate the url 
docker://perl. - CRI builds the image by using the singularity client (singularity build 
docker://perl), registers the image in the CRI image pool and stores it in the image storage folder. 
It supports docker private respositories by using ImageSecret credentials.
Users can upload their images in a container volume and specify the path to that image.
- User specify the image name including the repository name (
local-image). For example:local-image.perl.img. - CRI parse the image name to get the right image path in the volume. For example it will generate the path 
volumen/perl.img. - CRI copy the image to the working user directory.
 
Multi-cri aims to be a generic CRI in which different runtimes are supported by implementing different adapters.
We can configure it by setting the --adapter-name variable.
At the moment, there is an adapter for the Slurm workload manager.
Slurm adapter supports batch job submissions to Slurm clusters.
- CRI_SLURM_MOUNT_PATH: String environment variable. It is the working directory in the Slurm cluster ("multi-cri" by default). This path is relative to the $HOME directory.
 - CRI_SLURM_IMAGE_REMOTE_MOUNT: String environment variable. It is the path in which the images will be built (empty by default). They are built in the container persistent volume path by default.
 - CRI_SLURM_BUILD_IN_CLUSTER: Boolean environment variable which indicates to build images directly in the Slurm cluster (default false). Images will build in the CRI node by default.
 
- MPI jobs are supported. Configured by environment variables.
 - Slurm cluster credentials are provided by environment variables.
 - Data transfer supported by using NFS. Containers mount NFS volumes, which are linked to the proper Slurm NFS mount.
 - Local image repository use images stored in the NFS container volume.
 
Container job execution are configured by the following environment variables:
- 
Slurm credentials:
- CLUSTER_USERNAME: user name to access the cluster.
 - CLUSTER_PASSWORD: user password to access the cluster.
 - CLUSTER_HOST: host/ip related to the cluster.
 
 - 
Slurm prerun configuration:
- CLUSTER_CONFIG: Prerun script which will be executed before the run script defined by the container command. It must be passed as text.
 
 - 
Slurm job configuration:
- JOB_QUEUE: queue in which submit the job.
 - JOB_GPU: GPU configuration. Format "gpu[[:type]:count]". For instance: 
gpu:kepler:2. More information Slurm GRES - JOB_NUM_NODES: number of nodes.
 - JOB_NUM_CORES_NODE: number of cores in each node.
 - JOB_NUM_CORES: number of cores to distribute through the nodes.
 - JOB_NUM_TASKS_NODE: num of tasks to allocate in one node.
 - JOB_CUSTOM_CONFIG: custom Slurm environment variables. More information in Slurm input environment variables.
 
 - 
MPI configuration:
- MPI_VERSION: MPI version. It is considered as MPI job when it has value. In case it is not set, the job won't be MPI.
 - MPI_FLAGS: MPI flags.
 
 
Note: Container environment variables with CLUSTER_, JOB_, KUBERNETES_* and MPI_FLAGS pattern are reserved to the system.
In order to properly work with SLURM, we must to configure the NFS in this way:
- K8s side
- Create NFS PersistentVolume(PV) and PersistentVolumeClaim(PVC) to the NFS path (
/<NFS PATH>). - Mount the volume in container with 
mountPath: "multicri". So the CRI will know which is the Slurm volume. 
 - Create NFS PersistentVolume(PV) and PersistentVolumeClaim(PVC) to the NFS path (
 - Slurm side
- Mount the NFS path, 
/<NFS PATH>, on the$HOME/<CRI_SLURM_MOUNT_PATH>/<VOLUME CLAIM NAME>. 
 - Mount the NFS path, 
 
Pod results will be stored in the NFS server, specifically in the path /<NFS PATH>/<Sandobox ID>/<Container ID>.  You can see the right path in the pod logs.
Data can be recovered by mounting the NFS path in your computer or
mounting the volume in a new pod (for example, busybox) and use kubectl cp <new pod>:/<Mount Point>/<Sandboac ID><Container ID> <Local Path>.
- Slurm NFS configuration.
 
It is important to setup the mount point in Slurm as $HOME/<CRI_SLURM_MOUNT_PATH>/<VOLUME CLAIM NAME>, because the adapter will use it as working directory.
sudo mount <NFS server IP>:/mnt/storage/multicri-nfs /home/jorge/multi-cri/nfs-vol1
- K8s PV and PVC configuration:
- Create PV
 
apiVersion: v1 kind: PersistentVolume metadata: name: nfs-vol1 spec: capacity: storage: 10Gi accessModes: - ReadWriteMany nfs: server: <CLUSTER IP> path: "/mnt/storage/multicri-nfs"- Create PV
 
apiVersion: v1 kind: PersistentVolumeClaim apiVersion: v1 metadata: name: nfs-vol1 spec: accessModes: - ReadWriteMany storageClassName: "" resources: requests: storage: 10Gi 
- Configure credentials through environment variables, they can be set, for example, by using K8s secrets.
 - Use docker image repository.
 - Use NFS data transfer. It mounts a Persistent Volume Claim called 
nfs-vol1 
apiVersion: batch/v1
kind: Job
metadata:
  name: job-perl-slurm-vol-pod
spec:
  backoffLimit: 1
  template:
    metadata:
      labels:
        name: job-slurm-template
    spec:
      runtimeClassName: multicri
      containers:
      - name: job-slurm-container
        image: multicri/docker.perl.img:latest
        command: ["sleep", "60", "&&", "ls", "/"]
        env:
        - name: CLUSTER_USERNAME
          valueFrom:
            secretKeyRef:
              name: trujillo-secret
              key: username
        - name: CLUSTER_PASSWORD
          valueFrom:
            secretKeyRef:
              name: trujillo-secret
              key: password
        - name: CLUSTER_HOST
          valueFrom:
            secretKeyRef:
              name: trujillo-secret
              key: host
        - name: CLUSTER_PORT
          valueFrom:
            secretKeyRef:
              name: trujillo-secret
              key: port
        - name: CLUSTER_CONFIG
          valueFrom:
            secretKeyRef:
              name: trujillo-secret
              key: config
        - name: JOB_QUEUE
          valueFrom:
            secretKeyRef:
              name: trujillo-secret
              key: queue
        volumeMounts:
        # name must match the volume name below
          - name: my-pvc-nfs
            mountPath: "multicri"
      restartPolicy: Never
      nodeSelector:
        beta.kubernetes.io/arch: amd64
      volumes:
      - name: my-pvc-nfs
        persistentVolumeClaim:
          claimName: nfs-vol1
apiVersion: batch/v1
kind: Job
metadata:
  name: job-perl-slurm-pod
spec:
  backoffLimit: 1
  template:
    metadata:
      labels:
        name: job-slurm-template
    spec:
      runtimeClassName: multicri
      containers:
      - name: job-slurm-container
        image: multicri/docker.perl:latest
        command: ["ls", "/"]
        env:
        - name: MPI_VERSION
          value: "1.10.2"
        - name: MPI_FLAGS
          value: "-np 2"
        - name: CLUSTER_USERNAME
          valueFrom:
            secretKeyRef:
              name: trujillo-secret
              key: username
        - name: CLUSTER_PASSWORD
          valueFrom:
            secretKeyRef:
              name: trujillo-secret
              key: password
        - name: CLUSTER_HOST
          valueFrom:
            secretKeyRef:
              name: trujillo-secret
              key: host
        - name: CLUSTER_PORT
          valueFrom:
            secretKeyRef:
              name: trujillo-secret
              key: port
        - name: CLUSTER_CONFIG
          valueFrom:
            secretKeyRef:
              name: trujillo-secret
              key: config
        - name: JOB_QUEUE
          valueFrom:
            secretKeyRef:
              name: trujillo-secret
              key: queue
      restartPolicy: Never
      nodeSelector:
        beta.kubernetes.io/arch: amd64
In the following, you can find the explanation of a full setup of this system.
First of all, we need to launch the docker CRI socket by using the kubelet command for both minikube and kubelet deployments. It can be done as a system service with the following configuration:
[Unit]
Description=dockershim for remote Multicri CRI
[Service]
ExecStart=/usr/bin/kubelet --experimental-dockershim --port 11250
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
RequiredBy=multi-cri.service
Second, we have to execute and configure multi-cri with the aforementioned parameters, in addition to install Singularity 3.0. For instance, we can execute it as system service by using the following configuration:
[Unit]
Description=CRI Multicri
[Service]
Environment=CRI_SLURM_BUILD_IN_CLUSTER=true
ExecStart=/usr/local/bin/multi-cri -v 3 --socket-path /var/run/multi-cri.sock --remote-runtime-endpoints default:/var/run/dockershim.sock
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=kubelet.service # or localkube.service in case of using minikube
In case CNI network plugin raises a not found network configuration file exception, we can configure it by following the instructions of https://github.com/containernetworking/cni.
Third, we configure RuntimeClass, it is supported by kubernetes 1.12.0 version.
Fourth, we need to create a runtimeClass instance for each runtime we want to use, except for the default runtimeClass.
# kubectl apply -f runtime_multicri.yaml
apiVersion: node.k8s.io/v1alpha1  # RuntimeClass is defined in the node.k8s.io API group
kind: RuntimeClass
metadata:
  name: multicri
  # The name the RuntimeClass will be referenced by
  # RuntimeClass is a non-namespaced resource
spec:
  runtimeHandler: multicri
Later, we configure Kubernetes to use multi-cri as remote container runtime. The following command shows how to do it:
kubelet --container-runtime=remote --container-runtime-endpoint=/var/run/multi-cri.sock.
In the case of using minikube, you can launch it by using your local machine as host and configuring the CRI parameters in this way:
minikube start --kubernetes-version=v1.13.0 --vm-driver=none --extra-config=kubelet.container-runtime=remote --extra-config=kubelet.container-runtime-endpoint=/var/run/multi-cri.sock