AKS - Run Kubernetes Pods on Specific VM Types

This article covers how to utilize node pools in Azure Kubernetes Service (AKS) to tailor your Kubernetes infrastructure to specific compute needs. It explains the creation of initial and additional node pools, assigning labels for node selection, and deploying pods to specific node pools using node selectors and node affinity. Further, it highlights the importance of node and pod affinity in ensuring high availability and fault tolerance by distributing pods across different availability zones. Practical examples of deploying applications using these techniques are presented.

When dealing with software that you run in Kubernetes, it is a common requirement to have your applications running on certain types of underlying infrastructure (virtual machines). Perhaps your software has a high memory requirement, or maybe needs GPU optimization. At any rate, you might find yourself saying “I need these pods to be running on this type of virtual machine”.

With Azure Kubernetes Service (AKS) this is possible through the use of node pools. Node pools are the compute representation of the underlying agent nodes for a Kubernetes cluster.

What is a node pool?

A node pool is a collection of VMs (usually backed by a Virtual Machine Scale Set (VMSS)) that can participate in a Kubernetes cluster.

Node pools diagram

The first is the system node pool, which contains the system-related pods (e.g. kube-system).This system node pool is created by default when you create your AKS cluster. It will run the system pods, but also run the user pods if it is the only node pool in the cluster. It is the only “required” node pool.

You may want to add compute and have different options for your agent nodes, so you can create additional user node pools. All VMs in the same node pool share the same configuration (VM size, labels, disk size, etc.).

If you have some software that should be running on a specific type of compute, you can create a node pool to host that software. We’ll see how to do that and how to wire it all up.

Create the AKS cluster with user node pools

To create the AKS cluster, we’ll start out with the origin (only the single node pool):

$ az group create \
    --location eastus \
    --name rg
$ az aks create \
    --resource-group rg \
    --name aks \
    --node-count 2

After this cluster is created, we can see that we have our two agent nodes (system) ready:

$ kubectl get nodes
NAME                                STATUS   ROLES   AGE   VERSION
aks-nodepool1-36584864-vmss000000   Ready    agent   69s   v1.17.13
aks-nodepool1-36584864-vmss000001   Ready    agent   73s   v1.17.13

A helpful way to see the node pool breakdown is to run az aks nodepool list:

$ az aks nodepool list --resource-group rg --cluster-name aks -o table
Name       OsType    VmSize           Count    MaxPods    ProvisioningState    Mode
---------  --------  ---------------  -------  ---------  -------------------  ------
nodepool1  Linux     Standard_DS2_v2  2        110        Succeeded            System

This shows that we only have our single system node pool with two VMs in it.

$ az aks nodepool add \
    --resource-group rg \
    --cluster-name aks \
    --name nodepool2 \
    --node-count 3 \
    --node-vm-size Standard_DS2_v2 \
    --labels "vmsize=small"

$ az aks nodepool add \
    --resource-group rg \
    --cluster-name aks \
    --name nodepool3 \
    --node-count 3 \
    --node-vm-size Standard_DS5_v2 \
    --labels "vmsize=large"

The important thing to note here is the --labels parameter, which will put a label on each of these nodes in the node pool. It is through this label that we’ll be able to specify which nodes our pods should land on.

Now we can see that we have these additional agent nodes:

$ kubectl get nodes --label-columns vmsize
NAME                                STATUS   ROLES   AGE     VERSION    VMSIZE
aks-nodepool1-36584864-vmss000000   Ready    agent   7m5s    v1.17.13
aks-nodepool1-36584864-vmss000001   Ready    agent   7m9s    v1.17.13
aks-nodepool2-36584864-vmss000000   Ready    agent   3m13s   v1.17.13   small
aks-nodepool2-36584864-vmss000001   Ready    agent   3m9s    v1.17.13   small
aks-nodepool2-36584864-vmss000002   Ready    agent   3m18s   v1.17.13   small
aks-nodepool3-36584864-vmss000000   Ready    agent   2m48s   v1.17.13   large
aks-nodepool3-36584864-vmss000001   Ready    agent   2m35s   v1.17.13   large
aks-nodepool3-36584864-vmss000002   Ready    agent   2m48s   v1.17.13   large

And the additional node pools:

$ az aks nodepool list --resource-group rg --cluster-name aks -o table
Name       OsType    VmSize           Count    MaxPods    ProvisioningState    Mode
---------  --------  ---------------  -------  ---------  -------------------  ------
nodepool1  Linux     Standard_DS2_v2  2        110        Succeeded            System
nodepool2  Linux     Standard_DS2_v2  3        110        Succeeded            User
nodepool3  Linux     Standard_DS5_v2  3        110        Succeeded            User

Create pods on specific node pools

When we created our node pools, we had the option to specify --labels. This translates to node labels which can be used in the PodSpec. If we were creating a deployment that should have pods scheduled only on large VMs, it could look like this:

kind: Deployment
apiVersion: apps/v1
metadata:
  name: test-app1
spec:
  replicas: 16
  selector:
    matchLabels:
      app: app1
  template:
    metadata:
      labels:
        app: app1
    spec:
      containers:
        - name: app1
          image: debian:latest
          command: ["/bin/bash"]
          args: ["-c", "while true; do echo hello world; sleep 10; done"]
      nodeSelector:
        vmsize: large

As you can see, we set our nodeSelector to match the label selector of the node pool, vmsize=large.

Kubernetes will only schedule pods onto the nodes matching the labels you specify.

And now with these pods running, we can validate which nodes they are running on:

$ kubectl get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP           NODE                                NOMINATED NODE   READINESS GATES
test-app1-555d8bc87b-2gj9j   1/1     Running   0          23s   10.244.5.3   aks-nodepool3-36584864-vmss000000   <none>           <none>
test-app1-555d8bc87b-46p4m   1/1     Running   0          23s   10.244.7.4   aks-nodepool3-36584864-vmss000001   <none>           <none>
test-app1-555d8bc87b-6hvzw   1/1     Running   0          23s   10.244.7.3   aks-nodepool3-36584864-vmss000001   <none>           <none>
test-app1-555d8bc87b-6vj9b   1/1     Running   0          23s   10.244.6.5   aks-nodepool3-36584864-vmss000002   <none>           <none>
test-app1-555d8bc87b-9zkpt   1/1     Running   0          23s   10.244.5.2   aks-nodepool3-36584864-vmss000000   <none>           <none>
test-app1-555d8bc87b-b6bqm   1/1     Running   0          23s   10.244.5.6   aks-nodepool3-36584864-vmss000000   <none>           <none>
test-app1-555d8bc87b-grv77   1/1     Running   0          23s   10.244.7.2   aks-nodepool3-36584864-vmss000001   <none>           <none>
test-app1-555d8bc87b-hgjwz   1/1     Running   0          23s   10.244.6.6   aks-nodepool3-36584864-vmss000002   <none>           <none>
test-app1-555d8bc87b-m27cp   1/1     Running   0          23s   10.244.7.5   aks-nodepool3-36584864-vmss000001   <none>           <none>
test-app1-555d8bc87b-m4bck   1/1     Running   0          23s   10.244.6.2   aks-nodepool3-36584864-vmss000002   <none>           <none>
test-app1-555d8bc87b-msztk   1/1     Running   0          23s   10.244.5.5   aks-nodepool3-36584864-vmss000000   <none>           <none>
test-app1-555d8bc87b-mvz9k   1/1     Running   0          23s   10.244.5.4   aks-nodepool3-36584864-vmss000000   <none>           <none>
test-app1-555d8bc87b-mwm5v   1/1     Running   0          23s   10.244.6.3   aks-nodepool3-36584864-vmss000002   <none>           <none>
test-app1-555d8bc87b-q9p7b   1/1     Running   0          23s   10.244.6.4   aks-nodepool3-36584864-vmss000002   <none>           <none>
test-app1-555d8bc87b-wsbjk   1/1     Running   0          23s   10.244.6.7   aks-nodepool3-36584864-vmss000002   <none>           <none>
test-app1-555d8bc87b-xtdnr   1/1     Running   0          23s   10.244.7.6   aks-nodepool3-36584864-vmss000001   <none>           <none>

The NODE column shows that, as expected, these pods are running only on nodes in nodepool3, which matches the Standard_DS5_v2 VMs.

Moving beyond the node selector with affinity

The node selector is sufficient in small clusters but is usually unsuitable for complex cases.

For example, you may have an app that needs to run in separate availability zones. Or you may want to keep the API and database separate, e.g., when you don’t have many replicas.

Using them, you can create “preferred” and “soft” rules for different conditions for Kubernetes to schedule the pod even if there are no perfectly matching nodes. They also let you match the labels of pods running on the same nodes and specify the location of new pods more precisely.

It’s essential to keep in mind that there are two types of affinity:

Node affinity refers to impacting how pods get matched to nodes.
Pod affinity specifies how pods can be scheduled based on the labels of pods already running on that node.

Node affinity: what is it, and how does it work?

Similar to node selector, node affinity also lets you use labels to specify to which nodes Kube-scheduler should schedule your pods.
Remember that if you specify nodeSelector and nodeAffinity, both must be met for the pod to be scheduled.

requiredDuringSchedulingIgnoredDuringExecution	preferredDuringSchedulingIgnoredDuringExecution
the scheduler will only schedule the pod if the node meets the rule	the scheduler will try to find a node matching the rule, but it will still schedule the pod even if it doesn’t find anything suitable

What is pod affinity?

Working along similar lines, this concept focuses on impacting the Kubernetes scheduler based on the labels on the pods already running on a given node.
You can also specify it within the affinity section using the podAffinity and podAntiAffinity fields in the pod spec.

Pod affinity	Pod anti-affinity
assumes that a given pod can run in a specific location if there is already a pod meeting particular conditions	offers the opposite functionality, preventing pods from running on the same node as pods matching particular criteria.

Node affinity in action: high availability and fault tolerance

By spreading pods across several different nodes, you can ensure that your application remains available even if one or more of those nodes fail.

With node affinity, you can instruct the Kubernetes scheduler to choose nodes in different availability zones, data centers, or regions. By doing so, your app can continue running even if your AZ or data center experiences an outage.

Here is an example deployment of an on-demand instance on Azure with affinity set for a single zone eastus2-1:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-cross-single-az
  labels:
    app: nginx-cross-single-az
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx-single-az
  template:
    metadata:
      labels:
        app: nginx-single-az
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: "eastus2-1"
      containers:
      - name: nginx
        image: nginx:1.24.0
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 2

\=> In this case, the node selector will pick nodes with the label “topology.kubernetes.io/zone” set to “eastus2-1“.

For comparison, here’s an example of node affinity set for multi-zone pod scheduling:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-cross-az
  labels:
    app: nginx-cross-az
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx-cross-az
  template:
    metadata:
      labels:
        app: nginx-cross-az
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: "topology.kubernetes.io/zone"
                operator: In
                values:
                - eastus2-1
                - eastus2-2
                - eastus2-3
      containers:
      - name: nginx
        image: nginx:1.24.0
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 2