Configuration as Code in Kubernetes

Commands and Arguments

Let’s say we have a Dockerfile that looks like this:

1
2
3
4
5
6
7
FROM Ubuntu

# the command that is run at startup. 
ENTRYPOINT ["sleep"]

# the arguments passed to the above command. 
CMD ["5"]

What if we wanted to change this behavior when we spin up the container in a Pod? This is where the following come in:

command: overrides the ENTRYPOINT of the Dockerfile we are creating in a Pod.
args: overrides the CMD of the Dockerfile we are creating in a Pod.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: v1
kind: Pod
metadata: 
   name: my-ubuntu-pod
spec:
   containers:
      - name: my-ubuntu-container
        image: my-ubuntu-container
        command: ["sleep"]
        args: ["10"]

Environment Variables, ConfigMaps, and Secrets

We can pass environment variables into our container as key-value pairs like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
apiVersion: v1
kind: Pod
metadata:
   name: simple-webapp-color
spec:
   containers:
      - name: simple-webapp-color
        image: simple-webapp-color

        # We pass environment values 
        env:
           - name: APP_COLOR
             value: pink
           - name: DO_NOT_DELETE
             value: /usr/bin

A ConfigMap is an abstraction that makes it easy to manage a lot of environment variables at once. When we create a ConfigMap below, take note of the fact that there is no spec key, but instead, a data key.

1
2
3
4
5
6
7
8
9
apiVersion: v1
kind: ConfigMap
metadata: 
  name: app-config 
data: 
  APP_COLOR: blue
  APP_MODE: prod
  MY_NUMBER: 3.14
  FAV_PATH: "/usr/local/bin"

You can use the following with the kubectl tool:

1
2
kubectl create -f configmap-def.yaml
kubectl get configmaps 

We can import the data from our ConfigMap like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
apiVersion: v1
kind: Pod
metadata:
   name: simple-webapp-color
   labels:
      name: simple-webapp-color
spec:
   containers:
      - name: simple-webapp-color
        image: simple-webapp-color
        
        # Use an existing ConfigMap named app-config
        envFrom:
           - configMapRef:
                name: app-config 

Secrets are very similar to a ConfigMap, except they are stored in an encoded format. The values of data.YOUR_KEY must be encoded into base 64. To learn how to do this, see the comments in the config below.

1
2
3
4
5
6
7
8
apiVersion: v1
kind: Secret
metadata:
   name: app-secret
data:
   DB_Host: bXlzcWw=         # echo -n 'mysql' | base64
   DB_User: cm9vdA==         # echo -n 'root' | base64
   DB_Pass: cGFzc3dvcmQxMjM= # echo -n 'password123' | base64

You can decode a base64 string like so:

1
echo -n 'bXlzcWw=' | base64 --decode

We can view secrets in our cluster using:

kubectl get secrets

Now, let’s give a container access to our Secret.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
apiVersion v1:
kind: Pod
metadata:
   name: simple-webapp-color
   labels:
      name:simple-webapp-color
spec:
   containers:
      -  name: simple-webapp-color
         image: simple-webapp-color

   # Name of the K8s object goes here.
   envFrom:
      - secretRef:
           name: app-secret

For production: You may have noticed that base64 isn’t actually doing anything to keep our information safe, since it can be very easily decoded. Have a look at this article to learn more about the shortcomings of Secrets, as well as how to use secrets in a production environment. (tl;dr use AWS Secrets Manager, Google Cloud Key Management Service, or Azure Key Vault).

Security Contexts

Docker implements a set of security features which limits the abilities of the container’s root user. This means that, by default, the root user within a container has a lot of Linux capabilities disabled, such as CHOWN, DAC, KILL, SETFCAP, SETPCAP, SETGID, SETUID, NETBIND, NET_RAW, MAC_ADMIN, BROADCAST, NET_ADMIN, SYS_ADMIN, and SYS_CHROOT to name a few. To view a full list of linux capabilities, run the following:

`1`	`cat /usr/include/linux/capability.h`

If you want to make changes to the capabilities a Docker container has, you can run one of the following:

1
2
3
docker run --cap-add MAC_ADMIN ubuntu  # adds a capability to a container
docker run --cap-drop KILL ubuntu      # removes a capability from a container
docker run --privileged ubuntu         # adds all capabilities to a container.

In Kubernetes, defining a set of enabled and disabled Linux capabilities is called a Security Context. We can apply a security contexts like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
apiVersion: v1
kind: Pod
metadata:
   name: mypod
spec:

   # 1
   securityContext:
      runAsUser: 1000
      runAsGroup: 3000
      fsGroup: 2000
      capabilities:
         add: ["MAC_ADMIN"]

   containers:
      - name: ubuntu
        image: ubuntu
        command: ["sleep", "3600"]
        
        # 2
        securityContext:
            capabilities:
                add: ["NET_ADMIN"]



      - name: some-other-thing 
        image: some-other-thing
        command: ["sleep", "3600"]

        # 3
        securityContext:
            allowPrivilegeEscalation: false

Take note of the following:

We are applying the following to all containers in the pod.
- runAsUser: specifies that for any containers in the Pod, all processes will run with user ID 1000.
- runAsGroup: specifies the primary group ID of 3000 for all processes within any containers of the Pod. If this field is omitted, the primary group ID of the containers will be root(0).
- fsGroup: specifies a supplementary group that the containers belong to. In this case, containers in this Pod belong to supplementary group ID 2000.
We are applying the NET_ADMIN capability only to the ubuntu container.
We are disabling privilege escalation only to the some-other-thing container.

As you might imagine, there’s actually quite a lot to security contexts, which you can read more about here, but this should do for a tl;dr article for now.

Service Accounts

Authentication, Authorization, RBAC, etc… There are two types of accounts in K8s. A user account is used by admins, developers and… well… users! A ServiceAccount is used by machine; such as a system monitoring tool like Prometheus, or a build automation tool like Jenkins. Let’s create a service account:

`1`	`kubectl create serviceaccount my-serviceacct`

Let’s list the service accounts that exist within our NameSpace.

1
2
3
4
kubectl get serviceaccounts
#> NAME            SECRETS     AGE
#> default         1           10d
#> my-serviceacct  1           15s

Notice that there are two ServiceAccounts when we run the above command. Some things to be aware of:

For every NameSpace in Kubernetes, a ServiceAccount named default is automatically created.
Each NameSpace has its own default ServiceAccount.
Whenever a Pod is created, the default ServiceAccount, and its token, are automagically mounted to that Pod as a VolumeMount. If you were to run
1 2 3 4 5 6
kubectl describe pod some-application #> Name: #> Namespace: #> ... #> Mounts: #> /var/run/secrets/kubernetes.io/serviceaccount from default-token-abc123
The default ServiceAccount is very limited, and only has permission to run basic K8s API queries.

This is because whenever a new NameSpace is created, an accompanying default service account is created. Each namespace has its own default ServiceAccount. Now, let’s get the details of the ServiceAccount that we just created:

1
2
3
4
5
6
7
8
9
kubectl describe serviceaccount my-serviceacct
#> Name:                 my-serviceacct
#> Namespace:            default
#> Labels:               <none>
#> Annotations:          <none>
#> Image pull secrets:   <none>
#> Mountable secrets:    my-serviceacct-token-kbbdm
#> Tokens:               my-serviceacct-token-kbbdm
#> Events:               <none>

Notice there is a Tokens field with a single element by the name my-serviceacct-token-kbbdm. This object is a Secret, which contains an API token which will be used to authenticate applications wishing to utilize the service account. We can get the API Token of my-serviceacct-token-kbbdm by running the following command:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
kubectl describe secret my-serviceacct-token-kbbdm
#> Name:          my-serviceacct-token-kbbdm
#> Namespace:     default
#> Labels:        <none>

#> Type:          kubernetes.io/service-account-token

#> Data
#> ====
#> ca.crt:        1025 bytes
#> namespace:     7 bytes
#> token:
#> jkhfdisuHfiweurfhEKfjheifuHWEFiuwehtjHEfKJHEFKLJhefKJefheL
#> AD1dY0UKn0wThAtThIsT0K3nIsT0ta11ymad3UPl0Lk33puRT0K3NzSaF3
#> DSAf98540fwoeifhjOEfowiej54o3iojFoiJ

We can attach this token as an authentication header to curl, or put this token into a 3rd party application that interacts with our K8s cluster. As said in one of the bullets above, whenever a Pod is created, the default ServiceAccount is mounted as a Volume. We can change the ServiceAccount that gets mounted to a Pod like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: v1
kind: Pod
metadata:
   name: my-app
spec:
   containers:
      - name: my-app
        image: my-app

   # You can select one of the following two lines. They are mutually exclusive.
   serviceAccount: my-serviceacct
   automountServiceAccountToken: false

NOTE: You cannot change the ServiceAccount being used by an existing Pod, however, you can update the ServiceAccount used by containers of Deployment objects, thanks to rolling upgrade of the Pods associated with the Deployment.

Resource Requests and Limits

Worker nodes have a limited pool of resources (CPU, memory and disk), so the K8s scheduler schedules Pods in such a way to avoid starvation. We can include a resource request in a container spec, so that when the scheduler tries to place a Pod on a node, it can determine what nodes can support the Pod. If none of the nodes have enough resources to run the Pod, the Pod will maintain a status of Pending.

If you know that your Pod will need more than the defaults, you can modify these values. Since Docker containers have no limit to the amount of resources they consume on a node, we can impose limits on the container.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: v1
kind: Pod
metadata:
   name: myapp
   labels:
      name: myapp
spec:
   containers:
     - name: myapp
       image: myapp

       resources:
          # 1
          requests:
             memory: "1Gi"
             cpu: 1
            
          # 2
          limits:
             memory: "2Gi"
             cpu: 2

We request 1 Gibibyte (1024 bytes, as opposed to a Gigabyte, which is 1000 bytes) and 1 vCPU cores (equivalent to 1 AWS vCPU, 1 GCP Core, 1 Azure Core, or 1 Hyperthread).
We restrict the container to 2 Gibibytes of memory and 2 vCPU cores.
- If a container attempts to consume more vCPU, Kubernetes throttles the vCPU.
- Containers are allowed to consume more memory than the limit, but if they make a habit of overconsuming memory, the Pod will be terminated and restarted.

We can declare a LimitRange within a Namespace to create default resources.requests and resources.limits for all Pods created within that Namespace. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: v1
kind: LimitRange
metadata:
   name: mem-limit-range
   namespace: dev
spec:
   limits:

    # 1
    - type: Container
    
    # 2
      defaultRequest:
         memory: 256Mi

    # 3
     default:
         memory: 512Mi

The above yaml reads as follows:

If a Container is created in the dev namespace without specifying its own request or limits;
then default memory request is created for that Container of 256 Mibibytes;
and a memory limit of 512 Mibibytes.

Taints and Tolerations

Taints and tolerations provides you finer control over where the K8s scheduler places Pods. A taint is applied to a Node and essentially acts like “Pod repellent,” while tolerations are applied to Pods and act like “taint immunity.” The gif (pronounced /ɡɪf/) below should offer some insight.

taints_and_tolerances

Above, you can see we have Node_1, Node_2, and Node_3. We also have 4 Pods we wish to schedue across the three nodes, Pod_A, Pod_B, Pod_C, and Pod_D. We have applied a taint on Node_1, and a tolerance on Pod_D. Pod_A, Pod_B, and Pod_C are intolerant to the taint on Node_1, so they get scheduled across Node_2 and Node_3. Finally, Pod_D is tolerant to the taint on Node_1, so Pod_D gets scheduled on Node_1.

To apply a toleration to a Pod, you can add this information to a Pod config:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: v1
kind: Pod
metadata:
   name: myapp
spec:
   containers:
      - name: my-ubuntu
        image: my-ubuntu
   
   tolerations:
   - key: "app"
     operator: "Equal"
     value: "blue"
     effect: "NoSchedule"

See the K8s docs for more yaml variants.

We can apply a taint to a node through the command line like so:

1
kubectl taint nodes <node-name> <key>=<value>:<taint-effect>

node-name should be replaced with the name of the node you want to apply taint to.
key, value should be replaced with tolerations that ignore the taint. For example, if we only wanted to apply Pods like above to be scheduled on the node, we would specify app=blue.
taint-effect defines what would happen to the Pod if they do not tolerate the taint. Replace with one of the following to apply an effect:
1. NoSchedule: the Pods will not be scheduled on the Node
2. PreferNoSchedule: K8s will try to avoid placing a Pod on the Node, but this is not guaranteed.
3. NoExecute: new Pods will not be scheduled on the Node, and existing Pods on the node, if any, will be evicted if they do not tolerate the taint (these Pods may have been scheduled on the Node before the taint was applied).

Fun Fact: A Kubernetes cluster is composed of a single Master Node, and n-many Worker Nodes. By default, no Pods can be scheduled on the Master Node, which is accomplished via a Taint. You can modify this behavior if you need (which I don’t recommend), but you can view this taint via kubectl describe node kubemaster | grep Taint.

Node Selectors and Affinity

We can assign a label to a Node from the command line like so:

1
kubectl label nodes <node-name> <label-key>=<label-value>

node-name should be replaced with the name of the node you want to attach the label to.
label-key is the key you want to assign to the node.
label-value is a value you want to assign to the node.

Let’s say we have a single Node in our cluster that is a lot beefier than the rest, and we want to prefer that a particular Pod runs on it to avoid sucking up too many resources of the weaker nodes. We can apply a label on it like so:

1
kubectl label nodes my-big-node size=Large

Now, we can control where Pods are scheduled in another way, through the use of a nodeSelector.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: v1
kind: Pod
metadata:
   name: myapp-pod
spec:
   containers:
      - name: data-processor
        image: data-processor
   
   # 1
   nodeSelector:
      size: Large

The nodeSelector is scheduling this Pod on Nodes with the label size=Large. This pod will get scheduled on my-big-node due to the kubectl command we ran above.

There are limitations to nodeSelectors. For example, we can’t specify a Pod to run on either a Large OR Medium Node. Likewise, we can’t specify a Pod to run on any Node that IS NOT Small. This is where nodeAffinity comes into play.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
apiVersion: v1
kind: Pod
metadata:
   name: myapp-pod
spec:
   containers:
      - name: data-processor
        image: data-processor
   
   # 1 
   affinity:
      # 2 
      nodeAffinity:
         # 3
         requiredDuringSchedulingIgnoredDuringExecution:
            # 4
            nodeSelectorTerms:
               # 5 
               - matchExpressions:
                  - key: size
                    operator: In
                    values:
                       - Large 

This yaml file looks super complicated, but it provides the same functionality as the yaml in the nodeSelector example above. Let’s break this down:

Define an affinity section at the same level as containers.
Declare that we are going to be applying a nodeAffinity.

What if there are no Nodes that this could be scheduled to? What if someone changes the labels on the Node later on? This is handled via the type of nodeAffinity. There are a few types of nodeAffinity that can be used:

requiredDuringSchedulingIgnoredDuringExecution: the scheduler can only place the Pod on matching Nodes, and if no matching Nodes exist, the Pod will not be scheduled (thus, required during scheduling). If the label associated with the Node changes such that the nodeAffinity is no longer satisfied, the Pods will continue to run normally (thus ignored during execusion).
preferredDuringSchedulingIgnoredDuringExecution: the scheduler will try to place the Pod on matching Nodes, and if no matching Nodes exists, the Pod will be scheduled on whatever Node can support it (thus preferred during scheduling). If the label associated with the Node changes such that the nodeAffinity is no longer satisfied, the Pods will continue to run normally (thus ignored during execusion).

requiredDuringSchedulingRequiredDuringExecution: the scheduler can only place the Pod on matching Nodes, and if no matching Nodes exist, the Pod will not be scheduled (thus, required during scheduling). If the label associated with the Node changes such that the nodeAffinity is no longer satisfied, t blue-pod.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
apiVersion: v1
kind: Pod
metadata:
   name: green-pod
spec:
   containers:
      - name: my-blue-app
        image: my-blue-app

   tolerations:
   - key: "app"
     operator: "Equal"
     value: "blue"
     effect: "NoSchedule"
```he Pod will be evicted and attempted to be rescheduled.

matchExpressions are the labels we want to match on. Because we can match against multiple values, values is a list.

We can accomplish scheduling on a Large OR Medium Node by replacing section 5 with:

1
2
3
4
5
6
- matchExpressions:
   key: size
   operator: In
   values:
      - Large
      - Medium

We can accomplish scheduling on any Node that is NOT Small by replacing section 5 with:

1
2
3
4
5
- matchExpressions:
  key: size
  operator: NotIn
  values:
     - Small

We can choose to only schedule Pods on Nodes which have a particular key by replacing section 5 with:

1
2
3
- matchExpressions:
   key: size
   operator: In

Using Node Affinity with Taints and Tolerations

Using Node Affinity, Taints, and Tolerations, we have finer control over where Pods get scheduled. Refer to the gif below:

nodeAffinityVsTaintsAndTolerations

There’s a lot going on here, so let’s break it down:

We have 5 Pods we want to schedule across 5 Nodes.
Three of the Pods and Nodes have metadata assigned to them which associates a color to them.
We don’t care where the uncolored Pods get scheduled.
We don’t care what gets schedule on the uncolored Nodes.
We want to make sure that exactly one Pod gets scheduled per Node.

Now let’s take some action to schedule things nicely:

We apply taints to the colored Nodes which looks something like this:

1
2
3
kubectl taint nodes Blue-Node color=blue:NoSchedule
kubectl taint nodes Red-Node color=red:NoSchedule
kubectl taint nodes Green-Node color=green:NoSchedule

Next, we apply tolerations to the colored Pods, which looks like:

green-pod.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: v1
kind: Pod
metadata:
   name: green-pod
spec:
   containers:
      - name: my-green-app
        image: my-green-app

   tolerations:
   - key: "color"
     operator: "Equal"
     value: "green"
     effect: "NoSchedule"

blue-pod.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: v1
kind: Pod
metadata:
   name: green-pod
spec:
   containers:
      - name: my-blue-app
       image: my-blue-app

   tolerations:
   - key: "color"
     operator: "Equal"
     value: "blue"
     effect: "NoSchedule"

red-pod.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: v1
kind: Pod
metadata:
   name: green-pod
spec:
   containers:
      - name: my-red-app
        image: my-red-app

   tolerations:
   - key: "color"
     operator: "Equal"
     value: "red"
     effect: "NoSchedule"

Then, we attach labels to the colored Nodes for use with NodeAffinity:

1
2
3
kubectl label nodes Blue-Node node-color=blue
kubectl label nodes Red-Node node-color=red
kubectl label nodes Green-Node node-color=green

Finally, we add Node Affinities to the spec of the colored Pods like so:

green-pod.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
...
   affinity:
      nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
               - matchExpressions:
                  - key: color 
                    operator: In
                    values:
                       - green

blue-pod.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
...
   affinity:
      nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
               - matchExpressions:
                  - key: color 
                    operator: In
                    values:
                       - blue

red-pod.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
...
   affinity:
      nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
               - matchExpressions:
                  - key: color 
                    operator: In
                    values:
                       - red

What results is a way to guarantee:
- the green Pod is scheduled on the green Node.
- the blue Pod is scheduled on the blue Node.
- the red Pod is scheduled on the red Node.
- the two uncolored Pods are not scheduled on any of the colored Nodes, but instead, are scheduled arbitrarily across the “Other” Nodes.

Commands and Arguments#

Environment Variables, ConfigMaps, and Secrets#

Security Contexts#

Service Accounts#

Resource Requests and Limits#

Taints and Tolerations#

Node Selectors and Affinity#

Using Node Affinity with Taints and Tolerations#