Commands and Arguments
Let’s say we have a Dockerfile that looks like this:
What if we wanted to change this behavior when we spin up the container in a Pod? This is where the following come in:
command: overrides the ENTRYPOINT of the Dockerfile we are creating in a Pod.
args: overrides the CMD of the Dockerfile we are creating in a Pod.
Environment Variables, ConfigMaps, and Secrets
We can pass environment variables into our container as key-value pairs like so:
A ConfigMap is an abstraction that makes it easy to manage a lot of
environment variables at once. When we create a ConfigMap below, take note of
the fact that there is no spec
key, but instead, a data
key.
You can use the following with the kubectl
tool:
We can import the data
from our ConfigMap like so:
Secrets are very similar to a ConfigMap, except they are stored in an
encoded format. The values of data.YOUR_KEY
must be encoded into base 64. To
learn how to do this, see the comments in the config below.
You can decode a base64 string like so:
|
|
We can view secrets in our cluster using:
kubectl get secrets
Now, let’s give a container access to our Secret.
For production: You may have noticed that base64 isn’t actually doing anything to keep our information safe, since it can be very easily decoded. Have a look at this article to learn more about the shortcomings of Secrets, as well as how to use secrets in a production environment. (tl;dr use AWS Secrets Manager, Google Cloud Key Management Service, or Azure Key Vault).
Security Contexts
Docker implements a set of security features which limits the abilities of the
container’s root user. This means that, by default, the root user within a
container has a lot of Linux capabilities disabled, such as CHOWN
, DAC
,
KILL
, SETFCAP
, SETPCAP
, SETGID
, SETUID
, NETBIND
, NET_RAW
,
MAC_ADMIN
, BROADCAST
, NET_ADMIN
, SYS_ADMIN
, and SYS_CHROOT
to name a
few. To view a full list of linux capabilities, run the following:
|
|
If you want to make changes to the capabilities a Docker container has, you can run one of the following:
In Kubernetes, defining a set of enabled and disabled Linux capabilities is called a Security Context. We can apply a security contexts like so:
|
|
Take note of the following:
We are applying the following to all containers in the pod.
- runAsUser: specifies that for any containers in the Pod, all processes will run with user ID 1000.
- runAsGroup: specifies the primary group ID of 3000 for all processes within any containers of the Pod. If this field is omitted, the primary group ID of the containers will be root(0).
- fsGroup: specifies a supplementary group that the containers belong to. In this case, containers in this Pod belong to supplementary group ID 2000.
We are applying the
NET_ADMIN
capability only to theubuntu
container.We are disabling privilege escalation only to the
some-other-thing
container.
As you might imagine, there’s actually quite a lot to security contexts, which you can read more about here, but this should do for a tl;dr article for now.
Service Accounts
Authentication, Authorization, RBAC, etc… There are two types of accounts in K8s. A user account is used by admins, developers and… well… users! A ServiceAccount is used by machine; such as a system monitoring tool like Prometheus, or a build automation tool like Jenkins. Let’s create a service account:
|
|
Let’s list the service accounts that exist within our NameSpace.
Notice that there are two ServiceAccounts when we run the above command. Some things to be aware of:
- For every NameSpace in Kubernetes, a ServiceAccount named
default
is automatically created. - Each NameSpace has its own default ServiceAccount.
- Whenever a Pod is created, the default ServiceAccount, and its token,
are automagically mounted to that Pod as a VolumeMount. If you were to runThe
default
ServiceAccount is very limited, and only has permission to run basic K8s API queries.
This is because whenever a new NameSpace is created, an accompanying default
service account is created. Each namespace has its own default ServiceAccount.
Now, let’s get the details of the ServiceAccount that we just created:
Notice there is a Tokens
field with a single element by the name
my-serviceacct-token-kbbdm
. This object is a Secret, which contains an API
token which will be used to authenticate applications wishing to utilize the
service account. We can get the API Token of my-serviceacct-token-kbbdm
by
running the following command:
|
|
We can attach this token as an authentication header to curl
, or put this
token into a 3rd party application that interacts with our K8s cluster.
As said in one of the bullets above, whenever a Pod is created, the default
ServiceAccount is mounted as a Volume. We can change the ServiceAccount that
gets mounted to a Pod like so:
NOTE: You cannot change the ServiceAccount being used by an existing Pod,
however, you can update the ServiceAccount used by containers
of Deployment
objects, thanks to rolling upgrade of the Pods associated with the Deployment.
Resource Requests and Limits
Worker nodes have a limited pool of resources (CPU, memory and disk), so the
K8s scheduler schedules Pods in such a way to avoid starvation. We can include
a resource request in a container spec, so that when the scheduler tries to
place a Pod on a node, it can determine what nodes can support the Pod. If none
of the nodes have enough resources to run the Pod, the Pod will maintain a
status of Pending
.
If you know that your Pod will need more than the defaults, you can modify these values. Since Docker containers have no limit to the amount of resources they consume on a node, we can impose limits on the container.
We request 1 Gibibyte (1024 bytes, as opposed to a Gigabyte, which is 1000 bytes) and 1 vCPU cores (equivalent to 1 AWS vCPU, 1 GCP Core, 1 Azure Core, or 1 Hyperthread).
We restrict the container to 2 Gibibytes of memory and 2 vCPU cores.
- If a container attempts to consume more vCPU, Kubernetes throttles the vCPU.
- Containers are allowed to consume more memory than the limit, but if they make a habit of overconsuming memory, the Pod will be terminated and restarted.
We can declare a LimitRange within a Namespace to create default
resources.requests
and resources.limits
for all Pods created within that
Namespace. For example:
The above yaml reads as follows:
- If a Container is created in the
dev
namespace without specifying its own request or limits; - then default memory request is created for that Container of 256 Mibibytes;
- and a memory limit of 512 Mibibytes.
Taints and Tolerations
Taints and tolerations provides you finer control over where the K8s
scheduler places Pods. A taint is applied to a Node and essentially acts like
“Pod repellent,” while tolerations are applied to Pods and act like “taint
immunity.” The gif (pronounced /ɡɪf/
) below should offer some insight.
Above, you can see we have Node_1, Node_2, and Node_3. We also have 4 Pods we wish to schedue across the three nodes, Pod_A, Pod_B, Pod_C, and Pod_D. We have applied a taint on Node_1, and a tolerance on Pod_D. Pod_A, Pod_B, and Pod_C are intolerant to the taint on Node_1, so they get scheduled across Node_2 and Node_3. Finally, Pod_D is tolerant to the taint on Node_1, so Pod_D gets scheduled on Node_1.
To apply a toleration to a Pod, you can add this information to a Pod config:
See the K8s docs for more yaml variants.
We can apply a taint to a node through the command line like so:
|
|
node-name
should be replaced with the name of the node you want to apply taint to.key
,value
should be replaced with tolerations that ignore the taint. For example, if we only wanted to apply Pods like above to be scheduled on the node, we would specifyapp=blue
.taint-effect
defines what would happen to the Pod if they do not tolerate the taint. Replace with one of the following to apply an effect:NoSchedule
: the Pods will not be scheduled on the NodePreferNoSchedule
: K8s will try to avoid placing a Pod on the Node, but this is not guaranteed.NoExecute
: new Pods will not be scheduled on the Node, and existing Pods on the node, if any, will be evicted if they do not tolerate the taint (these Pods may have been scheduled on the Node before the taint was applied).
Fun Fact: A Kubernetes cluster is composed of a single Master Node, and n-many Worker Nodes. By default, no Pods can be scheduled on the Master Node, which is accomplished via a Taint. You can modify this behavior if you need (which I don’t recommend), but you can view this taint via
kubectl describe node kubemaster | grep Taint
.
Node Selectors and Affinity
We can assign a label to a Node from the command line like so:
|
|
node-name
should be replaced with the name of the node you want to attach the label to.label-key
is the key you want to assign to the node.label-value
is a value you want to assign to the node.
Let’s say we have a single Node in our cluster that is a lot beefier than the rest, and we want to prefer that a particular Pod runs on it to avoid sucking up too many resources of the weaker nodes. We can apply a label on it like so:
|
|
Now, we can control where Pods are scheduled in another way, through the use of a nodeSelector.
- The
nodeSelector
is scheduling this Pod on Nodes with the labelsize=Large
. This pod will get scheduled onmy-big-node
due to the kubectl command we ran above.
There are limitations to nodeSelectors. For example, we can’t specify a Pod to run on either a Large OR Medium Node. Likewise, we can’t specify a Pod to run on any Node that IS NOT Small. This is where nodeAffinity comes into play.
|
|
This yaml file looks super complicated, but it provides the same functionality as the yaml in the nodeSelector example above. Let’s break this down:
Define an
affinity
section at the same level ascontainers
.Declare that we are going to be applying a
nodeAffinity
.What if there are no Nodes that this could be scheduled to? What if someone changes the labels on the Node later on? This is handled via the type of nodeAffinity. There are a few types of nodeAffinity that can be used:
requiredDuringSchedulingIgnoredDuringExecution
: the scheduler can only place the Pod on matching Nodes, and if no matching Nodes exist, the Pod will not be scheduled (thus, required during scheduling). If the label associated with the Node changes such that the nodeAffinity is no longer satisfied, the Pods will continue to run normally (thus ignored during execusion).preferredDuringSchedulingIgnoredDuringExecution
: the scheduler will try to place the Pod on matching Nodes, and if no matching Nodes exists, the Pod will be scheduled on whatever Node can support it (thus preferred during scheduling). If the label associated with the Node changes such that the nodeAffinity is no longer satisfied, the Pods will continue to run normally (thus ignored during execusion).requiredDuringSchedulingRequiredDuringExecution
: the scheduler can only place the Pod on matching Nodes, and if no matching Nodes exist, the Pod will not be scheduled (thus, required during scheduling). If the label associated with the Node changes such that the nodeAffinity is no longer satisfied, t blue-pod.yaml
matchExpressions
are the labels we want to match on. Because we can match against multiple values,values
is a list.
We can accomplish scheduling on a Large OR Medium Node by replacing section 5 with:
We can accomplish scheduling on any Node that is NOT Small by replacing section 5 with:
We can choose to only schedule Pods on Nodes which have a particular key by replacing section 5 with:
Using Node Affinity with Taints and Tolerations
Using Node Affinity, Taints, and Tolerations, we have finer control over where Pods get scheduled. Refer to the gif below:
There’s a lot going on here, so let’s break it down:
- We have 5 Pods we want to schedule across 5 Nodes.
- Three of the Pods and Nodes have metadata assigned to them which associates a color to them.
- We don’t care where the uncolored Pods get scheduled.
- We don’t care what gets schedule on the uncolored Nodes.
- We want to make sure that exactly one Pod gets scheduled per Node.
- Now let’s take some action to schedule things nicely:
We apply taints to the colored Nodes which looks something like this:
Next, we apply tolerations to the colored Pods, which looks like:
green-pod.yaml
blue-pod.yaml
red-pod.yaml
Then, we attach labels to the colored Nodes for use with NodeAffinity:
Finally, we add Node Affinities to the spec of the colored Pods like so:
green-pod.yaml
blue-pod.yaml
red-pod.yaml
What results is a way to guarantee:
- the green Pod is scheduled on the green Node.
- the blue Pod is scheduled on the blue Node.
- the red Pod is scheduled on the red Node.
- the two uncolored Pods are not scheduled on any of the colored Nodes, but instead, are scheduled arbitrarily across the “Other” Nodes.