Kubernetes Resources Management - QoS, Quota, and LimitRange
July 25, 2017/
A good level of “built in” security is an inherent benefit of using container technology due to the isolated environment that containers run in and the security features that are integral to the Docker and Kubernetes frameworks, but this doesn’t mean that relying on a default installation of these components will provide an adequate level of security in itself or that you can absolve yourself from thinking about security.
There are a lot of moving parts present in a container cluster setup. In this guide we will examine the major components and focus on the associated security issues providing methods to address and mitigate these risks.
Before we delve into the individual components it is useful to review the high level issues and attack vectors that we are trying to protect against:
If a container becomes compromised by an attacker this shouldn’t allow the host to be compromised or allow them to access or interfere with the operation of other containers.
If an attacker manages to get non root access to an account they should not be able to leverage a vulnerability to escalate that account to one with a higher privilege level.
Vulnerable or poisoned images can increase the attack surface of a container and provide easy pickings for an attacker. Ensuring that the images used to create containers come from a reliable source, are free from malware and known vulnerabilities and have not been tampered with will help to protect against this.
If an external facing service suffers a DoS attack, will it lead to the host or even the entire cluster to being disrupted. As all containers share the kernel resources on a host node, without appropriate limits in place one container can easily monopolise enough resources to starve out other containers and cause the entire node to grind to a halt. Horizontal auto-scalars can further exacerbate this problem by reacting to the increase in perceived load by scheduling of containers on other nodes thereby affecting the entire cluster.
As all running containers on a node share the same kernel, in contrast to virtual machine architecture, exploits on the kernel are much more serious as they will affect the entire node.
If an attacker can get access to authentication credentials (such as a password, key or token) used to gain access to services (such as a database) then they can use those credentials to extract and possibly modify the data held by that service.
Application layer attacks come in many different varieties. Probably the most well known example of an application layer attack is exploiting an SQL injection vulnerability (where an attacker is able to manipulate SQL queries in ways not intended by the developer to either leak, modify or delete data).
More severe attacks can provide the attacker with shell access within the container which will often compromise the container and any secrets that it uses to access other resources within the system (eg. access credentials to a database). It can also provide a stepping stone to further attacks. In most cases the attack strategy is not directly related to the use of container technology but these attacks are nonetheless important to consider because externally exposed applications usually represents the first line of security defence that an attacker targets.
Application layer attacks can be difficult to guard against as they occur in-situ without requiring container security to be compromised and usually require only a single vulnerability to be exploited to enact a successful attack. These attacks are also difficult to detect because they follow the same access path that an application uses during normal operation.
Defences against application layer attacks are discussed in the Hardening Images against Attack section.
One of the overarching principles in the field of security (not just IT security) is the concept of layered security as opposed to just perimeter security. Applied to the current topic this refers to combining multiple mitigating strategies to protecting resources and data. If an attacker is able to compromise one layer of security it shouldn’t cause the entire system to be compromised. Many of the following sections refer to a particular layer of security. It is important that all layers are addressed and made as secure as possible in order to achieve the overall best outcome rather than relying on just one layer for protection.
The second primary principle is that of least privilege. Pods/containers and even users should run with the minimum set of access rights and resources that they need to perform their function. Thereby if a container becomes compromised, or a user goes rogue, the attacker should still be limited in the actions that they can perform as much as possible.
Probably the most important step in securing your cluster is to ensure the quality and resilience of your container images. Externally facing containers are your first line of defence and running containers with vulnerabilities opens your environment up to the risk of being easily compromised. Failing to properly harden images can exacerbate the problem if a vulnerability is successfully exploited.
Official images on docker hub provide a good foundation as base images to build upon or to utilise as is. They are designed to ensure that security updates are applied in a timely manner and all have been scanned for known vulnerabilities with Docker Cloud’s Security Scanning service and have their scan results available for easy examination. However new vulnerabilities are regularly discovered and can also be easily introduced through internally developed code and its dependencies so an ongoing process is required.
There is a huge range of available tools and services both proprietary and open source that can assist in this area. Many of these tools can also be readily integrated into existing CI/CD pipelines. The following paragraphs examine some of the most popular tool offerings and hardening methods.
Many image repositories in particular commercial SaaS offerings come bundled with Security scanners. For example Quay.io uses CoreOS’s Clair scanner and Docker Hubs private repository hosting service offers Docker Security Scanning (a.k.a. Docker Nautilus) as a paid add-on to their Docker Trusted Registry. Red Hat has a built in scanner for Atomic Registry (an enterprise container registry solution run on-premise or in the cloud) called Atomic Scan. Other popular commercial offerings include the Tenable Nessus vulnerability scanner, Twistlock, Black Duck Hub and the Aqua container security platform.
If you prefer open source software CoreOS's Clair software is also available as an open source project (https://github.com/coreos/clair) and OpenVAS (http://www.openvas.org ) is a framework of several services and tools offering a comprehensive and powerful vulnerability scanning and vulnerability management solution.
Vulnerability scanners are based on matching to the public CVE (Common Vulnerabilities and Exposures) database as well as various other vulnerability databases depending on the tool. Some tools will match based on docker image layer hash while others determine matches by analysing at the linux package level. Tools that check for vulnerabilities specific to particular development languages are also useful. For example the victims project is a Red Hat initiative that will check for known vulnerabilities in Java jar files. It’s important to remember however that these tools won’t usually identify vulnerabilities within an organization’s internally-developed code. An organization should still conduct internal code reviews and auditing of its software.
Automated penetration testing tools can be used as part of the development pipeline to test for common attack vulnerabilities. The Open Web Application Security Project - Zed Attack Proxy (OWASP –ZAP for short) is a popular open source project that can be easily integrated into CI workflow tools such as Jenkins and provides penetration testing for web based applications.
It is good practise to minimise the software included in your docker images to only what is required for the resulting container to do its job.
If your image includes binaries that have the setuid or setgid bits enabled consider whether your application uses these binaries (most likely not) and either disable the bits or remove the binaries from the image.
Try to drop user permissions as early as possible in the Dockerfile with the USER instruction or use gosu to start the application processes as a particular user so that the processes run without admin access within the container.
Container technology relies on namespacing to create an isolated environment in which a container is executed but not everything is namespaced. Kernel modules and the kernel itself as well as devices and various other resources (like system time) are common to all containers and hence access should be configured on a “least required privilege” basis. At a coarse level Linux capabilities provide a method of locking down what classes of system calls a container is able to request. The following section about running a hardened kernel describes several more fine grained methods of controlling access.
As of Docker v1.10 user namespaces are supported directly by the Docker daemon. This feature allows for the root user (uid 0) within a container to be mapped to a non root uid user outside the container. The option is enabled by providing the --userns-remap=
Depending on your choice of host OS there are several technologies available that can harden the linux kernel either via kernel patches or via linux security modules. Linux security modules add another level of security checks on the access rights of processes and users beyond that provided by standard file-level access control. The following paragraphs provide a brief overview of the major technologies, how they work and how they affect the container environment.
GRSEC (Grsecurity) provides a collection of security features to the Linux kernel, including address space protection, enhanced auditing and process control.
PaX, included as part of grsec, protects against buffer overflow attacks by marking program code in memory as non writable and data as non executable. Additionally it randomly arranges memory to mitigate against attacks that attempt to reroute code to existing procedures such as system calls.
PaX and Grsecurity run host wide so don't require any special configuration within Docker or Kubernetes.
Security Enhanced Linux provides an implementation of mandatory access control (MAC) as a Linux security module. SELinux supplements the standard read, write, execute, for owner, group, world linux file system permissions. Access controls are enforced by types which are labels applied to processes and objects. The system is quite complex and flexible but at a basic level, if a SELinux policy forbids a process of type A from accessing an object of type B, that access will be disallowed, regardless of the file permissions on the object or the access privileges of the user (even if they are root). Policies also control transitioning of user classes and processes via roles which can be used to prevent users from escalating their privileges even if they would ordinarily be able to via sudo.
SELinux policy tests occur after the normal file permission checks. Kernels which have the module loaded can be run in one of three modes: enforce – in which policies are strictly enforced, monitoring – in which policy violations are logged but the operation is allowed to proceed, or turned off.
In Docker/Kubernetes the default behaviour is for the container runtime to allocate a random SELinux context for each container unless overwritten in the security context for a pod or container via the seLinuxOptions field. The default policy enforces that containers are able to read and execute files only from very limited locations on the host. It also assigns a unique MCS (multi-category security) category number to each container, intended to prevent containers from being able to access files or resources written by other containers in the event of a breakout.
In most cases the default policy should not interfere with operations occurring within containers though volumes shared with containers from the host will need to be pre-prepared to give containers appropriate access rights.
AppArmor is also a Linux security module implementation of mandatory access control (MAC). It is a simpler system than SELinux and hence doesn't provide the same granularity of protection. The AppArmor policy model is process centric. It works by applying profiles to processes restricting which privileges they have at the level of Linux capabilities, network access and file access.
For systems that are configured with AppArmor, Docker will automatically apply an AppArmor profile to each launched container. The default profile provides a level of protection against rogue containers attempting to access various system resources. The default profile behaviour can be modified on a per container basis in docker via the --security-opt="apparmor:PROFILE" option. As of Kubernetes v1.4 support for selecting an AppArmor profile on a per-pod, per-container or per-cluster basis is supported as a beta feature.
Seccomp (secure computing mode) is a computer security facility in the Linux kernel that can be used to restrict the types of system calls that a process can make thereby providing a type of sandboxed environment. Seccomp-bpf is an extension to seccomp that allows filtering of system calls using a configurable policy implemented using Berkeley Packet Filter rules. Docker uses seccomp-bpf
By default docker restricts around 44 of the more than 300 system calls that can be configured thereby providing a moderately protective environment while providing good application compatibility. Most restrictions are for items that are not namespaced by the kernel. The default configuration can be overridden on a per container basis by using the --security-opt seccomp=
Note there is considerable overlap between seccomp and linux capabilities though seccomp offers more granular control.
By default any container within a Kubernetes cluster has the ability to connect to any other pod that has opened a listening port. This is regardless of whether a port specification has been declared in the pod definition or whether a Service has been associated with it. This is true even if the containers reside in different Kubernetes namespaces.
Given that pod IP addresses are often allocated sequentially they are prone to being easily guessed. This means that by default a compromised container running within your cluster can potentially discover open ports on other containers and attack them (even if they reside within a different namespace and on a different node). To address this situation a two layer security model is recommended. The network layer should be configured to isolate communication via sub-netting into multiple domains or to an even finer degree by creating point to point connection rules. Secondly the application layer should be configured to require appropriate authentication from the connecting party (eg. via a password, token or key)
In Kubernetes pods become isolated by creating a NetworkPolicy object that selects them. Once there is any NetworkPolicy in a namespace selecting a particular pod, that pod will reject any connections that are not allowed by any NetworkPolicy. Depending on the explicitness of the network policy you can loosely group pods into virtual subnets based on label match or if required tightly lock down inter-pod communication to explicit pods on explicit ports only.
A note of caution, network policies are implemented by the Kubernetes network plugin, so you must be using a networking solution which supports NetworkPolicy (examples include Calico, Romana, Weave Net). Simply creating the resource without a controller to implement it will have no effect on inter-pod routability.
Security contexts define the privilege level that pods, containers and volumes run under. Different configuration options exist that allow coarse and fine grain control of permissions. Objects should be configured with the least privilege principle in mind.
In Kubernetes security contexts are defined either via a SecurityContext (if defined at the container level) or PodSecurityContext (if defined at the pod level). SecurityContext takes precedence if both have been defined. Options include:
Obvious care should be taken when instantiating containers that require the privileged context as this is essentially equivalent to running as root on the host. If you need to run privileged containers consider enabling the DenyEscalatingExec admission control plug-in to deny exec and attach commands to pods that run with escalated privileges (see the Kubernetes Admission Controllers section below).
As of Kubernetes v1.6 (feature in beta) roles can be defined within a Kubernetes namespace or cluster-wide using the Role and ClusterRole objects respectively. Roles define a list of resources (such as pods, secrets, deployments etc. or specific instances of these) and associated verbs (e.g. get, watch, list, create, update, patch ...) which are permitted by the role.
RoleBinding and ClusterRoleBinding objects can then be used to associate users with roles. Roles for Service Accounts (the accounts that pods are executed under) can also be implemented via this mechanism to control the level of API access that they can exercise. To use RBAC for Service Accounts first define a Role and RoleBinding for a username and then specify that username to the serviceAccountName option within the pod specification.
Administrative boundaries can then be created between resources using Kubernetes namespaces to partition resources into logically named groups. Once defined namespaces can then be segregated as required (eg. to hide resources between namespaces) using Policy objects.
Users who only need to have access to the inside of containers can do so via the
command and hence don’t need SSH access to the Kubernetes nodes.
By default all pods are created with unbound CPU and memory limits. Resource quotas are used to limit the risk of a defective or compromised container bringing the entire node to a halt. (eg. in a DoS attack scenario). Container based resource limits can be specified for cpu and memory within the resource.limits section of a container definition.
Quota policies can also be attached to overall Kubernetes namespaces to constrain what can be instantiated within that namespace and the resource range for individual objects within that namespace. A ResourceQuota object defined within a Kubernetes namespace is used to limit the aggregate resource consumption for the entire namespace and a LimitRange object is used to constrain the minimum and maximum resources that individual pods can request.
Note that the ResourceQuota admission controller plugin must be in use to enforce ResourceQuotas and the LimitRanger plugin is required to enforce LimitRange objects (see the Kubernetes Admission Controllers section below)
Secrets are the credentials that applications use to access data stores or other resources within a system. Secrets may consist of passwords, tokens, keys or other credentials. Generally it’s a bad idea to bake secrets into container images in fact if possible secrets shouldn’t be stored in source control systems as potentially many people would be able to access them (and source control systems have history!). If you must store secrets in source control then encrypt them before placing them into the repository or store in a separate repository with strictly controlled access. Some desirable features of secrets management are:
As of v1.7 Kubernetes supports encrypted secrets as an alpha feature. A single EncryptionConfig object per cluster is used to specify the symmetric encryption algorithm to be used to encrypt and decrypt secrets as well as the encryption key to be used for this process. Obviously this object should be appropriately protected with RBAC restrictions as the contents provides the information required to decrypt all the secrets within the cluster. Once the object is configured secrets will be automatically encrypted before being written to etcd. To use a secret inside a pod define a named volume of type secret specifying the name of the secret as a parameter. This will expose the secret via a file on a temporary file system within the pod (which is dynamically updated in case of key rotation). Secrets can also be injected into pods via environment variables (using the env.valueFrom.secretKeyRef construct within a pod definition) though in this case values of the secret will only be set at pod create time.
If the Kubernetes built in support for secrets management doesn’t meet your requirements then there is a large choice of third party add-ons available that offer various alternative solutions. Popular vendors include Red Hat Open Shift, Rancher, NOMAD, Aqua and Hashicorp Vault. Cloud providers also offer possible solutions.
Keep in mind that secrets will usually be readily readable in raw unencrypted form within a container that uses the secret so users with permission to
into a running pod or start a new pod would be able to view the secret. Systems and processes that attempt to hide secrets from administrators may therefore be a largely fruitless exercises.
Kubernetes admission control is a plug-in architecture that provides additional control over the Kubernetes API access process. After an API request is authenticated and authorized the admission controllers intercept the request. Each admission control plug-in is run in sequence before a request is accepted into the cluster. If any of the plug-ins in the sequence rejects the request, the entire request is rejected. Administration Controllers can also mutate resources as part of the request. The following is a list of key admission control plug-ins:
Log management refers to the process of capturing and storing the output of log files and information written to standard output by pods and other elements within a cluster. While an audit trail refers to the chronological history of the sequence of activities initiated by individual users, administrators or other components of the system that have affected the system (ie. who did what, and when).
For log management a common open source pattern is to use the fluentd/Elasticsearch/Kibana trio. fluentd is configured to collect the individual log files and write them into Elasticsearch which provides storage and indexing features. Kibana is used as a UI frontend. Alternative third party solutions can be used. Cloud providers also provide solutions in this space (eg. Google Stackdriver Logging).
Kubernetes audit is implemented as part of the Kube-api server. It logs all requests processed by the server. Information that is persisted includes a unique id used for matching response and request metadata, requesting user, the resource being requested, etc.
Run-time monitoring refers to the automated monitoring of the Kubernetes cluster in real-time to identify unusual or suspicious activity and alert administrators. Various methods are used to determine what constitutes unusual or suspicious behaviour with a common method being to record a baseline of normal activity and then to compare this to run-time behaviour.
Sysdig falco is an open source behavioural activity monitor designed to detect anomalous activity by continuously monitoring container, application, host, and network activity. It is also available as a managed SaaS offering.
Third party products from AquaSec and Twistlock both provide facilities to profile normal application behaviour and then monitor a system in real-time to determine usage outside the learned profile with the option to automatically block execution. The Aqua product monitors usage at the linux command level while the Twistlock product profiles based on kernel system calls.
An additional option is to deploy deception or decoy services to automatically and more easily identify attacks in real-time. For example an internally open port that isn’t used within the application but could be triggered by hacker exploration to detect intrusion.
The Center for Internet Security (CIS) produces benchmark documents that define industry best practices for securing IT systems including auditing procedures to verify compliance. Docker security compliance is covered by the CIS Docker Community Edition Benchmark and Kubernetes compliance is covered in the CIS Kubernetes Benchmark. Both benchmarks provide two levels or compliance: Level 1 (clear security benefit and not inhibit utility beyond acceptable means) & Level 2 (intended for environments where security is paramount, but may negatively inhibit the utility or performance of the system).
Docker compliance checking with the CIS standard can be automated via the open-source Docker Bench project which provides a script that can test containers and their hosts’ security configurations against the CIS Benchmark.
Hopefully this article has provided some insight into the main focus areas to be considered when securing a Docker/Kubernetes cluster.
In concluding here is a list of some additional miscellaneous security do’s and don’ts:
© 2017-2020 Darumatic Pty Ltd. All Rights Reserved.