I have contributed support for a no-new-privileges option to docker. This flag has already been included in runc and the Open Container Initiative spec.

The new flag supports, in Docker, a security feature that was added to the Linux kernel back in 2012 under the name no_new_privs.

The kernel feature works as follows:

  • A process can set the no_new_privs bit in the kernel that persists across fork, clone, and exec.
  • The no_new_privs bit ensures that the process or its children processes do not gain any additional privileges.
  • A process isn’t allowed to unset the no_new_privs bit once it is set.
  • Processes with no_new_privs are not allowed to change uid/gid or gain any other capabilities, even if the process executes setuid binaries or executables with file capability bits set.
  • no_new_privs also prevents Linux Security Modules (LSMs) like SELinux from transitioning to process labels that have access not allowed to the current process. This means an SELinux process is only allowed to transition to a process type with less privileges.

For more details see the kernel documentation.

Here is an example showcasing how it helps in Docker.

Create a setuid binary that displays the effective uid:

[$ dockerfiles]# cat testnnp.c
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>

int main(int argc, char *argv[])
{
        printf("Effective uid: %d\n", geteuid());
        return 0;
}
[$ dockerfiles]# make testnnp
cc     testnnp.c   -o testnnp

Now we will add the binary to a docker image:

[$ dockerfiles]# cat Dockerfile
FROM fedora:latest
ADD testnnp /root/testnnp
RUN chmod +s /root/testnnp
ENTRYPOINT /root/testnnp

[$ dockerfiles]# docker build -t testnnp .
Sending build context to Docker daemon 12.29 kB
Step 1 : FROM fedora:latest
 ---> 760a896a323f
Step 2 : ADD testnnp /root/testnnp
 ---> 6c700f277948
Removing intermediate container 0981144fe404
Step 3 : RUN chmod +s /root/testnnp
 ---> Running in c1215bfbe825
 ---> f1f07d05a691
Removing intermediate container c1215bfbe825
Step 4 : ENTRYPOINT /root/testnnp
 ---> Running in 5a4d324d54fa
 ---> 44f767c67e30
Removing intermediate container 5a4d324d54fa
Successfully built 44f767c67e30

Now we will create and run a container without no-new-privileges:

[$ dockerfiles]# docker run -it --rm --user=1000  testnnp
Effective uid: 0

This shows that even though you requested a non-privileged user (UID=1000) to run your container, that user would be able to become root by executing the setuid app on the container image.

Running with no-new-privileges prevents the uid transition while running a setuid binary:

[$ dockerfiles]# docker run -it --rm --user=1000 --security-opt=no-new-privileges testnnp
Effective uid: 1000

As you can see above the container process is still running as UID=1000, meaning that even if the image has dangerous code in it, we can still prevent the user from escalating privileges.

Turning on no_new_privs actually stopped the SELinux transition from the docker daemon type docker_t to the container type, svirt_lxc_net_t. The no_new_privs option only allows SELinux transitions from one type to another if the target type as a complete subset of the source type. Dan Walsh worked on the SELinux policy for docker to fix this. With the latest policy in Fedora 24, no_new_privs and SELinux work well together. We will be back-porting these fixes to RHEL when we ship docker support for no_new_privs.

If you want to allow users to run images as a non-privileged UID, in most cases you would want to prevent them from becoming root. no_new_privileges is a great tool for guaranteeing this.