Having run FreeBSD for nearly two decades, I eventually got frustrated with it for various reasons and moved my workloads from "a lot of virtual machines on a FreeBSD bhyve host" to "a lot of OCI containers on a Linux host".
I try to run containers in as restricted and isolated a fashion as possible. Ultimately, I'm trying to limit the possible damage caused by a compromise of a single service. I've tried to take a layered approach starting from the kernel upwards. We don't have secure systems and probably never will, so all of my choices are based around making systems harder to exploit, and around limiting damage if and when something is compromised.
I'm running containers on RHEL derivatives. There are multiple reasons for this:
The distributions have a very up-to-date targeted
SELinux
policy that assigns each container a unique label. This
severely limits what compromised container processes can actually do:
Even if a container is completely compromised, the SELinux label effectively
stops the container's processes from interacting with anything on the host
system, and with any other containers.
The distribution userland is compiled with a lot of exploit mitigation
(stack protection, bounds-checked libc function replacements,
hardened allocators, read-only executable sections, control flow
integrity, etc). This makes it harder for any vulnerable processes to
be exploited successfully. There are distributions that enable even
more hardening, but these often require building from source
(Hardened Gentoo, etc)
and nobody has time for that.
The distribution kernel typically has most of the various kernel hardening features enabled. I'm told that Fedora is an early adopter of most of these features, and they percolate down into RHEL derivatives when it's determined that they aren't terrible. This makes it harder for kernel exploits to actually work.
The distribution kernel tends to exclude a lot of old drivers that don't have any place on modern "enterprise" hardware. Old drivers can be sources of vulnerabilities.
The distributions have an incredibly boring and slow release schedule with
no incompatible changes between major releases. This keeps things stable.
It also means that it's safe to run an automatic package updater
(such as dnf-automatic) to keep the host kernel up-to-date
with the latest security patches.
I tried running CoreOS for a few years, but I found it too cumbersome. A one-line configuration file change requires a complete re-deploy of the entire virtual machine, and this can take 5-10 minutes depending on the VPS host. I understand what they're trying to do, but I just found it too annoying to work with in practice. Maybe if it was a lot faster...
I like Debian, but the lack of kernel hardening and the lack of a good container SELinux policy means that I've rejected it as a container host.
My requirement is that I want a userland and kernel that's hard to exploit.
Containers don't have their own virtualized kernels in the manner that bhyve
and other hypervisor guests do, so it's important to pick a kernel that's going
to be less subject to any local kernel exploits that might be launched inside
containers. RHEL derivatives seem to be about the best option that doesn't
involve paying someone to build custom kernels as their full-time job.
I'm running containers with podman. The set of different
ways one can run a podman container
could fill a book.
I've landed on the following set of configuration practices:
Containers can either be rootful (meaning the host's root user is the
user that starts the container) or rootless (meaning a non-root user on
the host starts the container).
I ran entirely rootless containers for over a year, with each container being run by different user on the host. This sounds like it should be optimal from a security perspective; if a container never had root privileges, then it presumably can't be the victim of privilege escalation.
Unfortunately, this turned out to be bad. Rootless containers are convenient for allowing users to run containers locally for development work and so on, but they're horrible for production use:
Linux container networking is already obscenely complicated,
error-prone, and fragile. Couple this with the requirement to be able to
set up container networking without root privileges, and it eventually
becomes necessary to more or less implement an entire fake virtual network
stack. They are, as far as I'm aware, already onto their
third "userland network stack". It's
exactly as complex and error-prone as it looks, and when it fails, it
does so silently. With rootful containers, you can simply use macvlan
networking, assign each container an IP address, and sidestep most of the
horrible complexity.
Convenient commands like podman stats and podman ps only show the
containers from the current user (by design). If all of the containers
are being executed by different users, then the utility of these commands
is basically lost. With rootful containers, you can run these commands
as root and see all of the containers.
Rootless containers actually end up somehow being less secure in many cases, because there are sandboxing features that require some privileges to be present in the first place in order to enable the features. In other words, a process needs to be slightly more privileged in order to be able to use sandboxing features to drop itself to a less privileged position than an ordinary unprivileged process. I'll cover this in a bit.
The actual storage for container images ends up being spread over
all of the home directories of all of the "users" that run the containers.
With rootful containers, image data ends up in /var/lib/containers,
and that's it.
I'm now exclusively running rootful containers and using a set of configuration options that result in containers that are, as far as I'm aware, as isolated from each other as possible, and as isolated from the host as possible.
This is the first blunt instrument in the box. When a container is run
with --read-only, it's not possible for processes inside the container to
write to anything in the filesystem except for in explicitly read/write mounted
volumes. This has the following benefits:
Containers become stateless (other than any explicit state in volumes that the admin provides).
Compromised processes can't overwrite system binaries inside the
container. A compromised process attempting to replace /usr/sbin/sshd
will find that it can't.
It breaks any poorly designed images that run massive piles of setup scripts on startup that write all over the filesystem. This means you learn quite quickly that an image is incompetently packaged, and you can replace it with a better one.
I couple this with exclusively using --rm so that every time a
container is started, it's guaranteed to be starting from a pristine
"has never run before" state.
On Linux, the root user is simply an ordinary user that happens to have
more capabilities
by default. When a podman container is run by the root user, podman
does automatically drop some capabilities that it knows containers should
never have.
When the --cap-drop all command-line option is used, all of the possible
capabilities are dropped, turning all of the processes inside the container
into utterly unprivileged processes. A user that appears to be root inside the
container won't be able to do any of the things the root user can usually do,
such as chown files it doesn't own, and etc.
The Linux kernel has a no new privileges
flag that can be set for processes. This simply prevents any process from
gaining any new privileges via an execve() call. This is essentially
paranoia that I use in combination with --cap-drop all.
On Linux, a user with ID 0 inside a container is equivalent to a user
with ID 0 outside of the container. This means that, should a process
running under user ID 0 somehow get access to parts of the host filesystem,
then it can read and write files that are owned by user ID 0 on the host.
It can also send signals to other processes that have user ID 0.
The --userns auto option eliminates this problem. Instead of the container's
processes running under the same ID of the user that started the container
(root), podman consults /etc/subuid and /etc/subgid and then picks
a random region of 65536 UID and GID values that are guaranteed not to
overlap with any on the host system, and guaranteed not to overlap with any
in any other container. /etc/subuid and /etc/subgid tend to look like this:
# cat /etc/subuid containers:2147483647:2147483647 # cat /etc/subgid containers:2147483647:2147483647
The first number is the starting UID/GID, and the second number is the number
of ID values that are permitted to be used. This essentially means that all
container processes will be assigned UID/GID values in the range
[2147483647, 4294967294].
For example, inside a container, nginx believes it is running as UID 0.
However, on the host system, we can see that it isn't:
# ps -eo uid:20,pid,cmd | grep nginx | grep master 2147479553 319065 nginx: master process nginx -p /nginx/ -c /nginx/etc/nginx.conf -e /nginx/var/error.log
It is actually running as UID 2147479553. The processes are also placed into
an isolated user namespace.
When I mentioned earlier that there are sandboxing features that require privileges in order to get to a less privileged position, this is what I was referring to.
Z flagAs mentioned, each container on a RHEL derivative gets a unique label. The
SELinux policy is written such that volumes mounted in the container need to be
assigned the same label or the container processes won't be able to read or write
anything on the volume. The Z flag, when used on a volume, assigns a label
that's unique to the specific container being executed, whilst the z flag
assigns a label that can be used by all containers (if containers need to share
data via the filesystem). For example:
# podman run --volume /data/containers/example01/data:/data:rw,Z ...
If another container was compromised and somehow got access to
/data/containers/example01/data, it still wouldn't be able to actually read
or write anything there because the SELinux label wouldn't match.
If a volume doesn't have a Z or z flag, the containers simply won't be
able to read or write at all. It follows that, as the host system files and
processes won't have these labels, compromised containers have no ability to
read/write to locations on the host filesystem.
If containers need to share data via the filesystem, volumes can use the
idmap flag. This performs a bit of magic that allows files to have the same
UID/GID between different containers - something the --userns auto option
would normally specifically prevent. The idmap flag requires a compatible
filesystem (ZFS and XFS are good candidates) and also requires rootful
containers, and so is another instance of having to start out with extra
privileges in order to end up with fewer privileges.
I don't apply any extra seccomp policies
because it's too much of a maintenance burden. I do want to point out that
podman applies its own restrictive seccomp policy to all containers by
default.
For example, containers are forbidden from calling any of the system calls that
are used to load kernel modules. This means that, even if you didn't apply any of
the other security mechanisms that are listed here, a process running as
root inside a container still wouldn't be quite as capable as a process
running as root on the host.
It also means that, if a particular system call has an exploitable bug in it, a process inside the container might not necessarily be able to make the call and exploit the bug.
I run each container in its own systemd slice which, essentially, means that each container gets its own cgroup.
I only use CPU and memory accounting, and I tend to limit container resources on a somewhat trial-and-error basis. I use ZFS filesystem quotas, in combination with giving each container its own filesystem, to prevent runaway container processes from filling up disks.
I tend to set memory limits such that containers run at roughly 80% of their prescribed limits. For example:
ID NAME CPU % MEM USAGE / LIMIT MEM % 97c719b8ea55 objects02 0.56% 406MB / 512MB 79.30%
Containers that exceed limits are simply killed and restarted (and I have monitoring to indicate OOM conditions, and to notify me if a container is repeatedly restarting).
Frustratingly, there seems to be some kind of bug in podman, at the time
of writing, when running rootful containers. Sometimes resource limits are
not applied. For safety, I specify the limits redundantly in the
systemd unit
file for each container:
[Service] Slice=services-objects02.slice Restart=on-failure RestartSec=10s TimeoutStopSec=70 TimeoutStartSec=300 CPUAccounting=true CPUQuota=400% MemoryAccounting=true MemoryHigh=512000000 MemoryMax=512000000 [Container] ContainerName=objects02 PodmanArgs=--cpus=4 PodmanArgs=--memory=512000000b
This ensures that, if podman fails to apply resource limits, the systemd
service will still have those limits applied to the slice in which the
container is placed, and so it will still be subject to those limits.
I'm using macvlan. This seems to be the "simplest" option in terms of
moving parts; there's no NAT involved in the host kernel. You simply assign
each container its own IP address, and the container appears as if it was
a (real or virtual) machine on the LAN.
All Linux container networking schemes seem to have one or more fatal flaws
that make them unusable for one workload or another, but the macvlan scheme
seems to have the fewest of those, and is only available to rootful
containers.
All of the schemes that are available to rootless containers have nightmarish lists of caveats and horrible failure modes that can result in excruciating amounts of debugging just to work out why packets aren't getting from one place to another. In my opinion, life is too short to bother with them.
I run extremely restrictive default-deny firewall rules on all hosts, but I won't go into detail on that here because it's not specific to containers.
At the time of writing, I believe this is about as "secure" as is practically possible. The biggest weakness of the system is kernel-level exploits; if someone in a container successfully exploits the kernel locally, then none of the security features matter. This is, however, also true of virtual machines except that the compromised virtual machine has to take one step further and compromise the host kernel.