crush depth

How I Run Containers (2025 Edition)

Having run FreeBSD for nearly two decades, I eventually got frustrated with it for various reasons and moved my workloads from "a lot of virtual machines on a FreeBSD bhyve host" to "a lot of OCI containers on a Linux host".

I try to run containers in as restricted and isolated a fashion as possible. Ultimately, I'm trying to limit the possible damage caused by a compromise of a single service. I've tried to take a layered approach starting from the kernel upwards. We don't have secure systems and probably never will, so all of my choices are based around making systems harder to exploit, and around limiting damage if and when something is compromised.

Distribution

I'm running containers on RHEL derivatives. There are multiple reasons for this:

  • The distributions have a very up-to-date targeted SELinux policy that assigns each container a unique label. This severely limits what compromised container processes can actually do: Even if a container is completely compromised, the SELinux label effectively stops the container's processes from interacting with anything on the host system, and with any other containers.

  • The distribution userland is compiled with a lot of exploit mitigation (stack protection, bounds-checked libc function replacements, hardened allocators, read-only executable sections, control flow integrity, etc). This makes it harder for any vulnerable processes to be exploited successfully. There are distributions that enable even more hardening, but these often require building from source (Hardened Gentoo, etc) and nobody has time for that.

  • The distribution kernel typically has most of the various kernel hardening features enabled. I'm told that Fedora is an early adopter of most of these features, and they percolate down into RHEL derivatives when it's determined that they aren't terrible. This makes it harder for kernel exploits to actually work.

  • The distribution kernel tends to exclude a lot of old drivers that don't have any place on modern "enterprise" hardware. Old drivers can be sources of vulnerabilities.

  • The distributions have an incredibly boring and slow release schedule with no incompatible changes between major releases. This keeps things stable. It also means that it's safe to run an automatic package updater (such as dnf-automatic) to keep the host kernel up-to-date with the latest security patches.

I tried running CoreOS for a few years, but I found it too cumbersome. A one-line configuration file change requires a complete re-deploy of the entire virtual machine, and this can take 5-10 minutes depending on the VPS host. I understand what they're trying to do, but I just found it too annoying to work with in practice. Maybe if it was a lot faster...

I like Debian, but the lack of kernel hardening and the lack of a good container SELinux policy means that I've rejected it as a container host.

My requirement is that I want a userland and kernel that's hard to exploit. Containers don't have their own virtualized kernels in the manner that bhyve and other hypervisor guests do, so it's important to pick a kernel that's going to be less subject to any local kernel exploits that might be launched inside containers. RHEL derivatives seem to be about the best option that doesn't involve paying someone to build custom kernels as their full-time job.

Podman

I'm running containers with podman. The set of different ways one can run a podman container could fill a book. I've landed on the following set of configuration practices:

Rootful Containers

Containers can either be rootful (meaning the host's root user is the user that starts the container) or rootless (meaning a non-root user on the host starts the container).

I ran entirely rootless containers for over a year, with each container being run by different user on the host. This sounds like it should be optimal from a security perspective; if a container never had root privileges, then it presumably can't be the victim of privilege escalation.

Unfortunately, this turned out to be bad. Rootless containers are convenient for allowing users to run containers locally for development work and so on, but they're horrible for production use:

  • Linux container networking is already obscenely complicated, error-prone, and fragile. Couple this with the requirement to be able to set up container networking without root privileges, and it eventually becomes necessary to more or less implement an entire fake virtual network stack. They are, as far as I'm aware, already onto their third "userland network stack". It's exactly as complex and error-prone as it looks, and when it fails, it does so silently. With rootful containers, you can simply use macvlan networking, assign each container an IP address, and sidestep most of the horrible complexity.

  • Convenient commands like podman stats and podman ps only show the containers from the current user (by design). If all of the containers are being executed by different users, then the utility of these commands is basically lost. With rootful containers, you can run these commands as root and see all of the containers.

  • Rootless containers actually end up somehow being less secure in many cases, because there are sandboxing features that require some privileges to be present in the first place in order to enable the features. In other words, a process needs to be slightly more privileged in order to be able to use sandboxing features to drop itself to a less privileged position than an ordinary unprivileged process. I'll cover this in a bit.

  • The actual storage for container images ends up being spread over all of the home directories of all of the "users" that run the containers. With rootful containers, image data ends up in /var/lib/containers, and that's it.

I'm now exclusively running rootful containers and using a set of configuration options that result in containers that are, as far as I'm aware, as isolated from each other as possible, and as isolated from the host as possible.

read-only

This is the first blunt instrument in the box. When a container is run with --read-only, it's not possible for processes inside the container to write to anything in the filesystem except for in explicitly read/write mounted volumes. This has the following benefits:

  • Containers become stateless (other than any explicit state in volumes that the admin provides).

  • Compromised processes can't overwrite system binaries inside the container. A compromised process attempting to replace /usr/sbin/sshd will find that it can't.

  • It breaks any poorly designed images that run massive piles of setup scripts on startup that write all over the filesystem. This means you learn quite quickly that an image is incompetently packaged, and you can replace it with a better one.

I couple this with exclusively using --rm so that every time a container is started, it's guaranteed to be starting from a pristine "has never run before" state.

cap-drop all

On Linux, the root user is simply an ordinary user that happens to have more capabilities by default. When a podman container is run by the root user, podman does automatically drop some capabilities that it knows containers should never have.

When the --cap-drop all command-line option is used, all of the possible capabilities are dropped, turning all of the processes inside the container into utterly unprivileged processes. A user that appears to be root inside the container won't be able to do any of the things the root user can usually do, such as chown files it doesn't own, and etc.

security-opt=no-new-privileges

The Linux kernel has a no new privileges flag that can be set for processes. This simply prevents any process from gaining any new privileges via an execve() call. This is essentially paranoia that I use in combination with --cap-drop all.

userns auto

On Linux, a user with ID 0 inside a container is equivalent to a user with ID 0 outside of the container. This means that, should a process running under user ID 0 somehow get access to parts of the host filesystem, then it can read and write files that are owned by user ID 0 on the host. It can also send signals to other processes that have user ID 0.

The --userns auto option eliminates this problem. Instead of the container's processes running under the same ID of the user that started the container (root), podman consults /etc/subuid and /etc/subgid and then picks a random region of 65536 UID and GID values that are guaranteed not to overlap with any on the host system, and guaranteed not to overlap with any in any other container. /etc/subuid and /etc/subgid tend to look like this:

# cat /etc/subuid
containers:2147483647:2147483647

# cat /etc/subgid
containers:2147483647:2147483647

The first number is the starting UID/GID, and the second number is the number of ID values that are permitted to be used. This essentially means that all container processes will be assigned UID/GID values in the range [2147483647, 4294967294].

For example, inside a container, nginx believes it is running as UID 0. However, on the host system, we can see that it isn't:

# ps -eo uid:20,pid,cmd | grep nginx | grep master
  2147479553  319065 nginx: master process nginx -p /nginx/ -c /nginx/etc/nginx.conf -e /nginx/var/error.log

It is actually running as UID 2147479553. The processes are also placed into an isolated user namespace.

When I mentioned earlier that there are sandboxing features that require privileges in order to get to a less privileged position, this is what I was referring to.

Volume Z flag

As mentioned, each container on a RHEL derivative gets a unique label. The SELinux policy is written such that volumes mounted in the container need to be assigned the same label or the container processes won't be able to read or write anything on the volume. The Z flag, when used on a volume, assigns a label that's unique to the specific container being executed, whilst the z flag assigns a label that can be used by all containers (if containers need to share data via the filesystem). For example:

# podman run --volume /data/containers/example01/data:/data:rw,Z ...

If another container was compromised and somehow got access to /data/containers/example01/data, it still wouldn't be able to actually read or write anything there because the SELinux label wouldn't match.

If a volume doesn't have a Z or z flag, the containers simply won't be able to read or write at all. It follows that, as the host system files and processes won't have these labels, compromised containers have no ability to read/write to locations on the host filesystem.

idmap

If containers need to share data via the filesystem, volumes can use the idmap flag. This performs a bit of magic that allows files to have the same UID/GID between different containers - something the --userns auto option would normally specifically prevent. The idmap flag requires a compatible filesystem (ZFS and XFS are good candidates) and also requires rootful containers, and so is another instance of having to start out with extra privileges in order to end up with fewer privileges.

seccomp

I don't apply any extra seccomp policies because it's too much of a maintenance burden. I do want to point out that podman applies its own restrictive seccomp policy to all containers by default.

For example, containers are forbidden from calling any of the system calls that are used to load kernel modules. This means that, even if you didn't apply any of the other security mechanisms that are listed here, a process running as root inside a container still wouldn't be quite as capable as a process running as root on the host.

It also means that, if a particular system call has an exploitable bug in it, a process inside the container might not necessarily be able to make the call and exploit the bug.

Resource Limits

I run each container in its own systemd slice which, essentially, means that each container gets its own cgroup.

I only use CPU and memory accounting, and I tend to limit container resources on a somewhat trial-and-error basis. I use ZFS filesystem quotas, in combination with giving each container its own filesystem, to prevent runaway container processes from filling up disks.

I tend to set memory limits such that containers run at roughly 80% of their prescribed limits. For example:

ID            NAME        CPU %       MEM USAGE / LIMIT  MEM %
97c719b8ea55  objects02   0.56%       406MB / 512MB      79.30%

Containers that exceed limits are simply killed and restarted (and I have monitoring to indicate OOM conditions, and to notify me if a container is repeatedly restarting).

Frustratingly, there seems to be some kind of bug in podman, at the time of writing, when running rootful containers. Sometimes resource limits are not applied. For safety, I specify the limits redundantly in the systemd unit file for each container:

[Service]
Slice=services-objects02.slice
Restart=on-failure
RestartSec=10s
TimeoutStopSec=70
TimeoutStartSec=300

CPUAccounting=true
CPUQuota=400%
MemoryAccounting=true
MemoryHigh=512000000
MemoryMax=512000000

[Container]
ContainerName=objects02
PodmanArgs=--cpus=4
PodmanArgs=--memory=512000000b

This ensures that, if podman fails to apply resource limits, the systemd service will still have those limits applied to the slice in which the container is placed, and so it will still be subject to those limits.

Networking

I'm using macvlan. This seems to be the "simplest" option in terms of moving parts; there's no NAT involved in the host kernel. You simply assign each container its own IP address, and the container appears as if it was a (real or virtual) machine on the LAN.

All Linux container networking schemes seem to have one or more fatal flaws that make them unusable for one workload or another, but the macvlan scheme seems to have the fewest of those, and is only available to rootful containers.

All of the schemes that are available to rootless containers have nightmarish lists of caveats and horrible failure modes that can result in excruciating amounts of debugging just to work out why packets aren't getting from one place to another. In my opinion, life is too short to bother with them.

I run extremely restrictive default-deny firewall rules on all hosts, but I won't go into detail on that here because it's not specific to containers.

Summary

  • I run containers that start out life as rootful containers, but drop all privileges to the point that they're less privileged than rootless containers.
  • I run containers as random unique UID and GID values in isolated user namespaces.
  • I run containers with all capabilities dropped, and with process flags that prevent any acquisitions of new capabilities.
  • I run containers on read-only filesystems.
  • I set strict resource limits to quickly kill runaway (maliciously, or accidentally) processes.
  • I prevent containers from having access to the host or to each other with SELinux labels.
  • I run containers on reasonably hardened kernels.

At the time of writing, I believe this is about as "secure" as is practically possible. The biggest weakness of the system is kernel-level exploits; if someone in a container successfully exploits the kernel locally, then none of the security features matter. This is, however, also true of virtual machines except that the compromised virtual machine has to take one step further and compromise the host kernel.

Bosch serial numbers II

After a lot more back and forth with Bosch support, I finally got an actual answer.

The serial number on a Dremel is written on what they refer to as the "data plate" on the side of the tool. The number doesn't match what's in any of the QR codes on the tool, and it isn't labelled as being a serial number.

It's a 3-4 digit number, followed by a date of the form MM/YYYY.

Dremel 8260

If the first number was 179 and the date was 02/2024, then the serial number of the tool would be, apparently, 0179022024.

A Confusion Of Wireguards On Hetzner

I'm in the process of migrating io7m.com to Hetzner.

My previous setup on Vultr consisted of a single bastion host through which it was necessary to tunnel with ssh in order to log in to any of the multitude of servers sitting behind it.

This has downsides. Anything that wants to access private services on the hidden servers to perform monitoring, backups, etc, has to know about all the magic required to tunnel through the bastion host. When I first set up those servers, Wireguard didn't exist. However, clearly Wireguard now exists, and so I've incorporated it into the new arrangement on Hetzner instead of having to painfully tunnel SSH all over the place.

The goal is to essentially have the servers that sit behind the bastion host appear as if they were ordinary hosts on my office LAN, whilst also not being accessible from the world outside my LAN. This is achievable with Wireguard, but, unfortunately, it requires a slightly complicated setup along with some easy-to-forget bits of NAT and forwarding. In my case, at least one of the NAT rules is only necessary because of what I assume is an undocumented security feature of Hetzner private networks, and we'll get to that later. There's also a bit of MTU unpleasantness that isn't required if you aren't using PPPoE to access the internet.

I'm documenting the entire arrangement here so that I have at least a fighting chance of remembering how it works in six months time.

This is the basic arrangement:

The two LANs

On the left, workstation01 and backup01 are two machines in my office LAN that want to be able to reach the otherwise unroutable server01 and server02 machines hosted on the Hetzner cloud. The router01 machine is, unsurprisingly, the router for my office LAN, and router02 is the Hetzner-hosted bastion host. All IP addresses here are fictional, but 70.0.0.1 is supposed to represent the real publically-routable IPv4 address of router02, and the other addresses indicated in blue are the private IPv4 addresses of the other machines. My office LAN is assumed to cover 10.20.0.0/16, and the Hetzner private network covers 10.10.0.0/16.

This is also intended to be a strictly one-way arrangement. Machines on my office LAN should be able to connect to server01, server02, etc, but no machine in the Hetzner private network will ever have any reason to connect in to my office LAN.

Wireguard (router02)

The first step is to set up Wireguard in a "server" style configuration on router02. The configuration file for Wireguard on router02 looks something like this:

[Interface]
Address    = 10.10.0.3/32, 70.0.0.1
ListenPort = 51820
MTU        = 1410
PrivateKey = ...

[Peer]
# Router01
AllowedIPs   = 10.10.100.1/32
PreSharedKey = ...
PublicKey    = ...

This configuration specifies that Wireguard listens on UDP port 51820 on 10.10.0.3 and 70.0.0.1 for incoming peers. The first strange thing in this configuration is that we only define one peer and specify that its allowed IP address is 10.10.100.1. Packets with a source address of anything else will be discarded. This address doesn't match the address of anything on the office LAN, so why are we doing this? This brings us to...

Wireguard & NAT (router01)

We specify a single peer on router02 for reasons of routing and configuration sanity; we want to treat all connections coming from any of workstation01, backup01, or router01 as if they were coming straight from router01, and we want router01 to appear as just another ordinary host on the 10.10.0.0/16 Hetzner private network. Unsurprisingly, we achieve this by performing NAT on router01. router01 is running BSD with pf, and so a pf rule like this suffices:

nat on $nic_wg1 from $nic_dmz:network to any -> ($nic_wg1)

That should be pretty straightforward to read: Any packet with a source address matching the network of the NIC connected to the office LAN (10.20.0.0/16) will have its source address translated to the IP address of the Wireguard interface (10.10.100.1). The address 10.10.100.1 is deliberately chosen to be one that we know won't conflict with anything we have defined in the Hetzner private network.

We then use a Wireguard configuration on router01 that looks like this:

[Interface]
PrivateKey = ...

[Peer]
AllowedIPs          = 10.10.0.0/16
Endpoint            = router02.io7m.com:51820
PreSharedKey        = ...
PublicKey           = ...
PersistentKeepalive = 1

We specify a client configuration that will attempt to connect to router02.io7m.com. We specify that the allowed IPs are 10.10.0.0/16. In this context, the AllowedIPs directive indicates that any packets that are placed onto the Wireguard interface that don't have a destination address in this range will simply be discarded. Because router01 is a router, and therefore will forward packets it receives, if either of workstation01 or backup01 attempt to connect to, say, server01, their packets will ultimately be sent down the Wireguard network interface prior to having their source addresses translated to 10.10.100.1 by the pf NAT rule we configured above.

At this point, any of the machines on the office LAN can try to send packets to server01, server02, etc, but those packets won't actually get there. The reason for this is that router02 isn't currently configured to actually route anything, and so those packets will be dropped.

Routing On router02

The first necessary bit of configuration is to set a sysctl on router02 to enable IPv4 forwarding:

# sysctl net.ipv4.ip_forward=1

At this point, any of the machines on the office LAN can try to send packets to server01, server02, etc, but those packets still won't actually get there. This is an entirely Hetzner-specific problem and/or feature depending on your point of view, and it took quite a bit of debugging to work out what was happening. It wouldn't happen on a physical LAN, to my knowledge.

Hetzner IP Security(?)

It seems that there's some kind of anti-spoofing system at work in Hetzner private networks that the above setup will trip. Consider what happens here:

  1. workstation01 sends a packet A to server01.
  2. Packet A has source address 10.20.0.1 and destination 10.10.0.1.
  3. Packet A reaches router01 and undergoes NAT. The source address is transformed to 10.10.100.1.
  4. Packet A goes through the Wireguard tunnel and emerges on router02.
  5. router02 sees the destination is 10.10.0.1 and so dutifully forwards the packet to server01.
  6. server01 mysteriously never sees packet A.

What I believe is happening is that an anti-spoofing system running somewhere behind Hetzner's cloud network is (correctly) noting that packet A's source address of 10.10.100.1 doesn't correspond to anything on the network. There are no servers defined in Hetzner cloud that have that address as it's a fiction we've created with NAT to make our Wireguard configuration less complex. The anti-spoofing system is then silently dropping the packet as it's obviously malicious.

To correct this, we simply apply a second NAT on router02 such that we transform packets to appear to be coming directly from router02. The following nft ruleset suffices:

table ip filter {
	chain FORWARD {
		type filter hook forward priority filter; policy accept;
		iifname "wg0" oifname "eth1" accept
		iifname "eth1" oifname "wg0" accept
	}
}
table ip nat {
	chain POSTROUTING {
		type nat hook postrouting priority srcnat; policy accept;
		ip saddr 10.10.100.1 oifname "eth1" snat to 10.10.0.3
	}
}

We set up forwarding between the wg0 (Wireguard) and eth1 (Hetzner private network) NICs, and we specify a NAT rule so packets with a source address of 10.10.100.1 are transformed such that they get a source address of 10.10.0.3.

After enabling these rules, workstation01 can send packets to server01 and get responses as expected. From server01's perspective, it is receiving packets that originated at router02.

It's slightly amusing that we introduce a fictional address on router01 to simplify the configuration, and then undo that fiction on router02 in order to satisfy whatever security system Hetzner is using. This is what running out of IPv4 address space gets us.

MTU Strikes Back: Wireguard

Years ago, I had to deal with some aggravation around IPv6. My connection to my ISP is such that I'm using PPPoE which means I have to use an MTU of 1492 instead of the ubiquitous 1500 that everyone else is using:

# ifconfig tun1
tun1: flags=1008051<UP,POINTOPOINT,RUNNING,MULTICAST,LOWER_UP> metric 0 mtu 1492

I'm using Wireguard in various places to link multiple networks. Wireguard packets have 80 bytes of overhead, so the virtual interfaces it creates have an MTU of 1420 by default.

You can probably guess where this is going.

On a connection such as mine, a packet of size 1420 plus the 80 bytes of overhead is 1500. This means we'll run into all of the same problems that occurred with IPv6 back in 2017, with all of the same symptoms.

The solution? Set an MTU of 1410 on the Wireguard interfaces.

On FreeBSD:

# ifconfig wg0 mtu 1410

On Linux, you can either use an MTU=1410 directive in the Wireguard configuration file, or:

# ip link set mtu 1410 up dev wg0
Bosch serial numbers

Like any good obsessive, I keep an inventory of computer parts, tools, and so on. The inventory keeps track of serial numbers so that I can answer questions like "Which machine did that PSU end up being installed into?" and "When those addicts from that failing machine learning company broke in, which GPUs did they take?".

An absolutely true and honest depiction of a GPU heist

I recently bought a Dremel 8260 to do some guitar body routing tasks, and other miscellaneous bits of cutting and drilling. I've got no complaints with it, although I ran into a pretty immediate problem when trying to check it into the inventory.

Dremel are a division of Bosch, and it seems like Bosch have gotten into the habit of not putting serial numbers onto tools, or at least not doing it in any obvious way.

I have a Bosch GSB 18V-45 here and there's no serial number printed anywhere on the case. The same goes for the Dremel 8260.

There is, however, a QR code on both. Scanning the QR code on the drill yields the following redacted text:

240516_80103601JK3300_xxxxxxxxxxxxx
                      ^^^^^^^^^^^^^

The underscore characters are actually the U+241D GROUP SEPARATOR character, but that wouldn't be printable in most browsers so I've replaced it with an underscore here. The xxxxxxxxxxxxx string I've redacted because I believe it actually is a serial number. The 240516 string looks like a date, but it doesn't match up with anything date related on the tool itself (the tool is from 2023). It presumably has some internal meaning to Bosch. If you dump the string 80103601JK3300 into any search engine, the first result that comes up is service information for the drill, so that number is presumably a model number.

The Dremel has two QR codes. One QR code simply restates the model number, but the other QR code hidden inside the battery compartment yields:

240511_8010F013826077_xxxxxxxxxxxxx
                      ^^^^^^^^^^^^^

Again, dumping 8010F013826077 into a search engine yields service information for the Dremel 8260, so it is probably a model number. The 240511 string means something to someone somewhere at Bosch. The xxxxxxxxxxxxx string might be a serial number.

Neither of the xxxxxxxxxxxxx values are actually printed anywhere on the tools, and there's no documentation whatsoever online that I could find about how to locate serial numbers on Bosch tools. I'm not up for buying another instance of either tool just so that I can compare the xxxxxxxxxxxxx values. Searching for those values online yields nothing, which in itself is evidence that they might just be unique-and-otherwise-meaningless serial numbers.

I emailed Bosch to ask them where I can find the serial numbers on my tools. I got a supremely confusing message back saying that Bosch's legal team might be inspecting the message (?), followed a few days later by a message from a support team suggesting that I register the tool online. The message seemed to indicate that they hadn't read my initial message at all, or at least hadn't understood it. I did register the tools online and, predictably, this didn't result in the serial numbers being magically revealed (there's no way it could have; I wasn't even required to submit any kind of proof of purchase or scan anything on the tools, so presumably registration is just a way to get a bit of data out of me for marketing purposes).

I'm not sure what's so difficult about putting an unambiguous serial number somewhere visible on the case. Computer parts manufacturers seem to manage to do it just fine. Who benefits from keeping things obscure like this?