crush depth

That bin directory. No, not that one, the other one.

For about a week, I've been having DNS resolution issues on one server. The machine runs a tinydns server for publishing internal domain names, and it seemed that after roughly 24 hours of operation, the server would simply stop responding to DNS requests. After exhausting all of the obvious solutions, I restarted the jail that housed the daemon and everything mysteriously started working.

I checked the logs and suddenly realized that there were no messages in the log newer than about a week. I checked the process list for s6-log instances and noticed that no, there were no s6-log instances running in the jail. I checked /service/tinydns/log/run, which looked fine. I tried executing /service/tinydns/log/run and saw:

exec: /usr/local/sbin/s6-setuidgid: not found

OK. So...

# which s6-setuidgid
/usr/local/bin/s6-setuidgid

Apparently, at some point, the s6 binaries were moved from /usr/local/sbin to /usr/local/bin. This is not something I did! There was no indication of this happening in any recent port change entry nor anything in the s6 change log.

The "outage" was being caused by the way that logging is handled. The tinydns binary logs to stderr instead of using something like syslog, with the error messages being piped into a logging process in the manner of traditional UNIX pipes. This is normally a good thing, because syslog implementations haven't traditionally been very reliable. The problem occurs when the process that's reading from the standard error output of a preceding process stops reading. Sooner or later, any attempt made by the preceding process to write data to the output will block indefinitely (presumably, it doesn't happen immediately due to internal buffering by the operating system kernel). In a simple single-threaded design like that used by tinydns, this essentially means the process stops working as the write operation never completes and no other work can be performed in the mean time.

I'm currently going through all of the service entries to see if anything else has quietly broken. Perhaps I need process supervision for my process supervision.

Fighting fires