crush depth

The Rube Goldberg Machine Of Metrics

The traditional starting point for monitoring is to install Prometheus and have it scrape metrics from each machine being monitored. Each monitored machine runs a node-exporter service that exposes metrics over HTTP, and the central Prometheus server scrapes metrics from each node-exporter service every ten seconds or so.

The Prometheus server maintains its own persistent internal time series database, and can produce pretty graphs of metrics on demand. Additionally, the Prometheus server can be given alerting rules such that it will generate alert messages when a given rule evaluates to true for a specified time period. An example of this might be a rule such as node_memory_available_percent < 10.0. Intuitively, this can be read as "generate an alert if the amount of available memory on a node dips below ten percent for a configured period". When an alert rule is triggered, the Prometheus server sends an alert to an alert manager server. The alert manager is responsible for routing alert messages to the people and places that should receive them. For example, alert messages relating to the node_memory_available_percent signal might be configured to be routed to the email address of the hardware administration team.

The whole setup looks like this:

Metrics 0

This works fine, but at some point you come across an application that uses OpenTelemetry. OpenTelemetry offers far richer monitoring signals; it offers logs, metrics (a superset of Prometheus metrics), and traces.

Applications send OpenTelemetry signals to a collector, which then batches and forwards the signals to various servers for storage and analysis. In practice, because they are free, open source, and can be self-hosted, those servers will usually be the various offerings from Grafana Labs.

Specifically, metrics will be sent to Grafana Mimir, logs will be sent to Grafana Loki, and traces will be sent to Grafana Tempo. On top of this, a Grafana server is run to provide a dashboard and to provide alerting.

The monitoring setup then looks like this:

Metrics 1

This is a problem, because now the metrics signals and the alerting rules are split between the Prometheus server, and the Grafana server. Some systems are producing Prometheus metrics that go to the Prometheus server, and some systems are producing OpenTelemetry signals that end up in the Grafana server(s).

It turns out that Grafana Mimir provides an optional implementation of the Prometheus AlertManager, and Prometheus provides a remote write option that can push metrics directly to Grafana Mimir. This means that you can remove the Prometheus AlertManager entirely, put all the alerting rules into Grafana Mimir, and set up one dashboard in the Grafana server. At this point, the Prometheus server essentially just acts as a dumb server that scrapes metrics and sends them directly to Mimir for storage and analysis.

The monitoring setup ends up looking like this:

Metrics 2

In practice, everything inside the dotted red line can run on the same physical machine. At some point, the volume of metrics will result in each of the Grafana components needing to be moved to their own separate hardware.

Pork Vultures

I was using Vultr DNS to serve DNS records for my various VPS instances up until the point where I wrote and deployed certusine to manage ACME-issued certificates using DNS challenges.

The issue with Vultr DNS at the time was that Vultr required the use of API keys to use their API. Normally, this wouldn't be a problem, except that it turned out that (at the time) there was no way to restrict the capabilities of API keys.

Consider this: You run a service on a Vultr-hosted VPS that has access to the API key. That VPS is compromised, somehow. The person compromising the VPS now has the capability to destroy all of your VPS instances with just a few API calls. Worse, the person compromising the VPS has the ability to start up and use any number of VPS instances (up to your account's configured spending limit).

Obviously, this was a level of risk that was unacceptable to me. I've been using Gandi as a domain registrar for a very long time now, and it turned out that they offered free DNS hosting along with an API to manipulate records. I implemented support for Gandi DNS in certusine and have been using it up until now.

Unfortunately, about a week ago, API calls started returning 500 error codes. I contacted Gandi's technical support and they still haven't bothered to respond.

It turns out that Gandi were bought by another company in early 2023, immediately jacked up prices, and have apparently let their technical side fall apart.

After trying and failing to find any other good alternative for DNS hosting, I accidentally stumbled across an article about Vultr sub-accounts. It turns out that Vultr have added a mechanism to create new users within a given account. The permissions of those users can be restricted in a fairly fine-grained way, and those users can have their own API keys. Essentially, I can create a user that only has access to DNS records and no other account functionality, issue them an API key, and then use Vultr's DNS service without the risk of a leaked API key being utterly catastrophic.

I'll be moving all of my DNS records back over to Vultr DNS, releasing a new certusine version with improved Vultr support, and switching to Porkbun as my domain registrar.

Being Organized

Following on from a previous episode, I've migrated any actively-maintained software projects to the new organization, and archived a lot of projects that weren't maintained and weren't relevant.

In the process, I set up automated snapshot and release deployment to Maven Central from GitHub Actions. I used to have these automated deployments from a server I ran locally, but having it happen from GitHub Actions means more infrastructure that I don't have to maintain.

The releases are all signed with a new PGP key:

| Fingerprint                                       | Comment                      |
| E288 E54A 25D3 F5A9 BF68 6BB4 E64D 38C0 2097 0A85 | 2024 github-ci-maven-rsa-key |

I'm intending for the new organization to hold only those projects that have reached some level of stability and are past the experimental stage.

Pennies Saved On Construction

In a previous episode, I moved the contents of a server into a new SC-316 case.

While the case seems to be of generally good quality, mine has turned out to have a serious fault. The case's drive bays are implemented in terms of four mounted PCB backplanes, each with its own power connector and SAS connector. Each backplane is individually powered via a standard completely awful AMP Mate-n-Lok 1-480424-0 Power connector.

Anyone who's dealt with these connectors knows how bad they are. They're next to impossible to plug in, and next to impossible to unplug once plugged in. To quote that Wikipedia page:

Despite its widespread adoption, the connector does have problems. It is difficult to remove because it is held in place by friction instead of a latch, and some poorly constructed connectors may have one or more pins detach from the connector during mating or de-mating. There is also a tendency for the loosely inserted pins on the male connector to skew out of alignment. The female sockets can spread, making the connection imperfect and subject to arcing. Standard practice is to check for any sign of blackening or browning on the white plastic shell, which would indicate the need to replace the arcing connector. In extreme cases the whole connector can melt due to the heat from arcing.

To summarize:

  • The connector is difficult to plug in due to a friction fit.
  • The connector is difficult to unplug due to a friction fit.
  • The pins can be bent and damaged should the user do something as completely unforgivable as to plug in the connector.
  • The pins of the connector are subject to arcing, risking a fire.

And yet, somehow, we're fitting this ridiculous 1960s relic onto modern power supplies, despite far better connectors being available such as, frankly, any of the 15A and above rated JST connectors.

Back to the story. I connected the power supply to the top two backplanes, leaving the bottom two unpowered. I didn't have SAS connections for those backplanes anyway, I really only needed a 3U case because it was going to need to contain a full-height PCI card, and there wasn't a reasonably priced 3U case that didn't also come with 16 drive bays.

I placed some disks in the second row of bays. The lights came on, the disks were accessible, no issues.

I placed a disk in the top row of bays... Nothing. I tried a disk in a different bay on the same row... Nothing.

I shut the machine down, dragged the machine out of the rack, opened it, and took a look at the backplanes. Nothing appeared to be wrong until I wiggled the power connector for the top backplane to check that it was correctly plugged in.

Uh oh.

The entire power connector came off in my hand.


It turns out that the power connectors are not through-hole soldered connectors. They're rather flimsy plastic connectors that are screwed onto the board. You can see from the image that there are two thin plastic ears on the connector, and the ears have simply ripped off. Real AMP Mate-n-Lok connectors are typically made from nylon - these connectors do not feel like nylon, and I don't think nylon would have torn this easily.

I inspected the rest of the connectors and discovered that the bottom connector (that I'd not even touched) was also cracked on one side:


This effectively leaves only two of four sets of drive bays functional.

I'm waiting to hear back from ServerCase support. This will be the third time I've had to use their support, and they have been excellent every time, but I'm not looking forward to the possibility of having to take everything out of the case and sending the case back.

My ideal outcome for this is that they send me new backplanes with real connectors on them, properly soldered onto the board. I realize that's unlikely. I've already been looking at replacement connectors that I can solder onto the boards myself. There are some connectors from TE Connectivity that look like they would fit.

Becoming Organized

I have an idiotic number of GitHub repositories. There are, at the time of writing, 536 repositories in my account, of which I'm the "source" for 454 of those. At last count, I'm maintaining 159 of my own projects, plus less-than-a-hundred-but-more-than-ten open source and closed source projects for third parties.

When I registered on GitHub back in something like 2011, there was no concept of an "organization". People had personal accounts and that was that.

GitHub exposes many configuration methods that can be applied on a per-organization basis. For example, if a CI process requires access to secrets, those secrets can be configured in a single place and shared across all builds in an organization. If the same CI process is used in a personal account, the secrets it uses have to be configured in every repository that use the CI process. This is obviously unmaintainable.

I've decided to finally set up an organization, and I'll be transferring actively maintained ongoing projects there.