Northforge

The problem

Northforge builds environmental monitoring hardware for cattle ranches — soil moisture, water tank levels, ambient temperature, and GPS-based herd tracking. Their sensor nodes run embedded Linux on ARM boards deployed across 1,200 ranches in Montana and Wyoming, most of them on cell-edge connectivity with no physical access for months at a time.

Their fleet management was Ansible. Every configuration change — a new data endpoint, an updated TLS certificate, a sensor calibration tweak — required an Ansible playbook run across the entire fleet. The problems were fundamental: Ansible is imperative and stateful. If a run was interrupted by a connectivity drop (which happened constantly on cell-edge nodes), the node was left in a partial state that required manual SSH intervention to recover. Rollbacks didn’t exist — if a bad config shipped, Sam’s team had to write a new playbook to undo it and hope it reached every affected node.

They were losing 10–15 nodes per week to failed updates. Each recovery took 30–90 minutes of SSH debugging if the node was reachable at all. Some nodes in remote pastures went offline for weeks before anyone noticed. Sam’s three-person infrastructure team was spending 20 hours a week on fleet babysitting instead of building features.

The breaking point was a TLS certificate rotation that bricked 180 nodes simultaneously. It took two weeks and a road trip to recover them all. Northforge’s CEO told Sam to find a better way or hire two more ops engineers.

What we did

Week one we designed the NixOS migration path. Each sensor node got a minimal NixOS configuration defined in a Nix flake: the sensor daemon, WireGuard tunnel config, data upload endpoint, and a watchdog service. We built a custom update channel — a lightweight HTTP server that hosts the current fleet configuration as a Nix store path. Nodes poll the channel every 60 seconds; when a new generation is available, they download the closure diff (typically 2–8 MB), switch to the new generation atomically, and confirm health. If the health check fails, the node rolls back to the previous generation automatically. No partial states. No manual intervention.

We also built a migration image that Sam’s team could flash onto existing nodes via their existing SSH access. The image bootstraps NixOS, imports the node’s identity (WireGuard keys, ranch assignment, sensor calibration data) from a migration manifest, and joins the fleet. No physical touch required for nodes that are SSH-reachable.

Week two was observability. We deployed Prometheus node exporters on every node, feeding into a central Grafana instance. Loki captures structured logs from the sensor daemon and the NixOS update process. We built three dashboards: fleet health (node count by generation, uptime, connectivity), update rollout (deploy progress, rollback events, closure sizes), and per-ranch drill-down (individual node metrics, sensor readings, last-seen timestamps). Alert rules fire on Slack when a node misses three consecutive check-ins or when a rollback rate exceeds 2% on any deploy.

Week three was the production migration. We rolled out in cohorts of 200 nodes per day, monitoring rollback rates and connectivity stability at each stage. The migration took four days. Twelve nodes were unreachable by SSH — Sam’s team scheduled ranch visits for those. Every reachable node migrated without data loss. We wrote operational runbooks, trained the team, and handed off the fleet management repository as a single Git repo where every infrastructure change is a pull request with a Nix build check.

Results

OTA updates now deploy in 0.8 seconds per node — the time to download a typical closure diff and switch generations. The entire 1,200-node fleet updates in under five minutes. Rollbacks are atomic and automatic: in the three months since launch, 14 rollback events have occurred (all from intentional canary tests), and every node recovered without intervention.

Fleet uptime went from 94% to 99.7%. The 10–15 weekly node failures dropped to fewer than two, and those are hardware failures (dead SIM cards, rodent damage) rather than software state corruption. Sam’s team went from 20 hours a week on fleet maintenance to about two — mostly reviewing deploy diffs and checking dashboards.

The TLS certificate rotation that previously bricked 180 nodes? Sam ran it as a one-line flake update, pushed it to the channel, and watched 1,200 nodes pick it up in four minutes. Zero rollbacks. He sent the CEO a screenshot of the Grafana dashboard and a one-word Slack message: “Done.”

Northforge cancelled their plan to hire two additional ops engineers. The $28k engagement paid for itself in avoided headcount within the first month.