Twelve stores drilled their kill switch in April. Eleven stood down inside the 90-second SLA. The twelfth didn't — and the twelfth is the most useful story in this dispatch, because it's why we drill at all.
The websocket that didn't get the memo
On the twelfth store, one long-lived websocket connection to an ad platform kept a single executor alive for an extra 40 seconds after the switch tripped. No writes happened — the executor was blocked at the gate — but it hadn't fully stood down, and 'fully' is the only acceptable state for a kill switch.
We found it because they drilled. A store that never tests its kill switch would have discovered this during a real incident, which is the worst possible time. We shipped a fix that force-closes all agent connections on trip, and added a connection-drain assertion to the drill.
A switch you've never pressed is a hope. The drill is how hope becomes a guarantee.
