Control plane API server upgrade

Incident Report for Firezone Production

Postmortem

Connections to the Firezone control plane use the industy-standard WebSocket protocol, which is a bi-directional TCP-based protocol.

Normally, these connections are gracefully drained during infrastructure changes, such that any connected entities (clients/gateways) receive the disconnect and immediately reconnect. This happens up to multiple times per day, and reconnections take only a few hundred milliseconds under normal conditions.

In this particular incident, an API server upgrade initiated at 15:34 UTC did not trigger the graceful drain mechanism on all old API servers, leaving some control plane connections in a half-connected state. This state persisted for between 5 and 10 minutes (depending on region) and lasted from the point in time in which the VM was scheduled to be deleted, until the time at which the VM deletion was completed. At this time a final RST was sent to all remaining connected clients/gateways, allowing them to reconnect into a healthy state.

During this window, all control messages, including ones used in NAT traversal signaling for establishing new data plane connections, were backed up, causing new data plane connections to hang, and affected clients and gateways appeared offline in the admin portal UI. Existing connections established before this incident were unaffected, as these remain connected in a peer to peer fashion.

To prevent this issue from happening again, we’re making three fixes:

  1. Improved drain handling for infrastructure lifecycle operations to proactively send RSTs in all cases where old API servers are winding down.
  2. Implement shorter load balancer idle timeouts, such that if (1) fails, we will detect the stale connection at the load balancer and sever the connection with a RST.
  3. Implement matching application-layer heartbeat timeouts in clients / gateways, such that if (1) and (2) above fail, these will kick in to reset the connection as a last resort.

We expect the infrastructure fixes above to be applied within 72 hours, and the client/gateway fixes to land their next releases.

– Firezone team

Posted Feb 20, 2026 - 07:25 UTC

Resolved

On February 19, 2026 at approximately 16:00 UTC, we completed a routine API server upgrade. For a brief period lasting around 5 minutes, some clients and gateways appeared offline in the admin portal UI. Existing data plane connections remained unaffected, while new connection requests may have been delayed.
Posted Feb 19, 2026 - 16:00 UTC