Connections to the Firezone control plane use the industy-standard WebSocket protocol, which is a bi-directional TCP-based protocol.
Normally, these connections are gracefully drained during infrastructure changes, such that any connected entities (clients/gateways) receive the disconnect and immediately reconnect. This happens up to multiple times per day, and reconnections take only a few hundred milliseconds under normal conditions.
In this particular incident, an API server upgrade initiated at 15:34 UTC did not trigger the graceful drain mechanism on all old API servers, leaving some control plane connections in a half-connected state. This state persisted for between 5 and 10 minutes (depending on region) and lasted from the point in time in which the VM was scheduled to be deleted, until the time at which the VM deletion was completed. At this time a final RST was sent to all remaining connected clients/gateways, allowing them to reconnect into a healthy state.
During this window, all control messages, including ones used in NAT traversal signaling for establishing new data plane connections, were backed up, causing new data plane connections to hang, and affected clients and gateways appeared offline in the admin portal UI. Existing connections established before this incident were unaffected, as these remain connected in a peer to peer fashion.
To prevent this issue from happening again, we’re making three fixes:
We expect the infrastructure fixes above to be applied within 72 hours, and the client/gateway fixes to land their next releases.
– Firezone team