At April 24th 2025, 1:38 UTC we detected issues with our application servers coming back online after a recent deploy. Typically deploys cause zero downtime and go unnoticed by our customers.
This time however, the deploy caused an outage due to a hung migration which prevented other app servers from successfully issuing queries against the database.
After identifying the issue, we immediately began taking steps towards remediation, starting with reaping the problematic app servers that were timing out.
We then increased the healthcheck timeout on our app servers to allow the hung migration to resolve itself successfully.
At April 24th, 2:01 UTC, the migration successfully completed, and all app servers began returning to a healthy state, resolving the incident.