diff --git a/src/pages/docs/platform/architecture/infrastructure-operations.mdx b/src/pages/docs/platform/architecture/infrastructure-operations.mdx index 8db7c54d0e..00054757a9 100644 --- a/src/pages/docs/platform/architecture/infrastructure-operations.mdx +++ b/src/pages/docs/platform/architecture/infrastructure-operations.mdx @@ -65,7 +65,7 @@ Behind CloudFront, a collection of Network Load Balancers (NLBs) distribute conn DNS-based routing with latency-based resolution directs clients to the nearest available region. This ensures optimal performance under normal conditions while maintaining the ability to route traffic away from problematic regions. -If a region becomes unavailable, clients can connect to alternative regions through fallback mechanisms built into Ably's client libraries. This ensures service continuity even during regional outages. +If a region becomes unavailable, clients can connect to alternative regions through fallback mechanisms built into Ably's SDKs. This ensures service continuity even during regional outages. ## Configuration management @@ -155,7 +155,7 @@ When alerts are triggered, they are processed through an alert management system Critical alerts automatically page an on-call engineer, triggering immediate investigation regardless of time of day. Less urgent issues may be queued for normal business hours, allowing the team to focus emergency response on true service-impacting issues. -Ably operates a public status page, enabling customers to subscribe directly to notifications relating to incidents or disruption. +Ably operates a public [status page](https://status.ably.com/), enabling customers to subscribe directly to notifications relating to incidents or disruption. ## Operational management @@ -181,6 +181,31 @@ For managing larger incidents, Ably maintains a defined incident management fram These practices ensure that on-call engineers can respond efficiently to incidents with the information and support they need, reducing the stress of on-call duties while improving response effectiveness. +## Service status and incident communication + +Understanding how Ably communicates service status helps explain why you might experience disruption that isn't reflected on the [status page](https://status.ably.com/). + +### Why status pages don't always reflect your experience + +Service disruption is rarely binary, systems are seldom completely "up" or "down." Instead, disruption typically manifests as: + +* Higher than normal error rates for specific operations. +* Increased latencies in particular regions. +* Failures affecting specific channels or features. +* Issues impacting certain accounts or applications. + +The [status page](https://status.ably.com/) reflects general service availability. When disruption is localized to specific regions, accounts, or operations, the service remains generally available even though some customers experience significant impact. + +### Regional vs. global disruption + +Internet-scale operations inevitably experience regional issues, whether from Ably's infrastructure or dependencies like AWS. Ably's multi-region architecture is designed to handle these scenarios: + +* SDKs automatically attempt connections to different regions when errors occur. +* Traffic can be explicitly routed away from failing regions to minimize impact. +* Regional issues are typically not indicated as service unavailability unless Ably continues routing traffic to the failing region. + +This approach ensures service continuity for most customers during regional disruptions, but may create a disconnect between the [status page](https://status.ably.com/) and individual customer experience. + ## Fault tolerance and service continuity The Ably platform is designed to be fault tolerant across a wide range of potential failure conditions, from individual component failures to regional outages.