From f9319d707cdd2153c81a546c21faf8703e4566c2 Mon Sep 17 00:00:00 2001 From: Francis Roberts <111994975+franrob-projects@users.noreply.github.com> Date: Mon, 8 Sep 2025 16:39:34 +0200 Subject: [PATCH 1/2] Add service disruptions documentation explaining status site behavior MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This documentation explains why customers may experience disruptions that don't appear on the status site, covering technical reasons related to Ably's multi-region architecture and providing guidance for affected customers. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude --- .../infrastructure-operations.mdx | 27 ++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/src/pages/docs/platform/architecture/infrastructure-operations.mdx b/src/pages/docs/platform/architecture/infrastructure-operations.mdx index 8db7c54d0e..7ce21fe274 100644 --- a/src/pages/docs/platform/architecture/infrastructure-operations.mdx +++ b/src/pages/docs/platform/architecture/infrastructure-operations.mdx @@ -155,7 +155,7 @@ When alerts are triggered, they are processed through an alert management system Critical alerts automatically page an on-call engineer, triggering immediate investigation regardless of time of day. Less urgent issues may be queued for normal business hours, allowing the team to focus emergency response on true service-impacting issues. -Ably operates a public status page, enabling customers to subscribe directly to notifications relating to incidents or disruption. +Ably operates a public [status page](https://status.ably.com/), enabling customers to subscribe directly to notifications relating to incidents or disruption. ## Operational management @@ -181,6 +181,31 @@ For managing larger incidents, Ably maintains a defined incident management fram These practices ensure that on-call engineers can respond efficiently to incidents with the information and support they need, reducing the stress of on-call duties while improving response effectiveness. +## Service status and incident communication + +Understanding how Ably communicates service status helps explain why you might experience disruption that isn't reflected on our [status page](https://status.ably.com/). + +### Why status pages don't always reflect your experience + +Service disruption is rarely binary, systems are seldom completely "up" or "down." Instead, disruption typically manifests as: + +* Higher than normal error rates for specific operations. +* Increased latencies in particular regions. +* Failures affecting specific channels or features. +* Issues impacting certain accounts or applications. + +Our [status page](https://status.ably.com/) reflects general service availability. When disruption is localized to specific regions, accounts, or operations, the service remains generally available even though some customers experience significant impact. + +### Regional vs. global disruption + +Internet-scale operations inevitably experience regional issues, whether from our infrastructure or dependencies like AWS. Our multi-region architecture is designed to handle these scenarios: + +* Client libraries automatically attempt connections to different regions when errors occur. +* Traffic can be explicitly routed away from failing regions to minimize impact. +* Regional issues are typically not indicated as service unavailability unless we continue routing traffic to the failing region. + +This approach ensures service continuity for most customers during regional disruptions, but may create a disconnect between the [status page](https://status.ably.com/) and individual customer experience. + ## Fault tolerance and service continuity The Ably platform is designed to be fault tolerant across a wide range of potential failure conditions, from individual component failures to regional outages. From 4c5cc8bcb622f8e1c13788d5034d659069acb56d Mon Sep 17 00:00:00 2001 From: Francis Roberts <111994975+franrob-projects@users.noreply.github.com> Date: Thu, 18 Sep 2025 10:43:54 +0200 Subject: [PATCH 2/2] fixup! Add service disruptions documentation explaining status site behavior --- .../architecture/infrastructure-operations.mdx | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/src/pages/docs/platform/architecture/infrastructure-operations.mdx b/src/pages/docs/platform/architecture/infrastructure-operations.mdx index 7ce21fe274..00054757a9 100644 --- a/src/pages/docs/platform/architecture/infrastructure-operations.mdx +++ b/src/pages/docs/platform/architecture/infrastructure-operations.mdx @@ -65,7 +65,7 @@ Behind CloudFront, a collection of Network Load Balancers (NLBs) distribute conn DNS-based routing with latency-based resolution directs clients to the nearest available region. This ensures optimal performance under normal conditions while maintaining the ability to route traffic away from problematic regions. -If a region becomes unavailable, clients can connect to alternative regions through fallback mechanisms built into Ably's client libraries. This ensures service continuity even during regional outages. +If a region becomes unavailable, clients can connect to alternative regions through fallback mechanisms built into Ably's SDKs. This ensures service continuity even during regional outages. ## Configuration management @@ -183,7 +183,7 @@ These practices ensure that on-call engineers can respond efficiently to inciden ## Service status and incident communication -Understanding how Ably communicates service status helps explain why you might experience disruption that isn't reflected on our [status page](https://status.ably.com/). +Understanding how Ably communicates service status helps explain why you might experience disruption that isn't reflected on the [status page](https://status.ably.com/). ### Why status pages don't always reflect your experience @@ -194,15 +194,15 @@ Service disruption is rarely binary, systems are seldom completely "up" or "down * Failures affecting specific channels or features. * Issues impacting certain accounts or applications. -Our [status page](https://status.ably.com/) reflects general service availability. When disruption is localized to specific regions, accounts, or operations, the service remains generally available even though some customers experience significant impact. +The [status page](https://status.ably.com/) reflects general service availability. When disruption is localized to specific regions, accounts, or operations, the service remains generally available even though some customers experience significant impact. ### Regional vs. global disruption -Internet-scale operations inevitably experience regional issues, whether from our infrastructure or dependencies like AWS. Our multi-region architecture is designed to handle these scenarios: +Internet-scale operations inevitably experience regional issues, whether from Ably's infrastructure or dependencies like AWS. Ably's multi-region architecture is designed to handle these scenarios: -* Client libraries automatically attempt connections to different regions when errors occur. +* SDKs automatically attempt connections to different regions when errors occur. * Traffic can be explicitly routed away from failing regions to minimize impact. -* Regional issues are typically not indicated as service unavailability unless we continue routing traffic to the failing region. +* Regional issues are typically not indicated as service unavailability unless Ably continues routing traffic to the failing region. This approach ensures service continuity for most customers during regional disruptions, but may create a disconnect between the [status page](https://status.ably.com/) and individual customer experience.