Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 27 additions & 2 deletions src/pages/docs/platform/architecture/infrastructure-operations.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Behind CloudFront, a collection of Network Load Balancers (NLBs) distribute conn

DNS-based routing with latency-based resolution directs clients to the nearest available region. This ensures optimal performance under normal conditions while maintaining the ability to route traffic away from problematic regions.

If a region becomes unavailable, clients can connect to alternative regions through fallback mechanisms built into Ably's client libraries. This ensures service continuity even during regional outages.
If a region becomes unavailable, clients can connect to alternative regions through fallback mechanisms built into Ably's SDKs. This ensures service continuity even during regional outages.

## Configuration management

Expand Down Expand Up @@ -155,7 +155,7 @@ When alerts are triggered, they are processed through an alert management system

Critical alerts automatically page an on-call engineer, triggering immediate investigation regardless of time of day. Less urgent issues may be queued for normal business hours, allowing the team to focus emergency response on true service-impacting issues.

Ably operates a public status page, enabling customers to subscribe directly to notifications relating to incidents or disruption.
Ably operates a public [status page](https://status.ably.com/), enabling customers to subscribe directly to notifications relating to incidents or disruption.

## Operational management

Expand All @@ -181,6 +181,31 @@ For managing larger incidents, Ably maintains a defined incident management fram

These practices ensure that on-call engineers can respond efficiently to incidents with the information and support they need, reducing the stress of on-call duties while improving response effectiveness.

## Service status and incident communication

Understanding how Ably communicates service status helps explain why you might experience disruption that isn't reflected on the [status page](https://status.ably.com/).

### Why status pages don't always reflect your experience

Service disruption is rarely binary, systems are seldom completely "up" or "down." Instead, disruption typically manifests as:

* Higher than normal error rates for specific operations.
* Increased latencies in particular regions.
* Failures affecting specific channels or features.
* Issues impacting certain accounts or applications.

The [status page](https://status.ably.com/) reflects general service availability. When disruption is localized to specific regions, accounts, or operations, the service remains generally available even though some customers experience significant impact.

### Regional vs. global disruption

Internet-scale operations inevitably experience regional issues, whether from Ably's infrastructure or dependencies like AWS. Ably's multi-region architecture is designed to handle these scenarios:

* SDKs automatically attempt connections to different regions when errors occur.
* Traffic can be explicitly routed away from failing regions to minimize impact.
* Regional issues are typically not indicated as service unavailability unless Ably continues routing traffic to the failing region.

This approach ensures service continuity for most customers during regional disruptions, but may create a disconnect between the [status page](https://status.ably.com/) and individual customer experience.

## Fault tolerance and service continuity

The Ably platform is designed to be fault tolerant across a wide range of potential failure conditions, from individual component failures to regional outages.
Expand Down