From cb94a9cd605c84946b3867c3c72c86108fb07d9a Mon Sep 17 00:00:00 2001 From: Divyansha Dubey Date: Mon, 1 Sep 2025 00:30:00 +0400 Subject: [PATCH 1/2] Added public report for SUC-1640 --- 2025/08/24/readme.md | 70 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 2025/08/24/readme.md diff --git a/2025/08/24/readme.md b/2025/08/24/readme.md new file mode 100644 index 0000000..e0209a4 --- /dev/null +++ b/2025/08/24/readme.md @@ -0,0 +1,70 @@ +# Short title + +- **Incident Start:** 2025-08-24 07:26 UTC +- **Incident End:** 2025-08-26 05:10 UTC +- **Report Prepared By:** Divyansha + +## Summary + +Between 24–26 August 2025, some Appwrite Cloud users experienced intermittent issues with email delivery, SMS and email OTPs, and realtime events. The issue was caused by instability in our messaging system. The problem was resolved by stabilizing the messaging infrastructure, and all services are now fully operational. + +## Incident details + +### Initial detection + +The issue was first detected when users reported missing or delayed emails and OTPs. + +### Affected components + +Email delivery (invitations, password resets, notifications) +OTP delivery via email and SMS +Realtime events + +### User impact + +Some customers experienced delayed or failed OTPs and emails, which affected their ability to onboard new users or use realtime features. + +## Root cause analysis + +### Preliminary findings + +We noticed few errors in the messaging system responsible for handling message delivery. + +### Investigation + +The issue was linked to instability in the messaging system. + +### Root cause + +The root cause was overload in the underlying messaging queues, which led to dropped or delayed messages and intermittent realtime failures. + +## Resolution and recovery + +### Immediate Actions + +We restarted affected services across regions to temporarily restore delivery. + +### Resolution + +We stabilized the messaging infrastructure by migrating critical queues to a more reliable system and disabling task contributing to overload. + +## Lessons learned + +### What went well + +Quick detection of failures and temporary restarts reduced user impact. +Migration to a more reliable system helped restore stability quickly. + +### What can be improved + +Better visibility into system dependencies is needed to speed up diagnosis. + +### Action items + +Improve monitoring and alerting for earlier detection of overloads. + +## Additional resources + +### Supporting documentation + +https://status.appwrite.online/incident/711585 \ No newline at end of file From 5b034ba91fd5d66293e53682454c2f97dbd830e3 Mon Sep 17 00:00:00 2001 From: Divyansha Dubey Date: Wed, 17 Sep 2025 17:38:48 +0400 Subject: [PATCH 2/2] updated the public report --- 2025/08/24/readme.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/2025/08/24/readme.md b/2025/08/24/readme.md index e0209a4..e4e8957 100644 --- a/2025/08/24/readme.md +++ b/2025/08/24/readme.md @@ -1,4 +1,4 @@ -# Short title +# Email, SMS, and Realtime Delivery Issues - **Incident Start:** 2025-08-24 07:26 UTC - **Incident End:** 2025-08-26 05:10 UTC @@ -16,9 +16,9 @@ The issue was first detected when users reported missing or delayed emails and O ### Affected components -Email delivery (invitations, password resets, notifications) -OTP delivery via email and SMS -Realtime events +- Email delivery (invitations, password resets, notifications) +- OTP delivery via email and SMS +- Realtime events ### User impact @@ -52,8 +52,8 @@ We stabilized the messaging infrastructure by migrating critical queues to a mor ### What went well -Quick detection of failures and temporary restarts reduced user impact. -Migration to a more reliable system helped restore stability quickly. +- Quick detection of failures and temporary restarts reduced user impact. +- Migration to a more reliable system helped restore stability quickly. ### What can be improved