-
Notifications
You must be signed in to change notification settings - Fork 2
Added public incident report for Degradation in email, SMS, and realtime delivery #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,70 @@ | ||||||
# Short title | ||||||
|
||||||
- **Incident Start:** 2025-08-24 07:26 UTC | ||||||
- **Incident End:** 2025-08-26 05:10 UTC | ||||||
- **Report Prepared By:** Divyansha | ||||||
|
||||||
## Summary | ||||||
|
||||||
Between 24–26 August 2025, some Appwrite Cloud users experienced intermittent issues with email delivery, SMS and email OTPs, and realtime events. The issue was caused by instability in our messaging system. The problem was resolved by stabilizing the messaging infrastructure, and all services are now fully operational. | ||||||
|
||||||
## Incident details | ||||||
|
||||||
### Initial detection | ||||||
|
||||||
The issue was first detected when users reported missing or delayed emails and OTPs. | ||||||
|
||||||
### Affected components | ||||||
|
||||||
Email delivery (invitations, password resets, notifications) | ||||||
OTP delivery via email and SMS | ||||||
Realtime events | ||||||
|
||||||
### User impact | ||||||
|
||||||
Some customers experienced delayed or failed OTPs and emails, which affected their ability to onboard new users or use realtime features. | ||||||
|
||||||
## Root cause analysis | ||||||
|
||||||
### Preliminary findings | ||||||
|
||||||
We noticed few errors in the messaging system responsible for handling message delivery. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Grammar: “a few errors” Minor clarity fix. -We noticed few errors in the messaging system responsible for handling message delivery.
+We noticed a few errors in the messaging system responsible for handling message delivery. 📝 Committable suggestion
Suggested change
🧰 Tools🪛 LanguageTool[grammar] ~31-~31: There might be a mistake here. (QB_NEW_EN) 🤖 Prompt for AI Agents
|
||||||
|
||||||
### Investigation | ||||||
|
||||||
The issue was linked to instability in the messaging system. | ||||||
|
||||||
### Root cause | ||||||
|
||||||
The root cause was overload in the underlying messaging queues, which led to dropped or delayed messages and intermittent realtime failures. | ||||||
|
||||||
## Resolution and recovery | ||||||
|
||||||
### Immediate Actions | ||||||
|
||||||
We restarted affected services across regions to temporarily restore delivery. | ||||||
|
||||||
### Resolution | ||||||
|
||||||
We stabilized the messaging infrastructure by migrating critical queues to a more reliable system and disabling task contributing to overload. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Grammar: pluralize “task” and add article Also reads better with “the overload”. -We stabilized the messaging infrastructure by migrating critical queues to a more reliable system and disabling task contributing to overload.
+We stabilized the messaging infrastructure by migrating critical queues to a more reliable system and disabling tasks contributing to the overload. 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents
|
||||||
|
||||||
## Lessons learned | ||||||
|
||||||
### What went well | ||||||
|
||||||
Quick detection of failures and temporary restarts reduced user impact. | ||||||
Migration to a more reliable system helped restore stability quickly. | ||||||
|
||||||
### What can be improved | ||||||
|
||||||
Better visibility into system dependencies is needed to speed up diagnosis. | ||||||
|
||||||
### Action items | ||||||
|
||||||
Improve monitoring and alerting for earlier detection of overloads. | ||||||
|
||||||
## Additional resources | ||||||
|
||||||
### Supporting documentation | ||||||
|
||||||
https://status.appwrite.online/incident/711585 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix bare URL (MD034) with descriptive markdown link Resolves markdownlint warning. -https://status.appwrite.online/incident/711585
+[Status incident 711585](https://status.appwrite.online/incident/711585) 📝 Committable suggestion
Suggested change
🧰 Tools🪛 markdownlint-cli2 (0.17.2)70-70: Bare URL used (MD034, no-bare-urls) 🤖 Prompt for AI Agents
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace placeholder heading with a concrete incident title
Use a clear, user-facing title for indexing and readability.
📝 Committable suggestion
🤖 Prompt for AI Agents