Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions 2025/08/24/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Short title

Comment on lines +1 to +2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace placeholder heading with a concrete incident title

Use a clear, user-facing title for indexing and readability.

-# Short title
+# Degradation: Email, SMS, and Realtime Delivery (Aug 24–26, 2025)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Short title
# Degradation: Email, SMS, and Realtime Delivery (Aug 24–26, 2025)
🤖 Prompt for AI Agents
In 2025/08/24/readme.md around lines 1 to 2, the placeholder heading "# Short
title" should be replaced with a concrete, user-facing incident title; update
the markdown H1 to a descriptive, searchable title that clearly summarizes the
incident (e.g., include service/component and brief issue summary) to improve
indexing and readability.

- **Incident Start:** 2025-08-24 07:26 UTC
- **Incident End:** 2025-08-26 05:10 UTC
- **Report Prepared By:** Divyansha

## Summary

Between 24–26 August 2025, some Appwrite Cloud users experienced intermittent issues with email delivery, SMS and email OTPs, and realtime events. The issue was caused by instability in our messaging system. The problem was resolved by stabilizing the messaging infrastructure, and all services are now fully operational.

## Incident details

### Initial detection

The issue was first detected when users reported missing or delayed emails and OTPs.

### Affected components

Email delivery (invitations, password resets, notifications)
OTP delivery via email and SMS
Realtime events

### User impact

Some customers experienced delayed or failed OTPs and emails, which affected their ability to onboard new users or use realtime features.

## Root cause analysis

### Preliminary findings

We noticed few errors in the messaging system responsible for handling message delivery.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Grammar: “a few errors”

Minor clarity fix.

-We noticed few errors in the messaging system responsible for handling message delivery.
+We noticed a few errors in the messaging system responsible for handling message delivery.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
We noticed few errors in the messaging system responsible for handling message delivery.
We noticed a few errors in the messaging system responsible for handling message delivery.
🧰 Tools
🪛 LanguageTool

[grammar] ~31-~31: There might be a mistake here.
Context: ... analysis ### Preliminary findings We noticed few errors in the messaging system resp...

(QB_NEW_EN)

🤖 Prompt for AI Agents
In 2025/08/24/readme.md around line 31, the phrase "We noticed few errors in the
messaging system responsible for handling message delivery." uses incorrect
grammar; change "few" to "a few" so the sentence reads "We noticed a few errors
in the messaging system responsible for handling message delivery." to improve
clarity and correctness.


### Investigation

The issue was linked to instability in the messaging system.

### Root cause

The root cause was overload in the underlying messaging queues, which led to dropped or delayed messages and intermittent realtime failures.

## Resolution and recovery

### Immediate Actions

We restarted affected services across regions to temporarily restore delivery.

### Resolution

We stabilized the messaging infrastructure by migrating critical queues to a more reliable system and disabling task contributing to overload.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Grammar: pluralize “task” and add article

Also reads better with “the overload”.

-We stabilized the messaging infrastructure by migrating critical queues to a more reliable system and disabling task contributing to overload.
+We stabilized the messaging infrastructure by migrating critical queues to a more reliable system and disabling tasks contributing to the overload.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
We stabilized the messaging infrastructure by migrating critical queues to a more reliable system and disabling task contributing to overload.
We stabilized the messaging infrastructure by migrating critical queues to a more reliable system and disabling tasks contributing to the overload.
🤖 Prompt for AI Agents
In 2025/08/24/readme.md around line 49, the sentence uses the singular “task”
and omits an article before “overload”; change it to pluralize “task” to “tasks”
and add “the” so the phrase reads, for example, “disabling the tasks
contributing to the overload,” resulting in: “We stabilized the messaging
infrastructure by migrating critical queues to a more reliable system and
disabling the tasks contributing to the overload.”


## Lessons learned

### What went well

Quick detection of failures and temporary restarts reduced user impact.
Migration to a more reliable system helped restore stability quickly.

### What can be improved

Better visibility into system dependencies is needed to speed up diagnosis.

### Action items

Improve monitoring and alerting for earlier detection of overloads.

## Additional resources

### Supporting documentation

https://status.appwrite.online/incident/711585
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix bare URL (MD034) with descriptive markdown link

Resolves markdownlint warning.

-https://status.appwrite.online/incident/711585
+[Status incident 711585](https://status.appwrite.online/incident/711585)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
https://status.appwrite.online/incident/711585
[Status incident 711585](https://status.appwrite.online/incident/711585)
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

70-70: Bare URL used

(MD034, no-bare-urls)

🤖 Prompt for AI Agents
In 2025/08/24/readme.md around line 70, replace the bare URL
"https://status.appwrite.online/incident/711585" with a descriptive markdown
link to resolve MD034; update it to use bracketed link text like "[Appwrite
status: Incident 711585]" (or another concise description) followed by the URL
in parentheses so the markdown is more descriptive and passes markdownlint.