Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 145 additions & 6 deletions rust/src/core/chat_media.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1746,7 +1746,7 @@ impl AppCore {
self.last_outgoing_ts
};

let (client, wrapper, relays, rumor_id_hex) = {
let (client, relays, rumor, rumor_id_hex) = {
let Some(sess) = self.session.as_mut() else {
return;
};
Expand Down Expand Up @@ -1788,6 +1788,149 @@ impl AppCore {
},
);

let relays: Vec<RelayUrl> = if network_enabled {
sess.mdk
.get_relays(&group.mls_group_id)
.ok()
.map(|s| s.into_iter().collect())
.filter(|v: &Vec<RelayUrl>| !v.is_empty())
.unwrap_or_else(|| fallback_relays.clone())
} else {
vec![]
};

(sess.client.clone(), relays, rumor, rumor_id_hex)
};

self.prune_local_outbox(&chat_id);
self.refresh_chat_list_from_storage();
self.refresh_current_chat_if_open(&chat_id);

if !network_enabled {
let _ = self.core_sender.send(CoreMsg::Internal(Box::new(
InternalEvent::PublishMessageResult {
chat_id,
rumor_id: rumor_id_hex,
ok: false,
error: Some("offline".into()),
},
)));
return;
}

let tx = self.core_sender.clone();
let mut seen_ids = self.processed_group_event_ids.clone();
let chat_id_for_task = chat_id.clone();
self.runtime.spawn(async move {
let filter = Filter::new()
.kind(Kind::MlsGroupMessage)
.custom_tags(
SingleLetterTag::lowercase(Alphabet::H),
vec![chat_id_for_task.clone()],
)
.limit(200);
let (mut backlog_events, error) = match client
.fetch_events_from(relays.clone(), filter, std::time::Duration::from_secs(8))
.await
{
Ok(evts) => (evts.into_iter().collect::<Vec<Event>>(), None),
Err(e) => (Vec::new(), Some(format!("group backlog fetch failed: {e}"))),
Comment on lines +1832 to +1837
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't continue to create_message() after a failed catch-up.

Line 1837 converts backlog fetch failure into error, but Lines 1860-1864 only log it and continue. Lines 1899-1901 do the same when backlog ingestion still leaves events behind. That means publish_prepared_group_rumor() can still encrypt from a known-incomplete MLS state, which is the exact stale-epoch failure mode this PR is trying to avoid.

Suggested direction
 pub(super) fn handle_group_send_catchup_completed(
     &mut self,
     chat_id: String,
     rumor: UnsignedEvent,
     backlog_events: Vec<Event>,
     error: Option<String>,
 ) {
+    let rumor_id_hex = rumor.id().to_hex();
     if let Some(err) = error {
-        tracing::warn!(%chat_id, %err, "group send backlog fetch failed");
+        let delivery = MessageDeliveryState::Failed {
+            reason: format!("group backlog fetch failed: {err}"),
+        };
+        self.delivery_overrides
+            .entry(chat_id.clone())
+            .or_default()
+            .insert(rumor_id_hex.clone(), delivery.clone());
+        self.fail_delivery_or_refresh(&chat_id, &rumor_id_hex, delivery);
+        self.refresh_chat_list_from_storage();
+        return;
     }
-    self.ingest_group_backlog_events(backlog_events);
+    if !self.ingest_group_backlog_events(backlog_events) {
+        let delivery = MessageDeliveryState::Failed {
+            reason: "group backlog catch-up incomplete".into(),
+        };
+        self.delivery_overrides
+            .entry(chat_id.clone())
+            .or_default()
+            .insert(rumor_id_hex.clone(), delivery.clone());
+        self.fail_delivery_or_refresh(&chat_id, &rumor_id_hex, delivery);
+        self.refresh_chat_list_from_storage();
+        return;
+    }
     self.publish_prepared_group_rumor(chat_id, rumor);
 }
 
-fn ingest_group_backlog_events(&mut self, backlog_events: Vec<Event>) {
+fn ingest_group_backlog_events(&mut self, backlog_events: Vec<Event>) -> bool {
     let mut remaining = backlog_events;
     for _ in 0..3 {
         if remaining.is_empty() {
             break;
         }
@@
-                let Some(sess) = self.session.as_mut() else {
-                    return;
+                let Some(sess) = self.session.as_mut() else {
+                    return false;
                 };
                 sess.mdk.process_message(&event)
             };
@@
-    if !remaining.is_empty() {
-        tracing::warn!(remaining = remaining.len(), "failed to ingest some backlog events");
-    }
+    if !remaining.is_empty() {
+        tracing::warn!(remaining = remaining.len(), "failed to ingest some backlog events");
+        return false;
+    }
+    true
 }

Also applies to: 1860-1864, 1899-1901

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, but this PR is intentionally matching the CLI’s current best-effort catch-up behavior (parity) rather than changing/introducing a new fail-closed policy.

but it could be a good follow up

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reez, understood — keeping best-effort parity with the CLI makes sense as an incremental step. Would you like me to open a follow-up issue to track the fail-closed catch-up policy (failing the send when backlog fetch errors or ingestion is incomplete), so it doesn't get lost?


✏️ Learnings added
Learnt from: reez
Repo: sledtools/pika PR: 513
File: rust/src/core/chat_media.rs:1832-1837
Timestamp: 2026-03-08T21:14:08.184Z
Learning: In `rust/src/core/chat_media.rs`, the `handle_group_send_catchup_completed` / `ingest_group_backlog_events` flow intentionally uses best-effort (log-and-continue) catch-up behavior to match CLI parity. Failing the send on backlog fetch error or incomplete ingestion (fail-closed policy) is explicitly deferred to a future follow-up PR. Do not flag the log-and-continue pattern as a bug in this context.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: futurepaul
Repo: sledtools/pika PR: 429
File: rust/src/core/chat_media.rs:1257-1262
Timestamp: 2026-03-05T00:37:01.166Z
Learning: In `rust/src/core/chat_media.rs`, within `handle_batch_upload_completed`, the early returns when `self.session` is unavailable (around the `let Some(sess) = self.session.as_ref() else { return; }` check) and when `sess.local_keys` is unavailable do not transition the batch to a Failed state, which can strand the batch in pending/uploading. The fix — marking the batch as Failed and cleaning up outbox/delivery_overrides — is intentionally deferred to a follow-up PR alongside broader batch error-handling hardening (same class as the missing-URL/hash early returns in the same function).

Learnt from: futurepaul
Repo: sledtools/pika PR: 429
File: rust/src/core/chat_media.rs:874-895
Timestamp: 2026-03-05T00:37:03.771Z
Learning: In rust/src/core/chat_media.rs, within send_chat_media_batch, when the MDK re-encodes media and the local preview copy to final_local_path fails, the code should fall back to pp.local_path instead of setting att.local_path to None. This avoids temporarily losing the thumbnail in the outbox entry due to a cosmetic issue since uploads use encrypted_data in memory. Apply this fallback pattern in similar Rust code paths where a file copy may fail but an existing local_path could still be valid. This is a known follow-up and should be implemented as a defensive fallback.

};
backlog_events.retain(|ev| seen_ids.insert(ev.id));
backlog_events.sort_by_key(|ev| ev.created_at.as_secs());
Comment thread
coderabbitai[bot] marked this conversation as resolved.

let _ = tx.send(CoreMsg::Internal(Box::new(
InternalEvent::GroupSendCatchupCompleted {
chat_id: chat_id_for_task,
rumor,
backlog_events,
error,
},
)));
});
Comment on lines +1821 to +1850
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Crash safety regression: encrypted wrapper not persisted until after 8-second async backlog fetch

In the old code, the rumor was encrypted via create_message and immediately persisted to pending_sends (SQLite-backed) before any async work began. On app crash, PendingSends::load at rust/src/core/mod.rs:881 would recover the wrapper for retry.

In the new code, publish_chat_message_with_tags spawns an async backlog fetch (up to 8 seconds, rust/src/core/chat_media.rs:1833) before encryption happens. The pending_sends.insert call has moved to publish_prepared_group_rumor at rust/src/core/chat_media.rs:1953, which only runs after the async catchup completes. If the app crashes during this window, the message is lost: local_outbox and delivery_overrides are in-memory only (lost on crash), and the wrapper was never persisted to pending_sends. The user's message silently vanishes with no way to recover or retry it.

Comparison of persistence timing

Old code persisted the encrypted wrapper synchronously before the async send:

create_message → pending_sends.insert (persisted) → spawn send

New code defers persistence past an 8-second async gap:

spawn backlog fetch (8s timeout) → [async gap] → create_message → pending_sends.insert

The crash-safety window expanded from near-zero to 8+ seconds.

Prompt for agents
In rust/src/core/chat_media.rs, the publish_chat_message_with_tags function (around line 1821-1850) now defers encryption and pending_sends persistence until after an 8-second async backlog fetch, creating a window where a crash would lose the user's message. To fix this, consider one of these approaches:

1. Persist the unencrypted rumor content to a new durable store (e.g. a "pending_outbox" SQLite table) before spawning the backlog fetch at line 1824. Then in publish_prepared_group_rumor (line 1909), remove it from the pending_outbox after successful pending_sends.insert. On app restart, re-create and re-encrypt any pending_outbox entries.

2. Alternatively, encrypt the message BEFORE the backlog fetch (as in the old code) and store the wrapper in pending_sends immediately. Then perform the backlog fetch as a fire-and-forget side-effect. This preserves crash safety but may mean the encrypted message uses a stale MLS epoch (which was presumably the reason for the refactor).

3. At minimum, document in a code comment that the backlog catchup window is intentionally not crash-safe, if this is an accepted tradeoff.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

}

pub(super) fn handle_group_send_catchup_completed(
&mut self,
chat_id: String,
rumor: UnsignedEvent,
backlog_events: Vec<Event>,
error: Option<String>,
) {
if let Some(err) = error {
tracing::warn!(%chat_id, %err, "group send backlog fetch failed");
}
self.ingest_group_backlog_events(backlog_events);
self.publish_prepared_group_rumor(chat_id, rumor);
}

fn ingest_group_backlog_events(&mut self, backlog_events: Vec<Event>) {
let mut remaining = backlog_events;
let mut processed_any = false;
for _ in 0..3 {
if remaining.is_empty() {
break;
}
let current = std::mem::take(&mut remaining);
let mut next = Vec::new();
for event in current.into_iter() {
let event_id = event.id;
let result = {
let Some(sess) = self.session.as_mut() else {
return;
};
sess.mdk.process_message(&event)
};
match result {
Ok(r) => {
self.note_processed_group_event_id_in_memory(event_id);
processed_any = true;
self.handle_message_processing_result(r);
}
Err(e) => {
tracing::debug!(event_id = %event.id.to_hex(), %e, "deferred backlog message");
next.push(event);
}
}
}
if next.is_empty() {
break;
}
remaining = next;
}
if !remaining.is_empty() {
tracing::warn!(remaining = remaining.len(), "failed to ingest some backlog events");
}
if processed_any {
self.persist_processed_group_event_ids_cache();
}
}
Comment on lines +1867 to +1907
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 No tests added for new backlog catchup logic

The AGENTS.md rule states: "Always add tests for changes when possible." This PR introduces three new non-trivial functions — handle_group_send_catchup_completed (line 1853), ingest_group_backlog_events (line 1867), and publish_prepared_group_rumor (line 1909) — along with four new helper methods for processed_group_event_ids management (rust/src/core/mod.rs:4484-4520). None of these have tests. At minimum, ingest_group_backlog_events (the multi-pass retry loop) and the note_processed_group_event_id/clear_processed_group_event_ids_cache helpers are unit-testable without complex mocking.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.


fn publish_prepared_group_rumor(&mut self, chat_id: String, mut rumor: UnsignedEvent) {
let network_enabled = self.network_enabled();
let fallback_relays = self.default_relays();
let rumor_id_hex = rumor.id().to_hex();

let mark_publish_failed = |core: &mut Self, reason: &str| {
core.handle_publish_message_result(
chat_id.clone(),
rumor_id_hex.clone(),
false,
Some(reason.to_string()),
);
};

let (client, wrapper, relays) = {
let Some(sess) = self.session.as_mut() else {
mark_publish_failed(self, "session lost");
return;
};
let Some(group) = sess.groups.get(&chat_id).cloned() else {
self.toast("Chat not found");
mark_publish_failed(self, "chat not found");
return;
};

let wrapper = match sess.mdk.create_message(&group.mls_group_id, rumor) {
Ok(e) => e,
Err(e) => {
Expand Down Expand Up @@ -1821,13 +1964,9 @@ impl AppCore {
vec![]
};

(sess.client.clone(), wrapper, relays, rumor_id_hex)
(sess.client.clone(), wrapper, relays)
};

self.prune_local_outbox(&chat_id);
self.refresh_chat_list_from_storage();
self.refresh_current_chat_if_open(&chat_id);

if !network_enabled {
let _ = self.core_sender.send(CoreMsg::Internal(Box::new(
InternalEvent::PublishMessageResult {
Expand Down
52 changes: 52 additions & 0 deletions rust/src/core/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -771,6 +771,7 @@ pub struct AppCore {
// Nostr kind:0 profile cache (survives across session refreshes).
profiles: HashMap<String, ProfileCache>, // hex pubkey -> cached global profile
group_profiles: HashMap<String, HashMap<String, ProfileCache>>, // chat_id -> (pubkey -> profile)
processed_group_event_ids: HashSet<EventId>,
profile_db: Option<rusqlite::Connection>,
Comment thread
coderabbitai[bot] marked this conversation as resolved.
chat_media_db: Option<rusqlite::Connection>,

Expand Down Expand Up @@ -887,6 +888,8 @@ impl AppCore {

let push_device_id = Self::load_or_create_push_device_id(&data_dir);
let push_subscribed_chat_ids = Self::load_push_subscriptions(&data_dir);
let processed_group_event_ids =
pika_marmot_runtime::load_processed_mls_event_ids(std::path::Path::new(&data_dir));

let mut this = Self {
state,
Expand Down Expand Up @@ -919,6 +922,7 @@ impl AppCore {
local_outbox: HashMap::new(),
profiles,
group_profiles: HashMap::new(),
processed_group_event_ids,
profile_db,
typing_state: HashMap::new(),
last_typing_sent: HashMap::new(),
Expand Down Expand Up @@ -3139,6 +3143,7 @@ impl AppCore {
self.cancel_call_offer_timeout();
self.cancel_voice_recording_ticks();

self.clear_processed_group_event_ids_cache();
let root = std::path::Path::new(&self.data_dir);
match std::fs::read_dir(root) {
Ok(entries) => {
Expand Down Expand Up @@ -3327,6 +3332,12 @@ impl AppCore {
ok,
error,
} => self.handle_publish_message_result(chat_id, rumor_id, ok, error),
InternalEvent::GroupSendCatchupCompleted {
chat_id,
rumor,
backlog_events,
error,
} => self.handle_group_send_catchup_completed(chat_id, rumor, backlog_events, error),
InternalEvent::ChatMediaUploadCompleted {
request_id,
uploaded_url,
Expand Down Expand Up @@ -4451,6 +4462,7 @@ impl AppCore {
}

pub(crate) fn handle_group_message(&mut self, event: Event) {
let event_id = event.id;
let result = {
let Some(sess) = self.session.as_mut() else {
tracing::warn!("group_message but no session");
Expand All @@ -4465,9 +4477,48 @@ impl AppCore {
}
}
};
self.note_processed_group_event_id(event_id);
self.handle_message_processing_result(result);
}

fn note_processed_group_event_id(&mut self, event_id: EventId) {
if !self.processed_group_event_ids.insert(event_id) {
return;
}
self.persist_processed_group_event_ids_cache();
}

fn note_processed_group_event_id_in_memory(&mut self, event_id: EventId) {
self.processed_group_event_ids.insert(event_id);
}

fn persist_processed_group_event_ids_cache(&mut self) {
if let Err(e) = pika_marmot_runtime::persist_processed_mls_event_ids(
std::path::Path::new(&self.data_dir),
&self.processed_group_event_ids,
) {
tracing::warn!(%e, "failed to persist processed group event ids");
}
Comment thread
devin-ai-integration[bot] marked this conversation as resolved.
}
Comment on lines +4484 to +4502
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound the processed-event cache by recency in memory too.

persist_processed_mls_event_ids() only trims the serialized output, and it does it after lexically sorting IDs. This HashSet therefore grows without bound for the life of the process, while the on-disk file can evict arbitrary IDs instead of the most recent ones. On a long-lived session, every group message turns into a larger sort/write than the last and the duplicate filter gets weaker once the cap is hit. Please track insertion order (for example HashSet + VecDeque/LRU) and evict from both memory and disk consistently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rust/src/core/mod.rs` around lines 4483 - 4493, The in-memory processed-event
cache (processed_group_event_ids) in note_processed_group_event_id grows
unbounded because persist_processed_mls_event_ids only truncates the serialized
file; change the in-memory structure to track recency (e.g. maintain a
HashSet<EventId> plus a VecDeque<EventId> or use an LRU structure) and enforce a
MAX_PROCESSED_IDS cap on insert: when inserting in
note_processed_group_event_id, if the ID already exists move it to the back of
the recency queue, otherwise push_back and insert into the set, and while the
queue length > MAX_PROCESSED_IDS pop_front and remove that ID from the set so
memory and disk eviction are consistent; after these bounded updates still call
persist_processed_mls_event_ids with self.data_dir and the trimmed set so both
in-memory and on-disk representations match.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but I see this as follow-up hardening not a blocker for this parity fix.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reez Got it — agreed to treat this as follow-up hardening and not block this parity fix.

If you’d like, I can open a tracking issue with the following scope so it’s queued up:

  • Problem: processed_group_event_ids (in-memory) is unbounded; disk pruning alone weakens duplicate filtering and increases write costs over long sessions.
  • Proposed approach:
    • Add a small recency structure (e.g., VecDeque) alongside the HashSet.
    • Define a cap like MAX_PROCESSED_IDS (e.g., 20_000; tunable via config/env).
    • On insert: if new, push_back into deque and insert into set; if existing, move it to the back; while len > cap, pop_front and remove from set.
    • Persist only the bounded set, keeping memory and disk in sync.
  • Acceptance criteria:
    • Memory stays bounded under heavy message load.
    • Duplicate suppression behavior remains correct across restarts.
    • Unit tests cover eviction order and load-from-disk trimming.

Preference on the default cap? I can proceed with opening the issue now if you confirm.


✏️ Learnings added
Learnt from: reez
Repo: sledtools/pika PR: 513
File: rust/src/core/mod.rs:4484-4494
Timestamp: 2026-03-08T21:35:07.708Z
Learning: Repo: sledtools/pika PR: 513
File: rust/src/core/mod.rs
Learning: Bounding the in-memory processed_group_event_ids cache (e.g., via LRU/recency queue with a MAX_PROCESSED_IDS cap) is acknowledged as follow-up hardening and is not a blocker for the MLS backlog parity fix in this PR.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: reez
Repo: sledtools/pika PR: 513
File: rust/src/core/chat_media.rs:1832-1837
Timestamp: 2026-03-08T21:14:08.184Z
Learning: In `rust/src/core/chat_media.rs`, the `handle_group_send_catchup_completed` / `ingest_group_backlog_events` flow intentionally uses best-effort (log-and-continue) catch-up behavior to match CLI parity. Failing the send on backlog fetch error or incomplete ingestion (fail-closed policy) is explicitly deferred to a future follow-up PR. Do not flag the log-and-continue pattern as a bug in this context.

Learnt from: futurepaul
Repo: sledtools/pika PR: 429
File: rust/src/core/chat_media.rs:1257-1262
Timestamp: 2026-03-05T00:37:01.166Z
Learning: In `rust/src/core/chat_media.rs`, within `handle_batch_upload_completed`, the early returns when `self.session` is unavailable (around the `let Some(sess) = self.session.as_ref() else { return; }` check) and when `sess.local_keys` is unavailable do not transition the batch to a Failed state, which can strand the batch in pending/uploading. The fix — marking the batch as Failed and cleaning up outbox/delivery_overrides — is intentionally deferred to a follow-up PR alongside broader batch error-handling hardening (same class as the missing-URL/hash early returns in the same function).

Learnt from: futurepaul
Repo: sledtools/pika PR: 429
File: rust/src/core/chat_media.rs:874-895
Timestamp: 2026-03-05T00:37:03.771Z
Learning: In rust/src/core/chat_media.rs, within send_chat_media_batch, when the MDK re-encodes media and the local preview copy to final_local_path fails, the code should fall back to pp.local_path instead of setting att.local_path to None. This avoids temporarily losing the thumbnail in the outbox entry due to a cosmetic issue since uploads use encrypted_data in memory. Apply this fallback pattern in similar Rust code paths where a file copy may fail but an existing local_path could still be valid. This is a known follow-up and should be implemented as a defensive fallback.


fn clear_processed_group_event_ids_cache(&mut self) {
self.processed_group_event_ids.clear();
let path = pika_marmot_runtime::processed_mls_event_ids_path(
std::path::Path::new(&self.data_dir),
);
match std::fs::remove_file(&path) {
Ok(()) => {}
Err(e) if e.kind() == std::io::ErrorKind::NotFound => {}
Err(e) => {
tracing::warn!(
%e,
path = %path.display(),
"failed to delete processed group event id cache file"
);
}
}
}

fn handle_message_processing_result(&mut self, result: MessageProcessingResult) {
// Phase 1: Extract mls_group_id and optional app message.
let (mls_group_id, app_msg) = match result {
Expand Down Expand Up @@ -4729,6 +4780,7 @@ impl AppCore {
tracing::info!(path = %db_path.display(), "deleted mdk db on logout");
}
}
self.clear_processed_group_event_ids_cache();
self.clear_push_subscriptions();
self.stop_session();
self.state.auth = AuthState::LoggedOut;
Expand Down
6 changes: 6 additions & 0 deletions rust/src/updates.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,12 @@ pub enum InternalEvent {
ok: bool,
error: Option<String>,
},
GroupSendCatchupCompleted {
chat_id: String,
rumor: nostr_sdk::prelude::UnsignedEvent,
backlog_events: Vec<nostr_sdk::prelude::Event>,
error: Option<String>,
},
ChatMediaUploadCompleted {
request_id: String,
uploaded_url: Option<String>,
Expand Down