MBOX import: global cross-source deduplication causes incomplete per-mailbox archives and silent data omission

**Describe the bug**

When importing MBOX files from multiple separate Gmail accounts as individual ingestion sources, three related issues emerge that compromise archive integrity:

### Issue 1: Emails silently skipped during import

During MBOX import, emails are silently skipped with only an INFO-level log message (`Skipping duplicate email`). There is no summary, warning, or UI indication that emails were excluded from the import. The user has no way to know their archive is incomplete without manually comparing email counts against the source. For an archiving tool — particularly one marketed for legal compliance — silent data omission without user awareness is a critical integrity problem.

### Issue 2: Global cross-source deduplication

The duplicate detection appears to operate globally across all ingestion sources rather than being scoped per-source. When the same `message_id_header` exists in multiple mailboxes (e.g., a company-wide email received by multiple accounts), only the first-imported copy is stored. Subsequent imports of other mailboxes skip the email entirely.

This breaks per-mailbox archive integrity. In a corporate archiving scenario, if both Employee A and Employee B receive the same company-wide email, the archive must show it in both mailboxes independently. Attributing it only to whichever mailbox was imported first is factually and legally incorrect.

The `provider_msg_source_idx` composite index on `(provider_message_id, ingestion_source_id)` suggests per-source dedup may have been intended, but the application-level dedup logic appears to check globally.

### Issue 3: Search results do not indicate source inbox

When searching for an email that was deduplicated via Issue 2, the search results show the email with its To/From headers but do not indicate which ingestion source it belongs to. The result is that:

- An email addressed TO `account-b@example.com` appears in search results
- But navigating to Archive > Archived Emails filtered by the `account-b` ingestion source shows it as missing
- The email is actually attributed to `account-a`'s ingestion source (whichever was imported first)

This creates a confusing experience where search says the email exists, but the per-inbox archive view says it doesn't.

**To Reproduce**

1. Create ingestion source A (e.g., `account-a@example.com`) via MBOX import containing an email with a specific Message-ID
2. Wait for import to complete
3. Create ingestion source B (e.g., `account-b@example.com`) via MBOX import where the same email (same Message-ID) was also received
4. Wait for import to complete — observe `Skipping duplicate email` in logs with no summary or count
5. Navigate to Archive > Archived Emails and filter by ingestion source B — the email is missing
6. Search for the email by subject — it appears in results but attributed to source A, not source B

**Expected behavior**

1. **Import transparency:** When emails are skipped during import, a summary should be provided (e.g., "Imported X emails, skipped Y duplicates"). Ideally, skipped emails should be logged with enough detail (Message-ID, subject) to allow the user to verify the skip was correct.

2. **Per-source deduplication:** The duplicate check should be scoped to `(message_id_header, ingestion_source_id)`. This prevents duplicates within a single source (important for Gmail Takeout where label copies produce multiple MBOX entries for the same email) while allowing the same email to exist across multiple sources as independent archive entries.

3. **Search source attribution:** Search results should indicate which ingestion source an email belongs to, so users can understand why an email appears in search but not in a specific inbox's archive view.

**Database evidence**

```sql
-- Email addressed TO account-b but attributed to account-a's ingestion source
SELECT ingestion_source_id, subject, sender_email, user_email, recipients 
FROM archived_emails 
WHERE subject LIKE '%specific subject%';

-- Result:
-- ingestion_source_id = (account-a's UUID)
-- user_email = account-a@example.com@mbox.local  
-- recipients = {"to": [{"address": "account-b@example.com"}]}
```

The email was sent TO account-b but attributed to account-a because account-a was imported first and the global dedup skipped it during account-b's import.

**Additional context: user_email derivation**

A related observation: `user_email` is derived from the MBOX filename or ingestion source name (resulting in values like `account-a@example.com@mbox.local`) rather than from actual email headers (`To`, `Delivered-To`, `X-Delivered-To`). This means the `user_email` field does not reflect the actual recipient of the email, only the import source. Combined with Issue 2, this causes emails to have incorrect recipient attribution.

**System:**

- Open Archiver Version: latest (pulled March 2026)
- Deployment: Docker Compose with PostgreSQL 17-alpine, Meilisearch v1.15, Valkey 8-alpine, Tika 3.2.2.0-full
- Import method: MBOX via localFilePath (Google Takeout files + Thunderbird ImportExportTools NG exports)
- Scale: 26 Gmail accounts, ~220k total emails after import, ~383k expected based on source MBOX email counts

**Relevant logs:**

```
[0] [0] [14:18:55.763] INFO (351): Skipping duplicate email
[0] [0] [14:18:55.830] INFO (351): Skipping duplicate email
[0] [0] [14:18:55.872] INFO (351): Skipping duplicate email
(repeated thousands of times during import of subsequent ingestion sources)
```

No error, warning, or summary is produced — the skips are logged only at INFO level with no detail about which email was skipped or why.

**Suggested fix**

- **Issue 1:** Add a post-import summary (total imported, total skipped, total errors) visible in both logs and the UI ingestion source status.
- **Issue 2:** Scope the duplicate check to `(message_id_header, ingestion_source_id)` instead of `message_id_header` alone.
- **Issue 3:** Include ingestion source name/identifier in search result cards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MBOX import: global cross-source deduplication causes incomplete per-mailbox archives and silent data omission #340

Issue 1: Emails silently skipped during import

Issue 2: Global cross-source deduplication

Issue 3: Search results do not indicate source inbox

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

MBOX import: global cross-source deduplication causes incomplete per-mailbox archives and silent data omission #340

Description

Issue 1: Emails silently skipped during import

Issue 2: Global cross-source deduplication

Issue 3: Search results do not indicate source inbox

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions