Skip to content

MBOX import: global cross-source deduplication causes incomplete per-mailbox archives and silent data omission #340

@thisisamateurhour

Description

@thisisamateurhour

Describe the bug

When importing MBOX files from multiple separate Gmail accounts as individual ingestion sources, three related issues emerge that compromise archive integrity:

Issue 1: Emails silently skipped during import

During MBOX import, emails are silently skipped with only an INFO-level log message (Skipping duplicate email). There is no summary, warning, or UI indication that emails were excluded from the import. The user has no way to know their archive is incomplete without manually comparing email counts against the source. For an archiving tool — particularly one marketed for legal compliance — silent data omission without user awareness is a critical integrity problem.

Issue 2: Global cross-source deduplication

The duplicate detection appears to operate globally across all ingestion sources rather than being scoped per-source. When the same message_id_header exists in multiple mailboxes (e.g., a company-wide email received by multiple accounts), only the first-imported copy is stored. Subsequent imports of other mailboxes skip the email entirely.

This breaks per-mailbox archive integrity. In a corporate archiving scenario, if both Employee A and Employee B receive the same company-wide email, the archive must show it in both mailboxes independently. Attributing it only to whichever mailbox was imported first is factually and legally incorrect.

The provider_msg_source_idx composite index on (provider_message_id, ingestion_source_id) suggests per-source dedup may have been intended, but the application-level dedup logic appears to check globally.

Issue 3: Search results do not indicate source inbox

When searching for an email that was deduplicated via Issue 2, the search results show the email with its To/From headers but do not indicate which ingestion source it belongs to. The result is that:

  • An email addressed TO [email protected] appears in search results
  • But navigating to Archive > Archived Emails filtered by the account-b ingestion source shows it as missing
  • The email is actually attributed to account-a's ingestion source (whichever was imported first)

This creates a confusing experience where search says the email exists, but the per-inbox archive view says it doesn't.

To Reproduce

  1. Create ingestion source A (e.g., [email protected]) via MBOX import containing an email with a specific Message-ID
  2. Wait for import to complete
  3. Create ingestion source B (e.g., [email protected]) via MBOX import where the same email (same Message-ID) was also received
  4. Wait for import to complete — observe Skipping duplicate email in logs with no summary or count
  5. Navigate to Archive > Archived Emails and filter by ingestion source B — the email is missing
  6. Search for the email by subject — it appears in results but attributed to source A, not source B

Expected behavior

  1. Import transparency: When emails are skipped during import, a summary should be provided (e.g., "Imported X emails, skipped Y duplicates"). Ideally, skipped emails should be logged with enough detail (Message-ID, subject) to allow the user to verify the skip was correct.

  2. Per-source deduplication: The duplicate check should be scoped to (message_id_header, ingestion_source_id). This prevents duplicates within a single source (important for Gmail Takeout where label copies produce multiple MBOX entries for the same email) while allowing the same email to exist across multiple sources as independent archive entries.

  3. Search source attribution: Search results should indicate which ingestion source an email belongs to, so users can understand why an email appears in search but not in a specific inbox's archive view.

Database evidence

-- Email addressed TO account-b but attributed to account-a's ingestion source
SELECT ingestion_source_id, subject, sender_email, user_email, recipients 
FROM archived_emails 
WHERE subject LIKE '%specific subject%';

-- Result:
-- ingestion_source_id = (account-a's UUID)
-- user_email = [email protected]@mbox.local  
-- recipients = {"to": [{"address": "[email protected]"}]}

The email was sent TO account-b but attributed to account-a because account-a was imported first and the global dedup skipped it during account-b's import.

Additional context: user_email derivation

A related observation: user_email is derived from the MBOX filename or ingestion source name (resulting in values like [email protected]@mbox.local) rather than from actual email headers (To, Delivered-To, X-Delivered-To). This means the user_email field does not reflect the actual recipient of the email, only the import source. Combined with Issue 2, this causes emails to have incorrect recipient attribution.

System:

  • Open Archiver Version: latest (pulled March 2026)
  • Deployment: Docker Compose with PostgreSQL 17-alpine, Meilisearch v1.15, Valkey 8-alpine, Tika 3.2.2.0-full
  • Import method: MBOX via localFilePath (Google Takeout files + Thunderbird ImportExportTools NG exports)
  • Scale: 26 Gmail accounts, ~220k total emails after import, ~383k expected based on source MBOX email counts

Relevant logs:

[0] [0] [14:18:55.763] INFO (351): Skipping duplicate email
[0] [0] [14:18:55.830] INFO (351): Skipping duplicate email
[0] [0] [14:18:55.872] INFO (351): Skipping duplicate email
(repeated thousands of times during import of subsequent ingestion sources)

No error, warning, or summary is produced — the skips are logged only at INFO level with no detail about which email was skipped or why.

Suggested fix

  • Issue 1: Add a post-import summary (total imported, total skipped, total errors) visible in both logs and the UI ingestion source status.
  • Issue 2: Scope the duplicate check to (message_id_header, ingestion_source_id) instead of message_id_header alone.
  • Issue 3: Include ingestion source name/identifier in search result cards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions