mailbox.mbox malformed 'From ' lines not being detected/handled #93376

Soddentrough · 2022-05-31T12:52:25Z

Bug report

mailbox.mbox (class mbox) builds the table of contents (_generate_toc() ) by matching on lines starting with b'From '.
This is RFC compliant, however, malicious emails/senders will sometimes intentionally break this causing unexpected behavior.

I suggest this be considered a bug as :

In these cases it is not possible to ask the sender to fix their MTA but we still need to parse the message
There are cases where good senders do this by mistake - e.g. poor line wrapping in quoted-printable content
Many common end-user email programs gracefully handle this scenario already

I propose exposing a custom 'From ' line delimiter with existing behavior maintained as a default :

diff of mailbox.py :

847c847
<     def __init__(self, path, factory=None, create=True, fromline=b'From '):
---
>     def __init__(self, path, factory=None, create=True):
850d849
<         self._fromline = fromline
865c864
<             if line.startswith(self._fromline):
---
>             if line.startswith(b'From '):

There are more sophisticated methods which could be explored; for example is_from() function in mutt, or a regex over a byte array.

Your environment

python 3.9.13

Linked PRs

gh-93376: Allow override of mbox From matching #131415

The text was updated successfully, but these errors were encountered:

AA-Turner · 2022-05-31T22:09:19Z

How would you use the proposed argument? Can you show an example where of a malicious header that your solution would not break on?

A

Soddentrough · 2022-06-01T01:53:33Z

Hi. It's not a malicious header, it's wayward '^From ' lines in the body of the message which confuses mailbox.mbox into falsely thinking there is a new message marker header. Message count is then wrong and data is missed.

Unanchored "From " lines (missing the required ">") can be intentional but is also a common enough problem in legitimate emails that most readers are forced to compensate. This is often by looking for '^From ' followed by a string resembling an email address / date field etc (e.g: https://github.com/muttmua/mutt/blob/master/from.c).

My solution would allow people to open a box and specify an alternative line to key on, for example :

inbox = mailbox.mbox(fname, fromline=b'From MAILER-DAEMON')

A regex would be the more flexible solution but this would be marginally faster.

Soddentrough · 2022-06-03T02:30:45Z

Digging around and there really doesn't appear to be any existing way of handling this. I've looked through the documentation and "mangle_from_" (https://docs.python.org/3/library/email.policy.html#email.policy.Policy.mangle_from_) is only relevant when creating a message object, not when reading an mbox.

And passing a custom email Policy object via "factory=" does not allow for a custom 'From ' line handle on message reading (as far as I'm able to tell).

The message separator of '^From ' is hardcoded in mailbox.py and will always break on mbox files where this is malformed.

This issue has also been raised here and seems unresolved.

Soddentrough · 2022-06-03T03:15:14Z

I have an updated patch which allows for passing either a byte string or re.Pattern making these all possible (see test mbox at the end):

>>> inbox = mailbox.mbox('test.mbox', fromline=re.compile(rb'^From MAILER-DAEMON \w{3} \w{3}'))
>>> [ m for m in inbox ]
[<mailbox.mboxMessage object at 0x7f8778e9de20>]

>>> inbox = mailbox.mbox('test.mbox', fromline=b'From MAILER')
>>> [ m for m in inbox ]
[<mailbox.mboxMessage object at 0x7f8778708370>]

# And the default which will normally work but fail in cases of unanchored "From " lines:
>>> inbox = mailbox.mbox('test.mbox')
>>> [ m for m in inbox ]
[<mailbox.mboxMessage object at 0x7f87787cc970>, <mailbox.mboxMessage object at 0x7f87787d6f10>, <mailbox.mboxMessage object at 0x7f87787d64c0>]

Only obvious downside is the added requirement of the 're' module which fractionally increases startup time.
lib/python3.9.13/lib/mailbox.py

22,23d21
< import re
<
849c847
<     def __init__(self, path, factory=None, create=True, fromline=b'From '):
---
>     def __init__(self, path, factory=None, create=True):
852d849
<         self._fromline = fromline
867,872c864
<             newline_sep = False
<             if isinstance(self._fromline, re.Pattern):
<                 newline_sep = re.match(self._fromline, line)
<             else:
<                 newline_sep = line.startswith(self._fromline)
<             if newline_sep:
---
>             if line.startswith(b'From '):

From MAILER-DAEMON Wed Jun  1 03:11:31 2022
Received: from host.host.com ([192.168.0.1])
 by host2.host.com
Date: Mon, 3 May 2022 15:40:25 +0000
From: "Friendly From" <[email protected]>
Subject: Subject
To: <[email protected]>
Message-id: <[email protected]>

Hey

So this is some text but I need to make sure
From lines are handled correctly.

Lines like this:
> From something
Aren't always anchored
Some html here
<https://links.and.content/path>

More text here

And from here on

Anyway,
From here it's more text

But not a new message.

sebbASF · 2022-08-04T16:40:02Z

All valid From lines should end with a date; in particular they should end with a 4 digit year.
Matching on /^From .+ \d\d\d\d$/ would eliminate a lot of spurious From lines .
The regex can be made more specific if necessary.

I agree that it would be very helpful to be able to override the check in some way.

An alternative might be to allow the user to provide a function which returns True/False.
This would allow more sophisticated matching, and would avoid the need to import re for the default case.

digitalsignalperson · 2023-07-01T03:38:46Z

From ' lines in the body of the message which confuses mailbox.mbox into falsely thinking there is a new message marker header. Message count is then wrong and data is missed.

Yesss this. I'm trying to convert mbox's from thunderbird to maildir and this is a funny bug to run into. There are really a LOT of cases where random emails have lines starting with "From ".

From the world famous...
From the historic...
From the great...

etc.

Here's a sample of what I'm dealing with:

with open(mbox_path, 'rb') as f:
    data = f.read()

from_lines = [line for line in data.splitlines() if line.startswith(b'From ')]
Counter([len(line) for line in from_lines])
Out[43]: 
Counter({31: 9558,
         75: 21,
         76: 20,
         74: 7,
         70: 4,
         69: 1,
         72: 1,
         24: 1,
         82: 1,
         68: 1,
         71: 1})

obviously the length 31 items are the desired header lines like b'From - Tue Jun 6 20:43:03 2023'

While we could tweak line.startswith(b'From ') with constraints for length, date format, that only reduces the probability of a random line in an email body containing something that looks like that.

I'm curious how thunderbird does it, given my mbox files are from there.

digitalsignalperson · 2023-07-01T04:56:26Z

some interesting bits from thunderbird

https://github.com/mozilla/releases-comm-central/blob/78245f927ad4c32e6896f5ceee7483083f90bf59/mailnews/base/src/converterWorker.js#L154

  // A regexp to match mbox separator lines. Separator lines in the wild can
  // have all sorts of forms, for example:
  //
  // "From "
  // "From MAILER-DAEMON Fri Jul  8 12:08:34 2011"
  // "From - Mon Jul 11 12:08:34 2011"
  // "From [email protected] Fri Jul  8 12:08:34 2011"
  //
  // So we accept any line beginning with "From " and ignore the rest of it.
  //
  // We also require a message header on the next line, in order
  // to better cope with unescaped "From " lines in the message body.
  // note: the first subexpression matches the separator line, so
  // that it can be removed from the input.
  let sepRE = /^(From (?:.*?)\r?\n)[\x21-\x7E]+:/gm;

https://github.com/mozilla/releases-comm-central/blob/78245f927ad4c32e6896f5ceee7483083f90bf59/mailnews/local/src/nsParseMailbox.cpp#L616

  /* The required format is
     From jwz  Fri Jul  1 09:13:09 1994
   But we should also allow at least:
     From jwz  Fri, Jul 01 09:13:09 1994
     From jwz  Fri Jul  1 09:13:09 1994 PST
     From jwz  Fri Jul  1 09:13:09 1994 (+0700)

   We can't easily call XP_ParseTimeString() because the string is not
   null terminated (ok, we could copy it after a quick check...) but
   XP_ParseTimeString() may be too lenient for our purposes.

   DANGER!!  The released version of 2.0b1 was (on some systems,
   some Unix, some NT, possibly others) writing out envelope lines
   like "From - 10/13/95 11:22:33" which STRICT_ENVELOPE will reject!
   */

digitalsignalperson · 2023-07-02T02:07:52Z

This is what worked for my purposes for a one-time thing

    def _generate_toc(self):
        """Generate key-to-(start, stop) table of contents."""
        starts, stops = [], []
        last_was_empty = False
        self._file.seek(0)
        while True:
            line_pos = self._file.tell()
            line = self._file.readline()
            envelope_found = False
            if line.startswith(b'From '):
                next_line_pos = self._file.tell()
                next_line = self._file.readline()
                candidates = [
                    b'X-Mozilla-Status:',
                    b'X-Mozilla-Status2:',
                    b'Delivered-To: ',
                    b'Subject: ',
                    b'Message-ID: ',
                    b'Return-Path: ',
                    b'To: ',
                    b'Content-Type: ',
                    b'MIME-Version: ',
                    b'FCC: ',
                    b'BCC: ',
                    b'X-Identity-Key: ',
                ]
                for x in candidates:
                    if next_line.startswith(x):
                        envelope_found = True
                        break
            if envelope_found:
                if len(stops) < len(starts):
                    if last_was_empty:
                        stops.append(line_pos - len(linesep))
                    else:
                        # The last line before the "From " line wasn't
                        # blank, but we consider it a start of a
                        # message anyway.
                        stops.append(line_pos)
                starts.append(line_pos)
                last_was_empty = False
            elif not line:
                if last_was_empty:
                    stops.append(line_pos - len(linesep))
                else:
                    stops.append(line_pos)
                break
            elif not next_line:
                if last_was_empty:
                    stops.append(next_line_pos - len(linesep))
                else:
                    stops.append(next_line_pos)
                break
            elif line == linesep:
                last_was_empty = True
            else:
                last_was_empty = False
        self._toc = dict(enumerate(zip(starts, stops)))
        self._next_key = len(self._toc)
        self._file_length = self._file.tell()

I tried the regular expression suggested in the above comment from thunderbird source code, but it was not specific enough (letting a "From " line be valid if the next line has a "Something: Blah"). So I just looked at the header fields after the "From " line in my mbox files and crafted a limited allow-list.

A better method should also include tracking multipart messages and when parts begin and end. Emails with git patches can be especially messy if you interpreted the stuff that looks like email headers in git commit messages.

sebbASF · 2023-07-02T09:07:53Z

Note that mail headers can (and do) use either case, so the samples above may not work in all instances.
Seems to me it would be better to be somewhat stricter when matching the From line, and more relaxed when checking for a following header line.

This fixes python#93376

Soddentrough added the type-bug An unexpected behavior, bug, or error label May 31, 2022

AA-Turner added topic-email type-feature A feature request or enhancement 3.11 only security fixes 3.10 only security fixes 3.12 bugs and security fixes and removed type-bug An unexpected behavior, bug, or error labels May 31, 2022

This was referenced Aug 4, 2022

gh-93376: Allow override of mbox From matching #95689

Closed

gh-93376: Allow override of mbox From matching #95774

Closed

erlend-aasland added 3.13 bugs and security fixes and removed 3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes labels Jan 5, 2024

sebbASF added a commit to sebbASF/cpython that referenced this issue Mar 17, 2025

pythongh-93376 Allow override of mbox From matching

7f814c7

This fixes python#93376

sebbASF added a commit to sebbASF/cpython that referenced this issue Mar 17, 2025

Merge branch 'python:main' into pythongh-93376

7723a02

sebbASF added a commit to sebbASF/cpython that referenced this issue Mar 18, 2025

Merge branch 'python:main' into pythongh-93376

b13cafc

bedevere-app bot mentioned this issue Mar 18, 2025

gh-93376: Allow override of mbox From matching #131415

Open

picnixz added stdlib Python modules in the Lib dir and removed 3.13 bugs and security fixes labels Mar 18, 2025

sebbASF added a commit to sebbASF/cpython that referenced this issue Mar 18, 2025

Merge branch 'main' into pythongh-93376

c53f4f8

sebbASF added a commit to sebbASF/cpython that referenced this issue Mar 18, 2025

Merge branch 'main' into pythongh-93376

d9cffd9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mailbox.mbox malformed 'From ' lines not being detected/handled #93376

mailbox.mbox malformed 'From ' lines not being detected/handled #93376

Soddentrough commented May 31, 2022 •

edited by bedevere-app bot

Loading

AA-Turner commented May 31, 2022

Soddentrough commented Jun 1, 2022

Soddentrough commented Jun 3, 2022

Soddentrough commented Jun 3, 2022 •

edited

Loading

sebbASF commented Aug 4, 2022

digitalsignalperson commented Jul 1, 2023

digitalsignalperson commented Jul 1, 2023

digitalsignalperson commented Jul 2, 2023

sebbASF commented Jul 2, 2023

mailbox.mbox malformed 'From ' lines not being detected/handled #93376

mailbox.mbox malformed 'From ' lines not being detected/handled #93376

Comments

Soddentrough commented May 31, 2022 • edited by bedevere-app bot Loading

Linked PRs

AA-Turner commented May 31, 2022

Soddentrough commented Jun 1, 2022

Soddentrough commented Jun 3, 2022

Soddentrough commented Jun 3, 2022 • edited Loading

sebbASF commented Aug 4, 2022

digitalsignalperson commented Jul 1, 2023

digitalsignalperson commented Jul 1, 2023

digitalsignalperson commented Jul 2, 2023

sebbASF commented Jul 2, 2023

Soddentrough commented May 31, 2022 •

edited by bedevere-app bot

Loading

Soddentrough commented Jun 3, 2022 •

edited

Loading