Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mailbox.mbox malformed 'From ' lines not being detected/handled #93376

Open
Soddentrough opened this issue May 31, 2022 · 9 comments
Open

mailbox.mbox malformed 'From ' lines not being detected/handled #93376

Soddentrough opened this issue May 31, 2022 · 9 comments
Labels
stdlib Python modules in the Lib dir topic-email type-feature A feature request or enhancement

Comments

@Soddentrough
Copy link

Soddentrough commented May 31, 2022

Bug report

mailbox.mbox (class mbox) builds the table of contents (_generate_toc() ) by matching on lines starting with b'From '.
This is RFC compliant, however, malicious emails/senders will sometimes intentionally break this causing unexpected behavior.

I suggest this be considered a bug as :

  • In these cases it is not possible to ask the sender to fix their MTA but we still need to parse the message
  • There are cases where good senders do this by mistake - e.g. poor line wrapping in quoted-printable content
  • Many common end-user email programs gracefully handle this scenario already

I propose exposing a custom 'From ' line delimiter with existing behavior maintained as a default :

diff of mailbox.py :

847c847
<     def __init__(self, path, factory=None, create=True, fromline=b'From '):
---
>     def __init__(self, path, factory=None, create=True):
850d849
<         self._fromline = fromline
865c864
<             if line.startswith(self._fromline):
---
>             if line.startswith(b'From '):

There are more sophisticated methods which could be explored; for example is_from() function in mutt, or a regex over a byte array.

Your environment

python 3.9.13

Linked PRs

@Soddentrough Soddentrough added the type-bug An unexpected behavior, bug, or error label May 31, 2022
@AA-Turner AA-Turner added topic-email type-feature A feature request or enhancement 3.11 only security fixes 3.10 only security fixes 3.12 bugs and security fixes and removed type-bug An unexpected behavior, bug, or error labels May 31, 2022
@AA-Turner
Copy link
Member

How would you use the proposed argument? Can you show an example where of a malicious header that your solution would not break on?

A

@Soddentrough
Copy link
Author

Hi. It's not a malicious header, it's wayward '^From ' lines in the body of the message which confuses mailbox.mbox into falsely thinking there is a new message marker header. Message count is then wrong and data is missed.

Unanchored "From " lines (missing the required ">") can be intentional but is also a common enough problem in legitimate emails that most readers are forced to compensate. This is often by looking for '^From ' followed by a string resembling an email address / date field etc (e.g: https://github.com/muttmua/mutt/blob/master/from.c).

My solution would allow people to open a box and specify an alternative line to key on, for example :

inbox = mailbox.mbox(fname, fromline=b'From MAILER-DAEMON')

A regex would be the more flexible solution but this would be marginally faster.

@Soddentrough
Copy link
Author

Digging around and there really doesn't appear to be any existing way of handling this. I've looked through the documentation and "mangle_from_" (https://docs.python.org/3/library/email.policy.html#email.policy.Policy.mangle_from_) is only relevant when creating a message object, not when reading an mbox.

And passing a custom email Policy object via "factory=" does not allow for a custom 'From ' line handle on message reading (as far as I'm able to tell).

The message separator of '^From ' is hardcoded in mailbox.py and will always break on mbox files where this is malformed.

This issue has also been raised here and seems unresolved.

@Soddentrough
Copy link
Author

Soddentrough commented Jun 3, 2022

I have an updated patch which allows for passing either a byte string or re.Pattern making these all possible (see test mbox at the end):

>>> inbox = mailbox.mbox('test.mbox', fromline=re.compile(rb'^From MAILER-DAEMON \w{3} \w{3}'))
>>> [ m for m in inbox ]
[<mailbox.mboxMessage object at 0x7f8778e9de20>]

>>> inbox = mailbox.mbox('test.mbox', fromline=b'From MAILER')
>>> [ m for m in inbox ]
[<mailbox.mboxMessage object at 0x7f8778708370>]

# And the default which will normally work but fail in cases of unanchored "From " lines:
>>> inbox = mailbox.mbox('test.mbox')
>>> [ m for m in inbox ]
[<mailbox.mboxMessage object at 0x7f87787cc970>, <mailbox.mboxMessage object at 0x7f87787d6f10>, <mailbox.mboxMessage object at 0x7f87787d64c0>]

Only obvious downside is the added requirement of the 're' module which fractionally increases startup time.
lib/python3.9.13/lib/mailbox.py

22,23d21
< import re
<
849c847
<     def __init__(self, path, factory=None, create=True, fromline=b'From '):
---
>     def __init__(self, path, factory=None, create=True):
852d849
<         self._fromline = fromline
867,872c864
<             newline_sep = False
<             if isinstance(self._fromline, re.Pattern):
<                 newline_sep = re.match(self._fromline, line)
<             else:
<                 newline_sep = line.startswith(self._fromline)
<             if newline_sep:
---
>             if line.startswith(b'From '):
From MAILER-DAEMON Wed Jun  1 03:11:31 2022
Received: from host.host.com ([192.168.0.1])
 by host2.host.com
Date: Mon, 3 May 2022 15:40:25 +0000
From: "Friendly From" <[email protected]>
Subject: Subject
To: <[email protected]>
Message-id: <[email protected]>

Hey

So this is some text but I need to make sure
From lines are handled correctly.

Lines like this:
> From something
Aren't always anchored
Some html here
<https://links.and.content/path>

More text here

And from here on

Anyway,
From here it's more text

But not a new message.

@sebbASF
Copy link

sebbASF commented Aug 4, 2022

All valid From lines should end with a date; in particular they should end with a 4 digit year.
Matching on /^From .+ \d\d\d\d$/ would eliminate a lot of spurious From lines .
The regex can be made more specific if necessary.

I agree that it would be very helpful to be able to override the check in some way.

An alternative might be to allow the user to provide a function which returns True/False.
This would allow more sophisticated matching, and would avoid the need to import re for the default case.

@digitalsignalperson
Copy link

From ' lines in the body of the message which confuses mailbox.mbox into falsely thinking there is a new message marker header. Message count is then wrong and data is missed.

Yesss this. I'm trying to convert mbox's from thunderbird to maildir and this is a funny bug to run into. There are really a LOT of cases where random emails have lines starting with "From ".

  • From the world famous...
  • From the historic...
  • From the great...

etc.

Here's a sample of what I'm dealing with:

with open(mbox_path, 'rb') as f:
    data = f.read()

from_lines = [line for line in data.splitlines() if line.startswith(b'From ')]
Counter([len(line) for line in from_lines])
Out[43]: 
Counter({31: 9558,
         75: 21,
         76: 20,
         74: 7,
         70: 4,
         69: 1,
         72: 1,
         24: 1,
         82: 1,
         68: 1,
         71: 1})

obviously the length 31 items are the desired header lines like b'From - Tue Jun 6 20:43:03 2023'

While we could tweak line.startswith(b'From ') with constraints for length, date format, that only reduces the probability of a random line in an email body containing something that looks like that.

I'm curious how thunderbird does it, given my mbox files are from there.

@digitalsignalperson
Copy link

some interesting bits from thunderbird

https://github.com/mozilla/releases-comm-central/blob/78245f927ad4c32e6896f5ceee7483083f90bf59/mailnews/base/src/converterWorker.js#L154

  // A regexp to match mbox separator lines. Separator lines in the wild can
  // have all sorts of forms, for example:
  //
  // "From "
  // "From MAILER-DAEMON Fri Jul  8 12:08:34 2011"
  // "From - Mon Jul 11 12:08:34 2011"
  // "From [email protected] Fri Jul  8 12:08:34 2011"
  //
  // So we accept any line beginning with "From " and ignore the rest of it.
  //
  // We also require a message header on the next line, in order
  // to better cope with unescaped "From " lines in the message body.
  // note: the first subexpression matches the separator line, so
  // that it can be removed from the input.
  let sepRE = /^(From (?:.*?)\r?\n)[\x21-\x7E]+:/gm;

https://github.com/mozilla/releases-comm-central/blob/78245f927ad4c32e6896f5ceee7483083f90bf59/mailnews/local/src/nsParseMailbox.cpp#L616

  /* The required format is
     From jwz  Fri Jul  1 09:13:09 1994
   But we should also allow at least:
     From jwz  Fri, Jul 01 09:13:09 1994
     From jwz  Fri Jul  1 09:13:09 1994 PST
     From jwz  Fri Jul  1 09:13:09 1994 (+0700)

   We can't easily call XP_ParseTimeString() because the string is not
   null terminated (ok, we could copy it after a quick check...) but
   XP_ParseTimeString() may be too lenient for our purposes.

   DANGER!!  The released version of 2.0b1 was (on some systems,
   some Unix, some NT, possibly others) writing out envelope lines
   like "From - 10/13/95 11:22:33" which STRICT_ENVELOPE will reject!
   */

@digitalsignalperson
Copy link

This is what worked for my purposes for a one-time thing

    def _generate_toc(self):
        """Generate key-to-(start, stop) table of contents."""
        starts, stops = [], []
        last_was_empty = False
        self._file.seek(0)
        while True:
            line_pos = self._file.tell()
            line = self._file.readline()
            envelope_found = False
            if line.startswith(b'From '):
                next_line_pos = self._file.tell()
                next_line = self._file.readline()
                candidates = [
                    b'X-Mozilla-Status:',
                    b'X-Mozilla-Status2:',
                    b'Delivered-To: ',
                    b'Subject: ',
                    b'Message-ID: ',
                    b'Return-Path: ',
                    b'To: ',
                    b'Content-Type: ',
                    b'MIME-Version: ',
                    b'FCC: ',
                    b'BCC: ',
                    b'X-Identity-Key: ',
                ]
                for x in candidates:
                    if next_line.startswith(x):
                        envelope_found = True
                        break
            if envelope_found:
                if len(stops) < len(starts):
                    if last_was_empty:
                        stops.append(line_pos - len(linesep))
                    else:
                        # The last line before the "From " line wasn't
                        # blank, but we consider it a start of a
                        # message anyway.
                        stops.append(line_pos)
                starts.append(line_pos)
                last_was_empty = False
            elif not line:
                if last_was_empty:
                    stops.append(line_pos - len(linesep))
                else:
                    stops.append(line_pos)
                break
            elif not next_line:
                if last_was_empty:
                    stops.append(next_line_pos - len(linesep))
                else:
                    stops.append(next_line_pos)
                break
            elif line == linesep:
                last_was_empty = True
            else:
                last_was_empty = False
        self._toc = dict(enumerate(zip(starts, stops)))
        self._next_key = len(self._toc)
        self._file_length = self._file.tell()

I tried the regular expression suggested in the above comment from thunderbird source code, but it was not specific enough (letting a "From " line be valid if the next line has a "Something: Blah"). So I just looked at the header fields after the "From " line in my mbox files and crafted a limited allow-list.

A better method should also include tracking multipart messages and when parts begin and end. Emails with git patches can be especially messy if you interpreted the stuff that looks like email headers in git commit messages.

@sebbASF
Copy link

sebbASF commented Jul 2, 2023

Note that mail headers can (and do) use either case, so the samples above may not work in all instances.
Seems to me it would be better to be somewhat stricter when matching the From line, and more relaxed when checking for a following header line.

@erlend-aasland erlend-aasland added 3.13 bugs and security fixes and removed 3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes labels Jan 5, 2024
sebbASF added a commit to sebbASF/cpython that referenced this issue Mar 17, 2025
sebbASF added a commit to sebbASF/cpython that referenced this issue Mar 17, 2025
sebbASF added a commit to sebbASF/cpython that referenced this issue Mar 18, 2025
@picnixz picnixz added stdlib Python modules in the Lib dir and removed 3.13 bugs and security fixes labels Mar 18, 2025
sebbASF added a commit to sebbASF/cpython that referenced this issue Mar 18, 2025
sebbASF added a commit to sebbASF/cpython that referenced this issue Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-email type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants