-
-
Notifications
You must be signed in to change notification settings - Fork 31.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mailbox.mbox malformed 'From ' lines not being detected/handled #93376
Comments
How would you use the proposed argument? Can you show an example where of a malicious header that your solution would not break on? A |
Hi. It's not a malicious header, it's wayward '^From ' lines in the body of the message which confuses mailbox.mbox into falsely thinking there is a new message marker header. Message count is then wrong and data is missed. Unanchored "From " lines (missing the required ">") can be intentional but is also a common enough problem in legitimate emails that most readers are forced to compensate. This is often by looking for '^From ' followed by a string resembling an email address / date field etc (e.g: https://github.com/muttmua/mutt/blob/master/from.c). My solution would allow people to open a box and specify an alternative line to key on, for example :
A regex would be the more flexible solution but this would be marginally faster. |
Digging around and there really doesn't appear to be any existing way of handling this. I've looked through the documentation and "mangle_from_" (https://docs.python.org/3/library/email.policy.html#email.policy.Policy.mangle_from_) is only relevant when creating a message object, not when reading an mbox. And passing a custom email Policy object via "factory=" does not allow for a custom 'From ' line handle on message reading (as far as I'm able to tell). The message separator of '^From ' is hardcoded in mailbox.py and will always break on mbox files where this is malformed. This issue has also been raised here and seems unresolved. |
I have an updated patch which allows for passing either a byte string or re.Pattern making these all possible (see test mbox at the end):
Only obvious downside is the added requirement of the 're' module which fractionally increases startup time.
|
All valid From lines should end with a date; in particular they should end with a 4 digit year. I agree that it would be very helpful to be able to override the check in some way. An alternative might be to allow the user to provide a function which returns True/False. |
Yesss this. I'm trying to convert mbox's from thunderbird to maildir and this is a funny bug to run into. There are really a LOT of cases where random emails have lines starting with "From ".
etc. Here's a sample of what I'm dealing with: with open(mbox_path, 'rb') as f:
data = f.read()
from_lines = [line for line in data.splitlines() if line.startswith(b'From ')]
Counter([len(line) for line in from_lines])
Out[43]:
Counter({31: 9558,
75: 21,
76: 20,
74: 7,
70: 4,
69: 1,
72: 1,
24: 1,
82: 1,
68: 1,
71: 1}) obviously the length 31 items are the desired header lines like b'From - Tue Jun 6 20:43:03 2023' While we could tweak I'm curious how thunderbird does it, given my mbox files are from there. |
some interesting bits from thunderbird // A regexp to match mbox separator lines. Separator lines in the wild can
// have all sorts of forms, for example:
//
// "From "
// "From MAILER-DAEMON Fri Jul 8 12:08:34 2011"
// "From - Mon Jul 11 12:08:34 2011"
// "From [email protected] Fri Jul 8 12:08:34 2011"
//
// So we accept any line beginning with "From " and ignore the rest of it.
//
// We also require a message header on the next line, in order
// to better cope with unescaped "From " lines in the message body.
// note: the first subexpression matches the separator line, so
// that it can be removed from the input.
let sepRE = /^(From (?:.*?)\r?\n)[\x21-\x7E]+:/gm;
|
This is what worked for my purposes for a one-time thing def _generate_toc(self):
"""Generate key-to-(start, stop) table of contents."""
starts, stops = [], []
last_was_empty = False
self._file.seek(0)
while True:
line_pos = self._file.tell()
line = self._file.readline()
envelope_found = False
if line.startswith(b'From '):
next_line_pos = self._file.tell()
next_line = self._file.readline()
candidates = [
b'X-Mozilla-Status:',
b'X-Mozilla-Status2:',
b'Delivered-To: ',
b'Subject: ',
b'Message-ID: ',
b'Return-Path: ',
b'To: ',
b'Content-Type: ',
b'MIME-Version: ',
b'FCC: ',
b'BCC: ',
b'X-Identity-Key: ',
]
for x in candidates:
if next_line.startswith(x):
envelope_found = True
break
if envelope_found:
if len(stops) < len(starts):
if last_was_empty:
stops.append(line_pos - len(linesep))
else:
# The last line before the "From " line wasn't
# blank, but we consider it a start of a
# message anyway.
stops.append(line_pos)
starts.append(line_pos)
last_was_empty = False
elif not line:
if last_was_empty:
stops.append(line_pos - len(linesep))
else:
stops.append(line_pos)
break
elif not next_line:
if last_was_empty:
stops.append(next_line_pos - len(linesep))
else:
stops.append(next_line_pos)
break
elif line == linesep:
last_was_empty = True
else:
last_was_empty = False
self._toc = dict(enumerate(zip(starts, stops)))
self._next_key = len(self._toc)
self._file_length = self._file.tell() I tried the regular expression suggested in the above comment from thunderbird source code, but it was not specific enough (letting a "From " line be valid if the next line has a "Something: Blah"). So I just looked at the header fields after the "From " line in my mbox files and crafted a limited allow-list. A better method should also include tracking multipart messages and when parts begin and end. Emails with git patches can be especially messy if you interpreted the stuff that looks like email headers in git commit messages. |
Note that mail headers can (and do) use either case, so the samples above may not work in all instances. |
Bug report
mailbox.mbox (class mbox) builds the table of contents (_generate_toc() ) by matching on lines starting with b'From '.
This is RFC compliant, however, malicious emails/senders will sometimes intentionally break this causing unexpected behavior.
I suggest this be considered a bug as :
I propose exposing a custom 'From ' line delimiter with existing behavior maintained as a default :
diff of mailbox.py :
There are more sophisticated methods which could be explored; for example is_from() function in mutt, or a regex over a byte array.
Your environment
python 3.9.13
Linked PRs
The text was updated successfully, but these errors were encountered: