Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HTML from SMS Backup & Restore XML files. #273

Open
sjevtic opened this issue Jan 18, 2025 · 45 comments
Open

Make HTML from SMS Backup & Restore XML files. #273

sjevtic opened this issue Jan 18, 2025 · 45 comments
Labels
enhancement New feature or request

Comments

@sjevtic
Copy link

sjevtic commented Jan 18, 2025

First I'd like to express my appreciation for this tool. The freedom it provides is just such a breath of fresh air in contrast to the opacity and rigidity that have unfortunately become synonymous with so many aspects of Signal. I've really enjoyed watching this project grow over the past few years. After nearly 20 years in the software industry, I can honestly say that the quality you have achieved while building a significant body of functionality is impressive.

One of the features I have really been enjoying is the HTML export feature: the output is beautiful and remarkably functional; the embedded media handling is especially nice. After reading through issue [#227], I started wondering: could I somehow use signalbackup-tools to generate gorgeous HTML archives of my SMS/MMS messages from SMS Backup & Restore's XML files? The idea here would be to either use --exportplaintextbackuphtml to directly generate HTML, or even better, perform multiple --importplaintextbackup operations starting from an empty backup to effectively integrate multiple XML backups. In the latter case, the goal would just be to ultimately run --exporthtml on the resulting backup; it wouldn't have to be a perfectly valid Signal backup file since it would never actually be restored in Signal.

So, I tried a few experiments with --importplaintextbackup, --exportplaintextbackuphtml, and --listxmlcontacts. All of them produced the same result, and I also had similar experiences with different XML files. Here is an example:

H:\sjevtic\testing>signalbackup-tools-20250112-1_win.exe --exportplaintextbackuphtml sms.xml .\html *** Starting log: 2025-01-18 11:37:05 *** signalbackup-tools (signalbackup-tools-20250112-1_win.exe) source version 20250112.125109 (Win) (SQlite: 3.47.2, OpenSSL: OpenSSL 3.4.0 22 Oct 2024) [Error]: During sqlite3_prepare_v2(): near "m": syntax error -> Query: "INSERT INTO smses (date, type, read, body, contact_name, address) VALUES (1373919414774, 2, 1, 'I'm at the cinema', 'redacted-contact', 'redacted-number')"

Here's the corresponding line from the backup that caused the issue:

<sms protocol="0" address="redacted-number" date="1373919414774" type="2" subject="null" body="I'm at the cinema" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" date_sent="1373919414774" sub_id="-1" readable_date="Jul 15, 2013 4:16:54 PM" contact_name="redacted-contact" />

I don't have a lot of knowledge about XML, but the two highest rated responses to this post break down the escaping rules well, and it seems that a single quote (') enclosed in a double-quoted (") attribute is valid.

I have a few questions:

  1. Can I achieve what I wish to accomplish with signalbackup-tools?
  2. If yes to (1), is my approach correct, or can you provide guidance as to how to do this correctly?
  3. If yes to (1), do I need to preprocess my XML backups to change the quoting, or is this an issue in signalbackup-tools?

Thanks!

@bepaald
Copy link
Owner

bepaald commented Jan 19, 2025

Hi!

The thing with #227 is, it is a bit stalled at the moment, but actual real testing is still very much needed. The feature is really not considered to be complete (which is also why you won't find it mentioned anywhere in the README or --help).

The proper dealing with the escaped strings (unescaping xml entities such as &amp; and escaping characters sqlite doesn't like, such as the ' you ran into) would be something found in testing. Though I have to admit I really should have thought of those myself, as I've had to do the same for nearly every other function of this tool :). Anyway, that one should be fixed now.

However:

Can I achieve what I wish to accomplish with signalbackup-tools?

No, not really. Not yet, at least. The --importplaintextbackup was really created to import XML files as exported by Signal Android, back when it still did that. While the actual format of this XML file is compatible (I think) with SMS Backup and Restore's XML file, it is definitely a subset. Also the tool makes some assumptions about which data is present. A few issues that I know of (there are likely more):

  • Since Signal never exported any <mms> messages (messages with media attached, or outgoing group messages), these are not currently supported by the import-function. These are probably quite important.
  • Signal never has the address attribute empty, and currently this tool relies on that to differentiate messages from different contacts. In my small (non-representative) SMS Backup & Restore XML file, most messages have no address set (they are from service numbers, for example Signal's own registration SMS), which are then all grouped together in a single thread.
  • SMS Backup and Restore encodes emoji (and other characters) differently, Signal encodes separate UTF-16 code points as HTML entities, while SMSB&R uses UTF32 (or encodes the codepoint directly). I have actually pushed a fix for this, but I'm not sure it works correctly in all cases.

I do believe all of these issues can be fixed, by the way.

If yes to (1), is my approach correct, or can you provide guidance as to how to do this correctly?

The approach is correct I think, at least just --exportplaintextbackuphtml should produce some html, but again with the mms's skipped, it will probably be very incomplete. Also, you might run into another parsing issue, or some other bug that I'm not aware of (maybe you could test and report back).

If yes to (1), do I need to preprocess my XML backups to change the quoting, or is this an issue in signalbackup-tools?

This was indeed a bug, that should be fixed now. Thanks for reporting.

I did plan on adding mms support at some point, but that is going to be a long process to get right I think. I should have some time this week though. I'll keep this issue updated if you are interested and willing to test and report on the status occasionally.

Thanks!

@bepaald bepaald added the enhancement New feature or request label Jan 19, 2025
@sjevtic
Copy link
Author

sjevtic commented Jan 20, 2025

This sounds really promising, and I'm certainly happy to help test. With the latest build, I was able to run to completion on several different backup files. As you indicated, there is no MMS content of any sort included such as media or group chats. But, other than this and the service number issue you mentioned, the HTML looks good

Thanks for taking this on!

@bepaald
Copy link
Owner

bepaald commented Jan 22, 2025

Just a quick update: I have made an attempt to deal with the service numbers, as well as deal with mms messages (at least the message body and attached media). The next thing I need to do is try to deal with 'groups' in some way (even though mms, and XMLB&R don't really support groups as such, I think), and possibly find some way to easily add names to contacts if they have none.

There are a lot of difficult, or impossible to get 100% correct things in these XML files. For example, I am now relying on XMLB&R to set contact_name="(Unknown)" in specific cases to be able to handle the service numbers, but of course, there is nothing stopping anyone from naming a contact "(Unknown)". Similarly, I am skipping message bodies where the XML contains text="null", but what happens if someone sends an message with the literal text "null"? A lot of that sort of stuff. Also I've been installing SMSB&R on various phones and parsing the XML and found some oddities, like an empty address field, even though this field is mandatory according to synctech's own spec.

Anyway, I think attempting to deal with groups is not going to be easy, and there are probably a ton of bugs in the function as it is now, so if you find some time, you could always try it out and let me know what works and what doesn't at this moment.

Thanks!

@sjevtic
Copy link
Author

sjevtic commented Jan 23, 2025

I just tried the your 20250122 Windows executable on my 2024 archive. It is ~1 GB and contains significant MMS media and group conversations. I see a lot of progress here, notably in that media embedded in MMS messages appears to be getting correctly processed. Some of the group conversation threads are there too, although there is nothing in the HTML to identify who the sender of a particular message was. I encountered several different and applicable warnings, the highlight of which was this very appropriate message:

[Warning]: Chat partner was not found in recipient-table. Attempting to create. NOTE THE RESULTING BACKUP CAN MOST LIKELY NOT BE RESTORED ON SIGNAL ANDROID. IT IS ONLY MEANT TO EXPORT TO HTML.

However, HTML output generation aborted during generation of HTML for thread 34 (out of 180). The error simply indicates [Error]: Failed to open '<filename>' though there is no further detail about what the error was. However, it was a group chat with 5 participants, so perhaps there could be some issue associated with the length of the file path or characters it contains?

Thanks again--this is some really impressive progress!

@bepaald
Copy link
Owner

bepaald commented Jan 23, 2025

Thanks for testing.

However, HTML output generation aborted during generation of HTML for thread 34 (out of 180). The error simply indicates [Error]: Failed to open '<filename>' though there is no further detail about what the error was. However, it was a group chat with 5 participants, so perhaps there could be some issue associated with the length of the file path or characters it contains?

That is a strange one. Could you be more detailed about the actual error message? Just from grepping through the source code, this can't be it literally, there must at least be for reading or for writing behind that, but maybe it's something else? Does the filename refer to the XML file, or some attachment? Considering your last comment, was the file path very long, or did it contain any special characters? Did you add any options to the export command (like --originalfilenames)? I'm not sure what exactly is happening in this case.

It [...] contains significant [...] group conversations

I'm having a bit of a tough time dealing with group conversations properly. I actually don't have any in my SMS history, and can't create any because MMS has not been supported by any telecom provider in my country since 2019. The only group messages I can work with are ones I created myself by using my own tool to --exportxml.

Am I correct that I can detect group conversations simply by the address field? This field will always contain a list of phone numbers, concatenated with a ~ in case of a group chat (for both incoming and outgoing messages)?

Thanks!

@sjevtic
Copy link
Author

sjevtic commented Jan 23, 2025

Just from grepping through the source code, this can't be it literally, there must at least be for reading or for writing behind that,

My apologies. In redacting the contact names for publishing, I omitted to copy for writing. Also there is no closing single quote on the filename in the error message.

Does the filename refer to the XML file, or some attachment?

The path refers to the conversation HTML file in the thread directory. The directory had not been created yet.

Considering your last comment, was the file path very long, or did it contain any special characters?

I just tried creating the directory and file and it worked, so I guess the OS doesn't mind the name.

Did you add any options to the export command (like --originalfilenames)? I'm not sure what exactly is happening in this case.

In this case no additional options, just --exportplaintextbackuphtml.

I'm having a bit of a tough time dealing with group conversations properly. I actually don't have any in my SMS history, and can't create any because MMS has not been supported by any telecom provider in my country since 2019.

Interesting. What country, and what is the standard messaging protocol there? Things are quite a mess here in the US at the moment with device makers and carriers alike trying to force everyone to use Google Messages. It appears that this will soon be the only app supporting RCS messaging here. Interestingly though, RCS messages still seem to get stored in the telephony messaging database, which is very convenient since it makes it possible for them to be backed up by SMSBR. But as an aside, in recent times, Google has begun extending its crackdown on rooted devices, not only breaking RCS for them, but doing so silently. The RCS status in the app shows you as online, but you do not receive any RCS messages, nor does anyone receive your RCS messages.

Am I correct that I can detect group conversations simply by the address field? This field will always contain a list of phone numbers, concatenated with a ~ in case of a group chat (for both incoming and outgoing messages)?

By casual inspection, I believe that is correct, though I have not read the spec nor had any firsthand experience writing a parser for these files. Where contact information is available, the contact_name field also contains a comma + space delimited list of contacts that are parties to the thread.

@bepaald
Copy link
Owner

bepaald commented Jan 23, 2025

My apologies. In redacting the contact names for publishing, I omitted to copy for writing. Also there is no closing single quote on the filename in the error message.

Thanks, with that and my helpful typo, it was easy to find the spot where it fails.

The path refers to the conversation HTML file in the thread directory. The directory had not been created yet.

It's strange that the directory had not been created yet. I now think there are two errors: it doesn't create the directory, and it doesn't show an error when it fails to do so. I have no idea what could be causing this. It might have something to do with the directory name it is trying to create. I understand there are contact names in that directory name, but is there anything you could consider special about it?

This may be too much trouble, but maybe if you delete all messages but one (the first one for that conversation) from the XML file, could you then try to edit the "contact_name" field to see if you can reproduce it with a more anonymous name? Maybe I can think of something myself, without you going through all that trouble, but I just wanted to suggest this before going to bed (it's late here).

Interesting. What country, and what is the standard messaging protocol there? Things are quite a mess here in the US at the moment with device makers and carriers alike trying to force everyone to use Google Messages. It appears that this will soon be the only app supporting RCS messaging here. Interestingly though, RCS messages still seem to get stored in the telephony messaging database, which is very convenient since it makes it possible for them to be backed up by SMSBR. But as an aside, in recent times, Google has begun extending its crackdown on rooted devices, not only breaking RCS for them, but doing so silently. The RCS status in the app shows you as online, but you do not receive any RCS messages, nor does anyone receive your RCS messages.

This is in The Netherlands. I think in reality 99% of people have WhatsApp installed, and use that for most of their messaging. Other than that all providers still support SMS, just MMS has been discontinued.

By casual inspection, I believe that is correct, though I have not read the spec nor had any firsthand experience writing a parser for these files. Where contact information is available, the contact_name field also contains a comma + space delimited list of contacts that are parties to the thread.

Thank you that is useful information. I hope to continue working on this function over the next few days, the group stuff is going to get a little complicated, but it should be doable.

@sjevtic
Copy link
Author

sjevtic commented Jan 26, 2025

It's strange that the directory had not been created yet.

The directory actually does get created. In my haste the other day, I simply confused it with another directory. Sorry about that.

This may be too much trouble, but maybe if you delete all messages but one (the first one for that conversation) from the XML file, could you then try to edit the "contact_name" field to see if you can reproduce it with a more anonymous name?

I did this and was able to reproduce the problem. Similarly, by taking just two letters off the name in the last contact ("er"), I was able to get the HTML output. Note that my testing thus far has been with your Windows binary.

Next, I built signalbackup-tools from source on Linux and had no issues generating HTML output with the original file. So, I think it's reasonable to conclude that you are hitting some kind of a path length limit on Windows. I know Windows has long defined MAX_PATH as 260 as discussed here and the path in question is right at that limit. However, my Windows tests have been on Windows 10 22H2 with the LongPathsEnabled registry value set to 1. That said, there appear to be some important considerations governing long path support, notably that only a subset of APIs actually support the long paths. I wonder if it is possible that a library you are linking against might be using an API that does not support long paths.

This is in The Netherlands. I think in reality 99% of people have WhatsApp installed, and use that for most of their messaging. Other than that all providers still support SMS, just MMS has been discontinued.

Interesting. WhatApp is not especially popular in the US. But then again, Signal is even less popular, and I generally find it hard to get non-technical types interested in using Signal. Sadly, in the US the Android messaging ecosystem is very fragmented, probably as a consequence how long it has taken for RCS to gain significant market share; Google's recent decision to prevent root users from using RCS certainly will not help this situation. Is RCS widely used in Europe?

I found something else interesting. It seems that in SMSBR, MMS messages also have an addrs section:

<addrs>
<addr address="[contact_num1]" type="130" charset="106" />
<addr address="[contact_num2]" type="130" charset="106" />
<addr address="[contact_num3]" type="130" charset="106" />
<addr address="[contact_num4]" type="130" charset="106" />
<addr address="[contact_num5]" type="130" charset="106" />
<addr address="[contact_num6]" type="130" charset="106" />
<addr address="[contact_num7]" type="130" charset="106" />
<addr address="[contact_num8]" type="137" charset="106" />
<addr address="[contact_num9]" type="151" charset="106" />
</addrs>

So maybe this is another feature of the SMSBR backups that can help you detect a group conversation.

A few other things I noticed now that I was able to process my 2024 archive.

  • Group conversations with a lot of participants do not display well on the index.html page: the names ultimately run out of the box intended to house the displayed content.
  • There are a number of characters that are not displayed correctly in message bodies (often showing up as a question mark in a diamond icon; I suspect many were emojis).
  • A small amount of media does not open in the HTML pages.

Thanks!

@bepaald
Copy link
Owner

bepaald commented Jan 26, 2025

I did this and was able to reproduce the problem. Similarly, by taking just two letters off the name in the last contact ("er"), I was able to get the HTML output. Note that my testing thus far has been with your Windows binary.

Thank you, very helpful. I have done some testing and think I have a solution. It's going to take some time to implement properly though, because I think it will need to be added to all places where a file is created on Windows (which is going to be a lot of work). I have now included an attempt to deal with long paths on Windows (at least in the export-html-code). Please let me know if it works, or if the same still happens, or if it now fails at a different point.

I found something else interesting. It seems that in SMSBR, MMS messages also have an addrs section:

Yes, I did know about that one. I have just pushed an update to hopefully handle group conversations better.

  • Group conversations with a lot of participants do not display well on the index.html page: the names ultimately run out of the box intended to house the displayed content.

Thanks, should be fixed now.

  • There are a number of characters that are not displayed correctly in message bodies (often showing up as a question mark in a diamond icon; I suspect many were emojis).

I would love to have some samples of this (the way they are written in the XML file, and (if you know) what character they are supposed to represent)

  • A small amount of media does not open in the HTML pages.

Curious, is there anything special about the ones that don't open, are they all some specific file type? When you say "does not open" are they not displayed at all, or can they only not be clicked to enlarge or save?

Thanks for the feedback. I have no doubts there are more bugs or little things that need improving, so whatever you can find, let me know.

Cheers!

@sjevtic
Copy link
Author

sjevtic commented Jan 26, 2025

I have now included an attempt to deal with long paths on Windows (at least in the export-html-code). Please let me know if it works, or if the same still happens, or if it now fails at a different point.

I was able to fully process my unmodified 2024 archive using your latest (20250126-2) Windows binary. Of course, long path support should really work everywhere, not just in HTML generation since there is no telling how long the path could be to a user's backup file or output.

Apparently though a lot of things in Windows are still pretty reluctant to work with long paths. The File -> Open dialog in both Firefox and Edge (which I believe is the common Windows open dialog) refuse to accept the entire path to one of these long file names that I had copied using the Explorer context menu "Copy as Path" item. It seems that the "File name" field itself has a length limit, so I had to instead browse through the directory tree using the dialog itself, one level at a time. While this isn't an issue with signalbackup-tools per se, it might be a useful option to be able to generate HTML output with short directory and file names (maybe just based on the thread IDs). Users employing a file manager to browse the HTML tree for conversations of interest wouldn't like this so much, but those using the generated index.html would not be inconvenienced by it.

I haven't yet tried putting my generated HTML tree on a web server and accessing it over http. I should probably try this at some point. I don't expect any path length problems there, but it would be useful to verify this as well as make sure that there are no URL-encoding issues in the links provided.

On the latest build, I did receive a lot of these messages though:

[Error]: Failed to find source_recipient_id in contactmap. Should be present at this point. Skipping message

  • Group conversations with a lot of participants do not display well on the index.html page: the names ultimately run out of the box intended to house the displayed content.

Thanks, should be fixed now.

Confirmed; the ellipsis makes a lot sense on the index page. I also noticed that there is an analogous issue on the conversation pages themselves. In one case, a conversation has 8 members and the result is that the browser window is quite wide with a large horizontal scrollbar. Wrapping these names would make a lot of sense. In this same thread, the conversation is incorrectly listed as having 3 members, and the "show details" drop down link when clicked only shows 3 of the 8 names, even though all 8 are shown at the top of the conversation page.

  • There are a number of characters that are not displayed correctly in message bodies (often showing up as a question mark in a diamond icon; I suspect many were emojis).

I would love to have some samples of this (the way they are written in the XML file, and (if you know) what character they are supposed to represent)

Here is one without redacted body about a particularly boisterous puppy, which is significant because there does not appear to be any odd content in the message (other than perhaps the quoting):

<sms protocol="0" address="[contact_number]" date="1734628440524" type="2" subject="proto:CjoKImNvbS5nb29nbGUuYW5kcm9pZC5hcHBzLm1lc3NhZ2luZy4SFCIAKhDK0fX3A2tHcpDu+4iPN6gt" body='he seems to be entering a really rowdy phase. he was trying to grab my yeti cup off the coffee table yesterday right before i left. wouldn&apos;t stop when i screamed "NO" at him either.' toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" date_sent="0" sub_id="4" readable_date="Dec 19, 2024 12:14:00 PM" contact_name="[contact_name]" />

This message is shown a number of question marks in diamonds in the HTML output.

  • A small amount of media does not open in the HTML pages.

Curious, is there anything special about the ones that don't open, are they all some specific file type? When you say "does not open" are they not displayed at all, or can they only not be clicked to enlarge or save?

I tried saving one of the files out of Google Messages, and here's what I found:

{503 minilappy6 .../Movies/Messages} # file VID_20250126_120329.3gp
VID_20250126_120329.3gp: ISO Media, MPEG v4 system, 3GPP

Don't mind the name--Google messages made the horrible choice of naming the file based on when I saved it (and doesn't offer my a choice on the name or location). Maybe this format isn't understood by the embedded video player?

I can look for more examples of these issues if you need.

Thanks!

@bepaald
Copy link
Owner

bepaald commented Jan 26, 2025

Thank you for your thorough reply.

I was able to fully process my unmodified 2024 archive using your latest (20250126-2) Windows binary. Of course, long path support should really work everywhere, not just in HTML generation since there is no telling how long the path could be to a user's backup file or output.

Apparently though a lot of things in Windows are still pretty reluctant to work with long paths. The File -> Open dialog in both Firefox and Edge (which I believe is the common Windows open dialog) refuse to accept the entire path to one of these long file names that I had copied using the Explorer context menu "Copy as Path" item. It seems that the "File name" field itself has a length limit, so I had to instead browse through the directory tree using the dialog itself, one level at a time. While this isn't an issue with signalbackup-tools per se, it might be a useful option to be able to generate HTML output with short directory and file names (maybe just based on the thread IDs). Users employing a file manager to browse the HTML tree for conversations of interest wouldn't like this so much, but those using the generated index.html would not be inconvenienced by it.

Right, that is a problem. Also, since making this change I've also had a new issue where apparently it causes all paths to fail to open, for some reason (#277). So, I think I should probably revert that change and attempt to keep the filenames short. That is not going to be easy though. I can already foresee people wanting proper names for the output (as they wanted for attachments), but depending on how long the directory structure is where they output the file, even just id3\id3.html could be too long...

On the latest build, I did receive a lot of these messages though:

[Error]: Failed to find source_recipient_id in contactmap. Should be present at this point. Skipping message

Thanks, I'll look into that one, I'll probably have questions about this later. If you run signalbackup-tools --listxmlcontacts [input.xml], are all contacts listed?

Confirmed; the ellipsis makes a lot sense on the index page. I also noticed that there is an analogous issue on the conversation pages themselves. In one case, a conversation has 8 members and the result is that the browser window is quite wide with a large horizontal scrollbar. Wrapping these names would make a lot of sense. In this same thread, the conversation is incorrectly listed as having 3 members, and the "show details" drop down link when clicked only shows 3 of the 8 names, even though all 8 are shown at the top of the conversation page.

Right, I should be able to deal with the conversation pages. I expect the missing members might be related to the previous problem... edit I actually have an idea about this (and the previous), hopefully I'll have some time to deal with it tomorrow.

Here is one without redacted body about a particularly boisterous puppy, which is significant because there does not appear to be any odd content in the message (other than perhaps the quoting):

Curious, it seems to work fine here:
Image

You are sure this is the message? Is the entire message replaced by those diamonds or only certain characters?

Maybe this format isn't understood by the embedded video player?

I imagine this is the case, I'll try to look into it.

I have some work to do, thanks again, I'll let you know when I've made some changes or have more questions.

@bepaald
Copy link
Owner

bepaald commented Jan 27, 2025

I've made some adjustments:

  • improved detecting all members of a group. This should at least fix the group-details issue you saw earlier. I'm curious if it also has an effect on the Failed to find source_recipient_id in contactmap. Should be present at this point messages. Honestly I couldn't quite understand how one could run into that error, but so much code around that has changed that there is no point investigating now.
  • hopefully fixed the long titles on conversations-pages.
  • a new approach to long filenames. I have limited the length of filenames (during --exporthtml) to 32 characters under Windows. I may change the actual value of the length if there is reason to. I think this should mostly work, unless the export directory is already very deep. 32 Seemed like a good idea since Signal itself currently has a 32 character limit on group names, so this change should actually go unnoticed to most people. (though in the past group titles were allowed to be longer, and the firstname + space + lastname combo for individual contacts can also be longer)

Some other remarks:

  • I'm wondering if there is some character encoding issue with that message you posted, that gets lost when copying and pasting the message here...
  • I've made some barebones HTML pages with various video files as src for a <video> tag, and all 3gp files invariably fail to play. You could check the HTML source for the link to the actual file (somewhere in the media-subdirectory) and verify it is identical to the VID_20250126_120329.3gp you saved from Google messages, or at least that it plays correctly outside of the browser.

Let me know if these changes have made things somewhat better if you find the time.

Thanks!

@sjevtic
Copy link
Author

sjevtic commented Jan 28, 2025

improved detecting all members of a group. This should at least fix the group-details issue you saw earlier. I'm curious if it also has an effect on the Failed to find source_recipient_id in contactmap. Should be present at this point messages. Honestly I couldn't quite understand how one could run into that error, but so much code around that has changed that there is no point investigating now.

My entire 2024 archive now has only one group chat (which happens to have 3 participants total). All the others are missing from the HTML output. I'm also still getting a ton of those errors, but I don't know what messages are causing them so it is hard for me to provide additional information.

hopefully fixed the long titles on conversations-pages.

I can't tell at the moment since I am missing almost all of my group conversation threads.

a new approach to long filenames. I have limited the length of filenames (during --exporthtml) to 32 characters under Windows. I may change the actual value of the length if there is reason to. I think this should mostly work, unless the export directory is already very deep. 32 Seemed like a good idea since Signal itself currently has a 32 character limit on group names, so this change should actually go unnoticed to most people. (though in the past group titles were allowed to be longer, and the firstname + space + lastname combo for individual contacts can also be longer)

I think it would be great to give the user two choices: one for the original scheme (which is very descriptive), or one for a very compact scheme (like id_nn_\id_nn_.html). This seems like a better scheme than producing different output just because it's on Windows. Your scheme for creating HTML with very long paths on Windows did seem to be working correctly after all.

I'm wondering if there is some character encoding issue with that message you posted, that gets lost when copying and pasting the message here...

I tried this myself today (using the latest build), but in this case, I instead of copy/paste, I simply deleted all the other lines in the XML file and used File -> Save As in Visual Studio Code to write a 1-message XML file. The HTML displayed all the characters in the message properly. Barring any strange issues with Visual Studio Code, I really have no idea what is causing this corruption.

It is also worth noting that in the failure case for this message, every single character was a question mark in a diamond box.

I've made some barebones HTML pages with various video files as src for a <video> tag, and all 3gp files invariably fail to play. You could check the HTML source for the link to the actual file (somewhere in the media-subdirectory) and verify it is identical to the VID_20250126_120329.3gp you saved from Google messages, or at least that it plays correctly outside of the browser.

All of this content is inside of group chats that I don't have access to in the current HTML. I will test this on the next build that works unless you would prefer that I test it on a previous build.

Thanks!

@bepaald
Copy link
Owner

bepaald commented Jan 28, 2025

Ouch, it seems I have broken things thoroughly. I'm not sure how right now. I will try to see what is gong wrong when I get back from work.

I'm also still getting a ton of those errors, but I don't know what messages are causing them so it is hard for me to provide additional information.

I've quickly added the actual address it fails to find to the error message, maybe something about them provides you with insight. These 'source_recipient_id's are supposed to be the senders of group messages (<addr address="[addresshere]" type="137" charset="106" />, note the type="137" that denotes the sender).

I think it would be great to give the user two choices: one for the original scheme (which is very descriptive), or one for a very compact scheme (like id_nn_\id_nn_.html). This seems like a better scheme than producing different output just because it's on Windows. Your scheme for creating HTML with very long paths on Windows did seem to be working correctly after all.

Yes that would be nicer, I might look into that at some point in the future. Because you were having trouble with the original way, and #277 was having trouble with the new way, I wanted a quick solution. Offering different options will take some time to implement correctly and require updates to the argument-parser and README.

All of this content is inside of group chats that I don't have access to in the current HTML. I will test this on the next build that works unless you would prefer that I test it on a previous build.

Right, that's the priority when I get back from work. I hope I can think of something. I'm just going to show what I am seeing when I test this feature currently, maybe you see something noteworthy.

Note, I am passing both --mapxmlcontactnames to set some names for contacts (most don't have any), and --listxmlcontacts which prints the list in the beginning. The contacts marked with (*) are actual chats, the ones without only appear as members of a group. There are 14 *s, then 14 Failed to find matching thread for conversation, creating messages during import of the XML data, and lastly 14 Dealing with thread messages when exporting to html.

 $ bbuild && ./signalbackup-tools --listxmlcontacts SMSBRedit3.xml --mapxmlcontactnames "+31612345678=Faky McFakeman,+31631213121=XZ1C,+31647474977=Svd,+31666666666=Jonathan Dough" --exportplaintextbackuphtml SMSBRedit3.xml PTHTML/ --overwrite
Target "signalbackup-tools" is up to date, nothing to do.
 *** Starting log: 2025-01-28 08:23:03 ***
signalbackup-tools (./signalbackup-tools) source version 20250127.140828 (SQlite: 3.48.0, OpenSSL: OpenSSL 3.4.0 22 Oct 2024)
 is_chat    address             :  name
         +31611111111         : "+31611111111"
         +31612345678         : "Faky McFakeman"
         +31622222222         : "+31622222222"
   (*)   +31631213121         : "XZ1C"
   (*)   +31631213121~+31612345678 : "+31631213121~+31612345678"
   (*)   +31631213121~+31647474977~+31612345678 : "+31631213121~+31647474977~+31612345678"
   (*)   +31631213121~+31647474977~+31612345678~+31666666666~+31611111111~+31622222222~+31633333333~+31644444444~+31655555555~veryveryvreyvreyvreyvrvyevyrevyvryevyrelong group title : "+31631213121~+31647474977~+31612345678~+31666666666~+31611111111~+31622222222~+31633333333~+31644444444~+31655555555~veryveryvreyvreyvreyvrvyevyrevyvryevyrelong group title"
         +31633333333         : "+31633333333"
         +31644444444         : "+31644444444"
   (*)   +31647474977         : "Svd"
         +31655555555         : "+31655555555"
         +31666666666         : "Jonathan Dough"
   (*)   AUTHMSG              : "AUTHMSG"
   (*)   AUTHMSG&#13;         : "AUTHMSG&#13;"
   (*)   Amazon               : "Amazon"
   (*)   CloudOTP             : "CloudOTP"
   (*)   Google               : "Google"
   (*)   TKTMASTER            : "TKTMASTER"
   (*)   Thuisbzgd            : "Thuisbzgd"
   (*)   Vodafone             : "Vodafone"
   (*)   Vodafone             : "Vodafone "
Importing messages into backup... 0/78
[Warning]: Chat partner was not found in recipient-table. Attempting to create.
           NOTE THE RESULTING BACKUP CAN MOST LIKELY NOT BE RESTORED
           ON SIGNAL ANDROID. IT IS ONLY MEANT TO EXPORT TO HTML.
Failed to find matching thread for conversation, creating. (e164: AUTHMSG -> 2 -> thread_id: 1)
Failed to find matching thread for conversation, creating. (e164: AUTHMSG&#13; -> 3 -> thread_id: 2)
Failed to find matching thread for conversation, creating. (e164: Amazon -> 4 -> thread_id: 3)
Failed to find matching thread for conversation, creating. (e164: Thuisbzgd -> 5 -> thread_id: 4)
Failed to find matching thread for conversation, creating. (e164: Vodafone -> 6 -> thread_id: 5)
Failed to find matching thread for conversation, creating. (e164: Vodafone  -> 7 -> thread_id: 6)
Failed to find matching thread for conversation, creating. (e164: CloudOTP -> 8 -> thread_id: 7)
Failed to find matching thread for conversation, creating. (e164: Google -> 9 -> thread_id: 8)
Failed to find matching thread for conversation, creating. (e164: TKTMASTER -> 10 -> thread_id: 9)
Failed to find matching thread for conversation, creating. (e164: +31631213121 -> 1 -> thread_id: 10)
Failed to find matching thread for conversation, creating. (e164: +31647474977 -> 11 -> thread_id: 11)
Failed to find matching thread for conversation, creating. (e164: +31631213121~+31647474977~+31612345678~+31666666666~+31611111111~+31622222222~+31633333333~+31644444444~+31655555555~veryveryvreyvreyvreyvrvyevyrevyvryevyrelong group title -> 19 -> thread_id: 12)
Failed to find matching thread for conversation, creating. (e164: +31631213121~+31647474977~+31612345678 -> 20 -> thread_id: 13)
Failed to find matching thread for conversation, creating. (e164: +31631213121~+31612345678 -> 21 -> thread_id: 14)
Importing messages into backup... 78/78 done!
updateThreadsEntries
  Dealing with thread id: 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14
Checking foreign key constraints... ok
Checking database integrity (full)... ok
Starting HTML export to 'PTHTML/'
Clearing contents of directory 'PTHTML/'...
Dealing with thread 1
Dealing with thread 2
Dealing with thread 3
Dealing with thread 4
Dealing with thread 5
Dealing with thread 6
Dealing with thread 7
Dealing with thread 8
Dealing with thread 9
Dealing with thread 10
Dealing with thread 11
Dealing with thread 12
Dealing with thread 13
Dealing with thread 14
Writing index.html...
All done!

(note all phone numbers in there are fake)

Maybe that's useful. It's hard working on a feature such as this blindly. Maybe at some point it is possible for you to create a minimal example of a XML where threads are skipped, by deleting all but a few messages and the actual message contents. But first, I'll investigate when I get back from work.

EDIT I think I managed to craft an XML file that shows that same error (Failed to find source_recipient_id in contactmap.), and is missing the group chat. So that should help fixing it.

@bepaald
Copy link
Owner

bepaald commented Jan 28, 2025

I've made some changes that could have affected your issues, so please try again when you have time. Thanks!

@sjevtic
Copy link
Author

sjevtic commented Jan 29, 2025

Just a quick response--unfortunately a bit short on time today. The latest build you published (20250128-2) is generating output for more but not all of my group conversations. Of note, the group thread with the problematic .3gp video file I referenced earlier this week is missing.

Still getting a lot of these:

[Error]: Failed to find source_recipient_id in contactmap (XXXXXXXXXX). Should be present at this point. Skipping message

Also started getting a bunch of these:

[Error]: Failed to set new groups membership.
[Error]: After sqlite3_step(): UNIQUE constraint failed: group_membership.group_id, group_membership.recipient_id -> Query: "INSERT INTO group_membership (group_id, recipient_id) VALUES ('__signal_group__fake__XXXXXXXXXX~+1XXXXXXXXXX~+1XXXXXXXXXX', 104)"

Plenty of these too:

[Warning]: Failed to create missing recipient. Skipping message.
[Error]: After sqlite3_step(): UNIQUE constraint failed: recipient.e164 -> Query: "INSERT INTO recipient (profile_given_name, profile_joined_name, avatar_color, e164) VALUES ('+1XXXXXXXXXX', '+1XXXXXXXXXX', 'A120', '+1XXXXXXXXXX') RETURNING _id"

Also some of these:

[Warning]: Unexpectedly got empty contact name for group recipient XXXXXXXXXX

On the conversation pages, I see a lot of senders referred to by number or "(unknown)", including in cases for which a contact name was displayed in the past. The member counts in the details section seem to be accurate now, notwithstanding the above. At the very top of the conversation page,

By sharp contrast, the --listxmlcontacts output actually looks pretty good--many contacts are listed in there that are either numbers or "unknown" on the conversation pages. Some of the contacts are actually groups with every contact inside correctly recognized, but for which the group conversation did not show up in the HTML.

@bepaald
Copy link
Owner

bepaald commented Jan 29, 2025

Hm, I have to admit I'm quite confused at this point. I've tried long and hard to craft XML files that run into any of those errors, but I can't make it happen. I think I'm making some assumptions about the format of the XML file that may not hold. I do believe most of the errors are simply propagating a single bug and causing each other, so I hope it will turn out to be less bad than it looks, but we'll see.

I've included a little fix (hopefully) for the suddenly missing contact names. Other than that I mostly just added a bunch of verbose output to the process in the hope it will help me understand what is going on. I'd like to ask you to run the tool again and report back all of the output, from the start up until the first Importing messages into backup... N/M-message after the first hard error (or more).

I've attempted to censor all the phone numbers in the output (and there shouldn't be any names or anything else), but the data in the sqlite3_step()-errors will need manual editing. Also, if you feel the censoring is not good enough, of course feel free to change it more, but try to keep it in such a way that it's possible to tell when strings refer to the same id. Also, if you don't want to post it here, you could of course always email me directly.

I of course don't know what your XML file looks like, maybe I'm asking you to manually inspect and adjust thousands of lines of output before that first error, so don't feel obligated, I'll keep doing my best to get this working either way. But at this point this seems the quickest way to get some valuable info.

Thanks!

@sjevtic
Copy link
Author

sjevtic commented Jan 30, 2025

Hm, I have to admit I'm quite confused at this point. I've tried long and hard to craft XML files that run into any of those errors, but I can't make it happen. I think I'm making some assumptions about the format of the XML file that may not hold. I do believe most of the errors are simply propagating a single bug and causing each other, so I hope it will turn out to be less bad than it looks, but we'll see.

I suspect part of what you're facing here is that there is a lot less consistency in SMSBR XML files than anything you see in the Signal ecosystem, which is pretty closed. By contrast, the SMS/MMS ecosystem (and the resulting SMSBR backups) are subject to nuances associated with different carriers and messaging apps.

Do you have any logic for handling number normalization? For example, I store all of my contacts with a leading '+' followed by country code and the local number, like this: "+1 (847) 555-1212". This way my contacts "just work" when I am roaming internationally. Lots of users don't have this discipline because they don't have to--simply storing "8475551212" is sufficient for messaging and calls to work domestically. Somewhere in the stack, the unnecessary punctuation and spaces I use get stripped; SMSBR reports the number as +18475551212. The real challenge here is with incoming messages, since the formatting/completeness of the number is at the carrier's discretion. It's common for incoming messages to be lacking the + country code (e.g., "8475551212"), and when roaming internationally I have seen all sorts of odd prefixes attached to incoming numbers when I receive calls, many of which don't make a lot of sense.

I've included a little fix (hopefully) for the suddenly missing contact names. Other than that I mostly just added a bunch of verbose output to the process in the hope it will help me understand what is going on. I'd like to ask you to run the tool again and report back all of the output, from the start up until the first Importing messages into backup... N/M-message after the first hard error (or more).

I'm still seeing a lot of issues in my chats. I find this set of messages interesting:

Failed to find matching thread for conversation, creating. (e164: +xxxxxxx1717 -> 5 -> thread_id: 4)
[Error]: Failed to find source_recipient_id in contactmap (xxxxxx1717). Should be present at this point. Skipping message

So this appears to be somehow related to a number normalization issue--somehow the two numbers are not recognized to be associated with the same contacts. Yet strangely up above in the --listxmlcontacts seems to have figured it out:

(*) +1XXXXXX1717 : "Mr. Faky McFakerson"
(*) XXXXXX1717 : "Mr. Faky McFakerson"

That said, are these somehow being handled as separate contacts? That wouldn't make sense.

I've attempted to censor all the phone numbers in the output (and there shouldn't be any names or anything else), but the data in the sqlite3_step()-errors will need manual editing. Also, if you feel the censoring is not good enough, of course feel free to change it more, but try to keep it in such a way that it's possible to tell when strings refer to the same id. Also, if you don't want to post it here, you could of course always email me directly.

I could redact numbers from all these SQL error messages with some regexes to produce output similar to what you are doing in the other messages. If that's useful, I'll work on it in the next day or so, and preferably send that via some means other than GitHub (Signal?)

I of course don't know what your XML file looks like, maybe I'm asking you to manually inspect and adjust thousands of lines of output before that first error, so don't feel obligated, I'll keep doing my best to get this working either way. But at this point this seems the quickest way to get some valuable info.

If the above proposal is less than desirable, we could have a Teams meeting and I can show you the live output being produced by signalbackup-tools; I am willing to dig through the XML for corresponding data to try to get to the bottom of some of these issues.

Thanks!

@bepaald
Copy link
Owner

bepaald commented Jan 30, 2025

Do you have any logic for handling number normalization?

Ugh, right! That is a big problem, and could very well be the cause of most (if not all) errors you are seeing. Though for numbers with a contact_name field filled in (and the same), it should already mostly work, since once the contact is created, it is matched by name first. So if +1XXXXXX1717 is created as "Mr. Faky McFakerson", then any subsequent message with that name is matched to that same contact, even if the address is XXXXXX1717. But when a name is not filled in, the literal address field is used for matching, which will split contacts into separate ones. Also, in group messages the <addr> element that holds the individual group members do not have a name field at all... At least there is no point in testing further or supplying me any output until some of this is implemented.

I was really counting on the e164 format, the way Signal stores it. I don't really understand why messaging apps don't also normalize before storing the number, even for incoming messages. It has to be done anyway, to place a message in the correct conversation I assume.

Anyway, it is quite a tough problem, I remember having investigated this years ago and giving up on it. I don't think there exist implementations that actually do this 100% correctly, let alone that I can write one. The simplest thing I can do, that would probably catch a lot of cases is simply strip non-numerical characters (except '+') and remove any leading double-'0' and replace that with a '+'. However there are then still plenty of possible problems, probably most commonly if the country code is left off of a number in one message, but not another: I don't think there is anyway to automatically match that ever (without being told the country code to prepend (for each number)). Unless I simply start matching only the last X digits of every number (maybe the last 6 or 7), but it's obvious that this could cause issues as well. Honestly, this might be best solved by the user preparing the XML file to normalize the numbers, before feeding it to this tool.

I'll do some more research, and have a think about this.

@bepaald
Copy link
Owner

bepaald commented Jan 31, 2025

Having slept on it, I think there are several things I could do to help this situation. A couple of questions:

Does this same issue apply to group-addresses? That is, if one group message has an address list including +1XXXXXX1717, and another in the same group chat has XXXXXX1717, is this reflected in the group-address? That is, is one group addr1~addr2~+1XXXXXX1717~... and the other addr1~addr2~XXXXXX1717~...? I'm assuming so, but just to be sure.

When you look at the --listxmlcontacts output, do many (or any) of the contacts have no name? That is, is their name a copy of the phone number field? Or are there any that are completely empty?

Lastly, about the diamond characters: I noticed one of my contacts being split over two chat, despite the address being the same format in both (+31612345678), after close inspection for a few of those messages a "left-to-right-marker" was inserted before the number (U+200E). This would probably be missing when copy pasting here, just like the example you posted. .Maybe you could inspect that particular message in a hex editor and check for any unexpected characters (basically anything outside of 0x20-0x7E)? Also, I noticed at the top of the XML file, the encoding was specified (UTF-8), is that the same in you case?

Thanks!

@bepaald
Copy link
Owner

bepaald commented Jan 31, 2025

I've made some changes, both to normalize phone numbers and to make sure any unique contact name only points to a single address.

I do not dare anymore to guess whether this has made things better or worse, or nothing at all. But if you have the time you can try it out. You'll probably want to add the option --setcountrycode "+1". If you still see multiple entries for what is supposed to be a single contact in the --listxmlcontacts output, I'd be interested to know if there's any simple operation I can do on the addresses to merge them.

I will not have much time to do more work on this until Sunday probably.

@sjevtic
Copy link
Author

sjevtic commented Feb 2, 2025

Wow, this build is a massive improvement, thank you! --listxmlcontacts output is much cleaner, though I did notice that a number with a contact that really begins with "+381" was listed as beginning with "+111381". I don't think there are any other international numbers in the 2024 archive I have been testing with so I don't really have a lot of data points for international numbers. It also looks like just about all of my group conversations loaded correctly. There are still a handful of issues with respect to contact names in group conversations. In one case I have a conversation with a total of 11 members including myself:

  • 8 names are listed at the top of the conversation page
  • 4 names are listed in details
  • 7 numbers are listed in details

This of course raises a couple questions:

  1. Why are less contacts named in the details section than at the top of the page?
  2. It seems like the listing at the top of the page should have something for unnamed participants (numbers would be alright).

So, at least on my data, it looks like the number normalization logic you introduced really helps. Strangely enough, the messaging stack on my phone must be really good at this because conversations are usually threaded correctly after I restore a SMSBR backup.

This build generated a lot of verbose output without me turning anything on. In other matters:

  • The example .3gp file that I saved out of Google Messages (and that can be played in Google messages) is identical to the file in the media directory of the conversation's HTML output.
  • The sample message that shows up as all question-mark-in-diamond characters appears to be properly represented in the SMSBR XML file as illustrated in the hex dump screenshot below.

Image

@bepaald
Copy link
Owner

bepaald commented Feb 2, 2025

Good, finally a little progress.

I did notice that a number with a contact that really begins with "+381" was listed as beginning with "+111381"

Even with my very limited and simple phone numbering normalization, I'm running into the messiness of it. The + sign in the e164 standard is just a placeholder for the international call prefix. I thought this prefix was 00 almost everywhere, but in the USA and Canada, it's 011. So the tool did not recognize the prefix of this number (which undoubtedly was 011381... and prepended +1 and removed the leading zero. I have now fixed it to also recognize 011 as a call prefix, so this should be fixed. But as I said before, this is a nearly impossible task (there are many more possible prefixes used in smaller territories around the world, some even very likely clash with valid local area codes), so for more exotic problems, I think users of this function will really have to manually normalize the numbers in their XML files (or I might add more options to set default prefixes, and such things, but even then I don't think it can ever be 100% correct).

In one case I have a conversation with a total of 11 members including myself:

8 names are listed at the top of the conversation page
4 names are listed in details
7 numbers are listed in details

The names at the top of the conversation page is simply what is listed as the contact_name for this group in the XML file. Apparently on your phone, the contact_name is set to the names of the group members (though not all of them?). On my phone, all groups are simply (unknown). In some messenger apps, I have the ability to actually name groups myself, so I can only assume SMSB&R would then possibly use this as the contact_name. Either way, the tool does not parse this contact_name field at all currently, and I'm not sure it could, given that the number of names in there apparently does not necessarily match the number of group participants.

As for the names and numbers in the details, at least these add up to 11, so that seems right. It seems 7 of those numbers do not appear with an associated name in the XML file. Could this be correct? Does --listxmlcontacts show names for any of these 7 addresses?

This build generated a lot of verbose output without me turning anything on

True, in order to get more useful information on errors, I added a bunch of output. This entire function is in active development and undocumented, but if it ever gets completed and working somewhat reliably I will clean up the output. Are there still any of the errors you saw before? Any of these:

[Error]: Failed to find source_recipient_id in contactmap (XXXXXXXXXX). Should be present at this point. Skipping message
[Error]: Failed to set new groups membership.
[Error]: After sqlite3_step(): UNIQUE constraint failed: group_membership.group_id, group_membership.recipient_id -> Query: "INSERT INTO group_membership (group_id, recipient_id) VALUES ('__signal_group__fake__XXXXXXXXXX~+1XXXXXXXXXX~+1XXXXXXXXXX', 104)"
[Warning]: Failed to create missing recipient. Skipping message.
[Error]: After sqlite3_step(): UNIQUE constraint failed: recipient.e164 -> Query: "INSERT INTO recipient (profile_given_name, profile_joined_name, avatar_color, e164) VALUES ('+1XXXXXXXXXX', '+1XXXXXXXXXX', 'A120', '+1XXXXXXXXXX') RETURNING _id"
[Warning]: Unexpectedly got empty contact name for group recipient XXXXXXXXXX

The example .3gp file that I saved out of Google Messages (and that can be played in Google messages) is identical to the file in the media directory of the conversation's HTML output.

Ok, so in that regard I believe there is no problem in this function. As far as I know there is no good way to query the browser whether a video file is supported or not.

The sample message that shows up as all question-mark-in-diamond characters appears to be properly represented in the SMSBR XML file as illustrated in the hex dump screenshot below.

Indeed, that all looks good. This is a mysterious one. If you inspect the HTML output, what does this message look like, maybe there are HTML entities in there, any other weirdness, or does it look ok? In my test it looks correct:

          <!-- Message: _id:68,type:10485783 -->
          <div class="msg msg-outgoing msg-sender-1">
            <div>
              <pre>he seems to be entering a really rowdy phase. he was trying to grab my yeti cup off the coffee table yesterday right before i left. wouldn't stop when i screamed "NO" at him either.</pre>
            </div>
            <div class="footer">
              <span class="msg-data">Dec 19, 2024 18:14:00</span>
              <div class="footer-icons checkmarks-received">
              </div>
            </div>
          </div>

If there are diamonds in the HTML file as well, maybe you could hexdump that as well to see what the actual bytecodes of them are? Maybe that helps a little. Also, please double check the timestamp to be sure the message really corresponds to the diamonds-one (note it's localized in the HTML, so mine in the output above will be different from yours).

Thanks!

@sjevtic
Copy link
Author

sjevtic commented Feb 2, 2025

I thought this prefix was 00 almost everywhere, but in the USA and Canada, it's 011.

Yes, it is 011 here. That said, the '+' notation is appealing both because it is compact and it "just works"--when all contacts are stored this way, in my experience the phone will correctly dial/send a message to any contact anywhere in the world from anywhere in the world (i.e., without regard for where the phone might be roaming or the country of the contact).

So the tool did not recognize the prefix of this number (which undoubtedly was 011381... and prepended +1 and removed the leading zero. I have now fixed it to also recognize 011 as a call prefix, so this should be fixed.

This number is listed correctly in --listxmlcontacts output now (20250202 build). However, it looks like normalization is now otherwise broken in that many of my US contacts are no longer prefixed with +1 as they were in the 20250131-1 build.

But as I said before, this is a nearly impossible task (there are many more possible prefixes used in smaller territories around the world, some even very likely clash with valid local area codes), so for more exotic problems, I think users of this function will really have to manually normalize the numbers in their XML files (or I might add more options to set default prefixes, and such things, but even then I don't think it can ever be 100% correct).

100% effectiveness is probably not realistically achievable. Things get especially complicated with the fact that country codes aren't all the same length, different schemes are used for reporting numbers of incoming messages by carrier, etc. I thought about trying to normalize the numbers in my files at the beginning of all this, and honestly I'm not sure how practical it would be. From my perspective, if it is more than a couple of regexes, it's probably not a realistic expectation for users to do this.

The names at the top of the conversation page is simply what is listed as the contact_name for this group in the XML file. Apparently on your phone, the contact_name is set to the names of the group members (though not all of them?). On my phone, all groups are simply (unknown). In some messenger apps, I have the ability to actually name groups myself, so I can only assume SMSB&R would then possibly use this as the contact_name. Either way, the tool does not parse this contact_name field at all currently, and I'm not sure it could, given that the number of names in there apparently does not necessarily match the number of group participants.

Interesting. Google Messages definitely gives you the option to set group names. The default group name displayed in the UI is a list of comma-separated contact names, followed by comma-separated numbers for names with no contacts available. That doesn't match the names showed in HTML export though, so maybe these don't get written to the system messaging database for SMSBR to grab. Perhaps it makes sense to offer an option to have the tool generate a name for group chats based on the actual participant data shown in the details section.

As for the names and numbers in the details, at least these add up to 11, so that seems right. It seems 7 of those numbers do not appear with an associated name in the XML file. Could this be correct? Does --listxmlcontacts show names for any of these 7 addresses?

It looks like there is a trend among these: all of the contacts with no names displayed in the details section of a group chat appear to be those with which I have no individual conversations with. They don't show up as individuals in --listxmlcontacts. Is there anything that can even realistically be done about this on the signalbackup-tools side (especially given your suggestion that group names aren't reliable indicators of the contacts in a group)?

If support from the tool to correctly determine these contact names from SMSBR XML is not possible, it would be really neat to be able to pass in the contact export file from my phone (a .vcf file of sorts, at least in the Samsung ecosystem) and get contact names resolved that way. In the absence of that, I guess I could write a script that calls signalbackup-tools with an outrageously long --mapxmlcontactnames argument. I'm sure that will end in some other form of disappointment like exceeding the maximum command line length limit (which is apparently 8191 characters on Windows).

It's also a bit peculiar that my own number is listed in the group details section. It seems like it would make sense to put my name in there, jus like it is in details from HTML exports of real Signal chats. I guess I'm not entirely surprised by this outcome though since I am not in --listxmlcontacts output.

True, in order to get more useful information on errors, I added a bunch of output. This entire function is in active development and undocumented, but if it ever gets completed and working somewhat reliably I will clean up the output. Are there still any of the errors you saw before? Any of these:

  • Failed to find source_recipient_id: 135 using 20250130-01 build, 49 using 20250202 build.
  • Failed to set new groups membership: 923 using using 20250130-01 build, 0 using 20250202 build.
  • Failed to create missing recipient. Skipping message: 0 using 20250130-01 build, 0 using 20250202 build.
  • Unexpectedly got empty contact name for group recipient: 0 using 20250130-01 build, 0 using 20250202 build.

Ok, so in that regard I believe there is no problem in this function. As far as I know there is no good way to query the browser whether a video file is supported or not.

The .3gp files are pretty ubiquitous for low quality MMS video. Neither Firefox nor Edge appear to support them. I don't know what the answer here is, but it would be very challenging to replace them in preprocessing. If I could somehow grab the data without actually parsing the XML (i.e., use a regex or something simple), it might be doable (e.g., base64 decode, transcode with ffmpeg, base64 encode, replace). That's still far from "easy" though.

If there are diamonds in the HTML file as well, maybe you could hexdump that as well to see what the actual bytecodes of them are?

Here's a look at the hex dump of the body of that message in HTML:

Image

Also, please double check the timestamp to be sure the message really corresponds to the diamonds-one (note it's localized in the HTML, so mine in the output above will be different from yours).

I definitely double checked this. I noticed the timestamp localization too. That's a nice touch.

@bepaald
Copy link
Owner

bepaald commented Feb 3, 2025

I've added a bunch of output to the process:

When you now run the program, with --exportplaintextbackuphtml, as you've been doing, you will first see a list of normalized addresses (in your case, possibly a long list):

 $ ./signalbackup-tools --exportplaintextbackuphtml SMSBRedit2.XML PTHTML/ --setcountrycode "+1" --append
 *** Starting log: 2025-02-03 14:04:58 ***
signalbackup-tools (./signalbackup-tools) source version 20250203.140245 (SQlite: 3.48.0, OpenSSL: OpenSSL 3.4.0 22 Oct 2024)
normalizePhoneNumber in:  00-1-202-688-5500
normalizePhoneNumber out: +12026885500
normalizePhoneNumber in:  (202)688-5500
normalizePhoneNumber out: +12026885500
normalizePhoneNumber in:  +12026885500
normalizePhoneNumber out: +12026885500
normalizePhoneNumber in:  011-1-202-688-5500
normalizePhoneNumber out: +12026885500
normalizePhoneNumber in:  011381688-5500
normalizePhoneNumber out: +3816885500
normalizePhoneNumber in:  00381688-5500
normalizePhoneNumber out: +3816885500
normalizePhoneNumber in:  +381688-5500
normalizePhoneNumber out: +3816885500
[etc...]

Maybe you could provide some examples of numbers that normalized correctly before, but don't anymore. The numbers I'm testing still seem to work, and I don't immediately see how my previous change could have broken it. If you see some obvious pattern in them, describing it may suffice. Otherwise, just post it, replacing the final digits with x's (the leading characters are probably important).


edit: nevermind this part, I don't think it's useful, I'll come up with some better debugging info later.

Following this you should see something like:

--------------------------------
| COUNT(DISTINCT numaddresses) |
--------------------------------
| 1                            |
[...]
--------------------------------

This is my attempt at finding out about the remaining Failed to find source_recipient_id errors. My working theory is now that there are groups (with a constant <mms address=), with a changing list of <addr>s. This should tell me if this is the case, please let me know what numbers you get in that table (basically, are there any > 1).


edit: nevermind this, this problem should be fixed.

Then you will see:

 === Row 1/1 ===
     body : he seems to be entering a really rowdy phase. he was trying to grab my yeti cup off the coffee table yesterday right before i left. wouldn&apos;t stop when i screamed "NO" at him either.
HEX(body) : 6865207365656D7320746F20626520656E746572696E672061207265616C6C7920726F7764792070686173652E2068652077617320747279696E6720746F2067726162206D79207965746920637570206F66662074686520636F66666565207461626C6520796573746572646179207269676874206265666F72652069206C6566742E20776F756C646E2661706F733B742073746F70207768656E20692073637265616D656420224E4F222061742068696D206569746865722E

This is me tracking where the body of that message turns into garbage. Note I made changes which may have actually fixed this , though I am not 100% sure how (I did figure this out, and it really should be fixed). I'd say check the HTML output and let me know. If it's still not fixed, the output above may be interesting. The lines printed here should be read right from the XML file.


Then a few more tables are printed. In my case it looks like this:

------------
| COUNT(*) |
------------
| 10       |
------------
--------
| type |
--------
| 2    |
--------
----------------
| numaddresses |
----------------
| 1            |
----------------
------------
| COUNT(*) |
------------
| 0        |
------------
(no results)
(no results)

What do you see?


edit: nevermind this, again, this problem should be fixed.

Then starts the normal import process, which still generates a ton of output (I'll clean it up sometime soon I think). Somewhere in between there, you will see:

Body before XML unescape: "he seems to be entering a really rowdy phase. he was trying to grab my yeti cup off the coffee table yesterday right before i left. wouldn&apos;t stop when i screamed "NO" at him either."
Body after XML unescape:  "he seems to be entering a really rowdy phase. he was trying to grab my yeti cup off the coffee table yesterday right before i left. wouldn't stop when i screamed "NO" at him either."
Body after HTML escape :  "he seems to be entering a really rowdy phase. he was trying to grab my yeti cup off the coffee table yesterday right before i left. wouldn&apos;t stop when i screamed &quot;NO&quot; at him either."

Again, tracking when the message turns to diamond characters. If that issue was still not fixed, please let me know what these three lines say.


Perhaps it makes sense to offer an option to have the tool generate a name for group chats based on the actual participant data shown in the details section.

I don't mind implementing that when most other things are working properly. In the meantime, --mapxmlcontacts is the workaround.

Is there anything that can even realistically be done about this on the signalbackup-tools side (especially given your suggestion that group names aren't reliable indicators of the contacts in a group)?

If the name does not appear in the XML at all, or only in incomplete and out-of-order contact_name fields, I don't see what can be done automatically. Again, --mapxmlcontacts is the way to go. Same with your own name. I could probably add an option to read the map from a file at some point, if you run into more Windows limitations.

I am a bit surprised you are not listed by --listxmlcontacts, is your own number never in any group-message's <addr>-list? Maybe the XML-owners number is simply always implied as the sender or a recipient (depending on incoming or outgoing message), just by the fact of the message being present in the XML. That too would be a change from the XML files I know.

Thanks!

@sjevtic
Copy link
Author

sjevtic commented Feb 5, 2025

Sorry for the delayed response.

Maybe you could provide some examples of numbers that normalized correctly before, but don't anymore. The numbers I'm testing still seem to work, and I don't immediately see how my previous change could have broken it. If you see some obvious pattern in them, describing it may suffice. Otherwise, just post it, replacing the final digits with x's (the leading characters are probably important).

The 20250203-1 build seems to have fixed the number normalization regression I reported in my previous test. The +381 number is also correctly handled in this build.

This is the only normalization issue I found:

normalizePhoneNumber in: 0111001708XXXXXXX
normalizePhoneNumber out: +1001708XXXXXXX

In this case, I think this was one of those situations where a roaming carrier reported an incoming message from a US number in a strange format while I was out of the country. Interestingly, the normalization result is almost right--but the country code is listed a second time, with zero padding to 3 positions.

This also showed up in the normalization section:

[Warning]: Skipping message, missing required attribute 'address'

This is my attempt at finding out about the remaining Failed to find source_recipient_id errors. My working theory is now that there are groups (with a constant <mms address=), with a changing list of <addr>s. This should tell me if this is the case, please let me know what numbers you get in that table (basically, are there any > 1).

This is what I got:

--------------------------------
| COUNT(DISTINCT numaddresses) |
--------------------------------
| 1                            |
--------------------------------

This is me tracking where the body of that message turns into garbage. Note I made changes which may have actually fixed this , though I am not 100% sure how (I did figure this out, and it really should be fixed). I'd say check the HTML output and let me know. If it's still not fixed, the output above may be interesting. The lines printed here should be read right from the XML file.

I got this:

=== Row 1/1 ===
     body : he seems to be entering a really rowdy phase.  he was trying to grab my yeti cup off the coffee table yesterday right before i left.  wouldn&apos;t stop when i screamed "NO" at him either.
HEX(body) : 6865207365656D7320746F20626520656E746572696E672061207265616C6C7920726F7764792070686173652E202068652077617320747279696E6720746F2067726162206D79207965746920637570206F66662074686520636F66666565207461626C6520796573746572646179207269676874206265666F72652069206C6566742E2020776F756C646E2661706F733B742073746F70207768656E20692073637265616D656420224E4F222061742068696D206569746865722E

The rowdy puppy text was also correctly displayed in the HTML output starting in this build. I also didn't see any instances of this corruption elsewhere in output produced by this build. Nice work!

What do you see?

I see this:

------------
| COUNT(*) |
------------
| 0        |
------------
(no results)
(no results)
------------
| COUNT(*) |
------------
| 0        |
------------
(no results)
(no results)

Again, tracking when the message turns to diamond characters. If that issue was still not fixed, please let me know what these three lines say.

As noted above, this issue is resolved. But in the interest of completeness, here is what I got:

Body before XML unescape: "he seems to be entering a really rowdy phase.  he was trying to grab my yeti cup off the coffee table yesterday right before i left.  wouldn&apos;t stop when i screamed "NO" at him either."
Body after XML unescape:  "he seems to be entering a really rowdy phase.  he was trying to grab my yeti cup off the coffee table yesterday right before i left.  wouldn't stop when i screamed "NO" at him either."
Body after HTML escape :  "he seems to be entering a really rowdy phase.  he was trying to grab my yeti cup off the coffee table yesterday right before i left.  wouldn&apos;t stop when i screamed &quot;NO&quot; at him either."

I also found 49 of these in the import section:

[Error]: Failed to find source_recipient_id in contactmap (+xxxxxxx7393). Should be present at this point. Skipping message

Every one refers to this exact number. The contact has no name, but is listed in the --listxmlcontacts section as a number so I am perplexed why these messages could not be imported.

If the name does not appear in the XML at all, or only in incomplete and out-of-order contact_name fields, I don't see what can be done automatically. Again, --mapxmlcontacts is the way to go. Same with your own name. I could probably add an option to read the map from a file at some point, if you run into more Windows limitations.

I used several name assignments via --mapxmlcontacts in my most recent test, including for my own name, and the result was exactly as expected.

I have not done any experiments to see if I can create a backup with signalbackup-tools and use the SQL statement feature to import contacts into a backup ahead of importing the XML. This might be a nice approach to handling contact mapping too.

Any thoughts on approaches for handling .3gp videos?

@bepaald
Copy link
Owner

bepaald commented Feb 7, 2025

Sorry for the delayed response.

Not a problem, I don't always have time myself. Thanks for your thorough response.

The 20250203-1 build seems to have fixed the number normalization regression I reported in my previous test.

That's curious. Just like I didn't do anything to break it in the previous build, I didn't so anything to fix it in this one, I only added a bunch of output. But I'm happy it's working.

This is the only normalization issue I found:

normalizePhoneNumber in: 0111001708XXXXXXX
normalizePhoneNumber out: +1001708XXXXXXX

That's a weird one, it has two international prefixes and countrycodes (011, 1, followed by 00, 1). The program deals with the first but doesn't consider there is a second one. I'm not sure I can (or should honestly) attempt to deal with this. I'll have a think about it. Note that when mapping this contact to the exact same name as the contact with address +1708XXXXXXX, the contacts should still be effectively merged into one (though which phone number they are listed under will be random).

This also showed up in the normalization section:

[Warning]: Skipping message, missing required attribute 'address'

Also very curious. I don't think I touched this part of the code in a long time, if there is a message without an address field in your XML file, I would have expected this warning to have been shown basically since the beginning.

I have made a change so that when such a message is found and skipped, the XML-node of that message is printed to screen (it will likely contain private info, so don't paste it here). Do you see anything special about it? Is there really no address attribute?

I also found 49 of these in the import section:

[Error]: Failed to find source_recipient_id in contactmap (+xxxxxxx7393). Should be present at this point. Skipping message

Every one refers to this exact number. The contact has no name, but is listed in the --listxmlcontacts section as a number so I am perplexed why these messages could not be imported.

Me too, and it has been bugging me for a while. I've added some extra output to this error. I'm not completely sure it will give me a hint, because I'm not sure what I'm looking for and I can't reproduce no matter what I try. But maybe you could inspect that output, censor a little if necessary and post it here. Also, any messages about creating threads or recipients directly preceding that error might be relevant.

I have not done any experiments to see if I can create a backup with signalbackup-tools and use the SQL statement feature to import contacts into a backup ahead of importing the XML. This might be a nice approach to handling contact mapping too.

I'm not entirely sure what you mean by this. If you want to use --runsqlquery simply to insert a recipient in the Android backup, it's a little more complicated than that. At least if you need a valid recipient. If it's just for exporting HTML I guess it could work (it's what this function already does internally), but I don't think it would change the mapping very much. The contacts in the XML file still need to be matched to those in the backup, either by name (which may not be exactly the same, and for group-only contacts is missing from the XML file entirely) or by phone number (which would have the same normalization issues we are dealing with here).

Any thoughts on approaches for handling .3gp videos?

No, like I said, there is no querying the browser whether it can play any specific video (as far as I know). I could simply exclude 3gp-type attachments from being handled as video types, but that would turn it into any generic attachment in the HTML (still might be an improvement though, at least there will then be a download-button). Also, while your browser (and mine) doesn't play the 3gp files, for all I know other users have setups where they just work, or a future update will add support for them. Not sure what to do, do you have any ideas?

Thanks!

@sjevtic
Copy link
Author

sjevtic commented Feb 8, 2025

Note that when mapping this contact to the exact same name as the contact with address +1708XXXXXXX, the contacts should still be effectively merged into one (though which phone number they are listed under will be random).

Looking forward to seeing what you come up with.

I have made a change so that when such a message is found and skipped, the XML-node of that message is printed to screen (it will likely contain private info, so don't paste it here). Do you see anything special about it? Is there really no address attribute?

Wow, there really is no address attribute in the <mms> tag. So strange. However, the <addrs> sub-node is present and further contains <addr> sub-nodes with proper data including address attributes. Maybe these need to be used by signalbackup-tools as a falback when the primary address attribute is missing?

Interestingly I see creator="com.google.android.apps.messaging" for this node.

I also found 49 of these in the import section:
[Error]: Failed to find source_recipient_id in contactmap (+xxxxxxx7393). Should be present at this point. Skipping message
Every one refers to this exact number. The contact has no name, but is listed in the --listxmlcontacts section as a number so I am perplexed why these messages could not be imported.

Me too, and it has been bugging me for a while. I've added some extra output to this error. I'm not completely sure it will give me a hint, because I'm not sure what I'm looking for and I can't reproduce no matter what I try. But maybe you could inspect that output, censor a little if necessary and post it here. Also, any messages about creating threads or recipients directly preceding that error might be relevant.

I got even more of these (60) using the 20250207 build including for some other numbers. Here is an example.

Creating recipient for address +xxxxxxx9406 (group: false)
Failed to find matching thread for conversation, creating. (e164: +xxxxxxx9406 -> 2 -> thread_id: 1)
[Error]: Failed to find source_recipient_id in contactmap (xxxxxx9406). Should be present at this point. Skipping message
         Extra info:
         Thread recipient: 2
         Thread is 1-on-1
         First message for this address was (probably):
--------------------------------------------------------------------------------------------------------------------------------------------------------
| rowid | date          | type | read | contact_name    | address      | numattachments | sourceaddress | targetaddresses               | ismms | skip |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 12146 | 1704092594000 | 1    | 1    | <contact_name>  | +1XXXXXX9406 | 1              | XXXXXX9406    | ["XXXXXX4542","+1XXXXXX4542"] | 1     | 0    |
--------------------------------------------------------------------------------------------------------------------------------------------------------
         Current message for this address was:
--------------------------------------------------------------------------------------------------------------------------------------------------------
| rowid | date          | type | read | contact_name    | address      | numattachments | sourceaddress | targetaddresses               | ismms | skip |
--------------------------------------------------------------------------------------------------------------------------------------------------------
| 12146 | 1704092594000 | 1    | 1    | <contact_name>  | +1XXXXXX9406 | 1              | XXXXXX9406    | ["XXXXXX4542","+1XXXXXX4542"] | 1     | 0    |
--------------------------------------------------------------------------------------------------------------------------------------------------------

The only thing that really jumps out at me here is Failed to find source_recipient_id in contactmap (xxxxxx9406). Is there any significance to the number not being referred to in its noramlized (i.e., prefixed with +1) form?

I'm not entirely sure what you mean by this. If you want to use --runsqlquery simply to insert a recipient in the Android backup, it's a little more complicated than that. At least if you need a valid recipient. If it's just for exporting HTML I guess it could work (it's what this function already does internally), but I don't think it would change the mapping very much. The contacts in the XML file still need to be matched to those in the backup, either by name (which may not be exactly the same, and for group-only contacts is missing from the XML file entirely) or by phone number (which would have the same normalization issues we are dealing with here).

The only point of these SMSBR imports is to produce HTML output. One idea I had was effectively creating an otherwise empty database and using --runsqlquery statements to pre-populate contacts with a script based on the content of my contacts.vcf exported from the contacts app. The goal would be to avoid the need for huge --mapxmlcontacts arguments to handle contacts that (a) only ever show up in group messages and thus don't have an entry in the contact map and (b) provide supplemental mappings for old numbers of contacts that are no longer current (but for which it would still be nice to have correct name display in HTML).

If you aren't keen on parsing contact export files, even a simple --mapxmlcontactsfromfile option would be a huge help here, The file format could be very simple, with each line containing a <number>=<contact_name> statement, just like an argument to --mapxmlcontacts.

Any thoughts on approaches for handling .3gp videos?

No, like I said, there is no querying the browser whether it can play any specific video (as far as I know). I could simply exclude 3gp-type attachments from being handled as video types, but that would turn it into any generic attachment in the HTML (still might be an improvement though, at least there will then be a download-button). Also, while your browser (and mine) doesn't play the 3gp files, for all I know other users have setups where they just work, or a future update will add support for them. Not sure what to do, do you have any ideas?

I won't call it a great idea, but I was imagining some sort of method to transcode offending attachments into something more browser friendly. If I were going to do this myself as a pre-processing step, I'd probably be looking for attributes like this:

ct="video/3gpp" name="IMG_1941.3gp"

Then I'd need to grab the value from the data attribute, base64 decode it, transcode to my desired format (e.g., using ffmpeg), base64-re-encode it, replace it in the XML, and update all of the associated metadata tags. This is definitely not simple; it is fully out of reach for any user who is not a developer him/herself.

@bepaald
Copy link
Owner

bepaald commented Feb 8, 2025

Just very quickly, I don't have much time, I'll respond to the full message later.

I got even more of these (60) using the 20250207 build including for some other numbers. Here is an example.

[...]

The only thing that really jumps out at me here is Failed to find source_recipient_id in contactmap (xxxxxx9406). Is there any significance to the number not being referred to in its noramlized (i.e., prefixed with +1) form?

I think there is significance to the number not being referenced in normalized form, as well as that it appears like that in both the sourceaddress and targetaddresses fields. I can easily reproduce this exact output when omitting the --setcountrycode option. That, in combination with the fact that there are suddenly more cases of this problem, is making me suspect maybe you ran this one without the --setcountrycode option this time? Is that be the case?

Thanks!

@sjevtic
Copy link
Author

sjevtic commented Feb 9, 2025

That, in combination with the fact that there are suddenly more cases of this problem, is making me suspect maybe you ran this one without the --setcountrycode option this time? Is that be the case?

My apologies: I did in fact forget to set --setcountrycode. I've begun saving the command lines I use for the tests to avoid the problem going forward. I'm now back to 49 of these messages. Here are the details surrounding the first instance:

[Error]: Failed to find source_recipient_id in contactmap (+xxxxxxx7393). Should be present at this point. Skipping message
         Extra info:
         Thread recipient: 64
         Thread is 1-on-1
         First message for this address was (probably):
----------------------------------------------------------------------------------------------------------------------------------------------
| rowid | date          | type | read | contact_name        | address      | numattachments | sourceaddress | targetaddresses | ismms | skip |
----------------------------------------------------------------------------------------------------------------------------------------------
| 2478  | 1712169209612 | 1    | 1    | <contact_name>      | +1XXXXXX0113 | 0              | (NULL)        | (NULL)          | 0     | 0    |
----------------------------------------------------------------------------------------------------------------------------------------------
         Current message for this address was:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| rowid | date          | type | read | contact_name        | address      | numattachments | sourceaddress | targetaddresses                                | ismms | skip |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 14244 | 1721914314000 | 1    | 1    | <contact_name>      | +1XXXXXX0113 | 0              | +1XXXXXX7393  | ["+1XXXXXX0113","+1XXXXXX4542","+1XXXXXX4542"] | 1     | 0    |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

+xxxxxxx7393 is present in group chats only (no one-on-one conversations) and has no saved contact name. The contact is shown in --listxmlcontacts output as a number only.

The problem remains even when I explicitly assign a contact for +xxxxxxx7393 via --mapxmlcontactnames.

Thanks.

@bepaald
Copy link
Owner

bepaald commented Feb 9, 2025

Note that when mapping this contact to the exact same name as the contact with address +1708XXXXXXX, the contacts should still be effectively merged into one (though which phone number they are listed under will be random).

Looking forward to seeing what you come up with.

I've attempted to handle this case. It depends on the number starting with two international call prefixes and countrycodes (as set by --setcountrycode) and being too many digits (though sources differ on what is too many for a valid number). A warning is still printed when the conditions are met, mostly to make sure no other numbers erroneously enter this code path.

Wow, there really is no address attribute in the <mms> tag. So strange. However, the <addrs> sub-node is present and further contains <addr> sub-nodes with proper data including address attributes. Maybe these need to be used by signalbackup-tools as a falback when the primary address attribute is missing?

Interestingly I see creator="com.google.android.apps.messaging" for this node.

I use the address field to determine what thread the message belongs in (as it is intended). Where do you expect this message to end up? Is there a group this belongs to? Or is it simply copied into all 1-on-1 conversations for each of those addrss?

I think this is a case where the XML file needs to be prepared by the user to be valid. That is, I think you should probably add the correct address field to this message.

The only point of these SMSBR imports is to produce HTML output. One idea I had was effectively creating an otherwise empty database and using --runsqlquery statements to pre-populate contacts with a script based on the content of my contacts.vcf exported from the contacts app. The goal would be to avoid the need for huge --mapxmlcontacts arguments to handle contacts that (a) only ever show up in group messages and thus don't have an entry in the contact map and (b) provide supplemental mappings for old numbers of contacts that are no longer current (but for which it would still be nice to have correct name display in HTML).

If you aren't keen on parsing contact export files, even a simple --mapxmlcontactsfromfile option would be a huge help here, The file format could be very simple, with each line containing a <number>=<contact_name> statement, just like an argument to --mapxmlcontacts.

Well, like I said, I don't think creating an empty backup helps with the mapping problem, but I suppose you could try. I have no problem implementing a --mapxmlcontactsfromfile option. But I'll do that when (most) other issues are solved. I think this file could be generated from your vcf-file with a script just as easy (or more so) as the SQL queries.

I won't call it a great idea, but I was imagining some sort of method to transcode offending attachments into something more browser friendly. If I were going to do this myself as a pre-processing step, I'd probably be looking for attributes like this:

ct="video/3gpp" name="IMG_1941.3gp"

Then I'd need to grab the value from the data attribute, base64 decode it, transcode to my desired format (e.g., using ffmpeg), base64-re-encode it, replace it in the XML, and update all of the associated metadata tags. This is definitely not simple; it is fully out of reach for any user who is not a developer him/herself.

So, from my research, it seems indeed that 3gp files could be supported by some setups, see for example here. Though currently, it seems no known setups do actually support it (see here). Although I did find some references of it working on macOS.

I could possibly simply not treat 3gp attachments as video, then they would appear in the HTML as any other non-media attachment. The other option is what you are thinking. I would personally opt for post-processing instead of preprocesssing. That way you would not need to parse and edit the XML file and need no base64 encoding and decoding. You could just transcode the file directly and edit a single line in the HTML to point to the new file (if you change the name) and alter the type attribute.


Here are the details surrounding the first instance:

[Error]: Failed to find source_recipient_id in contactmap (+xxxxxxx7393). Should be present at this point. Skipping message
         Extra info:
         Thread recipient: 64
         Thread is 1-on-1
         First message for this address was (probably):
----------------------------------------------------------------------------------------------------------------------------------------------
| rowid | date          | type | read | contact_name        | address      | numattachments | sourceaddress | targetaddresses | ismms | skip |
----------------------------------------------------------------------------------------------------------------------------------------------
| 2478  | 1712169209612 | 1    | 1    | <contact_name>      | +1XXXXXX0113 | 0              | (NULL)        | (NULL)          | 0     | 0    |
----------------------------------------------------------------------------------------------------------------------------------------------
         Current message for this address was:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| rowid | date          | type | read | contact_name        | address      | numattachments | sourceaddress | targetaddresses                                >     | ismms | skip |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 14244 | 1721914314000 | 1    | 1    | <contact_name>      | +1XXXXXX0113 | 0              | +1XXXXXX7393  | ["+1XXXXXX0113","+1XXXXXX4542","+1XXXXXX4542"] | 1     | 0    |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

+xxxxxxx7393 is present in group chats only (no one-on-one conversations) and has no saved contact name. The contact is shown in --listxmlcontacts output as a number only.

So, that's unexpected again.... According to the address field (=+1XXXXXX0113), this message is not a group message, but a 1-on-1. For a group message, I (and this function) would expect the address to be +1XXXXXX0113~+1XXXXXX7393~+1XXXXXX4542 or something similar. Not sure what to do, do other group messages behave the same way? There is not much real documentation by Syntech on this, but I assumed from all other examples I had seen, group messages always had all numbers with the ~ in the address field.

I suppose I could use the list of recipients to determine if something is a group, and make up my own identifier and keep a list that way. But that is going to be a quite a few steps backward, I would need to rework the logic of the function and start over from there... That would also solve the case of the message with the missing address (though the XML file is still technically invalid).

@sjevtic
Copy link
Author

sjevtic commented Feb 9, 2025

I've attempted to handle this case. It depends on the number starting with two international call prefixes and countrycodes (as set by --setcountrycode) and being too many digits (though sources differ on what is too many for a valid number). A warning is still printed when the conditions are met, mostly to make sure no other numbers erroneously enter this code path.

This seems to be working, at least in my test data:

normalizePhoneNumber in:  0111001XXXXXX1717
[Warning]: Detected doubled prefix and countrycode in phone number (0111001XXXXXX1717)
normalizePhoneNumber out: +1XXXXXX1717

I use the address field to determine what thread the message belongs in (as it is intended). Where do you expect this message to end up? Is there a group this belongs to? Or is it simply copied into all 1-on-1 conversations for each of those addrss?

There is only one instance of this error in my 2024 archive. There are only two addr nodes so it looks like it was a 1-on-1 conversation.

I think this is a case where the XML file needs to be prepared by the user to be valid. That is, I think you should probably add the correct address field to this message.

This does seem to be an example of a malformed XML file. Of course the challenge here is that preprocessing at this level is pretty hard as it would require full parsing of the XML to do this (generate address attributes from the addrs nodes) correctly.

I don't fully understand the point of the addrs nodes--they seem to duplicate the information already present in the address attributes (although arguably in a more parseable, more detailed manner).

My hope is that signalbackup-tools would ultimately support gathering address information from the addrs nodes.

Well, like I said, I don't think creating an empty backup helps with the mapping problem, but I suppose you could try. I have no problem implementing a --mapxmlcontactsfromfile option. But I'll do that when (most) other issues are solved. I think this file could be generated from your vcf-file with a script just as easy (or more so) as the SQL queries.

If there was a --mapxmlcontactsfromfile option it would certainly be far easier than the --runsqlquery approach. Generating the contact mapping file from the exported .vcf file would be very straightforward--I could probably do this in bash right on my phone.

So, from my research, it seems indeed that 3gp files could be supported by some setups, see for example here. Though currently, it seems no known setups do actually support it (see here). Although I did find some references of it working on macOS.

My impression is that support is very scarce in the desktop browser space. But, .3gp video containers are very common in the mobile space, notably for video sent via legacy MMS.

I could possibly simply not treat 3gp attachments as video, then they would appear in the HTML as any other non-media attachment. The other option is what you are thinking. I would personally opt for post-processing instead of preprocesssing. That way you would not need to parse and edit the XML file and need no base64 encoding and decoding. You could just transcode the file directly and edit a single line in the HTML to point to the new file (if you change the name) and alter the type attribute.

I'm not convinced that postprocessing is simpler. I think it is more of trading one set of problems for another set of similar complexity problems.

+xxxxxxx7393 is present in group chats only (no one-on-one conversations) and has no saved contact name. The contact is shown in --listxmlcontacts output as a number only.

So, that's unexpected again.... According to the address field (=+1XXXXXX0113), this message is not a group message, but a 1-on-1. For a group message, I (and this function) would expect the address to be +1XXXXXX0113~+1XXXXXX7393~+1XXXXXX4542 or something similar. Not sure what to do, do other group messages behave the same way? There is not much real documentation by Syntech on this, but I assumed from all other examples I had seen, group messages always had all numbers with the ~ in the address field.

Most group messages do seem to have the address field set as you've suggested. I think that's the way it is supposed to work in theory. But as usual, practice deviates from theory. :(

I just went and dug through the XML a bit, and noticed a couple things:

  1. "First message for this address was (probably)" is wrong. It is actually pointing to a message from a 1-on-1 conversation with +1XXXXXX0113, but the message in question is part of a group conversation with +1XXXXXX7393 and +1XXXXXX0113.
  2. The offending message has this value in its address attribute: +1XXXXXX7393~+1XXXXXX0113. This is NOT present in the --listxmlcontacts output. It isn't clear wjhy this is happening.

I suppose I could use the list of recipients to determine if something is a group, and make up my own identifier and keep a list that way. But that is going to be a quite a few steps backward, I would need to rework the logic of the function and start over from there... That would also solve the case of the message with the missing address (though the XML file is still technically invalid).

In the case discussed up above, this is clearly an issue with the SMSBR output. I think this approach would be a good way of dealing with that.

In the second case, I think there is some other issue going on here in signalbackup-tools that is preventing the creation of the contact entry for the group conversation.

Also, I am wondering if this approach could enable better contact name identification for group conversations.

Thanks.

@bepaald
Copy link
Owner

bepaald commented Feb 10, 2025

This seems to be working, at least in my test data:

Good!

There is only one instance of this error in my 2024 archive. There are only two addr nodes so it looks like it was a 1-on-1 conversation.

Is one of the addr nodes yourself? Otherwise it would still probably be a group.

This does seem to be an example of a malformed XML file. Of course the challenge here is that preprocessing at this level is pretty hard as it would require full parsing of the XML to do this (generate address attributes from the addrs nodes) correctly.

Well, yes that would be hard if it needed to be done for thousands of messages, but if there is just a single one adding it manually is definitely the more efficient option. Getting the correct address is non-trivial for this program as well.

I don't fully understand the point of the addrs nodes--they seem to duplicate the information already present in the address attributes (although arguably in a more parseable, more detailed manner).

For group messages, the addr nodes show which user sent the message and which received it. Also it prevent having to tokenize the address field. That's my guess at least.

My hope is that signalbackup-tools would ultimately support gathering address information from the addrs nodes.

I'll consider it when most other problems are solved.

If there was a --mapxmlcontactsfromfile option it would certainly be far easier than the --runsqlquery approach. Generating the contact mapping file from the exported .vcf file would be very straightforward--I could probably do this in bash right on my phone.

yes, well that is also on the to-do list for when no other problems remain. As well as an --autogroupnames option to set group names from individual contact names.

I just went and dug through the XML a bit, and noticed a couple things:

  1. "First message for this address was (probably)" is wrong. It is actually pointing to a message from a 1-on-1 conversation with +1XXXXXX0113, but the message in question is part of a group conversation with +1XXXXXX7393 and +1XXXXXX0113.

  2. The offending message has this value in its address attribute: +1XXXXXX7393~+1XXXXXX0113. This is NOT present in the --listxmlcontacts output. It isn't clear wjhy this is happening.

Ok, that is interesting. I agree in this case it is probably a bug in the tool (but I'm happy with that, I think it will be easier to solve than starting over without relying on the address fields).

If you find the two messages reported by this tool in the XML file (easiest probably to search for the date value), would the contact_name attributes happen to be the same?

I've added some more debugging output for this case, maybe you could run again. I'm expecting some output, directly after the normalizePhoneNumber output, four tables like this:

1
----------------------------------------------
| address                   | contact_name   |
----------------------------------------------
| +1XXXXXX7393~+1XXXXXX0113 | <contact name> |
----------------------------------------------

I'm expecting the address to change from the value you find in the XML file to the incorrect +1XXXXXX0113 value in the 2nd, 3rd, or 4th table. Possibly the contact name also changes at some point. Could you describe what you see?

Thanks!

@sjevtic
Copy link
Author

sjevtic commented Feb 10, 2025

Is one of the addr nodes yourself? Otherwise it would still probably be a group.

Yes.

Well, yes that would be hard if it needed to be done for thousands of messages, but if there is just a single one adding it manually is definitely the more efficient option. Getting the correct address is non-trivial for this program as well.

If it is just one, that's fine. I haven't run signalbackup-tools on any of my older archives though yet to see if the condition exists in those.

For group messages, the addr nodes show which user sent the message and which received it. Also it prevent having to tokenize the address field. That's my guess at least.

This is my impression too.

yes, well that is also on the to-do list for when no other problems remain. As well as an --autogroupnames option to set group names from individual contact names.

--autogroupnames sounds like a great idea. When used in conjunction with --mapxmlcontactnamesfromfile on a full contact export, it should be able to completely eliminate the issue of contacts in a group chat not being recognized because there is no 1:1 conversations with those contacts.

  1. "First message for this address was (probably)" is wrong. It is actually pointing to a message from a 1-on-1 conversation with +1XXXXXX0113, but the message in question is part of a group conversation with +1XXXXXX7393 and +1XXXXXX0113.
  2. The offending message has this value in its address attribute: +1XXXXXX7393~+1XXXXXX0113. This is NOT present in the --listxmlcontacts output. It isn't clear wjhy this is happening.

Ok, that is interesting. I agree in this case it is probably a bug in the tool (but I'm happy with that, I think it will be easier to solve than starting over without relying on the address fields).

If you find the two messages reported by this tool in the XML file (easiest probably to search for the date value), would the contact_name attributes happen to be the same?

I looked at a few of these and it seems that the contact_name attribute is always has the value of the name associated with +1XXXXXX0113.

I'm expecting the address to change from the value you find in the XML file to the incorrect +1XXXXXX0113 value in the 2nd, 3rd, or 4th table. Possibly the contact name also changes at some point. Could you describe what you see?

1
---------------------------------------------------
| address                   | contact_name        |
---------------------------------------------------
| +1XXXXXX7393~+1XXXXXX0113 | <contact_name>      |
---------------------------------------------------
2
---------------------------------------------------
| address                   | contact_name        |
---------------------------------------------------
| +1XXXXXX7393~+1XXXXXX0113 | <contact_name>      |
---------------------------------------------------
3
---------------------------------------------------
| address                   | contact_name        |
---------------------------------------------------
| +1XXXXXX7393~+1XXXXXX0113 | <contact_name>      |
---------------------------------------------------
4
--------------------------------------
| address      | contact_name        |
--------------------------------------
| +1XXXXXX0113 | <contact_name>      |
--------------------------------------

In each of these 4 tables, the value of <contact_name> is the same. It is the name associated with +1XXXXXX0113.

@bepaald
Copy link
Owner

bepaald commented Feb 12, 2025

--autogroupnames sounds like a great idea. When used in conjunction with --mapxmlcontactnamesfromfile on a full contact export, it should be able to completely eliminate the issue of contacts in a group chat not being recognized because there is no 1:1 conversations with those contacts.

There are now --xmlautogroupnames and --mapxmlcontactnamesfromfile options available.

I looked at a few of these and it seems that the contact_name attribute is always has the value of the name associated with +1XXXXXX0113.

In each of these 4 tables, the value of <contact_name> is the same. It is the name associated with +1XXXXXX0113.

Right, so this was the problem all along. To match messages with existing threads, it would first attempt to match on contact_name, and only by phone number second. The idea being, that no one would have two contacts with the exact same name in their database (even if there were two "John Doe"s, they would likely rename them to "John 1" and "John 2" or similar). Also, this meant that phone numbers did not have to match precisely: the incorrectly normalized +1001708XXXXXXX would still match correctly as long as the contact_name field was identical to +1708XXXXXXX. Lastly, this function is also used to import into existing Signal databases, where phone numbers need not be present at all anymore and group contacts do not have phone numbers either.

Of course, this this doesn't work for this case, where the group gets the name of the only group member whose name is available, thus matching a 1-on-1 conversation with this person.

So, now the tool matches on phone number instead. This makes it rely heavily on the phone number normalization: if it fails conversations may be split. As I'm writing this, I'm thinking when --xmlautogroupnames is enabled maybe reverting to matching by name will work better again. That may be something to look into, if there are a lot of problems with normalization.

Thanks!

@sjevtic
Copy link
Author

sjevtic commented Feb 12, 2025

WOW! This I got exactly the expected output! This is exceptional.

There are now --xmlautogroupnames and --mapxmlcontactnamesfromfile options available.

I tried both of these and they worked as expected. My contact name file was still a fairly small one; I'll have to do some bigger tests in the coming days.

One small observation is that the group names for long conversations, both on index.html and at the top of group conversation pages, have the comma separating long member lists at the beginning of a new line, rather than at the end of the previous line (where I would argue it is customarily/grammatically expected).

Right, so this was the problem all along. To match messages with existing threads, it would first attempt to match on contact_name, and only by phone number second. The idea being, that no one would have two contacts with the exact same name in their database (even if there were two "John Doe"s, they would likely rename them to "John 1" and "John 2" or similar). Also, this meant that phone numbers did not have to match precisely: the incorrectly normalized +1001708XXXXXXX would still match correctly as long as the contact_name field was identical to +1708XXXXXXX. Lastly, this function is also used to import into existing Signal databases, where phone numbers need not be present at all anymore and group contacts do not have phone numbers either.

One of the things I will try to test soon is a contact map file in which a name appears many times, each time mapping it to a different number. I will be interested to see if the resulting conversations are threaded properly even though the sender has different numbers during different portions of the exchange. Apparently I have some shady contacts with a habit of using burner phones/numbers. :)

Of course, this this doesn't work for this case, where the group gets the name of the only group member whose name is available, thus matching a 1-on-1 conversation with this person.

So, now the tool matches on phone number instead. This makes it rely heavily on the phone number normalization: if it fails conversations may be split. As I'm writing this, I'm thinking when --xmlautogroupnames is enabled maybe reverting to matching by name will work better again. That may be something to look into, if there are a lot of problems with normalization.

Thanks!

I didn't see any normalization issues in this build on my test file Also the only skipped message was the one missing the address field. I need to go dig through my other archives and see if I have a lot of these.

I guess that leaves only the following open points for possible further enhancement:

  1. Option to choose short/long directory/file names.
  2. How to handle (awful) .3gp files.
  3. What to do with <addrs> sub-node data.

Thanks for your AMAZING work! I know it's not much and doesn't come close to compensating you for your work, but I left a small donation at your sponsorship link. I've been looking for a mechanism to create beautiful browsable/printable archives of SMSBR backups for over a decade now and nothing has even come close to what you accomplished. Your passion for this project and attention to detail is unmatched.

@bepaald
Copy link
Owner

bepaald commented Feb 12, 2025

Great that it's mostly working now, very happy with that.

One small observation is that the group names for long conversations, both on index.html and at the top of group conversation pages, have the comma separating long member lists at the beginning of a new line, rather than at the end of the previous line (where I would argue it is customarily/grammatically expected).

Yes that would be better, but should already be happening. I can't really reproduce this, but I know line-breaking has been a tough problem on the HTML side. I've tried increasing the size of one of the names in a group slowly (one 'i' at a time) and see the following:

Image

One more 'i':

Image

So that seems correct to me. This sort of thing is also likely influenced by the browser and possibly the fonts used... I'll try another browser tomorrow. You also mention this happening on the index page, but on the index page, the group name is never supposed to be more than 1 line:

Image

Is that different for you?

One of the things I will try to test soon is a contact map file in which a name appears many times, each time mapping it to a different number. I will be interested to see if the resulting conversations are threaded properly even though the sender has different numbers during different portions of the exchange. Apparently I have some shady contacts with a habit of using burner phones/numbers. :)

I am pretty sure it will create multiple conversations with contacts who happen to have the same name. If it is desired to merge some contacts (who have simply changed number), I think it shouldn't be too difficult to add another option to do so.

I didn't see any normalization issues in this build on my test file Also the only skipped message was the one missing the address field. I need to go dig through my other archives and see if I have a lot of these.

Yes, I'm curious to hear how well this works on other XML files, I hope results are just as good as this, but maybe we were lucky with this XML file...

I guess that leaves only the following open points for possible further enhancement:

  1. Option to choose short/long directory/file names.
  2. How to handle (awful) .3gp files.
  3. What to do with sub-node data.

Point 1, was actually next on my to-do list.
Point 2, I really don't know. I think for this program the only two options are to:

  1. Do what I'm doing now, and rely on the user's setup to handle playback.
  2. Or decide, since (almost) no one is capable of playing these files in browser, to handle them like any other non-viewable attachments (pdf, zip, ...).

Anything else (like transcoding video) is well outside the scope of this program, and should be done by the user either before or after processing by this tool.
Point 3, as long as messages with missing address attributes are rare, I think nothing more needs to be done with those probably? Do you have any suggestions?

Thanks for your AMAZING work! I know it's not much and doesn't come close to compensating you for your work, but I left a small donation at your sponsorship link. I've been looking for a mechanism to create beautiful browsable/printable archives of SMSBR backups for over a decade now and nothing has even come close to what you accomplished. Your passion for this project and attention to detail is unmatched.

Well, thank you very much! That is not a small donation, I've helped quite a few people with their backups over the years, but this is by far the biggest donation I have received. It is very generous of you and greatly appreciated.

Thanks!

@bepaald
Copy link
Owner

bepaald commented Feb 13, 2025

As per usual, after typing a response, I go to bed and as my head hits the pillow I think of two or three things that might cause an issue mentioned here.

The weird line breaking of group titles should be fixed now. (it is a windows thing, in combination with me always opening files in binary mode (specifically to work around this same Windows thing, but it was a bad combination in this case))

@bepaald
Copy link
Owner

bepaald commented Feb 13, 2025

There is now also an initial --compactfilenames option.

@sjevtic sjevtic closed this as completed Feb 15, 2025
@sjevtic
Copy link
Author

sjevtic commented Feb 15, 2025

First of all, I accidentally somehow hit "mark as completed" when I leaned over my keyboard and it doesn't appear that I can undo this action. That said, this activity does seem to be largely complete, so you can decide if it should stay closed or be re-opened.

The situation with conversation titles seems to be largely improved in 20250215-1. I tested in both Firefox and Chrome on Windows, and there are no lines in conversation title names beginning with commas any more. However in narrow windows, the conversation titles do not scale down in width to match the box that encloses conversation bubbles. I think this would be a nice touch; titles could just wrap as required to accommodate the width available.

I am pretty sure it will create multiple conversations with contacts who happen to have the same name. If it is desired to merge some contacts (who have simply changed number), I think it shouldn't be too difficult to add another option to do so.

It would probably be a nice option to have though I will admit I haven't thought about this a whole lot yet.

Anything else (like transcoding video) is well outside the scope of this program, and should be done by the user either before or after processing by this tool.

I tend to agree that this is out of scope. A reasonable compromise might be an option to make them links to files that can be downloaded or played in an external player. Having this as an option would be a nice choice so a user could elect to preprocess if desired.

Point 3, as long as messages with missing address attributes are rare, I think nothing more needs to be done with those probably? Do you have any suggestions?

That's my impression too. I think sticking with this decision makes sense unless user complaints prove otherwise.

There is now also an initial --compactfilenames option.

I ran my test without this option today initially, but still got file names with truncated contact names in the HTML files. This doesn't seem to be optimal behavior. I think it would be best to have the default be the old naming behavior.

I did a second test with --compactfilenames and the result was exactly what I would expect. Nicely done.

I need to do some bigger tests now. I have my archives stored in an XML file for each year, so I want to do some tests where I successively import multiple archive XML files into a backup and then export to HTML, paginating the results with --split or --split-by. It will be much nicer to have a single HTML directory tree with pagination handled by signalbackup-tools rather than a separate HTML tree for each year.

@bepaald bepaald reopened this Feb 16, 2025
@bepaald
Copy link
Owner

bepaald commented Feb 16, 2025

However in narrow windows, the conversation titles do not scale down in width to match the box that encloses conversation bubbles. I think this would be a nice touch; titles could just wrap as required to accommodate the width available.

This has all been very complicated (for me at least), usually long titles would not break, or break at weird spots, or very short ones would break even though more width was still available. I ended up trying to just set a sensible default fixed width. But, I messed around with the CSS again and it may be working now.

I tend to agree that this is out of scope. A reasonable compromise might be an option to make them links to files that can be downloaded or played in an external player. Having this as an option would be a nice choice so a user could elect to preprocess if desired.

I think I can add an option to prevent given content-types from being treated as media. That way, they should show as other non-media attachments would, including a download button:

Image

Would that be an improvement?

I ran my test without this option today initially, but still got file names with truncated contact names in the HTML files. This doesn't seem to be optimal behavior. I think it would be best to have the default be the old naming behavior.

For now, I've upped the length limit to 54, which should be higher than any conversation title in Signal can naturally be. So, apart from you, nobody should actually see any truncated names. I'm undecided on what's best here. Having the program just fail would be a bad user experience. On the other hand, maybe the limit of 54 is too high to prevent that, making this option the worst, still truncating, and still (possibly) failing.

I will give it some thought, maybe I should just disable this truncating altogether and calculate the actual length of the full absolute path and issue a warning if it's too high (on Windows).

I need to do some bigger tests now. I have my archives stored in an XML file for each year, so I want to do some tests where I successively import multiple archive XML files into a backup and then export to HTML, paginating the results with --split or --split-by. It will be much nicer to have a single HTML directory tree with pagination handled by signalbackup-tools rather than a separate HTML tree for each year.

Sounds like a plan. I believe the tool may behave slightly differently when importing into an existing backup as opposed to into an internal dummy-backup. It may give better results actually merging the XML files beforehand. I'll leave this issue open, as I'm not confident no new issues will arise. Also, from experience, by now I think the other XML files could contain surprises that would need to be dealt with.

@sjevtic
Copy link
Author

sjevtic commented Feb 16, 2025

This has all been very complicated (for me at least), usually long titles would not break, or break at weird spots, or very short ones would break even though more width was still available. I ended up trying to just set a sensible default fixed width. But, I messed around with the CSS again and it may be working now.

It looks reasonable in 20250216-1.

I think I can add an option to prevent given content-types from being treated as media. That way, they should show as other non-media attachments would, including a download button:

Image

Would that be an improvement?

Yes, this makes sense. It offers a reasonable level of usability when not pre/post-processing (to make these playable in-browser).

I will give it some thought, maybe I should just disable this truncating altogether and calculate the actual length of the full absolute path and issue a warning if it's too high (on Windows).

I think that would be the right approach. Windows long filename support supports really long filenames. The only issue with these longer names seems to be working with the long paths in Windows GUI elements, and for people that are bothered by that there is --compactfilenames (which are REALLY compact); everyone else can enjoy long names.

Sounds like a plan. I believe the tool may behave slightly differently when importing into an existing backup as opposed to into an internal dummy-backup. It may give better results actually merging the XML files beforehand. I'll leave this issue open, as I'm not confident no new issues will arise. Also, from experience, by now I think the other XML files could contain surprises that would need to be dealt with.

Here's a naive question: how do I create an empty backup into which I can start my series of imports? I tried creating an empty directory to pass as the first argument, but that didn't go well:

Opening from dir!
Reading database...
[Error]: SQL: not an error
[Error]: Failed to open backup

I think this behavior actually makes sense--I just don't know how to properly accomplish what I am trying to do.

I could always go down the XML merge route, but that is going to involve moving a LOT of bytes around. I'm also curious what it will do with regard to memory utilization in signalbackup-tools.

BTW, backups are enshrouded with a tag like this:

<smses count="15997" backup_set="33c73946-4fb2-43f7-aea1-693685f52c61" backup_date="1735707629557" type="full">

Do any of these attributes matter (especially the message count) from the perspective of signalbackup-tools import? If I can just drop these, the XML merge would definitely be easier.

@bepaald
Copy link
Owner

bepaald commented Feb 17, 2025

Yes, this makes sense. It offers a reasonable level of usability when not pre/post-processing (to make these playable in-browser).

I think that would be the right approach. Windows long filename support supports really long filenames. The only issue with these longer names seems to be working with the long paths in Windows GUI elements, and for people that are bothered by that there is --compactfilenames (which are REALLY compact); everyone else can enjoy long names.

Yes, I agree with both these things. I'll start work on them.

Here's a naive question: how do I create an empty backup into which I can start my series of imports? I tried creating an empty directory to pass as the first argument, but that didn't go well:

Opening from dir!
Reading database...
[Error]: SQL: not an error
[Error]: Failed to open backup

I think this behavior actually makes sense--I just don't know how to properly accomplish what I am trying to do.

I think there actually is no way to do that currently. I thought you could actually just add an --output option to the command line to save the initial dummy-backup (with the first XML file imported), and then use that as input together with --importplaintextdatabase. However, it seems I cleverly disabled --output for these dummy backups (to prevent people from trying to import them in to Signal, they are not proper backup files after all).

Creating an 'empty' backup on the command line is not possible, I'm not even sure what an 'empty' backup is. There is more to the backup file than just the SQL database, and even that database — while it contains no messages ­— will still have a schema (contain tables with certain columns etc) which depends on other data (like the database version).

I think I have two options:

  • Add an option --savedummy, to still allow saving the (invalid) backup to be used as input again. This is the easier option to implement but is a bit messy. The first run would require --exportplaintextbackuphtml and --savedummy, the HTML generated can then be discarded. Then for each subsequent XML file, you'd need to point the input to the saved dummy file and use --importplainttextbackup, and --output to generate a new backup. On the last run, you'd then add --exporthtml to get the HTML.
  • Allow --exportplainttexthtml to take multiple XML files as input. This is probably the cleaner solution, though I think it will require somewhat large changes to my argument parser.

In fact, the second option will also prevent the problem of the tool handling importing into a dummy differently than when importing into an existing backup. So I think I'll go with that option.

I could always go down the XML merge route, but that is going to involve moving a LOT of bytes around. I'm also curious what it will do with regard to memory utilization in signalbackup-tools.

Me too! But I think it'll be all right. The majority of the bytes in those XML files should be in the attachment data, the actual text-content won't add up to much I think. And the tool is actually pretty clever about the attachment data, not keeping it in memory the entire time, but instead just saving the offset and size and reading it back from the XML file directly whenever it needs it. At least that's the intention, I hope it's working like that. Do you see high memory usage when running this (in the order of the size of the XML file)?

BTW, backups are enshrouded with a tag like this:

<smses count="15997" backup_set="33c73946-4fb2-43f7-aea1-693685f52c61" backup_date="1735707629557" type="full">

Do any of these attributes matter (especially the message count) from the perspective of signalbackup-tools import? If I can just drop these, the XML merge would definitely be easier.

No, they are all ignored. I don't see any use for them for any program that wants to parse the file, apart maybe from showing a progress bar during parsing.

@sjevtic
Copy link
Author

sjevtic commented Feb 17, 2025

I think it is highly advantageous to be able to import these XML files into a backup database because it will streamline incremental processing. Ideally, I'd like to be able to create a database containing imports of all previous years messages, then periodically have a scheduled task import the year-to-date backup into a copy of this database and then export the HTML. This will save significant processing by avoiding the need to effectively always re-import previous years' backups.

@bepaald
Copy link
Owner

bepaald commented Feb 19, 2025

I think it is highly advantageous to be able to import these XML files into a backup database because it will streamline incremental processing. Ideally, I'd like to be able to create a database containing imports of all previous years messages, then periodically have a scheduled task import the year-to-date backup into a copy of this database and then export the HTML. This will save significant processing by avoiding the need to effectively always re-import previous years' backups.

I had a few goes at getting something working to enable this, but so far no luck. I need to think about how this can be done, but I'm a bit short on time this week. Not giving up yet though.

Though I still think it isn't a very good idea. The tool behaves differently when importing into a user-provided backup, it has to, though an obvious solution is to also provide an option --targetisdummy to work around that. But secondly, making a dummy once, and updating it for years to come could cause trouble when either the dummy format or the import function need changes.

I'm also curious about the 'significant processing'. I did manage a speedup in the import-function, so maybe test it out and see if you can work with that. With my biggest testing XML (which is admittedly small), importing 21,000 messages has gone from ~6 seconds to ~0,6 seconds (no attachments in that XML though). I may have another idea that might speed up the process, but that needs investigating.

edit I have one more possible optimization to go, but it may already be getting to the point were adding extra XML files to re-process is faster than reading an (also ever-growing) input backup and writing it out again after adding a single new XML.


Other changes I managed the last few days:

  • The windows path truncating is gone. It should now issue a warning when it detects a path that is 260 characters or above.
  • All XML/plaintext options now take any number of XML files as arguments, so no need to add them one by one, or merge them manually beforehand, just --exportplaintextbackuphtml [XML1] [XML2] [...] [OUTPUT]
  • There is now a --htmlignoremediatypes option that takes a list of mimetypes that will not be considered media-type attachments. For example: --htmlignoremediatypes "video/3gpp,video/3gpp2,audio/3gpp". I am planning to make this list handle wildcards (so you could use */3gp*), but haven't got around to it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants