-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make HTML from SMS Backup & Restore XML files. #273
Comments
Hi! The thing with #227 is, it is a bit stalled at the moment, but actual real testing is still very much needed. The feature is really not considered to be complete (which is also why you won't find it mentioned anywhere in the README or The proper dealing with the escaped strings (unescaping xml entities such as However:
No, not really. Not yet, at least. The
I do believe all of these issues can be fixed, by the way.
The approach is correct I think, at least just
This was indeed a bug, that should be fixed now. Thanks for reporting. I did plan on adding mms support at some point, but that is going to be a long process to get right I think. I should have some time this week though. I'll keep this issue updated if you are interested and willing to test and report on the status occasionally. Thanks! |
This sounds really promising, and I'm certainly happy to help test. With the latest build, I was able to run to completion on several different backup files. As you indicated, there is no MMS content of any sort included such as media or group chats. But, other than this and the service number issue you mentioned, the HTML looks good Thanks for taking this on! |
Just a quick update: I have made an attempt to deal with the service numbers, as well as deal with mms messages (at least the message body and attached media). The next thing I need to do is try to deal with 'groups' in some way (even though mms, and XMLB&R don't really support groups as such, I think), and possibly find some way to easily add names to contacts if they have none. There are a lot of difficult, or impossible to get 100% correct things in these XML files. For example, I am now relying on XMLB&R to set Anyway, I think attempting to deal with groups is not going to be easy, and there are probably a ton of bugs in the function as it is now, so if you find some time, you could always try it out and let me know what works and what doesn't at this moment. Thanks! |
I just tried the your 20250122 Windows executable on my 2024 archive. It is ~1 GB and contains significant MMS media and group conversations. I see a lot of progress here, notably in that media embedded in MMS messages appears to be getting correctly processed. Some of the group conversation threads are there too, although there is nothing in the HTML to identify who the sender of a particular message was. I encountered several different and applicable warnings, the highlight of which was this very appropriate message:
However, HTML output generation aborted during generation of HTML for thread 34 (out of 180). The error simply indicates Thanks again--this is some really impressive progress! |
Thanks for testing.
That is a strange one. Could you be more detailed about the actual error message? Just from grepping through the source code, this can't be it literally, there must at least be
I'm having a bit of a tough time dealing with group conversations properly. I actually don't have any in my SMS history, and can't create any because MMS has not been supported by any telecom provider in my country since 2019. The only group messages I can work with are ones I created myself by using my own tool to Am I correct that I can detect group conversations simply by the Thanks! |
My apologies. In redacting the contact names for publishing, I omitted to copy
The path refers to the conversation HTML file in the thread directory. The directory had not been created yet.
I just tried creating the directory and file and it worked, so I guess the OS doesn't mind the name.
In this case no additional options, just
Interesting. What country, and what is the standard messaging protocol there? Things are quite a mess here in the US at the moment with device makers and carriers alike trying to force everyone to use Google Messages. It appears that this will soon be the only app supporting RCS messaging here. Interestingly though, RCS messages still seem to get stored in the telephony messaging database, which is very convenient since it makes it possible for them to be backed up by SMSBR. But as an aside, in recent times, Google has begun extending its crackdown on rooted devices, not only breaking RCS for them, but doing so silently. The RCS status in the app shows you as online, but you do not receive any RCS messages, nor does anyone receive your RCS messages.
By casual inspection, I believe that is correct, though I have not read the spec nor had any firsthand experience writing a parser for these files. Where contact information is available, the contact_name field also contains a comma + space delimited list of contacts that are parties to the thread. |
Thanks, with that and my helpful typo, it was easy to find the spot where it fails.
It's strange that the directory had not been created yet. I now think there are two errors: it doesn't create the directory, and it doesn't show an error when it fails to do so. I have no idea what could be causing this. It might have something to do with the directory name it is trying to create. I understand there are contact names in that directory name, but is there anything you could consider special about it? This may be too much trouble, but maybe if you delete all messages but one (the first one for that conversation) from the XML file, could you then try to edit the "contact_name" field to see if you can reproduce it with a more anonymous name? Maybe I can think of something myself, without you going through all that trouble, but I just wanted to suggest this before going to bed (it's late here).
This is in The Netherlands. I think in reality 99% of people have WhatsApp installed, and use that for most of their messaging. Other than that all providers still support SMS, just MMS has been discontinued.
Thank you that is useful information. I hope to continue working on this function over the next few days, the group stuff is going to get a little complicated, but it should be doable. |
The directory actually does get created. In my haste the other day, I simply confused it with another directory. Sorry about that.
I did this and was able to reproduce the problem. Similarly, by taking just two letters off the name in the last contact ("er"), I was able to get the HTML output. Note that my testing thus far has been with your Windows binary. Next, I built signalbackup-tools from source on Linux and had no issues generating HTML output with the original file. So, I think it's reasonable to conclude that you are hitting some kind of a path length limit on Windows. I know Windows has long defined MAX_PATH as 260 as discussed here and the path in question is right at that limit. However, my Windows tests have been on Windows 10 22H2 with the LongPathsEnabled registry value set to 1. That said, there appear to be some important considerations governing long path support, notably that only a subset of APIs actually support the long paths. I wonder if it is possible that a library you are linking against might be using an API that does not support long paths.
Interesting. WhatApp is not especially popular in the US. But then again, Signal is even less popular, and I generally find it hard to get non-technical types interested in using Signal. Sadly, in the US the Android messaging ecosystem is very fragmented, probably as a consequence how long it has taken for RCS to gain significant market share; Google's recent decision to prevent root users from using RCS certainly will not help this situation. Is RCS widely used in Europe? I found something else interesting. It seems that in SMSBR, MMS messages also have an addrs section:
So maybe this is another feature of the SMSBR backups that can help you detect a group conversation. A few other things I noticed now that I was able to process my 2024 archive.
Thanks! |
Thank you, very helpful. I have done some testing and think I have a solution.
Yes, I did know about that one. I have just pushed an update to hopefully handle group conversations better.
Thanks, should be fixed now.
I would love to have some samples of this (the way they are written in the XML file, and (if you know) what character they are supposed to represent)
Curious, is there anything special about the ones that don't open, are they all some specific file type? When you say "does not open" are they not displayed at all, or can they only not be clicked to enlarge or save? Thanks for the feedback. I have no doubts there are more bugs or little things that need improving, so whatever you can find, let me know. Cheers! |
I was able to fully process my unmodified 2024 archive using your latest (20250126-2) Windows binary. Of course, long path support should really work everywhere, not just in HTML generation since there is no telling how long the path could be to a user's backup file or output. Apparently though a lot of things in Windows are still pretty reluctant to work with long paths. The File -> Open dialog in both Firefox and Edge (which I believe is the common Windows open dialog) refuse to accept the entire path to one of these long file names that I had copied using the Explorer context menu "Copy as Path" item. It seems that the "File name" field itself has a length limit, so I had to instead browse through the directory tree using the dialog itself, one level at a time. While this isn't an issue with signalbackup-tools per se, it might be a useful option to be able to generate HTML output with short directory and file names (maybe just based on the thread IDs). Users employing a file manager to browse the HTML tree for conversations of interest wouldn't like this so much, but those using the generated index.html would not be inconvenienced by it. I haven't yet tried putting my generated HTML tree on a web server and accessing it over http. I should probably try this at some point. I don't expect any path length problems there, but it would be useful to verify this as well as make sure that there are no URL-encoding issues in the links provided. On the latest build, I did receive a lot of these messages though:
Confirmed; the ellipsis makes a lot sense on the index page. I also noticed that there is an analogous issue on the conversation pages themselves. In one case, a conversation has 8 members and the result is that the browser window is quite wide with a large horizontal scrollbar. Wrapping these names would make a lot of sense. In this same thread, the conversation is incorrectly listed as having 3 members, and the "show details" drop down link when clicked only shows 3 of the 8 names, even though all 8 are shown at the top of the conversation page.
Here is one without redacted body about a particularly boisterous puppy, which is significant because there does not appear to be any odd content in the message (other than perhaps the quoting):
This message is shown a number of question marks in diamonds in the HTML output.
I tried saving one of the files out of Google Messages, and here's what I found:
Don't mind the name--Google messages made the horrible choice of naming the file based on when I saved it (and doesn't offer my a choice on the name or location). Maybe this format isn't understood by the embedded video player? I can look for more examples of these issues if you need. Thanks! |
Thank you for your thorough reply.
Right, that is a problem. Also, since making this change I've also had a new issue where apparently it causes all paths to fail to open, for some reason (#277). So, I think I should probably revert that change and attempt to keep the filenames short. That is not going to be easy though. I can already foresee people wanting proper names for the output (as they wanted for attachments), but depending on how long the directory structure is where they output the file, even just
Thanks, I'll look into that one, I'll probably have questions about this later. If you run
Right, I should be able to deal with the conversation pages. I expect the missing members might be related to the previous problem... edit I actually have an idea about this (and the previous), hopefully I'll have some time to deal with it tomorrow.
Curious, it seems to work fine here: You are sure this is the message? Is the entire message replaced by those diamonds or only certain characters?
I imagine this is the case, I'll try to look into it. I have some work to do, thanks again, I'll let you know when I've made some changes or have more questions. |
I've made some adjustments:
Some other remarks:
Let me know if these changes have made things somewhat better if you find the time. Thanks! |
My entire 2024 archive now has only one group chat (which happens to have 3 participants total). All the others are missing from the HTML output. I'm also still getting a ton of those errors, but I don't know what messages are causing them so it is hard for me to provide additional information.
I can't tell at the moment since I am missing almost all of my group conversation threads.
I think it would be great to give the user two choices: one for the original scheme (which is very descriptive), or one for a very compact scheme (like id_nn_\id_nn_.html). This seems like a better scheme than producing different output just because it's on Windows. Your scheme for creating HTML with very long paths on Windows did seem to be working correctly after all.
I tried this myself today (using the latest build), but in this case, I instead of copy/paste, I simply deleted all the other lines in the XML file and used File -> Save As in Visual Studio Code to write a 1-message XML file. The HTML displayed all the characters in the message properly. Barring any strange issues with Visual Studio Code, I really have no idea what is causing this corruption. It is also worth noting that in the failure case for this message, every single character was a question mark in a diamond box.
All of this content is inside of group chats that I don't have access to in the current HTML. I will test this on the next build that works unless you would prefer that I test it on a previous build. Thanks! |
Ouch, it seems I have broken things thoroughly. I'm not sure how right now. I will try to see what is gong wrong when I get back from work.
I've quickly added the actual address it fails to find to the error message, maybe something about them provides you with insight. These 'source_recipient_id's are supposed to be the senders of group messages (
Yes that would be nicer, I might look into that at some point in the future. Because you were having trouble with the original way, and #277 was having trouble with the new way, I wanted a quick solution. Offering different options will take some time to implement correctly and require updates to the argument-parser and README.
Right, that's the priority when I get back from work. I hope I can think of something. I'm just going to show what I am seeing when I test this feature currently, maybe you see something noteworthy. Note, I am passing both
(note all phone numbers in there are fake) Maybe that's useful. It's hard working on a feature such as this blindly. Maybe at some point it is possible for you to create a minimal example of a XML where threads are skipped, by deleting all but a few messages and the actual message contents. But first, I'll investigate when I get back from work. EDIT I think I managed to craft an XML file that shows that same error ( |
I've made some changes that could have affected your issues, so please try again when you have time. Thanks! |
Just a quick response--unfortunately a bit short on time today. The latest build you published (20250128-2) is generating output for more but not all of my group conversations. Of note, the group thread with the problematic .3gp video file I referenced earlier this week is missing. Still getting a lot of these:
Also started getting a bunch of these:
Plenty of these too:
Also some of these:
On the conversation pages, I see a lot of senders referred to by number or "(unknown)", including in cases for which a contact name was displayed in the past. The member counts in the details section seem to be accurate now, notwithstanding the above. At the very top of the conversation page, By sharp contrast, the |
Hm, I have to admit I'm quite confused at this point. I've tried long and hard to craft XML files that run into any of those errors, but I can't make it happen. I think I'm making some assumptions about the format of the XML file that may not hold. I do believe most of the errors are simply propagating a single bug and causing each other, so I hope it will turn out to be less bad than it looks, but we'll see. I've included a little fix (hopefully) for the suddenly missing contact names. Other than that I mostly just added a bunch of verbose output to the process in the hope it will help me understand what is going on. I'd like to ask you to run the tool again and report back all of the output, from the start up until the first I've attempted to censor all the phone numbers in the output (and there shouldn't be any names or anything else), but the data in the I of course don't know what your XML file looks like, maybe I'm asking you to manually inspect and adjust thousands of lines of output before that first error, so don't feel obligated, I'll keep doing my best to get this working either way. But at this point this seems the quickest way to get some valuable info. Thanks! |
I suspect part of what you're facing here is that there is a lot less consistency in SMSBR XML files than anything you see in the Signal ecosystem, which is pretty closed. By contrast, the SMS/MMS ecosystem (and the resulting SMSBR backups) are subject to nuances associated with different carriers and messaging apps. Do you have any logic for handling number normalization? For example, I store all of my contacts with a leading '+' followed by country code and the local number, like this: "+1 (847) 555-1212". This way my contacts "just work" when I am roaming internationally. Lots of users don't have this discipline because they don't have to--simply storing "8475551212" is sufficient for messaging and calls to work domestically. Somewhere in the stack, the unnecessary punctuation and spaces I use get stripped; SMSBR reports the number as +18475551212. The real challenge here is with incoming messages, since the formatting/completeness of the number is at the carrier's discretion. It's common for incoming messages to be lacking the + country code (e.g., "8475551212"), and when roaming internationally I have seen all sorts of odd prefixes attached to incoming numbers when I receive calls, many of which don't make a lot of sense.
I'm still seeing a lot of issues in my chats. I find this set of messages interesting:
So this appears to be somehow related to a number normalization issue--somehow the two numbers are not recognized to be associated with the same contacts. Yet strangely up above in the
That said, are these somehow being handled as separate contacts? That wouldn't make sense.
I could redact numbers from all these SQL error messages with some regexes to produce output similar to what you are doing in the other messages. If that's useful, I'll work on it in the next day or so, and preferably send that via some means other than GitHub (Signal?)
If the above proposal is less than desirable, we could have a Teams meeting and I can show you the live output being produced by signalbackup-tools; I am willing to dig through the XML for corresponding data to try to get to the bottom of some of these issues. Thanks! |
Ugh, right! That is a big problem, and could very well be the cause of most (if not all) errors you are seeing. Though for numbers with a I was really counting on the e164 format, the way Signal stores it. I don't really understand why messaging apps don't also normalize before storing the number, even for incoming messages. It has to be done anyway, to place a message in the correct conversation I assume. Anyway, it is quite a tough problem, I remember having investigated this years ago and giving up on it. I don't think there exist implementations that actually do this 100% correctly, let alone that I can write one. The simplest thing I can do, that would probably catch a lot of cases is simply strip non-numerical characters (except '+') and remove any leading double-'0' and replace that with a '+'. However there are then still plenty of possible problems, probably most commonly if the country code is left off of a number in one message, but not another: I don't think there is anyway to automatically match that ever (without being told the country code to prepend (for each number)). Unless I simply start matching only the last X digits of every number (maybe the last 6 or 7), but it's obvious that this could cause issues as well. Honestly, this might be best solved by the user preparing the XML file to normalize the numbers, before feeding it to this tool. I'll do some more research, and have a think about this. |
Having slept on it, I think there are several things I could do to help this situation. A couple of questions: Does this same issue apply to group-addresses? That is, if one group message has an address list including When you look at the Lastly, about the diamond characters: I noticed one of my contacts being split over two chat, despite the address being the same format in both ( Thanks! |
I've made some changes, both to normalize phone numbers and to make sure any unique contact name only points to a single address. I do not dare anymore to guess whether this has made things better or worse, or nothing at all. But if you have the time you can try it out. You'll probably want to add the option I will not have much time to do more work on this until Sunday probably. |
Wow, this build is a massive improvement, thank you!
This of course raises a couple questions:
So, at least on my data, it looks like the number normalization logic you introduced really helps. Strangely enough, the messaging stack on my phone must be really good at this because conversations are usually threaded correctly after I restore a SMSBR backup. This build generated a lot of verbose output without me turning anything on. In other matters:
|
Good, finally a little progress.
Even with my very limited and simple phone numbering normalization, I'm running into the messiness of it. The
The names at the top of the conversation page is simply what is listed as the As for the names and numbers in the details, at least these add up to 11, so that seems right. It seems 7 of those numbers do not appear with an associated name in the XML file. Could this be correct? Does
True, in order to get more useful information on errors, I added a bunch of output. This entire function is in active development and undocumented, but if it ever gets completed and working somewhat reliably I will clean up the output. Are there still any of the errors you saw before? Any of these:
Ok, so in that regard I believe there is no problem in this function. As far as I know there is no good way to query the browser whether a video file is supported or not.
Indeed, that all looks good. This is a mysterious one. If you inspect the HTML output, what does this message look like, maybe there are HTML entities in there, any other weirdness, or does it look ok? In my test it looks correct: <!-- Message: _id:68,type:10485783 -->
<div class="msg msg-outgoing msg-sender-1">
<div>
<pre>he seems to be entering a really rowdy phase. he was trying to grab my yeti cup off the coffee table yesterday right before i left. wouldn't stop when i screamed "NO" at him either.</pre>
</div>
<div class="footer">
<span class="msg-data">Dec 19, 2024 18:14:00</span>
<div class="footer-icons checkmarks-received">
</div>
</div>
</div> If there are diamonds in the HTML file as well, maybe you could hexdump that as well to see what the actual bytecodes of them are? Maybe that helps a little. Also, please double check the timestamp to be sure the message really corresponds to the diamonds-one (note it's localized in the HTML, so mine in the output above will be different from yours). Thanks! |
Yes, it is 011 here. That said, the '+' notation is appealing both because it is compact and it "just works"--when all contacts are stored this way, in my experience the phone will correctly dial/send a message to any contact anywhere in the world from anywhere in the world (i.e., without regard for where the phone might be roaming or the country of the contact).
This number is listed correctly in
100% effectiveness is probably not realistically achievable. Things get especially complicated with the fact that country codes aren't all the same length, different schemes are used for reporting numbers of incoming messages by carrier, etc. I thought about trying to normalize the numbers in my files at the beginning of all this, and honestly I'm not sure how practical it would be. From my perspective, if it is more than a couple of regexes, it's probably not a realistic expectation for users to do this.
Interesting. Google Messages definitely gives you the option to set group names. The default group name displayed in the UI is a list of comma-separated contact names, followed by comma-separated numbers for names with no contacts available. That doesn't match the names showed in HTML export though, so maybe these don't get written to the system messaging database for SMSBR to grab. Perhaps it makes sense to offer an option to have the tool generate a name for group chats based on the actual participant data shown in the details section.
It looks like there is a trend among these: all of the contacts with no names displayed in the details section of a group chat appear to be those with which I have no individual conversations with. They don't show up as individuals in If support from the tool to correctly determine these contact names from SMSBR XML is not possible, it would be really neat to be able to pass in the contact export file from my phone (a .vcf file of sorts, at least in the Samsung ecosystem) and get contact names resolved that way. In the absence of that, I guess I could write a script that calls signalbackup-tools with an outrageously long It's also a bit peculiar that my own number is listed in the group details section. It seems like it would make sense to put my name in there, jus like it is in details from HTML exports of real Signal chats. I guess I'm not entirely surprised by this outcome though since I am not in
The .3gp files are pretty ubiquitous for low quality MMS video. Neither Firefox nor Edge appear to support them. I don't know what the answer here is, but it would be very challenging to replace them in preprocessing. If I could somehow grab the data without actually parsing the XML (i.e., use a regex or something simple), it might be doable (e.g., base64 decode, transcode with ffmpeg, base64 encode, replace). That's still far from "easy" though.
Here's a look at the hex dump of the body of that message in HTML:
I definitely double checked this. I noticed the timestamp localization too. That's a nice touch. |
I've added a bunch of output to the process: When you now run the program, with $ ./signalbackup-tools --exportplaintextbackuphtml SMSBRedit2.XML PTHTML/ --setcountrycode "+1" --append
*** Starting log: 2025-02-03 14:04:58 ***
signalbackup-tools (./signalbackup-tools) source version 20250203.140245 (SQlite: 3.48.0, OpenSSL: OpenSSL 3.4.0 22 Oct 2024)
normalizePhoneNumber in: 00-1-202-688-5500
normalizePhoneNumber out: +12026885500
normalizePhoneNumber in: (202)688-5500
normalizePhoneNumber out: +12026885500
normalizePhoneNumber in: +12026885500
normalizePhoneNumber out: +12026885500
normalizePhoneNumber in: 011-1-202-688-5500
normalizePhoneNumber out: +12026885500
normalizePhoneNumber in: 011381688-5500
normalizePhoneNumber out: +3816885500
normalizePhoneNumber in: 00381688-5500
normalizePhoneNumber out: +3816885500
normalizePhoneNumber in: +381688-5500
normalizePhoneNumber out: +3816885500
[etc...] Maybe you could provide some examples of numbers that normalized correctly before, but don't anymore. The numbers I'm testing still seem to work, and I don't immediately see how my previous change could have broken it. If you see some obvious pattern in them, describing it may suffice. Otherwise, just post it, replacing the final digits with x's (the leading characters are probably important). edit: nevermind this part, I don't think it's useful, I'll come up with some better debugging info later. Following this you should see something like:
This is my attempt at finding out about the remaining edit: nevermind this, this problem should be fixed. Then you will see:
This is me tracking where the body of that message turns into garbage. Note I made changes which may have actually fixed this Then a few more tables are printed. In my case it looks like this:
What do you see? edit: nevermind this, again, this problem should be fixed. Then starts the normal import process, which still generates a ton of output (I'll clean it up sometime soon I think). Somewhere in between there, you will see:
Again, tracking when the message turns to diamond characters. If that issue was still not fixed, please let me know what these three lines say.
I don't mind implementing that when most other things are working properly. In the meantime,
If the name does not appear in the XML at all, or only in incomplete and out-of-order I am a bit surprised you are not listed by Thanks! |
Sorry for the delayed response.
The 20250203-1 build seems to have fixed the number normalization regression I reported in my previous test. The +381 number is also correctly handled in this build. This is the only normalization issue I found:
In this case, I think this was one of those situations where a roaming carrier reported an incoming message from a US number in a strange format while I was out of the country. Interestingly, the normalization result is almost right--but the country code is listed a second time, with zero padding to 3 positions. This also showed up in the normalization section:
This is what I got:
I got this:
The rowdy puppy text was also correctly displayed in the HTML output starting in this build. I also didn't see any instances of this corruption elsewhere in output produced by this build. Nice work!
I see this:
As noted above, this issue is resolved. But in the interest of completeness, here is what I got:
I also found 49 of these in the import section:
Every one refers to this exact number. The contact has no name, but is listed in the
I used several name assignments via I have not done any experiments to see if I can create a backup with signalbackup-tools and use the SQL statement feature to import contacts into a backup ahead of importing the XML. This might be a nice approach to handling contact mapping too. Any thoughts on approaches for handling .3gp videos? |
Not a problem, I don't always have time myself. Thanks for your thorough response.
That's curious. Just like I didn't do anything to break it in the previous build, I didn't so anything to fix it in this one, I only added a bunch of output. But I'm happy it's working.
That's a weird one, it has two international prefixes and countrycodes (
Also very curious. I don't think I touched this part of the code in a long time, if there is a message without an I have made a change so that when such a message is found and skipped, the XML-node of that message is printed to screen (it will likely contain private info, so don't paste it here). Do you see anything special about it? Is there really no
Me too, and it has been bugging me for a while. I've added some extra output to this error. I'm not completely sure it will give me a hint, because I'm not sure what I'm looking for and I can't reproduce no matter what I try. But maybe you could inspect that output, censor a little if necessary and post it here. Also, any messages about creating threads or recipients directly preceding that error might be relevant.
I'm not entirely sure what you mean by this. If you want to use
No, like I said, there is no querying the browser whether it can play any specific video (as far as I know). I could simply exclude 3gp-type attachments from being handled as video types, but that would turn it into any generic attachment in the HTML (still might be an improvement though, at least there will then be a download-button). Also, while your browser (and mine) doesn't play the 3gp files, for all I know other users have setups where they just work, or a future update will add support for them. Not sure what to do, do you have any ideas? Thanks! |
Looking forward to seeing what you come up with.
Wow, there really is no address attribute in the Interestingly I see
I got even more of these (60) using the 20250207 build including for some other numbers. Here is an example.
The only thing that really jumps out at me here is
The only point of these SMSBR imports is to produce HTML output. One idea I had was effectively creating an otherwise empty database and using If you aren't keen on parsing contact export files, even a simple
I won't call it a great idea, but I was imagining some sort of method to transcode offending attachments into something more browser friendly. If I were going to do this myself as a pre-processing step, I'd probably be looking for attributes like this:
Then I'd need to grab the value from the |
Just very quickly, I don't have much time, I'll respond to the full message later.
I think there is significance to the number not being referenced in normalized form, as well as that it appears like that in both the Thanks! |
My apologies: I did in fact forget to set
+xxxxxxx7393 is present in group chats only (no one-on-one conversations) and has no saved contact name. The contact is shown in The problem remains even when I explicitly assign a contact for +xxxxxxx7393 via Thanks. |
I've attempted to handle this case. It depends on the number starting with two international call prefixes and countrycodes (as set by
I use the I think this is a case where the XML file needs to be prepared by the user to be valid. That is, I think you should probably add the correct
Well, like I said, I don't think creating an empty backup helps with the mapping problem, but I suppose you could try. I have no problem implementing a
So, from my research, it seems indeed that 3gp files could be supported by some setups, see for example here. Though currently, it seems no known setups do actually support it (see here). Although I did find some references of it working on macOS. I could possibly simply not treat 3gp attachments as video, then they would appear in the HTML as any other non-media attachment. The other option is what you are thinking. I would personally opt for post-processing instead of preprocesssing. That way you would not need to parse and edit the XML file and need no base64 encoding and decoding. You could just transcode the file directly and edit a single line in the HTML to point to the new file (if you change the name) and alter the
So, that's unexpected again.... According to the I suppose I could use the list of recipients to determine if something is a group, and make up my own identifier and keep a list that way. But that is going to be a quite a few steps backward, I would need to rework the logic of the function and start over from there... That would also solve the case of the message with the missing address (though the XML file is still technically invalid). |
This seems to be working, at least in my test data:
There is only one instance of this error in my 2024 archive. There are only two
This does seem to be an example of a malformed XML file. Of course the challenge here is that preprocessing at this level is pretty hard as it would require full parsing of the XML to do this (generate I don't fully understand the point of the My hope is that signalbackup-tools would ultimately support gathering address information from the
If there was a
My impression is that support is very scarce in the desktop browser space. But, .3gp video containers are very common in the mobile space, notably for video sent via legacy MMS.
I'm not convinced that postprocessing is simpler. I think it is more of trading one set of problems for another set of similar complexity problems.
Most group messages do seem to have the I just went and dug through the XML a bit, and noticed a couple things:
In the case discussed up above, this is clearly an issue with the SMSBR output. I think this approach would be a good way of dealing with that. In the second case, I think there is some other issue going on here in signalbackup-tools that is preventing the creation of the contact entry for the group conversation. Also, I am wondering if this approach could enable better contact name identification for group conversations. Thanks. |
Good!
Is one of the
Well, yes that would be hard if it needed to be done for thousands of messages, but if there is just a single one adding it manually is definitely the more efficient option. Getting the correct
For group messages, the
I'll consider it when most other problems are solved.
yes, well that is also on the to-do list for when no other problems remain. As well as an
Ok, that is interesting. I agree in this case it is probably a bug in the tool (but I'm happy with that, I think it will be easier to solve than starting over without relying on the If you find the two messages reported by this tool in the XML file (easiest probably to search for the I've added some more debugging output for this case, maybe you could run again. I'm expecting some output, directly after the
I'm expecting the Thanks! |
Yes.
If it is just one, that's fine. I haven't run signalbackup-tools on any of my older archives though yet to see if the condition exists in those.
This is my impression too.
I looked at a few of these and it seems that the contact_name attribute is always has the value of the name associated with +1XXXXXX0113.
In each of these 4 tables, the value of |
There are now
Right, so this was the problem all along. To match messages with existing threads, it would first attempt to match on contact_name, and only by phone number second. The idea being, that no one would have two contacts with the exact same name in their database (even if there were two "John Doe"s, they would likely rename them to "John 1" and "John 2" or similar). Also, this meant that phone numbers did not have to match precisely: the incorrectly normalized Of course, this this doesn't work for this case, where the group gets the name of the only group member whose name is available, thus matching a 1-on-1 conversation with this person. So, now the tool matches on phone number instead. This makes it rely heavily on the phone number normalization: if it fails conversations may be split. As I'm writing this, I'm thinking when Thanks! |
WOW! This I got exactly the expected output! This is exceptional.
I tried both of these and they worked as expected. My contact name file was still a fairly small one; I'll have to do some bigger tests in the coming days. One small observation is that the group names for long conversations, both on index.html and at the top of group conversation pages, have the comma separating long member lists at the beginning of a new line, rather than at the end of the previous line (where I would argue it is customarily/grammatically expected).
One of the things I will try to test soon is a contact map file in which a name appears many times, each time mapping it to a different number. I will be interested to see if the resulting conversations are threaded properly even though the sender has different numbers during different portions of the exchange. Apparently I have some shady contacts with a habit of using burner phones/numbers. :)
I didn't see any normalization issues in this build on my test file Also the only skipped message was the one missing the address field. I need to go dig through my other archives and see if I have a lot of these. I guess that leaves only the following open points for possible further enhancement:
Thanks for your AMAZING work! I know it's not much and doesn't come close to compensating you for your work, but I left a small donation at your sponsorship link. I've been looking for a mechanism to create beautiful browsable/printable archives of SMSBR backups for over a decade now and nothing has even come close to what you accomplished. Your passion for this project and attention to detail is unmatched. |
Great that it's mostly working now, very happy with that.
Yes that would be better, but should already be happening. I can't really reproduce this, but I know line-breaking has been a tough problem on the HTML side. I've tried increasing the size of one of the names in a group slowly (one 'i' at a time) and see the following: One more 'i': So that seems correct to me. This sort of thing is also likely influenced by the browser and possibly the fonts used... I'll try another browser tomorrow. You also mention this happening on the index page, but on the index page, the group name is never supposed to be more than 1 line: Is that different for you?
I am pretty sure it will create multiple conversations with contacts who happen to have the same name. If it is desired to merge some contacts (who have simply changed number), I think it shouldn't be too difficult to add another option to do so.
Yes, I'm curious to hear how well this works on other XML files, I hope results are just as good as this, but maybe we were lucky with this XML file...
Point 1, was actually next on my to-do list.
Anything else (like transcoding video) is well outside the scope of this program, and should be done by the user either before or after processing by this tool.
Well, thank you very much! That is not a small donation, I've helped quite a few people with their backups over the years, but this is by far the biggest donation I have received. It is very generous of you and greatly appreciated. Thanks! |
As per usual, after typing a response, I go to bed and as my head hits the pillow I think of two or three things that might cause an issue mentioned here. The weird line breaking of group titles should be fixed now. (it is a windows thing, in combination with me always opening files in binary mode (specifically to work around this same Windows thing, but it was a bad combination in this case)) |
There is now also an initial |
First of all, I accidentally somehow hit "mark as completed" when I leaned over my keyboard and it doesn't appear that I can undo this action. That said, this activity does seem to be largely complete, so you can decide if it should stay closed or be re-opened. The situation with conversation titles seems to be largely improved in 20250215-1. I tested in both Firefox and Chrome on Windows, and there are no lines in conversation title names beginning with commas any more. However in narrow windows, the conversation titles do not scale down in width to match the box that encloses conversation bubbles. I think this would be a nice touch; titles could just wrap as required to accommodate the width available.
It would probably be a nice option to have though I will admit I haven't thought about this a whole lot yet.
I tend to agree that this is out of scope. A reasonable compromise might be an option to make them links to files that can be downloaded or played in an external player. Having this as an option would be a nice choice so a user could elect to preprocess if desired.
That's my impression too. I think sticking with this decision makes sense unless user complaints prove otherwise.
I ran my test without this option today initially, but still got file names with truncated contact names in the HTML files. This doesn't seem to be optimal behavior. I think it would be best to have the default be the old naming behavior. I did a second test with I need to do some bigger tests now. I have my archives stored in an XML file for each year, so I want to do some tests where I successively import multiple archive XML files into a backup and then export to HTML, paginating the results with |
This has all been very complicated (for me at least), usually long titles would not break, or break at weird spots, or very short ones would break even though more width was still available. I ended up trying to just set a sensible default fixed width. But, I messed around with the CSS again and it may be working now.
I think I can add an option to prevent given content-types from being treated as media. That way, they should show as other non-media attachments would, including a download button: Would that be an improvement?
For now, I've upped the length limit to 54, which should be higher than any conversation title in Signal can naturally be. So, apart from you, nobody should actually see any truncated names. I'm undecided on what's best here. Having the program just fail would be a bad user experience. On the other hand, maybe the limit of 54 is too high to prevent that, making this option the worst, still truncating, and still (possibly) failing. I will give it some thought, maybe I should just disable this truncating altogether and calculate the actual length of the full absolute path and issue a warning if it's too high (on Windows).
Sounds like a plan. I believe the tool may behave slightly differently when importing into an existing backup as opposed to into an internal dummy-backup. It may give better results actually merging the XML files beforehand. I'll leave this issue open, as I'm not confident no new issues will arise. Also, from experience, by now I think the other XML files could contain surprises that would need to be dealt with. |
It looks reasonable in 20250216-1.
Yes, this makes sense. It offers a reasonable level of usability when not pre/post-processing (to make these playable in-browser).
I think that would be the right approach. Windows long filename support supports really long filenames. The only issue with these longer names seems to be working with the long paths in Windows GUI elements, and for people that are bothered by that there is
Here's a naive question: how do I create an empty backup into which I can start my series of imports? I tried creating an empty directory to pass as the first argument, but that didn't go well:
I think this behavior actually makes sense--I just don't know how to properly accomplish what I am trying to do. I could always go down the XML merge route, but that is going to involve moving a LOT of bytes around. I'm also curious what it will do with regard to memory utilization in signalbackup-tools. BTW, backups are enshrouded with a tag like this:
Do any of these attributes matter (especially the message count) from the perspective of signalbackup-tools import? If I can just drop these, the XML merge would definitely be easier. |
Yes, I agree with both these things. I'll start work on them.
I think there actually is no way to do that currently. I thought you could actually just add an Creating an 'empty' backup on the command line is not possible, I'm not even sure what an 'empty' backup is. There is more to the backup file than just the SQL database, and even that database — while it contains no messages — will still have a schema (contain tables with certain columns etc) which depends on other data (like the database version). I think I have two options:
In fact, the second option will also prevent the problem of the tool handling importing into a dummy differently than when importing into an existing backup. So I think I'll go with that option.
Me too! But I think it'll be all right. The majority of the bytes in those XML files should be in the attachment data, the actual text-content won't add up to much I think. And the tool is actually pretty clever about the attachment data, not keeping it in memory the entire time, but instead just saving the offset and size and reading it back from the XML file directly whenever it needs it. At least that's the intention, I hope it's working like that. Do you see high memory usage when running this (in the order of the size of the XML file)?
No, they are all ignored. I don't see any use for them for any program that wants to parse the file, apart maybe from showing a progress bar during parsing. |
I think it is highly advantageous to be able to import these XML files into a backup database because it will streamline incremental processing. Ideally, I'd like to be able to create a database containing imports of all previous years messages, then periodically have a scheduled task import the year-to-date backup into a copy of this database and then export the HTML. This will save significant processing by avoiding the need to effectively always re-import previous years' backups. |
I had a few goes at getting something working to enable this, but so far no luck. I need to think about how this can be done, but I'm a bit short on time this week. Not giving up yet though. Though I still think it isn't a very good idea. The tool behaves differently when importing into a user-provided backup, it has to, though an obvious solution is to also provide an option I'm also curious about the 'significant processing'. I did manage a speedup in the import-function, so maybe test it out and see if you can work with that. With my biggest testing XML (which is admittedly small), importing 21,000 messages has gone from ~6 seconds to ~0,6 seconds (no attachments in that XML though). I may have another idea that might speed up the process, but that needs investigating. edit I have one more possible optimization to go, but it may already be getting to the point were adding extra XML files to re-process is faster than reading an (also ever-growing) input backup and writing it out again after adding a single new XML. Other changes I managed the last few days:
|
First I'd like to express my appreciation for this tool. The freedom it provides is just such a breath of fresh air in contrast to the opacity and rigidity that have unfortunately become synonymous with so many aspects of Signal. I've really enjoyed watching this project grow over the past few years. After nearly 20 years in the software industry, I can honestly say that the quality you have achieved while building a significant body of functionality is impressive.
One of the features I have really been enjoying is the HTML export feature: the output is beautiful and remarkably functional; the embedded media handling is especially nice. After reading through issue [#227], I started wondering: could I somehow use signalbackup-tools to generate gorgeous HTML archives of my SMS/MMS messages from SMS Backup & Restore's XML files? The idea here would be to either use
--exportplaintextbackuphtml
to directly generate HTML, or even better, perform multiple--importplaintextbackup
operations starting from an empty backup to effectively integrate multiple XML backups. In the latter case, the goal would just be to ultimately run--exporthtml
on the resulting backup; it wouldn't have to be a perfectly valid Signal backup file since it would never actually be restored in Signal.So, I tried a few experiments with
--importplaintextbackup
,--exportplaintextbackuphtml
, and--listxmlcontacts
. All of them produced the same result, and I also had similar experiences with different XML files. Here is an example:H:\sjevtic\testing>signalbackup-tools-20250112-1_win.exe --exportplaintextbackuphtml sms.xml .\html *** Starting log: 2025-01-18 11:37:05 *** signalbackup-tools (signalbackup-tools-20250112-1_win.exe) source version 20250112.125109 (Win) (SQlite: 3.47.2, OpenSSL: OpenSSL 3.4.0 22 Oct 2024) [Error]: During sqlite3_prepare_v2(): near "m": syntax error -> Query: "INSERT INTO smses (date, type, read, body, contact_name, address) VALUES (1373919414774, 2, 1, 'I'm at the cinema', 'redacted-contact', 'redacted-number')"
Here's the corresponding line from the backup that caused the issue:
<sms protocol="0" address="redacted-number" date="1373919414774" type="2" subject="null" body="I'm at the cinema" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" date_sent="1373919414774" sub_id="-1" readable_date="Jul 15, 2013 4:16:54 PM" contact_name="redacted-contact" />
I don't have a lot of knowledge about XML, but the two highest rated responses to this post break down the escaping rules well, and it seems that a single quote (') enclosed in a double-quoted (") attribute is valid.
I have a few questions:
Thanks!
The text was updated successfully, but these errors were encountered: