Parse XML using streaming API instead of tree API. #134

ephraimkunz · 2024-11-14T02:25:51Z

I saw #14 which looked fun so thought I'd take a crack at it. The goal was to reduce peak memory usage and execution time by using quick_xml for pull-based parsing, rather than roxmltree for tree-base parsing.

For now, I kept the old tree-based parser around in case we want to do further tests for performance / correctness. I don't think long term we'd want to keep both parsers though as it might be a maintenance burden. So happy to remove the old parser in this PR or later if we want to take the new one. The pull-based parser passes all tests and the output xml it writes is identical to what the library wrote before any of these changes. It's possible it regresses other things that the tests don't catch though, but I'd need some exotic XMLs to know for sure.

It looks like the worksheet xml files and the shared string xml files are the largest part of the xlsx and have the most to gain from stream parsing, so I only looked at those. Do we want to convert other uses of roxmltree to quick_xml? They don't show up in profiling so it would be just for consistency and not for performance reasons that we'd get rid of the remaining uses of roxmltree.

Here are the performance results. I tested the new and old parsers on all the files in xlsx/tests/calc_tests and verified no regressions in parsing speed were present on these small files. The majority of my testing was with a large file with the maximum number of rows Excel allows (1,048,576) and 39 columns (https://data.cityofnewyork.us/Social-Services/DEP-311-Service-Requests/vwtg-azuw/about_data). I tested reading in the Excel file (load_from_xlsx), calling evaluate on the model, then writing out the Excel file (save_to_xlsx).

Tree (original)
Peak RAM usage was 11.5 GB, took 51 seconds.

Stream (new)
Peak RAM usage was 3.3 GB, took 47 seconds.

Those are pretty good results, peak memory usage was reduced by almost 3.5x. That's a direct result of not pulling the whole XML file into memory at once. Time wasn't reduced by as much as I hoped, but I think it's because we are looking at most of XML tags in a file to parse it, and because we are mainly bottlenecked on unzipping the files.

Most of the remaining memory used in the stream parser is storing the nested HashMaps in sheetData. Reducing memory usage further here would involve changing the data layout or data structure of sheetData.

Most of the time is spent zipping and unzipping. Further improvements here would be around exploring unzipping the files completely as separate files on disk, then parsing them. I think this would speed up parsing and keep the memory footprint low, but would temporarily increase the disk footprint and may leave files behind if the parser crashes, which wouldn't be an expected side effect of the call to load_from_xlsx.

Anyway, happy to hear feedback here, it's my first time implementing a stream parser so I'm grateful for any suggestions.

xlsx/src/export/mod.rs

xlsx/src/export/worksheets.rs

ephraimkunz · 2024-11-14T02:29:18Z

xlsx/src/export/worksheets.rs

@@ -90,8 +148,11 @@ pub(crate) fn get_worksheet_xml(
        row_style_dict.insert(row.r, row.clone());
    }

+    const ROW_STRING_COUNT_BEFORE_WRITE: usize = 10;


Empirically determined batch size for calling write! on rows. So we'll accumulate 10 rows worth of data before writing it. Smaller and I saw longer run times, larger and we use more memory with the same runtime.

xlsx/src/export/worksheets.rs

xlsx/src/import/shared_strings.rs

xlsx/src/import/worksheets/worksheet_streaming.rs

xlsx/src/import/worksheets/worksheet_tree.rs

xlsx/src/import/worksheets/mod.rs

nhatcher · 2024-11-14T07:26:42Z

This is amazing @ephraimkunz ! Looking forward to testing this myself!

Did you do any comparison with (sane author as quick-xml)
https://github.com/tafia/calamine/

Whatever magic Johann Tuffe is doing there I'm sure we can learn a lot.

Thanks!

ephraimkunz · 2024-11-14T13:50:48Z

I wasn't aware of that crate. Let me do some comparisons and report back here.

nhatcher · 2024-11-14T17:51:30Z

Hi @ephraimkunz , had a quick glance at the PR. I don't think we should keep around the old parser. And definitely shouldn't keep roxmltree as a dependency. I realize that is a larger change, so we can go step by step.

jaycarlton

Nice submission. I learned a lot going through it.

xlsx/src/import/worksheets/mod.rs

xlsx/src/import/worksheets/worksheet_streaming.rs

ephraimkunz · 2024-11-23T17:59:26Z

Sorry for the delay here - I just started a new job and it's taking a lot of my time recently.

Thanks for the helpful review. I'll take a look soon.
I am also still planning to do some benchmarking vs calamine, umya-spreadsheet and ooxml.
I may also look at the test coverage of this new parser and the possibilities around adding more tests or fuzzing it.

ephraimkunz · 2024-12-21T00:09:49Z

Hi @ephraimkunz , had a quick glance at the PR. I don't think we should keep around the old parser. And definitely shouldn't keep roxmltree as a dependency. I realize that is a larger change, so we can go step by step.

Agreed. I removed the tree-based implementation in this last push which had the unfortunate side effect of changing the diff by a lot, because I combined everything back into 1 file since we don't have two different implementations anymore.

Roxmltree is still used in a couple places but those should be easy to fix, following the model of shared_strings.rs, where I implemented a simple state machine to replace it there.

ephraimkunz · 2024-12-21T00:24:42Z

I also did a bunch of perf tests against other crates (what looked to be the top XLSX parsers on lib.rs) on my same large file with > 1 million rows and 39 columns. Here are the results:

OOXML was significantly worse than IronCalc. It doesn't do streaming so it reads the whole file into memory. Super slow as well, I gave up after it ran for 700 seconds.

Umya-spreadsheet wasn't great. It was the worse for RAM usage out of any crates I tested, clocking in at a max of 16 GB to read the file. It took 2 minutes, 30 seconds to complete.

Calamine was of course the crate to beat, and comparison to it was what kicked off this whole performance benchmark. It seemed to perform slightly worse in terms of time (17 seconds vs this PR's 14 seconds) and slightly better in terms of memory (2.7 GB peak vs this PR's 3.2 GB). However, that's not the full story. It's a lazy parser and doesn't cache the results of its lazy evaluation. This means you can grab cell data in one call, but if you try to grab all cell data again, it re-parses all the cell data again. Furthermore, fetching formulas is a 2nd call that also re-parses all data. So, if I want it to parse all the data in the xlsx file and return it in a single struct like this xlsx does, it would require around 30 seconds although peak memory would still be slightly lower.

In conclusion, I think we're competitive with Calamine as well as supporting more xlsx features than that library does (for the most part, they seem to support images which we don't). Switching to Calamine would be a major change since it is lazy and depending on our access patterns might require multiple re-parses of the data. It would be interesting to explore more about how it maintains a lower memory footprint while parsing, which I think it mostly due to how the cell data is stored (using a vector of (rowID, colID, data) rather than what we do, which is doubly-nested hashmap of rowID:colID:Cell).

ephraimkunz · 2024-12-21T00:25:05Z

@jaycarlton, @nhatcher This is ready to be reviewed again.

nhatcher · 2024-12-21T08:29:48Z

Hi @ephraimkunz , this is great news!!!
Amazing job really

I will have a look as soon as possible.

In the meantime, why are we using roxmltree in a couple of places? If its a matter of time I can probably help out

ephraimkunz · 2024-12-21T14:51:51Z

I had originally converted just the performance sensitive parts to use quick-xml. I should be able to look at converting the rest to quick-xml and removing roxmltree today.

ephraimkunz · 2024-12-21T19:26:18Z

@nhatcher I pushed changes to make a few more places use streaming parsers. It looks like there are still several places left to convert:

Comments
Sheet relationships
Workbook
Tables
Styles

Several of these have pretty complicated XML meaning it will take longer. I wonder if we should consider reviewing and merging this PR as-is, then file separate tasks for each of those, with the goal of removing the roxmltree dependency once all those are converted to using streaming parsers?

I don't have a lot of time to work on this and I think there's a lot more work to remove the dependency completely in this PR.

nhatcher · 2024-12-30T14:22:42Z

I don't have a lot of time to work on this and I think there's a lot more work to remove the dependency completely in this PR.

Hi @ephraimkunz , that's tough. The problem is that in situations like this is not difficult to end up in a hybrid state where half of the work is done following one framework (or library or whatever) and the other half the other. Then a third party wants to add something or fix a bug and it would be difficult for them to know what to do. I think you have done more than half of the work already and I am tempted to go ahead and bite the bullet and not let this good work go to waste.

As for the more technical side. This is really amazing work and leaves us in an unbeatable situation. In we are in the same ballpark as Calamine we are already wining. Remember that we actually parse formulas!

I want to give this a little bit more though, but I will likely just merge this as is by the end of the day and open tickets for the remaining tasks.

Once again thank you for your work. This has an enormous impact. If someone in a couple of months form now opens a beefy workbook and IronCalc opens it like a champ it might be because of this PR. And they might choose the technology because of your work here.

ephraimkunz · 2024-12-30T15:47:36Z

No problem, it's been fun learning more. I should be able to help with some of those tasks as I get time.

ephraimkunz added 5 commits November 13, 2024 16:45

Initial

eac34ee

Initial

feb8f3c

Fix bugs

92ca6a1

batch write output

6af25ed

Remove test code

45ce5c3

ephraimkunz commented Nov 14, 2024

View reviewed changes

xlsx/src/import/worksheets/mod.rs Outdated Show resolved Hide resolved

Remove redundant comments, add back test

08f514e

nhatcher self-requested a review November 14, 2024 17:45

jaycarlton reviewed Nov 19, 2024

View reviewed changes

ephraimkunz added 2 commits December 20, 2024 15:55

Merge branch 'main' into stream-parsing

04e77eb

Review feedback

86ccf5d

ephraimkunz added 2 commits December 21, 2024 11:35

Convert metadata importer to use stream parsing

d159fbe

Streaming parser for relationships

411a2bf

Merge branch 'main' into stream-parsing

13207d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse XML using streaming API instead of tree API. #134

Parse XML using streaming API instead of tree API. #134

ephraimkunz commented Nov 14, 2024

ephraimkunz Nov 14, 2024

nhatcher commented Nov 14, 2024

ephraimkunz commented Nov 14, 2024

nhatcher commented Nov 14, 2024

jaycarlton left a comment

ephraimkunz commented Nov 23, 2024

ephraimkunz commented Dec 21, 2024

ephraimkunz commented Dec 21, 2024

ephraimkunz commented Dec 21, 2024

nhatcher commented Dec 21, 2024

ephraimkunz commented Dec 21, 2024

ephraimkunz commented Dec 21, 2024

nhatcher commented Dec 30, 2024

ephraimkunz commented Dec 30, 2024

Parse XML using streaming API instead of tree API. #134

Are you sure you want to change the base?

Parse XML using streaming API instead of tree API. #134

Conversation

ephraimkunz commented Nov 14, 2024

ephraimkunz Nov 14, 2024

Choose a reason for hiding this comment

nhatcher commented Nov 14, 2024

ephraimkunz commented Nov 14, 2024

nhatcher commented Nov 14, 2024

jaycarlton left a comment

Choose a reason for hiding this comment

ephraimkunz commented Nov 23, 2024

ephraimkunz commented Dec 21, 2024

ephraimkunz commented Dec 21, 2024

ephraimkunz commented Dec 21, 2024

nhatcher commented Dec 21, 2024

ephraimkunz commented Dec 21, 2024

ephraimkunz commented Dec 21, 2024

nhatcher commented Dec 30, 2024

ephraimkunz commented Dec 30, 2024