Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse XML using streaming API instead of tree API. #134

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ephraimkunz
Copy link

I saw #14 which looked fun so thought I'd take a crack at it. The goal was to reduce peak memory usage and execution time by using quick_xml for pull-based parsing, rather than roxmltree for tree-base parsing.

For now, I kept the old tree-based parser around in case we want to do further tests for performance / correctness. I don't think long term we'd want to keep both parsers though as it might be a maintenance burden. So happy to remove the old parser in this PR or later if we want to take the new one. The pull-based parser passes all tests and the output xml it writes is identical to what the library wrote before any of these changes. It's possible it regresses other things that the tests don't catch though, but I'd need some exotic XMLs to know for sure.

It looks like the worksheet xml files and the shared string xml files are the largest part of the xlsx and have the most to gain from stream parsing, so I only looked at those. Do we want to convert other uses of roxmltree to quick_xml? They don't show up in profiling so it would be just for consistency and not for performance reasons that we'd get rid of the remaining uses of roxmltree.

Here are the performance results. I tested the new and old parsers on all the files in xlsx/tests/calc_tests and verified no regressions in parsing speed were present on these small files. The majority of my testing was with a large file with the maximum number of rows Excel allows (1,048,576) and 39 columns (https://data.cityofnewyork.us/Social-Services/DEP-311-Service-Requests/vwtg-azuw/about_data). I tested reading in the Excel file (load_from_xlsx), calling evaluate on the model, then writing out the Excel file (save_to_xlsx).

Tree (original)
Peak RAM usage was 11.5 GB, took 51 seconds.

Stream (new)
Peak RAM usage was 3.3 GB, took 47 seconds.

Those are pretty good results, peak memory usage was reduced by almost 3.5x. That's a direct result of not pulling the whole XML file into memory at once. Time wasn't reduced by as much as I hoped, but I think it's because we are looking at most of XML tags in a file to parse it, and because we are mainly bottlenecked on unzipping the files.

Most of the remaining memory used in the stream parser is storing the nested HashMaps in sheetData. Reducing memory usage further here would involve changing the data layout or data structure of sheetData.

Most of the time is spent zipping and unzipping. Further improvements here would be around exploring unzipping the files completely as separate files on disk, then parsing them. I think this would speed up parsing and keep the memory footprint low, but would temporarily increase the disk footprint and may leave files behind if the parser crashes, which wouldn't be an expected side effect of the call to load_from_xlsx.

Anyway, happy to hear feedback here, it's my first time implementing a stream parser so I'm grateful for any suggestions.

@@ -90,8 +148,11 @@ pub(crate) fn get_worksheet_xml(
row_style_dict.insert(row.r, row.clone());
}

const ROW_STRING_COUNT_BEFORE_WRITE: usize = 10;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirically determined batch size for calling write! on rows. So we'll accumulate 10 rows worth of data before writing it. Smaller and I saw longer run times, larger and we use more memory with the same runtime.

@nhatcher
Copy link
Member

This is amazing @ephraimkunz ! Looking forward to testing this myself!

Did you do any comparison with (sane author as quick-xml)
https://github.com/tafia/calamine/

Whatever magic Johann Tuffe is doing there I'm sure we can learn a lot.

Thanks!

@ephraimkunz
Copy link
Author

I wasn't aware of that crate. Let me do some comparisons and report back here.

@nhatcher nhatcher self-requested a review November 14, 2024 17:45
@nhatcher
Copy link
Member

Hi @ephraimkunz , had a quick glance at the PR. I don't think we should keep around the old parser. And definitely shouldn't keep roxmltree as a dependency. I realize that is a larger change, so we can go step by step.

Copy link

@jaycarlton jaycarlton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice submission. I learned a lot going through it.

@ephraimkunz
Copy link
Author

Sorry for the delay here - I just started a new job and it's taking a lot of my time recently.

Thanks for the helpful review. I'll take a look soon.
I am also still planning to do some benchmarking vs calamine, umya-spreadsheet and ooxml.
I may also look at the test coverage of this new parser and the possibilities around adding more tests or fuzzing it.

@ephraimkunz
Copy link
Author

Hi @ephraimkunz , had a quick glance at the PR. I don't think we should keep around the old parser. And definitely shouldn't keep roxmltree as a dependency. I realize that is a larger change, so we can go step by step.

Agreed. I removed the tree-based implementation in this last push which had the unfortunate side effect of changing the diff by a lot, because I combined everything back into 1 file since we don't have two different implementations anymore.

Roxmltree is still used in a couple places but those should be easy to fix, following the model of shared_strings.rs, where I implemented a simple state machine to replace it there.

@ephraimkunz
Copy link
Author

I also did a bunch of perf tests against other crates (what looked to be the top XLSX parsers on lib.rs) on my same large file with > 1 million rows and 39 columns. Here are the results:

OOXML was significantly worse than IronCalc. It doesn't do streaming so it reads the whole file into memory. Super slow as well, I gave up after it ran for 700 seconds.

Umya-spreadsheet wasn't great. It was the worse for RAM usage out of any crates I tested, clocking in at a max of 16 GB to read the file. It took 2 minutes, 30 seconds to complete.

Calamine was of course the crate to beat, and comparison to it was what kicked off this whole performance benchmark. It seemed to perform slightly worse in terms of time (17 seconds vs this PR's 14 seconds) and slightly better in terms of memory (2.7 GB peak vs this PR's 3.2 GB). However, that's not the full story. It's a lazy parser and doesn't cache the results of its lazy evaluation. This means you can grab cell data in one call, but if you try to grab all cell data again, it re-parses all the cell data again. Furthermore, fetching formulas is a 2nd call that also re-parses all data. So, if I want it to parse all the data in the xlsx file and return it in a single struct like this xlsx does, it would require around 30 seconds although peak memory would still be slightly lower.

In conclusion, I think we're competitive with Calamine as well as supporting more xlsx features than that library does (for the most part, they seem to support images which we don't). Switching to Calamine would be a major change since it is lazy and depending on our access patterns might require multiple re-parses of the data. It would be interesting to explore more about how it maintains a lower memory footprint while parsing, which I think it mostly due to how the cell data is stored (using a vector of (rowID, colID, data) rather than what we do, which is doubly-nested hashmap of rowID:colID:Cell).

@ephraimkunz
Copy link
Author

@jaycarlton, @nhatcher This is ready to be reviewed again.

@nhatcher
Copy link
Member

Hi @ephraimkunz , this is great news!!!
Amazing job really

I will have a look as soon as possible.

In the meantime, why are we using roxmltree in a couple of places? If its a matter of time I can probably help out

@ephraimkunz
Copy link
Author

I had originally converted just the performance sensitive parts to use quick-xml. I should be able to look at converting the rest to quick-xml and removing roxmltree today.

@ephraimkunz
Copy link
Author

@nhatcher I pushed changes to make a few more places use streaming parsers. It looks like there are still several places left to convert:

  • Comments
  • Sheet relationships
  • Workbook
  • Tables
  • Styles

Several of these have pretty complicated XML meaning it will take longer. I wonder if we should consider reviewing and merging this PR as-is, then file separate tasks for each of those, with the goal of removing the roxmltree dependency once all those are converted to using streaming parsers?

I don't have a lot of time to work on this and I think there's a lot more work to remove the dependency completely in this PR.

@nhatcher
Copy link
Member

I don't have a lot of time to work on this and I think there's a lot more work to remove the dependency completely in this PR.

Hi @ephraimkunz , that's tough. The problem is that in situations like this is not difficult to end up in a hybrid state where half of the work is done following one framework (or library or whatever) and the other half the other. Then a third party wants to add something or fix a bug and it would be difficult for them to know what to do. I think you have done more than half of the work already and I am tempted to go ahead and bite the bullet and not let this good work go to waste.

As for the more technical side. This is really amazing work and leaves us in an unbeatable situation. In we are in the same ballpark as Calamine we are already wining. Remember that we actually parse formulas!

I want to give this a little bit more though, but I will likely just merge this as is by the end of the day and open tickets for the remaining tasks.

Once again thank you for your work. This has an enormous impact. If someone in a couple of months form now opens a beefy workbook and IronCalc opens it like a champ it might be because of this PR. And they might choose the technology because of your work here.

@ephraimkunz
Copy link
Author

No problem, it's been fun learning more. I should be able to help with some of those tasks as I get time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants