Does Substring have performance issues? #321

BjornRuud · 2023-11-05T13:51:10Z

BjornRuud
Nov 5, 2023

Hi! I have written a small HTML lexer to convert any string to tokens representing whatever text and HTML elements it contains. The intended use case is to parse strings that use HTML for formatting, but it can also tokenize full HTML documents. I have previously written a handcrafted lexer using an API similar to Scanner, but which uses a custom collection wrapper instead of Scanner to gain some performance.

The code for both lexers are in the swift-parsing branch of this repo:
https://github.com/BjornRuud/HTMLLexer/tree/swift-parsing

Benchmarking the two lexers the swift-parsing one is much slower than the handcrafted one. Tokenizing the HTML specification web page, a document that is 85 KB in size, takes 77 ms for the hancrafted lexer and 14.8 seconds for the swift-parsing lexer on my mac. This is my first attempt at using swift-parsing so I'm sure there is much that can be optimized in my code, but I profiled the code and found something surprising.

The code spends most of it's runtime performing Substring.distance(from:to:) which is being called by Collection.count. Is that normal behaviour for Substring? Does it calculate count on each Substring allocation? Can this be avoided?

jessegrosjean · 2023-11-05T16:24:45Z

jessegrosjean
Nov 5, 2023

I think this might help to understand what is going on:

https://pointfreeco.github.io/swift-parsing/main/documentation/parsing/stringabstractions

6 replies

jessegrosjean Nov 5, 2023

I'm actually less sure now if the above link is what you need...

To start... I've only really done one project in Swift-Parsing, not an expert.

First thing is swift parsing does allow different string abstraction that have different performance characteristics. I saw your initial post, and that's what I thought of, and so I sent above link.

Taking a quick and superficial look at your code I wonder where the hot spot is. If it's in the actual swift-parsing code, then my above link might help. On the other hand this function looks a little suspicious to me. That while loop processing over input... I think that's generally what you want swift-parsing internals to be doing for you. It seems like you could replace that method with a new parser?

BjornRuud Nov 5, 2023
Author

Indeed, but I was unable to write a parser that did exactly what I wanted. I want parsing to never stop, so that it reads text until a potential tag is found, and if it is then a text token is emitted with the text found up to that point and then a tag token. If it wasn't a tag, the parsing continues from the current index as if only text was found. Any suggestion on how to solve this by making a parser is very welcome.

BjornRuud Nov 5, 2023
Author

It is essentially the same loop in both the handcrafted version and the swift-parsing one, but the performance implications can of course be different since the parser code is different.

jessegrosjean Nov 5, 2023

I'm delaying mowing the lawn, but I need to do that now. I did just run things in profiler and it seems the core performance issue at the moment (99% of time) is:

try HTMLTokenParser.upToNextPotentialTag.parse(&input)

You have that in a Skip, but I think the internal PrefixUpTo is doing more work than you want. As a first hack you could replace that with something more like:

if let nextPossible = input.firstIndex(of: "<") {
    input = input[nextPossible...]
} else {
    textEndIndex = html.endIndex
    break
}
//try HTMLTokenParser.upToNextPotentialTag.parse(&input)

Sorry I can't help more at moment, and maybe someone more experience will point you in an even more correct direction. Good luck!

BjornRuud Nov 5, 2023
Author

Thanks a lot! I totally missed that from the trace. With that change it's down to 0.37 seconds. Still slower than the handcrafted parser but now it's in the ballpark where I expected it to be and I'm in a much better position to optimize.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does Substring have performance issues? #321

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does Substring have performance issues? #321

Uh oh!

BjornRuud Nov 5, 2023

Replies: 1 comment · 6 replies

Uh oh!

jessegrosjean Nov 5, 2023

Uh oh!

jessegrosjean Nov 5, 2023

Uh oh!

BjornRuud Nov 5, 2023 Author

Uh oh!

BjornRuud Nov 5, 2023 Author

Uh oh!

jessegrosjean Nov 5, 2023

Uh oh!

BjornRuud Nov 5, 2023 Author

BjornRuud
Nov 5, 2023

Replies: 1 comment 6 replies

jessegrosjean
Nov 5, 2023

BjornRuud Nov 5, 2023
Author

BjornRuud Nov 5, 2023
Author

BjornRuud Nov 5, 2023
Author