Skip to content

Any tips on how to speed up build of a _very_ large table? (~1500 rows) #1374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
flynneva opened this issue Aug 11, 2023 · 13 comments
Closed

Comments

@flynneva
Copy link

First off - thanks for the awesome project 🙏🏼

I use the mkdocs-material + python-markdown integration quite heavily with the docs I build and I am running into a corner case which I'm not quite sure how best to improve / optimize.

We have a super large markdown table (~1500 rows with 3 columns, with md links in each cell) that is auto-generated and when trying to build with mkdocs build, it hangs specifically on the python-markdown bit and takes ~30min just to build that one file 😅

The last debug print I see before the ~30min wait until the next file are some Successfully imported extension and Successfully loaded extension from core.py within this package:

 DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "markdown.extensions.toc.TocExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "markdown.extensions.tables.TableExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "markdown.extensions.fenced_code.FencedCodeExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "markdown.extensions.meta.MetaExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "markdown.extensions.admonition.AdmonitionExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "markdown.extensions.attr_list.AttrListExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "markdown.extensions.footnotes.FootnoteExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "markdown.extensions.md_in_html.MarkdownInHtmlExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.highlight".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.highlight.HighlightExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.snippets".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.snippets.SnippetExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.details".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.details.DetailsExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.tabbed".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.tabbed.TabbedExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.saneheaders".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.saneheaders.SaneHeadersExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.keys".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.keys.KeysExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.tasklist".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.tasklist.TasklistExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.arithmatex".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.arithmatex.ArithmatexExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.inlinehilite".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.highlight".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.highlight.HighlightExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.inlinehilite.InlineHiliteExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.superfences".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx._bypassnorm".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx._bypassnorm.BypassNormExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.highlight".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.highlight.HighlightExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.superfences.SuperFencesCodeExtension".
[ DEBUG    ][ MARKDOWN ][ core.py:163  ] Successfully imported extension module "pymdownx.emoji".
[ DEBUG    ][ MARKDOWN ][ core.py:126  ] Successfully loaded extension "pymdownx.emoji.EmojiExtension".

After this last print it hangs for ~30min until moving onto the next file.

Main questions

  1. I cant tell which extension is taking a long time...any tips on how to see more granular debug prints (or is the best way here to add my own to each extension?)
  2. Is there some trick to speed up build times here for super large tables?
@flynneva
Copy link
Author

flynneva commented Aug 11, 2023

With #1375 I'm able to see where it is getting stuck. Looks like it is in the markdown.treeprocessors that is getting stuck:

[ DEBUG    ][ MARKDOWN ][ core.py:236  ] Converting markdown file to serialzed XHTML or HTML
[ DEBUG    ][ MARKDOWN ][ core.py:249  ] Split document into lines
[ DEBUG    ][ MARKDOWN ][ core.py:252  ] Running preprocessor prep for pymdownx._bypassnorm
[ DEBUG    ][ MARKDOWN ][ core.py:252  ] Running preprocessor prep for pymdownx.snippets
[ DEBUG    ][ MARKDOWN ][ core.py:252  ] Running preprocessor prep for pymdownx.superfences
[ DEBUG    ][ MARKDOWN ][ core.py:252  ] Running preprocessor prep for markdown.preprocessors
[ DEBUG    ][ MARKDOWN ][ core.py:252  ] Running preprocessor prep for pymdownx._bypassnorm
[ DEBUG    ][ MARKDOWN ][ core.py:252  ] Running preprocessor prep for markdown.extensions.meta
[ DEBUG    ][ MARKDOWN ][ core.py:252  ] Running preprocessor prep for pymdownx.superfences
[ DEBUG    ][ MARKDOWN ][ core.py:252  ] Running preprocessor prep for markdown.extensions.md_in_html
[ DEBUG    ][ MARKDOWN ][ core.py:255  ] Parse the high-level elements of the file
[ DEBUG    ][ MARKDOWN ][ core.py:258  ] Run the tree-processors
[ DEBUG    ][ MARKDOWN ][ core.py:260  ] Running treeprocessor for markdown.extensions.footnotes
[ DEBUG    ][ MARKDOWN ][ core.py:260  ] Running treeprocessor for pymdownx.highlight
[ DEBUG    ][ MARKDOWN ][ core.py:260  ] Running treeprocessor for pymdownx.tasklist
[ DEBUG    ][ MARKDOWN ][ core.py:260  ] Running treeprocessor for markdown.treeprocessors

Now to figure out which treeprocessor it is within markdown.treeprocessors.py 🙃

@flynneva
Copy link
Author

Looks like its the InlineProcessor tree processor:

[ DEBUG    ][ MARKDOWN ][ core.py:258  ] Run the tree-processors
[ DEBUG    ][ MARKDOWN ][ core.py:260  ] Running treeprocessor for markdown.extensions.footnotes:FootnoteTreeprocessor
[ DEBUG    ][ MARKDOWN ][ core.py:260  ] Running treeprocessor for pymdownx.highlight:HighlightTreeprocessor
[ DEBUG    ][ MARKDOWN ][ core.py:260  ] Running treeprocessor for pymdownx.tasklist:TasklistTreeprocessor
[ DEBUG    ][ MARKDOWN ][ core.py:260  ] Running treeprocessor for markdown.treeprocessors:InlineProcessor
[ DEBUG    ][ MARKDOWN ][ treeprocessors.py:345  ] Running inline treeprocessor

@flynneva
Copy link
Author

flynneva commented Aug 11, 2023

Ah it looks like the InlineTreeprocessor skips AtomicStrings. I'll see if I can adjust our table generation logic to only use atomic strings then.

@waylan
Copy link
Member

waylan commented Aug 11, 2023

Ah it looks like the InlineTreeprocessor skips AtomicStrings. I'll see if I can adjust our table generation logic to only use atomic strings then.

I doubt that will work. Atomic strings are simply an instance of a custom Python class. They would not be retained in a external text file. They are intended to be used internally to instruct the parser that that string is fully parsed and to ignore it in any future parsing.

@flynneva
Copy link
Author

@waylan ok, thanks! Any suggestions on how to fix this issue then? Or speed up the InlineTreeprocessor?

@waylan
Copy link
Member

waylan commented Aug 11, 2023

The InlineTreeProcessor is simple a wrapper around all inline processors (see inlinepatterns.py. It could be anything in there.

Although, I have to wonder if perhaps there is an issue with the syntax of your tables. If you have the table extension enabled, then the large table should have converted all of the table syntax already by the time we get to the InlineTreeProcessor and we would simply be parsing the contents of each cell at this point. Table cells usually tend to have pretty simple content, so this is not usually an issue. Unless you have some unusual cell content...

Or it could be that stepping through all of the many thousands of cells is what is slowing things down. And come to think of it, the InlineTreeProcessor does some extra shenanigans to support some sophisticated nesting (as explained here; in fact, it could be that the entire discussion in #798 is related to this - I don't have a way to know without some sample input). Running that for each cell on thousands of cells could add up. Although, I can't imagine ~30 minutes for that.

@flynneva
Copy link
Author

Unless you have some unusual cell content...

@waylan inside each cell is either one link (e.g. [text](link.html)) or a list of links in one line (e.g. [text](link.html)*;* [other text](link2.html)*;*). The auto-generator tool we use inserts those funky *;* in between each link, could that trip up the inline processors somehow? I might be able to adjust it to just be a normal ;.

Not sure if it matters either but VS Code markdown preview utility can render the table just fine very quickly.

@waylan
Copy link
Member

waylan commented Aug 11, 2023

I don't see any obvious reason for the slowdown from what you have provided. Again, take a look at #798. Even if the specific issue there is not relevant, our general approach to performance issues and priorities are discussed in detail.

It could be that you have hit an edge case on some regex which we could tweak or it could be that a fix would require completely rewriting how inline parser works. Python-Markdown is very old and back when its structure was first designed very few extensions existed in any implementation. Therefore, it doesn't always work well for some newer syntax. We haven't rewritten as that would require us to abandon the rich ecosystem of existing third party extensions.

@squidfunk
Copy link

We have a super large markdown table (~1500 rows with 3 columns, with md links in each cell) that is auto-generated and when trying to build with mkdocs build, it hangs specifically on the python-markdown bit and takes ~30min just to build that one file 😅

Correct me if I'm wrong, and I definitely might be, as I don't have enough information on what you're trying to achieve, but if you're generating the table, why not just generate HTML and omit Markdown processing completely?

@flynneva
Copy link
Author

why not just generate HTML and omit Markdown processing completely?

@squidfunk this was going to be my "plan b" if we cant figure it out here 😅

@squidfunk
Copy link

@flynneva if you're dealing with a lot of data, it might be a scalable approach 😉 However, I understand that you might want to keep Markdown parsing inside of table cells, e.g. text formatting, icons, etc., which would require some extra work.

@flynneva
Copy link
Author

flynneva commented Aug 11, 2023

@waylan just figured it out 🙃 it isn't an issue at all with this repo or the table itself....Its the generator we have.

The issue was that just before the md table, there is an HTML link in the file (e.g. <a></a>), without an empty line between it and the table, causing the parser here to not consider it a table...and instead treat it as one huge paragraph I think 🙃

Removing that link, or adding a space between it and the table fixes my issue.

Closing this as it is resolved.

Big thanks for helping me find the root cause 🙏🏼 @waylan @squidfunk

@waylan
Copy link
Member

waylan commented Aug 11, 2023

why not just generate HTML and omit Markdown processing completely?

That may or may not help; it depends on what the issue is. The Markdown parser will still parse all of the raw HTML (as HTML) only to find the end of the block of text it should ignore. However, it uses the HTML parser in the Python standard library which is generally fast enough but does have a few weird edge cases of its own. That said, all inline processing is avoided, so the current slowdown would be avoided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants