-
Notifications
You must be signed in to change notification settings - Fork 37
parsed mds by for chatbot parser script for account.md, connecting.md, compiling_your_software.md (FOR REVIEW ONLY, NO MERGE!) #664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
EwDa291
wants to merge
148
commits into
hpcugent:main
Choose a base branch
from
EwDa291:chatbot_parser
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 146 commits
Commits
Show all changes
148 commits
Select commit
Hold shift + click to select a range
1ebc363
initial commit
EwDa291 34df842
Merge branch 'hpcugent:main' into chatbot_parser
EwDa291 10edb20
some cleanup
EwDa291 85a93ec
used jinja to replace macros
EwDa291 dfff5fa
adapt if-mangler to accommodate for nested if-clauses
EwDa291 649ddec
adapt the parser to take all files as input, not all files get parsed…
EwDa291 2116d6e
adapt the parser to take all files as input, not all files get parsed…
EwDa291 159aa62
small update, not important
EwDa291 75765e5
change to the templates
EwDa291 57d9cfe
change to accommodate for more nested if-clauses
EwDa291 75d345b
Delete scripts/HPC chatbot preprocessor/start_checker.py
EwDa291 ff7a9fc
make sure files with duplicate names between normal files and linux-t…
EwDa291 47a33b7
Merge branch 'chatbot_parser' of https://github.com/EwDa291/vsc_user_…
EwDa291 7d279d6
fixed the problem of some files being written in reST instead of mark…
EwDa291 8047572
some small fixes
EwDa291 7d1c5ed
remove try-except-structure
EwDa291 984b0cd
collapse all code into one file
EwDa291 8f5eeaa
Rename file
EwDa291 2b97b7a
cleanup repository
EwDa291 b595301
Rename directory
EwDa291 90c8ab7
add a main function
EwDa291 b8ae706
make file paths non os-specific
EwDa291 b751497
use docstrings to document the functions
EwDa291 0f8eb5d
rewrite the if-mangler to make it more readable
EwDa291 9938e92
got rid of most global variables
EwDa291 508b22c
fixed some issues with if statements
EwDa291 a25ce2d
fixed some issues with if statements
EwDa291 80d0535
got rid of all global variables
EwDa291 9163a75
small changes to make file more readable
EwDa291 1dcffc1
codeblocks, tips, warnings and info reformatted
EwDa291 4d7fbdb
small optimisations
EwDa291 671f7f3
small optimisations
EwDa291 e5c39bd
initial commit
EwDa291 c6492fc
added requirements
EwDa291 aff8198
added requirements and usage info
EwDa291 a981002
minor changes to the print statements
EwDa291 1f3b343
reworked function to take care of html structures
EwDa291 b6388d3
Merge branch 'hpcugent:main' into chatbot_parser
EwDa291 48cad97
filter out images
EwDa291 df58f23
get rid of backquotes, asterisks, pluses and underscores used for for…
EwDa291 c423e07
dump to json files instead of txt files
EwDa291 2c333fe
cleaned up parser with macros
EwDa291 ce52352
cleaned up parser with macros
EwDa291 5db34af
cleaned up parser with macros
EwDa291 4226d28
Update README.md
EwDa291 d730a26
Update README.md
EwDa291 f3182e3
added section about restrictions on input files
EwDa291 aee54de
Merge branch 'hpcugent:main' into chatbot_parser
EwDa291 675bec5
adapted section about restrictions on input files
EwDa291 f1e58ef
adapted section about restrictions on input files
EwDa291 2bf1075
Merge branch 'chatbot_parser' of https://github.com/EwDa291/vsc_user_…
EwDa291 a168509
change variables to be lowercase
EwDa291 09b86c9
take out some copy pasting
EwDa291 f95b99e
added warning about long filepaths
EwDa291 06bb7b9
fixing typos
EwDa291 2f3e5b3
take out copy pasting
EwDa291 0c4dbe8
first draft version of the restructured script to accommodate for the…
EwDa291 38c4572
added support to filter out collapsable admonitions
EwDa291 5cbd653
attempt at fix for problems with jinja include, not working yet
EwDa291 0e6f8b2
fixed an issue with jinja templates
EwDa291 cd77837
added docstrings to new functions
EwDa291 98eb695
only add necessary if-statements in front of non-if-complete sections
EwDa291 27457e3
fixed some more jinja problems
EwDa291 bb72287
implemented extra test to make sure generic files dont accidentally g…
EwDa291 67cb19e
make sure empty os-specific files are not saved
EwDa291 cf9834a
clean up unused code
EwDa291 da32459
introduce more macros
EwDa291 093200b
reintroduce logic to remove unnecessary directories
EwDa291 5d0ffe9
added functionality to include links or leave them out
EwDa291 a3e34a9
added functionality to include links or leave them out
EwDa291 7c6154b
adapt filenames to allow for splitting on something other than subtitles
EwDa291 8d5b50d
making some changes to prepare to add paragraph level splitting tomorrow
EwDa291 0c10376
making some changes to prepare to add paragraph level splitting tomorrow
EwDa291 f8ee860
making some changes to prepare to add paragraph level splitting tomorrow
EwDa291 6533733
adapted the parsing script to allow for testing in a semi-efficient way
EwDa291 2e7a00f
added test for make_valid_title
EwDa291 f5e0579
removed useless lines from testscript
EwDa291 6757b4f
First attempt at splitting in paragraphs (need for other fixes for ti…
EwDa291 6d9558d
make two functions for different ways of dividing the text
EwDa291 2c7025a
added docstrings to new functions
EwDa291 ae99bb9
update test for valid titles
EwDa291 084b421
fixed problem with splitting os-specific text (metadata not fixed yet)
EwDa291 cf7f5f0
fix for metadata of os-specific sections
EwDa291 b7c10d3
clean up temporary version
EwDa291 4a441f3
added command line options for custom macros
EwDa291 662134f
small fix to macros
EwDa291 05eab4a
clean up test for valid title
EwDa291 b85a8fb
add a test for write_metadata
EwDa291 39a3c99
added functionality to split on paragraphs
EwDa291 af9e6cc
clean up
EwDa291 f4163a7
clean up
EwDa291 833f964
further clean up and added shebang
EwDa291 79b1a56
clean up
EwDa291 cec154c
added test for if mangler
EwDa291 2f4a277
clean up
EwDa291 cd0c8eb
clean up customizable options
EwDa291 3be262a
further adapt the script to be able to test it
EwDa291 1d32aab
make changes to usage in command line to be more intuitive
EwDa291 5902c96
first revised version of the README
EwDa291 6f97d5f
Merge branch 'hpcugent:main' into chatbot_parser
EwDa291 6e48800
added docstring to main function
EwDa291 0bc440b
include chatbot_prepprocessor
EwDa291 e6e6023
added options for source and destination directories
EwDa291 a6d99d9
cleanup
EwDa291 2be834f
cleanup
EwDa291 532543a
cleanup
EwDa291 107464e
relocate test files
EwDa291 dd64381
update arguments of if mangler
EwDa291 ef3fd58
relocate full test files
EwDa291 4d7db8f
Revert "update arguments of if mangler"
EwDa291 df9bac5
Revert "relocate full test files"
EwDa291 631d9e9
update test to adapt to new arguments in if mangler
EwDa291 c6e600d
relocated full test files
EwDa291 d1c6194
Rename test_paragraph_split_1.md to test_paragraph_split_1_input.md
EwDa291 695ffd6
Rename test_title_split_1.md to test_title_split_1_input.md
EwDa291 af4832b
smal fix
EwDa291 8805c8c
test text for paragraph split
EwDa291 a265ffd
start of a fix for double title problem, not done yet
EwDa291 6c2a61c
Fix for double title bug when splitting on paragraph
EwDa291 ed08879
Fix bug for empty linklist in metadata
EwDa291 176af13
fix bug where too many directories were sometimes created
EwDa291 d4ceac8
test of full script, test files not ready to be pushed yet
EwDa291 815a863
updated requirements.txt
EwDa291 d15469f
updated docstring in main function
EwDa291 daa6b36
add support for comments for the bot to be included in the source files
EwDa291 4c19f44
changed the default for min paragraph length
EwDa291 9a6ff58
added test files for full script test
EwDa291 56543f0
small fix for double title bug
EwDa291 52a3861
added examples of output of the script when splitting on paragraphs w…
EwDa291 692e77b
fix for issue with html links
EwDa291 7f493a1
fix for issue with html links
EwDa291 0e34396
fix for issue with relative links to the same document
EwDa291 fa00044
added test for replace_markdown_markers
EwDa291 b3952b2
fix to small inconsistency in metadata
EwDa291 73072bf
added test for insert_links
EwDa291 3161309
make sure paragraphs only include full lists
EwDa291 7d4d7f9
Merge branch 'hpcugent:main' into chatbot_parser
EwDa291 3407be3
adapted to the new source files
EwDa291 6d04bbc
add source-directory to metadata and verbose mode
EwDa291 f33cfb3
added verbose mode
EwDa291 1c389d7
Merge branch 'hpcugent:main' into chatbot_parser
EwDa291 3227f19
Added limitation on lists
EwDa291 67aed53
fix for non os-specific if-statement not being recognised
EwDa291 9e297b1
new test for links
EwDa291 b6b8610
new test to make sure lists are kept as one section
EwDa291 57a2139
updated test_file for list test
EwDa291 170a10c
dropped <> around links and started new function to calculate length …
EwDa291 f279701
strip out code & co so only parsed MDs for selected pages remain
boegel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,196 @@ | ||
| # Chatbot parser | ||
|
|
||
| `chatbot_parser.py` is a script that transforms the markdown sourcefiles into a structured directory as input for a chatbot. | ||
|
|
||
| ## Usage | ||
|
|
||
| The script can be ran in a shell environment with the following command: | ||
|
|
||
| ```shell | ||
| python chatbot_parser.py | ||
| ``` | ||
|
|
||
| This command has the following possible options: | ||
|
|
||
| ```shell | ||
| chatbot_parser.py [-h] -src SOURCE -dst DESTINATION [-st] [-pl MIN_PARAGRAPH_LENGTH] [-td MAX_TITLE_DEPTH] [-l] [-dd] | ||
| ``` | ||
|
|
||
| ### Options | ||
|
|
||
| #### `h`/`help` | ||
|
|
||
| Display the help message | ||
|
|
||
| #### `src`/`source` | ||
|
|
||
| This is a required option that specifies the source directory of the input files for the script. This location is also used to look for jinja templates when using jinja to parse the source files (such as the `macros` directory within `vsc_user_docs/mkdocs/docs/HPC`). | ||
|
|
||
| #### `dst`/`destination` | ||
|
|
||
| This is a required option that specifies where the output of the script should be written. The script also generates extra intermediate subdirectories, so subdirectories with the following names shouldn't be present in the destination directory: `parsed_mds`, `copies` and `if_mangled_files`. If any of these pose a problem, the name of the intermediate subdirectory used for the script can be changed in the macros at the top of the script. | ||
|
|
||
| #### `st`/`split_on_titles` | ||
|
|
||
| Including this option will split the source files based on the titles and subtitles in the markdown text. Not including this option will split the text on paragraphs with a certain minimum length. | ||
|
|
||
| #### `pl`/`min_paragraph_length` | ||
|
|
||
| This option allows the user to configure the minimum length a paragraph must be. Some deviations from this minimum length are possible (for example at the end of a file). The default value for this minimum paragraph length is 683 characters. This options only works if `split_on_titles` is not enabled. | ||
|
|
||
| #### `td`/`max_title_depth` | ||
|
|
||
| This option allows the user to configure the maximum "title depth" (the amount of `#` in front) to be used as borders between sections if `split_on_titles` is enabled. The default value is 4. | ||
|
|
||
| #### `l`/`links` | ||
|
|
||
| Some of the sourcefiles might contain links. Including this option will retain the links in the plaintext. If this option is not included, the links will be dropped from the plaintext. | ||
|
|
||
| #### `dd`/`deep_directories` | ||
|
|
||
| Including this option will make the script generate a "deep directory" where every title encountered will be made into a subdirectory of its parent title (So for example a title with three `#`s will be made a subdirectory of the most recent title with two `#`s). This option only works if `split_on_titles` is enabled. | ||
|
|
||
| ## Generated file structure | ||
|
|
||
| The generated directory structure is written as a subdirectory of `parsed_mds`. In `parsed_mds`, two subdirectories can be found: | ||
|
|
||
| - `generic` contains the parts of the markdown sources that were non-OS-specific | ||
| - `os_specific` contains the parts of the markdown sources that were OS-specific | ||
|
|
||
| Within `os_specific` a further distinction is made for each of the three possible operating systems included in the documentation. | ||
|
|
||
| Both the generic and each of the three os-specific directories then contain a directory for each source file. | ||
|
|
||
| If the option `deep_directories` is not enabled, all paragraphs of the source file and their corresponding metadata will be saved in this directory. The (processed) plaintext of the paragraph is written to a `.txt` file and the metadata is written to a `.json` file. | ||
|
|
||
| If the option `deep_directories` is enabled, the directory of each source file will contain a subdirectory structure corresponding to the structure of the subtitles at different levels in the source file. Each subtitle in the source file corresponds to a directory nested in the directory of its parent title (So for example a title with three `#`s will be made a subdirectory of the most recent title with two `#`s). | ||
|
|
||
| Finally, each of these subtitle-specific subdirectories contains a `.txt` file with the (processed) plaintext of that section and a `.json` file with the metadata of that section. | ||
|
|
||
| ## Requirements | ||
|
|
||
| - The required Python packages are listed in `requirements.txt` | ||
|
|
||
| ## Restrictions on source-files | ||
|
|
||
| Due to the nature of the script, some restrictions should be taken into account about the markdown files it can use as input. | ||
|
|
||
| ### Nested if structures | ||
|
|
||
| The script uses the if-structures in the source-files to split the documentation into general documentation and os-specific documentation. As such it needs to keep track of which types of if-structures (os-related/non-os-related) it is reading from. When using certain nested if-structures, this will cause problems. The supported nested if-structures are determined by the macros `NON_OS_IF`, `NON_OS_IF_IN_OS_IF`, `OS_IF` and `OS_IF_IN_OS_IF`. So respectively a non-os-related if-structure, a non-os-related if nested in an os-related one, an os-related if-structure and an os-related if-structure nested in another os-related if-structure. All of these are allowed to be nested in an undetermined amount of non-os-related if-structures, but no non-os-related if structures should be nested in them. It is also not allowed to nest any of the allowed structures in more os-related if-structures. | ||
|
|
||
| #### Examples of valid and invalid if-structures | ||
|
|
||
| ##### Allowed | ||
|
|
||
| ###### non-os-related in os-related | ||
|
|
||
| This is an example of one of the basic allowed if-structures (`NON_OS_IF_IN_OS_IF`) | ||
|
|
||
| ``` | ||
| if OS == windows: | ||
| if site == Gent: | ||
| ... | ||
| endif | ||
| endif | ||
| ``` | ||
|
|
||
| ###### os-related in os-related in non-os-related | ||
|
|
||
| This is an example of the basic allowed if-structure `OS_IF_IN_OS_IF` nested in a non-os-specific if. | ||
|
|
||
| ``` | ||
| if site == Gent: | ||
| if OS == windows: | ||
| ... | ||
| else: | ||
| if OS == Linux: | ||
| ... | ||
| endif | ||
| endif | ||
| endif | ||
| ``` | ||
|
|
||
| ##### Not allowed | ||
|
|
||
| ###### non-os-related in os-related in os-related | ||
|
|
||
| This is an example of a non-os-related if-structure nested in one of the basic allowed if-structures (`OS_IF_IN_OS_IF`). | ||
|
|
||
| ``` | ||
| if OS != windows: | ||
| if OS == Linux: | ||
| if site == Gent: | ||
| ... | ||
| endif | ||
| endif | ||
| endif | ||
| ``` | ||
|
|
||
| This will result in the parser "forgetting" it opened an os-specific if-statement with OS != windows and not properly closing it. | ||
|
|
||
| ###### os-related in non-os-related in os-related | ||
|
|
||
| This is an example of the basic allowed if-structure `OS_IF` (indirectly) nested in an os-specific if-structure. | ||
|
|
||
| ``` | ||
| if OS != windows: | ||
| if site == Gent: | ||
| if OS == Linux: | ||
| ... | ||
| endif | ||
| endif | ||
| endif | ||
| ``` | ||
|
|
||
| This will also result in the parser "forgetting" it opened an os-specific if-statement with OS != windows and not properly closing it. | ||
|
|
||
| ### Non OS-related if-statements | ||
|
|
||
| Due to the way jinja parses the sourcefiles, the script slightly alters non os-specific if-statements as well. It expects if-statements of the following form: | ||
|
|
||
| ``` | ||
| {%- if site == gent %} | ||
| {% if site != (gent or brussel) %} | ||
| ``` | ||
|
|
||
| All spaces and the dash are optional. City names don't need to be fully lowercase since the parser will capitalize them properly anyway. | ||
|
|
||
| ### html syntax | ||
|
|
||
| The input shouldn't contain any html syntax. While some failsafes are in place, the script isn't made with the use case of handling html syntax in mind. | ||
|
|
||
| ### Comments | ||
|
|
||
| Any comments within the markdown files (for example TODO's) should follow the following syntax: | ||
|
|
||
| ``` | ||
| <!--your comment--> | ||
| ``` | ||
| and should be limited to one line. | ||
|
|
||
| Comments can be written in such a way that the script will keep them as input for the bot. To do that, the marker `INPUT_FOR_BOT` should be put in front of the content of the comment as such. | ||
|
|
||
| ``` | ||
| <!--INPUT_FOR_BOT: your comment for the bot--> | ||
| ``` | ||
|
|
||
| This will be reworked to | ||
|
|
||
| ``` | ||
| your comment for the bot | ||
| ``` | ||
|
|
||
| in the final output. | ||
|
|
||
| ### Long filepaths | ||
|
|
||
| Due to the nature of this script, it can generate large directories with very long names if `deep_directories` is enabled. Depending on the operating system, this can cause problems with filepaths being to long, resulting in files not being able to open. A possible fix for this is to make sure the filepath to where the script is located is not too long. Another solution is lowering the `max_title_depth` or disabling `deep_directories`. | ||
|
|
||
| ### Markdown lists | ||
|
|
||
| The parser is made in a way to detect lists and not split them in multiple paragraphs. The kinds of lists it can detect is all lists with denominators `-`, `+`, `*` and list indexed with numbers or letters (one letter per list entry). It can handle list entries being spread out over multiple lines if there is an indentation of at least two spaces. It can also handle multiple paragraph list entries in this way, as long as the indentation stays. | ||
|
|
||
| ### Links | ||
|
|
||
| Part of the metadata of the parser are links. In order for the links to be built up in the right way, links to external sites should always start with either `https://` or `http://`. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance you can also make a script (or an option) to dump the the whole structure in a json list of dicts with eg following metadata:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps also the title as separate metadata instead of part of text