-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathCFG.txt
160 lines (131 loc) · 7.95 KB
/
CFG.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
My "CFG" representation of Wikitext
-------------------------------------
Well, not exactly context-free.
NOTE: I'm rather a newbie at this. If there's something
inappropriate or wrong with the following representation,
please notify me to fix or formalize it.
NOTE: The following representation is subject to change.
CXuesong, 2016
Notation
----------
NAME Non-terminals
A/B If the derivation attempt on A failed, then try B.
"text" Text or regular expressions (depends on the context), where
. is a character except \n .
\s is a whitespace character including \n .
{n} matches the previous element exactly n times.
{n,} matches the previous element at least n times.
+ matches the previous element at least once.
+? matches the previous element at least once, as few times as possible.
Terminator Sepcifications
Terminator is used to terminate PLAIN_TEXT, WIKITEXT, and RUN.
Terminators by default can be inherited from a derivation to its children deriviations,
unless it's prevented by overriding, in which case the child doesn't inherit terminators
from its parents.
EOF is always a terminator.
>| "text" Text or regular expressions used as a terminator. This will also override parent terminators.
>| There's no terminator for PLAIN_TEXT, and it can stretch to the end of the document.
+| "text" The same as >| , except that it also inherits the terminators from the parent derivation.
-| "text" Inherits terminators from the parent deriviation, while prevents "text" from terminates PLAIN_TEXT.
# Basic
IDENTIFIER -> "[^>\s]" # Tag identifier
PLAIN_TEXT -> ".+" # Repeats until a terminator or suspectable plain-text end is met
# Hierarchical & lexical syntax
WIKITEXT -> LINE "\n" WIKITEXT / LINE / empty # An empty WIKITEXT contains no children
LINE -> LIST_ITEM / HEADING / PARAGRAPH
+| "\n" # The last line has no extra trailing \n
# Known Issue: Cannot handle code like
# ; Term : Defination
LIST_ITEM -> "[*#:;]+|-{4,}| " RUN? # Flatten the list hierarchy
# Note that "---- abcd" is valid
HEADING -> "={6}" RUN "={6}" EXPANDABLE_TEXT
>| "={6}$"
/ "={5}" RUN "={5}" EXPANDABLE_TEXT
>| "={5}$"
/ ...
/ "={1}" RUN "={1}" EXPANDABLE_TEXT
>| "={1}$"
HORIZONTAL_RULER -> "-{4,}" RUN?
# A normal/closed PARAGRAPH ends up with \n, while a compact/unclosed PARAGRAPH doesn't have an extra \n
# Note that there's another \n between each LINE
# A compact paragraph can exists only when the next token in WIKITEXT derivation is LIST_ITEM / HEADING / Terminators
PARAGRAPH -> RUN "\n" PARAGRAPH #
/ RUN # (i.e. RUN \n RUN) Only when PARAGRAPH is compact
/ empty # (i.e. RUN \n RUN \n)
RUN -> INLINE+
INLINE -> EXPANDABLE / TAG / IMAGE_LINK / WIKI_LINK / EXTERNAL_LINK / FORMAT / PLAIN_TEXT
EXPANDABLE -> COMMENT / TEMPLATE / ARGUMENT_REF # COMMENT expands to ""
EXPANDABLE_TEXT -> EXPANDABLE EXPANDABLE_TEXT
/ PLAIN_TEXT EXPANDABLE_TEXT
/ EXPANDABLE
/ PLAIN_TEXT
/ empty
EXPANDABLE_URL -> EXPANDABLE EXPANDABLE_URL
/ EXPANDABLE
/ URL
/ empty
# Syntax for inserting an image or thumbnail
# c.f. https://www.mediawiki.org/wiki/Help:Images#Rendering_a_single_image
IMAGE_LINK -> "[[" IMAGE_LINK_TARGET IMAGE_LINK_ARGUMENT* "]]"
>| "]]|\|"
IMAGE_LINK_TARGET -> "[\s_]*(File|Image|<...>)[\s_]*:" EXPANDABLE_TEXT # <...> is customizable File namespace aliases.
# IMAGE_LINK_ARGUMENT can have \n inside.
# IMAGE_LINK_CAPTION is actually the last IMAGE_LINK_ARGUMENT
IMAGE_LINK_ARGUMENT -> "|" WIKITEXT "=" WIKITEXT
# Known issue: Current implementation will parse [[http://abc]] as WIKI_LINK,
# while actually it should be trated as "[" EXTERNAL_LINK "]"
WIKI_LINK -> "[[" WIKI_LINK_TARGET "]]" / "[[" WIKI_LINK_TARGET "|" WIKI_LINK_TEXT "]]"
>| "\n" # Link stops at new line
WIKI_LINK_TARGET -> EXPANDABLE_TEXT
+| "]]" / "|"
WIKI_LINK_TEXT -> RUN
+| "]]" # Pipes are allowed here
EXTERNAL_LINK -> "[" EXTERNAL_LINK "[ \t]" EXTERNAL_LINK_TEXT "]" / "[" EXTERNAL_LINK "]"
/ URL
EXTERNAL_LINK_TARGET -> EXPANDABLE_URL
+| "]" / "[ \t]"
EXTERNAL_LINK_TEXT -> RUN
+| "]" # Pipes are allowed here
FORMAT_SWITCH -> "'{5}(?!')" / "'{3}(?!')" / "'{2}(?!')" # Also handles '''ab''cd'''efg''
# Known Issue: The derivation is ambiguous for {{{{T}}}} .
# Current implementation will treat it as {{{ {T }}} }, where {T is rendered as normal text,
# while actually it should be treated as { {{{T}}} }.
# Note that there's no ambiguity for nested brackets delimited with spaces, e.g. {{ {{T}} }}
TEMPLATE -> "{{" EXPANDABLE_TEXT TEMPLATE_ARGS "}}"
>| "}}" / "|"
TEMPLATE_ARGS -> "|" TEMPLATE_ARG TEMPLATE_ARGS / empty
TEMPLATE_ARG -> WIKITEXT "=" WIKITEXT / WIKITEXT # [[abc|def]]={{def}} is valid!
ARGUMENT_REF -> "{{{" WIKITEXT "|" WIKITEXT "}}}" / "{{{" WIKITEXT "}}}"
+| "|"
TEMPLATE_MAGIC -> "{{" EXPANDABLE_TEXT_MAGIC TEMPLATE_MAGIC_ARGS "}}"
EXPANDABLE_TEXT_MAGIC -> "#" EXPANDABLE_TEXT # Likely to be a parser function
/ WELL_KNOWN_VARIABLE_NAMES # Some well-known variable names
+| ":" # Colon will start the 1st argument
TEMPLATE_MAGIC_ARG1 -> ":" WIKITEXT TEMPLATE_ARGS # The first argument just cannot be named
/ empty
# You cannot transclude templates in parser TAGs
TAG -> LI_TAG / OTHER_TAG
LI_TAG -> "<li" TAG_ATTRS ">" WIKITEXT LI_TAG_TERMINATOR # Use WIKITEXT for parsing, but actually \n in <li> tag is not converted into <p>.
+| "<li\s+>" / "</li\s*>"
LI_TAG_TERMINATOR -> "</li\s*>"
/ empty # a <li> tag will end the previous <li> tag automatically
OTHER_TAG -> "<" IDENTIFIER TAG_ATTRS ">" WIKITEXT "</" IDENTIFIER "\s*>"
+| "</IDENTIFIER\s*>"
TAG_ATTRS -> "\s+" TAG_ATTR TAG_ATTRS / empty
TAG_ATTR -> IDENTIFIER "\s*=\s*\"" "[^"]*" "\"" # <tag attr="value">
/ IDENTIFIER "\s*=\s*'" "[^']*" "'" # <tag attr='value'>
/ IDENTIFIER "\s*=\s*" "[^\s]*" # <tag attr=value> or <tag attr= >
/ IDENTIFIER "\s*" # <tag attr>
COMMENT -> "<!--" ".*?" "-->"
WIKITEXT Corner Cases
-----------------------
TERM == Terminators
TERM Empty WIKITEXT
abc TERM A compact paragraph.
abc\n TERM A normal paragraph.
abc\n\n TERM A normal paragraph "abc\n" and a compact paragraph "", as the last line has no extra trailing \n .
abc\n\n\n TERM Two normal paragraphs: "abc\n", "\n", as the last line has no extra trailing \n .
abc\ndef\n TERM A normal paragraph "abc\ndef\n".
* abc TERM A WIKITEXT containing a LIST_ITEM.
* abc\n TERM A WIKITEXT containing a LIST_ITEM and a compact paragraph.
* abc\n\n TERM A WIKITEXT containing a LIST_ITEM and a normal paragraph.