Skip to content

Commit d51d86d

Browse files
WIP
1 parent df7fc9c commit d51d86d

File tree

2 files changed

+284
-0
lines changed

2 files changed

+284
-0
lines changed

packages/xml/doc/using-desc.lhs

Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
---
2+
title: "Using Q4C12.XML.Desc: An Example"
3+
...
4+
5+
Have you ever written a printer and a parser, and thought that the code from one could be used to define the other? Well, you can! This package also allows you to generate a [RELAX NG] schema from the same single specification.
6+
7+
In this example, we will build one phase of a small document processing system. We will accept input that has references to [Digital Object Identifiers][DOI] (DOIs) in it without title, date or author information, and outputs full references. A second phase, which is not implemented in this example, would only accept references with filled-in metadata and then build a HTML document from that input.
8+
9+
This document is Literate Haskell and can be run as-is with `runhaskell using-desc.lhs` (or, if you're reading a HTML render, copy-and-paste the contents and then run with `runhaskell`).
10+
11+
[DOI]: https://en.wikipedia.org/wiki/Digital_object_identifier
12+
[RELAX NG]: https://en.wikipedia.org/wiki/RELAX_NG
13+
14+
Preliminaries
15+
=============
16+
17+
First, the usual preliminaries of imports and `LANGUAGE` pragmata.
18+
19+
> {-# LANGUAGE DataKinds, GADTs, OverloadedStrings, QuasiQuotes, TemplateHaskell #-}
20+
> module Main where
21+
> import Data.DList.NonEmpty (NonEmptyDList)
22+
> import qualified Data.DList.NonEmpty as NEDList
23+
> import Data.Either.Validation (Validation (Failure))
24+
> import Data.Text.Lazy (Text)
25+
> import Q4C12.TwoFinger (TwoFingerOddA)
26+
> import Q4C12.XML ()
27+
> import Q4C12.XML.Desc (Desc, El, OddFlow, rcons, rnil, rmany, elementMixed, makeRPlus, pat)
28+
> import Test.Tasty.HUnit (assertBool)
29+
30+
The abstract syntax tree
31+
========================
32+
33+
We will use GADTs to represent our AST to enforce the phase restriction on DOIs, which will also support sections, paragraphs, unordered lists, and italic and bold.
34+
35+
> data DOIResolution = DOIsResolved | DOIsUnresolved
36+
>
37+
> data Reference = Reference
38+
> { referenceTitle :: Text
39+
> , referenceDate :: Text
40+
> , referenceURL :: Text
41+
> , referenceAuthors :: [Text]
42+
> , referenceIdentifiers :: [Text]
43+
> }
44+
>
45+
> data InlineElement :: DOIResolution -> * where
46+
> InlineDOI :: Text -> InlineElement 'DOIsUnresolved
47+
> InlineReference :: Reference -> InlineElement doiResolution
48+
> InlineItalic :: Inline doiResolution -> InlineElement doiResolution
49+
> InlineBold :: Inline doiResolution -> InlineElement doiResolution
50+
51+
Notice the difference between the types of `InlineDOIReference` and `InlineReference`: `InlineDOIReference` requires its type argument to be `'DOIsUnresolved`, but `InlineReference` can accept either resolved or unresolved.
52+
53+
> newtype Inline doiResolution = Inline { getInline :: TwoFingerOddA (InlineElement doiResolution) Text }
54+
55+
`TwoFingerOddA` is from [q4c12-twofinger]. It's isomorphic to `([(Text, InlineElement doiResolution)], Text)`, and it has a useful `Monoid` instance which we don't use in this simple example, but which is very handy for more complicated document processing.
56+
57+
[q4c12-twofinger]: https://hackage.haskell.org/package/q4c12-twofinger
58+
59+
> data BlockElement :: DOIResolution -> * where
60+
> BlockParagraph :: Inline doiResolution -> BlockElement doiResolution
61+
> BlockSection :: Text -> Block doiResolution -> BlockElement doiResolution
62+
> BlockUnorderedList :: [Block doiResolution] -> BlockElement doiResolution
63+
>
64+
> newtype Block doiResolution = Block { getBlock :: [BlockElement doiResolution] }
65+
66+
Write a parser, get a printer and schema for free
67+
=================================================
68+
69+
Now for the fun part: defining the combined printer-parser-schema!
70+
71+
> blockDesc :: (Desc tag) => SingDOIResolution doiResolution -> OddFlow tag (Block doiResolution)
72+
> blockDesc doiResolution = rfmap (iso Block getBlock) $ nonTerminalOdd productionName
73+
> flowEvenPreWS $ rmany $ flowWSE $ blockElementDesc doiResolution
74+
> where
75+
> productionName :: Text
76+
> productionName = case doiResolution of
77+
> SingDOIsUnresolved -> "block-unresolved"
78+
> SingDOIsResolved -> "block-resolved"
79+
80+
TODO something about even/odd flows here, note that this is ignoring whitespace, and nonTerminalOdd and injectivity requirements there (also how RELAX NG needs an element in a definition cycle)
81+
82+
> blockElementDesc :: (Desc tag) => SingDOIResolution doiResolution -> El tag (BlockElement doiResolution)
83+
> blockElementDesc doiResolution =
84+
> $(makeRPlus $ do
85+
> pat [e| rfmap singleProd $ elementMixed (uname "p") $ inlineFlow doiResolution |]
86+
> $ \inlineP -> [p| BlockParagraph $inlineP |]
87+
> pat [e| elementMixed (uname "section") $ flowEvenPreWS
88+
> $ rcons (elementMixed (uname "title") oddTx)
89+
> $ rcons (inlineFlow doiResolution)
90+
> $ rnil
91+
> |]
92+
> $ \titleP bodyP -> [p| BlockSection $titleP $bodyP |]
93+
> pat [e| rfmap singleProd $ elementMixed (uname "ul") $
94+
> flowEvenPreWS $ rmany $ flowWSE $ elementMixed (uname "li") $
95+
> blockDesc doiResolution
96+
> |]
97+
> $ \itemsP -> [p| BlockUnorderedList $itemsP |]
98+
> )
99+
100+
TODO apologise for the TH, explain what's going on there, mention that this will get good warnings out of GHC if a case is uncovered or duplicated (can we actually test that??)---mention what it *can't* do (e.g., can't pick up non-injectivity in the AST->XML direction; round-trip tests help there though)
101+
102+
> inlineDesc :: (Desc tag) => SingDOIResolution doiResolution -> OddFlow tag (Inline doiResolution)
103+
> inlineDesc doiResolution = rfmap (iso Inline getInline) $ nonTerminalOdd productionName $
104+
> interleave (inlineElementDesc doiResolution) oddTx
105+
> where
106+
> productionName :: Text
107+
> productionName = case doiResolution of
108+
> SingDOIsUnresolved -> "inline-unresolved"
109+
> SingDOIsResolved -> "inline-resolved"
110+
>
111+
> inlineElementDesc :: (Desc tag) => SingDOIResolution doiResolution -> El tag (InlineElement doiResolution)
112+
> inlineElementDesc doiResolution =
113+
> $(makeRPlus $ do
114+
> case doiResolution of
115+
> SingDOIsResolved -> return ()
116+
> SingDOIsUnresolved ->
117+
> pat [e| rfmap singleProd $ elementMixed (uname "doi") oddTx |]
118+
> $ \identP -> [p| InlineDOI $identP |]
119+
120+
TODO something about singletons? Definitely call attention to the case there.
121+
122+
> pat [e| rfmap quintupleProd $ elementMixed (uname "ref")
123+
> $ rcons (uattrF "title" stringTokenDT)
124+
> $ rcons (uattrF "date" dateDT)
125+
> $ rcons (uattrF "href" urlDT)
126+
> $ rfmap collateReferenceFields $
127+
> flowEvenPreWS $ rmany $ flowWSE referenceFieldDesc
128+
> |]
129+
> $ \titleP dateP hrefP authorsP identsP -> [p| InlineReference (Reference $titleP $dateP $hrefP $authorsP $identsP) |]
130+
131+
TODO datatypes, talk about isomorphisms and quotients
132+
133+
> pat [e| rfmap singleProd $ elementMixed (uname "i") $ inlineDesc doiResolution |]
134+
> $ \inlineP -> [p| InlineItalic $inlineP |]
135+
> pat [e| rfmap singleProd $ elementMixed (uname "b") $ inlineDesc doiResolution |]
136+
> $ \inlineP -> [p| InlineBold $inlineP |]
137+
> )
138+
139+
Whew. And now all that's left is to put the whole thing together at the top level.
140+
141+
> documentDesc :: (Desc tag) => SingDOIResolution doiResolution -> El tag (Block doiResolution)
142+
> documentDesc doiResolution = elementMixed "document" $ blockDesc doiResolution
143+
144+
DOI resolution
145+
==============
146+
147+
This is not particularly enlightening: it's just walking the AST.
148+
149+
> lookupDOI :: Text -> Maybe Reference
150+
> lookupDOI "10.17487/RFC2324" = Just $ Reference
151+
> { referenceTitle = "Hyper Text Coffee Pot Control Protocol (HTCPCP/1.0)"
152+
> , referenceDate = "1998-03"
153+
> , referenceURL = "https://tools.ietf.org/html/rfc2324"
154+
> , referenceAuthors = ["L. Masinter"]
155+
> , referenceIdentifiers = ["RFC 2324", "doi:10.17487/RFC2324"]
156+
> }
157+
> lookupDOI _ = Nothing
158+
159+
For this example, we are only mocking the metadata retrieval: in a real application this would hit the network and query [CrossRef]'s API.
160+
161+
[CrossRef]: https://github.com/CrossRef/rest-api-doc
162+
163+
> data DOIResolutionError = UnknownDOI Text
164+
> deriving (Eq)
165+
>
166+
> type Resolution = Validation (NonEmptyDList DOIResolutionError)
167+
>
168+
> resolutionFailure :: DOIResolutionError -> Resolution a
169+
> resolutionFailure = Failure . NEDList.singleton
170+
>
171+
> resolveDOIsInlineElement :: InlineElement 'DOIsUnresolved -> Resolution (InlineElement 'DOIsResolved)
172+
> resolveDOIsInlineElement (InlineDOI doi) = case lookupDOI doi of
173+
> Nothing -> resolutionFailure $ UnknownDOI doi
174+
> Just reference -> pure $ InlineReference reference
175+
> resolveDOIsInlineElement (InlineReference reference) = pure $ InlineReference reference
176+
> resolveDOIsInlineElement (InlineItalic inline) = InlineItalic <$> resolveDOIsInline inline
177+
> resolveDOIsInlineElement (InlineBold inline) = InlineBold <$> resolveDOIsInline inline
178+
>
179+
> resolveDOIsInline :: Inline 'DOIsUnresolved -> Resolution (InlineElement 'DOIsResolved)
180+
> resolveDOIsInline = fmap Inline . bitraverse resolveDOIsInlineElement pure . getInline
181+
>
182+
> resolveDOIsBlockElement :: BlockElement 'DOIsUnresolved -> Resolution (BlockElement 'DOIsResolved)
183+
> resolveDOIsBlockElement (BlockParagraph inline) = BlockParagraph <$> resolveDOIsInline inline
184+
> resolveDOIsBlockElement (BlockSection title body) = BlockSection title <$> resolveDOIsBlock body
185+
> resolveDOIsBlockElement (BlockUnorderedList items) = BlockUnorderedList <$> traverse resolveDOIsBlock items
186+
>
187+
> resolveDOIsBlock :: Block 'DOIsUnresolved -> Resolution (Block 'DOIsResolved)
188+
> resolveDOIsBlock = fmap Block . traverse resolveDOIsBlockElement . getBlock
189+
190+
Note that the types here mean we can never forget to resolve a DOI: after this pass, it's statically guaranteed that there are no DOIs left!
191+
192+
Putting it all together
193+
=======================
194+
195+
Example 1: Resolving a DOI and dumping to XML
196+
---------------------------------------------
197+
198+
> example1 :: Bool
199+
> example1 = first toNonEmpty rendered == Right (pure expected)
200+
> where
201+
> input :: Element ()
202+
> input = [xmlqq| <document><p><doi>10.17487/RFC2324</doi></p></document> |]
203+
> parsed :: Either Text (Block 'DOIUnresolved)
204+
> parsed = parse (documentDesc SignDOIsUnresolved) input
205+
> resolved :: Either Text (Resolution (Block 'DOIResolved))
206+
> resolved = resolveDOIs <$> parsed
207+
> rendered :: Either Text (Resolution (Element ()))
208+
> rendered = fmap (fmap $ render $ documentDesc SignDOIsResolved) resolved
209+
> expected :: Element ()
210+
> expected =
211+
> [xmlqq| <document><p><ref date="1998-03"
212+
> href="https://tools.ietf.org/html/rfc2324"
213+
> name="Hyper Text Coffee Pot Control Protocol (HTCPCP/1.0)"
214+
> ><author>L. Masinter</author><ident>RFC 2324</ident><ident>doi:10.17487/RFC2324</ident></ref></document>
215+
> |]
216+
217+
Example 2: Failing to parse because of an unknown element
218+
---------------------------------------------------------
219+
220+
> example2 :: Bool
221+
> example2 = isLeft parsed
222+
> where
223+
> input :: Element ()
224+
> input = [xmlqq| <document><p><marquee>Whee!</marquee></p></document> |]
225+
> parsed :: Either Text (Block 'DOIUnresolved)
226+
> parsed = parse (documentDesc SignDOIsUnresolved) input
227+
228+
Example 3: Failing to parse because there is an unresolved DOI
229+
--------------------------------------------------------------
230+
231+
> example3 :: Bool
232+
> example3 = isLeft parsed
233+
> where
234+
> input :: Element ()
235+
> input = [xmlqq| <document><p><doi>10.17487/RFC2324</doi></p></document> |]
236+
> parsed :: Either Text (Block 'DOIResolved)
237+
> parsed = parse (documentDesc SignDOIsResolved) input
238+
239+
Testing all the examples
240+
------------------------
241+
242+
> main :: IO ()
243+
> main = do
244+
> assertBool "Example 1" example1
245+
> assertBool "Example 2" example2
246+
> assertBool "Example 3" example3
247+
248+
Some unrelated closing remarks
249+
==============================
250+
251+
XML is not very fashionable these days. Considering some of the crazy things that were stuffed into angle brackets, I'm not sure I can blame anyone for reacting that way and retreating to JSON to retain their sanity, but I think we need to be careful not to throw the baby out with the bathwater.
252+
253+
In my opinion, XML is still the best meta-format for *documents* out there (as opposed to being a meta-format for structured data), and RELAX NG is a very well-thought-out piece of technology that should appeal to any functional programmer. I can recommend the [RELAX NG tutorial] as a quick introduction and ['RELAX NG' by Erik van der Vliste][RELAX NG book] for further reading.
254+
255+
[nxml-mode], which can use RELAX NG schemas, is quite brilliant---unsurprisingly, since James Clark worked on both. If you're working with documents and you're able to avoid XML's rough edges, this is an excellent choice. `Q4C12.XML.Desc.Schema` is the module that will take a description of an element and turn it into a RELAX NG schema. This will be in the XML syntax; [trang] (another James Clark creation) can convert from that to the compact syntax needed by nxml-mode. Try it on this example! It comes out quite startlingly readable, I think.
256+
257+
What if you still hate XML, or your data fits better into JSON? Most of the ideas in `Q4C12.XML.Desc` can be applied just as well to parsing/printing JSON; some bits would probably work even better (e.g., no need to be so fussy about whitespace or namespaces). [JSON Schema] is not nearly as nice as RELAX NG, and, to my knowledge, it doesn't have any good editor integrations.
258+
259+
[JSON schema]: http://json-schema.org/
260+
[nxml-mode]: https://www.gnu.org/software/emacs/manual/html_mono/nxml-mode.html
261+
[RELAX NG book]: http://books.xmlschemata.org/relaxng/page2.html
262+
[RELAX NG tutorial]: http://relaxng.org/compact-tutorial-20030326.html
263+
[trang]: https://github.com/relaxng/jing-trang

packages/xml/package.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ license-file:
1010
- LICENSE.W3C
1111
extra-doc-files:
1212
- README.markdown
13+
- doc/using-desc.lhs
1314
extra-source-files:
1415
- data/html.xml
1516
- test/golden/**/*.in
@@ -21,6 +22,8 @@ defaults:
2122

2223
_common-dependencies:
2324
# Define version ranges once and only once across components.
25+
- &d-base
26+
"base >= 4.9.1.0 && < 4.11"
2427
- &d-base-noprelude
2528
"base-noprelude >= 4.9.1.0 && < 4.11"
2629
- &d-containers
@@ -31,10 +34,14 @@ _common-dependencies:
3134
"dlist >= 0.8.0.3 && < 0.9"
3235
- &d-dlist-nonempty
3336
"dlist-nonempty >= 0.1.1 && < 0.2"
37+
- &d-either
38+
"either >= 5 && < 5.1"
3439
- &d-filepath
3540
"filepath >= 1.4.1.1 && < 1.5"
3641
- &d-formatting
3742
"formatting >= 6.3.0 && < 6.4"
43+
- &d-lens
44+
"lens >= 4.15.4 && < 4.16"
3845
- &d-q4c12-mappend
3946
"q4c12-mappend >= 0 && < 0.1"
4047
- &d-q4c12-position
@@ -49,6 +56,8 @@ _common-dependencies:
4956
"tasty >= 0.12 && < 0.13 || >= 1.0 && < 1.1"
5057
- &d-tasty-expected-failure
5158
"tasty-expected-failure >= 0.11.0.4 && < 0.12"
59+
- &d-tasty-hunit
60+
"tasty-hunit >= 0.10.0.1 && < 0.11"
5261
- &d-tasty-golden
5362
"tasty-golden >= 2.3.1.1 && < 2.4"
5463
- &d-template-haskell
@@ -152,6 +161,18 @@ tests:
152161
- *d-tasty-golden
153162
- *d-text
154163
- *d-transformers
164+
using-desc:
165+
main: doc/using-desc.lhs
166+
other-modules: []
167+
dependencies:
168+
- q4c12-xml
169+
- *d-base
170+
- *d-dlist-nonempty
171+
- *d-either
172+
- *d-lens
173+
- *d-q4c12-twofinger
174+
- *d-tasty-hunit
175+
- *d-text
155176

156177
benchmarks:
157178
parse-xml:

0 commit comments

Comments
 (0)