|
| 1 | +--- |
| 2 | +title: "Using Q4C12.XML.Desc: An Example" |
| 3 | +... |
| 4 | + |
| 5 | +Have you ever written a printer and a parser, and thought that the code from one could be used to define the other? Well, you can! This package also allows you to generate a [RELAX NG] schema from the same single specification. |
| 6 | + |
| 7 | +In this example, we will build one phase of a small document processing system. We will accept input that has references to [Digital Object Identifiers][DOI] (DOIs) in it without title, date or author information, and outputs full references. A second phase, which is not implemented in this example, would only accept references with filled-in metadata and then build a HTML document from that input. |
| 8 | + |
| 9 | +This document is Literate Haskell and can be run as-is with `runhaskell using-desc.lhs` (or, if you're reading a HTML render, copy-and-paste the contents and then run with `runhaskell`). |
| 10 | + |
| 11 | +[DOI]: https://en.wikipedia.org/wiki/Digital_object_identifier |
| 12 | +[RELAX NG]: https://en.wikipedia.org/wiki/RELAX_NG |
| 13 | + |
| 14 | +Preliminaries |
| 15 | +============= |
| 16 | + |
| 17 | +First, the usual preliminaries of imports and `LANGUAGE` pragmata. |
| 18 | + |
| 19 | +> {-# LANGUAGE DataKinds, GADTs, OverloadedStrings, QuasiQuotes, TemplateHaskell #-} |
| 20 | +> module Main where |
| 21 | +> import Data.DList.NonEmpty (NonEmptyDList) |
| 22 | +> import qualified Data.DList.NonEmpty as NEDList |
| 23 | +> import Data.Either.Validation (Validation (Failure)) |
| 24 | +> import Data.Text.Lazy (Text) |
| 25 | +> import Q4C12.TwoFinger (TwoFingerOddA) |
| 26 | +> import Q4C12.XML () |
| 27 | +> import Q4C12.XML.Desc (Desc, El, OddFlow, rcons, rnil, rmany, elementMixed, makeRPlus, pat) |
| 28 | +> import Test.Tasty.HUnit (assertBool) |
| 29 | + |
| 30 | +The abstract syntax tree |
| 31 | +======================== |
| 32 | + |
| 33 | +We will use GADTs to represent our AST to enforce the phase restriction on DOIs, which will also support sections, paragraphs, unordered lists, and italic and bold. |
| 34 | + |
| 35 | +> data DOIResolution = DOIsResolved | DOIsUnresolved |
| 36 | +> |
| 37 | +> data Reference = Reference |
| 38 | +> { referenceTitle :: Text |
| 39 | +> , referenceDate :: Text |
| 40 | +> , referenceURL :: Text |
| 41 | +> , referenceAuthors :: [Text] |
| 42 | +> , referenceIdentifiers :: [Text] |
| 43 | +> } |
| 44 | +> |
| 45 | +> data InlineElement :: DOIResolution -> * where |
| 46 | +> InlineDOI :: Text -> InlineElement 'DOIsUnresolved |
| 47 | +> InlineReference :: Reference -> InlineElement doiResolution |
| 48 | +> InlineItalic :: Inline doiResolution -> InlineElement doiResolution |
| 49 | +> InlineBold :: Inline doiResolution -> InlineElement doiResolution |
| 50 | + |
| 51 | +Notice the difference between the types of `InlineDOIReference` and `InlineReference`: `InlineDOIReference` requires its type argument to be `'DOIsUnresolved`, but `InlineReference` can accept either resolved or unresolved. |
| 52 | + |
| 53 | +> newtype Inline doiResolution = Inline { getInline :: TwoFingerOddA (InlineElement doiResolution) Text } |
| 54 | + |
| 55 | +`TwoFingerOddA` is from [q4c12-twofinger]. It's isomorphic to `([(Text, InlineElement doiResolution)], Text)`, and it has a useful `Monoid` instance which we don't use in this simple example, but which is very handy for more complicated document processing. |
| 56 | + |
| 57 | +[q4c12-twofinger]: https://hackage.haskell.org/package/q4c12-twofinger |
| 58 | + |
| 59 | +> data BlockElement :: DOIResolution -> * where |
| 60 | +> BlockParagraph :: Inline doiResolution -> BlockElement doiResolution |
| 61 | +> BlockSection :: Text -> Block doiResolution -> BlockElement doiResolution |
| 62 | +> BlockUnorderedList :: [Block doiResolution] -> BlockElement doiResolution |
| 63 | +> |
| 64 | +> newtype Block doiResolution = Block { getBlock :: [BlockElement doiResolution] } |
| 65 | + |
| 66 | +Write a parser, get a printer and schema for free |
| 67 | +================================================= |
| 68 | + |
| 69 | +Now for the fun part: defining the combined printer-parser-schema! |
| 70 | + |
| 71 | +> blockDesc :: (Desc tag) => SingDOIResolution doiResolution -> OddFlow tag (Block doiResolution) |
| 72 | +> blockDesc doiResolution = rfmap (iso Block getBlock) $ nonTerminalOdd productionName |
| 73 | +> flowEvenPreWS $ rmany $ flowWSE $ blockElementDesc doiResolution |
| 74 | +> where |
| 75 | +> productionName :: Text |
| 76 | +> productionName = case doiResolution of |
| 77 | +> SingDOIsUnresolved -> "block-unresolved" |
| 78 | +> SingDOIsResolved -> "block-resolved" |
| 79 | + |
| 80 | +TODO something about even/odd flows here, note that this is ignoring whitespace, and nonTerminalOdd and injectivity requirements there (also how RELAX NG needs an element in a definition cycle) |
| 81 | + |
| 82 | +> blockElementDesc :: (Desc tag) => SingDOIResolution doiResolution -> El tag (BlockElement doiResolution) |
| 83 | +> blockElementDesc doiResolution = |
| 84 | +> $(makeRPlus $ do |
| 85 | +> pat [e| rfmap singleProd $ elementMixed (uname "p") $ inlineFlow doiResolution |] |
| 86 | +> $ \inlineP -> [p| BlockParagraph $inlineP |] |
| 87 | +> pat [e| elementMixed (uname "section") $ flowEvenPreWS |
| 88 | +> $ rcons (elementMixed (uname "title") oddTx) |
| 89 | +> $ rcons (inlineFlow doiResolution) |
| 90 | +> $ rnil |
| 91 | +> |] |
| 92 | +> $ \titleP bodyP -> [p| BlockSection $titleP $bodyP |] |
| 93 | +> pat [e| rfmap singleProd $ elementMixed (uname "ul") $ |
| 94 | +> flowEvenPreWS $ rmany $ flowWSE $ elementMixed (uname "li") $ |
| 95 | +> blockDesc doiResolution |
| 96 | +> |] |
| 97 | +> $ \itemsP -> [p| BlockUnorderedList $itemsP |] |
| 98 | +> ) |
| 99 | + |
| 100 | +TODO apologise for the TH, explain what's going on there, mention that this will get good warnings out of GHC if a case is uncovered or duplicated (can we actually test that??)---mention what it *can't* do (e.g., can't pick up non-injectivity in the AST->XML direction; round-trip tests help there though) |
| 101 | + |
| 102 | +> inlineDesc :: (Desc tag) => SingDOIResolution doiResolution -> OddFlow tag (Inline doiResolution) |
| 103 | +> inlineDesc doiResolution = rfmap (iso Inline getInline) $ nonTerminalOdd productionName $ |
| 104 | +> interleave (inlineElementDesc doiResolution) oddTx |
| 105 | +> where |
| 106 | +> productionName :: Text |
| 107 | +> productionName = case doiResolution of |
| 108 | +> SingDOIsUnresolved -> "inline-unresolved" |
| 109 | +> SingDOIsResolved -> "inline-resolved" |
| 110 | +> |
| 111 | +> inlineElementDesc :: (Desc tag) => SingDOIResolution doiResolution -> El tag (InlineElement doiResolution) |
| 112 | +> inlineElementDesc doiResolution = |
| 113 | +> $(makeRPlus $ do |
| 114 | +> case doiResolution of |
| 115 | +> SingDOIsResolved -> return () |
| 116 | +> SingDOIsUnresolved -> |
| 117 | +> pat [e| rfmap singleProd $ elementMixed (uname "doi") oddTx |] |
| 118 | +> $ \identP -> [p| InlineDOI $identP |] |
| 119 | + |
| 120 | +TODO something about singletons? Definitely call attention to the case there. |
| 121 | + |
| 122 | +> pat [e| rfmap quintupleProd $ elementMixed (uname "ref") |
| 123 | +> $ rcons (uattrF "title" stringTokenDT) |
| 124 | +> $ rcons (uattrF "date" dateDT) |
| 125 | +> $ rcons (uattrF "href" urlDT) |
| 126 | +> $ rfmap collateReferenceFields $ |
| 127 | +> flowEvenPreWS $ rmany $ flowWSE referenceFieldDesc |
| 128 | +> |] |
| 129 | +> $ \titleP dateP hrefP authorsP identsP -> [p| InlineReference (Reference $titleP $dateP $hrefP $authorsP $identsP) |] |
| 130 | + |
| 131 | +TODO datatypes, talk about isomorphisms and quotients |
| 132 | + |
| 133 | +> pat [e| rfmap singleProd $ elementMixed (uname "i") $ inlineDesc doiResolution |] |
| 134 | +> $ \inlineP -> [p| InlineItalic $inlineP |] |
| 135 | +> pat [e| rfmap singleProd $ elementMixed (uname "b") $ inlineDesc doiResolution |] |
| 136 | +> $ \inlineP -> [p| InlineBold $inlineP |] |
| 137 | +> ) |
| 138 | + |
| 139 | +Whew. And now all that's left is to put the whole thing together at the top level. |
| 140 | + |
| 141 | +> documentDesc :: (Desc tag) => SingDOIResolution doiResolution -> El tag (Block doiResolution) |
| 142 | +> documentDesc doiResolution = elementMixed "document" $ blockDesc doiResolution |
| 143 | + |
| 144 | +DOI resolution |
| 145 | +============== |
| 146 | + |
| 147 | +This is not particularly enlightening: it's just walking the AST. |
| 148 | + |
| 149 | +> lookupDOI :: Text -> Maybe Reference |
| 150 | +> lookupDOI "10.17487/RFC2324" = Just $ Reference |
| 151 | +> { referenceTitle = "Hyper Text Coffee Pot Control Protocol (HTCPCP/1.0)" |
| 152 | +> , referenceDate = "1998-03" |
| 153 | +> , referenceURL = "https://tools.ietf.org/html/rfc2324" |
| 154 | +> , referenceAuthors = ["L. Masinter"] |
| 155 | +> , referenceIdentifiers = ["RFC 2324", "doi:10.17487/RFC2324"] |
| 156 | +> } |
| 157 | +> lookupDOI _ = Nothing |
| 158 | + |
| 159 | +For this example, we are only mocking the metadata retrieval: in a real application this would hit the network and query [CrossRef]'s API. |
| 160 | + |
| 161 | +[CrossRef]: https://github.com/CrossRef/rest-api-doc |
| 162 | + |
| 163 | +> data DOIResolutionError = UnknownDOI Text |
| 164 | +> deriving (Eq) |
| 165 | +> |
| 166 | +> type Resolution = Validation (NonEmptyDList DOIResolutionError) |
| 167 | +> |
| 168 | +> resolutionFailure :: DOIResolutionError -> Resolution a |
| 169 | +> resolutionFailure = Failure . NEDList.singleton |
| 170 | +> |
| 171 | +> resolveDOIsInlineElement :: InlineElement 'DOIsUnresolved -> Resolution (InlineElement 'DOIsResolved) |
| 172 | +> resolveDOIsInlineElement (InlineDOI doi) = case lookupDOI doi of |
| 173 | +> Nothing -> resolutionFailure $ UnknownDOI doi |
| 174 | +> Just reference -> pure $ InlineReference reference |
| 175 | +> resolveDOIsInlineElement (InlineReference reference) = pure $ InlineReference reference |
| 176 | +> resolveDOIsInlineElement (InlineItalic inline) = InlineItalic <$> resolveDOIsInline inline |
| 177 | +> resolveDOIsInlineElement (InlineBold inline) = InlineBold <$> resolveDOIsInline inline |
| 178 | +> |
| 179 | +> resolveDOIsInline :: Inline 'DOIsUnresolved -> Resolution (InlineElement 'DOIsResolved) |
| 180 | +> resolveDOIsInline = fmap Inline . bitraverse resolveDOIsInlineElement pure . getInline |
| 181 | +> |
| 182 | +> resolveDOIsBlockElement :: BlockElement 'DOIsUnresolved -> Resolution (BlockElement 'DOIsResolved) |
| 183 | +> resolveDOIsBlockElement (BlockParagraph inline) = BlockParagraph <$> resolveDOIsInline inline |
| 184 | +> resolveDOIsBlockElement (BlockSection title body) = BlockSection title <$> resolveDOIsBlock body |
| 185 | +> resolveDOIsBlockElement (BlockUnorderedList items) = BlockUnorderedList <$> traverse resolveDOIsBlock items |
| 186 | +> |
| 187 | +> resolveDOIsBlock :: Block 'DOIsUnresolved -> Resolution (Block 'DOIsResolved) |
| 188 | +> resolveDOIsBlock = fmap Block . traverse resolveDOIsBlockElement . getBlock |
| 189 | + |
| 190 | +Note that the types here mean we can never forget to resolve a DOI: after this pass, it's statically guaranteed that there are no DOIs left! |
| 191 | + |
| 192 | +Putting it all together |
| 193 | +======================= |
| 194 | + |
| 195 | +Example 1: Resolving a DOI and dumping to XML |
| 196 | +--------------------------------------------- |
| 197 | + |
| 198 | +> example1 :: Bool |
| 199 | +> example1 = first toNonEmpty rendered == Right (pure expected) |
| 200 | +> where |
| 201 | +> input :: Element () |
| 202 | +> input = [xmlqq| <document><p><doi>10.17487/RFC2324</doi></p></document> |] |
| 203 | +> parsed :: Either Text (Block 'DOIUnresolved) |
| 204 | +> parsed = parse (documentDesc SignDOIsUnresolved) input |
| 205 | +> resolved :: Either Text (Resolution (Block 'DOIResolved)) |
| 206 | +> resolved = resolveDOIs <$> parsed |
| 207 | +> rendered :: Either Text (Resolution (Element ())) |
| 208 | +> rendered = fmap (fmap $ render $ documentDesc SignDOIsResolved) resolved |
| 209 | +> expected :: Element () |
| 210 | +> expected = |
| 211 | +> [xmlqq| <document><p><ref date="1998-03" |
| 212 | +> href="https://tools.ietf.org/html/rfc2324" |
| 213 | +> name="Hyper Text Coffee Pot Control Protocol (HTCPCP/1.0)" |
| 214 | +> ><author>L. Masinter</author><ident>RFC 2324</ident><ident>doi:10.17487/RFC2324</ident></ref></document> |
| 215 | +> |] |
| 216 | + |
| 217 | +Example 2: Failing to parse because of an unknown element |
| 218 | +--------------------------------------------------------- |
| 219 | + |
| 220 | +> example2 :: Bool |
| 221 | +> example2 = isLeft parsed |
| 222 | +> where |
| 223 | +> input :: Element () |
| 224 | +> input = [xmlqq| <document><p><marquee>Whee!</marquee></p></document> |] |
| 225 | +> parsed :: Either Text (Block 'DOIUnresolved) |
| 226 | +> parsed = parse (documentDesc SignDOIsUnresolved) input |
| 227 | + |
| 228 | +Example 3: Failing to parse because there is an unresolved DOI |
| 229 | +-------------------------------------------------------------- |
| 230 | + |
| 231 | +> example3 :: Bool |
| 232 | +> example3 = isLeft parsed |
| 233 | +> where |
| 234 | +> input :: Element () |
| 235 | +> input = [xmlqq| <document><p><doi>10.17487/RFC2324</doi></p></document> |] |
| 236 | +> parsed :: Either Text (Block 'DOIResolved) |
| 237 | +> parsed = parse (documentDesc SignDOIsResolved) input |
| 238 | + |
| 239 | +Testing all the examples |
| 240 | +------------------------ |
| 241 | + |
| 242 | +> main :: IO () |
| 243 | +> main = do |
| 244 | +> assertBool "Example 1" example1 |
| 245 | +> assertBool "Example 2" example2 |
| 246 | +> assertBool "Example 3" example3 |
| 247 | + |
| 248 | +Some unrelated closing remarks |
| 249 | +============================== |
| 250 | + |
| 251 | +XML is not very fashionable these days. Considering some of the crazy things that were stuffed into angle brackets, I'm not sure I can blame anyone for reacting that way and retreating to JSON to retain their sanity, but I think we need to be careful not to throw the baby out with the bathwater. |
| 252 | + |
| 253 | +In my opinion, XML is still the best meta-format for *documents* out there (as opposed to being a meta-format for structured data), and RELAX NG is a very well-thought-out piece of technology that should appeal to any functional programmer. I can recommend the [RELAX NG tutorial] as a quick introduction and ['RELAX NG' by Erik van der Vliste][RELAX NG book] for further reading. |
| 254 | + |
| 255 | +[nxml-mode], which can use RELAX NG schemas, is quite brilliant---unsurprisingly, since James Clark worked on both. If you're working with documents and you're able to avoid XML's rough edges, this is an excellent choice. `Q4C12.XML.Desc.Schema` is the module that will take a description of an element and turn it into a RELAX NG schema. This will be in the XML syntax; [trang] (another James Clark creation) can convert from that to the compact syntax needed by nxml-mode. Try it on this example! It comes out quite startlingly readable, I think. |
| 256 | + |
| 257 | +What if you still hate XML, or your data fits better into JSON? Most of the ideas in `Q4C12.XML.Desc` can be applied just as well to parsing/printing JSON; some bits would probably work even better (e.g., no need to be so fussy about whitespace or namespaces). [JSON Schema] is not nearly as nice as RELAX NG, and, to my knowledge, it doesn't have any good editor integrations. |
| 258 | + |
| 259 | +[JSON schema]: http://json-schema.org/ |
| 260 | +[nxml-mode]: https://www.gnu.org/software/emacs/manual/html_mono/nxml-mode.html |
| 261 | +[RELAX NG book]: http://books.xmlschemata.org/relaxng/page2.html |
| 262 | +[RELAX NG tutorial]: http://relaxng.org/compact-tutorial-20030326.html |
| 263 | +[trang]: https://github.com/relaxng/jing-trang |
0 commit comments