Description
There is a fix related to apostrophes and XMLEventReader
in #72 that is slated to appear in version 1.1.0. However, while fixing the defect there was not a change to the handling of apostrophes for the entire library. Fundamentally, this is represented by the behavior of Utitlity.escape
method:
scala> scala.xml.Utility.escape("'")
res0: String = '
Users are curious why the apostrophe isn't generally being escaped to become '
:
scala> scala.xml.Utility.escape("'")
res1: String = '
The reason apostrophe isn't escaped is because the scala-xml code has been around for a long time, and although it is primarily an XML library its purpose was one of mixed use with HTML. The HTML 4.0 standard still does not define an apos
entity, see https://www.w3.org/TR/html401/sgml/entities.html
The apos entity has been in XML 1.0 since the beginning, see https://www.w3.org/TR/1998/REC-xml-19980210#sec-predefined-ent
Most HTML these days is XHTML, and the apos entity was defined in XHTML 1.0, since it needed to conform to the XML standard, see https://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#a_dtd_Special_characters
It seems like apos
was explicitly commented out for some historical issue about HTML4 and a Web browser called Internet Explorer -- see the enigmatic comment added in e1fadeb.
Ten years later, can we bring back XML escape support for apos
?
There is some commentary on the Entities representing special characters in XHTML at Wikipedia, but I couldn't find an an analysis of browser support for apos
.
Presumably, all browsers support apos
if the document is properly declared as XHTML. So, it seems that fixing this would require accepting that scala-xml no longer supports HTML4 or earlier?
If that's the case, than the only other issue is finding out the second-order consequences of "fixing" this, and how this would affect users. Byte-for-byte, there XML would suddenly look a little different.
There are at least 4 different contexts for an apostrophe:
- In an element's text (PCDATA), e.g.
<p>The ' character</p>
- CDATA, e.g.
<![CDATA[the ' character]]>
- In an attribute value, e.g.
<a href="#the-'-character">
- As an attribute quote character, e.g.
<a href='foo.html'>
And then there are at least 3 different programming modes
- XML literals
- XML files
- Programmatically-created XML (e.g.
Elem(null, "p", Null, TopScope, Text("The ' character"))
)
And then there at least 3 different types of parsing and reading:
scala.xml.factory.XMLLoader
scala.xml.pull.XMLEventReader
scala.xml.parsing.ConstructingParser