Skip to content

Escape apostrophes by default? #167

Open
@ashawley

Description

@ashawley

There is a fix related to apostrophes and XMLEventReader in #72 that is slated to appear in version 1.1.0. However, while fixing the defect there was not a change to the handling of apostrophes for the entire library. Fundamentally, this is represented by the behavior of Utitlity.escape method:

scala> scala.xml.Utility.escape("'")
res0: String = '

Users are curious why the apostrophe isn't generally being escaped to become ':

scala> scala.xml.Utility.escape("'")
res1: String = '

The reason apostrophe isn't escaped is because the scala-xml code has been around for a long time, and although it is primarily an XML library its purpose was one of mixed use with HTML. The HTML 4.0 standard still does not define an apos entity, see https://www.w3.org/TR/html401/sgml/entities.html

The apos entity has been in XML 1.0 since the beginning, see https://www.w3.org/TR/1998/REC-xml-19980210#sec-predefined-ent

Most HTML these days is XHTML, and the apos entity was defined in XHTML 1.0, since it needed to conform to the XML standard, see https://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#a_dtd_Special_characters

It seems like apos was explicitly commented out for some historical issue about HTML4 and a Web browser called Internet Explorer -- see the enigmatic comment added in e1fadeb.

Ten years later, can we bring back XML escape support for apos?

There is some commentary on the Entities representing special characters in XHTML at Wikipedia, but I couldn't find an an analysis of browser support for apos.

Presumably, all browsers support apos if the document is properly declared as XHTML. So, it seems that fixing this would require accepting that scala-xml no longer supports HTML4 or earlier?

If that's the case, than the only other issue is finding out the second-order consequences of "fixing" this, and how this would affect users. Byte-for-byte, there XML would suddenly look a little different.

There are at least 4 different contexts for an apostrophe:

  1. In an element's text (PCDATA), e.g. <p>The ' character</p>
  2. CDATA, e.g. <![CDATA[the ' character]]>
  3. In an attribute value, e.g. <a href="#the-'-character">
  4. As an attribute quote character, e.g. <a href='foo.html'>

And then there are at least 3 different programming modes

  1. XML literals
  2. XML files
  3. Programmatically-created XML (e.g. Elem(null, "p", Null, TopScope, Text("The ' character")))

And then there at least 3 different types of parsing and reading:

  1. scala.xml.factory.XMLLoader
  2. scala.xml.pull.XMLEventReader
  3. scala.xml.parsing.ConstructingParser

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions