Skip to content

ATN cannot be deserialized in PHP-runtime when using example XML-language #4075

@martinmolema

Description

@martinmolema

Hello, I found an issue with the PHP-target language using the supplied XML-example language. All info below.

Origin of the error seems to be the Unicode-characters in the language: (https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4)

fragment
NameChar    :   NameStartChar
            |   '-' | '_' | '.' | DIGIT
            |   '\u00B7'
            |   '\u0300'..'\u036F'
            |   '\u203F'..'\u2040'
            ;

fragment
NameStartChar
            :   [:a-zA-Z]
            |   '\u2070'..'\u218F'
            |   '\u2C00'..'\u2FEF'
            |   '\u3001'..'\uD7FF'
            |   '\uF900'..'\uFDCF'
            |   '\uFDF0'..'\uFFFD'
            ;
  • ANTLR4 runtime using antlr-4.9.3-complete.jar
  • using ANTLR PHP Runtime version 0.5.0
  • I am stuck in a vendor lock-in with Laravel/Lumen version that will not upgrade tot PHP8, so using PHP7.4.

Error occurs in ATNDeserializer.php, line 175 (
$characters = \preg_split('//u', $data, -1, \PREG_SPLIT_NO_EMPTY);
)
returning false. This is described in the u-modifier https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

Effect
ATN cannot be deserialized and this yields the error in a completely different part of the code because there is no ATN data.

composer.json:
{ "require": { "antlr/antlr4-php-runtime": "0.5.0" } }

My Test.php:

<?php

require_once 'vendor/autoload.php';
require_once './parser/XMLParserVisitor.php';
require_once './parser/XMLParserBaseVisitor.php';
require_once './parser/XMLLexer.php';
require_once './parser/XMLParser.php';

use Antlr\Antlr4\Runtime\CommonTokenStream;
use Antlr\Antlr4\Runtime\InputStream;

use parser\XMLLexer;
use parser\XMLParser;

$expression = '<html><p>Test</p></html>';

$stream = InputStream::fromString($expression);
$lexer  = new XMLLexer($stream);
$tokens = new CommonTokenStream($lexer);
$parser = new XMLParser($tokens);

$tree = $parser->document();

Solution
The simplest way is to simply remove the Unicode characters from the example, but that would be too simple. These characters probably represent valid characters. Instead, a proper warning of catcheable exception with an indication of this problem would have saved me **a lot of ** time.

The PHP-manual says: "Five and six octet UTF-8 sequences are regarded as invalid. ". I can't quite understand what that means but maybe there's a hint of a solution there.

In the meantime I removed these characters as I am only parsing HTML generated by CKEditor. Testing in progress....

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions