-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Hello, I found an issue with the PHP-target language using the supplied XML-example language. All info below.
Origin of the error seems to be the Unicode-characters in the language: (https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4)
fragment
NameChar : NameStartChar
| '-' | '_' | '.' | DIGIT
| '\u00B7'
| '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
fragment
NameStartChar
: [:a-zA-Z]
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
;
- ANTLR4 runtime using antlr-4.9.3-complete.jar
- using ANTLR PHP Runtime version 0.5.0
- I am stuck in a vendor lock-in with Laravel/Lumen version that will not upgrade tot PHP8, so using PHP7.4.
Error occurs in ATNDeserializer.php, line 175 (
$characters = \preg_split('//u', $data, -1, \PREG_SPLIT_NO_EMPTY);
)
returning false. This is described in the u-modifier https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
Effect
ATN cannot be deserialized and this yields the error in a completely different part of the code because there is no ATN data.
composer.json:
{ "require": { "antlr/antlr4-php-runtime": "0.5.0" } }
My Test.php:
<?php
require_once 'vendor/autoload.php';
require_once './parser/XMLParserVisitor.php';
require_once './parser/XMLParserBaseVisitor.php';
require_once './parser/XMLLexer.php';
require_once './parser/XMLParser.php';
use Antlr\Antlr4\Runtime\CommonTokenStream;
use Antlr\Antlr4\Runtime\InputStream;
use parser\XMLLexer;
use parser\XMLParser;
$expression = '<html><p>Test</p></html>';
$stream = InputStream::fromString($expression);
$lexer = new XMLLexer($stream);
$tokens = new CommonTokenStream($lexer);
$parser = new XMLParser($tokens);
$tree = $parser->document();
Solution
The simplest way is to simply remove the Unicode characters from the example, but that would be too simple. These characters probably represent valid characters. Instead, a proper warning of catcheable exception with an indication of this problem would have saved me **a lot of ** time.
The PHP-manual says: "Five and six octet UTF-8 sequences are regarded as invalid. ". I can't quite understand what that means but maybe there's a hint of a solution there.
In the meantime I removed these characters as I am only parsing HTML generated by CKEditor. Testing in progress....