Skip to content

Commit e5924a3

Browse files
author
aivaloglou
committed
Final corrections
1 parent 08401a8 commit e5924a3

File tree

2 files changed

+5
-5
lines changed

2 files changed

+5
-5
lines changed
Binary file not shown.

JSEP submission/smrdoc.tex

+5-5
Original file line numberDiff line numberDiff line change
@@ -531,9 +531,9 @@ \section{Discussion and Limitations}
531531
\subsection{Alternative grammars and parsers}
532532
\label{sec:comparisonGrammars}
533533

534-
We have identified two alternatives to the proposed grammar and its parser implementation: The official, published grammar for Excel formulas \cite{ExcelOfficialGrammar} and the formula parser implementation within the Apache POI Java API for Microsoft Documents\footnote{\label{footnote:poi}\url{https://poi.apache.org/apidocs/org/apache/poi/ss/formula/FormulaParser.html}}. From the related works discussed in this paper, \cite{DouCACheck} and \cite{badame2012refactoring} utilize Apache POI for processing spreadsheets, but we are not aware of any related works utilizing either the official Excel formula grammar of Apache POI for formula parsing.
534+
We have identified two alternatives to the proposed grammar and its parser implementation: The official, published grammar for Excel formulas \cite{ExcelOfficialGrammar} and the formula parser implementation within the Apache POI Java API for Microsoft Documents\footnote{\label{footnote:poi}\url{https://poi.apache.org/apidocs/org/apache/poi/ss/formula/FormulaParser.html}}. Of the related works discussed in this paper, \cite{DouCACheck} and \cite{badame2012refactoring} utilize Apache POI for processing spreadsheets, but we are not aware of any related works utilizing either the official Excel formula grammar or Apache POI for formula parsing.
535535

536-
Of the requirements set in Section \ref{section:grammar}, the official Excel formula grammar naturally fulfills the first one, on compatibility.
536+
From the requirements set in Section \ref{section:grammar}, the official Excel formula grammar naturally fulfills the first one, on compatibility.
537537
However, it is too granular for our purpose ---it is over 30 pages long and contains hundreds of production rules. Because of its detail and the large number of production rules, the resulting parse trees are very complex and thus fail requirement 2.
538538
An example is given in Figure \ref{figure:parsetrees}(b): the relatively simple formula \texttt{SUM(B2,5)} results in a 37-node tree with a depth of 18 nodes.
539539
For our purpose of facilitating research on spreadsheet formulas, we need a grammar that provides a different level of detail, just-enough to satisfy requirement 3.
@@ -576,11 +576,11 @@ \subsection{Alternative grammars and parsers}
576576
\label{figure:parsetrees}
577577
\end{figure}
578578

579-
Examining the Apache POI formula parser, the current version of which is v3.15, the first issue that we encountered is the grammar specification: We found no published or defined grammar, apart from a high-level grammar composed of 4 BNF syntax rules in the comments of the FormulaParser class. The grammar specification can therefore only be retrieved or reverse-engineered through the implementation. Moreover, the FormulaParser is marked to be `for POI internal use only', both in the source code and in its documentation. The parse tree is not offered through an interface, while its root node is declared as a private property, not exposed outside the FormulaParser class.
579+
Examining the Apache POI formula parser (the current version of which is v3.15), the first issue that we encountered is the lack of grammar specification: We found no published or defined grammar, apart from a high-level grammar composed of 4 BNF syntax rules in the comments of the FormulaParser class. The grammar specification can therefore only be retrieved by reverse-engineering the implementation. Moreover, the FormulaParser class is marked to be `for POI internal use only', both in its source code and in its documentation. The produced parse tree is not offered through an interface while the root node, which is required for traversing it, is declared as a private property, not exposed outside the FormulaParser class.
580580

581-
The Apache POI formula parser fails our second requirement because of the structure of the produced parse trees. As demonstrated in the example in Figure \ref{figure:parsetrees}(c), the parse trees it produces are condensed. However, this is at the expense of defining many different types of edge nodes to represent different syntactical cases, each type with its own properties and methods (in the current version, there exist 66 types of edge nodes, counted as the members of the org.apache.poi.ss.formula.ptg package). For example, reference \texttt{A1} is represented as an edge node of type \texttt{RefPtg}, reference \texttt{A1:A3} as an \texttt{AreaPtg} and reference \texttt{Sheet3!B6} is an \texttt{Ref3DPxg}. For a simple task like finding the cell references of a formula, a researcher would therefore need to manually explore and handle all different types of edge nodes that might relate to references through various properties. Similar to the official Excel formula grammar specification, it is built for a different purpose than the proposed grammar, i.e., to facilitate the evaluation of formulas, and this makes it less suitable for the intended use.
581+
The Apache POI formula parser fails our second requirement because of the parse tree structure. As demonstrated in the example in Figure \ref{figure:parsetrees}(c), the parse trees it produces are condensed. However, this is at the expense of defining many different types of edge nodes to represent different syntactical cases, each type with its own properties and methods (in the current version, there exist 66 types of edge nodes, counted as the members of the org.apache.poi.ss.formula.ptg package). For example, reference \texttt{A1} would be represented as an edge node of type \texttt{RefPtg}, reference \texttt{A1:A3} as an \texttt{AreaPtg} and reference \texttt{Sheet3!B6} is an \texttt{Ref3DPxg}. For a simple task like finding the cell references of a formula, we would therefore need to explore and handle all different types of edge nodes that might relate to references through various properties. Similar to the official Excel formula grammar specification, it is built for a different purpose than the proposed grammar, i.e., to facilitate the evaluation of formulas, and this makes it less suitable for the intended use.
582582

583-
Finally, comparing it to the proposed grammar, Apache POI has not been tested against and improved based on the datasets discussed in this paper. To reach this conclusion, we compiled a list of the latest rare grammar cases that were found in the datasets and XLParser had to be enriched to support according to the process described in Section \ref{sec:designProcess}. We tested the Apache POI formula parser against those cases and we found 6 cases that caused it to generate parse errors and incorrect parse trees\footnote{More information on those cases can be found in the issues 60979 to 60984 that we opened in the Apache POI project, accessible through \url{https://bz.apache.org/bugzilla/show_bug.cgi?id=<issue-number>}}. Example grammatical cases that we found that Apache POI in its current version does not support are intersections between named ranges (e.g. \texttt{SUM(January Sales)}), ranges with error references (e.g. \texttt{SUM(#REF!:#REF!)}), quoted multiple sheet references (e.g. \texttt{SUM('sales 1:sales 10'!F9)}), which it incorrectly recognizes as single-sheet references (to non-existent worksheet \texttt{'sales 1:sales 10'}), and references to quoted sheets in external files (e.g. \texttt{‘[file.xlsx]final sales’!A20}), which it incorrectly recognizes as references to local worksheets.
583+
Finally, comparing it to the proposed grammar, Apache POI has not been tested against and improved based on the datasets discussed in this paper. To reach this conclusion, we compiled a list of the latest grammar cases that were found in the datasets and our parser had to be enriched to support according to the process described in Section \ref{sec:designProcess}. We tested the Apache POI formula parser against those cases and we found 6 cases that caused it to generate parse errors and incorrect parse trees\footnote{More information on those cases can be found in the issues 60979 to 60984 that we opened in the Apache POI project, accessible through \url{https://bz.apache.org/bugzilla/show_bug.cgi?id=<issue-number>}}. Example grammatical cases that we found that Apache POI in its current version does not support are intersections between named ranges (e.g. \texttt{SUM(January Sales)}), ranges with error references (e.g. \texttt{SUM(#REF!:#REF!)}), quoted multiple sheet references (e.g. \texttt{SUM('sales 1:sales 10'!F9)}), which it incorrectly recognizes as single-sheet references (to non-existent worksheet \texttt{'sales 1:sales 10'}), and references to quoted sheets in external files (e.g. \texttt{‘[file.xlsx]final sales’!A20}), which it incorrectly recognizes as references to local worksheets.
584584

585585
\subsection{Dialects}
586586

0 commit comments

Comments
 (0)