You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: JSEP submission/smrdoc.tex
+5-5
Original file line number
Diff line number
Diff line change
@@ -531,9 +531,9 @@ \section{Discussion and Limitations}
531
531
\subsection{Alternative grammars and parsers}
532
532
\label{sec:comparisonGrammars}
533
533
534
-
We have identified two alternatives to the proposed grammar and its parser implementation: The official, published grammar for Excel formulas \cite{ExcelOfficialGrammar} and the formula parser implementation within the Apache POI Java API for Microsoft Documents\footnote{\label{footnote:poi}\url{https://poi.apache.org/apidocs/org/apache/poi/ss/formula/FormulaParser.html}}. From the related works discussed in this paper, \cite{DouCACheck} and \cite{badame2012refactoring} utilize Apache POI for processing spreadsheets, but we are not aware of any related works utilizing either the official Excel formula grammar of Apache POI for formula parsing.
534
+
We have identified two alternatives to the proposed grammar and its parser implementation: The official, published grammar for Excel formulas \cite{ExcelOfficialGrammar} and the formula parser implementation within the Apache POI Java API for Microsoft Documents\footnote{\label{footnote:poi}\url{https://poi.apache.org/apidocs/org/apache/poi/ss/formula/FormulaParser.html}}. Of the related works discussed in this paper, \cite{DouCACheck} and \cite{badame2012refactoring} utilize Apache POI for processing spreadsheets, but we are not aware of any related works utilizing either the official Excel formula grammar or Apache POI for formula parsing.
535
535
536
-
Of the requirements set in Section \ref{section:grammar}, the official Excel formula grammar naturally fulfills the first one, on compatibility.
536
+
From the requirements set in Section \ref{section:grammar}, the official Excel formula grammar naturally fulfills the first one, on compatibility.
537
537
However, it is too granular for our purpose ---it is over 30 pages long and contains hundreds of production rules. Because of its detail and the large number of production rules, the resulting parse trees are very complex and thus fail requirement 2.
538
538
An example is given in Figure \ref{figure:parsetrees}(b): the relatively simple formula \texttt{SUM(B2,5)} results in a 37-node tree with a depth of 18 nodes.
539
539
For our purpose of facilitating research on spreadsheet formulas, we need a grammar that provides a different level of detail, just-enough to satisfy requirement 3.
@@ -576,11 +576,11 @@ \subsection{Alternative grammars and parsers}
576
576
\label{figure:parsetrees}
577
577
\end{figure}
578
578
579
-
Examining the Apache POI formula parser, the current version of which is v3.15, the first issue that we encountered is the grammar specification: We found no published or defined grammar, apart from a high-level grammar composed of 4 BNF syntax rules in the comments of the FormulaParser class. The grammar specification can therefore only be retrieved or reverse-engineered through the implementation. Moreover, the FormulaParser is marked to be `for POI internal use only', both in the source code and in its documentation. The parse tree is not offered through an interface, while its root node is declared as a private property, not exposed outside the FormulaParser class.
579
+
Examining the Apache POI formula parser (the current version of which is v3.15), the first issue that we encountered is the lack of grammar specification: We found no published or defined grammar, apart from a high-level grammar composed of 4 BNF syntax rules in the comments of the FormulaParser class. The grammar specification can therefore only be retrieved by reverse-engineering the implementation. Moreover, the FormulaParser class is marked to be `for POI internal use only', both in its source code and in its documentation. The produced parse tree is not offered through an interface while the root node, which is required for traversing it, is declared as a private property, not exposed outside the FormulaParser class.
580
580
581
-
The Apache POI formula parser fails our second requirement because of the structure of the produced parse trees. As demonstrated in the example in Figure \ref{figure:parsetrees}(c), the parse trees it produces are condensed. However, this is at the expense of defining many different types of edge nodes to represent different syntactical cases, each type with its own properties and methods (in the current version, there exist 66 types of edge nodes, counted as the members of the org.apache.poi.ss.formula.ptg package). For example, reference \texttt{A1} is represented as an edge node of type \texttt{RefPtg}, reference \texttt{A1:A3} as an \texttt{AreaPtg} and reference \texttt{Sheet3!B6} is an \texttt{Ref3DPxg}. For a simple task like finding the cell references of a formula, a researcher would therefore need to manually explore and handle all different types of edge nodes that might relate to references through various properties. Similar to the official Excel formula grammar specification, it is built for a different purpose than the proposed grammar, i.e., to facilitate the evaluation of formulas, and this makes it less suitable for the intended use.
581
+
The Apache POI formula parser fails our second requirement because of the parse tree structure. As demonstrated in the example in Figure \ref{figure:parsetrees}(c), the parse trees it produces are condensed. However, this is at the expense of defining many different types of edge nodes to represent different syntactical cases, each type with its own properties and methods (in the current version, there exist 66 types of edge nodes, counted as the members of the org.apache.poi.ss.formula.ptg package). For example, reference \texttt{A1} would be represented as an edge node of type \texttt{RefPtg}, reference \texttt{A1:A3} as an \texttt{AreaPtg} and reference \texttt{Sheet3!B6} is an \texttt{Ref3DPxg}. For a simple task like finding the cell references of a formula, we would therefore need to explore and handle all different types of edge nodes that might relate to references through various properties. Similar to the official Excel formula grammar specification, it is built for a different purpose than the proposed grammar, i.e., to facilitate the evaluation of formulas, and this makes it less suitable for the intended use.
582
582
583
-
Finally, comparing it to the proposed grammar, Apache POI has not been tested against and improved based on the datasets discussed in this paper. To reach this conclusion, we compiled a list of the latest rare grammar cases that were found in the datasets and XLParser had to be enriched to support according to the process described in Section \ref{sec:designProcess}. We tested the Apache POI formula parser against those cases and we found 6 cases that caused it to generate parse errors and incorrect parse trees\footnote{More information on those cases can be found in the issues 60979 to 60984 that we opened in the Apache POI project, accessible through \url{https://bz.apache.org/bugzilla/show_bug.cgi?id=<issue-number>}}. Example grammatical cases that we found that Apache POI in its current version does not support are intersections between named ranges (e.g. \texttt{SUM(January Sales)}), ranges with error references (e.g. \texttt{SUM(#REF!:#REF!)}), quoted multiple sheet references (e.g. \texttt{SUM('sales 1:sales 10'!F9)}), which it incorrectly recognizes as single-sheet references (to non-existent worksheet \texttt{'sales 1:sales 10'}), and references to quoted sheets in external files (e.g. \texttt{‘[file.xlsx]final sales’!A20}), which it incorrectly recognizes as references to local worksheets.
583
+
Finally, comparing it to the proposed grammar, Apache POI has not been tested against and improved based on the datasets discussed in this paper. To reach this conclusion, we compiled a list of the latest grammar cases that were found in the datasets and our parser had to be enriched to support according to the process described in Section \ref{sec:designProcess}. We tested the Apache POI formula parser against those cases and we found 6 cases that caused it to generate parse errors and incorrect parse trees\footnote{More information on those cases can be found in the issues 60979 to 60984 that we opened in the Apache POI project, accessible through \url{https://bz.apache.org/bugzilla/show_bug.cgi?id=<issue-number>}}. Example grammatical cases that we found that Apache POI in its current version does not support are intersections between named ranges (e.g. \texttt{SUM(January Sales)}), ranges with error references (e.g. \texttt{SUM(#REF!:#REF!)}), quoted multiple sheet references (e.g. \texttt{SUM('sales 1:sales 10'!F9)}), which it incorrectly recognizes as single-sheet references (to non-existent worksheet \texttt{'sales 1:sales 10'}), and references to quoted sheets in external files (e.g. \texttt{‘[file.xlsx]final sales’!A20}), which it incorrectly recognizes as references to local worksheets.
0 commit comments