Some wrong cases #2

xiaobanni · 2025-01-28T10:53:02Z

Thank you for your excellent work on this project! I have been testing the evaluation functionality with various mathematical cases, but I noticed some discrepancies in the results.

Here is the test case I used:

eval_dict = [
    {"pred": "0.0833333333333333", "gt": "\\frac{1}{12}"},
    {"pred": "(1,4.5)", "gt": "(1,\\frac{9}{2})"},
    {"pred": "\\frac{x}{7}+\\frac{2}{7}",
        "gt": "\\frac{x+2}{7}"},
    {"pred": "\\sec^2(y)", "gt": "\\tan^2(y)+1"},
    {"pred": "\\begin{pmatrix}-\\frac{7}{4}&-2\\\\4&\\frac{1}{4}\\end{pmatrix}",
        "gt": "(\\begin{pmatrix}-\\frac{7}{4}&-2\\\\4&\\frac{1}{4}\\\\\\end{pmatrix})"},
    {"pred": '\\begin{pmatrix}\\frac{1}{3x^{2/3}}&0&0\\\\0&1&0\\\\-\\sin(x)&0&0\\end{pmatrix}',
    "gt": '(\\begin{pmatrix}\\frac{1}{3\\sqrt[3]{x}^2}&0&0\\\\0&1&0\\\\-\\sin(x)&0&0\\\\\\end{pmatrix})'},
    {"pred": '-\\frac{8x^2}{9(x^2-2)^{5/3}}+\\frac{2}{3(x^2-2)^{2/3}}',
    "gt": '-\\\frac{2(x^2+6)}{9(x^2-2)\\sqrt[3]{x^2-2}^2}'},
    {"pred": '-34x-45y+20z-100=0', "gt": '34x+45y-20z+100=0'},
    {"pred": '\\frac{100}{3}', "gt": '33.3'},
    {"pred": '\\begin{pmatrix}0.290243531202435\\\\0.196008371385084\\\\-0.186381278538813\\end{pmatrix}',
        "gt": '(\\begin{pmatrix}0.29\\\\0.196\\\\-0.186\\\\\\end{pmatrix})'},
    {"pred": '\\frac{\\sqrt{\\sqrt{11}+\\sqrt{194}}}{2\\sqrt{33}+15}',
        "gt": '\\frac{\\sqrt{\\sqrt{11}+\\sqrt{194}}}{15+2\\sqrt{33}}'},
    {"pred": '(+5)(b+2)', "gt": '(a+5)(b+2)'},
    {"pred": '\\frac{1+\\sqrt{5}}{2}', "gt": '2'},
    {"pred": '\\frac{34}{16}+\\frac{\\sqrt{1358}}{16}',
        "gt": '4'},
    {"pred": '1', "gt": '1\\\\sqrt{19}'},
    {"pred": "(0.6,2.6667]",
    "gt": "(\\\frac{3}{5},\\frac{8}{3}]"},
    {"pred": "x+2n+1", "gt": "x+1"},
    {"pred": "0.5", "gt":"2\\frac{1}{2}"}
]

  for idx, item in enumerate(eval_dict):
      gold = parse(item['gt'])
      pred = parse(item['pred'])
      print(
          f"[{idx}] pred: {item['pred']}, ground truth: {item['gt']}, result: {verify(gold, pred)}")

The output I received is:

[0] pred: 0.0833333333333333, ground truth: \frac{1}{12}, result: True
[1] pred: (1,4.5), ground truth: (1,\frac{9}{2}), result: False
[2] pred: \frac{x}{7}+\frac{2}{7}, ground truth: \frac{x+2}{7}, result: False
[3] pred: \sec^2(y), ground truth: \tan^2(y)+1, result: False
[4] pred: \begin{pmatrix}-\frac{7}{4}&-2\\4&\frac{1}{4}\end{pmatrix}, ground truth: (\begin{pmatrix}-\frac{7}{4}&-2\\4&\frac{1}{4}\\\end{pmatrix}), result: True
[5] pred: \begin{pmatrix}\frac{1}{3x^{2/3}}&0&0\\0&1&0\\-\sin(x)&0&0\end{pmatrix}, ground truth: (\begin{pmatrix}\frac{1}{3\sqrt[3]{x}^2}&0&0\\0&1&0\\-\sin(x)&0&0\\\end{pmatrix}), result: False
[6] pred: -\frac{8x^2}{9(x^2-2)^{5/3}}+\frac{2}{3(x^2-2)^{2/3}}, ground truth: -\
                                                                                 rac{2(x^2+6)}{9(x^2-2)\sqrt[3]{x^2-2}^2}, result: False
[7] pred: -34x-45y+20z-100=0, ground truth: 34x+45y-20z+100=0, result: True
[8] pred: \frac{100}{3}, ground truth: 33.3, result: False
[9] pred: \begin{pmatrix}0.290243531202435\\0.196008371385084\\-0.186381278538813\end{pmatrix}, ground truth: (\begin{pmatrix}0.29\\0.196\\-0.186\\\end{pmatrix}), result: False
[10] pred: \frac{\sqrt{\sqrt{11}+\sqrt{194}}}{2\sqrt{33}+15}, ground truth: \frac{\sqrt{\sqrt{11}+\sqrt{194}}}{15+2\sqrt{33}}, result: False
[11] pred: (+5)(b+2), ground truth: (a+5)(b+2), result: False
[12] pred: \frac{1+\sqrt{5}}{2}, ground truth: 2, result: False
[13] pred: \frac{34}{16}+\frac{\sqrt{1358}}{16}, ground truth: 4, result: False
[14] pred: 1, ground truth: 1\\sqrt{19}, result: True
[15] pred: (0.6,2.6667], ground truth: (\
                                         rac{3}{5},\frac{8}{3}], result: False
[16] pred: x+2n+1, ground truth: x+1, result: False
[17] pred: 0.5, ground truth: 2\frac{1}{2}, result: True

The evaluation results of 1,2,3,5,6,10,14,15,17 are all wrong.

Is there an issue in how I am using the API or the verify function?
Formatting Dependency: I noticed that wrapping answers in $$ seems to increase accuracy, but some models or formats do not prefer this notation. How can this issue be addressed to support plain formats effectively?

The text was updated successfully, but these errors were encountered:

hynky1999 · 2025-01-28T11:22:02Z

Formatting Dependency: I noticed that wrapping answers in $$ seems to increase accuracy, but some models or formats do not prefer this notation.

Hi indeed if you want to parse latex, it needs to be in latex environment (therefore wrapping it in any latex env notation like $$ or [ ] or others). What doesn't need to be wrapped are simple expression like 1/2 or 1.0222 that can be picked up from the string.

Can you try reruning with the $$? And show if there are still some failure cases?

How can this issue be addressed to support plain formats effectively?

I don't see an easy way to adress. The way it works is that each target (latex or expr) has set of regexes which are used to identify the answer. All latex regexes require the latex environment to match, so if the models doesn't output it no latex parsing will be done. Why is done this way? Because recalling what the answer is from the text is incredibly hard using rule based parsing, here the LLM could be probably useful :). So yeah the only way to fix this imo is using LLM for recalling what the answer is.

but some models or formats do not prefer this notation

Do you have an example ? I tuned the setting based on popular models and from what I have seen most of them can easily output the latex environment. Frankly if the model can't output latex env I am very doubtful about it's math abilities. There are two things that is parsable without latex environement: simple \frac and \boxed env

xiaobanni · 2025-01-28T11:44:31Z

Thank You for Your Prompt Reply!

Let me first address the issue with the code snippet provided:

eval_dict = [
    {"pred": "$0.0833333333333333$", "gt": "$\\frac{1}{12}$"},
    {"pred": "$1,4.5$", "gt": "$1,\\frac{9}{2}$"},
    {"pred": "$\\frac{x}{7}+\\frac{2}{7}$",
        "gt": "$\\frac{x+2}{7}$", "timeout": True},
    {"pred": "$\\sec^2(y)$", "gt": "$\\tan^2(y)+1$", "timeout": True},
    {"pred": "$\\begin{pmatrix}-\\frac{7}{4}&-2\\\\4&\\frac{1}{4}\\end{pmatrix}$",
        "gt": "$(\\begin{pmatrix}-\\frac{7}{4}&-2\\\\4&\\frac{1}{4}\\\\\\end{pmatrix})$", "timeout": True},
    {"pred": '$\\begin{pmatrix}\\frac{1}{3x^{2/3}}&0&0\\\\0&1&0\\\\-\\sin(x)&0&0\\end{pmatrix}$',
     "gt": '$(\\begin{pmatrix}\\frac{1}{3\\sqrt[3]{x}^2}&0&0\\\\0&1&0\\\\-\\sin(x)&0&0\\\\\\end{pmatrix})$', "timeout": True},
    {"pred": '$-\\frac{8x^2}{9(x^2-2)^{5/3}}+\\frac{2}{3(x^2-2)^{2/3}}$',
     "gt": '$-\\frac{2(x^2+6)}{9(x^2-2)\\sqrt[3]{x^2-2}^2}$', "timeout": True},
    {"pred": '$-34x-45y+20z-100=0$', "gt": '$34x+45y-20z+100=0$'},
    {"pred": '$\\frac{100}{3}$', "gt": '$33.3$'},
    {"pred": '$\\begin{pmatrix}0.290243531202435\\\\0.196008371385084\\\\-0.186381278538813\\end{pmatrix}$',
        "gt": '$(\\begin{pmatrix}0.29\\\\0.196\\\\-0.186\\\\\\end{pmatrix})$', "timeout": True},
    {"pred": '$\\frac{\\sqrt{\\sqrt{11}+\\sqrt{194}}}{2\\sqrt{33}+15}$',
        "gt": '$\\frac{\\sqrt{\\sqrt{11}+\\sqrt{194}}}{15+2\\sqrt{33}}$', "timeout": True},
    {"pred": '$(+5)(b+2)$', "gt": '$(a+5)(b+2)$', "timeout": True},
    {"pred": '$\\frac{1+\\sqrt{5}}{2}$', "gt": '$2$', "timeout": True},
    {"pred": '$\\frac{34}{16}+\\frac{\\sqrt{1358}}{16}$',
        "gt": '$4$', "timeout": True},
    {"pred": '$1$', "gt": '$1\\\\sqrt{19}$', "timeout": True},
    {"pred": '$(0.6,2.6667]$',
     "gt": "$(\\frac{3}{5},\\frac{8}{3}]$", "timeout": True},
    {"pred": '$x+2n+1$', "gt": '$x+1$', "timeout": True},
    {"pred": "$1$", "gt": "$2\\frac{1}{2}$"}
]

And the output is:
[0] pred: $0.0833333333333333$, ground truth: $\frac{1}{12}$, result: True
[1] pred: $1,4.5$, ground truth: $1,\frac{9}{2}$, result: True
[2] pred: $\frac{x}{7}+\frac{2}{7}$, ground truth: $\frac{x+2}{7}$, result: True
[3] pred: $\sec^2(y)$, ground truth: $\tan^2(y)+1$, result: True
[4] pred: $\begin{pmatrix}-\frac{7}{4}&-2\4&\frac{1}{4}\end{pmatrix}$, ground truth: $(\begin{pmatrix}-\frac{7}{4}&-2\4&\frac{1}{4}\\end{pmatrix})$, result: True
[5] pred: $\begin{pmatrix}\frac{1}{3x^{2/3}}&0&0\0&1&0\-\sin(x)&0&0\end{pmatrix}$, ground truth: $(\begin{pmatrix}\frac{1}{3\sqrt[3]{x}^2}&0&0\0&1&0\-\sin(x)&0&0\\end{pmatrix})$, result: True
[6] pred: $-\frac{8x^2}{9(x^2-2)^{5/3}}+\frac{2}{3(x^2-2)^{2/3}}$, ground truth: $-\frac{2(x^2+6)}{9(x^2-2)\sqrt[3]{x^2-2}^2}$, result: True
[7] pred: $-34x-45y+20z-100=0$, ground truth: $34x+45y-20z+100=0$, result: True
[8] pred: $\frac{100}{3}$, ground truth: $33.3$, result: False
[9] pred: $\begin{pmatrix}0.290243531202435\0.196008371385084\-0.186381278538813\end{pmatrix}$, ground truth: $(\begin{pmatrix}0.29\0.196\-0.186\\end{pmatrix})$, result: False
[10] pred: $\frac{\sqrt{\sqrt{11}+\sqrt{194}}}{2\sqrt{33}+15}$, ground truth: $\frac{\sqrt{\sqrt{11}+\sqrt{194}}}{15+2\sqrt{33}}$, result: True
[11] pred: $(+5)(b+2)$, ground truth: $(a+5)(b+2)$, result: False
[12] pred: $\frac{1+\sqrt{5}}{2}$, ground truth: $2$, result: False
[13] pred: $\frac{34}{16}+\frac{\sqrt{1358}}{16}$, ground truth: $4$, result: False
[14] pred: $1$, ground truth: $1\sqrt{19}$, result: False
[15] pred: $(0.6,2.6667]$, ground truth: $(\frac{3}{5},\frac{8}{3}]$, result: False
[16] pred: $x+2n+1$, ground truth: $x+1$, result: False
[17] pred: $1$, ground truth: $2\frac{1}{2}$, result: True

Cases 8, 9, 15: I think they should be considered correct? As the differences are within numerical precision limits.
Case 17: The result is incorrect because $2\frac{1}{2}$ corresponds to a value of 2.5, which does not match $1$.

hynky1999 · 2025-01-28T11:54:49Z

17, is good catch, it will parse is as 2*1/2. Will fix

8,9,15 there is precission argument in verify, but it would be possible when comparing fraction and float to infer the precission from number of decimals in float. Same could be done for float to float comparisson where the precission would be taken from the gold . Certainly wouldn't be a default tho, for the fraction inference it's fine, but the issue is with rounding to the smaller precission. If you then have stuff like 0.33 it would be equal to 0.334 (not ideal), same for 0.1299 would equal to 0.1 just because we took the smallest precission. Thinking about it again I don't think it make sense to adjust the precision to the smaller! But for frac vs float it makes a good sense 👍

hynky1999 · 2025-02-04T11:26:21Z

The mixed fractions should be working now :)
I decided not to implement the precission inference as after second thought, it could easily induce some false positives.
#4

This was referenced Jan 28, 2025

Fix parsing of mixed fractions #3

Closed

Allow precission inference from gold #4

Closed

hynky1999 closed this as completed Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some wrong cases #2

Some wrong cases #2

xiaobanni commented Jan 28, 2025

hynky1999 commented Jan 28, 2025 •

edited

Loading

xiaobanni commented Jan 28, 2025

hynky1999 commented Jan 28, 2025 •

edited

Loading

hynky1999 commented Feb 4, 2025

Some wrong cases #2

Some wrong cases #2

Comments

xiaobanni commented Jan 28, 2025

hynky1999 commented Jan 28, 2025 • edited Loading

xiaobanni commented Jan 28, 2025

hynky1999 commented Jan 28, 2025 • edited Loading

hynky1999 commented Feb 4, 2025

hynky1999 commented Jan 28, 2025 •

edited

Loading

hynky1999 commented Jan 28, 2025 •

edited

Loading