Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some wrong cases #2

Closed
xiaobanni opened this issue Jan 28, 2025 · 4 comments
Closed

Some wrong cases #2

xiaobanni opened this issue Jan 28, 2025 · 4 comments

Comments

@xiaobanni
Copy link

Thank you for your excellent work on this project! I have been testing the evaluation functionality with various mathematical cases, but I noticed some discrepancies in the results.

Here is the test case I used:

eval_dict = [
    {"pred": "0.0833333333333333", "gt": "\\frac{1}{12}"},
    {"pred": "(1,4.5)", "gt": "(1,\\frac{9}{2})"},
    {"pred": "\\frac{x}{7}+\\frac{2}{7}",
        "gt": "\\frac{x+2}{7}"},
    {"pred": "\\sec^2(y)", "gt": "\\tan^2(y)+1"},
    {"pred": "\\begin{pmatrix}-\\frac{7}{4}&-2\\\\4&\\frac{1}{4}\\end{pmatrix}",
        "gt": "(\\begin{pmatrix}-\\frac{7}{4}&-2\\\\4&\\frac{1}{4}\\\\\\end{pmatrix})"},
    {"pred": '\\begin{pmatrix}\\frac{1}{3x^{2/3}}&0&0\\\\0&1&0\\\\-\\sin(x)&0&0\\end{pmatrix}',
    "gt": '(\\begin{pmatrix}\\frac{1}{3\\sqrt[3]{x}^2}&0&0\\\\0&1&0\\\\-\\sin(x)&0&0\\\\\\end{pmatrix})'},
    {"pred": '-\\frac{8x^2}{9(x^2-2)^{5/3}}+\\frac{2}{3(x^2-2)^{2/3}}',
    "gt": '-\\\frac{2(x^2+6)}{9(x^2-2)\\sqrt[3]{x^2-2}^2}'},
    {"pred": '-34x-45y+20z-100=0', "gt": '34x+45y-20z+100=0'},
    {"pred": '\\frac{100}{3}', "gt": '33.3'},
    {"pred": '\\begin{pmatrix}0.290243531202435\\\\0.196008371385084\\\\-0.186381278538813\\end{pmatrix}',
        "gt": '(\\begin{pmatrix}0.29\\\\0.196\\\\-0.186\\\\\\end{pmatrix})'},
    {"pred": '\\frac{\\sqrt{\\sqrt{11}+\\sqrt{194}}}{2\\sqrt{33}+15}',
        "gt": '\\frac{\\sqrt{\\sqrt{11}+\\sqrt{194}}}{15+2\\sqrt{33}}'},
    {"pred": '(+5)(b+2)', "gt": '(a+5)(b+2)'},
    {"pred": '\\frac{1+\\sqrt{5}}{2}', "gt": '2'},
    {"pred": '\\frac{34}{16}+\\frac{\\sqrt{1358}}{16}',
        "gt": '4'},
    {"pred": '1', "gt": '1\\\\sqrt{19}'},
    {"pred": "(0.6,2.6667]",
    "gt": "(\\\frac{3}{5},\\frac{8}{3}]"},
    {"pred": "x+2n+1", "gt": "x+1"},
    {"pred": "0.5", "gt":"2\\frac{1}{2}"}
]

  for idx, item in enumerate(eval_dict):
      gold = parse(item['gt'])
      pred = parse(item['pred'])
      print(
          f"[{idx}] pred: {item['pred']}, ground truth: {item['gt']}, result: {verify(gold, pred)}")

The output I received is:

[0] pred: 0.0833333333333333, ground truth: \frac{1}{12}, result: True
[1] pred: (1,4.5), ground truth: (1,\frac{9}{2}), result: False
[2] pred: \frac{x}{7}+\frac{2}{7}, ground truth: \frac{x+2}{7}, result: False
[3] pred: \sec^2(y), ground truth: \tan^2(y)+1, result: False
[4] pred: \begin{pmatrix}-\frac{7}{4}&-2\\4&\frac{1}{4}\end{pmatrix}, ground truth: (\begin{pmatrix}-\frac{7}{4}&-2\\4&\frac{1}{4}\\\end{pmatrix}), result: True
[5] pred: \begin{pmatrix}\frac{1}{3x^{2/3}}&0&0\\0&1&0\\-\sin(x)&0&0\end{pmatrix}, ground truth: (\begin{pmatrix}\frac{1}{3\sqrt[3]{x}^2}&0&0\\0&1&0\\-\sin(x)&0&0\\\end{pmatrix}), result: False
[6] pred: -\frac{8x^2}{9(x^2-2)^{5/3}}+\frac{2}{3(x^2-2)^{2/3}}, ground truth: -\
                                                                                 rac{2(x^2+6)}{9(x^2-2)\sqrt[3]{x^2-2}^2}, result: False
[7] pred: -34x-45y+20z-100=0, ground truth: 34x+45y-20z+100=0, result: True
[8] pred: \frac{100}{3}, ground truth: 33.3, result: False
[9] pred: \begin{pmatrix}0.290243531202435\\0.196008371385084\\-0.186381278538813\end{pmatrix}, ground truth: (\begin{pmatrix}0.29\\0.196\\-0.186\\\end{pmatrix}), result: False
[10] pred: \frac{\sqrt{\sqrt{11}+\sqrt{194}}}{2\sqrt{33}+15}, ground truth: \frac{\sqrt{\sqrt{11}+\sqrt{194}}}{15+2\sqrt{33}}, result: False
[11] pred: (+5)(b+2), ground truth: (a+5)(b+2), result: False
[12] pred: \frac{1+\sqrt{5}}{2}, ground truth: 2, result: False
[13] pred: \frac{34}{16}+\frac{\sqrt{1358}}{16}, ground truth: 4, result: False
[14] pred: 1, ground truth: 1\\sqrt{19}, result: True
[15] pred: (0.6,2.6667], ground truth: (\
                                         rac{3}{5},\frac{8}{3}], result: False
[16] pred: x+2n+1, ground truth: x+1, result: False
[17] pred: 0.5, ground truth: 2\frac{1}{2}, result: True

The evaluation results of 1,2,3,5,6,10,14,15,17 are all wrong.

  1. Is there an issue in how I am using the API or the verify function?
  2. Formatting Dependency: I noticed that wrapping answers in $$ seems to increase accuracy, but some models or formats do not prefer this notation. How can this issue be addressed to support plain formats effectively?
@hynky1999
Copy link
Collaborator

hynky1999 commented Jan 28, 2025

Formatting Dependency: I noticed that wrapping answers in $$ seems to increase accuracy, but some models or formats do not prefer this notation.

Hi indeed if you want to parse latex, it needs to be in latex environment (therefore wrapping it in any latex env notation like $$ or [ ] or others). What doesn't need to be wrapped are simple expression like 1/2 or 1.0222 that can be picked up from the string.

Can you try reruning with the $$? And show if there are still some failure cases?

How can this issue be addressed to support plain formats effectively?

I don't see an easy way to adress. The way it works is that each target (latex or expr) has set of regexes which are used to identify the answer. All latex regexes require the latex environment to match, so if the models doesn't output it no latex parsing will be done. Why is done this way? Because recalling what the answer is from the text is incredibly hard using rule based parsing, here the LLM could be probably useful :). So yeah the only way to fix this imo is using LLM for recalling what the answer is.

but some models or formats do not prefer this notation

Do you have an example ? I tuned the setting based on popular models and from what I have seen most of them can easily output the latex environment. Frankly if the model can't output latex env I am very doubtful about it's math abilities. There are two things that is parsable without latex environement: simple \frac and \boxed env

@xiaobanni
Copy link
Author

Thank You for Your Prompt Reply!

Let me first address the issue with the code snippet provided:

eval_dict = [
    {"pred": "$0.0833333333333333$", "gt": "$\\frac{1}{12}$"},
    {"pred": "$1,4.5$", "gt": "$1,\\frac{9}{2}$"},
    {"pred": "$\\frac{x}{7}+\\frac{2}{7}$",
        "gt": "$\\frac{x+2}{7}$", "timeout": True},
    {"pred": "$\\sec^2(y)$", "gt": "$\\tan^2(y)+1$", "timeout": True},
    {"pred": "$\\begin{pmatrix}-\\frac{7}{4}&-2\\\\4&\\frac{1}{4}\\end{pmatrix}$",
        "gt": "$(\\begin{pmatrix}-\\frac{7}{4}&-2\\\\4&\\frac{1}{4}\\\\\\end{pmatrix})$", "timeout": True},
    {"pred": '$\\begin{pmatrix}\\frac{1}{3x^{2/3}}&0&0\\\\0&1&0\\\\-\\sin(x)&0&0\\end{pmatrix}$',
     "gt": '$(\\begin{pmatrix}\\frac{1}{3\\sqrt[3]{x}^2}&0&0\\\\0&1&0\\\\-\\sin(x)&0&0\\\\\\end{pmatrix})$', "timeout": True},
    {"pred": '$-\\frac{8x^2}{9(x^2-2)^{5/3}}+\\frac{2}{3(x^2-2)^{2/3}}$',
     "gt": '$-\\frac{2(x^2+6)}{9(x^2-2)\\sqrt[3]{x^2-2}^2}$', "timeout": True},
    {"pred": '$-34x-45y+20z-100=0$', "gt": '$34x+45y-20z+100=0$'},
    {"pred": '$\\frac{100}{3}$', "gt": '$33.3$'},
    {"pred": '$\\begin{pmatrix}0.290243531202435\\\\0.196008371385084\\\\-0.186381278538813\\end{pmatrix}$',
        "gt": '$(\\begin{pmatrix}0.29\\\\0.196\\\\-0.186\\\\\\end{pmatrix})$', "timeout": True},
    {"pred": '$\\frac{\\sqrt{\\sqrt{11}+\\sqrt{194}}}{2\\sqrt{33}+15}$',
        "gt": '$\\frac{\\sqrt{\\sqrt{11}+\\sqrt{194}}}{15+2\\sqrt{33}}$', "timeout": True},
    {"pred": '$(+5)(b+2)$', "gt": '$(a+5)(b+2)$', "timeout": True},
    {"pred": '$\\frac{1+\\sqrt{5}}{2}$', "gt": '$2$', "timeout": True},
    {"pred": '$\\frac{34}{16}+\\frac{\\sqrt{1358}}{16}$',
        "gt": '$4$', "timeout": True},
    {"pred": '$1$', "gt": '$1\\\\sqrt{19}$', "timeout": True},
    {"pred": '$(0.6,2.6667]$',
     "gt": "$(\\frac{3}{5},\\frac{8}{3}]$", "timeout": True},
    {"pred": '$x+2n+1$', "gt": '$x+1$', "timeout": True},
    {"pred": "$1$", "gt": "$2\\frac{1}{2}$"}
]

And the output is:
[0] pred: $0.0833333333333333$, ground truth: $\frac{1}{12}$, result: True
[1] pred: $1,4.5$, ground truth: $1,\frac{9}{2}$, result: True
[2] pred: $\frac{x}{7}+\frac{2}{7}$, ground truth: $\frac{x+2}{7}$, result: True
[3] pred: $\sec^2(y)$, ground truth: $\tan^2(y)+1$, result: True
[4] pred: $\begin{pmatrix}-\frac{7}{4}&-2\4&\frac{1}{4}\end{pmatrix}$, ground truth: $(\begin{pmatrix}-\frac{7}{4}&-2\4&\frac{1}{4}\\end{pmatrix})$, result: True
[5] pred: $\begin{pmatrix}\frac{1}{3x^{2/3}}&0&0\0&1&0\-\sin(x)&0&0\end{pmatrix}$, ground truth: $(\begin{pmatrix}\frac{1}{3\sqrt[3]{x}^2}&0&0\0&1&0\-\sin(x)&0&0\\end{pmatrix})$, result: True
[6] pred: $-\frac{8x^2}{9(x^2-2)^{5/3}}+\frac{2}{3(x^2-2)^{2/3}}$, ground truth: $-\frac{2(x^2+6)}{9(x^2-2)\sqrt[3]{x^2-2}^2}$, result: True
[7] pred: $-34x-45y+20z-100=0$, ground truth: $34x+45y-20z+100=0$, result: True
[8] pred: $\frac{100}{3}$, ground truth: $33.3$, result: False
[9] pred: $\begin{pmatrix}0.290243531202435\0.196008371385084\-0.186381278538813\end{pmatrix}$, ground truth: $(\begin{pmatrix}0.29\0.196\-0.186\\end{pmatrix})$, result: False
[10] pred: $\frac{\sqrt{\sqrt{11}+\sqrt{194}}}{2\sqrt{33}+15}$, ground truth: $\frac{\sqrt{\sqrt{11}+\sqrt{194}}}{15+2\sqrt{33}}$, result: True
[11] pred: $(+5)(b+2)$, ground truth: $(a+5)(b+2)$, result: False
[12] pred: $\frac{1+\sqrt{5}}{2}$, ground truth: $2$, result: False
[13] pred: $\frac{34}{16}+\frac{\sqrt{1358}}{16}$, ground truth: $4$, result: False
[14] pred: $1$, ground truth: $1\sqrt{19}$, result: False
[15] pred: $(0.6,2.6667]$, ground truth: $(\frac{3}{5},\frac{8}{3}]$, result: False
[16] pred: $x+2n+1$, ground truth: $x+1$, result: False
[17] pred: $1$, ground truth: $2\frac{1}{2}$, result: True

  • Cases 8, 9, 15: I think they should be considered correct? As the differences are within numerical precision limits.
  • Case 17: The result is incorrect because $2\frac{1}{2}$ corresponds to a value of 2.5, which does not match $1$.

@hynky1999
Copy link
Collaborator

hynky1999 commented Jan 28, 2025

17, is good catch, it will parse is as 2*1/2. Will fix

8,9,15 there is precission argument in verify, but it would be possible when comparing fraction and float to infer the precission from number of decimals in float. Same could be done for float to float comparisson where the precission would be taken from the gold . Certainly wouldn't be a default tho, for the fraction inference it's fine, but the issue is with rounding to the smaller precission. If you then have stuff like 0.33 it would be equal to 0.334 (not ideal), same for 0.1299 would equal to 0.1 just because we took the smallest precission. Thinking about it again I don't think it make sense to adjust the precision to the smaller! But for frac vs float it makes a good sense 👍

@hynky1999
Copy link
Collaborator

The mixed fractions should be working now :)
I decided not to implement the precission inference as after second thought, it could easily induce some false positives.
#4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants