from math_verify import parse, verify
import sympy
This is a TIL (“Today I Learned”) post. I expect it to be useful to my future self and maybe to others, but it is meant to be a quick, informal way to capture something I learned rather than a polished presentation.
Hugging Face’s Math-Verify library provides relatively robust tools to evaluate LLM performance on math problems. Its README demonstrates using it by calling its parse
function on both the LLM output and the gold answer, and then passing those results to verify
. My last post examined the parse
function. This post examines the verify
function.
Comparing Lists
We saw previously that parse
returns a list which may contain both a sympy
expression and a string:
"1/3") parse(
[1/3, '1/3']
verify
ostensibly compares everything in the first list with everything in the second list, and returns True
if any of those combinations pass its equality check. However, its equality check always returns False
for the combination of a sympy
expression and a string, so in practice it just indicates whether either the two sympy
expressions or the two strings are equal to each other.
= sympy.Number(0)
zero = sympy.Number(1)
one
# Everything is equal
=[zero, "0"], target=[zero, "0"]) verify(gold
True
# `gold` and `target` are each internally consistent but are not equal to each other
=[zero, "0"], target=[one, "1"]) verify(gold
False
# `gold` and `target` sympy expressions are equal to each other while their strings are not
=[zero, "1"], target=[zero, "2"]) verify(gold
True
# `gold` and `target` strings are equal to each other while their `sympy` expressions are not
=[zero, "2"], target=[one, "2"]) verify(gold
True
# `gold` `sympy` expression is equal to `target` string and vice versa
=[zero, "1"], target=[one, "0"]) verify(gold
False
# `gold` and `target` indicate the same value, but one is a `sympy` expression and the other is a string
=[zero], target=["0"]) verify(gold
False
This last example might seem surprising. The thinking behind this behavior, as I understand this comment, is that a string should only be present without a corresponding sympy
expression if parsing that string failed, and it is unlikely that parsing failed on one side and yet the string on that side is genuinely equal to the sympy
expression on the other side. This rationale makes sense on the assumption that the input to verify
came from parse
, which is probably what we want but could be documented more explicitly, as I suggested in this issue.
Equality for Strings
Equality for strings is simply Python ==
after stripping whitespace off the ends and ensuring that the strings are not both empty. This approach is obviously imperfect, but it is meant only to catch some of the cases where sympy
parsing fails.
=["1/3"], target=["1 / 3"]) verify(gold
False
Equality for sympy
Expressions
Equality for sympy
expressions is complex. At this core it uses sympy
functionality such as Eq
and evalf
after applying various normalization steps, with support for a few options for strictness:
float_rounding: Number of decimal places to round floats to. Defaults to 6.
numeric_precision: Number of decimal places to consider for numeric comparisons. Defaults to 15.
- If know the evaluated expressions will be small, you should increase this. See: https://docs.sympy.org/latest/modules/evalf.html
strict: Whether to enforce strict comparison mode. Defaults to True.
- In strict mode: Variables matter and sets are not comparable with tuples
- In non-strict mode: Variables are matched by position and sets can be compared with tuples
The presence of both numeric_precision
and float_rounding
parameters could lead to confusion, as this issue notes: setting one of them to a high value will not have the result one might expect if the other is lower:
# `float_rounding` is 6 by default, so increasing `numeric_precision` has no effect in this case
"0.0000001"), parse("0.0000002"), numeric_precision=99999) verify(parse(
True
Conclusion
I see two sharp edges in Math-Verify’s verify
function: it assumes that its inputs have passed through parse
, and it has numeric_precision
and float_rounding
parameters that need to be adjusted together to avoid unexpected behavior. Otherwise it seems like a smart approach to the difficult problem of comparing LLM outputs to gold answers on open-ended math problems without relying on an LLM judge.