TIL: How Math-Verify Verifies LLM Outputs


Greg Gandenberger


February 22, 2025


This is a TIL (“Today I Learned”) post. I expect it to be useful to my future self and maybe to others, but it is meant to be a quick, informal way to capture something I learned rather than a polished presentation.

Hugging Face’s Math-Verify library provides relatively robust tools to evaluate LLM performance on math problems. Its README demonstrates using it by calling its parse function on both the LLM output and the gold answer, and then passing those results to verify. My last post examined the parse function. This post examines the verify function.

Comparing Lists

We saw previously that parse returns a list which may contain both a sympy expression and a string:

from math_verify import parse, verify
import sympy
[1/3, '1/3']

verify ostensibly compares everything in the first list with everything in the second list, and returns True if any of those combinations pass its equality check. However, its equality check always returns False for the combination of a sympy expression and a string, so in practice it just indicates whether either the two sympy expressions or the two strings are equal to each other.

zero = sympy.Number(0)
one = sympy.Number(1)

# Everything is equal
verify(gold=[zero, "0"], target=[zero, "0"])
# `gold` and `target` are each internally consistent but are not equal to each other
verify(gold=[zero, "0"], target=[one, "1"])
# `gold` and `target` sympy expressions are equal to each other while their strings are not
verify(gold=[zero, "1"], target=[zero, "2"])
# `gold` and `target` strings are equal to each other while their `sympy` expressions are not
verify(gold=[zero, "2"], target=[one, "2"])
# `gold` `sympy` expression is equal to `target` string and vice versa
verify(gold=[zero, "1"], target=[one, "0"])
# `gold` and `target` indicate the same value, but one is a `sympy` expression and the other is a string
verify(gold=[zero], target=["0"])

This last example might seem surprising. The thinking behind this behavior, as I understand this comment, is that a string should only be present without a corresponding sympy expression if parsing that string failed, and it is unlikely that parsing failed on one side and yet the string on that side is genuinely equal to the sympy expression on the other side. This rationale makes sense on the assumption that the input to verify came from parse, which is probably what we want but could be documented more explicitly, as I suggested in this issue.

Equality for Strings

Equality for strings is simply Python == after stripping whitespace off the ends and ensuring that the strings are not both empty. This approach is obviously imperfect, but it is meant only to catch some of the cases where sympy parsing fails.

verify(gold=["1/3"], target=["1 / 3"])

Equality for sympy Expressions

Equality for sympy expressions is complex. At this core it uses sympy functionality such as Eq and evalf after applying various normalization steps, with support for a few options for strictness:

float_rounding: Number of decimal places to round floats to. Defaults to 6.
numeric_precision: Number of decimal places to consider for numeric comparisons. Defaults to 15.
    - If know the evaluated expressions will be small, you should increase this. See: https://docs.sympy.org/latest/modules/evalf.html
strict: Whether to enforce strict comparison mode. Defaults to True.
    - In strict mode: Variables matter and sets are not comparable with tuples
    - In non-strict mode: Variables are matched by position and sets can be compared with tuples

The presence of both numeric_precision and float_rounding parameters could lead to confusion, as this issue notes: setting one of them to a high value will not have the result one might expect if the other is lower:

# `float_rounding` is 6 by default, so increasing `numeric_precision` has no effect in this case
verify(parse("0.0000001"), parse("0.0000002"), numeric_precision=99999)


I see two sharp edges in Math-Verify’s verify function: it assumes that its inputs have passed through parse, and it has numeric_precision and float_rounding parameters that need to be adjusted together to avoid unexpected behavior. Otherwise it seems like a smart approach to the difficult problem of comparing LLM outputs to gold answers on open-ended math problems without relying on an LLM judge.