TIL: How Math-Verify Parses LLM Outputs

ml
llms
evals
math-verify
Author

Greg Gandenberger

Published

February 20, 2025

Note

This is a TIL (“Today I Learned”) post. I expect it to be useful to my future self and maybe to others, but it is meant to be a quick, informal way to capture something I learned rather than a polished presentation.

In a previous post, we saw that LLM Foundry’s default evaluation procedure for open-ended math problems has some limitations. That post proposed to address this problem by using Hugging Face’s Math-Verify library.

This post is a first look at Math-Verify.

from math_verify import parse, verify, LatexExtractionConfig, ExprExtractionConfig

Applying Math-Verify to Our Examples

Naively following the Math-Verify README, we can apply it to one of the examples that the default LLM Foundry evaluation had trouble with as follows.

gold = parse("40")
answer = parse("20 + 20 = 40")

verify(gold, answer)
True

Hear Math-Verify gets the right answer, correctly telling us that the model is correct.

Let’s try another example.

gold = parse("15 pounds x 1/4 pounds x 1/2 pounds = 15 pounds.")
answer = parse("15")

verify(gold, answer)
True

In this case the model gets the right answer for the wrong reason. If we are expecting Math-Verify to evaluate the model’s answer and not its reasoning, then it is doing what we want here.

gold = parse("12\nB: 16\nC: 24\nD: 32")
answer = parse("12")

verify(gold, answer)
False

Here the model misunderstands its job – instead of giving an answer, it gives multiple-choice options. The first option happens to be correct, but the model should not get credit for giving the right answer, and Math-Verify correctly rejects its response.

How Does Math-Verify’s parse Work?

So far, so good – Math-Verify is getting the right result in these three cases. But how? Is it getting the right results in ways that generalize?

Let’s dig one layer deeper by seeing how Math-Verify’s parse is working.

parse("12\nB: 16\nC: 24\nD: 32")
[32, '32']

Ah. Here parse is returning a list of two values, both of which are some version of “32”. “32” is the last multiple-choice optional that the model gave. I suspect that if 32 had been the correct answer, then Math-Verify would have counted the model’s response as correct even though it was a set of options rather than a definite answer. Let’s check.

verify(parse("12\nB: 16\nC: 24\nD: 32"), parse("32"))
True

OK, so Math-Verify isn’t magic. In this case it simply picking out the last number where LLM Foundry picked out the first number, which happens to give the correct result in this case but is not any better in principle.

fallback_mode

Why is parse("12\nB: 16\nC: 24\nD: 32") returning a list of two values?

parse_results = parse("12\nB: 16\nC: 24\nD: 32")
[type(item) for item in parse_results]
[sympy.core.numbers.Integer, str]

The first item is a sympy object, and the second is a string.

sympy objects have some advantages for our purposes. For instance, it can recognize objects as equal even when they are written differently, as in this example:

import sympy as sp

sp.Eq(sp.sympify("1/2"), sp.sympify("0.5"))

\(\displaystyle \text{True}\)

The string result is meant to be a fallback option. You can turn it off:

parse(
    "12\nB: 16\nC: 24\nD: 32",
    fallback_mode="no_fallback",
)
[32]

parse works in two stages. First, it pulls out regex matches from the input text. Then it tries to cast each of those matches as a sympy object. With fallback_mode="first_match", parse returns the first regex match it pulls out as a string, independently of what happens with sympy. With fallback_mode="no_fallback", it does not return a string; it only returns the sympy object, if sympy processing succeeds, or an empty list if sympy processing fails.

Here is an example where parse pulls out the string inside \\boxed{}, but the string is not a well-formed mathematical expression, so sympy cannot process it. parse then returns just the string with fallback_mode="first_match" (or by default) and an empty list with fallback_mode="no_fallback".

parse(
    "\\boxed{E=mc^}",
    fallback_mode="first_match",
)
['E=mc^']
parse(
    "\\boxed{E=mc^}",
)
['E=mc^']
parse(
    "\\boxed{E=mc^}",
    fallback_mode="no_fallback",
)
[]

At this point it might be clearer for parse to have a Boolean parameter with a name like return_string_fallback rather than a string parameter with the name fallback_mode that simply controls whether or not parse returns the first match it tries to parse as a string. Perhaps the motivation for the current design is that it provides flexibility to add more “fallback modes” in the future without changing the function signature.

extraction_mode

parse has a second parameter extraction_mode that also affects what happens when casting to sympy fails. With extraction_mode="first_match", parse will only try casting to sympy once, and will not return a sympy object if it fails. With extraction_mode="any_match", parse will keep trying to cast matches to sympy until one succeeds.

For instance, if we have one invalid expression and one valid expression, parse will return the valid expression as a sympy object with extraction_mode="any_match", but with extraction_mode="first_match" it will not return any sympy objects if it processes the invalid expression before the valid one.

parse(
    "$x + y$ $E=mc^$",
    fallback_mode="no_fallback",  # do not return a string
    extraction_mode="first_match",  # give up on returning a `sympy` object if the first attempt fails
)
[]
parse(
    "$x + y$ $E=mc^$",
    fallback_mode="first_match",  # return a string from the first match regardless of whether casting to `sympy` succeeds
    extraction_mode="first_match",  # give up on returning a `sympy` object if the first attempt fails
)
['E=mc^']
parse(
    "$x + y$ $E=mc^$",
    fallback_mode="no_fallback",  # do not return a string
    extraction_mode="any_match",  # keep trying to return a `sympy` object until one attempt succeeds
)
[x + y]
parse(
    "$x + y$ $E=mc^$",
    fallback_mode="first_match",  # return a string from the first match regardless of whether casting to `sympy` succeeds
    extraction_mode="any_match",  # keep trying to return a `sympy` object until one attempt succeeds
)
[x + y, 'E=mc^']

In this last example, the two returned items come from different matches. The second item is the invalid expression, which we get as a string because it is processed first and we have fallback_mode=“first_match”. The first item is the valid expression, which we get as a sympy object because setting extraction_mode="any_match" caused parse to keep trying to cast to sympy until it succeeded.

In these examples parse finds two matches and prioritizes the second one. However, it does not always prioritize the last match. Let’s take a look at what it does instead.

extraction_config

parse has an additional extraction_config parameter that takes a sequence of ExtractionConfig objects. Each ExtractionConfig object specifies one procedure for finding and prioritizing matches. It supports three ExtractionConfig classes: LatexExtractionConfig, StringExtractionConfig, and ExprExtractionConfig. Based on which of these classes is used, parse generates a list of regexes with with associated priority levels that it uses to find and prioritize matches. Within a given priority level, parse processes matches in the reverse order they appear in the input text.

For instance, LatexExtractionConfig looks for expressions within various LaTeX delimiters, such as $$...$$ and \[..\]. It prioritizes matches that are “marked” as the final answer, for instance by being inside \boxed{} or after “final answer is”. The details are quite complicated, and I do not fully understand them, but let’s look at some examples.

With two expressions simply delimited by $, parse finds both matches and prioritizes the second one.

parse(
    "$x + y$ $E=mc^2$",
    extraction_config=[LatexExtractionConfig()],
    extraction_mode="any_match",
    fallback_mode="no_fallback",
)
[Eq(E, c**2*m)]

However, parse will prioritize the first match if it is marked as the final answer in a way that it recognizes.

parse(
    "the final answer is $x + y$ $E=mc^2$",
    extraction_config=[LatexExtractionConfig()],
    extraction_mode="any_match",
    fallback_mode="no_fallback",
)
[x + y]
parse(
    "\\boxed{x + y} $E=mc^2$",
    extraction_config=[LatexExtractionConfig()],
    extraction_mode="any_match",
    fallback_mode="no_fallback",
)
[x + y]

This approach makes sense, but it is complicated and currently not well documented. It rewards models for marking their final answers in certain specific ways, which introduces some amount of coupling between the evaluation procedure and the details of how the model is trained and prompted.

If I understand correctly, ExprExtractionConfig looks for numerical (rather than symbolic) mathematical expressions without relying on LaTeX delimiters.

parse(
    "1 + 2",
    extraction_config=[ExprExtractionConfig()],
    extraction_mode="any_match",
    fallback_mode="no_fallback",
)
[1 + 2]

It is prone to pulling out parts of larger expressions in ways that may or may not match our desires.

# Here we get `3`, which probably is what we want
parse(
    "1 + 2 = 3",
    extraction_config=[ExprExtractionConfig()],
    extraction_mode="any_match",
    fallback_mode="no_fallback",
)
[3]
# Here we get "2", which probably is not what we want
parse(
    "$1 + 2$",
    extraction_config=[ExprExtractionConfig()],
    extraction_mode="any_match",
    fallback_mode="no_fallback",
)
[2]

By default, extraction_config=[LatexExtractionConfig(), ExprExtractionConfig()], so parse will find both LaTeX expressions and numerical expressions. It combines their matches into one pool, prioritizing those matches by the priorities that the configs give them and breaking ties by working back to front.

The final config class, StringExtractionConfig is to a first approximation simply looking for any of a fixed set of strings, by default “A”, “B”, “C”, and “D”. I take it that it is meant to be used for multiple-choice questions rather than open-ended math problems.

Examples

Now that we have some general idea of how parse works, let’s look back at the examples from the previous post.

parse("12\nB: 16\nC: 24\nD: 32")
[32, '32']
parse(
    "12\nB: 16\nC: 24\nD: 32",
    extraction_config=[LatexExtractionConfig()],
    extraction_mode="any_match",
)
[]
parse(
    "12\nB: 16\nC: 24\nD: 32",
    extraction_config=[ExprExtractionConfig()],
    extraction_mode="any_match",
    fallback_mode="no_fallback",
)
[32]

In this case, LatexExtractionConfig does not find any matches because the model’s output is not formatted as LaTeX. ExprExtractionConfig presumably finds all four numbers and gives them the same priority, so it returns the last number, 32.

The fact that parse returns the same result here as it would if the model had simply returned “32” is a problem, because it will cause us to count the model as correct even though it did not give a definite answer. The ideal behavior depends on our larger system design, but it would involve recognizing that the model did not give a definite answer and returning something that indicates as much, such as perhaps an empty list.

In our other two examples, parse returns the rightmost number for the same reason, and that happens to be the right thing to do:

parse("20 + 20 = 40")
[40, '40']
parse("15 pounds x 1/4 pounds x 1/2 pounds = 15 pounds.")
[15, '15']

parse is simply picking out the last number in the model’s output here, rather than intelligently handling the = sign, as this example shows:

parse("20 + 20 = 40. By the way, my favorite number is 50.")
[50, '50']

Conclusion

Math-Verify is a library for evaluating LLM outputs on open-ended math problems. It provides a parse function that uses regexes to extract mathematical expressions and then attempts to cast them as sympy objects. It also provides a verify function that compares the parsed model output to the parsed gold answer. I will look at this verify function in a future post. The resulting evaluation process is not foolproof, but it is perhaps an improvement over LLM Foundry’s default evaluation procedure.