from math_verify import parse, verify, LatexExtractionConfig, ExprExtractionConfig
This is a TIL (“Today I Learned”) post. I expect it to be useful to my future self and maybe to others, but it is meant to be a quick, informal way to capture something I learned rather than a polished presentation.
In a previous post, we saw that LLM Foundry’s default evaluation procedure for open-ended math problems has some limitations. That post proposed to address this problem by using Hugging Face’s Math-Verify library.
This post is a first look at Math-Verify.
Applying Math-Verify to Our Examples
Naively following the Math-Verify README, we can apply it to one of the examples that the default LLM Foundry evaluation had trouble with as follows.
= parse("40")
gold = parse("20 + 20 = 40")
answer
verify(gold, answer)
True
Hear Math-Verify gets the right answer, correctly telling us that the model is correct.
Let’s try another example.
= parse("15 pounds x 1/4 pounds x 1/2 pounds = 15 pounds.")
gold = parse("15")
answer
verify(gold, answer)
True
In this case the model gets the right answer for the wrong reason. If we are expecting Math-Verify to evaluate the model’s answer and not its reasoning, then it is doing what we want here.
= parse("12\nB: 16\nC: 24\nD: 32")
gold = parse("12")
answer
verify(gold, answer)
False
Here the model misunderstands its job – instead of giving an answer, it gives multiple-choice options. The first option happens to be correct, but the model should not get credit for giving the right answer, and Math-Verify correctly rejects its response.
How Does Math-Verify’s parse
Work?
So far, so good – Math-Verify is getting the right result in these three cases. But how? Is it getting the right results in ways that generalize?
Let’s dig one layer deeper by seeing how Math-Verify’s parse
is working.
"12\nB: 16\nC: 24\nD: 32") parse(
[32, '32']
Ah. Here parse
is returning a list of two values, both of which are some version of “32”. “32” is the last multiple-choice optional that the model gave. I suspect that if 32 had been the correct answer, then Math-Verify would have counted the model’s response as correct even though it was a set of options rather than a definite answer. Let’s check.
"12\nB: 16\nC: 24\nD: 32"), parse("32")) verify(parse(
True
OK, so Math-Verify isn’t magic. In this case it simply picking out the last number where LLM Foundry picked out the first number, which happens to give the correct result in this case but is not any better in principle.
fallback_mode
Why is parse("12\nB: 16\nC: 24\nD: 32")
returning a list of two values?
= parse("12\nB: 16\nC: 24\nD: 32")
parse_results type(item) for item in parse_results] [
[sympy.core.numbers.Integer, str]
The first item is a sympy
object, and the second is a string.
sympy
objects have some advantages for our purposes. For instance, it can recognize objects as equal even when they are written differently, as in this example:
import sympy as sp
"1/2"), sp.sympify("0.5")) sp.Eq(sp.sympify(
\(\displaystyle \text{True}\)
The string result is meant to be a fallback option. You can turn it off:
parse("12\nB: 16\nC: 24\nD: 32",
="no_fallback",
fallback_mode )
[32]
parse
works in two stages. First, it pulls out regex matches from the input text. Then it tries to cast each of those matches as a sympy
object. With fallback_mode="first_match"
, parse
returns the first regex match it pulls out as a string, independently of what happens with sympy
. With fallback_mode="no_fallback"
, it does not return a string; it only returns the sympy
object, if sympy
processing succeeds, or an empty list if sympy
processing fails.
Here is an example where parse
pulls out the string inside \\boxed{}
, but the string is not a well-formed mathematical expression, so sympy
cannot process it. parse
then returns just the string with fallback_mode="first_match"
(or by default) and an empty list with fallback_mode="no_fallback"
.
parse("\\boxed{E=mc^}",
="first_match",
fallback_mode )
['E=mc^']
parse("\\boxed{E=mc^}",
)
['E=mc^']
parse("\\boxed{E=mc^}",
="no_fallback",
fallback_mode )
[]
At this point it might be clearer for parse
to have a Boolean parameter with a name like return_string_fallback
rather than a string parameter with the name fallback_mode
that simply controls whether or not parse
returns the first match it tries to parse as a string. Perhaps the motivation for the current design is that it provides flexibility to add more “fallback modes” in the future without changing the function signature.
extraction_mode
parse
has a second parameter extraction_mode
that also affects what happens when casting to sympy
fails. With extraction_mode="first_match"
, parse
will only try casting to sympy
once, and will not return a sympy
object if it fails. With extraction_mode="any_match"
, parse
will keep trying to cast matches to sympy
until one succeeds.
For instance, if we have one invalid expression and one valid expression, parse
will return the valid expression as a sympy
object with extraction_mode="any_match"
, but with extraction_mode="first_match"
it will not return any sympy
objects if it processes the invalid expression before the valid one.
parse("$x + y$ $E=mc^$",
="no_fallback", # do not return a string
fallback_mode="first_match", # give up on returning a `sympy` object if the first attempt fails
extraction_mode )
[]
parse("$x + y$ $E=mc^$",
="first_match", # return a string from the first match regardless of whether casting to `sympy` succeeds
fallback_mode="first_match", # give up on returning a `sympy` object if the first attempt fails
extraction_mode )
['E=mc^']
parse("$x + y$ $E=mc^$",
="no_fallback", # do not return a string
fallback_mode="any_match", # keep trying to return a `sympy` object until one attempt succeeds
extraction_mode )
[x + y]
parse("$x + y$ $E=mc^$",
="first_match", # return a string from the first match regardless of whether casting to `sympy` succeeds
fallback_mode="any_match", # keep trying to return a `sympy` object until one attempt succeeds
extraction_mode )
[x + y, 'E=mc^']
In this last example, the two returned items come from different matches. The second item is the invalid expression, which we get as a string because it is processed first and we have fallback_mode=“first_match”. The first item is the valid expression, which we get as a sympy
object because setting extraction_mode="any_match"
caused parse
to keep trying to cast to sympy
until it succeeded.
In these examples parse
finds two matches and prioritizes the second one. However, it does not always prioritize the last match. Let’s take a look at what it does instead.
extraction_config
parse
has an additional extraction_config
parameter that takes a sequence of ExtractionConfig
objects. Each ExtractionConfig
object specifies one procedure for finding and prioritizing matches. It supports three ExtractionConfig
classes: LatexExtractionConfig
, StringExtractionConfig
, and ExprExtractionConfig
. Based on which of these classes is used, parse
generates a list of regexes with with associated priority levels that it uses to find and prioritize matches. Within a given priority level, parse
processes matches in the reverse order they appear in the input text.
For instance, LatexExtractionConfig
looks for expressions within various LaTeX delimiters, such as $$...$$
and \[..\]
. It prioritizes matches that are “marked” as the final answer, for instance by being inside \boxed{}
or after “final answer is”. The details are quite complicated, and I do not fully understand them, but let’s look at some examples.
With two expressions simply delimited by $
, parse
finds both matches and prioritizes the second one.
parse("$x + y$ $E=mc^2$",
=[LatexExtractionConfig()],
extraction_config="any_match",
extraction_mode="no_fallback",
fallback_mode )
[Eq(E, c**2*m)]
However, parse
will prioritize the first match if it is marked as the final answer in a way that it recognizes.
parse("the final answer is $x + y$ $E=mc^2$",
=[LatexExtractionConfig()],
extraction_config="any_match",
extraction_mode="no_fallback",
fallback_mode )
[x + y]
parse("\\boxed{x + y} $E=mc^2$",
=[LatexExtractionConfig()],
extraction_config="any_match",
extraction_mode="no_fallback",
fallback_mode )
[x + y]
This approach makes sense, but it is complicated and currently not well documented. It rewards models for marking their final answers in certain specific ways, which introduces some amount of coupling between the evaluation procedure and the details of how the model is trained and prompted.
If I understand correctly, ExprExtractionConfig
looks for numerical (rather than symbolic) mathematical expressions without relying on LaTeX delimiters.
parse("1 + 2",
=[ExprExtractionConfig()],
extraction_config="any_match",
extraction_mode="no_fallback",
fallback_mode )
[1 + 2]
It is prone to pulling out parts of larger expressions in ways that may or may not match our desires.
# Here we get `3`, which probably is what we want
parse("1 + 2 = 3",
=[ExprExtractionConfig()],
extraction_config="any_match",
extraction_mode="no_fallback",
fallback_mode )
[3]
# Here we get "2", which probably is not what we want
parse("$1 + 2$",
=[ExprExtractionConfig()],
extraction_config="any_match",
extraction_mode="no_fallback",
fallback_mode )
[2]
By default, extraction_config=[LatexExtractionConfig(), ExprExtractionConfig()]
, so parse
will find both LaTeX expressions and numerical expressions. It combines their matches into one pool, prioritizing those matches by the priorities that the configs give them and breaking ties by working back to front.
The final config class, StringExtractionConfig
is to a first approximation simply looking for any of a fixed set of strings, by default “A”, “B”, “C”, and “D”. I take it that it is meant to be used for multiple-choice questions rather than open-ended math problems.
Examples
Now that we have some general idea of how parse
works, let’s look back at the examples from the previous post.
"12\nB: 16\nC: 24\nD: 32") parse(
[32, '32']
parse("12\nB: 16\nC: 24\nD: 32",
=[LatexExtractionConfig()],
extraction_config="any_match",
extraction_mode )
[]
parse("12\nB: 16\nC: 24\nD: 32",
=[ExprExtractionConfig()],
extraction_config="any_match",
extraction_mode="no_fallback",
fallback_mode )
[32]
In this case, LatexExtractionConfig
does not find any matches because the model’s output is not formatted as LaTeX. ExprExtractionConfig
presumably finds all four numbers and gives them the same priority, so it returns the last number, 32.
The fact that parse
returns the same result here as it would if the model had simply returned “32” is a problem, because it will cause us to count the model as correct even though it did not give a definite answer. The ideal behavior depends on our larger system design, but it would involve recognizing that the model did not give a definite answer and returning something that indicates as much, such as perhaps an empty list.
In our other two examples, parse
returns the rightmost number for the same reason, and that happens to be the right thing to do:
"20 + 20 = 40") parse(
[40, '40']
"15 pounds x 1/4 pounds x 1/2 pounds = 15 pounds.") parse(
[15, '15']
parse
is simply picking out the last number in the model’s output here, rather than intelligently handling the =
sign, as this example shows:
"20 + 20 = 40. By the way, my favorite number is 50.") parse(
[50, '50']
Conclusion
Math-Verify is a library for evaluating LLM outputs on open-ended math problems. It provides a parse
function that uses regexes to extract mathematical expressions and then attempts to cast them as sympy
objects. It also provides a verify
function that compares the parsed model output to the parsed gold answer. I will look at this verify
function in a future post. The resulting evaluation process is not foolproof, but it is perhaps an improvement over LLM Foundry’s default evaluation procedure.