Evaluating LLM answers to open-ended math questions is hard because LLMs can generate arbitrary text. They might surround their answer with additional information. They are not guaranteed to format their answer in any particular way. They might not answer the question at all! Comparing their answer to ground truth requires finding it and then making the comparison in a flexible yet discerning way.
This blog post describes some best practices for evaluating LLM answers to open-ended math questions. It illustrates these ideas in the context of using HuggingFace’s Math-Verify library with MosaicML’s LLM Foundry, but the principles apply more broadly. It does NOT address using an LLM as a judge, which is an alternative to the kind of approach described here.
1. Make Sure the Model Is Answering the Question
First, make sure that when you present the model with a question, it is attempting to answer that question at least most of the time. This point might sound trivial, but the wrong configuration can lead to other types of responses.
You would get this result by using this LLM Foundry eval configuration:
icl_tasks:
-
label: gsm8k
dataset_uri: eval/local_data/symbolic_problem_solving/gsm8k.jsonl
num_fewshot: [0]
icl_task_type: generation_task_with_answers
cot_delimiter: "The answer is "
continuation_delimiter: "\n\nA:"
question_prelimiter: ""
do_normalization: false
early_stopping_criteria:
- "\n\n"
- "Question:"
1.a. Use an instruction-tuned model
Base language models are trained to generate more text rather than specifically to answer questions. If you are evaluating performance on math questions, you should probably be using an instruction-tuned model. This point is particularly easy to overlook if you are just grabbing a small model to get started with development. For instance, HuggingFaceTB/SmolLM2-135M-Instruct
would be a better choice for this purpose than HuggingFaceTB/SmolLM2-135M
.
1.b. Use few-shot prompting
LLMs are fundamentally pattern-matching machines, so giving them examples of what you want them to do is generally effective.
In LLM Foundry, you can use few-shot prompting by setting e.g. num_fewshot: [5]
in the eval configuration, or by using a dataset such as gsm8k_prepended_8shot.jsonl
that has few-shot examples built into each question. The example below also uses question_prelimiter: "Question: "
.
For instance, you can give them input that looks like this:
Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers’ market?
A: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.The answer is 18 Question: A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?
A:It takes 2/2=<<2/2=1>>1 bolt of white fiber So the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric The answer is 3 … Question: Every day, Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy. She gives the chickens their feed in three separate meals. In the morning, she gives her flock of chickens 15 cups of feed. In the afternoon, she gives her chickens another 25 cups of feed. How many cups of feed does she need to give her chickens in the final meal of the day if the size of Wendi’s flock is 20 chickens?
A:
The model will typically then get the clue that it should be trying to answer the last question.
1.c. Cut it off at the question delimiter
Few-shot prompting might lead to a further problem: the model might think that it should not only answer the last question but continue to generate more question/answer pairs, with responses like this:
He made 150% of the original price The answer is 150% Question: A snail is at the bottom of a 20-foot well. Each day, it climbs up 3 feet, but at night, it slips back 2 feet. How many days will it take for the snail to reach the top of the well? …
You can deal with this problem by ignoring everything in its response after the “question prelimiter”, in this case “Question:”.
In LLM Foundry, you can implement this approach by including “Question:” in the early_stopping_criteria
.
2. Have the Model “Box” Its Answer In a Way That the Metric Recognizes
In the few-shot example above, each answer example ends with a statement of the form “The answer is X”, which encourages the model to give its final answer in this same format. You can then look for this same pattern when extracting the model’s answer.
In LLM Foundry, you can use cot_delimiter
for this purpose. It will formulate its few-shot prompts accordingly, and its default metric for the generation_task_with_answers
task type appropriate for open-ended math questions will look for that formulation in extracting the model’s answer.
HuggingFace’s Math-Verify library uses regular expressions to extract the model’s answer. For LaTeX output, it optionally prioritizes matches that fall inside \boxed{}
, and for general mathematical expressions it looks for variations on “the final answer is”. Using those formulations in your prompts will help Math-Verify find the model’s answer!
3. Use a Smart Metric
Once your model is answering the intended question, and you are able to find its answer, you still need to compare that answer to the ground truth. Math-Verify has an extensive procedure for this purpose. It is thoughtfully designed, but you should still spot check its results on your problem! By contrast, LLM Foundry’s default metric for the generation_task_with_answers
task type simply checks whether the extracted answer starts with the ground truth answer (or one of the ground truth answers, if there are multiple), which can lead to false positives, for instance when the answer is “1” and the model’s answer is “100”. In my testing so far, Math-Verify has given more reliable results.
Conclusion
The practices in this post can help avoid pitfalls in evaluating LLMs on open-ended math questions. However, the most important practice of all is the one that got me to this point: look at your data! Check some inputs and outputs and dig into anything that is unclear or looks wrong. Do not blindly trust a library, especially in such a rapidly evolving field.