Some Best Practices for LLM Math Evaluation

ml
llms
evals
math
Author

Greg Gandenberger

Published

March 5, 2025

Image generated by ChatGPT-4o.

Evaluating LLM answers to open-ended math questions is hard because LLMs can generate arbitrary text. They might surround their answer with additional information. They are not guaranteed to format their answer in any particular way. They might not answer the question at all! Comparing their answer to ground truth requires finding it and then making the comparison in a flexible yet discerning way.

This blog post describes some best practices for evaluating LLM answers to open-ended math questions. It illustrates these ideas in the context of using HuggingFace’s Math-Verify library with MosaicML’s LLM Foundry, but the principles apply more broadly. It does NOT address using an LLM as a judge, which is an alternative to the kind of approach described here.

1. Make Sure the Model Is Answering the Question

First, make sure that when you present the model with a question, it is attempting to answer that question at least most of the time. This point might sound trivial, but the wrong configuration can lead to other types of responses.

Example

For instance, in my first attempt at a recent project developing a math evaluation procedure, the model was generating responses that looked like this:

15 cups

B: 25 cups

C: 30 cups

D: 40 cups

E: 50 cups

It was behaving that way because I had not configured the prompting procedure well, so it was getting inputs like this:

Every day, Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy. She gives the chickens their feed in three separate meals. In the morning, she gives her flock of chickens 15 cups of feed. In the afternoon, she gives her chickens another 25 cups of feed. How many cups of feed does she need to give her chickens in the final meal of the day if the size of Wendi’s flock is 20 chickens?

A:

The “A:” at the end of the prompt was meant to signify “answer”, but the model took it to signify “option A” in a multiple-choice question.

Note

You would get this result by using this LLM Foundry eval configuration:

icl_tasks:
- 
  label: gsm8k
  dataset_uri: eval/local_data/symbolic_problem_solving/gsm8k.jsonl
  num_fewshot: [0]
  icl_task_type: generation_task_with_answers
  cot_delimiter: "The answer is "
  continuation_delimiter: "\n\nA:"
  question_prelimiter: ""
  do_normalization: false
  early_stopping_criteria:
  - "\n\n"
  - "Question:"

1.a. Use an instruction-tuned model

Base language models are trained to generate more text rather than specifically to answer questions. If you are evaluating performance on math questions, you should probably be using an instruction-tuned model. This point is particularly easy to overlook if you are just grabbing a small model to get started with development. For instance, HuggingFaceTB/SmolLM2-135M-Instruct would be a better choice for this purpose than HuggingFaceTB/SmolLM2-135M.

1.b. Use few-shot prompting

LLMs are fundamentally pattern-matching machines, so giving them examples of what you want them to do is generally effective.

Note

In LLM Foundry, you can use few-shot prompting by setting e.g. num_fewshot: [5] in the eval configuration, or by using a dataset such as gsm8k_prepended_8shot.jsonl that has few-shot examples built into each question. The example below also uses question_prelimiter: "Question: ".

For instance, you can give them input that looks like this:

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers’ market?

A: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.The answer is 18 Question: A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?

A:It takes 2/2=<<2/2=1>>1 bolt of white fiber So the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric The answer is 3 … Question: Every day, Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy. She gives the chickens their feed in three separate meals. In the morning, she gives her flock of chickens 15 cups of feed. In the afternoon, she gives her chickens another 25 cups of feed. How many cups of feed does she need to give her chickens in the final meal of the day if the size of Wendi’s flock is 20 chickens?

A:

The model will typically then get the clue that it should be trying to answer the last question.

1.c. Cut it off at the question delimiter

Few-shot prompting might lead to a further problem: the model might think that it should not only answer the last question but continue to generate more question/answer pairs, with responses like this:

He made 150% of the original price The answer is 150% Question: A snail is at the bottom of a 20-foot well. Each day, it climbs up 3 feet, but at night, it slips back 2 feet. How many days will it take for the snail to reach the top of the well? …

You can deal with this problem by ignoring everything in its response after the “question prelimiter”, in this case “Question:”.

Note

In LLM Foundry, you can implement this approach by including “Question:” in the early_stopping_criteria.

2. Have the Model “Box” Its Answer In a Way That the Metric Recognizes

In the few-shot example above, each answer example ends with a statement of the form “The answer is X”, which encourages the model to give its final answer in this same format. You can then look for this same pattern when extracting the model’s answer.

Note

In LLM Foundry, you can use cot_delimiter for this purpose. It will formulate its few-shot prompts accordingly, and its default metric for the generation_task_with_answers task type appropriate for open-ended math questions will look for that formulation in extracting the model’s answer.

HuggingFace’s Math-Verify library uses regular expressions to extract the model’s answer. For LaTeX output, it optionally prioritizes matches that fall inside \boxed{}, and for general mathematical expressions it looks for variations on “the final answer is”. Using those formulations in your prompts will help Math-Verify find the model’s answer!

3. Use a Smart Metric

Once your model is answering the intended question, and you are able to find its answer, you still need to compare that answer to the ground truth. Math-Verify has an extensive procedure for this purpose. It is thoughtfully designed, but you should still spot check its results on your problem! By contrast, LLM Foundry’s default metric for the generation_task_with_answers task type simply checks whether the extracted answer starts with the ground truth answer (or one of the ground truth answers, if there are multiple), which can lead to false positives, for instance when the answer is “1” and the model’s answer is “100”. In my testing so far, Math-Verify has given more reliable results.

Conclusion

The practices in this post can help avoid pitfalls in evaluating LLMs on open-ended math questions. However, the most important practice of all is the one that got me to this point: look at your data! Check some inputs and outputs and dig into anything that is unclear or looks wrong. Do not blindly trust a library, especially in such a rapidly evolving field.