This is a TIL (“Today I Learned”) post. I expect it to be useful to my future self and maybe to others, but it is meant to be a quick, informal way to capture something I learned rather than a polished presentation.
Context
I am contributing to a research project on performance-efficient fine tuning of LLMs. That project involves fine tuning LLMs on math problems.
Problem
Evaluating LLM performance on open-ended math problems is difficult because LLMs can generate arbitrary text. Checking an LLM’s answer to an open-ended (as opposed to multiple-choice, for instance) math problem against ground truth requires extracting its answer from the generated text and then accounting for issues such as rounding (e.g. 0.33 vs. 0.333) and alternative representations (e.g. 0.33 vs. 1/3). HuggingFace recently released the Math-Verify library to address these issues and showed that it makes a massive difference to LLM math leaderboards.
In the meantime, we have been using LLM Foundry for model development. LLM Foundry has its own procedure for evaluating LLM math performance. By default, it simply checks whether LLM’s output starts with an exact match to one of a set of provided answers after some light post-processing to “lower text and remove punctuation, articles and extra whitespace”.
This approach does not always give the desired result. For instance, in this case the model seems to be generating multiple-choice options rather than answering the question. The first of those options matches the ground truth answer, so this evaluation procedure counts it as correct even though the model did not commit to it as its actual answer. (All of the examples below come from SmolLM2-135M on GSM8K. SmolLM2-135M is a base language model and is tiny by LLM standards, so it is not expected to perform well on these problems.)
Another case leads to a comedy of errors: the model gets a numerically correct answer for entirely wrong reasons, and the evaluation procedure counts it as correct for the wrong reason.
In another case, the model gives the right answer, but the evaluation procedure counts it as incorrect because it does not put that answer at the start of its output.
Next Steps
We should be able to do better. LLM Foundry supports specifing a “metric” to use to evaluate the model’s performance. The default metric for “generation_task_with_answers” (which is at least how scripts/eval/yamls/tasks_v0.3.yaml treats GSM8K) is InContextLearningGenerationExactMatchAccuracy
. We could define a custom metric that uses Math-Verify
to extract the LLM’s answer and compare it to ground truth more accurately.