Note
This is a TIL (“Today I Learned”) post. I expect it to be useful to my future self and maybe to others, but it is meant to be a quick, informal way to capture something I learned rather than a polished presentation.
My last post described limitations of LLM Foundry’s default evaluation procedure for open-ended math problems and suggested that we could do better by creating a custom metric that uses Math-Verify
.
I encountered a “gotcha” in the process of creating that custom metric: the metric must have “Accuracy” in its name. Otherwise LLM Foundry’s evaluation script skips it!
I created this GitHub issue to track this problem. I hope that we can eliminate this behavior, or at least make it more obvious.