I have been evaluating a number of maths models. One simple question I as is “what is 2 to the power of 4.1?” Almost every model butchers the answer. GPT-4 is the only one to get it correct out of the box. It looks like questions such as this are just not meant for LLMs. Without basic arithmetic, LLMs will not be particularly useful to any highly numeric occupations. Has anyone managed to get any finetuned LLMs to perform arithmetic reliably?
I am starting to think that the only way to do this is outsource specific calculations to a mathematical expression parser.
I feel like verifiable math & physics simulation should be something which every LLM should just invoke as a tool instead of trying to do it within slowly
I think I have to agree.