I did something similar a little while ago, but with a small 10M parameter model trained from scratch. I didn’t get around to doing multiplication, but I could reliably add and subtract numbers up to six digits. One thing I found was getting it to show working also improved its performance when not showing working.
I did something similar a little while ago, but with a small 10M parameter model trained from scratch. I didn’t get around to doing multiplication, but I could reliably add and subtract numbers up to six digits. One thing I found was getting it to show working also improved its performance when not showing working.