So Mistral-7b is a pretty impressive 7B param model … but why is it so capable? Do we have any insights into its dataset? Was it trained very far beyond the scaling limit? Any attempts at open reproductions or merges to scale up # of params?
So Mistral-7b is a pretty impressive 7B param model … but why is it so capable? Do we have any insights into its dataset? Was it trained very far beyond the scaling limit? Any attempts at open reproductions or merges to scale up # of params?
They matched parameters and tokens when training.
Podcast on Spotify “No Priors” has the CEO of Mistral on who discusses this.
I don’t know what this means but will listen to the podcast to find out