With the proof of concept done and users able to get over 180gb/s on a PC with AMD’s 3d vcache, it sure would be nice if we could figure a way to use that bandwidth for CPU based inferencing. I think it only worked on Windows but if that is the case we should be able to come up with a way to do it under Linux too.
Vcache only helps when you want to access lots of tiny chunks of data that fit inside the 96-128mb cache.
During inference you have to read the entire several Gb model for each token generation, so your botleneck is still the Ram bandwidth.
In the article they said that that is what was expected but the gains impacted the entire ramdrive and the concept has been proven now. The test used a 500mb+ block so bigger than the cache alone.
https://www.tomshardware.com/news/amd-3d-v-cache-ram-disk-182-gbs-12x-faster-pcie-5-ssd