Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had
MoE is not suited for paging because it’s essentially a random expert per token. It only improves throughput because you reduce the memory bandwidth requirements for generating a token since 1/n of the weights are accessed per token (but a different 1/n on each loop).
Now shrinking them sure, but I’ve seen nothing that indicates you can just page weights in and out without cratering your performance like you would with a non MoE model
Not entirely true, it’s random access within the relevant subset of experts and since concepts are clustered you actually have a much higher probability of repeatedly accessing the same subset of experts more frequently.
It’s called mixture of experts but it’s not that concepts map cleanly or even roughly to different experts. Otherwise you wouldn’t get a new expert on every token. You have to remember these were designed to improve throughput in cloud deployments where different GPUs load an expert. There you precisely want each expert to handle randomly to improve your GPU utilization rate. I have not heard anyone training local MoE models to aid sharding.
That blog post was super interesting. It is neat that he can select experts and control the routing in the model—not having played with the models in detail, tended to assume the “mixing” in mixture of experts was more like a blender, haha. The models are still quite lumpy I guess!
llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.
Ideally you'd have (parameter count) * (bits per parameter) VRAM for the entire (presumably quantized, don't forget to account for that) model. So very approximately 16 GiB for a 34B model quantized to 4 bits per parameter.
You can spill to RAM in which case you at least want enough for a single active expert but really that's going to tank performance. If you're only "a bit" short of the full model the difference might not be all that large.
These things are memory bandwidth limited so if you check out RAM, VRAM, and PCIe bandwidth what I wrote above should make sense.
Also you should just ask your friendly local LLM these sorts of questions.
The A3B part in the name stands for `Active 3B`, so for the inference jobs a core 3B is used in conjunction with another subpart of the model, based on the task (MoE, mixture of experts). If you use these models mostly for related/similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models.