Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn...

Maxious · 2026-02-28T10:38:09 1772275089

Yep. These Mixture of Experts models are well suited for paging in only the relevant data for a certain task https://huggingface.co/blog/moe

There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/

vlovich123 · 2026-02-28T17:55:13 1772301313

MoE is not suited for paging because it’s essentially a random expert per token. It only improves throughput because you reduce the memory bandwidth requirements for generating a token since 1/n of the weights are accessed per token (but a different 1/n on each loop).

Now shrinking them sure, but I’ve seen nothing that indicates you can just page weights in and out without cratering your performance like you would with a non MoE model

FuckButtons · 2026-02-28T19:00:06 1772305206

Not entirely true, it’s random access within the relevant subset of experts and since concepts are clustered you actually have a much higher probability of repeatedly accessing the same subset of experts more frequently.

vlovich123 · 2026-03-01T01:26:17 1772328377

It’s called mixture of experts but it’s not that concepts map cleanly or even roughly to different experts. Otherwise you wouldn’t get a new expert on every token. You have to remember these were designed to improve throughput in cloud deployments where different GPUs load an expert. There you precisely want each expert to handle randomly to improve your GPU utilization rate. I have not heard anyone training local MoE models to aid sharding.

cagenut · 2026-03-01T02:10:04 1772331004

is there anywhere good to read/follow to get operational clarity on this stuff?

my current system of looking for 1 in 1000 posts on HN or 1 in 100 on r/locallama is tedious.

p1esk · 2026-03-01T04:51:55 1772340715

Ask any of the models to explain this to you

bee_rider · 2026-02-28T16:10:18 1772295018

That blog post was super interesting. It is neat that he can select experts and control the routing in the model—not having played with the models in detail, tended to assume the “mixing” in mixture of experts was more like a blender, haha. The models are still quite lumpy I guess!

segmondy · 2026-02-28T10:00:32 1772272832

llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.

pyuser583 · 2026-03-01T05:33:24 1772343204

How much do you use?

I have lots of trouble figuring out what the limits are of a system with x amount of vram and y amounts of ram. How do you determine this?

fc417fc802 · 2026-03-01T09:00:19 1772355619

Ideally you'd have (parameter count) * (bits per parameter) VRAM for the entire (presumably quantized, don't forget to account for that) model. So very approximately 16 GiB for a 34B model quantized to 4 bits per parameter.

You can spill to RAM in which case you at least want enough for a single active expert but really that's going to tank performance. If you're only "a bit" short of the full model the difference might not be all that large.

These things are memory bandwidth limited so if you check out RAM, VRAM, and PCIe bandwidth what I wrote above should make sense.

Also you should just ask your friendly local LLM these sorts of questions.

pyuser583 · 2026-03-02T11:56:49 1772452609

I usually do ask the llm what parameters to use. But that’s why I know so little about parameters!

nurettin · 2026-02-28T11:43:15 1772278995

This is why they say "A3B" meaning only 3B is active at a time, limiting VRAM usage.

Koffiepoeder · 2026-02-28T11:44:21 1772279061

The A3B part in the name stands for `Active 3B`, so for the inference jobs a core 3B is used in conjunction with another subpart of the model, based on the task (MoE, mixture of experts). If you use these models mostly for related/similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models.