Mixture of Experts Architectures: Scaling LLMs Without Linear Cost
How sparse routing, load-balanced auxiliary losses, and expert parallelism let Mixtral, Switch Transformer, and DeepSeek-MoE outperform dense models at a fraction of the per-token FLOPs.
Every parameter in a dense transformer pays rent on every token. A 70B-parameter Llama variant performs roughly 140 GFLOPs of compute per token at inference, and that cost grows linearly with model size. Mixture of Experts (MoE) breaks the linkage between parameter count and per-token FLOPs. Mixtral 8x7B, released by Mistral in late 2023 (Jiang et al., arXiv:2401.04088), has 46.7B total parameters but only activates around 12.9B for any given token. It matches or outperforms Llama 2 70B on most benchmarks while running roughly five times faster at inference. The architectural trick - sparse routing through a pool of specialized feed-forward experts - is twenty-five years old in spirit but has only recently become production-ready.
Dense feed-forward layers and the motivation for sparsity
In a standard transformer block, after multi-head attention, every token passes through a position-wise feed-forward network (FFN) with two linear layers and a nonlinearity. The FFN holds the majority of the parameters in modern LLMs - typically 4d^2 weights per layer for hidden dimension d, against roughly d^2 for attention projections. Sparse MoE replaces this single FFN with N parallel FFNs called experts, plus a small router network that decides which experts handle each token. Crucially, only k experts (typically k=2) are activated per token, so the FLOPs per token stay near a dense model with k * d_expert width while total capacity scales with N.
Shazeer et al.'s Outrageously Large Neural Networks paper (arXiv:1701.06538) introduced the modern sparse-gated MoE on LSTMs in 2017. They proposed the noisy top-k gating function: compute logits G(x) = softmax(TopK(x * W_g + noise, k)), where the TopK operator zeros all but the largest k logits before the softmax, and the noise term encourages exploration during training. This formulation is differentiable through the selected experts and remains the basis of essentially every MoE architecture in production today.
Routing, capacity, and the load-balancing problem
Sparse routing has a fundamental pathology. Early in training, the router will randomly favor a few experts, those experts will get more gradient signal, they will get better, the router will favor them more, and the system collapses into using two or three experts out of dozens. To prevent this, MoE layers add an auxiliary load-balancing loss. The standard form, refined in the GShard paper from Google (Lepikhin et al., arXiv:2006.16668), penalizes the dot product of the fraction of tokens routed to each expert f_i and the average router probability mass on each expert P_i: L_aux = alpha * N * sum_i (f_i * P_i). Adding this to the main cross-entropy loss with alpha around 0.01 pushes the router toward uniform expert utilization.
Even with a balancing loss, individual tokens still need a place to go. MoE layers enforce a fixed expert capacity, the maximum number of tokens any one expert will accept per batch. Capacity is parameterized by a capacity factor c, with capacity = c * (tokens_per_batch / N) for N experts. Tokens routed to a full expert are dropped, meaning their FFN output is skipped via a residual connection. A capacity factor of 1.0 is tight and risks dropping; 1.25 to 2.0 is typical. Switch Transformer (Fedus et al., arXiv:2101.03961) showed that c=1.25 with top-1 routing works at very large scale and even mitigates instabilities by routing each token to exactly one expert.
Switch Transformer and the case for top-1 routing
Switch Transformer made an aggressive bet: route each token to a single expert. This halves communication compared to top-2 and simplifies the gating math. Fedus et al. trained a 1.6 trillion-parameter Switch model that reached the same perplexity as the T5-XXL baseline four times faster. Their paper also introduced the now-standard z-loss, an auxiliary penalty on the squared log-sum-exp of router logits that keeps the router from producing extreme values which destabilize bfloat16 training. The z-loss with coefficient around 1e-3 became table stakes in subsequent work because it directly addresses the loss spikes that haunted early MoE runs.
# Top-k routing with auxiliary load-balancing loss (sketch)
import torch
import torch.nn.functional as F
def moe_route(x, W_g, k=2):
logits = x @ W_g # [tokens, N_experts]
probs = F.softmax(logits, dim=-1)
topk_vals, topk_idx = probs.topk(k, dim=-1)
# Renormalize so the k chosen experts sum to 1
weights = topk_vals / topk_vals.sum(dim=-1, keepdim=True)
# Load-balancing aux loss
fraction = torch.zeros_like(probs).scatter_(-1, topk_idx, 1.0).mean(0)
importance = probs.mean(0)
aux_loss = (fraction * importance).sum() * probs.size(-1)
return topk_idx, weights, aux_lossExpert parallelism and the all-to-all bottleneck
MoE training requires a parallelism strategy not seen in dense models: expert parallelism, where different experts live on different GPUs. Because the router decides which tokens go where dynamically, every MoE layer requires two all-to-all collective communications - one to dispatch tokens to their assigned experts across devices, and one to gather the results back. The all-to-all is bandwidth-heavy and latency-sensitive. GShard, Switch Transformer, and the open-source DeepSpeed-MoE all dedicate substantial engineering to overlapping computation and communication. NVLink-connected NVIDIA H100 nodes with 900 GB/s of bidirectional bandwidth make MoE training practical; commodity Ethernet does not.
Inference is its own puzzle. Activated parameters per token are small, but total memory footprint is enormous. Mixtral 8x7B fits in roughly 90 GB at fp16, even though only 13B parameters do work per token. Serving frameworks like vLLM and TensorRT-LLM implement expert offloading and per-expert batching to amortize the cost. The variable compute per request - if a batch happens to route heavily to one expert, that expert becomes the bottleneck - makes latency prediction harder than for dense models.
Mixtral and the modern open-weights MoE recipe
Mixtral 8x7B is the model that brought MoE to mainstream open-source practice. Its recipe is deliberately simple. Take a transformer with the same attention and norm choices as Mistral 7B. Replace each FFN with 8 expert FFNs and a top-2 router. Each expert is a SwiGLU FFN with the same hidden dimension as the dense Mistral 7B FFN. Token dispatch uses softmax over the top-2 router logits, weighted contributions are summed, and the auxiliary load-balancing loss and z-loss are applied. The result is a model that on MMLU, HumanEval, and GSM8K matches GPT-3.5-class performance while running on a single 80GB H100 at fp8 quantization.
MoE is not free capacity. You pay for it in memory, in network bandwidth, and in engineering complexity. What you buy is a different point on the FLOPs-per-token versus capability frontier, and that point has turned out to be the right one for serving frontier-quality models on commodity hardware.
DeepSeek-MoE and the fine-grained, shared-expert design
DeepSeek-MoE (Dai et al., 2024) introduced two refinements that have since influenced GPT-4-class architectures. The first is fine-grained experts: instead of 8 large experts with hidden dimension d, use 64 small experts with hidden dimension d/8 and route top-8 of them. The combinatorial routing space increases dramatically, allowing specialization on much narrower token distributions. The second is shared experts: a small number of experts (typically 2) that every token always activates, alongside the sparsely routed ones. Shared experts capture common patterns - grammar, frequent tokens, low-level syntax - while routed experts can specialize without re-learning these basics.
DeepSeek-V2 and V3 extended this with auxiliary-loss-free balancing, replacing the explicit load-balancing penalty with a per-expert bias term updated by EMA based on observed utilization. This removes a conflicting gradient signal from the main objective and reportedly improves convergence.
When MoE beats dense, and when it does not
- MoE wins for inference-throughput-bound serving where memory is abundant but per-token latency matters - exactly the regime of API providers running fp8 H100 clusters.
- MoE wins at fixed training FLOPs: Switch Transformer and Mixtral both show 3-5x compute efficiency over dense baselines at matched downstream quality.
- MoE loses when memory is the binding constraint - on a 24GB consumer GPU, a 7B dense model beats a 47B MoE that simply will not fit.
- MoE loses on small training budgets where the auxiliary losses and routing instabilities have not yet stabilized, typically below 100B training tokens.
- MoE introduces serving complexity around expert placement, batch sizing, and dynamic load that dense models avoid entirely.
- Fine-tuning MoE is harder than dense fine-tuning because the router can collapse to a small expert set on narrow distributions, so methods like LoRA on shared parameters plus expert freezing are common workarounds.
Looking ahead, MoE design is converging on a few stable choices. Fine-grained experts with shared experts, top-k routing with k between 2 and 8, z-loss for stability, auxiliary-loss-free balancing for late-stage training, and aggressive expert parallelism with overlapped all-to-all. As frontier labs scale toward trillion-parameter active counts and tens of trillions of total parameters, the gap between MoE and dense will only widen. The boring news is that sparse models work; the exciting news is that we are still discovering how much further the sparsity dial can be turned.
Related Resources
Related Articles
RAG Systems: Grounding AI in Real-World Knowledge
Retrieval-Augmented Generation combines the creativity of language models with the accuracy of database retrieval for more reliable AI responses.
Mixture of Experts: The Architecture Behind Next-Gen AI Models
MoE architectures are enabling larger, more capable AI models while keeping computational costs manageable. Here's how they work.
LLM Routing & Fallbacks: Reliability at Scale
Use multiple models safely with routing, caching, and graceful degradation—without exploding complexity.