How Custom Triton Kernels Achieved 212 tok/s on AMD — 223% of Stock llama.cpp
Conventional wisdom says you can't beat hand-optimized C++ with Python. We did it anyway. Stock llama.cpp achieves 95 tokens per second on our hardware. We developed custom Triton INT4 kernels that achieve 212 tok/s on an AMD 7900 XTX — 223% of the C++ baseline. This paper documents the optimization journey: the dead ends, the breakthroughs, and the key insight that memory layout matters more than algorithmic cleverness.
llama.cpp has become the de facto benchmark for local LLM inference. Its hand-tuned C++ kernels, vectorized assembly, and careful memory management deliver impressive throughput. On our AMD 7900 XTX with ROCm 7.1, stock llama.cpp achieves 95 tokens per second on a 20B parameter model.
Our goal was to match and exceed this performance using PyTorch and Triton — tools that offer portability and ease of modification at the cost of raw speed. Or so we thought.
The hardware specifications:
Our initial PyTorch implementation achieved 42 tok/s in eager mode. Adding torch.compile with max-autotune pushed this to 66 tok/s — a respectable 57% improvement from a single line of code, but still less than half of llama.cpp's throughput.
| Configuration | Throughput | % of Target |
|---|---|---|
| FP16 eager mode | 42 tok/s | 44% |
| FP16 + torch.compile | 66 tok/s | 69% |
| Stock llama.cpp (target) | 95 tok/s | 100% |
The problem was clear: we were memory-bandwidth bound. At FP16 precision, each forward pass reads approximately 40GB of weights. With 960 GB/s bandwidth, theoretical maximum throughput is around 24 tok/s. We exceeded this because grouped-query attention and mixture-of-experts reduce effective parameter count, but we were still firmly against the memory wall.
The obvious solution was quantization. Reducing weight precision from 16 bits to 8 or 4 bits would proportionally reduce memory traffic. What followed was a month of experiments, most of which failed.
INT8 weight-only quantization achieved 96 tok/s — a 45% improvement over FP16 and our best result for weeks. But profiling revealed the problem: 44% of execution time was spent converting INT8 weights back to FP16 for computation. The dequantization overhead consumed most of our bandwidth savings.
We tried everything:
Key Insight #1: Quantization only helps if dequantization is fused with computation. Separate dequant passes just move the memory bottleneck.
The solution was to write custom Triton kernels that fuse INT4 dequantization directly into the matrix multiplication. No intermediate FP16 tensors, no separate passes — weights go from INT4 to output in a single kernel.
Our initial Triton INT4 GEMM kernel achieved 15.5 tok/s. Not a typo — six times slower than FP16. Profiling revealed catastrophic memory access patterns: our weight layout of [N, K//2] caused strided memory access, achieving only 3% cache efficiency.
Transposing the weight layout from [N, K//2] to [K//2, N] transformed strided access into coalesced access. Throughput jumped to 51 tok/s — a 3.3x improvement from changing how we store data.
Key Insight #2: Memory layout determines performance more than algorithmic sophistication. A well-laid-out naive algorithm beats a clever algorithm with poor memory access.
Standard matrix multiplication parallelizes over the output dimensions. For decode inference with batch size 1, this leaves most of the GPU idle. Split-K parallelism divides the reduction dimension across multiple thread blocks, then combines partial results with atomic operations. This pushed throughput to 77.7 tok/s.
Triton supports autotuning — testing multiple kernel configurations and selecting the fastest. We defined 11 configurations varying block sizes, split-K factors, warp counts, and pipeline stages:
BLOCK_N: 16, 32, 64, 128BLOCK_K: 128, 256SPLIT_K: 1, 2, 4, 8, 16num_warps: 4, 8num_stages: 2, 3
The autotuner found configurations we never would have guessed. Small matrices preferred BLOCK_N=32, BLOCK_K=256, SPLIT_K=8. Larger matrices worked better with BLOCK_N=64, BLOCK_K=128, SPLIT_K=4. The difference between best and worst configurations exceeded 40%.
INT4 values pack two per byte, eight per 32-bit word. Loading individual nibbles wastes memory bandwidth. We restructured the kernel to load 32-bit words and unpack eight INT4 values in registers. Combined with autotuning, this achieved 175.9 tok/s — already 185% of stock llama.cpp.
Profiling our 176 tok/s implementation revealed an unexpected bottleneck: KV cache cloning. Our benchmark was copying the entire key-value cache each iteration to preserve state for measurement. Real inference doesn't need this — you just append new tokens to existing cache.
Switching to pre-allocated KV cache with in-place updates eliminated this overhead:
| Configuration | Throughput | vs stock llama.cpp |
|---|---|---|
| With KV cache cloning | 175.9 tok/s | 185% |
| Pre-allocated (no clone) | 212.0 tok/s | 223% |
Key Insight #3: Benchmark artifacts can hide significant performance. Always profile real inference patterns, not just convenient measurement setups.
From 42 tok/s to 212 tok/s — a 5x improvement through systematic optimization:
| Stage | Throughput | Cumulative Gain |
|---|---|---|
| FP16 baseline | 42 tok/s | — |
| + torch.compile | 66 tok/s | +57% |
| + INT8 quantization | 96 tok/s | +129% |
| + Triton INT4 (broken) | 15.5 tok/s | -63% |
| + Transposed weights | 51 tok/s | +21% |
| + Split-K parallelism | 77.7 tok/s | +85% |
| + Block size tuning | 97.5 tok/s | +132% |
| + Vectorized loads | 102.2 tok/s | +143% |
| + Extensive autotuning | 175.9 tok/s | +319% |
| + No-clone KV cache | 212.0 tok/s | +405% |
This result seems counterintuitive. How does Python-based Triton outperform hand-tuned C++? Several factors contributed:
The portability advantage is real. The same Triton code runs on NVIDIA hardware without modification. We sacrifice nothing for cross-platform support.
The optimized kernels are available at:
# Clone the Aegis repository
git clone https://github.com/tukluslabs/aegis
# Run the benchmark
cd aegis
python3 AEGIS/inference/benchmark_triton_int4_noclone.py
Requirements: ROCm 7.x or CUDA 12.x, PyTorch 2.10+, Triton 3.x
We set out to match stock llama.cpp's 95 tok/s. We achieved 212 tok/s -- more than double the C++ baseline. The key was not algorithmic brilliance but systematic attention to memory access patterns, aggressive operator fusion, and extensive autotuning.
The broader lesson: the performance ceiling for high-level tools is higher than commonly assumed. With careful optimization, PyTorch and Triton can compete with and exceed hand-tuned C++. The days of rewriting everything in low-level code for performance may be ending.
Following the 212 tok/s benchmark result, we built a production-oriented inference engine designed for integration with the AEGIS Pensive memory system. The key requirement: constant decode speed regardless of how long the conversation has been running.
Standard KV caches grow with sequence length. As conversations get longer, attention over the cache takes more time, and decode speed drops. For a memory system that ingests thousands of documents, this degradation makes interactive use impractical at scale.
The MXFP4 engine uses a fixed-window KV cache that holds only the most recent N tokens (default: 512). Older tokens are evicted, and the retrieval system (Pensive) handles long-term context. This architecture decouples decode speed from conversation length.
Memory impact: the KV cache drops from 192MB (growing) to 24MB (fixed) -- an 87.5% reduction. This frees VRAM for larger models or batch inference.
The fixed-window cache has a constant memory layout, which enables CUDA graph capture for the entire decode loop. This eliminates kernel launch overhead, Python interpreter overhead, and dynamic memory allocation -- achieving zero-overhead decode after the initial warmup.
| Prefill Length | Growing Cache | Fixed Window (512) | Improvement |
|---|---|---|---|
| 100 tokens | 152.9 tok/s | 152.8 tok/s | 0% |
| 1,000 tokens | 143.9 tok/s | 150.2 tok/s | +4.4% |
| 2,000 tokens | 135.5 tok/s | 150.3 tok/s | +10.9% |
At 2,000 tokens of prefill, the growing cache has already degraded by 11.4%. The fixed window holds steady. The gap widens with longer conversations -- exactly the use case AEGIS was designed for.
The MXFP4 engine was designed for a custom MoE architecture: 32.8B total parameters, 10.1B active (30.9%), 20 layers, 16 experts with top-4 routing, GQA with 40 query heads and 8 KV heads. At INT4 quantization, the full model fits in 21.2GB on the 7900 XTX's 24GB VRAM, leaving headroom for the fixed-window KV cache and Pensive retrieval operations.
Key Insight #4: For memory-augmented systems, the KV cache should hold the working set, not the full history. Let the retrieval system handle long-term context. The inference engine handles immediate generation.
Questions or comments? Reach out at [email protected]