Beating llama.cpp

How Custom Triton Kernels Achieved 212 tok/s on AMD — 223% of Stock llama.cpp

Tuklus Labs | February 2026 | AMD Radeon RX 7900 XTX

Abstract

Conventional wisdom says you can't beat hand-optimized C++ with Python. We did it anyway. Stock llama.cpp achieves 95 tokens per second on our hardware. We developed custom Triton INT4 kernels that achieve 212 tok/s on an AMD 7900 XTX — 223% of the C++ baseline. This paper documents the optimization journey: the dead ends, the breakthroughs, and the key insight that memory layout matters more than algorithmic cleverness.

212 tok/s

223% of stock llama.cpp with portable Triton code

1. The Challenge

llama.cpp has become the de facto benchmark for local LLM inference. Its hand-tuned C++ kernels, vectorized assembly, and careful memory management deliver impressive throughput. On our AMD 7900 XTX with ROCm 7.1, stock llama.cpp achieves 95 tokens per second on a 20B parameter model.

Our goal was to match and exceed this performance using PyTorch and Triton — tools that offer portability and ease of modification at the cost of raw speed. Or so we thought.

The hardware specifications:

GPU: AMD Radeon RX 7900 XTX (24GB VRAM, gfx1100)
Memory Bandwidth: 960 GB/s
Platform: ROCm 7.1, PyTorch 2.10, Triton 3.6
Model: 20B parameter MoE architecture (2 active experts of 8)

2. Starting Point: The Baseline

Our initial PyTorch implementation achieved 42 tok/s in eager mode. Adding torch.compile with max-autotune pushed this to 66 tok/s — a respectable 57% improvement from a single line of code, but still less than half of llama.cpp's throughput.

Configuration	Throughput	% of Target
FP16 eager mode	42 tok/s	44%
FP16 + torch.compile	66 tok/s	69%
Stock llama.cpp (target)	95 tok/s	100%

The problem was clear: we were memory-bandwidth bound. At FP16 precision, each forward pass reads approximately 40GB of weights. With 960 GB/s bandwidth, theoretical maximum throughput is around 24 tok/s. We exceeded this because grouped-query attention and mixture-of-experts reduce effective parameter count, but we were still firmly against the memory wall.

3. The Quantization Rabbit Hole

The obvious solution was quantization. Reducing weight precision from 16 bits to 8 or 4 bits would proportionally reduce memory traffic. What followed was a month of experiments, most of which failed.

3.1 INT8: A False Summit

INT8 weight-only quantization achieved 96 tok/s — a 45% improvement over FP16 and our best result for weeks. But profiling revealed the problem: 44% of execution time was spent converting INT8 weights back to FP16 for computation. The dequantization overhead consumed most of our bandwidth savings.

3.2 The Graveyard of Failed Approaches

We tried everything:

Naive INT4: 62 tok/s — slower than INT8 due to complex unpacking
bitsandbytes NF4: 61 tok/s — ROCm kernels weren't optimized
FP8: 50 tok/s — actually slower than FP16
Mixed INT8/INT4: 77 tok/s — complexity without sufficient gain
Cached dequantization: 66 tok/s — caching didn't help because every forward pass still needed fresh dequant

Key Insight #1: Quantization only helps if dequantization is fused with computation. Separate dequant passes just move the memory bottleneck.

4. The Triton Breakthrough

The solution was to write custom Triton kernels that fuse INT4 dequantization directly into the matrix multiplication. No intermediate FP16 tensors, no separate passes — weights go from INT4 to output in a single kernel.

4.1 First Attempt: Disaster

Our initial Triton INT4 GEMM kernel achieved 15.5 tok/s. Not a typo — six times slower than FP16. Profiling revealed catastrophic memory access patterns: our weight layout of [N, K//2] caused strided memory access, achieving only 3% cache efficiency.

4.2 The Memory Layout Fix

Transposing the weight layout from [N, K//2] to [K//2, N] transformed strided access into coalesced access. Throughput jumped to 51 tok/s — a 3.3x improvement from changing how we store data.

Key Insight #2: Memory layout determines performance more than algorithmic sophistication. A well-laid-out naive algorithm beats a clever algorithm with poor memory access.

4.3 Split-K Parallelism

Standard matrix multiplication parallelizes over the output dimensions. For decode inference with batch size 1, this leaves most of the GPU idle. Split-K parallelism divides the reduction dimension across multiple thread blocks, then combines partial results with atomic operations. This pushed throughput to 77.7 tok/s.

4.4 The Autotuning Revelation

Triton supports autotuning — testing multiple kernel configurations and selecting the fastest. We defined 11 configurations varying block sizes, split-K factors, warp counts, and pipeline stages:

BLOCK_N: 16, 32, 64, 128
BLOCK_K: 128, 256
SPLIT_K: 1, 2, 4, 8, 16
num_warps: 4, 8
num_stages: 2, 3

The autotuner found configurations we never would have guessed. Small matrices preferred BLOCK_N=32, BLOCK_K=256, SPLIT_K=8. Larger matrices worked better with BLOCK_N=64, BLOCK_K=128, SPLIT_K=4. The difference between best and worst configurations exceeded 40%.

4.5 Vectorized Loads

INT4 values pack two per byte, eight per 32-bit word. Loading individual nibbles wastes memory bandwidth. We restructured the kernel to load 32-bit words and unpack eight INT4 values in registers. Combined with autotuning, this achieved 175.9 tok/s — already 185% of stock llama.cpp.

5. The Final Optimization: KV Cache

Profiling our 176 tok/s implementation revealed an unexpected bottleneck: KV cache cloning. Our benchmark was copying the entire key-value cache each iteration to preserve state for measurement. Real inference doesn't need this — you just append new tokens to existing cache.

Switching to pre-allocated KV cache with in-place updates eliminated this overhead:

Configuration	Throughput	vs stock llama.cpp
With KV cache cloning	175.9 tok/s	185%
Pre-allocated (no clone)	212.0 tok/s	223%

Key Insight #3: Benchmark artifacts can hide significant performance. Always profile real inference patterns, not just convenient measurement setups.

6. The Complete Journey

From 42 tok/s to 212 tok/s — a 5x improvement through systematic optimization:

Stage	Throughput	Cumulative Gain
FP16 baseline	42 tok/s	—
+ torch.compile	66 tok/s	+57%
+ INT8 quantization	96 tok/s	+129%
+ Triton INT4 (broken)	15.5 tok/s	-63%
+ Transposed weights	51 tok/s	+21%
+ Split-K parallelism	77.7 tok/s	+85%
+ Block size tuning	97.5 tok/s	+132%
+ Vectorized loads	102.2 tok/s	+143%
+ Extensive autotuning	175.9 tok/s	+319%
+ No-clone KV cache	212.0 tok/s	+405%

7. Why Triton Beat C++

This result seems counterintuitive. How does Python-based Triton outperform hand-tuned C++? Several factors contributed:

Autotuning: We tested thousands of kernel configurations automatically. Manual tuning rarely explores the full space.
Fusion: Triton fuses operations that would require separate kernels in C++, eliminating intermediate memory traffic.
ROCm-specific optimization: llama.cpp's CUDA-first codebase may not be fully optimized for AMD's architecture.
torch.compile integration: Graph-level optimizations work alongside our custom kernels.

The portability advantage is real. The same Triton code runs on NVIDIA hardware without modification. We sacrifice nothing for cross-platform support.

8. Lessons Learned

Memory layout is everything. Our 3.3x speedup from transposing weights cost zero compute. Profile memory access patterns before optimizing arithmetic.
Fuse aggressively. Separate dequantization passes defeat the purpose of quantization. The kernel that reads compressed weights must also use them.
Autotuning is not optional. The 40% gap between best and worst configurations is too large to leave on the table. Build autotuning into your workflow.
Question your benchmarks. Our KV cache cloning was a measurement artifact that hid 20% of our potential throughput.
Don't assume C++ wins. High-level tools with good compilers can outperform hand-tuned code, especially when they enable optimizations you wouldn't attempt manually.

9. Reproduction

The optimized kernels are available at:

# Clone the Aegis repository
git clone https://github.com/tukluslabs/aegis

# Run the benchmark
cd aegis
python3 AEGIS/inference/benchmark_triton_int4_noclone.py

Requirements: ROCm 7.x or CUDA 12.x, PyTorch 2.10+, Triton 3.x

10. Conclusion

We set out to match stock llama.cpp's 95 tok/s. We achieved 212 tok/s -- more than double the C++ baseline. The key was not algorithmic brilliance but systematic attention to memory access patterns, aggressive operator fusion, and extensive autotuning.

The broader lesson: the performance ceiling for high-level tools is higher than commonly assumed. With careful optimization, PyTorch and Triton can compete with and exceed hand-tuned C++. The days of rewriting everything in low-level code for performance may be ending.

speedup from 42 tok/s to 212 tok/s through systematic optimization

11. Addendum: MXFP4 Constant-Time Decode Engine

Following the 212 tok/s benchmark result, we built a production-oriented inference engine designed for integration with the AEGIS Pensive memory system. The key requirement: constant decode speed regardless of how long the conversation has been running.

11.1 The Problem

Standard KV caches grow with sequence length. As conversations get longer, attention over the cache takes more time, and decode speed drops. For a memory system that ingests thousands of documents, this degradation makes interactive use impractical at scale.

11.2 Fixed-Window KV Cache

The MXFP4 engine uses a fixed-window KV cache that holds only the most recent N tokens (default: 512). Older tokens are evicted, and the retrieval system (Pensive) handles long-term context. This architecture decouples decode speed from conversation length.

Memory impact: the KV cache drops from 192MB (growing) to 24MB (fixed) -- an 87.5% reduction. This frees VRAM for larger models or batch inference.

11.3 CUDA Graph Capture

The fixed-window cache has a constant memory layout, which enables CUDA graph capture for the entire decode loop. This eliminates kernel launch overhead, Python interpreter overhead, and dynamic memory allocation -- achieving zero-overhead decode after the initial warmup.

11.4 Results

Prefill Length	Growing Cache	Fixed Window (512)	Improvement
100 tokens	152.9 tok/s	152.8 tok/s	0%
1,000 tokens	143.9 tok/s	150.2 tok/s	+4.4%
2,000 tokens	135.5 tok/s	150.3 tok/s	+10.9%

At 2,000 tokens of prefill, the growing cache has already degraded by 11.4%. The fixed window holds steady. The gap widens with longer conversations -- exactly the use case AEGIS was designed for.

11.5 Target Model: AEGIS-MoE-32B

The MXFP4 engine was designed for a custom MoE architecture: 32.8B total parameters, 10.1B active (30.9%), 20 layers, 16 experts with top-4 routing, GQA with 40 query heads and 8 KV heads. At INT4 quantization, the full model fits in 21.2GB on the 7900 XTX's 24GB VRAM, leaving headroom for the fixed-window KV cache and Pensive retrieval operations.

Key Insight #4: For memory-augmented systems, the KV cache should hold the working set, not the full history. Let the retrieval system handle long-term context. The inference engine handles immediate generation.

Questions or comments? Reach out at [email protected]