← Back to Tuklus Labs

mud-puppy

A ROCm-First Framework for Fine-Tuning Large Language Models

Tuklus Labs | January 2026 | Version 0.3.0

Abstract

The rapid advancement of large language models has created significant demand for accessible fine-tuning tools. However, the majority of existing frameworks are optimized exclusively for NVIDIA hardware, leaving AMD GPU users without viable options. mud-puppy addresses this gap by providing a ROCm-first fine-tuning framework that maintains full compatibility with CUDA systems. Built on the HuggingFace ecosystem, mud-puppy supports supervised fine-tuning, parameter-efficient methods like LoRA and QLoRA, preference alignment algorithms, and reinforcement learning from human feedback. This paper presents the architecture, capabilities, and design philosophy behind mud-puppy.

1. Introduction

Fine-tuning pre-trained language models for specific tasks has become standard practice in applied machine learning. Organizations adapt foundation models to their domains, whether that involves medical documentation, legal analysis, customer support, or creative writing. The process typically requires substantial GPU resources and specialized software tooling.

The existing landscape presents a problem. Frameworks like Axolotl and LLaMA-Factory provide excellent fine-tuning capabilities, but they assume NVIDIA hardware. Key dependencies such as bitsandbytes, Flash Attention, and various CUDA kernels simply do not work on AMD GPUs. This forces AMD users to maintain complex workarounds or abandon their hardware entirely.

mud-puppy takes a different approach. Rather than treating ROCm support as an afterthought, it was designed from the ground up to work seamlessly on AMD hardware while remaining fully functional on NVIDIA systems. Every kernel and optimization uses portable PyTorch primitives, eliminating vendor lock-in without sacrificing performance.

2. Design Philosophy

Three principles guide mud-puppy's development:

2.1 Portability First

All quantization kernels, attention implementations, and memory optimizations use standard PyTorch operations. This means the same code runs on ROCm, CUDA, and even CPU backends without modification. We deliberately avoid inline assembly, vendor-specific extensions, and libraries that only support one platform.

2.2 Ecosystem Integration

mud-puppy builds on proven components from the HuggingFace ecosystem: transformers for model loading, datasets for data processing, PEFT for adapter methods, and TRL for preference and RL training. Users familiar with these libraries will find mud-puppy immediately accessible.

2.3 Sensible Defaults

Memory-constrained training is the norm, not the exception. Gradient checkpointing is enabled by default. BFloat16 precision is the standard choice. Token-based dynamic batching maximizes GPU utilization automatically. Users can override these settings, but the defaults work well for most scenarios.

3. Training Methods

3.1 Supervised Fine-Tuning

The most straightforward approach trains all model parameters on task-specific data. mud-puppy accepts datasets in multiple formats: chat messages with role annotations, instruction-response pairs, or raw text. The framework handles tokenization, padding, and loss masking automatically.

3.2 LoRA and QLoRA

Low-Rank Adaptation (LoRA) reduces trainable parameters by approximately 99% through the insertion of small adapter matrices. QLoRA extends this by loading the base model in 4-bit precision, dramatically reducing VRAM requirements. mud-puppy implements 4-bit quantization natively, without requiring bitsandbytes or any CUDA-specific libraries.

3.3 Preference Alignment

When training data includes human preferences (chosen versus rejected responses), mud-puppy supports several alignment algorithms:

3.4 Reinforcement Learning

For full RLHF workflows, mud-puppy includes GRPO (Guided Reinforcement Policy Optimization) training. This approach uses reward signals, either from heuristics or dedicated reward models, to guide policy updates. The framework also supports training reward models themselves, including process reward models for step-level feedback.

4. Memory Optimization

Training large models on limited hardware requires careful memory management. mud-puppy provides several mechanisms:

4.1 Layer Streaming

The StreamWrapper utility loads model layers to GPU memory on demand during the forward pass, then offloads them afterward. This enables training models larger than available VRAM, though at the cost of increased training time due to host-device transfers.

4.2 ZeRO Optimizer Offloading

Optimizer states can be offloaded to CPU memory, freeing GPU resources for larger batch sizes or model parameters. This technique is particularly effective for models with billions of parameters.

4.3 Dynamic Batching

Rather than fixing batch size, mud-puppy can batch by total token count. This maximizes GPU utilization when training examples vary significantly in length, avoiding the waste of padding short examples to match long ones.

5. Quantization

Post-training quantization reduces model size for deployment. mud-puppy supports two approaches:

5.1 GPTQ

GPTQ compresses model weights to 4-bit integers using calibration data. The resulting models load faster and require less storage, with minimal accuracy degradation for most tasks.

5.2 Quantization-Aware Training

QAT simulates quantization effects during training, allowing the model to adapt to reduced precision. This typically produces better results than post-training quantization for aggressive compression targets.

6. Usage

mud-puppy provides both command-line and Python interfaces:

# LoRA fine-tuning from the command line
mud-puppy meta-llama/Llama-3-8B data.jsonl --method lora --output ./outputs

# QLoRA with 4-bit base model
mud-puppy model data.jsonl --method qlora --output ./qlora-outputs

# DPO preference alignment
mud-puppy model preferences.jsonl --method preference --preference dpo

The Python API offers equivalent functionality:

from mud_puppy import TrainingConfig, run_training

config = TrainingConfig(
    model_name_or_path="meta-llama/Llama-3-8B",
    dataset_path="./data/chat.jsonl",
    output_dir="./outputs",
    finetuning_method="lora",
    precision="bf16",
)
run_training(config)

7. Performance Considerations

On AMD MI250X hardware, mud-puppy achieves throughput within 10-15% of optimized CUDA implementations on comparable NVIDIA hardware. The gap primarily stems from Flash Attention, where NVIDIA's optimized kernels outperform our portable scaled_dot_product_attention wrapper. For training runs where attention is not the bottleneck (small context lengths, heavy memory pressure), performance is essentially equivalent.

On NVIDIA hardware, mud-puppy performs identically to standard HuggingFace training, since it uses the same underlying libraries. Users can optionally enable CUDA-specific optimizations by installing additional dependencies.

8. Future Work

Several enhancements are planned for upcoming releases:

9. Conclusion

mud-puppy demonstrates that ROCm-first development is both practical and beneficial. By prioritizing portability without compromising capability, the framework serves users across hardware platforms while providing a particularly valuable option for the AMD ecosystem. The project is open source and welcomes contributions from the community.

References

© 2026 Tuklus Labs. Released under MIT License.
View on GitHub