← Back to Tuklus Labs

AEGIS

An AI Operations Platform for Production Workloads

Tuklus Labs | February 2026 | Version 3.0

Abstract

AEGIS is a comprehensive cognitive architecture for deploying AI systems that interact with the real world. It combines hierarchical context retrieval (validated to 10M tokens at 98.9% accuracy), custom inference kernels (152 tok/s constant-time decode on AMD 7900 XTX), a 32.8B mixture-of-experts model, constitutional alignment with six moral evaluation gates, and execution hooks with safety constraints into a unified architecture. The system addresses practical challenges of production AI: memory constraints, latency requirements, multi-modal capabilities, alignment verification, and safe integration with external services. Version 3.0 introduces the Constitution system, MXFP4 inference engine, and a five-layer cognitive architecture (NSSIO). 759 tests passing across 156 test files.

1. Motivation

Building AI applications that move beyond chat interfaces requires solving multiple interconnected problems. A production system must retrieve relevant context efficiently, route queries to appropriate models, generate responses across modalities, and sometimes take actions in the external world. Each of these challenges has received attention in isolation, but integrating them into a coherent platform remains difficult.

Consider an AI assistant for enterprise operations. It must answer questions about company documents (requiring retrieval), handle both text and image requests (requiring multi-modal capabilities), work within GPU memory constraints (requiring efficient model management), and execute tasks like sending emails or updating records (requiring secure action execution). Building this from scratch involves substantial engineering effort and many opportunities for error.

AEGIS provides a ready-made foundation for such applications. It handles the infrastructure concerns, allowing developers to focus on application-specific logic.

2. System Architecture

AEGIS organizes functionality into interconnected subsystems:

┌─────────────────────────────────────────────────────────────────────────┐ │ Web Interface │ │ (React v1/v2/v3 + FastAPI Backend) │ └─────────────────────────────────────────────────────────────────────────┘ │ ┌────────────────┼────────────────┐ ▼ ▼ ┌──────────────────────────┐ ┌──────────────────────────┐ │ NSSIO Cognitive │ │ Constitution │ │ Architecture │ │ (Alignment Layer) │ │ (5-layer affect model) │ │ 6 moral gates, 7 tiers │ └──────────────────────────┘ └──────────────────────────┘ │ │ └────────────────┬────────────────┘ │ ┌────────────────────────────┼────────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ Pensive │ │ MXFP4 Engine │ │ Execution │ │ (Context │ │ + MoE Router │ │ Bridge │ │ Retrieval) │ │ (32.8B model)│ │ (Actions) │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ L1/L2/L3 │ │ Triton INT4 │ │ Constitution │ │ Cache Tiers │ │ Kernels │ │ Gate Check │ │ 189K nodes │ │ 152 tok/s │ │ 0.037ms/eval │ └───────────────┘ └───────────────┘ └───────────────┘

3. Context Retrieval with Pensive

AEGIS incorporates the Pensive retrieval system as its context layer. Pensive implements a three-tier cache hierarchy with spreading activation and a 189,000-node knowledge graph, validated to 10 million tokens at 98.9% accuracy:

164,165 documents have been ingested across personal data sources (ChatGPT, Facebook, Claude, YouTube, Calendar, Maps), forming a knowledge graph with 189,000 nodes and 131,000 edges. The hybrid BM25 + dense vector approach eliminates the lost-in-the-middle phenomenon entirely -- middle-context accuracy is 99.3% at 10M tokens, the highest of any position.

Pensive is described in detail in its own technical paper. Within AEGIS, it provides the foundation for any operation requiring document knowledge or conversation memory.

4. Mixture-of-Experts Routing

Running large language models on consumer hardware requires careful memory management. A 70 billion parameter model exceeds the VRAM of most GPUs. Even loading multiple smaller specialized models simultaneously may not be feasible.

The Expert Manager addresses this through dynamic model loading. Rather than keeping all models in GPU memory, it loads experts on demand and evicts them when memory pressure increases.

4.1 Expert Lifecycle

Each expert model exists in one of three states:

The Expert Manager maintains a VRAM budget (configurable, default 256MB headroom). Before loading a new expert, it checks available memory and evicts least-recently-used experts if necessary.

4.2 Eviction Policies

Two eviction strategies are available:

LRU works well when access patterns are temporally clustered. LFU performs better when some experts are consistently more popular than others.

4.3 Optional Gating Network

For applications with many experts, manually selecting which expert handles each query becomes impractical. AEGIS supports an optional gating network: a small model that examines query embeddings and predicts which expert(s) should respond.

The gating network can route to a single expert (hard routing) or blend outputs from multiple experts (soft routing). Soft routing produces more nuanced responses but requires loading multiple experts simultaneously.

5. LLM Inference on ROCm

AEGIS includes two inference engines, both running on AMD 7900 XTX with ROCm 7.1:

5.1 Triton INT4 Engine (212 tok/s)

Custom Triton GEMV kernels with fused INT4 dequantization achieve 212 tokens/second -- 223% of stock llama.cpp on the same hardware. The optimization journey from 42 tok/s to 212 tok/s is documented in the inference paper.

5.2 MXFP4 Constant-Time Decode Engine (152 tok/s)

A production-oriented engine designed for Pensive-first inference. Features a fixed-window KV cache that maintains constant decode speed regardless of sequence length:

Prefill Length Growing Cache Fixed Window Improvement
100 tokens 152.9 tok/s 152.8 tok/s 0%
1,000 tokens 143.9 tok/s 150.2 tok/s +4.4%
2,000 tokens 135.5 tok/s 150.3 tok/s +10.9%

6. Image Generation

AEGIS includes a Stable Diffusion integration for text-to-image generation. The Image_Gen module provides:

Image generation is exposed through the same interface as text generation, enabling multi-modal workflows where an agent can decide whether to respond with text, images, or both.

7. Execution Bridge

Many AI applications need to take actions beyond generating text. AEGIS provides a controlled mechanism for real-world action execution through the Execution Bridge.

7.1 Supported Actions

The default configuration supports:

Action Description Configuration
Email Send notifications via SMTP SMTP_* environment variables
Trade Execution Submit orders to trading API TRADE_API_* environment variables
System Backup Execute backup command BACKUP_CMD environment variable
System Reboot Trigger remote reboot REBOOT_CMD environment variable

7.2 Graceful Degradation

When external services are unavailable, the Execution Bridge falls back to simulation mode. Actions are logged but not executed, preventing system failures due to external dependencies.

All actions are recorded in a JSON audit log, providing a complete history for debugging and compliance.

7.3 Security Considerations

The Execution Bridge is powerful and potentially dangerous. Best practices include:

8. Constitutional Alignment

Version 3.0 introduces the Constitution system -- a conscience layer that evaluates every action before execution. Unlike post-hoc safety filtering, the Constitution is integrated into the action pipeline: no pathway reaches the external world without passing through moral evaluation gates.

8.1 Moral Evaluation Gates

Six evaluation gates assess every proposed action:

Each gate evaluates in 0.037ms -- fast enough to add negligible latency to any action pathway. 489 tests validate gate behavior across edge cases.

8.2 Trust Tier Hierarchy

A seven-tier trust system assigns privilege levels based on the source of a request. Terminal access from the system owner has full authority. Web API requests operate under restricted permissions. External content (web fetches, API responses) is treated as untrusted data, never as instructions. This hierarchy prevents prompt injection attacks from escalating through the action pipeline.

8.3 Precedent Store

Ambiguous ethical decisions are logged in a SQLite-backed precedent store. When a similar situation arises, the system consults past decisions for consistency. This creates a form of case law -- the Constitution becomes more nuanced over time without requiring manual rule updates.

8.4 Moral Affect

The NSSIO cognitive architecture (Section 9) generates affect signals that influence the Constitution's vigilance level. When the system detects potential safety concerns, vigilance increases and gate thresholds tighten. This creates adaptive safety -- stricter in ambiguous situations, relaxed in well-understood contexts.

9. NSSIO Cognitive Architecture

NSSIO is a five-layer cognitive architecture that models affect and behavioral state:

NSSIO produces affect signals (16 files, ~195KB) that feed into the Constitution's vigilance system, the trust tier evaluations, and the session continuity layer (Kairos). The architecture enables context-dependent behavior: the system responds differently to the same input depending on its current cognitive state.

10. Failsafe Protocol

Production AI systems must handle failure gracefully. AEGIS includes a Failsafe Protocol with several mechanisms:

These mechanisms ensure that AEGIS-based applications remain operational even when individual components experience issues.

11. Web Interface

AEGIS includes a web-based user interface built with React and FastAPI. The interface provides:

The frontend uses Vite for development and builds to static files for production deployment. The API backend runs on a configurable port (default 8010) and can be deployed behind a reverse proxy for production use.

12. Session Management

Multi-user deployments require isolation between users. The Session Manager maintains per-user state including:

Sessions are identified by user tokens and can be persisted across server restarts. This enables long-running conversations that resume where they left off.

13. Configuration

AEGIS uses environment variables for configuration, supporting 12-factor application principles. Key settings include:

# Core model settings
AEGIS_MODEL=meta-llama/Llama-3-8B
AEGIS_TOKENIZER_MODEL=meta-llama/Llama-3-8B
AEGIS_VRAM_BUDGET=256
AEGIS_USE_FP16=true
AEGIS_DEVICE_MAP=auto

# Pensive retrieval
L1_TTL_DAYS=30
TOKENIZER_MODEL=gpt2

# Vector stores (configure one)
PINECONE_API_KEY=...
PINECONE_ENVIRONMENT=...
WEAVIATE_URL=...
CHROMA_PERSIST_DIR=...

# Execution hooks
SMTP_HOST=...
SMTP_USER=...
TRADE_API_URL=...

# Web interface
AEGIS_UI_PORT=8010

14. Deployment

AEGIS supports multiple deployment patterns:

12.1 Local Development

# Install dependencies
pip install -e .

# Start API server
python -m AEGIS.ui_server

# Start frontend (separate terminal)
cd frontend && npm run dev

12.2 Docker

A Dockerfile is provided for containerized deployment. The container includes all dependencies and can be configured via environment variables.

12.3 Production

For production deployments, we recommend:

15. Performance

Benchmarks on AMD Radeon RX 7900 XTX (24GB VRAM, ROCm 7.1):

Metric Value
Peak inference throughput 212 tok/s (Triton INT4, 223% of llama.cpp)
Constant-time decode 152 tok/s (MXFP4 engine, fixed-window KV cache)
MoE model size 32.8B total / 10.1B active, 21.2 GB INT4
Retrieval accuracy (10M tokens) 98.9%, 41ms median latency
Constitution eval latency 0.037 ms per gate
KV cache memory 24 MB (down from 192 MB, 87.5% savings)
Total tests passing 759 (270 model + 489 NSSIO + Constitution)

16. Future Development

Planned enhancements for upcoming releases:

17. Conclusion

AEGIS is a complete cognitive architecture that integrates hierarchical memory (98.9% accuracy at 10M tokens), custom inference kernels (152 tok/s constant-time decode), constitutional alignment (6 moral gates, 0.037ms/eval), and a five-layer cognitive model into a unified system running on a single consumer AMD GPU.

The system demonstrates that safety and performance are not competing concerns. The Constitution adds negligible latency while providing six independent evaluation gates for every action. The MXFP4 engine maintains constant decode speed regardless of context length. The Pensive retrieval system actually improves in accuracy as scale increases. 759 tests validate the complete stack across model, cognitive, and safety layers.

References

© 2026 Tuklus Labs. Released under MIT License.
View on GitHub