AEGIS

An AI Operations Platform for Production Workloads

Tuklus Labs | February 2026 | Version 3.0

Abstract

AEGIS is a comprehensive cognitive architecture for deploying AI systems that interact with the real world. It combines hierarchical context retrieval (validated to 10M tokens at 98.9% accuracy), custom inference kernels (152 tok/s constant-time decode on AMD 7900 XTX), a 32.8B mixture-of-experts model, constitutional alignment with six moral evaluation gates, and execution hooks with safety constraints into a unified architecture. The system addresses practical challenges of production AI: memory constraints, latency requirements, multi-modal capabilities, alignment verification, and safe integration with external services. Version 3.0 introduces the Constitution system, MXFP4 inference engine, and a five-layer cognitive architecture (NSSIO). 759 tests passing across 156 test files.

1. Motivation

Building AI applications that move beyond chat interfaces requires solving multiple interconnected problems. A production system must retrieve relevant context efficiently, route queries to appropriate models, generate responses across modalities, and sometimes take actions in the external world. Each of these challenges has received attention in isolation, but integrating them into a coherent platform remains difficult.

Consider an AI assistant for enterprise operations. It must answer questions about company documents (requiring retrieval), handle both text and image requests (requiring multi-modal capabilities), work within GPU memory constraints (requiring efficient model management), and execute tasks like sending emails or updating records (requiring secure action execution). Building this from scratch involves substantial engineering effort and many opportunities for error.

AEGIS provides a ready-made foundation for such applications. It handles the infrastructure concerns, allowing developers to focus on application-specific logic.

2. System Architecture

AEGIS organizes functionality into interconnected subsystems:

┌─────────────────────────────────────────────────────────────────────────┐ │ Web Interface │ │ (React v1/v2/v3 + FastAPI Backend) │ └─────────────────────────────────────────────────────────────────────────┘ │ ┌────────────────┼────────────────┐ ▼ ▼ ┌──────────────────────────┐ ┌──────────────────────────┐ │ NSSIO Cognitive │ │ Constitution │ │ Architecture │ │ (Alignment Layer) │ │ (5-layer affect model) │ │ 6 moral gates, 7 tiers │ └──────────────────────────┘ └──────────────────────────┘ │ │ └────────────────┬────────────────┘ │ ┌────────────────────────────┼────────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ Pensive │ │ MXFP4 Engine │ │ Execution │ │ (Context │ │ + MoE Router │ │ Bridge │ │ Retrieval) │ │ (32.8B model)│ │ (Actions) │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ L1/L2/L3 │ │ Triton INT4 │ │ Constitution │ │ Cache Tiers │ │ Kernels │ │ Gate Check │ │ 189K nodes │ │ 152 tok/s │ │ 0.037ms/eval │ └───────────────┘ └───────────────┘ └───────────────┘

3. Context Retrieval with Pensive

AEGIS incorporates the Pensive retrieval system as its context layer. Pensive implements a three-tier cache hierarchy with spreading activation and a 189,000-node knowledge graph, validated to 10 million tokens at 98.9% accuracy:

L1 (Hot): In-memory cache with sub-millisecond latency, temporal/topic/entity indexes
L2 (Warm): FAISS vector search + BM25 hybrid, spreading activation, cross-encoder reranking
L3 (Cold): NVMe-backed persistent archive for complete documents

164,165 documents have been ingested across personal data sources (ChatGPT, Facebook, Claude, YouTube, Calendar, Maps), forming a knowledge graph with 189,000 nodes and 131,000 edges. The hybrid BM25 + dense vector approach eliminates the lost-in-the-middle phenomenon entirely -- middle-context accuracy is 99.3% at 10M tokens, the highest of any position.

Pensive is described in detail in its own technical paper. Within AEGIS, it provides the foundation for any operation requiring document knowledge or conversation memory.

4. Mixture-of-Experts Routing

Running large language models on consumer hardware requires careful memory management. A 70 billion parameter model exceeds the VRAM of most GPUs. Even loading multiple smaller specialized models simultaneously may not be feasible.

The Expert Manager addresses this through dynamic model loading. Rather than keeping all models in GPU memory, it loads experts on demand and evicts them when memory pressure increases.

4.1 Expert Lifecycle

Each expert model exists in one of three states:

Unloaded: Model weights stored on disk or in CPU memory
Loading: Transfer to GPU in progress
Active: Model resident in GPU memory, ready for inference

The Expert Manager maintains a VRAM budget (configurable, default 256MB headroom). Before loading a new expert, it checks available memory and evicts least-recently-used experts if necessary.

4.2 Eviction Policies

Two eviction strategies are available:

LRU (Least Recently Used): Evicts experts that have not been accessed recently
LFU (Least Frequently Used): Evicts experts with the lowest total access count

LRU works well when access patterns are temporally clustered. LFU performs better when some experts are consistently more popular than others.

4.3 Optional Gating Network

For applications with many experts, manually selecting which expert handles each query becomes impractical. AEGIS supports an optional gating network: a small model that examines query embeddings and predicts which expert(s) should respond.

The gating network can route to a single expert (hard routing) or blend outputs from multiple experts (soft routing). Soft routing produces more nuanced responses but requires loading multiple experts simultaneously.

5. LLM Inference on ROCm

AEGIS includes two inference engines, both running on AMD 7900 XTX with ROCm 7.1:

5.1 Triton INT4 Engine (212 tok/s)

Custom Triton GEMV kernels with fused INT4 dequantization achieve 212 tokens/second -- 223% of stock llama.cpp on the same hardware. The optimization journey from 42 tok/s to 212 tok/s is documented in the inference paper.

5.2 MXFP4 Constant-Time Decode Engine (152 tok/s)

A production-oriented engine designed for Pensive-first inference. Features a fixed-window KV cache that maintains constant decode speed regardless of sequence length:

152 tok/s constant-time decode (0% degradation from 100 to 2,000 token prefill)
Fixed-window KV cache: 192MB reduced to 24MB (87.5% memory savings)
CUDA graph capture for zero-overhead decode loops
Designed for the AEGIS-MoE-32B model (32.8B total / 10.1B active parameters)

Prefill Length	Growing Cache	Fixed Window	Improvement
100 tokens	152.9 tok/s	152.8 tok/s	0%
1,000 tokens	143.9 tok/s	150.2 tok/s	+4.4%
2,000 tokens	135.5 tok/s	150.3 tok/s	+10.9%

6. Image Generation

AEGIS includes a Stable Diffusion integration for text-to-image generation. The Image_Gen module provides:

REST API wrapper for Stable Diffusion backends
Dynamic checkpoint loading for different model variants
Prompt translation for optimizing user inputs
Base64 and file-based output options

Image generation is exposed through the same interface as text generation, enabling multi-modal workflows where an agent can decide whether to respond with text, images, or both.

7. Execution Bridge

Many AI applications need to take actions beyond generating text. AEGIS provides a controlled mechanism for real-world action execution through the Execution Bridge.

7.1 Supported Actions

The default configuration supports:

Action	Description	Configuration
Email	Send notifications via SMTP	SMTP_* environment variables
Trade Execution	Submit orders to trading API	TRADE_API_* environment variables
System Backup	Execute backup command	BACKUP_CMD environment variable
System Reboot	Trigger remote reboot	REBOOT_CMD environment variable

7.2 Graceful Degradation

When external services are unavailable, the Execution Bridge falls back to simulation mode. Actions are logged but not executed, preventing system failures due to external dependencies.

All actions are recorded in a JSON audit log, providing a complete history for debugging and compliance.

7.3 Security Considerations

The Execution Bridge is powerful and potentially dangerous. Best practices include:

Running AEGIS with minimal permissions necessary for required actions
Using environment variables for sensitive credentials rather than configuration files
Implementing rate limits on action execution
Reviewing audit logs regularly
Testing in simulation mode before enabling real execution

8. Constitutional Alignment

Version 3.0 introduces the Constitution system -- a conscience layer that evaluates every action before execution. Unlike post-hoc safety filtering, the Constitution is integrated into the action pipeline: no pathway reaches the external world without passing through moral evaluation gates.

8.1 Moral Evaluation Gates

Six evaluation gates assess every proposed action:

Fabrication gate: Prevents generation of false information presented as fact
Panoptes gate: Governs vision system access and image analysis
Image generation gate: Controls Stable Diffusion output with content safety checks
Training gate: Restricts fine-tuning operations to authorized configurations
Execution gate: Controls code execution and system commands
UI gate: Governs user-facing interface modifications

Each gate evaluates in 0.037ms -- fast enough to add negligible latency to any action pathway. 489 tests validate gate behavior across edge cases.

8.2 Trust Tier Hierarchy

A seven-tier trust system assigns privilege levels based on the source of a request. Terminal access from the system owner has full authority. Web API requests operate under restricted permissions. External content (web fetches, API responses) is treated as untrusted data, never as instructions. This hierarchy prevents prompt injection attacks from escalating through the action pipeline.

8.3 Precedent Store

Ambiguous ethical decisions are logged in a SQLite-backed precedent store. When a similar situation arises, the system consults past decisions for consistency. This creates a form of case law -- the Constitution becomes more nuanced over time without requiring manual rule updates.

8.4 Moral Affect

The NSSIO cognitive architecture (Section 9) generates affect signals that influence the Constitution's vigilance level. When the system detects potential safety concerns, vigilance increases and gate thresholds tighten. This creates adaptive safety -- stricter in ambiguous situations, relaxed in well-understood contexts.

9. NSSIO Cognitive Architecture

NSSIO is a five-layer cognitive architecture that models affect and behavioral state:

Narrative: High-level goal tracking and long-term planning
Social: Trust modeling and relationship state
Sensory: Input processing and attention allocation
Instinctive: Fast-path responses for safety-critical situations
Operational: Task execution and resource management

NSSIO produces affect signals (16 files, ~195KB) that feed into the Constitution's vigilance system, the trust tier evaluations, and the session continuity layer (Kairos). The architecture enables context-dependent behavior: the system responds differently to the same input depending on its current cognitive state.

10. Failsafe Protocol

Production AI systems must handle failure gracefully. AEGIS includes a Failsafe Protocol with several mechanisms:

Inactivity monitoring: Triggers alerts when the system becomes unresponsive
Uplink diagnostics: Monitors network connectivity to critical services
Automatic recovery: Attempts to restore operation after transient failures
Escalation: Notifies operators when automatic recovery fails

These mechanisms ensure that AEGIS-based applications remain operational even when individual components experience issues.

11. Web Interface

AEGIS includes a web-based user interface built with React and FastAPI. The interface provides:

Chat interface for conversational interaction
Memory inspection showing cache statistics per user session
Command execution for failsafe operations
Real-time status monitoring

The frontend uses Vite for development and builds to static files for production deployment. The API backend runs on a configurable port (default 8010) and can be deployed behind a reverse proxy for production use.

12. Session Management

Multi-user deployments require isolation between users. The Session Manager maintains per-user state including:

Conversation history
Retrieved context cache
User preferences
Action authorization state

Sessions are identified by user tokens and can be persisted across server restarts. This enables long-running conversations that resume where they left off.

13. Configuration

AEGIS uses environment variables for configuration, supporting 12-factor application principles. Key settings include:

# Core model settings
AEGIS_MODEL=meta-llama/Llama-3-8B
AEGIS_TOKENIZER_MODEL=meta-llama/Llama-3-8B
AEGIS_VRAM_BUDGET=256
AEGIS_USE_FP16=true
AEGIS_DEVICE_MAP=auto

# Pensive retrieval
L1_TTL_DAYS=30
TOKENIZER_MODEL=gpt2

# Vector stores (configure one)
PINECONE_API_KEY=...
PINECONE_ENVIRONMENT=...
WEAVIATE_URL=...
CHROMA_PERSIST_DIR=...

# Execution hooks
SMTP_HOST=...
SMTP_USER=...
TRADE_API_URL=...

# Web interface
AEGIS_UI_PORT=8010

14. Deployment

AEGIS supports multiple deployment patterns:

12.1 Local Development

# Install dependencies
pip install -e .

# Start API server
python -m AEGIS.ui_server

# Start frontend (separate terminal)
cd frontend && npm run dev

12.2 Docker

A Dockerfile is provided for containerized deployment. The container includes all dependencies and can be configured via environment variables.

12.3 Production

For production deployments, we recommend:

Running behind a reverse proxy (nginx, Caddy) for TLS termination
Using a process manager (systemd, supervisor) for automatic restarts
Mounting persistent volumes for L3 archive and session storage
Configuring health check endpoints for load balancer integration

15. Performance

Benchmarks on AMD Radeon RX 7900 XTX (24GB VRAM, ROCm 7.1):

Metric	Value
Peak inference throughput	212 tok/s (Triton INT4, 223% of llama.cpp)
Constant-time decode	152 tok/s (MXFP4 engine, fixed-window KV cache)
MoE model size	32.8B total / 10.1B active, 21.2 GB INT4
Retrieval accuracy (10M tokens)	98.9%, 41ms median latency
Constitution eval latency	0.037 ms per gate
KV cache memory	24 MB (down from 192 MB, 87.5% savings)
Total tests passing	759 (270 model + 489 NSSIO + Constitution)

16. Future Development

Planned enhancements for upcoming releases:

Speculative decoding for improved generation throughput
100M+ token retrieval validation
Distributed deployment across multiple GPU nodes
Embodiment integration (physics sim, walking/hand controllers)
Panoptes vision module (YOLO detection, scene classification)

17. Conclusion

AEGIS is a complete cognitive architecture that integrates hierarchical memory (98.9% accuracy at 10M tokens), custom inference kernels (152 tok/s constant-time decode), constitutional alignment (6 moral gates, 0.037ms/eval), and a five-layer cognitive model into a unified system running on a single consumer AMD GPU.

The system demonstrates that safety and performance are not competing concerns. The Constitution adds negligible latency while providing six independent evaluation gates for every action. The MXFP4 engine maintains constant decode speed regardless of context length. The Pensive retrieval system actually improves in accuracy as scale increases. 759 tests validate the complete stack across model, cognitive, and safety layers.

References

Shazeer, N., et al. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017.
Lewis, P., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
Rombach, R., et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022.
Dettmers, T., et al. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022.