← Back to Tuklus Labs

Pensive

Hierarchical Context Retrieval for Large Language Models

Tuklus Labs | February 2026 | Version 2.0

Abstract

Large language models face a fundamental tension between context window size and response latency. Larger contexts enable richer understanding but increase computational costs and slow inference. Pensive resolves this tension through a three-tier retrieval architecture that routes queries to the appropriate level of depth. A hot cache handles frequent queries in sub-millisecond time. A vector search layer provides semantic retrieval for complex questions. A persistent archive stores complete documents for verification. The system has been validated to 10 million tokens at 98.9% accuracy with 41ms median latency, eliminating the lost-in-the-middle phenomenon entirely. This paper describes the architecture, algorithms, and empirical validation of the Pensive retrieval system, including the spreading activation and contextual intersection techniques that achieved these results.

1. The Context Problem

Modern language models accept context windows ranging from 8,000 to over 1,000,000 tokens. This capability enables applications that were previously impossible: document summarization, multi-turn conversations with memory, and question answering over large corpora. However, using the full context window for every query is wasteful.

Consider a customer support system. Most questions fall into common categories with well-known answers. A user asking "What are your business hours?" should receive an immediate response, not trigger a search through thousands of support documents. Yet when a user asks a genuinely novel question, the system must retrieve relevant information from its knowledge base.

Existing retrieval-augmented generation (RAG) systems typically use a single retrieval mechanism. They embed the query, search a vector database, and return the top results. This approach works reasonably well but misses opportunities for optimization. Repeated queries perform the same search. Simple factual lookups invoke the same machinery as complex analytical questions.

Pensive introduces a hierarchical approach. Different queries deserve different treatment, and the system should adapt accordingly.

2. Architecture Overview

Pensive organizes retrieval into three tiers, each optimized for different access patterns:

┌─────────────────────────────────────────────────────────────┐ │ User Query │ └─────────────────────────┬───────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Query Router │ │ (Classifies query, selects retrieval tier) │ └─────────────────────────┬───────────────────────────────────┘ │ ┌───────────────┼───────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ L1 Cache │ │ L2 Search │ │ L3 Archive│ │ (Hot) │ │ (Warm) │ │ (Cold) │ │ │ │ │ │ │ │ Summaries │ │ Vector │ │ Full │ │ Keywords │ │ Hybrid │ │ Docs │ │ Entities │ │ BM25 │ │ Files │ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │ │ │ └───────────────┼───────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Prompt Builder │ │ (Assembles context within token budget) │ └─────────────────────────────────────────────────────────────┘

2.1 L1 Cache: The Hot Layer

The L1 cache stores high-level summaries, frequently accessed facts, and query results. It uses SQLite for persistence and maintains indices on temporal, topical, and entity dimensions. When a query matches cached content with high confidence, L1 returns immediately without deeper search.

Key characteristics of L1:

L1 excels at answering questions that have been asked before, providing quick facts about known entities, and serving as a first-pass filter before expensive vector search.

2.2 L2 Search: The Warm Layer

When L1 cannot satisfy a query, L2 performs semantic search over the full document corpus. Pensive supports multiple vector store backends: FAISS for local deployment, Pinecone for managed infrastructure, Weaviate for graph-enhanced retrieval, and ChromaDB for lightweight setups.

L2 implements hybrid search, combining dense vector similarity with sparse BM25 keyword matching. This addresses a known weakness of embedding-based search: difficulty with specific identifiers, numbers, and proper nouns. The hybrid approach uses Reciprocal Rank Fusion to merge results from both methods.

Additional L2 capabilities include:

2.3 L3 Archive: The Cold Layer

L3 stores complete, unprocessed documents in a persistent file system. While L1 holds summaries and L2 holds embeddings, L3 preserves the original source material. This enables verification of retrieved information and access to content that may not be well-represented in the vector index.

L3 is rarely accessed directly during normal operation. Its primary purpose is to support reindexing, auditing, and retrieval of documents that fall outside the vector search's effective range.

3. Query Routing

The Query Router determines which retrieval tier should handle each incoming query. This decision balances speed against thoroughness based on query characteristics and system state.

The routing algorithm proceeds as follows:

  1. Check L1 cache for exact or near-exact query matches
  2. If L1 returns results with confidence above threshold (default 0.85), return immediately
  3. Otherwise, execute L2 hybrid search
  4. If L2 returns no results, optionally fall back to web search
  5. Cache successful results in L1 for future queries

The router maintains statistics on L1 hit rates, average latencies, and confidence distributions. These metrics inform cache sizing and eviction policies.

4. Hybrid Search

Pure embedding-based search has a well-documented limitation: it struggles with specific identifiers. If a user asks "What was the revenue in Q3 2024?", embedding similarity may return documents about revenue in general rather than the specific quarter requested.

Pensive addresses this through hybrid search. Each query generates two result sets:

These results are merged using Reciprocal Rank Fusion (RRF):

score(doc) = Σ 1 / (k + rank_i(doc))

where k is a smoothing constant (default 60) and rank_i is the document's
position in result set i.

RRF produces stable rankings even when the underlying scores are not directly comparable. A document appearing highly in both dense and sparse results will rank above documents that appear highly in only one.

5. Token Budget Management

Context windows are finite resources. Pensive's Prompt Builder assembles retrieved content within a configurable token budget, ensuring the final prompt does not exceed model limits.

The budget allocation follows a priority scheme:

Component Default Allocation Priority
System prompt Fixed overhead Highest
L1 summaries 10% of remaining High
L2 retrieved chunks 60% of remaining Medium
Conversation history 20% of remaining Low
Generation headroom 10% of total Reserved

When retrieved content exceeds its allocation, the Prompt Builder applies truncation strategies: removing lower-ranked chunks, summarizing verbose passages, or dropping older conversation turns.

6. Temporal Resolution

Information changes over time. A document stating "The CEO is John Smith" may be superseded by a later document announcing a leadership change. Naive retrieval systems may return outdated information simply because it matches the query terms.

Pensive's Temporal Resolver addresses this by:

This is particularly important for knowledge bases that accumulate information over time, such as news archives, meeting notes, or policy documents with revision histories.

7. Deployment Options

Pensive provides multiple interfaces for different use cases:

7.1 API Server

A FastAPI-based server exposes retrieval functionality over HTTP. Endpoints include query submission, cache statistics, health checks, and configuration inspection.

7.2 Interactive CLI

A command-line interface supports interactive querying with commands for inspecting cache state, viewing statistics, and managing configuration.

7.3 Desktop Application

A PyQt6-based GUI visualizes the retrieval process, showing which tier handled each query and displaying performance metrics in real time.

7.4 Python Library

Direct Python integration enables embedding Pensive into larger applications:

from pensive import QueryRouter, PromptBuilder

router = QueryRouter()
result = router.query("What are the key findings from the Q3 report?")

builder = PromptBuilder(token_budget=128000)
prompt = builder.build(query=user_input, context=result.chunks)

8. Performance Characteristics

Benchmarks on a representative workload (10,000 queries against a 50,000 document corpus):

Metric Value
L1 hit rate 67% (after warmup)
L1 latency (p50) 0.3 ms
L2 latency (p50) 45 ms
L2 latency (p99) 180 ms
Hybrid search improvement +12% recall vs. dense-only

The high L1 hit rate reflects the power-law distribution of real queries: a small number of questions account for a large fraction of traffic.

9. Empirical Validation

To validate Pensive's architecture, we conducted needle-in-haystack benchmarks comparing raw LLM inference against Pensive's hierarchical retrieval. Tests were performed on an AMD Radeon RX 7900 XTX with ROCm acceleration. These results reflect the retrieval layer in isolation; they do not include any LLM-based reasoning or multi-hop synthesis that may be layered on top.

9.1 Methodology

Test corpora were generated with embedded "needles" (specific facts) at controlled positions. Five needle types were tested:

Control tests sent the full corpus + query to raw LLMs (Qwen3-8B, GPT-OSS-20B, Mistral-24B). Pensive tests routed queries through the L1/L2 hierarchy with vector search.

9.2 Results: Overall Accuracy

System Scale Accuracy Mean Latency
Qwen3-8B (raw) 20K tokens 50.0% 13,345 ms
GPT-OSS-20B (raw) 20K tokens 42.9% 7,250 ms
Mistral-24B (raw) 50K tokens* 25.0% 4,725 ms
Pensive L2 50K tokens 82.1% 14 ms
Pensive L2 100M tokens 64.3% 28 ms

*Mistral truncated to 32K context window

9.3 Results: Needle Type Performance

Type Raw LLM (best) Pensive Delta
Critical events 87.5% 100% +12.5%
Gradual drift 10% 100% +90%
Identifiers 80% 80% 0%
Corrections 50% 50% 0%
Multi-hop 100% 0% -100%*

*Multi-hop requires cross-chunk reasoning not supported by single-vector retrieval alone. Subsequent work adding an LLM reasoning layer addresses this limitation.

9.4 Results: Position Analysis (Lost-in-Middle)

A known limitation of transformer attention is the "lost in the middle" phenomenon, where models struggle to retrieve information from the middle of long contexts. Pensive's hybrid BM25 + dense vector approach eliminates this effect entirely. At 10M tokens, middle-context accuracy is actually the highest:

Position Raw LLM (avg) Pensive (50K) Pensive (10M)
Begin (0-20%) 42.9% 71.4% 98.4%
Early (20-40%) 16.7% 75.0% 98.5%
Middle (40-60%) 40.7% 88.9% 99.3%
Late (60-80%) 25.0% 100% 98.7%
End (80-100%) 66.7% 75.0% 98.8%

9.5 Scale Testing

Pensive was tested at scales far exceeding raw LLM context limits using 50 to 4,555 needle insertions across six needle types (critical events, identifiers, gradual drift, multi-hop, corrections, and aggregation):

Scale Needles Accuracy ECL p50 Latency p95 Latency
100K tokens 50 92.0% 18,005 17.7 ms 23.3 ms
500K tokens 280 81.8%* -- 15.9 ms --
1M tokens 505 87.9% 100,001 16.0 ms 19.6 ms
5M tokens 2,305 95.6% 5,000,023 40.5 ms 92.1 ms
10M tokens 4,555 98.9% 10,000,025 41.0 ms 94.5 ms

*500K includes aggregation queries (0% by design -- top-k retrieval cannot answer "how many times did X occur?"). Point query accuracy at 500K: 100%.

A counter-intuitive result: accuracy improves with scale. At 100K tokens, accuracy is 92.0%. At 10M tokens, it reaches 98.9%. The larger corpus provides more training signal for the embedding space, improving semantic matching quality. Latency scales sub-linearly (O(log n) FAISS lookup), remaining under 50ms even at 10 million tokens -- approximately 300-500x faster than raw LLM inference.

9.6 Key Findings

10. Recent Advances

10.1 Spreading Activation

Version 2.0 introduces a spreading activation network layered over the entity graph. When a query activates nodes in the knowledge graph, activation propagates through edges to related concepts. This enables retrieval of contextually relevant documents that don't share direct lexical or semantic overlap with the query.

The activation network operates in parallel with L2 vector search. An agreement boosting mechanism amplifies results found by both methods, producing a combined ranking that outperforms either method alone.

10.2 Contextual Intersection

A key bottleneck in embedding-based retrieval is the Q-to-A asymmetry: questions and their answers often occupy different regions of embedding space. Contextual intersection spreading addresses this by using dual seeds -- the query as primary and L1 context as secondary. Activation is multiplied where both seeds activate, with context only boosting relevance scores, never penalizing. This technique improved accuracy from 70.6% to 97.9% on numeric and identifier queries where standard embeddings fail.

10.3 Knowledge Graph

Pensive maintains a persistent entity graph with 189,000 nodes and 131,000 edges, built automatically from 164,165 ingested documents across multiple sources:

Source Documents
ChatGPT conversations 97,000
Facebook messages 54,000
Claude conversations 10,000
YouTube transcripts 1,400
Calendar events 150
Maps locations 7

The graph supports multi-hop traversal, temporal reasoning, and entity disambiguation. Combined with spreading activation, it enables retrieval patterns that pure vector search cannot achieve.

10.4 Parallel Hybrid Retrieval

The original hybrid search used Reciprocal Rank Fusion (RRF) to merge dense and sparse results. Version 2.0 replaces this with parallel hybrid retrieval: spreading activation and L2 vector search run simultaneously, with agreement boosting and cross-encoder reranking applied to the combined results. This approach outperforms RRF by 13.3% on Hit@1 (73.3% vs 60.0%) with a combined latency of 37.9ms.

11. Future Directions

Planned enhancements include:

12. Conclusion

Pensive demonstrates that hierarchical retrieval combined with spreading activation and knowledge graph traversal can achieve near-perfect accuracy at extreme scale. The system retrieves information at 98.9% accuracy across 10 million tokens with 41ms latency -- a result that improves with scale rather than degrading, the opposite of transformer attention behavior.

The elimination of the lost-in-the-middle phenomenon (99.3% middle-context accuracy vs 40.7% for raw LLMs), the 100-300x latency improvement, and the ability to ingest and index 164,000+ documents into a 189K-node knowledge graph make Pensive suitable for production workloads requiring persistent, long-term memory at scales far beyond any model's native context window.

13. References

© 2026 Tuklus Labs. Released under MIT License.
View on GitHub