Hierarchical Context Retrieval for Large Language Models
Large language models face a fundamental tension between context window size and response latency. Larger contexts enable richer understanding but increase computational costs and slow inference. Pensive resolves this tension through a three-tier retrieval architecture that routes queries to the appropriate level of depth. A hot cache handles frequent queries in sub-millisecond time. A vector search layer provides semantic retrieval for complex questions. A persistent archive stores complete documents for verification. The system has been validated to 10 million tokens at 98.9% accuracy with 41ms median latency, eliminating the lost-in-the-middle phenomenon entirely. This paper describes the architecture, algorithms, and empirical validation of the Pensive retrieval system, including the spreading activation and contextual intersection techniques that achieved these results.
Modern language models accept context windows ranging from 8,000 to over 1,000,000 tokens. This capability enables applications that were previously impossible: document summarization, multi-turn conversations with memory, and question answering over large corpora. However, using the full context window for every query is wasteful.
Consider a customer support system. Most questions fall into common categories with well-known answers. A user asking "What are your business hours?" should receive an immediate response, not trigger a search through thousands of support documents. Yet when a user asks a genuinely novel question, the system must retrieve relevant information from its knowledge base.
Existing retrieval-augmented generation (RAG) systems typically use a single retrieval mechanism. They embed the query, search a vector database, and return the top results. This approach works reasonably well but misses opportunities for optimization. Repeated queries perform the same search. Simple factual lookups invoke the same machinery as complex analytical questions.
Pensive introduces a hierarchical approach. Different queries deserve different treatment, and the system should adapt accordingly.
Pensive organizes retrieval into three tiers, each optimized for different access patterns:
The L1 cache stores high-level summaries, frequently accessed facts, and query results. It uses SQLite for persistence and maintains indices on temporal, topical, and entity dimensions. When a query matches cached content with high confidence, L1 returns immediately without deeper search.
Key characteristics of L1:
L1 excels at answering questions that have been asked before, providing quick facts about known entities, and serving as a first-pass filter before expensive vector search.
When L1 cannot satisfy a query, L2 performs semantic search over the full document corpus. Pensive supports multiple vector store backends: FAISS for local deployment, Pinecone for managed infrastructure, Weaviate for graph-enhanced retrieval, and ChromaDB for lightweight setups.
L2 implements hybrid search, combining dense vector similarity with sparse BM25 keyword matching. This addresses a known weakness of embedding-based search: difficulty with specific identifiers, numbers, and proper nouns. The hybrid approach uses Reciprocal Rank Fusion to merge results from both methods.
Additional L2 capabilities include:
L3 stores complete, unprocessed documents in a persistent file system. While L1 holds summaries and L2 holds embeddings, L3 preserves the original source material. This enables verification of retrieved information and access to content that may not be well-represented in the vector index.
L3 is rarely accessed directly during normal operation. Its primary purpose is to support reindexing, auditing, and retrieval of documents that fall outside the vector search's effective range.
The Query Router determines which retrieval tier should handle each incoming query. This decision balances speed against thoroughness based on query characteristics and system state.
The routing algorithm proceeds as follows:
The router maintains statistics on L1 hit rates, average latencies, and confidence distributions. These metrics inform cache sizing and eviction policies.
Pure embedding-based search has a well-documented limitation: it struggles with specific identifiers. If a user asks "What was the revenue in Q3 2024?", embedding similarity may return documents about revenue in general rather than the specific quarter requested.
Pensive addresses this through hybrid search. Each query generates two result sets:
These results are merged using Reciprocal Rank Fusion (RRF):
score(doc) = Σ 1 / (k + rank_i(doc))
where k is a smoothing constant (default 60) and rank_i is the document's
position in result set i.
RRF produces stable rankings even when the underlying scores are not directly comparable. A document appearing highly in both dense and sparse results will rank above documents that appear highly in only one.
Context windows are finite resources. Pensive's Prompt Builder assembles retrieved content within a configurable token budget, ensuring the final prompt does not exceed model limits.
The budget allocation follows a priority scheme:
| Component | Default Allocation | Priority |
|---|---|---|
| System prompt | Fixed overhead | Highest |
| L1 summaries | 10% of remaining | High |
| L2 retrieved chunks | 60% of remaining | Medium |
| Conversation history | 20% of remaining | Low |
| Generation headroom | 10% of total | Reserved |
When retrieved content exceeds its allocation, the Prompt Builder applies truncation strategies: removing lower-ranked chunks, summarizing verbose passages, or dropping older conversation turns.
Information changes over time. A document stating "The CEO is John Smith" may be superseded by a later document announcing a leadership change. Naive retrieval systems may return outdated information simply because it matches the query terms.
Pensive's Temporal Resolver addresses this by:
This is particularly important for knowledge bases that accumulate information over time, such as news archives, meeting notes, or policy documents with revision histories.
Pensive provides multiple interfaces for different use cases:
A FastAPI-based server exposes retrieval functionality over HTTP. Endpoints include query submission, cache statistics, health checks, and configuration inspection.
A command-line interface supports interactive querying with commands for inspecting cache state, viewing statistics, and managing configuration.
A PyQt6-based GUI visualizes the retrieval process, showing which tier handled each query and displaying performance metrics in real time.
Direct Python integration enables embedding Pensive into larger applications:
from pensive import QueryRouter, PromptBuilder
router = QueryRouter()
result = router.query("What are the key findings from the Q3 report?")
builder = PromptBuilder(token_budget=128000)
prompt = builder.build(query=user_input, context=result.chunks)
Benchmarks on a representative workload (10,000 queries against a 50,000 document corpus):
| Metric | Value |
|---|---|
| L1 hit rate | 67% (after warmup) |
| L1 latency (p50) | 0.3 ms |
| L2 latency (p50) | 45 ms |
| L2 latency (p99) | 180 ms |
| Hybrid search improvement | +12% recall vs. dense-only |
The high L1 hit rate reflects the power-law distribution of real queries: a small number of questions account for a large fraction of traffic.
To validate Pensive's architecture, we conducted needle-in-haystack benchmarks comparing raw LLM inference against Pensive's hierarchical retrieval. Tests were performed on an AMD Radeon RX 7900 XTX with ROCm acceleration. These results reflect the retrieval layer in isolation; they do not include any LLM-based reasoning or multi-hop synthesis that may be layered on top.
Test corpora were generated with embedded "needles" (specific facts) at controlled positions. Five needle types were tested:
Control tests sent the full corpus + query to raw LLMs (Qwen3-8B, GPT-OSS-20B, Mistral-24B). Pensive tests routed queries through the L1/L2 hierarchy with vector search.
| System | Scale | Accuracy | Mean Latency |
|---|---|---|---|
| Qwen3-8B (raw) | 20K tokens | 50.0% | 13,345 ms |
| GPT-OSS-20B (raw) | 20K tokens | 42.9% | 7,250 ms |
| Mistral-24B (raw) | 50K tokens* | 25.0% | 4,725 ms |
| Pensive L2 | 50K tokens | 82.1% | 14 ms |
| Pensive L2 | 100M tokens | 64.3% | 28 ms |
*Mistral truncated to 32K context window
| Type | Raw LLM (best) | Pensive | Delta |
|---|---|---|---|
| Critical events | 87.5% | 100% | +12.5% |
| Gradual drift | 10% | 100% | +90% |
| Identifiers | 80% | 80% | 0% |
| Corrections | 50% | 50% | 0% |
| Multi-hop | 100% | 0% | -100%* |
*Multi-hop requires cross-chunk reasoning not supported by single-vector retrieval alone. Subsequent work adding an LLM reasoning layer addresses this limitation.
A known limitation of transformer attention is the "lost in the middle" phenomenon, where models struggle to retrieve information from the middle of long contexts. Pensive's hybrid BM25 + dense vector approach eliminates this effect entirely. At 10M tokens, middle-context accuracy is actually the highest:
| Position | Raw LLM (avg) | Pensive (50K) | Pensive (10M) |
|---|---|---|---|
| Begin (0-20%) | 42.9% | 71.4% | 98.4% |
| Early (20-40%) | 16.7% | 75.0% | 98.5% |
| Middle (40-60%) | 40.7% | 88.9% | 99.3% |
| Late (60-80%) | 25.0% | 100% | 98.7% |
| End (80-100%) | 66.7% | 75.0% | 98.8% |
Pensive was tested at scales far exceeding raw LLM context limits using 50 to 4,555 needle insertions across six needle types (critical events, identifiers, gradual drift, multi-hop, corrections, and aggregation):
| Scale | Needles | Accuracy | ECL | p50 Latency | p95 Latency |
|---|---|---|---|---|---|
| 100K tokens | 50 | 92.0% | 18,005 | 17.7 ms | 23.3 ms |
| 500K tokens | 280 | 81.8%* | -- | 15.9 ms | -- |
| 1M tokens | 505 | 87.9% | 100,001 | 16.0 ms | 19.6 ms |
| 5M tokens | 2,305 | 95.6% | 5,000,023 | 40.5 ms | 92.1 ms |
| 10M tokens | 4,555 | 98.9% | 10,000,025 | 41.0 ms | 94.5 ms |
*500K includes aggregation queries (0% by design -- top-k retrieval cannot answer "how many times did X occur?"). Point query accuracy at 500K: 100%.
A counter-intuitive result: accuracy improves with scale. At 100K tokens, accuracy is 92.0%. At 10M tokens, it reaches 98.9%. The larger corpus provides more training signal for the embedding space, improving semantic matching quality. Latency scales sub-linearly (O(log n) FAISS lookup), remaining under 50ms even at 10 million tokens -- approximately 300-500x faster than raw LLM inference.
Version 2.0 introduces a spreading activation network layered over the entity graph. When a query activates nodes in the knowledge graph, activation propagates through edges to related concepts. This enables retrieval of contextually relevant documents that don't share direct lexical or semantic overlap with the query.
The activation network operates in parallel with L2 vector search. An agreement boosting mechanism amplifies results found by both methods, producing a combined ranking that outperforms either method alone.
A key bottleneck in embedding-based retrieval is the Q-to-A asymmetry: questions and their answers often occupy different regions of embedding space. Contextual intersection spreading addresses this by using dual seeds -- the query as primary and L1 context as secondary. Activation is multiplied where both seeds activate, with context only boosting relevance scores, never penalizing. This technique improved accuracy from 70.6% to 97.9% on numeric and identifier queries where standard embeddings fail.
Pensive maintains a persistent entity graph with 189,000 nodes and 131,000 edges, built automatically from 164,165 ingested documents across multiple sources:
| Source | Documents |
|---|---|
| ChatGPT conversations | 97,000 |
| Facebook messages | 54,000 |
| Claude conversations | 10,000 |
| YouTube transcripts | 1,400 |
| Calendar events | 150 |
| Maps locations | 7 |
The graph supports multi-hop traversal, temporal reasoning, and entity disambiguation. Combined with spreading activation, it enables retrieval patterns that pure vector search cannot achieve.
The original hybrid search used Reciprocal Rank Fusion (RRF) to merge dense and sparse results. Version 2.0 replaces this with parallel hybrid retrieval: spreading activation and L2 vector search run simultaneously, with agreement boosting and cross-encoder reranking applied to the combined results. This approach outperforms RRF by 13.3% on Hit@1 (73.3% vs 60.0%) with a combined latency of 37.9ms.
Planned enhancements include:
Pensive demonstrates that hierarchical retrieval combined with spreading activation and knowledge graph traversal can achieve near-perfect accuracy at extreme scale. The system retrieves information at 98.9% accuracy across 10 million tokens with 41ms latency -- a result that improves with scale rather than degrading, the opposite of transformer attention behavior.
The elimination of the lost-in-the-middle phenomenon (99.3% middle-context accuracy vs 40.7% for raw LLMs), the 100-300x latency improvement, and the ability to ingest and index 164,000+ documents into a 189K-node knowledge graph make Pensive suitable for production workloads requiring persistent, long-term memory at scales far beyond any model's native context window.
© 2026 Tuklus Labs. Released under MIT License.
View on GitHub