RFC: Document Router Layer for Large-Scale Retrieval
Summary
Vectorless currently sends all workspace documents to the Orchestrator's analysis phase, which uses a single LLM call to read every DocCard and select relevant ones. This works well for small workspaces (1–100 documents) but breaks down at scale — 1000+ DocCards exceed token budgets, increase latency, and make LLM selection unreliable.
This RFC proposes a Router layer that sits between Engine.query() and the Orchestrator, using compile-stage artifacts to pre-filter documents before dispatching Workers.
Problem
Current flow (workspace scope):
Engine.query(workspace)
→ resolve_scope() → all doc_ids (potentially 1000+)
→ load_documents() → load ALL (tree + nav_index + reasoning_index)
→ Orchestrator.analyze()
→ LLM reads ALL DocCards in one prompt ← bottleneck
→ selects docs → dispatch Workers
Issues at 1000+ documents:
- Token budget: 1000 DocCards × ~200 tokens = ~200K tokens — exceeds context windows
- Cost: One massive LLM call just for document selection
- Latency: Loading all document trees upfront is wasteful when only a fraction will be used
- Quality: LLM selection degrades when presented with too many options
Proposed Solution
Insert a Router that uses compile-stage artifacts (DocCard, ReasoningIndex, DocumentGraph) for coarse filtering, narrowing candidates to a manageable set before the Orchestrator's LLM-based analysis.
Proposed flow:
Engine.query(workspace)
→ resolve_scope() → all doc_ids
→ Router::route(query, doc_ids) → top-K candidates (10–20)
→ load_documents(top_K) → only K documents
→ Orchestrator.run() (reduced candidate set)
→ analyze: LLM reads K DocCards → precise selection
→ dispatch Workers → navigate trees
The Router does not replace the Orchestrator's analysis. It narrows the input so the LLM can make a better-informed, cheaper decision.
Data Sources
The Router leverages artifacts already produced by the compile pipeline — no additional LLM calls at index time.
| Artifact | Fields Used by Router | Produced By |
|---|---|---|
| DocCard | title, overview, topic_tags, question_hints, sections | NavigationIndexStage |
| ReasoningIndex | topic_paths (keyword → node mappings) | ReasoningIndexStage |
| DocumentGraph | cross-document edges, shared keywords | DocumentGraphBuilder |
These are lightweight (no tree content), fast to load, and already persisted in the workspace.
Scoring Strategy
The Router combines three scoring signals:
1. BM25 on DocCard text (lexical match)
Build a BM25 index over each document's DocCard searchable text:
searchable_text = f"{title} {overview} {question_hints} {topic_tags} {section_descriptions}"
The existing Bm25Engine<K> in rust/src/scoring/bm25.rs supports per-field weighting. For Router use, we weight title and topic_tags higher than section descriptions.
2. Keyword overlap (concept match)
Use QueryPlan.key_concepts (from query understanding) to match against each document's:
DocCard.topic_tagsReasoningIndex.topic_pathskeys
Score = Jaccard similarity between query concepts and document keywords.
3. Graph-based boost (contextual relevance)
If the DocumentGraph is available, documents connected to high-scoring candidates receive a boost. This captures the intuition that related documents may be co-relevant.
Fusion
final_score = w_bm25 * normalize(bm25_score)
+ w_keyword * keyword_overlap
+ w_graph * graph_boost
Default weights: w_bm25 = 0.5, w_keyword = 0.3, w_graph = 0.2
LLM-assisted routing (optional)
When BM25 + keyword scores are ambiguous (e.g., top candidates have similar scores), the Router can optionally invoke the LLM to rank the top-M candidates. This is a lightweight call — the LLM only sees K DocCard summaries (not full trees), making it orders of magnitude cheaper than the current all-DocCards approach.
RouteMode::Fast → BM25 + keyword + graph (no LLM)
RouteMode::Balanced → BM25 + keyword + graph, then LLM top-M if ambiguous
RouteMode::Precise → BM25 + keyword + graph + LLM top-M always
Activation Threshold
The Router is only activated when the workspace exceeds a configurable document count:
RouterConfig {
activate_threshold: 20, // only route when docs > 20
max_candidates: 15, // top-K to pass to Orchestrator
bm25_top_k: 50, // BM25 initial retrieval size
mode: RouteMode::Fast, // default: no LLM in Router
}
Below the threshold, the current flow (all DocCards → Orchestrator.analyze) is used unchanged.
Incremental Index Maintenance
The Router's BM25 index is updated incrementally alongside document indexing:
- Document indexed: Extract DocCard →
router.upsert(doc_id, card, keywords) - Document removed:
router.remove(doc_id) - Graph rebuilt:
router.update_graph(new_graph)
No full re-index needed — the Router stays in sync with the workspace.
Lazy Document Loading (future optimization)
Currently load_documents() loads full DocumentTree + NavigationIndex + ReasoningIndex for all candidates. With the Router, we can defer tree loading until Worker dispatch:
- Router + Orchestrator.analyze: only need DocCards (lightweight)
- Orchestrator.dispatch: load DocumentTree per-Worker, on demand
This reduces memory pressure when the Router selects 15 candidates from 1000 documents but the Orchestrator only dispatches 5 Workers.
Module Structure
rust/src/router/
├── mod.rs # DocumentRouter, RouteResult, ScoredCandidate
├── scorer.rs # BM25 + keyword + graph fusion scoring
└── config.rs # RouterConfig, RouteMode
Integration points:
rust/src/client/engine.rs— insertRouter::route()inquery()rust/src/config/mod.rs— addrouter: RouterConfigpython/src/lib.rs— expose RouterConfig to Python SDK
What This Is Not
- Not a replacement for the Orchestrator — Router is coarse filter, Orchestrator is precise selector
- Not an embedding layer — uses BM25 + keywords + graph, no vector similarity required
- Not a "vector backtrack" — this is a pragmatic engineering layer that happens to use lexical matching
Open Questions
-
Score calibration: How to normalize BM25 scores across workspaces with different corpus sizes? Min-max normalization may not work well with very few documents. Consider quantile-based normalization.
-
Cold start: New documents have no graph edges and no hot-node history. Should new docs get a freshness boost?
-
Multi-hop routing: Should the Router consider re-routing after the first Orchestrator iteration finds nothing? Or is one-shot routing sufficient given the supervisor loop can replan?
-
Thread safety: The Router holds a mutable BM25 index. Need to decide between
RwLock<DocumentRouter>or rebuild-on-query from workspace data.