Skip to main content

Roadmap: The Ultimate Document Retrieval Tool

Focus: structured document retrieval — precise, reliable, indispensable. The "jq of document retrieval".

Scope

Focus on the document retrieval vertical — no code retrieval, no general knowledge platform. Build a complete Python developer experience layer on top of the Rust core engine, with broader format support and finer-grained parsing.

Phase Overview

PhaseFocusLanguage
A1Router Layer — support 1000+ document workspacesRust
A2Document Formats — HTML, DOCX, LaTeXRust
A3Parsing Precision — tables, figures, footnotesRust
A4Python Ecosystem — CLI, Pythonic API, framework integrationPython
A5Domain Optimization — legal, financial, technical documentsRust
A6Performance & Reliability — lazy loading, caching, concurrencyRust

Dependencies:

A1 (Router) ────→ A6 (Lazy Loading) ────→ A2 (Formats)

A3 (Precision)

A4 (Python, can run in parallel)

A5 (Domain)

A1: Router Layer

Goal: Support retrieval across 1000+ document workspaces.

Full design: RFC: Document Router

Key ideas:

  • Insert a Router between Engine.query() and the Orchestrator
  • Use compile-stage artifacts (DocCard + ReasoningIndex + DocumentGraph) for coarse filtering
  • BM25 + keyword overlap + graph boost — three-signal scoring fusion
  • Optional LLM-assisted routing (LLM ranks top-M candidates when scores are ambiguous)
  • Only activates when document count exceeds a configurable threshold

Module structure:

rust/src/router/
├── mod.rs # DocumentRouter, RouteResult, ScoredCandidate
├── scorer.rs # BM25 + keyword + graph fusion scoring
└── config.rs # RouterConfig, RouteMode

Estimated: ~600 lines Rust, no new dependencies.


A2: Document Format Support

Goal: Support HTML, DOCX, LaTeX in addition to PDF and Markdown.

HTML Parsing

HTML DOM → hierarchical tree structure
<h1>–<h6> → depth-mapped nodes
<p>, <li>, <td> → content nodes
<table> → special handling (text + structure)
<code>, <pre> → preserve formatting

Challenge: HTML documents often have deep nesting (div > div > div) that doesn't represent semantic structure. Need heuristics to skip decorative containers.

DOCX Parsing

DOCX = ZIP archive
word/document.xml → paragraph extraction
<w:pStyle w:val="Heading1"/> → heading level
<w:p> → paragraph content
Style inheritance → heading/body classification

LaTeX Parsing

Regex-based extraction:
\section{...} → depth-0 node
\subsection{...} → depth-1 node
\begin{...} environments → content blocks

Tasks

#TaskFile
1HTML parserrust/src/index/parse/html.rs
2DOCX parserrust/src/index/parse/docx.rs
3LaTeX parserrust/src/index/parse/latex.rs
4Format detectionextend detect_format_from_path()
5IndexMode extensionrust/src/index/pipeline.rs

New dependencies: scraper = "0.22", zip = "2"

Estimated: ~800 lines Rust.


A3: Parsing Precision

Goal: Fine-grained extraction of tables, figures, and footnotes.

Current Limitations

pdf-extract produces flat text. Tables lose structure, figures are invisible, footnotes mix into body text.

Table Extraction (PDF)

Use lopdf low-level access to detect text blocks with (x, y) coordinates, group by row and column, output as Markdown table strings. Insert as dedicated TreeNodes with {type: "table"} metadata.

Figure Description (PDF)

Extract image streams via lopdf, send to LLM (vision-capable model), insert description as TreeNode with {type: "figure"} metadata. The only new LLM call in indexing — justified because figures often contain critical information invisible to text extraction.

Cross-Reference Resolution

Resolve "see Section 3.2", "refer to Figure 4", "as noted in Table 2" to target TreeNodes. Enhances NavigationIndex with cross-reference edges for Worker navigation.

Tasks

#TaskFile
1PDF table extractionrust/src/index/parse/pdf_table.rs
2PDF figure descriptionrust/src/index/parse/pdf_figure.rs
3PDF footnote handlingrust/src/index/parse/pdf_footnote.rs
4Markdown table parsingrust/src/index/parse/md_table.rs
5Cross-reference resolutionextend rust/src/document/reference.rs

New dependency: image = "0.25"

Estimated: ~1000 lines Rust.


A4: Python Ecosystem

Goal: Complete Python developer experience.

See the Python ecosystem expansion plan for full details.

PhaseContentDeliverable
1CLIvectorless init/add/query/list/remove/ask/tree/stats/config
2Pythonic APIerrors.py, _engine.py, _query.py, type stubs
3High-level abstractionsBatchIndexer, DocumentWatcher
4Framework integrationLangChain BaseRetriever, LlamaIndex adapter
5TestingUnit → Mock → E2E

A4 runs in parallel with A1–A3 — the Python layer doesn't depend on new Rust features.


A5: Domain Optimization

Goal: Domain-specific optimizations for legal, financial, and technical documents.

Domain Template System

pub trait DomainTemplate: Send + Sync {
fn name(&self) -> &str;
fn detect(&self, tree: &DocumentTree, card: &DocCard) -> bool;
fn enhance(&self, tree: &mut DocumentTree, card: &mut DocCard);
fn domain_tags(&self, tree: &DocumentTree) -> Vec<String>;
}
DomainOptimizations
LegalContract clause identification, article reference resolution, defined term tracking
FinancialKPI extraction from tables, reporting period detection, currency normalization
TechnicalCode block extraction with language tags, API endpoint identification, version-aware sectioning

Templates hook into the compile pipeline after the Enhance stage.

Estimated: ~500 lines Rust (framework + 2–3 built-in templates).


A6: Performance & Reliability

Goal: Optimize memory, latency, and observability.

Lazy Document Loading

Defer tree loading until Worker dispatch. Router + Orchestrator.analyze only need DocCards (lightweight). Each DocumentTree is 10–100x larger than its DocCard.

Caching

  • Router cache: Cache routing results keyed by (query_hash, doc_ids_hash). Invalidate on document add/remove.
  • Query cache: Same query + same documents = cached result. Useful for interactive mode.

Subtree-Level Incremental Updates

Current incremental update detects file-level changes. Refine to diff affected subtrees and only re-compile changed portions. Can reduce re-indexing LLM calls by 50–80%.

Metrics

MetricSourceUse Case
Router latencyrouter.route()Monitor routing overhead
Router cache hit rateRouter cacheTune cache size
Lazy load countWorker dispatchVerify memory savings

Success Metrics

MetricCurrentTarget
Max practical workspace size~100 docs10,000+ docs
Index time per doc (PDF, 50 pages)~30s~20s
Query latency (100 docs)~10s~8s
Query latency (1000 docs)N/A~12s
Python install-to-queryManual setup< 5 minutes
Format supportPDF, Markdown+ HTML, DOCX, LaTeX

Execution Priority

Sprint 1: A1 (Router) + A4 Phase 1 (CLI)
Sprint 2: A6 (Lazy Loading) + A4 Phase 2 (Pythonic API)
Sprint 3: A2 (HTML, DOCX, LaTeX)
Sprint 4: A3 (Table, Figure, Footnote)
Sprint 5: A5 (Domain Templates) + A4 Phase 4 (Framework Integration)

A1 is the most critical enabler — without it, large-scale scenarios are not viable. A4 (Python) runs in parallel throughout.