Skip to main content

IR Specification

The Vectorless IR (Intermediate Representation) is the single artifact produced by the compile pipeline. It is a self-contained, serializable document that encodes everything an agent needs to reason about a document — tree structure, indexes, acceleration data, and metadata.

Overview

Document (PDF/MD)
↓ compile pipeline
Document IR (.bin)
↓ load
DocumentNavigator → agent traversal → evidence → answer

The IR is produced once at compile time and consumed many times at query time. Agents never re-compile — they only read and navigate.

Schema Versioning

FieldDescription
schema_versionu32 — incremented on backward-incompatible changes
CURRENT_SCHEMA_VERSIONCurrently 3

Old IRs are detected via schema_version < CURRENT_SCHEMA_VERSION. All new fields use #[serde(default)] for forward compatibility.

Version History

VersionChanges
0Pre-versioning (no schema_version field)
1Initial persisted format with PersistedWrapper envelope
2Added query_routes, chain_index, content_overlap, evidence_scores
3Unified IR: single Document type, embedded DocumentMeta, schema_version field

Field Specification

Identity

FieldTypeDescription
schema_versionu32IR format version
doc_idStringUnique document identifier (UUID)
nameStringDocument name/title
formatStringSource format: "pdf", "markdown", "docx"
source_pathOption<String>Original file path (if compiled from file)

Indexes

FieldTypeBuilt byDescription
treeDocumentTreeBuild passArena-based hierarchical tree with titled nodes
nav_indexNavigationIndexNavigation passChild routes, overviews, doc cards for agent navigation
reasoning_indexReasoningIndexReasoning passKeyword-to-node mappings, topic entries, section summaries

Compile Results

FieldTypeBuilt byDescription
summaryStringEnhance passDocument-level summary
conceptsVec<Concept>Concept passKey concepts with section associations

Agent Acceleration Data

FieldTypeBuilt byDescription
query_routesOption<QueryRoutingTable>Route passIntent routes and concept routes for fast agent targeting
chain_indexOption<ChainIndex>Chain passReasoning chains connecting sections (elaboration, supporting)
content_overlapOption<ContentOverlapMap>Overlap passJaccard similarity between overlapping nodes
evidence_scoresOption<EvidenceScoreMap>Score passPer-node quality scores (density, richness, specificity)

All acceleration fields are Option<_> with #[serde(default)] — they are absent in fast compilation mode (no LLM).

Metadata

FieldTypeDescription
page_countOption<usize>Page count (PDF only)
metaOption<DocumentMeta>Processing metadata (see below)

DocumentMeta

Processing metadata for incremental recompilation and diagnostics:

FieldTypeDescription
created_atDateTime<Utc>IR creation timestamp
modified_atDateTime<Utc>Last modification timestamp
content_fingerprintStringBLAKE2b hash of source content (hex-encoded)
logic_fingerprintStringHash of pipeline configuration
processing_versionu32Incremented when algorithm changes
node_countusizeNumber of nodes in tree
total_summary_tokensusizeTotal tokens in generated summaries
processing_modelOption<String>LLM model used for processing
processing_duration_msu64Total compile time in milliseconds
line_countOption<usize>Line count (for text files)

Serialization Format

IR files use a JSON envelope with checksum verification:

PersistedWrapper
├── version: u32 (FORMAT_VERSION = 2)
├── checksum: String (SHA-256 of payload)
└── payload: Value (serialized Document as JSON)

The checksum ensures data integrity. On load, the wrapper verifies the checksum before deserializing the payload into Document.

Compilation Modes

ModePassesLLM CallsOutput
FastParse → Build → Validate → Split → Navigation0Tree + nav index, no summaries or acceleration data
StandardFast + Enhance(selective) + Reasoning + Route + ScoreLimitedFull IR with selective summaries
DeepAll 15 passesFullComplete IR with all acceleration data

In all modes, the IR is a valid Document — agents can navigate any IR regardless of compilation depth.