RAG Pipeline Deep Dive¶
The heart of Vectorless RAG is a three-stage retrieval pipeline that replaces vector similarity search with LLM reasoning. This page walks through each stage in detail.
Pipeline Overview¶
graph TD
Q[User Question] --> S1
subgraph S1["Stage 1: Tree Search"]
TS1[Build lightweight tree JSON<br/>titles + summaries only]
TS2[Send to LLM with<br/>search prompt]
TS3[LLM returns node_ids<br/>+ reasoning]
TS1 --> TS2 --> TS3
end
S1 --> S2
subgraph S2["Stage 2: Context Assembly"]
CA1[Resolve node_ids to<br/>TreeNode objects]
CA2[Recursively collect text<br/>from nodes + children]
CA3[Collect images from<br/>selected sections]
CA4[Format with headers<br/>and page numbers]
CA1 --> CA2 --> CA3 --> CA4
end
S2 --> S3
subgraph S3["Stage 3: Answer Generation"]
AG1[Build prompt with<br/>context + query]
AG2[Include images if<br/>multimodal]
AG3[LLM generates grounded<br/>answer with citations]
AG1 --> AG2 --> AG3
end
S3 --> A[Final Answer + Metadata]
style S1 fill:#e8eaf6,stroke:#3f51b5
style S2 fill:#e8f5e9,stroke:#4caf50
style S3 fill:#fff3e0,stroke:#ff9800
Stage 1: Tree Search¶
Goal: Identify which sections of the document are relevant to the user's question, using only the tree structure -- not the full text.
What Gets Sent to the LLM¶
The TreeSearcher calls tree.to_json(include_text=False) to produce a lightweight JSON representation:
{
"node_id": "root",
"title": "Technical Architecture Guide",
"summary": "Comprehensive overview of system architecture...",
"children": [
{
"node_id": "1",
"title": "System Overview",
"summary": "High-level architecture with microservices...",
"pages": "1-5",
"children": [
{
"node_id": "1.1",
"title": "Authentication Service",
"summary": "OAuth 2.0 implementation with JWT tokens...",
"pages": "2-3"
},
{
"node_id": "1.2",
"title": "Data Layer",
"summary": "PostgreSQL primary with Redis caching...",
"pages": "4-5"
}
]
}
]
}
Token Efficiency
This lightweight JSON is typically a few hundred tokens regardless of the original document size. A 500-page document produces roughly the same search payload as a 5-page document (only more nodes in the tree).
The Search Prompt¶
The system prompt instructs the LLM to:
- Read the question carefully and identify key concepts
- Walk through the tree evaluating each node's title and summary
- Prefer specific nodes over broad parents (but select parents for multi-aspect questions)
- Select 1 to 5 nodes maximum
- Return a JSON object with
node_idsandreasoning
Response Parsing¶
Primary path: The LLM returns valid JSON:
{
"node_ids": ["1.1", "3.2.1"],
"reasoning": "The question asks about authentication, which is directly covered in Section 1.1. Section 3.2.1 discusses security configurations related to auth."
}
Fallback path: If JSON parsing fails, the system:
- Sends a raw text request with the same prompt
- Extracts node IDs using regex:
\b(root|\d+(?:\.\d+)*)\b - Validates extracted IDs against actual nodes in the tree (prevents hallucinated IDs)
- Caps at 5 nodes
Stage 2: Context Assembly¶
Goal: Extract the full text from selected sections and format it for the answer-generation prompt.
Text Collection¶
For each selected node, the ContextAssembler:
- Resolves node IDs to
TreeNodeobjects viatree.find_nodes_by_ids() - Recursively collects all text from the node and its children (depth-first, preserving reading order)
- Formats each section with a Markdown header:
### Authentication Service (Pages 2-3)
OAuth 2.0 is implemented using JWT tokens for stateless authentication.
The service handles token issuance, validation, and refresh flows...
[Full section text here]
Image Collection¶
For multimodal documents (PDFs with images):
- Collects images from selected nodes and their children
- Caps at
MAX_CONTEXT_IMAGES(default: 10) images - Returns them as
{"data": base64, "media_type": "image/png", "caption": "..."}dicts
Context Budget¶
The assembler respects max_context_chars (default: 15,000 characters):
- Each section is added in order until the budget is reached
- If a section would exceed the budget but > 200 characters remain, it's truncated at a word boundary with a
[... section truncated]note - If < 200 characters remain, the section is skipped entirely
graph LR
A[Section 1<br/>3,200 chars] --> B[Section 2<br/>5,100 chars]
B --> C[Section 3<br/>4,800 chars]
C --> D[Section 4<br/>6,000 chars]
A -.->|Budget: 15,000| E[Included<br/>3,200]
B -.->|Remaining: 11,800| F[Included<br/>5,100]
C -.->|Remaining: 6,700| G[Included<br/>4,800]
D -.->|Remaining: 1,900| H[Truncated<br/>1,900 chars]
style H fill:#fff3e0,stroke:#ff9800
Stage 3: Answer Generation¶
Goal: Produce a grounded, cited answer using only the retrieved context.
The Answer Prompt¶
The system prompt enforces strict grounding rules:
- Ground every claim in the provided content -- no outside knowledge
- Cite sources with section titles: (Section 3.2: Security Architecture)
- Be precise -- include numbers, specifics, and details from the source
- Acknowledge limitations -- explicitly state what's not covered
- Structure clearly -- use paragraphs, bullets, or numbered lists
- Analyze images -- describe charts, diagrams, and tables when provided
Multimodal Path¶
When images are available, the pipeline uses generate_multimodal():
content_blocks = [
{"type": "text", "text": user_message}, # Context + query
{"type": "image", "data": "...", "media_type": "image/png"}, # Chart
{"type": "image", "data": "...", "media_type": "image/png"}, # Diagram
]
This allows the LLM to reference visual content in its answer:
"As shown in the architecture diagram (Page 12), the system uses a three-tier design with..."
Error Handling¶
If answer generation fails, the pipeline returns a user-friendly fallback:
"An error occurred while generating the answer. The relevant document sections were retrieved successfully -- please review the context directly or try again."
Multi-Document Pipeline¶
When a workspace contains multiple documents, an additional routing stage runs before the per-document RAG pipeline.
graph TD
Q[User Question] --> R[Document Router]
R --> |LLM selects relevant docs| D1[Doc 1 RAG Pipeline]
R --> D2[Doc 3 RAG Pipeline]
D1 --> A1[Answer 1]
D2 --> A2[Answer 2]
A1 --> M[Answer Merger]
A2 --> M
M --> |LLM synthesizes| F[Final Merged Answer]
style R fill:#e3f2fd,stroke:#1565c0
style M fill:#fce4ec,stroke:#c62828
Document Routing¶
The DocumentRouter sends document summaries to the LLM:
{
"documents": [
{"doc_id": 1, "title": "API Reference", "summary": "REST API endpoints for..."},
{"doc_id": 2, "title": "User Guide", "summary": "Step-by-step instructions for..."},
{"doc_id": 3, "title": "Architecture Guide", "summary": "System design decisions..."}
]
}
The LLM selects 1-3 documents most likely to answer the query. This avoids running the full tree search on every document.
Answer Merging¶
If multiple documents produce answers, the merger LLM:
- Combines information from all sources
- Cites document names: (from "API Reference")
- Removes redundancy
- Produces a single coherent response
If only one document has useful content, it's returned directly without the merge step.
Pipeline Return Value¶
The RAGPipeline.query() method returns a rich dictionary:
{
"answer": "The system uses OAuth 2.0 with JWT tokens for...",
"node_ids": ["1.1", "3.2.1"], # Selected sections
"reasoning": "Selected authentication and security sections...",
"context": "### Authentication Service (Pages 2-3)\n\n...",
"image_count": 2,
"images": [{"data": "...", "media_type": "image/png", "caption": "..."}]
}
This metadata powers the RAG Explorer panel in the React UI, giving users full transparency into the retrieval process.
Performance Characteristics¶
| Operation | Typical Time | Depends On |
|---|---|---|
| Tree Search (Stage 1) | 1-3 seconds | LLM latency, tree size |
| Context Assembly (Stage 2) | < 100ms | Number of selected nodes |
| Answer Generation (Stage 3) | 2-5 seconds | Context size, LLM latency |
| Document Routing | 1-2 seconds | Number of documents |
| Total (single doc) | 3-8 seconds | LLM provider & model |
| Total (multi-doc) | 5-15 seconds | Number of routed documents |
Quick Index Mode
Use Quick Index during document upload to skip LLM-generated summaries. This makes indexing nearly instant but uses text snippets as summaries, which may slightly reduce search accuracy.