RAG Pipeline Deep Dive¶

The heart of Vectorless RAG is a three-stage retrieval pipeline that replaces vector similarity search with LLM reasoning. This page walks through each stage in detail.

Pipeline Overview¶

graph TD
    Q[User Question] --> S1

    subgraph S1["Stage 1: Tree Search"]
        TS1[Build lightweight tree JSON<br/>titles + summaries only]
        TS2[Send to LLM with<br/>search prompt]
        TS3[LLM returns node_ids<br/>+ reasoning]
        TS1 --> TS2 --> TS3
    end

    S1 --> S2

    subgraph S2["Stage 2: Context Assembly"]
        CA1[Resolve node_ids to<br/>TreeNode objects]
        CA2[Recursively collect text<br/>from nodes + children]
        CA3[Collect images from<br/>selected sections]
        CA4[Format with headers<br/>and page numbers]
        CA1 --> CA2 --> CA3 --> CA4
    end

    S2 --> S3

    subgraph S3["Stage 3: Answer Generation"]
        AG1[Build prompt with<br/>context + query]
        AG2[Include images if<br/>multimodal]
        AG3[LLM generates grounded<br/>answer with citations]
        AG1 --> AG2 --> AG3
    end

    S3 --> A[Final Answer + Metadata]

    style S1 fill:#e8eaf6,stroke:#3f51b5
    style S2 fill:#e8f5e9,stroke:#4caf50
    style S3 fill:#fff3e0,stroke:#ff9800

Stage 1: Tree Search¶

Goal: Identify which sections of the document are relevant to the user's question, using only the tree structure -- not the full text.

What Gets Sent to the LLM¶

The TreeSearcher calls tree.to_json(include_text=False) to produce a lightweight JSON representation:

{
  "node_id": "root",
  "title": "Technical Architecture Guide",
  "summary": "Comprehensive overview of system architecture...",
  "children": [
    {
      "node_id": "1",
      "title": "System Overview",
      "summary": "High-level architecture with microservices...",
      "pages": "1-5",
      "children": [
        {
          "node_id": "1.1",
          "title": "Authentication Service",
          "summary": "OAuth 2.0 implementation with JWT tokens...",
          "pages": "2-3"
        },
        {
          "node_id": "1.2",
          "title": "Data Layer",
          "summary": "PostgreSQL primary with Redis caching...",
          "pages": "4-5"
        }
      ]
    }
  ]
}

Token Efficiency

This lightweight JSON is typically a few hundred tokens regardless of the original document size. A 500-page document produces roughly the same search payload as a 5-page document (only more nodes in the tree).

The Search Prompt¶

The system prompt instructs the LLM to:

Read the question carefully and identify key concepts
Walk through the tree evaluating each node's title and summary
Prefer specific nodes over broad parents (but select parents for multi-aspect questions)
Select 1 to 5 nodes maximum
Return a JSON object with node_ids and reasoning

Response Parsing¶

Primary path: The LLM returns valid JSON:

{
  "node_ids": ["1.1", "3.2.1"],
  "reasoning": "The question asks about authentication, which is directly covered in Section 1.1. Section 3.2.1 discusses security configurations related to auth."
}

Fallback path: If JSON parsing fails, the system:

Sends a raw text request with the same prompt
Extracts node IDs using regex: \b(root|\d+(?:\.\d+)*)\b
Validates extracted IDs against actual nodes in the tree (prevents hallucinated IDs)
Caps at 5 nodes

Stage 2: Context Assembly¶

Goal: Extract the full text from selected sections and format it for the answer-generation prompt.

Text Collection¶

For each selected node, the ContextAssembler:

Resolves node IDs to TreeNode objects via tree.find_nodes_by_ids()
Recursively collects all text from the node and its children (depth-first, preserving reading order)
Formats each section with a Markdown header:

### Authentication Service (Pages 2-3)

OAuth 2.0 is implemented using JWT tokens for stateless authentication.
The service handles token issuance, validation, and refresh flows...

[Full section text here]

Image Collection¶

For multimodal documents (PDFs with images):

Collects images from selected nodes and their children
Caps at MAX_CONTEXT_IMAGES (default: 10) images
Returns them as {"data": base64, "media_type": "image/png", "caption": "..."} dicts

Context Budget¶

The assembler respects max_context_chars (default: 15,000 characters):

Each section is added in order until the budget is reached
If a section would exceed the budget but > 200 characters remain, it's truncated at a word boundary with a [... section truncated] note
If < 200 characters remain, the section is skipped entirely

graph LR
    A[Section 1<br/>3,200 chars] --> B[Section 2<br/>5,100 chars]
    B --> C[Section 3<br/>4,800 chars]
    C --> D[Section 4<br/>6,000 chars]

    A -.->|Budget: 15,000| E[Included<br/>3,200]
    B -.->|Remaining: 11,800| F[Included<br/>5,100]
    C -.->|Remaining: 6,700| G[Included<br/>4,800]
    D -.->|Remaining: 1,900| H[Truncated<br/>1,900 chars]

    style H fill:#fff3e0,stroke:#ff9800

Stage 3: Answer Generation¶

Goal: Produce a grounded, cited answer using only the retrieved context.

The Answer Prompt¶

The system prompt enforces strict grounding rules:

Ground every claim in the provided content -- no outside knowledge
Cite sources with section titles: (Section 3.2: Security Architecture)
Be precise -- include numbers, specifics, and details from the source
Acknowledge limitations -- explicitly state what's not covered
Structure clearly -- use paragraphs, bullets, or numbered lists
Analyze images -- describe charts, diagrams, and tables when provided

Multimodal Path¶

When images are available, the pipeline uses generate_multimodal():

content_blocks = [
    {"type": "text", "text": user_message},       # Context + query
    {"type": "image", "data": "...", "media_type": "image/png"},  # Chart
    {"type": "image", "data": "...", "media_type": "image/png"},  # Diagram
]

This allows the LLM to reference visual content in its answer:

"As shown in the architecture diagram (Page 12), the system uses a three-tier design with..."

Error Handling¶

If answer generation fails, the pipeline returns a user-friendly fallback:

"An error occurred while generating the answer. The relevant document sections were retrieved successfully -- please review the context directly or try again."

Multi-Document Pipeline¶

When a workspace contains multiple documents, an additional routing stage runs before the per-document RAG pipeline.

graph TD
    Q[User Question] --> R[Document Router]
    R --> |LLM selects relevant docs| D1[Doc 1 RAG Pipeline]
    R --> D2[Doc 3 RAG Pipeline]
    D1 --> A1[Answer 1]
    D2 --> A2[Answer 2]
    A1 --> M[Answer Merger]
    A2 --> M
    M --> |LLM synthesizes| F[Final Merged Answer]

    style R fill:#e3f2fd,stroke:#1565c0
    style M fill:#fce4ec,stroke:#c62828

Document Routing¶

The DocumentRouter sends document summaries to the LLM:

{
  "documents": [
    {"doc_id": 1, "title": "API Reference", "summary": "REST API endpoints for..."},
    {"doc_id": 2, "title": "User Guide", "summary": "Step-by-step instructions for..."},
    {"doc_id": 3, "title": "Architecture Guide", "summary": "System design decisions..."}
  ]
}

The LLM selects 1-3 documents most likely to answer the query. This avoids running the full tree search on every document.

Answer Merging¶

If multiple documents produce answers, the merger LLM:

Combines information from all sources
Cites document names: (from "API Reference")
Removes redundancy
Produces a single coherent response

If only one document has useful content, it's returned directly without the merge step.

Pipeline Return Value¶

The RAGPipeline.query() method returns a rich dictionary:

{
    "answer": "The system uses OAuth 2.0 with JWT tokens for...",
    "node_ids": ["1.1", "3.2.1"],           # Selected sections
    "reasoning": "Selected authentication and security sections...",
    "context": "### Authentication Service (Pages 2-3)\n\n...",
    "image_count": 2,
    "images": [{"data": "...", "media_type": "image/png", "caption": "..."}]
}

This metadata powers the RAG Explorer panel in the React UI, giving users full transparency into the retrieval process.

Performance Characteristics¶

Operation	Typical Time	Depends On
Tree Search (Stage 1)	1-3 seconds	LLM latency, tree size
Context Assembly (Stage 2)	< 100ms	Number of selected nodes
Answer Generation (Stage 3)	2-5 seconds	Context size, LLM latency
Document Routing	1-2 seconds	Number of documents
Total (single doc)	3-8 seconds	LLM provider & model
Total (multi-doc)	5-15 seconds	Number of routed documents

Quick Index Mode

Use Quick Index during document upload to skip LLM-generated summaries. This makes indexing nearly instant but uses text snippets as summaries, which may slightly reduce search accuracy.