VectifyAI

    VectifyAI/PageIndex

    📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG

    ai-agents
    ai
    llm
    agent
    agentic-ai
    context-engineering
    rag
    reasoning
    retrieval
    Python
    MIT
    18.6K stars
    1.3K forks
    18.6K watching
    Updated 2/27/2026
    View on GitHub
    Backblaze Advertisement

    Loading star history...

    Health Score

    5.6

    Weekly Growth

    +0

    +0.0% this week

    Contributors

    1

    Total contributors

    Open Issues

    85

    Generated Insights

    About PageIndex

    PageIndex Banner

    Reasoning-based RAG  ✧  No Vector DB  ✧  No Chunking  ✧  Human-like Retrieval

    🏠 Homepage  •   🖥️ Dashboard  •   📚 API Docs  •   💬 Discord  •   ✉️ Contact 


    📄 Introduction to PageIndex

    Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic similarity rather than true relevance. But similarity ≠ relevance — what we truly need in retrieval is relevance, and that requires reasoning. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.

    Inspired by AlphaGo, we propose PageIndex, a reasoning-based RAG system that simulates how human experts navigate and extract knowledge from long documents through tree search, enabling LLMs to think and reason their way to the most relevant document sections. It performs retrieval in two steps:

    1. Generate a "Table-of-Contents" tree structure index of documents
    2. Perform reasoning-based retrieval through tree search

    💡 Features

    Compared to traditional vector-based RAG, PageIndex features:

    • No Vectors Needed: Uses document structure and LLM reasoning for retrieval.
    • No Chunking Needed: Documents are organized into natural sections, not artificial chunks.
    • Human-like Retrieval: Simulates how human experts navigate and extract knowledge from complex documents.
    • Transparent Retrieval Process: Retrieval based on reasoning — say goodbye to approximate vector search ("vibe retrieval").

    PageIndex powers a reasoning-based RAG system that achieved 98.7% accuracy on FinanceBench, showing state-of-the-art performance in professional document analysis (see our blog post for details).

    🚀 Deployment Options

    • 🛠️ Self-host — run locally with this open-source repo
    • ☁️ Cloud Service — try instantly with our 🖥️ Dashboard or 🔌 API, no setup required

    ⚡ Quick Hands-on

    Check out this simple Vectorless RAG Notebook — a minimal, hands-on, reasoning-based RAG pipeline using PageIndex.

    Open in Colab


    📦 PageIndex Tree Structure

    PageIndex can transform lengthy PDF documents into a semantic tree structure, similar to a "table of contents" but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.

    Here is an example output. See more example documents and generated trees.

    ...
    {
      "title": "Financial Stability",
      "node_id": "0006",
      "start_index": 21,
      "end_index": 22,
      "summary": "The Federal Reserve ...",
      "nodes": [
        {
          "title": "Monitoring Financial Vulnerabilities",
          "node_id": "0007",
          "start_index": 22,
          "end_index": 28,
          "summary": "The Federal Reserve's monitoring ..."
        },
        {
          "title": "Domestic and International Cooperation and Coordination",
          "node_id": "0008",
          "start_index": 28,
          "end_index": 31,
          "summary": "In 2023, the Federal Reserve collaborated ..."
        }
      ]
    }
    ...
    

    You can either generate the PageIndex tree structure with this open-source repo or try our ☁️ Cloud Service — instantly accessible via our 🖥️ Dashboard or 🔌 API, with no setup required.


    🚀 Package Usage

    You can follow these steps to generate a PageIndex tree from a PDF document.

    1. Install dependencies

    pip3 install --upgrade -r requirements.txt
    

    2. Set your OpenAI API key

    Create a .env file in the root directory and add your API key:

    CHATGPT_API_KEY=your_openai_key_here
    

    3. Run PageIndex on your PDF

    python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
    
    Optional parameters

    You can customize the processing with additional optional arguments:

    --model                 OpenAI model to use (default: gpt-4o-2024-11-20)
    --toc-check-pages       Pages to check for table of contents (default: 20)
    --max-pages-per-node    Max pages per node (default: 10)
    --max-tokens-per-node   Max tokens per node (default: 20000)
    --if-add-node-id        Add node ID (yes/no, default: yes)
    --if-add-node-summary   Add node summary (yes/no, default: no)
    --if-add-doc-description Add doc description (yes/no, default: yes)
    

    ☁️ Improved Tree Generation with PageIndex OCR

    This repo is designed for generating PageIndex tree structure for simple PDFs, but many real-world use cases involve complex PDFs that are hard to parsed by classic python tools. However, extracting high-quality text from PDF documents remains a non-trivial challenge. Most OCR tools only extract page-level content, losing the broader document context and hierarchy.

    To address this, we introduced PageIndex OCR — the first long-context OCR model designed to preserve the global structure of documents. PageIndex OCR significantly outperforms other leading OCR tools, such as those from Mistral and Contextual AI, in recognizing true hierarchy and semantic relationships across document pages.

    • Experience next-level OCR quality with PageIndex OCR at our Dashboard.
    • Integrate seamlessly PageIndex OCR into your stack via our API.


    📈 Case Study: Mafin 2.5 on FinanceBench

    Mafin 2.5 is a state-of-the-art reasoning-based RAG model designed specifically for financial document analysis. Powered by PageIndex, it achieved a market-leading 98.7% accuracy on the FinanceBench benchmark — significantly outperforming traditional vector-based RAG systems.

    PageIndex's hierarchical indexing enabled precise navigation and extraction of relevant content from complex financial reports, such as SEC filings and earnings disclosures.

    👉 See the full benchmark results and our blog post for detailed comparisons and performance metrics.


    🔎 Learn More about PageIndex

    See the Tutorials for step-by-step guides, including Document Search and Tree Search.

    Check out the Cookbook for practical recipes and advanced use cases.

    Refer to the API Documentation for integration details and options.

    ⭐ Support Us

    Leave a star if you like our project — thank you!

    © 2025 Vectify AI

    Discover Repositories

    Search across tracked repositories by name or description