Curated repos, tools, and frameworks shaping the developer ecosystem.
Live data from GitHub.
by VectifyAI
📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG
Reasoning-based RAG ◦ No Vector DB, No Chunking ◦ Context-Aware Retrieval ◦ Human-like
Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic similarity rather than true relevance. But similarity ≠ relevance — what we truly need in retrieval is relevance, and that requires reasoning. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short — missing what's relevant but not similar, and returning what's similar yet not relevant.
Inspired by AlphaGo, we propose PageIndex — a vectorless, reasoning-based RAG system that builds a hierarchical tree index from long documents and uses LLMs to reason over that index for agentic, context-aware retrieval. The retrieval is traceable and explainable, with no vector DBs or chunking. PageIndex simulates how human experts navigate and extract knowledge from complex documents through tree search, enabling LLMs to think and reason their way to the most relevant document sections. It performs retrieval in two steps:
Compared to traditional vector-based RAG, PageIndex features:
PageIndex powers a reasoning-based RAG system that achieved state-of-the-art 98.7% accuracy on FinanceBench, vastly outperforming vector-based RAG solutions on professional document analysis (blog post).
To learn more, please see a detailed introduction to the PageIndex framework. Check out our GitHub for open-source code, and the cookbooks, tutorials, and blog for more usage guides and examples.
The PageIndex service is available as a ChatGPT-style chat platform, or can be integrated via MCP or API, with enterprise deployment available.
PageIndex can transform lengthy PDF documents into a semantic tree structure, similar to a “table of contents” but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.
Below is an example PageIndex tree structure. Also see more example documents and generated tree structures.
...
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
]
}
...
You can generate the PageIndex tree structure with this open-source repo; or use our API for higher-quality results powered by our enhanced OCR and tree building pipeline.
Note: This package uses standard PDF parsing. For use cases with complex PDFs, our cloud service (via MCP and API) offers enhanced OCR, tree building, and retrieval.
You can follow these steps to generate a PageIndex tree from a PDF document.
pip3 install --upgrade -r requirements.txt
Create a .env file in the root directory with your LLM API key. Multi-LLM is supported via LiteLLM:
OPENAI_API_KEY=your_openai_key_here
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
--model LLM model to use (default: gpt-4o-2024-11-20)
--toc-check-pages Pages to check for table of contents (default: 20)
--max-pages-per-node Max pages per node (default: 10)
--max-tokens-per-node Max tokens per node (default: 20000)
--if-add-node-id Add node ID (yes/no, default: yes)
--if-add-node-summary Add node summary (yes/no, default: yes)
--if-add-doc-description Add doc description (yes/no, default: yes)
python3 run_pageindex.py --md_path /path/to/your/document.md
Note: in this mode, we use "#" to determine node headings and their levels. For example, "##" is level 2, "###" is level 3, etc. Make sure your markdown file is formatted correctly. If your Markdown file was converted from a PDF or HTML, we don't recommend using this mode, since most existing conversion tools cannot preserve the original hierarchy. Instead, use our PageIndex OCR, which is designed to preserve it, to convert the PDF to a markdown file and then use this mode.
For a simple, end-to-end agentic vectorless RAG example using self-hosted PageIndex (with OpenAI Agents SDK), see examples/agentic_vectorless_rag_demo.py.
# Install optional dependency
pip3 install openai-agents
# Run the demo
python3 examples/agentic_vectorless_rag_demo.py
Mafin 2.5 is a reasoning-based RAG system for financial document analysis, powered by PageIndex. It achieved a state-of-the-art 98.7% accuracy on the FinanceBench benchmark, significantly outperforming traditional vector-based RAG systems.
PageIndex's hierarchical indexing and reasoning-driven retrieval enable precise navigation and extraction of relevant context from complex financial reports, such as SEC filings and earnings disclosures.
Explore the full benchmark results and our blog post for detailed comparisons and performance metrics.
Leave us a star 🌟 if you like our project. Thank you!
Please cite this work as:
Mingtian Zhang, Yu Tang and PageIndex Team,
"PageIndex: Next-Generation Vectorless, Reasoning-based RAG",
PageIndex Blog, Sep 2025.
@article{zhang2025pageindex,
author = {Mingtian Zhang and Yu Tang and PageIndex Team},
title = {PageIndex: Next-Generation Vectorless, Reasoning-based RAG},
journal = {PageIndex Blog},
year = {2025},
month = {September},
note = {https://pageindex.ai/blog/pageindex-intro},
}
PageIndex anchors a growing open-source ecosystem of long-context AI infra — OpenKB is an LLM knowledge base that compiles documents into an interlinked wiki. ChatIndex provides tree indexing and retrieval for long conversational histories and memory. ConDB is a KV-cache native context database for tree-based retrieval at scale. PageIndex MCP is PageIndex's MCP server.
© 2026 Vectify AI
Stable Diffusion web UI