X
OPEN-XTRACT

"EXTRACT"

FROM RAW PDFS TO CLEAN, QUERYABLE TEXT

Open-source framework that extracts structured data from PDFs. Bring your own models and extend to any file type.

"CAPABILITIES"

MODEL AGNOSTIC

Bring your own OCR or LLM. open-xtract abstracts the extraction layer so you can swap engines at will.

  • • WORKS WITH ANY MODEL
  • • SIMPLE ADAPTER API
  • • OPEN SOURCE MIT

PDF-FIRST INGESTION

Drop in a PDF and receive clean, layout-aware text that's ready for embeddings.

  • • LAYOUT-AWARE PARSING
  • • TOKENIZATION INCLUDED
  • • IMAGES & VIDEO SOON

CITED RETRIEVAL

Every chunk is embedded into a vector DB, reranked, and served via RAG with inline citations.

  • • VECTOR SEARCH
  • • RERANKED ANSWERS
  • • INLINE CITATIONS
MIT
OPEN SOURCE
PDF
FIRST FORMAT
FAST
EXTRACTION
OPEN
EXTENSIBLE

"GET STARTED"

JOIN INDUSTRY LEADERS WHO TRUST OPEN-XTRACT