GStars
    opendataloader-project

    opendataloader-project/opendataloader-pdf

    PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

    ai
    frontend
    documentation
    computer-vision
    llm
    a11y
    accessibility
    bounding-box
    document-parsing
    eaa
    html
    json
    markdown
    ocr
    ocr-recognition
    pdf
    pdf-accessibility
    pdf-converter
    pdf-extraction
    pdf-parser
    pdf-ua
    rag
    tables
    tagged-pdf
    Java
    Apache-2.0
    17.7K stars
    1.6K forks
    17.7K watching
    Updated 4/17/2026
    View on GitHub
    Backblaze Advertisement

    Loading star history...

    Health Score

    75

    Weekly Growth

    +0

    +0.0% this week

    Contributors

    1

    Total contributors

    Open Issues

    52

    Generated Insights

    About opendataloader-pdf

    OpenDataLoader PDF

    PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

    License PyPI version npm version Maven Central Java

    opendataloader-project%2Fopendataloader-pdf | Trendshift

    ๐Ÿ” PDF parser for AI data extraction โ€” Extract Markdown, JSON (with bounding boxes), and HTML from any PDF. #1 in benchmarks (0.90 overall). Deterministic local mode + AI hybrid mode for complex pages.

    • How accurate is it? โ€” #1 in benchmarks: 0.90 overall, 0.93 table accuracy across 200 real-world PDFs including multi-column and scientific papers. Deterministic local mode + AI hybrid mode for complex pages (benchmarks)
    • Scanned PDFs and OCR? โ€” Yes. Built-in OCR (80+ languages) in hybrid mode. Works with poor-quality scans at 300 DPI+ (hybrid mode)
    • Tables, formulas, images, charts? โ€” Yes. Complex/borderless tables, LaTeX formulas, and AI-generated picture/chart descriptions all via hybrid mode (hybrid mode)
    • How do I use this for RAG? โ€” pip install opendataloader-pdf, convert in 3 lines. Outputs structured Markdown for chunking, JSON with bounding boxes for source citations, and HTML. LangChain integration available. Python, Node.js, Java SDKs (quick start | LangChain)

    โ™ฟ PDF accessibility automation โ€” The same layout analysis engine also powers auto-tagging. First open-source tool to generate Tagged PDFs end-to-end (coming Q2 2026).

    • What's the problem? โ€” Accessibility regulations are now enforced worldwide. Manual PDF remediation costs $50โ€“200 per document and doesn't scale (regulations)
    • What's free? โ€” Layout analysis + auto-tagging (Q2 2026, Apache 2.0). Untagged PDF in โ†’ Tagged PDF out. No proprietary SDK dependency (auto-tagging preview)
    • What about PDF/UA compliance? โ€” Converting Tagged PDF to PDF/UA-1 or PDF/UA-2 is an enterprise add-on. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step (pipeline)
    • Why trust this? โ€” Built in collaboration with PDF Association and Dual Lab (veraPDF developers). Auto-tagging follows the Well-Tagged PDF specification, validated with veraPDF (collaboration)

    Get Started in 30 Seconds

    Requires: Java 11+ and Python 3.10+ (Node.js | Java also available)

    Before you start: run java -version. If not found, install JDK 11+ from Adoptium.

    pip install -U opendataloader-pdf
    
    import opendataloader_pdf
    
    # Batch all files in one call โ€” each convert() spawns a JVM process, so repeated calls are slow
    opendataloader_pdf.convert(
        input_path=["file1.pdf", "file2.pdf", "folder/"],
        output_dir="output/",
        format="markdown,json"
    )
    

    OpenDataLoader PDF layout analysis โ€” headings, tables, images detected with bounding boxes

    Annotated PDF output โ€” each element (heading, paragraph, table, image) detected with bounding boxes and semantic type.

    What Problems Does This Solve?

    ProblemSolutionStatus
    PDF structure lost during parsing โ€” wrong reading order, broken tables, no element coordinatesDeterministic local PDF to Markdown/JSON with bounding boxes, XY-Cut++ reading orderShipped
    Complex tables, scanned PDFs, formulas, charts need AI-level understandingHybrid mode routes complex pages to AI backend (#1 in benchmarks)Shipped
    PDF accessibility compliance โ€” EAA, ADA, Section 508 enforced. Manual remediation $50โ€“200/docAuto-tagging: layout analysis โ†’ Tagged PDF (free, Q2 2026). Built with PDF Association & veraPDF validation. PDF/UA export (enterprise add-on)Auto-tag: Q2 2026

    Capability Matrix

    CapabilitySupportedTier
    Data extraction
    Extract text with correct reading orderYesFree
    Bounding boxes for every elementYesFree
    Table extraction (simple borders)YesFree
    Table extraction (complex/borderless)YesFree (Hybrid)
    Heading hierarchy detectionYesFree
    List detection (numbered, bulleted, nested)YesFree
    Image extraction with coordinatesYesFree
    AI chart/image descriptionYesFree (Hybrid)
    OCR for scanned PDFsYesFree (Hybrid)
    Formula extraction (LaTeX)YesFree (Hybrid)
    Tagged PDF structure extractionYesFree
    AI safety (prompt injection filtering)YesFree
    Header/footer/watermark filteringYesFree
    Accessibility
    Auto-tagging โ†’ Tagged PDF for untagged PDFsComing Q2 2026Free (Apache 2.0)
    PDF/UA-1, PDF/UA-2 export๐Ÿ’ผ AvailableEnterprise
    Accessibility studio (visual editor)๐Ÿ’ผ AvailableEnterprise
    Limitations
    Process Word/Excel/PPTNoโ€”
    GPU requiredNoโ€”

    Extraction Benchmarks

    opendataloader-pdf [hybrid] ranks #1 overall (0.90) across reading order, table, and heading extraction accuracy.

    EngineOverallReading OrderTableHeadingSpeed (s/page)
    opendataloader [hybrid]0.900.940.930.830.43
    opendataloader0.720.910.490.760.05
    docling0.860.900.890.800.73
    marker0.830.890.810.8053.93
    mineru0.820.860.870.745.96
    pymupdf4llm0.570.890.400.410.09
    markitdown0.290.880.000.000.04

    Scores normalized to [0, 1]. Higher is better for accuracy; lower is better for speed. Bold = best. Full benchmark details

    Benchmark

    Which Mode Should I Use?

    Your DocumentModeInstallServer CommandClient Command
    Standard digital PDFFast (default)pip install opendataloader-pdfNone neededopendataloader-pdf file1.pdf file2.pdf folder/
    Complex or nested tablesHybridpip install "opendataloader-pdf[hybrid]"opendataloader-pdf-hybrid --port 5002opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
    Scanned / image-based PDFHybrid + OCRpip install "opendataloader-pdf[hybrid]"opendataloader-pdf-hybrid --port 5002 --force-ocropendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
    Non-English scanned PDFHybrid + OCRpip install "opendataloader-pdf[hybrid]"opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
    Mathematical formulasHybrid + formulapip install "opendataloader-pdf[hybrid]"opendataloader-pdf-hybrid --enrich-formulaopendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
    Charts needing descriptionHybrid + picturepip install "opendataloader-pdf[hybrid]"opendataloader-pdf-hybrid --enrich-picture-descriptionopendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
    Untagged PDFs needing accessibilityAuto-tagging โ†’ Tagged PDFComing Q2 2026โ€”โ€”

    Quick Start

    Python

    pip install -U opendataloader-pdf
    
    import opendataloader_pdf
    
    # Batch all files in one call โ€” each convert() spawns a JVM process, so repeated calls are slow
    opendataloader_pdf.convert(
        input_path=["file1.pdf", "file2.pdf", "folder/"],
        output_dir="output/",
        format="markdown,json"
    )
    

    Node.js

    npm install @opendataloader/pdf
    
    import { convert } from '@opendataloader/pdf';
    
    await convert(['file1.pdf', 'file2.pdf', 'folder/'], {
      outputDir: 'output/',
      format: 'markdown,json'
    });
    

    Java

    <dependency>
      <groupId>org.opendataloader</groupId>
      <artifactId>opendataloader-pdf-core</artifactId>
    </dependency>
    

    Python Quick Start | Node.js Quick Start | Java Quick Start

    Hybrid Mode: #1 Accuracy for Complex PDFs

    Hybrid mode combines fast local Java processing with AI backends. Simple pages stay local (0.05s); complex pages route to AI for +90% table accuracy.

    pip install -U "opendataloader-pdf[hybrid]"
    

    Terminal 1 โ€” Start the backend server:

    opendataloader-pdf-hybrid --port 5002
    

    Terminal 2 โ€” Process PDFs:

    # Batch all files in one call โ€” each invocation spawns a JVM process, so repeated calls are slow
    opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
    

    Python:

    # Batch all files in one call โ€” each convert() spawns a JVM process, so repeated calls are slow
    opendataloader_pdf.convert(
        input_path=["file1.pdf", "file2.pdf", "folder/"],
        output_dir="output/",
        hybrid="docling-fast"
    )
    

    OCR for Scanned PDFs

    Start the backend with --force-ocr for image-based PDFs with no selectable text:

    opendataloader-pdf-hybrid --port 5002 --force-ocr
    

    For non-English documents, specify the language:

    opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"
    

    Supported languages: en, ko, ja, ch_sim, ch_tra, de, fr, ar, and more.

    Formula Extraction (LaTeX)

    Extract mathematical formulas as LaTeX from scientific PDFs:

    # Server: enable formula enrichment
    opendataloader-pdf-hybrid --enrich-formula
    
    # Batch all files in one call โ€” each invocation spawns a JVM process, so repeated calls are slow
    opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
    

    Output in JSON:

    {
      "type": "formula",
      "page number": 1,
      "bounding box": [226.2, 144.7, 377.1, 168.7],
      "content": "\\frac{f(x+h) - f(x)}{h}"
    }
    

    Note: Formula and picture description enrichments require --hybrid-mode full on the client side.

    Chart & Image Description

    Generate AI descriptions for charts and images โ€” useful for RAG search and accessibility alt text:

    # Server
    opendataloader-pdf-hybrid --enrich-picture-description
    
    # Batch all files in one call โ€” each invocation spawns a JVM process, so repeated calls are slow
    opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
    

    Output in JSON:

    {
      "type": "picture",
      "page number": 1,
      "bounding box": [72.0, 400.0, 540.0, 650.0],
      "description": "A bar chart showing waste generation by region from 2016 to 2030..."
    }
    

    Uses SmolVLM (256M), a lightweight vision model. Custom prompts supported via --picture-description-prompt.

    Hancom Data Loader Integration โ€” Coming Soon

    Enterprise-grade AI document analysis via Hancom Data Loader โ€” customer-customized models trained on your domain-specific documents. 30+ element types (tables, charts, formulas, captions, footnotes, etc.), VLM-based image/chart understanding, complex table extraction (merged cells, nested tables), SLA-backed OCR for scanned documents, and native HWP/HWPX support. Supports PDF, DOCX, XLSX, PPTX, HWP, PNG, JPG. Live demo

    Hybrid Mode Guide

    Output Formats

    FormatUse Case
    JSONStructured data with bounding boxes, semantic types
    MarkdownClean text for LLM context, RAG chunks
    HTMLWeb display with styling
    Annotated PDFVisual debugging โ€” see detected structures (sample)
    TextPlain text extraction

    Combine formats: format="json,markdown"

    JSON Output Example

    {
      "type": "heading",
      "id": 42,
      "level": "Title",
      "page number": 1,
      "bounding box": [72.0, 700.0, 540.0, 730.0],
      "heading level": 1,
      "font": "Helvetica-Bold",
      "font size": 24.0,
      "text color": "[0.0]",
      "content": "Introduction"
    }
    
    FieldDescription
    typeElement type: heading, paragraph, table, list, image, caption, formula
    idUnique identifier for cross-referencing
    page number1-indexed page reference
    bounding box[left, bottom, right, top] in PDF points (72pt = 1 inch)
    heading levelHeading depth (1+)
    contentExtracted text

    Full JSON Schema

    Advanced Features

    Tagged PDF Support

    When a PDF has structure tags, OpenDataLoader extracts the exact layout the author intended โ€” no guessing, no heuristics. Headings, lists, tables, and reading order are preserved from the source.

    # Batch all files in one call โ€” each convert() spawns a JVM process, so repeated calls are slow
    opendataloader_pdf.convert(
        input_path=["file1.pdf", "file2.pdf", "folder/"],
        output_dir="output/",
        use_struct_tree=True           # Use native PDF structure tags
    )
    

    Most PDF parsers ignore structure tags entirely. Learn more

    AI Safety: Prompt Injection Protection

    PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:

    • Hidden text (transparent, zero-size fonts)
    • Off-page content
    • Suspicious invisible layers

    To sanitize sensitive data (emails, URLs, phone numbers โ†’ placeholders), enable it explicitly:

    # Batch all files in one call โ€” each invocation spawns a JVM process, so repeated calls are slow
    opendataloader-pdf file1.pdf file2.pdf folder/ --sanitize
    

    AI Safety Guide

    LangChain Integration

    pip install -U langchain-opendataloader-pdf
    
    from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
    
    loader = OpenDataLoaderPDFLoader(
        file_path=["file1.pdf", "file2.pdf", "folder/"],
        format="text"
    )
    documents = loader.load()
    

    LangChain Docs | GitHub | PyPI

    Advanced Options

    # Batch all files in one call โ€” each convert() spawns a JVM process, so repeated calls are slow
    opendataloader_pdf.convert(
        input_path=["file1.pdf", "file2.pdf", "folder/"],
        output_dir="output/",
        format="json,markdown,pdf",
        image_output="embedded",        # "off", "embedded" (Base64), or "external" (default)
        image_format="jpeg",            # "png" or "jpeg"
        use_struct_tree=True,           # Use native PDF structure
    )
    

    Full CLI Options Reference

    PDF Accessibility & PDF/UA Conversion

    Problem: Millions of existing PDFs lack structure tags, failing accessibility regulations (EAA, ADA/Section 508, Korea Digital Inclusion Act). Manual remediation costs $50โ€“200 per document and doesn't scale.

    OpenDataLoader's approach: Built in collaboration with PDF Association and Dual Lab (developers of veraPDF, the industry-reference open-source PDF/A and PDF/UA validator). Auto-tagging follows the Well-Tagged PDF specification and is validated programmatically using veraPDF โ€” automated conformance checks against PDF accessibility standards, not manual review. No existing open-source tool generates Tagged PDFs end-to-end โ€” most rely on proprietary SDKs for the tag-writing step. OpenDataLoader does it all under Apache 2.0. (collaboration details)

    RegulationDeadlineRequirement
    European Accessibility Act (EAA)June 28, 2025Accessible digital products across the EU
    ADA & Section 508In effectU.S. federal agencies and public accommodations
    Digital Inclusion ActIn effectSouth Korea digital service accessibility

    Standards & Validation

    AspectDetail
    SpecificationWell-Tagged PDF by PDF Association
    ValidationveraPDF โ€” industry-reference open-source PDF/A & PDF/UA validator
    CollaborationPDF Association + Dual Lab (veraPDF developers) co-develop tagging and validation
    LicenseAuto-tagging โ†’ Tagged PDF: Apache 2.0 (free). PDF/UA export: Enterprise

    Accessibility Pipeline

    StepFeatureStatusTier
    1. AuditRead existing PDF tags, detect untagged PDFsShippedFree
    2. Auto-tag โ†’ Tagged PDFGenerate structure tags for untagged PDFsComing Q2 2026Free (Apache 2.0)
    3. Export PDF/UAConvert to PDF/UA-1 or PDF/UA-2 compliant files๐Ÿ’ผ AvailableEnterprise
    4. Visual editingAccessibility studio โ€” review and fix tags๐Ÿ’ผ AvailableEnterprise

    ๐Ÿ’ผ Enterprise features are available on request. Contact us to get started.

    Auto-Tagging Preview (Coming Q2 2026)

    # API shape preview โ€” available Q2 2026
    opendataloader_pdf.convert(
        input_path=["file1.pdf", "file2.pdf", "folder/"],
        output_dir="output/",
        auto_tag=True                   # Generate structure tags for untagged PDFs
    )
    

    End-to-End Compliance Workflow

    Existing PDFs (untagged)
        โ”‚
        โ–ผ
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  1. Audit       โ”‚โ”€โ”€โ”€>โ”‚  2. Auto-Tag    โ”‚โ”€โ”€โ”€>โ”‚  3. Export       โ”‚โ”€โ”€โ”€>โ”‚  4. Studio       โ”‚
    โ”‚  (check tags)   โ”‚    โ”‚  (โ†’ Tagged PDF) โ”‚    โ”‚  (PDF/UA)        โ”‚    โ”‚  (visual editor) โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
            โ”‚                      โ”‚                      โ”‚                      โ”‚
            โ–ผ                      โ–ผ                      โ–ผ                      โ–ผ
      use_struct_tree         auto_tag              PDF/UA export       Accessibility Studio
      (Available now)    (Q2 2026, Apache 2.0)    (Enterprise)          (Enterprise)
    

    PDF Accessibility Guide

    Roadmap

    FeatureTimelineTier
    Auto-tagging โ†’ Tagged PDF โ€” Generate Tagged PDFs from untagged PDFsQ2 2026Free
    Hancom Data Loader โ€” Enterprise AI document analysis, customer-customized models, VLM-based chart/image understanding, production-grade OCRQ2-Q3 2026Free
    Structure validation โ€” Verify PDF tag treesQ2 2026Planned

    Full Roadmap

    Frequently Asked Questions

    What is the best PDF parser for RAG?

    For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this โ€” it outputs structured JSON with bounding boxes, handles multi-column layouts with XY-Cut++, and runs locally without GPU. In hybrid mode, it ranks #1 overall (0.90) in benchmarks.

    What is the best open-source PDF parser?

    OpenDataLoader PDF is the only open-source parser that combines: rule-based deterministic extraction (no GPU), bounding boxes for every element, XY-Cut++ reading order, built-in AI safety filters, native Tagged PDF support, and hybrid AI mode for complex documents. It ranks #1 in overall accuracy (0.90) while running locally on CPU.

    How do I extract tables from PDF for LLM?

    OpenDataLoader detects tables using border analysis and text clustering, preserving row/column structure. For complex tables, enable hybrid mode for +90% accuracy improvement (0.49 to 0.93 TEDS score):

    # Batch all files in one call โ€” each convert() spawns a JVM process, so repeated calls are slow
    opendataloader_pdf.convert(
        input_path=["file1.pdf", "file2.pdf", "folder/"],
        output_dir="output/",
        format="json",
        hybrid="docling-fast"           # For complex tables
    )
    

    How does it compare to docling, marker, or pymupdf4llm?

    OpenDataLoader [hybrid] ranks #1 overall (0.90) across reading order, table, and heading accuracy. Key differences: docling (0.86) is strong but lacks bounding boxes and AI safety filters. marker (0.83) requires GPU and is 100x slower (53.93s/page). pymupdf4llm (0.57) is fast but has poor table (0.40) and heading (0.41) accuracy. OpenDataLoader is the only parser that combines deterministic local extraction, bounding boxes for every element, and built-in prompt injection protection. See full benchmark.

    Can I use this without sending data to the cloud?

    Yes. OpenDataLoader runs 100% locally. No API calls, no data transmission โ€” your documents never leave your environment. The hybrid mode backend also runs locally on your machine. Ideal for legal, healthcare, and financial documents.

    Does it support OCR for scanned PDFs?

    Yes, via hybrid mode. Install with pip install "opendataloader-pdf[hybrid]", start the backend with --force-ocr, then process as usual. Supports multiple languages including Korean, Japanese, Chinese, Arabic, and more via --ocr-lang.

    Does it work with Korean, Japanese, or Chinese documents?

    Yes. For digital PDFs, text extraction works out of the box. For scanned PDFs, use hybrid mode with --force-ocr --ocr-lang "ko,en" (or ja, ch_sim, ch_tra). Coming soon: Hancom Data Loader integration โ€” enterprise-grade AI document analysis with built-in production-grade OCR and customer-customized models optimized for your specific document types and workflows.

    How fast is it?

    Local mode processes 20+ pages per second on CPU (0.05s/page). Hybrid mode processes 2+ pages per second (0.43s/page) with significantly higher accuracy for complex documents. No GPU required. Benchmarked on Apple M4. Full benchmark details. With multi-process batch processing, throughput exceeds 100 pages per second on 8+ core machines.

    Does it handle multi-column layouts?

    Yes. OpenDataLoader uses XY-Cut++ reading order analysis to correctly sequence text across multi-column pages, sidebars, and mixed layouts. This works in both local and hybrid modes without any configuration.

    What is hybrid mode?

    Hybrid mode combines fast local Java processing with an AI backend. Simple pages are processed locally (0.05s/page); complex pages (tables, scanned content, formulas, charts) are automatically routed to the AI backend for higher accuracy. The backend runs locally on your machine โ€” no cloud required. See Which Mode Should I Use? and Hybrid Mode Guide.

    Does it work with LangChain?

    Yes. Install langchain-opendataloader-pdf for an official LangChain document loader integration. See LangChain docs.

    How do I chunk PDFs for RAG?

    OpenDataLoader outputs structured Markdown with headings, tables, and lists preserved โ€” ideal input for semantic chunking. Each element in JSON output includes type, heading level, and page number, so you can split by section or page boundary. For most RAG pipelines: parse with format="markdown" for text chunks, or format="json" when you need element-level control. Pair with LangChain's RecursiveCharacterTextSplitter or your own heading-based splitter for best results.

    How do I cite PDF sources in RAG answers?

    Every element in JSON output includes a bounding box ([left, bottom, right, top] in PDF points) and page number. When your RAG pipeline returns an answer, map the source chunk back to its bounding box to highlight the exact location in the original PDF. This enables "click to source" UX โ€” users see which paragraph, table, or figure the answer came from. No other open-source parser provides bounding boxes for every element by default.

    How do I convert PDF to Markdown for LLM?

    import opendataloader_pdf
    
    # Batch all files in one call โ€” each convert() spawns a JVM process, so repeated calls are slow
    opendataloader_pdf.convert(
        input_path=["file1.pdf", "file2.pdf", "folder/"],
        output_dir="output/",
        format="markdown"
    )
    

    OpenDataLoader preserves heading hierarchy, table structure, and reading order in the Markdown output. For complex documents with borderless tables or scanned pages, use hybrid mode (hybrid="docling-fast") for higher accuracy. The output is clean enough to feed directly into LLM context windows or RAG chunking pipelines.

    Is there an automated PDF accessibility remediation tool?

    Yes. OpenDataLoader is the first open-source tool that automates PDF accessibility end-to-end. Built in collaboration with PDF Association and Dual Lab (veraPDF developers), auto-tagging follows the Well-Tagged PDF specification and is validated programmatically using veraPDF. The layout analysis engine detects document structure (headings, tables, lists, reading order) and generates accessibility tags automatically. Auto-tagging (Q2 2026) converts untagged PDFs into Tagged PDFs under Apache 2.0 โ€” no proprietary SDK dependency. For organizations needing full PDF/UA compliance, enterprise add-ons provide PDF/UA export and a visual tag editor. This replaces manual remediation workflows that typically cost $50โ€“200+ per document.

    Is this really the first open-source PDF auto-tagging tool?

    Yes. Existing tools either depend on proprietary SDKs for writing structure tags, only output non-PDF formats (e.g., Docling outputs Markdown/JSON but cannot produce Tagged PDFs), or require manual intervention. OpenDataLoader is the first to do layout analysis โ†’ tag generation โ†’ Tagged PDF output entirely under an open-source license (Apache 2.0), with no proprietary dependency. Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF, the industry-reference open-source PDF/A and PDF/UA validator.

    How do I convert existing PDFs to PDF/UA?

    OpenDataLoader provides an end-to-end pipeline: audit existing PDFs for tags (use_struct_tree=True), auto-tag untagged PDFs into Tagged PDFs (Q2 2026, free under Apache 2.0), and export as PDF/UA-1 or PDF/UA-2 (enterprise add-on). Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step. Contact us for enterprise integration.

    How do I make my PDFs accessible for EAA compliance?

    The European Accessibility Act requires accessible digital products by June 28, 2025. OpenDataLoader supports the full remediation workflow: audit โ†’ auto-tag โ†’ Tagged PDF โ†’ PDF/UA export. Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF, ensuring standards-compliant output. Auto-tagging to Tagged PDF will be open-sourced under Apache 2.0 (Q2 2026). PDF/UA export and accessibility studio are enterprise add-ons. See our Accessibility Guide.

    Is OpenDataLoader PDF free?

    The core library is open-source under Apache 2.0 โ€” free for commercial use. This includes all extraction features (text, tables, images, OCR, formulas, charts via hybrid mode), AI safety filters, Tagged PDF support, and auto-tagging to Tagged PDF (Q2 2026). We are committed to keeping the core accessibility pipeline (layout analysis โ†’ auto-tagging โ†’ Tagged PDF) free and open-source. Enterprise add-ons (PDF/UA export, accessibility studio) are available for organizations needing end-to-end regulatory compliance.

    Why did the license change from MPL 2.0 to Apache 2.0?

    MPL 2.0 requires file-level copyleft, which often triggers legal review before enterprise adoption. Apache 2.0 is fully permissive โ€” no copyleft obligations, easier to integrate into commercial projects. If you are using a pre-2.0 version, it remains under MPL 2.0 and you can continue using it. Upgrading to 2.0+ means your project follows Apache 2.0 terms, which are strictly more permissive โ€” no additional obligations, no action needed on your side.

    Documentation

    Contributing

    We welcome contributions! See CONTRIBUTING.md for guidelines.

    License

    Apache License 2.0

    Note: Versions prior to 2.0 are licensed under the Mozilla Public License 2.0.


    Found this useful? Give us a star to help others discover OpenDataLoader.

    Discover Repositories

    Search across tracked repositories by name or description