this repo has no description
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

Python 100.0%
14 1 0

Clone this repository

https://tangled.org/alice.mosphere.at/claude-skill-pdf-to-markdown https://tangled.org/did:plc:by3jhwdqgbtrcc7q4tkkv3cf/claude-skill-pdf-to-markdown
git@tangled.org:alice.mosphere.at/claude-skill-pdf-to-markdown git@tangled.org:did:plc:by3jhwdqgbtrcc7q4tkkv3cf/claude-skill-pdf-to-markdown

For self-hosted knots, clone URLs may differ based on your setup.

Download tar.gz
README.md

PDF to Markdown Converter#

Convert PDF documents to clean, structured Markdown with table and image extraction.

Features#

  • Text extraction with formatting preservation (headers, bold, italic, lists)
  • Table extraction with two modes:
    • Fast mode: PyMuPDF (good for simple tables)
    • Accurate mode: IBM Docling AI (better for complex/borderless tables)
  • Image extraction to cache directory with paths in output
  • Aggressive caching - extract once, reuse forever

Installation#

cd ~/.claude/skills/pdf-to-markdown
uv venv .venv

# For fast mode (default):
uv pip install --python .venv/bin/python pymupdf pymupdf4llm

# For --docling mode (high-accuracy tables):
uv pip install --python .venv/bin/python pymupdf docling docling-core

Usage#

# Basic conversion (outputs to document.md)
.venv/bin/python scripts/pdf_to_md.py document.pdf

# High-accuracy tables (slower)
.venv/bin/python scripts/pdf_to_md.py document.pdf --docling

# Custom output path
.venv/bin/python scripts/pdf_to_md.py document.pdf output.md

Options#

Option Description
--docling Use Docling AI for high-accuracy tables
--no-progress Disable progress indicator
--clear-cache Clear cache for this PDF and re-extract
--clear-all-cache Clear entire cache
--cache-stats Show cache statistics

Project Structure#

scripts/
  pdf_to_md.py    # Main CLI tool
  extractor.py    # PDF extraction library (fast + accurate modes)

Cache#

PDFs are cached in ~/.cache/pdf-to-markdown/. Cache is invalidated when:

  • Source PDF is modified
  • Extractor version changes
  • Explicitly cleared with --clear-cache

License#

MIT