this repo has no description
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

Fix code review issues, refactor, and add README

Code review fixes:
- Fix --no-images flag to properly suppress cached images
- Fix image path joining bug (handle 'images/' prefix from Docling)
- Extend dependency check to cover all required packages
- Deduplicate images by xref in extract_images()
- Add fallback metadata lookup for cache clearing deleted PDFs
- Fix tempfile handle issue for Windows compatibility
- Fix mkdir to use parents=True

Refactoring:
- Delete extract_pdf_accurate() (was dead code duplicating extract_pdf_docling)
- Delete test_docling.py (functionality covered by main CLI)
- Rename docling_extractor.py to extractor.py (more generic)
- Extract _save_docling_images() helper function

Documentation:
- Add README.md
- Update SKILL.md with correct dependencies
- Remove references to non-existent --chunked and --workers options
- Remove unverified accuracy percentage claims

alice 40869a92 518d27d6

+568 -359
+1
.gitignore
··· 29 29 *.pdf 30 30 *.md 31 31 !SKILL.md 32 + !README.md 32 33 /tmp/
+73
README.md
··· 1 + # PDF to Markdown Converter 2 + 3 + Convert PDF documents to clean, structured Markdown with table and image extraction. 4 + 5 + ## Features 6 + 7 + - **Text extraction** with formatting preservation (headers, bold, italic, lists) 8 + - **Table extraction** with two modes: 9 + - Fast mode: PyMuPDF (good for simple tables) 10 + - Accurate mode: IBM Docling AI (better for complex/borderless tables) 11 + - **Image extraction** to cache directory with paths in output 12 + - **Aggressive caching** - extract once, reuse forever 13 + - **Page slicing** - request specific pages from cached full extraction 14 + 15 + ## Installation 16 + 17 + ```bash 18 + cd ~/.claude/skills/pdf-to-markdown 19 + uv venv .venv 20 + uv pip install --python .venv/bin/python pymupdf pymupdf4llm docling docling-core 21 + ``` 22 + 23 + ## Usage 24 + 25 + ```bash 26 + # Basic conversion 27 + .venv/bin/python scripts/pdf_to_md.py document.pdf --stdout 28 + 29 + # High-accuracy tables (slower) 30 + .venv/bin/python scripts/pdf_to_md.py document.pdf --docling --stdout 31 + 32 + # Specific pages 33 + .venv/bin/python scripts/pdf_to_md.py document.pdf --pages 1-10 --stdout 34 + 35 + # Skip images (faster) 36 + .venv/bin/python scripts/pdf_to_md.py document.pdf --no-images --stdout 37 + 38 + # Save to file 39 + .venv/bin/python scripts/pdf_to_md.py document.pdf output.md 40 + ``` 41 + 42 + ## Options 43 + 44 + | Option | Description | 45 + |--------|-------------| 46 + | `--stdout` | Print to stdout instead of file | 47 + | `--pages RANGE` | Page range (e.g., "1-5" or "1,3,5-7") | 48 + | `--docling` | Use Docling AI for high-accuracy tables | 49 + | `--no-images` | Skip image extraction | 50 + | `--no-metadata` | Skip metadata header | 51 + | `--no-cache` | Bypass cache (still updates it) | 52 + | `--clear-cache` | Clear cache for this PDF | 53 + | `--clear-all-cache` | Clear entire cache | 54 + | `--cache-stats` | Show cache statistics | 55 + 56 + ## Project Structure 57 + 58 + ``` 59 + scripts/ 60 + pdf_to_md.py # Main CLI tool 61 + extractor.py # PDF extraction library (fast + accurate modes) 62 + ``` 63 + 64 + ## Cache 65 + 66 + PDFs are cached in `~/.cache/pdf-to-markdown/`. Cache is invalidated when: 67 + - Source PDF is modified 68 + - Extractor version changes 69 + - Explicitly cleared with `--clear-cache` 70 + 71 + ## License 72 + 73 + MIT
+5 -12
SKILL.md
··· 31 31 32 32 ### First-Time Setup (if .venv doesn't exist) 33 33 ```bash 34 - cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf4llm docling 34 + cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf pymupdf4llm docling docling-core 35 35 ``` 36 36 37 37 ### Verify Installation 38 38 ```bash 39 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python -c "import pymupdf4llm; import docling; print('OK')" 39 + ~/.claude/skills/pdf-to-markdown/.venv/bin/python -c "import pymupdf; import pymupdf4llm; import docling; import docling_core; print('OK')" 40 40 ``` 41 41 42 42 ## Quick Start ··· 62 62 63 63 ### Step 1: Ensure the skill venv exists 64 64 ```bash 65 - test -d ~/.claude/skills/pdf-to-markdown/.venv || (cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf4llm docling) 65 + test -d ~/.claude/skills/pdf-to-markdown/.venv || (cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf pymupdf4llm docling docling-core) 66 66 ``` 67 67 68 68 ### Step 2: Convert PDF to Markdown ··· 219 219 --cache-stats Show cache statistics and exit 220 220 ``` 221 221 222 - **Performance:** For PDFs with 100+ pages, the script automatically uses parallel processing across all CPU cores. This provides 3-6x speedup on large documents. 222 + **Performance:** First extraction is cached, so subsequent requests for the same PDF are instant. 223 223 224 224 ## Advanced Usage 225 225 ··· 228 228 ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf --pages 1-10 --stdout 229 229 ``` 230 230 231 - ### Get Page-by-Page Chunks with Metadata 232 - ```bash 233 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf --chunked --stdout 234 - ``` 235 - 236 231 ### Handle Scanned PDFs (OCR) 237 232 For scanned PDFs without extractable text, pymupdf4llm will attempt OCR automatically if Tesseract is available: 238 233 ```bash ··· 248 243 ### "No module named pymupdf4llm" or venv doesn't exist 249 244 Recreate the skill's virtual environment: 250 245 ```bash 251 - cd ~/.claude/skills/pdf-to-markdown && rm -rf .venv && uv venv .venv && uv pip install --python .venv/bin/python pymupdf4llm docling 246 + cd ~/.claude/skills/pdf-to-markdown && rm -rf .venv && uv venv .venv && uv pip install --python .venv/bin/python pymupdf pymupdf4llm docling docling-core 252 247 ``` 253 248 254 249 ### Poor extraction quality ··· 259 254 - For scanned PDFs, ensure Tesseract OCR is installed: `brew install tesseract` 260 255 261 256 ### Very large PDFs 262 - - Parallel processing is automatic for 100+ pages (uses all CPU cores) 263 257 - Use `--pages` to extract only needed sections 264 - - Use `--workers N` to limit CPU usage if needed 265 258 - Use `--no-images` to skip image extraction (faster) 266 259 267 260 ### Tables not formatting correctly
-267
scripts/docling_extractor.py
··· 1 - """ 2 - PDF extraction with multiple backends: 3 - - Fast mode: PyMuPDF with multi-strategy table detection (~70-80% accuracy) 4 - - Accurate mode: IBM Docling with TableFormer AI (~93.6% accuracy, slower) 5 - """ 6 - 7 - import sys 8 - from pathlib import Path 9 - 10 - # Version for cache invalidation - increment when extraction logic changes 11 - # Format: major.minor.patch 12 - EXTRACTOR_VERSION = "3.0.0" 13 - 14 - 15 - def check_docling_models(): 16 - """Check if Docling models are downloaded.""" 17 - try: 18 - from huggingface_hub import scan_cache_dir 19 - cache_info = scan_cache_dir() 20 - # Check for docling models in HF cache 21 - docling_repos = [r for r in cache_info.repos if 'docling' in r.repo_id.lower()] 22 - return len(docling_repos) > 0 23 - except Exception: 24 - return False 25 - 26 - 27 - def extract_pdf_fast(pdf_path: str, show_progress: bool = False) -> str: 28 - """ 29 - Fast PDF extraction using PyMuPDF with multi-strategy table detection. 30 - 31 - Tries multiple table detection strategies for better coverage: 32 - - lines_strict: Best for bordered tables 33 - - text: For borderless tables (whitespace-based) 34 - 35 - Args: 36 - pdf_path: Path to the PDF file 37 - show_progress: Whether to show progress output 38 - 39 - Returns: 40 - Markdown string of the PDF content 41 - """ 42 - import pymupdf4llm 43 - 44 - if show_progress: 45 - print("Extracting with PyMuPDF (fast mode)...", file=sys.stderr) 46 - 47 - # Use text strategy which handles borderless tables better 48 - # than the default lines_strict 49 - markdown = pymupdf4llm.to_markdown( 50 - pdf_path, 51 - show_progress=show_progress, 52 - table_strategy="text" # Better for mixed table types 53 - ) 54 - 55 - return markdown 56 - 57 - 58 - def extract_pdf_accurate(pdf_path: str, show_progress: bool = False) -> str: 59 - """ 60 - Extract PDF to markdown using Docling with accurate table mode. 61 - 62 - Uses IBM's TableFormer AI model for ~93.6% table extraction accuracy. 63 - Much slower than fast mode (~2-3 sec/page). 64 - 65 - Args: 66 - pdf_path: Path to the PDF file 67 - show_progress: Whether to show progress output 68 - 69 - Returns: 70 - Markdown string of the PDF content 71 - """ 72 - from docling.document_converter import DocumentConverter 73 - from docling.datamodel.pipeline_options import PdfPipelineOptions 74 - from docling.datamodel.base_models import InputFormat 75 - from docling.document_converter import PdfFormatOption 76 - 77 - # Check if this is first run (models need downloading) 78 - if not check_docling_models(): 79 - print("First run: downloading Docling AI models (one-time setup, ~2-3 minutes)...", 80 - file=sys.stderr) 81 - 82 - # Configure pipeline for accurate table extraction 83 - pipeline_options = PdfPipelineOptions() 84 - pipeline_options.do_table_structure = True 85 - 86 - # Use ACCURATE mode for best table extraction 87 - try: 88 - from docling.datamodel.pipeline_options import TableFormerMode 89 - pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE 90 - except (ImportError, AttributeError): 91 - # Fallback if TableFormerMode not available 92 - pass 93 - 94 - # Create converter with PDF options 95 - converter = DocumentConverter( 96 - format_options={ 97 - InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) 98 - } 99 - ) 100 - 101 - if show_progress: 102 - print(f"Processing PDF with Docling (accurate mode, ~2-3 sec/page)...", file=sys.stderr) 103 - 104 - # Convert the document 105 - result = converter.convert(pdf_path) 106 - 107 - # Export to markdown 108 - markdown = result.document.export_to_markdown() 109 - 110 - return markdown 111 - 112 - 113 - def extract_pdf_docling( 114 - pdf_path: str, 115 - output_dir: str = None, 116 - images_scale: float = 4.0, 117 - show_progress: bool = False 118 - ) -> tuple: 119 - """ 120 - Extract PDF using Docling with accurate tables + high-res images. 121 - 122 - Uses IBM's TableFormer AI model for ~93.6% table extraction accuracy. 123 - Also extracts images at configurable resolution (default 4x for crisp images). 124 - 125 - Args: 126 - pdf_path: Path to the PDF file 127 - output_dir: Directory to save extracted images (None = skip images) 128 - images_scale: Image resolution multiplier (default: 4.0 for high-res) 129 - show_progress: Whether to show progress output 130 - 131 - Returns: 132 - tuple: (markdown: str, image_paths: list[str]) 133 - """ 134 - from docling.document_converter import DocumentConverter, PdfFormatOption 135 - from docling.datamodel.base_models import InputFormat 136 - from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode 137 - from docling_core.types.doc.base import ImageRefMode 138 - 139 - # Check if this is first run (models need downloading) 140 - if not check_docling_models(): 141 - print("First run: downloading Docling AI models (one-time setup, ~2-3 minutes)...", 142 - file=sys.stderr) 143 - 144 - if show_progress: 145 - print(f"Processing PDF with Docling (accurate mode, ~1 sec/page)...", file=sys.stderr) 146 - 147 - # Configure pipeline for accurate tables + image extraction 148 - pipeline_options = PdfPipelineOptions( 149 - do_table_structure=True, 150 - generate_picture_images=output_dir is not None, # Only extract images if we have output dir 151 - images_scale=images_scale 152 - ) 153 - pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE 154 - 155 - converter = DocumentConverter( 156 - format_options={ 157 - InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) 158 - } 159 - ) 160 - 161 - # Convert the document 162 - result = converter.convert(pdf_path) 163 - 164 - # Save images to output directory 165 - image_paths = [] 166 - if output_dir: 167 - output_path = Path(output_dir) 168 - output_path.mkdir(parents=True, exist_ok=True) 169 - 170 - for i, (element, _level) in enumerate(result.document.iterate_items()): 171 - if hasattr(element, 'image') and element.image is not None: 172 - img_path = output_path / f"figure_{i:04d}.png" 173 - element.image.pil_image.save(str(img_path)) 174 - image_paths.append(str(img_path)) 175 - 176 - if show_progress and image_paths: 177 - print(f"Extracted {len(image_paths)} images at {images_scale}x resolution", file=sys.stderr) 178 - 179 - # Export markdown with placeholders 180 - md = result.document.export_to_markdown(image_mode=ImageRefMode.PLACEHOLDER) 181 - 182 - # Replace placeholders with actual image references 183 - for img_path in image_paths: 184 - md = md.replace( 185 - "<!-- image -->", 186 - f"![Figure](images/{Path(img_path).name})", 187 - 1 188 - ) 189 - 190 - return md, image_paths 191 - 192 - 193 - def extract_pdf_to_markdown(pdf_path: str, accurate: bool = False, show_progress: bool = False) -> str: 194 - """ 195 - Extract PDF to markdown with configurable accuracy/speed trade-off. 196 - 197 - Args: 198 - pdf_path: Path to the PDF file 199 - accurate: If True, use Docling AI (~93.6% accuracy, slow). 200 - If False, use PyMuPDF (~70-80% accuracy, fast). 201 - show_progress: Whether to show progress output 202 - 203 - Returns: 204 - Markdown string of the PDF content 205 - """ 206 - if accurate: 207 - return extract_pdf_accurate(pdf_path, show_progress) 208 - else: 209 - return extract_pdf_fast(pdf_path, show_progress) 210 - 211 - 212 - def get_page_count(pdf_path: str) -> int: 213 - """Get the number of pages in a PDF using pymupdf (faster than Docling for this).""" 214 - import pymupdf 215 - doc = pymupdf.open(pdf_path) 216 - count = len(doc) 217 - doc.close() 218 - return count 219 - 220 - 221 - def extract_images(pdf_path: str, output_dir: str, show_progress: bool = False) -> list: 222 - """ 223 - Extract images from PDF to output directory. 224 - 225 - Uses pymupdf for image extraction since Docling focuses on document structure. 226 - 227 - Returns: 228 - List of extracted image paths 229 - """ 230 - import pymupdf 231 - 232 - output_path = Path(output_dir) 233 - output_path.mkdir(parents=True, exist_ok=True) 234 - 235 - doc = pymupdf.open(pdf_path) 236 - extracted = [] 237 - image_count = 0 238 - 239 - for page_num in range(len(doc)): 240 - page = doc[page_num] 241 - images = page.get_images() 242 - 243 - for img_index, img in enumerate(images): 244 - try: 245 - xref = img[0] 246 - pix = pymupdf.Pixmap(doc, xref) 247 - 248 - # Convert CMYK to RGB if necessary 249 - if pix.n - pix.alpha > 3: 250 - pix = pymupdf.Pixmap(pymupdf.csRGB, pix) 251 - 252 - image_count += 1 253 - img_filename = f"image_{image_count:04d}.png" 254 - img_path = output_path / img_filename 255 - pix.save(str(img_path)) 256 - extracted.append(str(img_path)) 257 - 258 - pix = None 259 - except Exception: 260 - continue 261 - 262 - doc.close() 263 - 264 - if show_progress and extracted: 265 - print(f"Extracted {len(extracted)} images", file=sys.stderr) 266 - 267 - return extracted
+251
scripts/extractor.py
··· 1 + """ 2 + PDF extraction with multiple backends: 3 + - Fast mode: PyMuPDF with multi-strategy table detection (good for simple tables) 4 + - Accurate mode: IBM Docling with TableFormer AI (better for complex/borderless tables) 5 + """ 6 + 7 + import sys 8 + from pathlib import Path 9 + 10 + # Version for cache invalidation - increment when extraction logic changes 11 + # Format: major.minor.patch 12 + EXTRACTOR_VERSION = "3.0.0" 13 + 14 + 15 + def check_docling_models(): 16 + """Check if Docling models are downloaded.""" 17 + try: 18 + from huggingface_hub import scan_cache_dir 19 + 20 + cache_info = scan_cache_dir() 21 + # Check for docling models in HF cache 22 + docling_repos = [r for r in cache_info.repos if "docling" in r.repo_id.lower()] 23 + return len(docling_repos) > 0 24 + except Exception: 25 + return False 26 + 27 + 28 + def extract_pdf_fast(pdf_path: str, show_progress: bool = False) -> str: 29 + """ 30 + Fast PDF extraction using PyMuPDF with multi-strategy table detection. 31 + 32 + Tries multiple table detection strategies for better coverage: 33 + - lines_strict: Best for bordered tables 34 + - text: For borderless tables (whitespace-based) 35 + 36 + Args: 37 + pdf_path: Path to the PDF file 38 + show_progress: Whether to show progress output 39 + 40 + Returns: 41 + Markdown string of the PDF content 42 + """ 43 + import pymupdf4llm 44 + 45 + if show_progress: 46 + print("Extracting with PyMuPDF (fast mode)...", file=sys.stderr) 47 + 48 + # Use text strategy which handles borderless tables better 49 + # than the default lines_strict 50 + markdown = pymupdf4llm.to_markdown( 51 + pdf_path, 52 + show_progress=show_progress, 53 + table_strategy="text", # Better for mixed table types 54 + ) 55 + 56 + return markdown 57 + 58 + 59 + def _save_docling_images(result, output_dir: Path) -> list: 60 + """ 61 + Save images from a Docling conversion result to output directory. 62 + 63 + Images are saved in iteration order, which matches the order of 64 + <!-- image --> placeholders in the exported markdown. 65 + 66 + Args: 67 + result: Docling ConversionResult object 68 + output_dir: Directory to save images to 69 + 70 + Returns: 71 + List of saved image paths (in iteration order) 72 + """ 73 + output_dir.mkdir(parents=True, exist_ok=True) 74 + image_paths = [] 75 + 76 + for i, (element, _level) in enumerate(result.document.iterate_items()): 77 + if hasattr(element, "image") and element.image is not None: 78 + img_path = output_dir / f"figure_{i:04d}.png" 79 + element.image.pil_image.save(str(img_path)) 80 + image_paths.append(str(img_path)) 81 + 82 + return image_paths 83 + 84 + 85 + def extract_pdf_docling( 86 + pdf_path: str, 87 + output_dir: str = None, 88 + images_scale: float = 4.0, 89 + show_progress: bool = False, 90 + ) -> tuple: 91 + """ 92 + Extract PDF using Docling with accurate tables + high-res images. 93 + 94 + Uses IBM's TableFormer AI model for ~93.6% table extraction accuracy. 95 + Also extracts images at configurable resolution (default 4x for crisp images). 96 + 97 + Args: 98 + pdf_path: Path to the PDF file 99 + output_dir: Directory to save extracted images (None = skip images) 100 + images_scale: Image resolution multiplier (default: 4.0 for high-res) 101 + show_progress: Whether to show progress output 102 + 103 + Returns: 104 + tuple: (markdown: str, image_paths: list[str]) 105 + """ 106 + from docling.document_converter import DocumentConverter, PdfFormatOption 107 + from docling.datamodel.base_models import InputFormat 108 + from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode 109 + from docling_core.types.doc.base import ImageRefMode 110 + 111 + # Check if this is first run (models need downloading) 112 + if not check_docling_models(): 113 + print( 114 + "First run: downloading Docling AI models (one-time setup, ~2-3 minutes)...", 115 + file=sys.stderr, 116 + ) 117 + 118 + if show_progress: 119 + print( 120 + f"Processing PDF with Docling (accurate mode, ~1 sec/page)...", 121 + file=sys.stderr, 122 + ) 123 + 124 + # Configure pipeline for accurate tables + image extraction 125 + pipeline_options = PdfPipelineOptions( 126 + do_table_structure=True, 127 + generate_picture_images=output_dir is not None, 128 + images_scale=images_scale, 129 + ) 130 + pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE 131 + 132 + converter = DocumentConverter( 133 + format_options={ 134 + InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) 135 + } 136 + ) 137 + 138 + # Convert the document 139 + result = converter.convert(pdf_path) 140 + 141 + # Save images to output directory (order matters for placeholder replacement) 142 + image_paths = [] 143 + if output_dir: 144 + image_paths = _save_docling_images(result, Path(output_dir)) 145 + if show_progress and image_paths: 146 + print( 147 + f"Extracted {len(image_paths)} images at {images_scale}x resolution", 148 + file=sys.stderr, 149 + ) 150 + 151 + # Export markdown with placeholders 152 + md = result.document.export_to_markdown(image_mode=ImageRefMode.PLACEHOLDER) 153 + 154 + # Replace placeholders with actual image references (order must match iteration order) 155 + for img_path in image_paths: 156 + md = md.replace("<!-- image -->", f"![Figure](images/{Path(img_path).name})", 1) 157 + 158 + return md, image_paths 159 + 160 + 161 + def extract_pdf_to_markdown( 162 + pdf_path: str, accurate: bool = False, show_progress: bool = False 163 + ) -> str: 164 + """ 165 + Extract PDF to markdown with configurable accuracy/speed trade-off. 166 + 167 + Args: 168 + pdf_path: Path to the PDF file 169 + accurate: If True, use Docling AI (better for complex tables, slower). 170 + If False, use PyMuPDF (fast, good for simple tables). 171 + show_progress: Whether to show progress output 172 + 173 + Returns: 174 + Markdown string of the PDF content 175 + """ 176 + if accurate: 177 + # Use Docling without image extraction 178 + md, _ = extract_pdf_docling( 179 + pdf_path, output_dir=None, show_progress=show_progress 180 + ) 181 + return md 182 + else: 183 + return extract_pdf_fast(pdf_path, show_progress) 184 + 185 + 186 + def get_page_count(pdf_path: str) -> int: 187 + """Get the number of pages in a PDF using pymupdf (faster than Docling for this).""" 188 + import pymupdf 189 + 190 + doc = pymupdf.open(pdf_path) 191 + count = len(doc) 192 + doc.close() 193 + return count 194 + 195 + 196 + def extract_images(pdf_path: str, output_dir: str, show_progress: bool = False) -> list: 197 + """ 198 + Extract images from PDF to output directory. 199 + 200 + Uses pymupdf for image extraction since Docling focuses on document structure. 201 + Deduplicates by xref to avoid extracting the same image multiple times 202 + (e.g., icons/logos reused across pages). 203 + 204 + Returns: 205 + List of extracted image paths 206 + """ 207 + import pymupdf 208 + 209 + output_path = Path(output_dir) 210 + output_path.mkdir(parents=True, exist_ok=True) 211 + 212 + doc = pymupdf.open(pdf_path) 213 + extracted = [] 214 + image_count = 0 215 + seen_xrefs = set() # Track already-extracted images by xref 216 + 217 + for page_num in range(len(doc)): 218 + page = doc[page_num] 219 + images = page.get_images() 220 + 221 + for img_index, img in enumerate(images): 222 + try: 223 + xref = img[0] 224 + 225 + # Skip if we've already extracted this image 226 + if xref in seen_xrefs: 227 + continue 228 + seen_xrefs.add(xref) 229 + 230 + pix = pymupdf.Pixmap(doc, xref) 231 + 232 + # Convert CMYK to RGB if necessary 233 + if pix.n - pix.alpha > 3: 234 + pix = pymupdf.Pixmap(pymupdf.csRGB, pix) 235 + 236 + image_count += 1 237 + img_filename = f"image_{image_count:04d}.png" 238 + img_path = output_path / img_filename 239 + pix.save(str(img_path)) 240 + extracted.append(str(img_path)) 241 + 242 + pix = None 243 + except Exception: 244 + continue 245 + 246 + doc.close() 247 + 248 + if show_progress and extracted: 249 + print(f"Extracted {len(extracted)} unique images", file=sys.stderr) 250 + 251 + return extracted
+238 -80
scripts/pdf_to_md.py
··· 39 39 # CACHING FUNCTIONS 40 40 # ============================================================================= 41 41 42 + 42 43 def get_cache_key(pdf_path: str, docling: bool = False) -> str: 43 44 """Generate cache key from file content + size + mode (path-independent).""" 44 45 p = Path(pdf_path).resolve() ··· 49 50 chunk_size = 65536 # 64KB 50 51 hasher = hashlib.sha256() 51 52 52 - with open(p, 'rb') as f: 53 + with open(p, "rb") as f: 53 54 # Read first chunk 54 55 hasher.update(f.read(chunk_size)) 55 56 ··· 75 76 Returns: 76 77 (is_valid: bool, cache_key: str) 77 78 """ 78 - from docling_extractor import EXTRACTOR_VERSION 79 + from extractor import EXTRACTOR_VERSION 79 80 80 81 try: 81 82 cache_key = get_cache_key(pdf_path, docling=docling) ··· 97 98 p = Path(pdf_path).resolve() 98 99 stat = p.stat() 99 100 100 - if (metadata.get("source_size") != stat.st_size or 101 - metadata.get("source_mtime") != stat.st_mtime): 101 + if ( 102 + metadata.get("source_size") != stat.st_size 103 + or metadata.get("source_mtime") != stat.st_mtime 104 + ): 102 105 return False, cache_key 103 106 104 107 # Check extractor version - invalidate if extraction logic changed ··· 110 113 return False, cache_key 111 114 112 115 113 - def load_from_cache(cache_key: str, pages: list = None) -> tuple: 116 + def load_from_cache( 117 + cache_key: str, pages: list = None, no_images: bool = False 118 + ) -> tuple: 114 119 """ 115 120 Load markdown from cache, optionally slice specific pages. 121 + 122 + Args: 123 + cache_key: The cache key to load from 124 + pages: Optional list of page numbers to slice 125 + no_images: If True, skip loading image directory even if cached 116 126 117 127 Returns: 118 128 (markdown: str, image_dir: Path or None, total_pages: int) ··· 127 137 metadata = json.load(f) 128 138 total_pages = metadata.get("total_pages", 0) 129 139 130 - # Check for cached images 131 - image_dir = cache_dir / "images" 132 - if not image_dir.exists() or not any(image_dir.iterdir()): 133 - image_dir = None 140 + # Check for cached images (skip if no_images flag is set) 141 + image_dir = None 142 + if not no_images: 143 + cached_image_dir = cache_dir / "images" 144 + if cached_image_dir.exists() and any(cached_image_dir.iterdir()): 145 + image_dir = cached_image_dir 134 146 135 147 # Slice pages if requested 136 148 if pages: ··· 139 151 return full_md, image_dir, total_pages 140 152 141 153 142 - def save_to_cache(cache_key: str, markdown: str, image_dir: Path, pdf_path: str, total_pages: int): 154 + def save_to_cache( 155 + cache_key: str, markdown: str, image_dir: Path, pdf_path: str, total_pages: int 156 + ): 143 157 """Save full extraction to cache.""" 144 - from docling_extractor import EXTRACTOR_VERSION 158 + from extractor import EXTRACTOR_VERSION 145 159 146 160 cache_dir = get_cache_dir(cache_key) 147 161 cache_dir.mkdir(parents=True, exist_ok=True) ··· 182 196 """ 183 197 # Try to split on page separator pattern (horizontal rules) 184 198 # pymupdf4llm uses "-----" as page separator 185 - page_pattern = r'\n-----\n' 199 + page_pattern = r"\n-----\n" 186 200 parts = re.split(page_pattern, full_md) 187 201 188 202 if len(parts) <= 1: ··· 201 215 return "\n-----\n".join(selected_parts) 202 216 203 217 218 + def find_cache_by_source_path(pdf_path: str) -> list: 219 + """ 220 + Find cache entries by source path in metadata. 221 + 222 + Used as fallback when the source PDF no longer exists (can't compute hash). 223 + 224 + Returns: 225 + List of cache directories that match the source path 226 + """ 227 + if not CACHE_DIR.exists(): 228 + return [] 229 + 230 + pdf_path_resolved = str(Path(pdf_path).resolve()) 231 + matching = [] 232 + 233 + for entry in CACHE_DIR.iterdir(): 234 + if not entry.is_dir(): 235 + continue 236 + metadata_file = entry / "metadata.json" 237 + if not metadata_file.exists(): 238 + continue 239 + try: 240 + with open(metadata_file) as f: 241 + metadata = json.load(f) 242 + if metadata.get("source_path") == pdf_path_resolved: 243 + matching.append(entry) 244 + except (json.JSONDecodeError, OSError): 245 + continue 246 + 247 + return matching 248 + 249 + 204 250 def clear_cache(pdf_path: str = None): 205 251 """Clear cache for specific PDF (both fast and docling) or entire cache.""" 206 252 if pdf_path: 207 253 cleared = False 208 - # Clear both fast and docling caches for this PDF 254 + # First try: clear by computing cache key (requires file to exist) 209 255 for docling in [False, True]: 210 256 try: 211 257 cache_key = get_cache_key(pdf_path, docling=docling) ··· 215 261 cleared = True 216 262 except (FileNotFoundError, OSError): 217 263 pass 264 + 265 + # Fallback: if file doesn't exist, search by source_path in metadata 266 + if not cleared: 267 + matching_caches = find_cache_by_source_path(pdf_path) 268 + for cache_dir in matching_caches: 269 + shutil.rmtree(cache_dir) 270 + cleared = True 271 + 218 272 return cleared 219 273 else: 220 274 # Clear all cache ··· 250 304 # PDF PROCESSING FUNCTIONS 251 305 # ============================================================================= 252 306 307 + 253 308 def check_dependencies(): 254 309 """Check if required packages are installed.""" 310 + missing = [] 311 + 312 + # Core dependencies 255 313 try: 256 314 import docling 315 + except ImportError: 316 + missing.append("docling") 317 + 318 + try: 257 319 import pymupdf 258 - return True 259 - except ImportError as e: 260 - print(f"ERROR: Missing dependency: {e}", file=sys.stderr) 261 - print("Install with: uv pip install docling pymupdf", file=sys.stderr) 320 + except ImportError: 321 + missing.append("pymupdf") 322 + 323 + try: 324 + import pymupdf4llm 325 + except ImportError: 326 + missing.append("pymupdf4llm") 327 + 328 + try: 329 + import docling_core 330 + except ImportError: 331 + missing.append("docling-core") 332 + 333 + # Optional but recommended 334 + try: 335 + import huggingface_hub 336 + except ImportError: 337 + # huggingface_hub is optional (used for model cache checking) 338 + pass 339 + 340 + if missing: 341 + print(f"ERROR: Missing dependencies: {', '.join(missing)}", file=sys.stderr) 342 + print( 343 + "Install with: uv pip install docling pymupdf pymupdf4llm docling-core", 344 + file=sys.stderr, 345 + ) 262 346 return False 347 + 348 + return True 263 349 264 350 265 351 def parse_page_range(page_str, total_pages): ··· 268 354 return None 269 355 270 356 pages = [] 271 - for part in page_str.split(','): 357 + for part in page_str.split(","): 272 358 part = part.strip() 273 - if '-' in part: 274 - start, end = part.split('-', 1) 359 + if "-" in part: 360 + start, end = part.split("-", 1) 275 361 start = int(start) - 1 # Convert to 0-indexed 276 362 end = int(end) # End is inclusive, so no -1 277 363 pages.extend(range(start, min(end, total_pages))) ··· 297 383 298 384 images = [] 299 385 for img_path in sorted(image_dir.glob("*")): 300 - if img_path.suffix.lower() in ('.png', '.jpg', '.jpeg', '.gif', '.bmp', '.webp'): 386 + if img_path.suffix.lower() in ( 387 + ".png", 388 + ".jpg", 389 + ".jpeg", 390 + ".gif", 391 + ".bmp", 392 + ".webp", 393 + ): 301 394 try: 302 395 # Get file size 303 396 size_bytes = img_path.stat().st_size ··· 306 399 # Try to get dimensions using pymupdf 307 400 try: 308 401 import pymupdf 402 + 309 403 pix = pymupdf.Pixmap(str(img_path)) 310 404 dimensions = f"{pix.width}x{pix.height}" 311 405 pix = None 312 406 except: 313 407 dimensions = "unknown" 314 408 315 - images.append({ 316 - 'filename': img_path.name, 317 - 'path': str(img_path), 318 - 'size_kb': round(size_kb, 1), 319 - 'dimensions': dimensions, 320 - }) 409 + images.append( 410 + { 411 + "filename": img_path.name, 412 + "path": str(img_path), 413 + "size_kb": round(size_kb, 1), 414 + "dimensions": dimensions, 415 + } 416 + ) 321 417 except Exception: 322 418 pass 323 419 ··· 335 431 336 432 def replace_image_ref(match): 337 433 alt_text = match.group(1) 338 - filename = match.group(2) 434 + filename_raw = match.group(2) 435 + # Strip any directory components (e.g., "images/figure_0001.png" -> "figure_0001.png") 436 + # This handles Docling's output which includes "images/" prefix 437 + filename = Path(filename_raw).name 339 438 full_path = image_dir / filename 340 439 341 440 if full_path.exists(): ··· 343 442 size_kb = round(full_path.stat().st_size / 1024, 1) 344 443 try: 345 444 import pymupdf 445 + 346 446 pix = pymupdf.Pixmap(str(full_path)) 347 447 dims = f"{pix.width}x{pix.height}" 348 448 pix = None 349 449 except: 350 450 dims = "?" 351 451 352 - return f"![{alt_text}]({filename})\n\n**[Image: {filename} ({dims}, {size_kb}KB) → {full_path}]**" 452 + return f"![{alt_text}]({filename_raw})\n\n**[Image: {filename} ({dims}, {size_kb}KB) → {full_path}]**" 353 453 except: 354 - return f"![{alt_text}]({filename})\n\n**[Image: {filename} → {full_path}]**" 454 + return f"![{alt_text}]({filename_raw})\n\n**[Image: {filename} → {full_path}]**" 355 455 356 456 return match.group(0) 357 457 358 - pattern = r'!\[([^\]]*)\]\(([^)]+)\)' 458 + pattern = r"!\[([^\]]*)\]\(([^)]+)\)" 359 459 return re.sub(pattern, replace_image_ref, markdown) 360 460 361 461 ··· 383 483 return "\n".join(lines) 384 484 385 485 386 - def convert_pdf(pdf_path, image_dir=None, no_images=False, show_progress=False, 387 - docling=False, images_scale=4.0): 486 + def convert_pdf( 487 + pdf_path, 488 + image_dir=None, 489 + no_images=False, 490 + show_progress=False, 491 + docling=False, 492 + images_scale=4.0, 493 + ): 388 494 """ 389 495 Convert PDF to markdown. 390 496 ··· 397 503 images_scale: Image resolution multiplier for Docling mode (default: 4.0) 398 504 """ 399 505 if docling: 400 - from docling_extractor import extract_pdf_docling 506 + from extractor import extract_pdf_docling 507 + 401 508 # Docling extracts both text and images together 402 - markdown, image_paths = extract_pdf_docling( 509 + markdown, _image_paths = extract_pdf_docling( 403 510 pdf_path, 404 511 output_dir=image_dir if not no_images else None, 405 512 images_scale=images_scale, 406 - show_progress=show_progress 513 + show_progress=show_progress, 407 514 ) 408 515 return markdown 409 516 else: 410 - from docling_extractor import extract_pdf_to_markdown, extract_images 517 + from extractor import extract_pdf_to_markdown, extract_images 518 + 411 519 # Fast mode: separate text and image extraction 412 - markdown = extract_pdf_to_markdown(pdf_path, accurate=False, show_progress=show_progress) 520 + markdown = extract_pdf_to_markdown( 521 + pdf_path, accurate=False, show_progress=show_progress 522 + ) 413 523 414 524 if not no_images and image_dir: 415 525 extract_images(pdf_path, image_dir, show_progress=show_progress) ··· 417 527 return markdown 418 528 419 529 420 - def add_metadata_header(markdown, pdf_path, total_pages, pages_extracted, image_dir=None, cached=False): 530 + def add_metadata_header( 531 + markdown, pdf_path, total_pages, pages_extracted, image_dir=None, cached=False 532 + ): 421 533 """Add metadata header to markdown output.""" 422 534 filename = os.path.basename(pdf_path) 423 535 ··· 443 555 def setup_temp_image_dir(pdf_path): 444 556 """Create temporary image directory for extraction.""" 445 557 pdf_name = Path(pdf_path).stem 446 - safe_name = re.sub(r'[^\w\-_]', '_', pdf_name) 558 + safe_name = re.sub(r"[^\w\-_]", "_", pdf_name) 447 559 image_dir = Path("/tmp/pdf_images") / safe_name 448 560 449 561 if image_dir.exists(): ··· 457 569 # MAIN 458 570 # ============================================================================= 459 571 572 + 460 573 def main(): 461 574 parser = argparse.ArgumentParser( 462 - description='Convert PDF to Markdown for LLM context (with persistent caching)', 575 + description="Convert PDF to Markdown for LLM context (with persistent caching)", 463 576 formatter_class=argparse.RawDescriptionHelpFormatter, 464 577 epilog=""" 465 578 Examples: ··· 475 588 Cache is keyed by file path + size + modification time. 476 589 Full PDF is always extracted and cached; --pages slices from cache. 477 590 Cache persists until explicitly cleared or source PDF changes. 478 - """ 591 + """, 479 592 ) 480 593 481 - parser.add_argument('input', nargs='?', help='Input PDF file path') 482 - parser.add_argument('output', nargs='?', help='Output markdown file path (default: <input>.md)') 483 - parser.add_argument('--stdout', action='store_true', help='Print to stdout instead of file') 484 - parser.add_argument('--pages', help='Page range to extract (e.g., "1-5" or "1,3,5-7")') 485 - parser.add_argument('--docling', '--accurate', action='store_true', dest='docling', 486 - help='Use Docling AI for ~93.6%% table accuracy (slower, ~1 sec/page)') 487 - parser.add_argument('--images-scale', type=float, default=4.0, 488 - help='Image resolution multiplier for Docling mode (default: 4.0)') 489 - parser.add_argument('--no-images', action='store_true', help='Skip image extraction (faster)') 490 - parser.add_argument('--no-metadata', action='store_true', help='Skip metadata header') 491 - parser.add_argument('--no-progress', action='store_true', help='Disable progress indicator') 594 + parser.add_argument("input", nargs="?", help="Input PDF file path") 595 + parser.add_argument( 596 + "output", nargs="?", help="Output markdown file path (default: <input>.md)" 597 + ) 598 + parser.add_argument( 599 + "--stdout", action="store_true", help="Print to stdout instead of file" 600 + ) 601 + parser.add_argument( 602 + "--pages", help='Page range to extract (e.g., "1-5" or "1,3,5-7")' 603 + ) 604 + parser.add_argument( 605 + "--docling", 606 + "--accurate", 607 + action="store_true", 608 + dest="docling", 609 + help="Use Docling AI for complex/borderless tables (slower, ~1 sec/page)", 610 + ) 611 + parser.add_argument( 612 + "--images-scale", 613 + type=float, 614 + default=4.0, 615 + help="Image resolution multiplier for Docling mode (default: 4.0)", 616 + ) 617 + parser.add_argument( 618 + "--no-images", action="store_true", help="Skip image extraction (faster)" 619 + ) 620 + parser.add_argument( 621 + "--no-metadata", action="store_true", help="Skip metadata header" 622 + ) 623 + parser.add_argument( 624 + "--no-progress", action="store_true", help="Disable progress indicator" 625 + ) 492 626 493 627 # Cache options 494 - parser.add_argument('--no-cache', action='store_true', 495 - help='Bypass cache, process fresh (still updates cache)') 496 - parser.add_argument('--clear-cache', action='store_true', 497 - help='Clear cache for this PDF before processing') 498 - parser.add_argument('--clear-all-cache', action='store_true', 499 - help='Clear entire cache directory and exit') 500 - parser.add_argument('--cache-stats', action='store_true', 501 - help='Show cache statistics and exit') 628 + parser.add_argument( 629 + "--no-cache", 630 + action="store_true", 631 + help="Bypass cache, process fresh (still updates cache)", 632 + ) 633 + parser.add_argument( 634 + "--clear-cache", 635 + action="store_true", 636 + help="Clear cache for this PDF before processing", 637 + ) 638 + parser.add_argument( 639 + "--clear-all-cache", 640 + action="store_true", 641 + help="Clear entire cache directory and exit", 642 + ) 643 + parser.add_argument( 644 + "--cache-stats", action="store_true", help="Show cache statistics and exit" 645 + ) 502 646 503 647 args = parser.parse_args() 504 648 ··· 526 670 print(f"ERROR: File not found: {args.input}", file=sys.stderr) 527 671 sys.exit(1) 528 672 529 - if not args.input.lower().endswith('.pdf'): 673 + if not args.input.lower().endswith(".pdf"): 530 674 print(f"WARNING: File may not be a PDF: {args.input}", file=sys.stderr) 531 675 532 676 # Check dependencies ··· 542 686 543 687 # Get total pages 544 688 import pymupdf 689 + 545 690 doc = pymupdf.open(args.input) 546 691 total_pages = len(doc) 547 692 doc.close() ··· 565 710 if show_progress: 566 711 mode = "docling" if args.docling else "fast" 567 712 print(f"Loading from cache ({mode} mode)...", file=sys.stderr) 568 - result, image_dir, cached_total = load_from_cache(cache_key, requested_pages) 713 + result, image_dir, cached_total = load_from_cache( 714 + cache_key, requested_pages, no_images=args.no_images 715 + ) 569 716 cache_hit = True 570 717 571 718 # If no cache hit, extract full PDF ··· 583 730 try: 584 731 if show_progress: 585 732 if args.docling: 586 - print(f"Extracting {total_pages} pages with Docling AI (~1 sec/page)...", file=sys.stderr) 733 + print( 734 + f"Extracting {total_pages} pages with Docling AI (~1 sec/page)...", 735 + file=sys.stderr, 736 + ) 587 737 else: 588 - print(f"Extracting {total_pages} pages with PyMuPDF (fast mode)...", file=sys.stderr) 738 + print( 739 + f"Extracting {total_pages} pages with PyMuPDF (fast mode)...", 740 + file=sys.stderr, 741 + ) 589 742 590 743 result = convert_pdf( 591 744 args.input, ··· 593 746 no_images=args.no_images, 594 747 show_progress=show_progress, 595 748 docling=args.docling, 596 - images_scale=args.images_scale 749 + images_scale=args.images_scale, 597 750 ) 598 751 except Exception as e: 599 752 print(f"ERROR: Conversion failed: {e}", file=sys.stderr) ··· 605 758 if show_progress: 606 759 print(f"Cached: {get_cache_dir(cache_key)}", file=sys.stderr) 607 760 608 - # Set image_dir to cached location 609 - cached_image_dir = get_cache_dir(cache_key) / "images" 610 - if cached_image_dir.exists() and any(cached_image_dir.iterdir()): 611 - image_dir = cached_image_dir 612 - else: 613 - image_dir = temp_image_dir 761 + # Set image_dir to cached location (unless no_images is set) 762 + if not args.no_images: 763 + cached_image_dir = get_cache_dir(cache_key) / "images" 764 + if cached_image_dir.exists() and any(cached_image_dir.iterdir()): 765 + image_dir = cached_image_dir 766 + else: 767 + image_dir = temp_image_dir 614 768 615 769 # Slice pages if requested (after caching full result) 616 770 if requested_pages: ··· 619 773 # Format output 620 774 output = result 621 775 622 - # Enhance image references with full paths 623 - if image_dir: 776 + # Enhance image references with full paths (skip if --no-images) 777 + if image_dir and not args.no_images: 624 778 output = enhance_markdown_with_image_paths(output, image_dir) 625 779 626 780 # Add image summary table at the end ··· 630 784 631 785 if not args.no_metadata: 632 786 output = add_metadata_header( 633 - output, args.input, total_pages, pages_to_output, 634 - image_dir, cached=cache_hit 787 + output, 788 + args.input, 789 + total_pages, 790 + pages_to_output, 791 + image_dir, 792 + cached=cache_hit, 635 793 ) 636 794 637 795 # Write output 638 796 if args.stdout: 639 797 print(output) 640 798 else: 641 - output_path = args.output or os.path.splitext(args.input)[0] + '.md' 642 - with open(output_path, 'w', encoding='utf-8') as f: 799 + output_path = args.output or os.path.splitext(args.input)[0] + ".md" 800 + with open(output_path, "w", encoding="utf-8") as f: 643 801 f.write(output) 644 802 645 803 msg = f"Converted {pages_to_output} pages to: {output_path}" 646 804 if cache_hit: 647 805 msg += " (from cache)" 648 - if image_dir: 806 + if image_dir and not args.no_images: 649 807 images = get_image_info(image_dir) 650 808 if images: 651 809 msg += f" ({len(images)} images)" 652 810 print(msg, file=sys.stderr) 653 811 654 812 655 - if __name__ == '__main__': 813 + if __name__ == "__main__": 656 814 main()