Simplify pdf_to_md.py: remove rarely-used features causing bugs · alice.mosphere.at/claude-skill-pdf-to-markdown@22002e0

+5 -19

README.md

··· 10 10 - Accurate mode: IBM Docling AI (better for complex/borderless tables) 11 11 - **Image extraction** to cache directory with paths in output 12 12 - **Aggressive caching** - extract once, reuse forever 13 - - **Page slicing** - request specific pages from cached full extraction 14 13 15 14 ## Installation 16 15 ··· 28 27 ## Usage 29 28 30 29 ```bash 31 - # Basic conversion 32 - .venv/bin/python scripts/pdf_to_md.py document.pdf --stdout 30 + # Basic conversion (outputs to document.md) 31 + .venv/bin/python scripts/pdf_to_md.py document.pdf 33 32 34 33 # High-accuracy tables (slower) 35 - .venv/bin/python scripts/pdf_to_md.py document.pdf --docling --stdout 36 - 37 - # Specific pages 38 - .venv/bin/python scripts/pdf_to_md.py document.pdf --pages 1-10 --stdout 39 - 40 - # Skip images (faster) 41 - .venv/bin/python scripts/pdf_to_md.py document.pdf --no-images --stdout 34 + .venv/bin/python scripts/pdf_to_md.py document.pdf --docling 42 35 43 - # Save to file 36 + # Custom output path 44 37 .venv/bin/python scripts/pdf_to_md.py document.pdf output.md 45 38 ``` 46 39 ··· 48 41 49 42 | Option | Description | 50 43 |--------|-------------| 51 - | `--stdout` | Print to stdout instead of file | 52 - | `--pages RANGE` | Page range (e.g., "1-5" or "1,3,5-7") | 53 44 | `--docling` | Use Docling AI for high-accuracy tables | 54 - | `--images-scale N` | Image resolution multiplier for Docling mode (default: 4.0) | 55 - | `--no-images` | Skip image extraction | 56 - | `--no-metadata` | Skip metadata header | 57 45 | `--no-progress` | Disable progress indicator | 58 - | `--no-cache` | Bypass cache entirely (no read or write) | 59 - | `--clear-cache` | Clear cache for this PDF (works even if PDF deleted) | 46 + | `--clear-cache` | Clear cache for this PDF and re-extract | 60 47 | `--clear-all-cache` | Clear entire cache | 61 48 | `--cache-stats` | Show cache statistics | 62 - | `--force-stale-cache` | Use cached extraction even if version differs (when PDF missing) | 63 49 64 50 ## Project Structure 65 51

+36 -118

SKILL.md

··· 12 12 - Lists (ordered and unordered) 13 13 - Multi-column layouts (correct reading order) 14 14 - Code blocks 15 - - **Images** (extracted to files with paths in output) 15 + - **Images** (always extracted to cache with paths in output) 16 16 17 17 ## When to Use This Skill 18 18 ··· 22 22 - User says "load", "read", "bring in", "extract" a PDF 23 23 - Grepping/searching would miss context or structure 24 24 - PDF has tables, formatting, or structure to preserve 25 - 26 - **USE `--pages`** when user only needs specific pages (faster, less output). 27 25 28 26 ## Environment Setup 29 27 ··· 52 50 53 51 ## Quick Start 54 52 55 - ### Using the Script (Recommended) 56 - 57 - The script automatically uses the skill's dedicated venv: 58 - 59 53 ```bash 60 - # Convert PDF to markdown (images extracted by default) 61 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf --stdout 62 - 63 - # Skip image extraction (faster, smaller output) 64 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf --no-images --stdout 54 + # Convert PDF to markdown (always extracts images) 55 + ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf 65 56 66 - # Specific pages only 67 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf --pages 1-10 --stdout 57 + # Output: document.md + images in cache 68 58 ``` 69 59 70 60 ## Standard Workflow ··· 73 63 74 64 ### Step 1: Ensure the skill venv exists 75 65 ```bash 76 - # For fast mode (default): 77 66 test -d ~/.claude/skills/pdf-to-markdown/.venv || (cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf pymupdf4llm) 78 67 ``` 79 68 80 69 ### Step 2: Convert PDF to Markdown 81 70 ```bash 82 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py "/path/to/document.pdf" --stdout 71 + ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py "/path/to/document.pdf" 83 72 ``` 84 73 85 - ### Step 3: Load into context 86 - The markdown output is now available. Display it or use it directly. 74 + ### Step 3: Read the output 75 + ```bash 76 + # Output is written to document.md in the same directory as the PDF 77 + cat /path/to/document.md 78 + ``` 87 79 88 80 ## Caching 89 81 ··· 91 83 92 84 ### How It Works 93 85 - **Cache location**: `~/.cache/pdf-to-markdown/<cache_key>/` 94 - - **Cache key**: Based on file path + size + modification time 95 - - **Full PDF cached**: Even if you request `--pages 1-10`, the full PDF is extracted and cached. Page slicing happens from the cached result. 86 + - **Cache key**: Based on file content hash + extraction mode 96 87 - **Invalidation**: Cache is invalidated when: 97 88 - Source PDF is modified (size or mtime changes) 98 89 - Extractor version changes (automatic re-extraction) ··· 108 99 109 100 # Show cache statistics 110 101 ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py --cache-stats 111 - 112 - # Bypass cache entirely (no read or write) 113 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf --no-cache --stdout 114 102 ``` 115 103 116 104 ### Cache Contents ··· 123 111 124 112 ## Image Handling 125 113 126 - By default, images are: 114 + Images are always extracted. They are: 127 115 1. **Extracted** to cache directory `~/.cache/pdf-to-markdown/<cache_key>/images/` 128 - 2. **Referenced** in the markdown with full paths like: 129 - ``` 130 - ![alt text](image.png) 131 - 132 - **[Image: image.png (800x600, 45.2KB) → ~/.cache/pdf-to-markdown/<key>/images/image.png]** 133 - ``` 116 + 2. **Referenced** in the markdown with full paths 134 117 3. **Summarized** in a table at the end of the document 135 118 136 119 ### Auto-View Behavior for Images 137 120 138 121 **IMPORTANT:** When the extracted markdown contains image references like: 139 122 ``` 140 - **[Image: figure_1.png (1200x800, 125.3KB) → /Users/.../.cache/pdf-to-markdown/abc123/images/figure_1.png]** 123 + **[Image: figure_1.png (1200x800, 125.3KB)]** 141 124 ``` 142 125 143 126 And the user asks about something that might be visual (charts, graphs, diagrams, figures, screenshots, layouts, designs, plots, illustrations), **automatically use the Read tool** to view the relevant image file(s) before answering. Don't ask the user - just look at it. ··· 149 132 - User: "Describe the architecture shown" → Read the image file 150 133 - User: "What are the results?" (and there's a results figure) → Read it 151 134 152 - **When NOT to auto-view:** 153 - - User only asks about text content 154 - - User explicitly says they don't need images 155 - - No images were extracted (--no-images was used) 156 - 157 - ### Image Options 158 - 159 - ```bash 160 - # Default: extract images to cache directory 161 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py doc.pdf --stdout 162 - 163 - # Skip images entirely (faster) 164 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py doc.pdf --no-images --stdout 165 - ``` 166 - 167 135 ## Output Format 168 136 169 137 The markdown output includes: ··· 173 141 --- 174 142 source: document.pdf 175 143 total_pages: 42 176 - pages_extracted: 42 177 144 extracted_at: 2025-01-15T10:30:00 178 145 from_cache: true 179 146 images_dir: /Users/.../.cache/pdf-to-markdown/abc123/images ··· 188 155 189 156 Regular paragraph text with **bold**, *italic*, and `code` formatting. 190 157 191 - ![Figure 1](figure_1.png) 158 + ![Figure 1](/Users/.../.cache/pdf-to-markdown/abc123/images/figure_1.png) 192 159 193 - **[Image: figure_1.png (800x600, 45.2KB) → ~/.cache/pdf-to-markdown/abc123/images/figure_1.png]** 160 + **[Image: figure_1.png (800x600, 45.2KB)]** 194 161 195 162 | Column A | Column B | 196 163 |----------|----------| ··· 205 172 206 173 | # | File | Dimensions | Size | Path | 207 174 |---|------|------------|------|------| 208 - | 1 | figure_1.png | 800x600 | 45.2KB | `~/.cache/pdf-to-markdown/abc123/images/figure_1.png` | 209 - | 2 | chart_2.png | 1200x800 | 89.1KB | `~/.cache/pdf-to-markdown/abc123/images/chart_2.png` | 175 + | 1 | figure_1.png | 800x600 | 45.2KB | `~/.cache/.../images/figure_1.png` | 176 + | 2 | chart_2.png | 1200x800 | 89.1KB | `~/.cache/.../images/chart_2.png` | 210 177 ``` 211 178 212 179 ## Script Reference ··· 217 184 Usage: pdf_to_md.py <input.pdf> [output.md] [options] 218 185 219 186 Options: 220 - --stdout Print to stdout instead of file 221 - --pages RANGE Page range (e.g., "1-5" or "1,3,5-7") 222 187 --docling Use Docling AI for high-accuracy tables (~1 sec/page) 223 - --images-scale N Image resolution for Docling mode (default: 4.0) 224 - --no-images Skip image extraction (faster) 225 - --no-metadata Skip metadata header in output 226 188 --no-progress Disable progress indicator 227 189 228 190 Cache Options: 229 - --no-cache Bypass cache entirely (no read or write) 230 - --clear-cache Clear cache for this PDF (works even if PDF was deleted) 191 + --clear-cache Clear cache for this PDF and re-extract 231 192 --clear-all-cache Clear entire cache directory and exit 232 193 --cache-stats Show cache statistics and exit 233 - --force-stale-cache Use cached extraction even if version differs (when PDF missing) 234 194 ``` 235 195 236 - **Performance:** First extraction is cached, so subsequent requests for the same PDF are instant. 196 + ## High-Accuracy Mode (Docling) 237 197 238 - ## Advanced Usage 198 + For PDFs with complex tables that need high accuracy, use the `--docling` flag: 239 199 240 - ### Extract Specific Pages 241 200 ```bash 242 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf --pages 1-10 --stdout 201 + ~/.claude/skills/pdf-to-markdown/.venv/bin/python \ 202 + ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py \ 203 + document.pdf --docling 243 204 ``` 244 205 245 - ### Handle Scanned PDFs (OCR) 246 - For scanned PDFs without extractable text, pymupdf4llm will attempt OCR automatically if Tesseract is available: 247 - ```bash 248 - # Install Tesseract first (macOS) 249 - brew install tesseract 206 + **When to use `--docling`:** 207 + - PDF has complex tables (borderless, merged cells, multi-column) 208 + - Table accuracy is critical (medical data, financial reports) 209 + - You're seeing garbled table output in default mode 210 + 211 + **Trade-offs:** 212 + - ~1 second per page (vs instant for fast mode) 213 + - First run downloads AI models (~500MB one-time) 214 + - Higher-resolution images (4x default) 250 215 251 - # Then convert - OCR happens automatically for image-based pages 252 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py scanned.pdf --stdout 253 - ``` 216 + **Note:** `--accurate` is an alias for `--docling`. 254 217 255 218 ## Troubleshooting 256 219 ··· 265 228 ``` 266 229 267 230 ### Poor extraction quality 268 - - Try `marker-pdf` for complex layouts (install into the skill venv): 269 - ```bash 270 - uv pip install --python ~/.claude/skills/pdf-to-markdown/.venv/bin/python marker-pdf 271 - ``` 231 + - Try `--docling` for complex tables 272 232 - For scanned PDFs, ensure Tesseract OCR is installed: `brew install tesseract` 273 233 274 - ### Very large PDFs 275 - - Use `--pages` to extract only needed sections 276 - - Use `--no-images` to skip image extraction (faster) 277 - 278 234 ### Tables not formatting correctly 279 - pymupdf4llm handles most tables well. For complex tables: 280 - ```bash 281 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python -c " 282 - import pymupdf4llm 283 - md_text = pymupdf4llm.to_markdown('doc.pdf', table_strategy='lines_strict') 284 - print(md_text) 285 - " 286 - ``` 287 - 288 - ## High-Accuracy Mode (Docling) 289 - 290 - For PDFs with complex tables that need high accuracy, use the `--docling` flag: 291 - 292 - ```bash 293 - ~/.claude/skills/pdf-to-markdown/.venv/bin/python \ 294 - ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py \ 295 - document.pdf --docling --stdout 296 - ``` 297 - 298 - **When to use `--docling`:** 299 - - PDF has complex tables (borderless, merged cells, multi-column) 300 - - Table accuracy is critical (medical data, financial reports) 301 - - You're seeing garbled table output in default mode 302 - 303 - **Trade-offs:** 304 - - ~1 second per page (vs instant for fast mode) 305 - - First run downloads AI models (~500MB one-time) 306 - - Higher-resolution images (4x default) 307 - 308 - **Image resolution:** 309 - ```bash 310 - # Default: 4x resolution (crisp images) 311 - ... --docling --stdout 312 - 313 - # Custom resolution (2x for smaller files) 314 - ... --docling --images-scale 2.0 --stdout 315 - ``` 316 - 317 - **Note:** `--accurate` is an alias for `--docling` for backwards compatibility. 235 + For complex tables, use `--docling` mode which uses IBM's TableFormer AI model. 318 236 319 237 ## Comparison with Other Approaches 320 238

+128 -634

scripts/pdf_to_md.py

··· 1 1 #!/usr/bin/env python3 2 2 """ 3 3 PDF to Markdown Converter for LLM Context 4 + 4 5 Extracts entire PDF content as clean, structured markdown. 5 - Images are extracted to cache directory by default. 6 + Images are extracted to cache directory and copied to output location. 6 7 7 8 Features: 8 9 - High-accuracy table extraction using IBM Docling (TableFormer AI model) 9 10 - Aggressive persistent caching (extracts once, reuses forever) 10 - - Full PDF cached, pages sliced on demand 11 11 - Cache only cleared on explicit request or source file change 12 12 13 13 Usage: 14 14 python pdf_to_md.py <input.pdf> [output.md] 15 - python pdf_to_md.py <input.pdf> --stdout 16 - python pdf_to_md.py <input.pdf> --pages 1-5 17 - python pdf_to_md.py <input.pdf> --clear-cache 18 - python pdf_to_md.py --clear-all-cache 15 + python pdf_to_md.py <input.pdf> --docling # Accurate tables (slower) 16 + python pdf_to_md.py <input.pdf> --clear-cache # Re-extract 17 + python pdf_to_md.py --clear-all-cache # Clear entire cache 19 18 20 19 Dependencies: 21 - uv pip install docling pymupdf 20 + uv pip install pymupdf pymupdf4llm # Fast mode 21 + uv pip install docling docling-core # Docling mode (optional) 22 22 """ 23 23 24 24 import argparse ··· 46 46 pdf_path: str 47 47 docling: bool = False 48 48 images_scale: float = 4.0 49 - no_images: bool = False 50 49 51 50 52 51 @dataclass ··· 57 56 image_dir: Path | None 58 57 total_pages: int 59 58 from_cache: bool = False 59 + 60 60 61 61 # Suppress PyMuPDF's "Consider using pymupdf_layout" recommendation 62 - # This must be set before any pymupdf imports to take effect 63 62 os.environ.setdefault("PYMUPDF_SUGGEST_LAYOUT_ANALYZER", "0") 64 63 65 64 # Default cache directory ··· 72 71 73 72 74 73 class CacheManager: 75 - """Manages PDF extraction cache with clear ownership.""" 74 + """Manages PDF extraction cache.""" 76 75 77 76 def __init__(self, cache_dir: Path = None): 78 77 self.cache_dir = cache_dir or DEFAULT_CACHE_DIR 79 78 80 79 def get_key(self, config: ExtractionConfig) -> str: 81 - """Generate cache key from file content + size + mode (path-independent).""" 80 + """Generate cache key from file content + size + mode.""" 82 81 p = Path(config.pdf_path).resolve() 83 82 stat = p.stat() 84 83 file_size = stat.st_size 85 84 86 - # Hash content for cache key identity 87 - # For files <= 128KB: hash entire content (avoids collision for similar templates) 88 - # For larger files: hash first 64KB + last 64KB for speed 89 85 chunk_size = 65536 # 64KB 90 86 hasher = hashlib.sha256() 91 87 ··· 97 93 f.seek(-chunk_size, 2) 98 94 hasher.update(f.read(chunk_size)) 99 95 100 - # Include images_scale in mode for Docling (affects extracted image resolution) 101 - if config.docling: 102 - mode = f"docling_{config.images_scale}" 103 - else: 104 - mode = "fast" 105 - 106 - # Include no_images flag to avoid cache contamination 107 - if config.no_images: 108 - mode += "_noimages" 109 - 96 + mode = f"docling_{config.images_scale}" if config.docling else "fast" 110 97 raw = f"{file_size}|{hasher.hexdigest()}|{mode}" 111 98 return hashlib.sha256(raw.encode()).hexdigest()[:16] 112 99 ··· 114 101 """Get cache directory for a given cache key.""" 115 102 return self.cache_dir / cache_key 116 103 117 - def _get_cached_total_pages(self, cache_key: str) -> int: 118 - """Get total_pages from cache metadata without loading full content.""" 119 - cache_dir = self._get_dir(cache_key) 120 - metadata_file = cache_dir / "metadata.json" 121 - try: 122 - with open(metadata_file) as f: 123 - metadata = json.load(f) 124 - return metadata.get("total_pages", 0) 125 - except (FileNotFoundError, json.JSONDecodeError, OSError): 126 - return 0 127 - 128 104 def is_valid(self, config: ExtractionConfig) -> tuple[bool, str]: 129 - """Check if valid cache exists for this PDF. 130 - 131 - Returns: 132 - (is_valid: bool, cache_key: str) 133 - """ 105 + """Check if valid cache exists for this PDF.""" 134 106 from extractor import EXTRACTOR_VERSION 135 107 136 108 try: ··· 165 137 except (json.JSONDecodeError, KeyError, OSError): 166 138 return False, cache_key 167 139 168 - def load( 169 - self, cache_key: str, pages: list = None, no_images: bool = False 170 - ) -> ExtractionResult | None: 171 - """Load markdown from cache, optionally slice specific pages. 172 - 173 - Returns ExtractionResult or None if cache is corrupted. 174 - """ 140 + def load(self, cache_key: str) -> ExtractionResult | None: 141 + """Load markdown from cache.""" 175 142 cache_dir = self._get_dir(cache_key) 176 143 177 144 try: ··· 191 158 pass 192 159 return None 193 160 194 - # Check for cached images (skip if no_images flag is set) 161 + # Get cached images directory 195 162 image_dir = None 196 - if not no_images: 197 - cached_image_dir = cache_dir / "images" 198 - if cached_image_dir.exists() and any(cached_image_dir.iterdir()): 199 - image_dir = cached_image_dir 200 - else: 201 - # Strip image references when --no-images requested but cache has them 202 - # This prevents broken links when loading from a cache with images 203 - full_md = self._strip_image_references(full_md) 204 - 205 - # Slice pages if requested 206 - if pages: 207 - full_md = slice_pages_from_markdown(full_md, pages, total_pages) 163 + cached_image_dir = cache_dir / "images" 164 + if cached_image_dir.exists() and any(cached_image_dir.iterdir()): 165 + image_dir = cached_image_dir 208 166 209 167 return ExtractionResult( 210 168 markdown=full_md, ··· 213 171 from_cache=True, 214 172 ) 215 173 216 - def _strip_image_references(self, markdown: str) -> str: 217 - """Remove image references from markdown, leaving alt text as placeholder.""" 218 - 219 - def replace_image(match): 220 - alt_text = match.group(1) 221 - if alt_text: 222 - return f"[Image: {alt_text}]" 223 - return "[Image]" 224 - 225 - pattern = r"!\[([^\]]*)\]\([^)]+\)" 226 - return re.sub(pattern, replace_image, markdown) 227 - 228 174 def _normalize_image_paths(self, markdown: str, source_image_dir: Path) -> str: 229 175 """Normalize image paths in markdown to use relative 'images/' prefix.""" 230 176 if not source_image_dir: ··· 243 189 pattern = r"!\[([^\]]*)\]\(([^)]+)\)" 244 190 return re.sub(pattern, normalize_ref, markdown) 245 191 246 - def save( 247 - self, 248 - cache_key: str, 249 - result: ExtractionResult, 250 - config: ExtractionConfig, 251 - ): 192 + def save(self, cache_key: str, result: ExtractionResult, config: ExtractionConfig): 252 193 """Save full extraction to cache using atomic writes.""" 253 194 from extractor import EXTRACTOR_VERSION 254 195 ··· 259 200 if result.image_dir: 260 201 markdown = self._normalize_image_paths(markdown, result.image_dir) 261 202 262 - # Build metadata 263 203 p = Path(config.pdf_path).resolve() 264 204 stat = p.stat() 265 205 mode = f"docling_{config.images_scale}" if config.docling else "fast" 266 - if config.no_images: 267 - mode += "_noimages" 268 206 269 207 metadata = { 270 208 "source_path": str(p), ··· 276 214 "extractor_version": EXTRACTOR_VERSION, 277 215 "mode": mode, 278 216 "images_scale": config.images_scale if config.docling else None, 279 - "no_images": config.no_images, 280 217 } 281 218 282 219 temp_md = None ··· 322 259 if temp_json and os.path.exists(temp_json): 323 260 os.unlink(temp_json) 324 261 325 - def find_by_source( 326 - self, pdf_path: str, docling: bool = None, images_scale: float = None 327 - ) -> list: 328 - """Find cache entries by source path in metadata. 329 - 330 - Used as fallback when the source PDF no longer exists. 331 - Returns list of (cache_dir, metadata) tuples sorted by cached_at (freshest first). 332 - """ 333 - if not self.cache_dir.exists(): 334 - return [] 335 - 336 - pdf_path_resolved = str(Path(pdf_path).resolve()) 337 - matching = [] 338 - 339 - for entry in self.cache_dir.iterdir(): 340 - if not entry.is_dir(): 341 - continue 342 - metadata_file = entry / "metadata.json" 343 - if not metadata_file.exists(): 344 - continue 345 - try: 346 - with open(metadata_file) as f: 347 - metadata = json.load(f) 348 - 349 - if metadata.get("source_path") != pdf_path_resolved: 350 - continue 351 - 352 - if docling is not None: 353 - cached_mode = metadata.get("mode") 354 - if cached_mode is None: 355 - continue 356 - if docling and not cached_mode.startswith("docling"): 357 - continue 358 - # Match fast mode including fast_noimages variant 359 - if not docling and not cached_mode.startswith("fast"): 360 - continue 361 - 362 - if docling and images_scale is not None: 363 - cached_scale = metadata.get("images_scale") 364 - if cached_scale is not None and cached_scale != images_scale: 365 - continue 366 - 367 - matching.append((entry, metadata)) 368 - except (json.JSONDecodeError, OSError): 369 - continue 370 - 371 - matching.sort(key=lambda x: x[1].get("cached_at", ""), reverse=True) 372 - return matching 373 - 374 262 def clear(self, pdf_path: str = None) -> bool: 375 - """Clear cache for specific PDF (all modes and scale variants) or entire cache.""" 263 + """Clear cache for specific PDF or entire cache.""" 376 264 if pdf_path: 377 - cleared = False 378 - 379 265 try: 380 266 config = ExtractionConfig(pdf_path=pdf_path) 381 267 cache_key = self.get_key(config) 382 268 cache_dir = self._get_dir(cache_key) 383 269 if cache_dir.exists(): 384 270 shutil.rmtree(cache_dir) 385 - cleared = True 271 + return True 386 272 except (FileNotFoundError, OSError): 387 273 pass 388 - 389 - matching_caches = self.find_by_source(pdf_path) 390 - for cache_dir, _metadata in matching_caches: 391 - if cache_dir.exists(): 392 - shutil.rmtree(cache_dir) 393 - cleared = True 394 - 395 - return cleared 274 + return False 396 275 else: 397 276 if self.cache_dir.exists(): 398 277 shutil.rmtree(self.cache_dir) ··· 427 306 428 307 429 308 class ImageManager: 430 - """Manages image extraction and cleanup with proper lifecycle.""" 309 + """Manages image extraction and cleanup.""" 431 310 432 311 def __init__(self): 433 312 self._temp_dirs: list[Path] = [] ··· 447 326 shutil.rmtree(temp_dir) 448 327 self._temp_dirs.clear() 449 328 450 - def __enter__(self): 451 - return self 452 - 453 - def __exit__(self, *args): 454 - self.cleanup() 455 - 456 329 def extract_references(self, markdown: str) -> set: 457 330 """Extract the set of image filenames referenced in markdown.""" 458 331 pattern = r"!\[[^\]]*\]\(([^)]+)\)" ··· 461 334 462 335 def get_info(self, image_dir: Path, referenced_only: set = None) -> list: 463 336 """Get information about extracted images.""" 464 - if not image_dir: 337 + if not image_dir or not Path(image_dir).exists(): 465 338 return [] 466 339 467 340 image_dir = Path(image_dir) 468 - if not image_dir.exists(): 469 - return [] 470 - 471 341 images = [] 342 + 472 343 for img_path in sorted(image_dir.glob("*")): 473 - if img_path.suffix.lower() in ( 474 - ".png", 475 - ".jpg", 476 - ".jpeg", 477 - ".gif", 478 - ".bmp", 479 - ".webp", 480 - ): 344 + if img_path.suffix.lower() in (".png", ".jpg", ".jpeg", ".gif", ".bmp", ".webp"): 481 345 if referenced_only is not None and img_path.name not in referenced_only: 482 346 continue 483 347 ··· 487 351 488 352 try: 489 353 import pymupdf 490 - 491 354 pix = pymupdf.Pixmap(str(img_path)) 492 355 dimensions = f"{pix.width}x{pix.height}" 493 356 pix = None 494 357 except Exception: 495 358 dimensions = "unknown" 496 359 497 - images.append( 498 - { 499 - "filename": img_path.name, 500 - "path": str(img_path), 501 - "size_kb": round(size_kb, 1), 502 - "dimensions": dimensions, 503 - } 504 - ) 360 + images.append({ 361 + "filename": img_path.name, 362 + "path": str(img_path), 363 + "size_kb": round(size_kb, 1), 364 + "dimensions": dimensions, 365 + }) 505 366 except Exception: 506 367 pass 507 368 ··· 525 386 size_kb = round(full_path.stat().st_size / 1024, 1) 526 387 try: 527 388 import pymupdf 528 - 529 389 pix = pymupdf.Pixmap(str(full_path)) 530 390 dims = f"{pix.width}x{pix.height}" 531 391 pix = None ··· 565 425 return "\n".join(lines) 566 426 567 427 def finalize_images( 568 - self, 569 - temp_dir: Path, 570 - cache_dir: Path = None, 571 - output_dir: Path = None, 572 - no_cache: bool = False, 573 - show_progress: bool = False, 428 + self, temp_dir: Path, cache_dir: Path, output_dir: Path, show_progress: bool = False 574 429 ) -> Path | None: 575 430 """Finalize image directory after extraction. 576 431 577 - Handles: 578 - - Copying to output location when --no-cache 579 - - Returning cached location when caching enabled 580 - - Cleaning up temp directories 432 + Copies images from cache to output location. 433 + Cleans up temp directories. 581 434 582 435 Returns the final image directory to use for output. 583 436 """ ··· 585 438 return None 586 439 587 440 temp_dir = Path(temp_dir) 441 + 442 + # Clean up empty temp directories 588 443 if not temp_dir.exists() or not any(temp_dir.iterdir()): 444 + if temp_dir.exists(): 445 + shutil.rmtree(temp_dir) 446 + if temp_dir in self._temp_dirs: 447 + self._temp_dirs.remove(temp_dir) 589 448 return None 590 449 591 - if no_cache: 592 - if output_dir: 593 - # Copy images to output location 594 - output_images_dir = Path(str(output_dir).rsplit(".", 1)[0] + "_images") 595 - if output_images_dir.exists(): 596 - shutil.rmtree(output_images_dir) 597 - shutil.copytree(temp_dir, output_images_dir) 598 - # Clean up temp directory 599 - if temp_dir.exists(): 600 - shutil.rmtree(temp_dir) 601 - # Remove from tracking since we cleaned it up 602 - if temp_dir in self._temp_dirs: 603 - self._temp_dirs.remove(temp_dir) 604 - if show_progress: 605 - print(f"Images copied to: {output_images_dir}", file=sys.stderr) 606 - return output_images_dir 607 - else: 608 - # Outputting to stdout with --no-cache: clean up temp dir 609 - # (--no-cache contract: don't leave any files behind) 610 - print( 611 - "WARNING: --no-cache with stdout: images not available (would require temp files).", 612 - file=sys.stderr, 613 - ) 614 - if temp_dir.exists(): 615 - shutil.rmtree(temp_dir) 616 - if temp_dir in self._temp_dirs: 617 - self._temp_dirs.remove(temp_dir) 618 - return None 619 - else: 620 - # Caching enabled - use cached location if available 621 - if cache_dir: 622 - cached_image_dir = cache_dir / "images" 623 - if cached_image_dir.exists() and any(cached_image_dir.iterdir()): 624 - # Clean up temp directory 625 - if temp_dir.exists(): 626 - shutil.rmtree(temp_dir) 627 - if temp_dir in self._temp_dirs: 628 - self._temp_dirs.remove(temp_dir) 629 - return cached_image_dir 630 - return None 631 - 632 - 633 - # ============================================================================= 634 - # HELPER FUNCTIONS 635 - # ============================================================================= 636 - 637 - 638 - def slice_pages_from_markdown(full_md: str, pages: list, total_pages: int) -> str: 639 - """Extract specific pages from full markdown. 640 - 641 - Uses explicit  sentinels inserted during extraction. 642 - """ 643 - page_pattern = r"\n\n" 644 - parts = re.split(page_pattern, full_md) 645 - 646 - if len(parts) <= 1: 647 - return full_md 648 - 649 - selected_parts = [] 650 - for page_num in pages: 651 - if 0 <= page_num < len(parts): 652 - selected_parts.append(parts[page_num]) 450 + # Clean up temp directory (images are saved to cache) 451 + if temp_dir.exists(): 452 + shutil.rmtree(temp_dir) 453 + if temp_dir in self._temp_dirs: 454 + self._temp_dirs.remove(temp_dir) 653 455 654 - if not selected_parts: 655 - return full_md 456 + # Use cached images 457 + if cache_dir: 458 + cached_image_dir = cache_dir / "images" 459 + if cached_image_dir.exists() and any(cached_image_dir.iterdir()): 460 + return cached_image_dir 656 461 657 - return "\n\n".join(selected_parts) 462 + return None 658 463 659 464 660 465 # ============================================================================= 661 - # PDF PROCESSING FUNCTIONS 466 + # PDF PROCESSING 662 467 # ============================================================================= 663 468 664 469 665 470 def check_dependencies(docling_mode: bool = False): 666 - """ 667 - Check if required packages are installed for the requested mode. 668 - 669 - Args: 670 - docling_mode: If True, check for Docling dependencies. 671 - If False, check for fast mode (PyMuPDF) dependencies. 672 - """ 471 + """Check if required packages are installed.""" 673 472 missing = [] 674 473 675 - # pymupdf is always needed (for page count, image extraction in fast mode) 676 474 try: 677 475 import pymupdf 678 476 except ImportError: 679 477 missing.append("pymupdf") 680 478 681 479 if docling_mode: 682 - # Docling mode requires docling + docling_core 683 480 try: 684 481 import docling 685 482 except ImportError: ··· 692 489 693 490 install_cmd = "uv pip install pymupdf docling docling-core" 694 491 else: 695 - # Fast mode requires pymupdf4llm 696 492 try: 697 493 import pymupdf4llm 698 494 except ImportError: ··· 708 504 return True 709 505 710 506 711 - class PageRangeError(ValueError): 712 - """Error raised when page range string is invalid.""" 713 - 714 - pass 715 - 716 - 717 - def parse_page_range(page_str, total_pages): 718 - """Parse page range string like '1-5' or '1,3,5-7'. 719 - 720 - Raises: 721 - PageRangeError: If the page range string is invalid (non-numeric, invalid range, etc.) 722 - """ 723 - if not page_str: 724 - return None 725 - 726 - pages = [] 727 - requested_any = False # Track if user specified at least one page token 728 - 729 - for part in page_str.split(","): 730 - part = part.strip() 731 - if not part: 732 - continue 733 - 734 - if "-" in part: 735 - parts = part.split("-", 1) 736 - if len(parts) != 2 or not parts[0] or not parts[1]: 737 - raise PageRangeError( 738 - f"Invalid range '{part}'. Expected format: start-end (e.g., '1-5')" 739 - ) 740 - try: 741 - start = int(parts[0]) 742 - end = int(parts[1]) 743 - except ValueError: 744 - raise PageRangeError( 745 - f"Invalid range '{part}'. Page numbers must be integers." 746 - ) 747 - 748 - if start > end: 749 - raise PageRangeError( 750 - f"Invalid range '{part}'. Start page ({start}) cannot be greater than end page ({end})." 751 - ) 752 - if start < 1: 753 - raise PageRangeError( 754 - f"Invalid range '{part}'. Page numbers must be >= 1." 755 - ) 756 - 757 - requested_any = True # User specified a valid range 758 - # Convert to 0-indexed 759 - start_idx = start - 1 760 - # end is inclusive, so no -1 for end 761 - pages.extend(range(start_idx, min(end, total_pages))) 762 - else: 763 - try: 764 - page = int(part) 765 - except ValueError: 766 - raise PageRangeError( 767 - f"Invalid page number '{part}'. Page numbers must be integers." 768 - ) 769 - if page < 1: 770 - raise PageRangeError( 771 - f"Invalid page number '{part}'. Page numbers must be >= 1." 772 - ) 773 - requested_any = True # User specified a valid page number 774 - pages.append(page - 1) 775 - 776 - result = sorted(set(p for p in pages if 0 <= p < total_pages)) 777 - 778 - if not result and requested_any: 779 - # User specified pages but all are out of range 780 - raise PageRangeError( 781 - f"All requested pages are out of range. Document has {total_pages} pages." 782 - ) 783 - 784 - return result 785 - 786 - 787 - def convert_pdf( 788 - pdf_path, 789 - image_dir=None, 790 - no_images=False, 791 - show_progress=False, 792 - docling=False, 793 - images_scale=4.0, 794 - ): 795 - """ 796 - Convert PDF to markdown. 797 - 798 - Args: 799 - pdf_path: Path to PDF file 800 - image_dir: Directory to extract images to 801 - no_images: Skip image extraction 802 - show_progress: Show progress output 803 - docling: Use Docling AI for high-accuracy tables (slower) 804 - images_scale: Image resolution multiplier for Docling mode (default: 4.0) 805 - """ 507 + def convert_pdf(pdf_path, image_dir, show_progress=False, docling=False, images_scale=4.0): 508 + """Convert PDF to markdown.""" 806 509 if docling: 807 510 from extractor import extract_pdf_docling 808 511 809 - # Docling extracts both text and images together 810 512 markdown, _image_paths = extract_pdf_docling( 811 513 pdf_path, 812 - output_dir=image_dir if not no_images else None, 514 + output_dir=image_dir, 813 515 images_scale=images_scale, 814 516 show_progress=show_progress, 815 517 ) ··· 817 519 else: 818 520 from extractor import extract_pdf_fast 819 521 820 - # Fast mode: pymupdf4llm handles both text and image extraction 821 522 markdown = extract_pdf_fast( 822 523 pdf_path, 823 - image_dir=image_dir if not no_images else None, 524 + image_dir=image_dir, 824 525 show_progress=show_progress, 825 526 ) 826 - 827 527 return markdown 828 528 829 529 830 - def add_metadata_header( 831 - markdown, pdf_path, total_pages, pages_extracted, image_dir=None, cached=False 832 - ): 530 + def add_metadata_header(markdown, pdf_path, total_pages, image_dir=None, cached=False): 833 531 """Add metadata header to markdown output.""" 834 532 filename = os.path.basename(pdf_path) 835 533 ··· 837 535 "---", 838 536 f"source: {filename}", 839 537 f"total_pages: {total_pages}", 840 - f"pages_extracted: {pages_extracted}", 841 538 f"extracted_at: {datetime.now().isoformat()}", 842 539 ] 843 540 ··· 864 561 epilog=""" 865 562 Examples: 866 563 python pdf_to_md.py document.pdf # Output to document.md (cached) 867 - python pdf_to_md.py document.pdf --stdout # Print to stdout 868 - python pdf_to_md.py document.pdf --pages 1-10 # Only pages 1-10 (from cache) 869 - python pdf_to_md.py document.pdf --no-cache # Bypass cache 870 - python pdf_to_md.py document.pdf --clear-cache # Clear cache for this PDF 564 + python pdf_to_md.py document.pdf output.md # Custom output path 565 + python pdf_to_md.py document.pdf --docling # Accurate tables (slower) 566 + python pdf_to_md.py document.pdf --clear-cache # Clear cache and re-extract 871 567 python pdf_to_md.py --clear-all-cache # Clear entire cache 872 568 873 569 Caching: 874 570 PDFs are cached in ~/.cache/pdf-to-markdown/ 875 - Cache is keyed by file path + size + modification time. 876 - Full PDF is always extracted and cached; --pages slices from cache. 571 + Cache is keyed by file content hash + extraction mode. 877 572 Cache persists until explicitly cleared or source PDF changes. 878 573 """, 879 574 ) 880 575 881 576 parser.add_argument("input", nargs="?", help="Input PDF file path") 882 - parser.add_argument( 883 - "output", nargs="?", help="Output markdown file path (default: <input>.md)" 884 - ) 885 - parser.add_argument( 886 - "--stdout", action="store_true", help="Print to stdout instead of file" 887 - ) 888 - parser.add_argument( 889 - "--pages", 890 - help='Page range to extract (e.g., "1-5" or "1,3,5-7"). Note: only effective with fast mode (pymupdf4llm); Docling mode always extracts full document.', 891 - ) 577 + parser.add_argument("output", nargs="?", help="Output markdown file path (default: <input>.md)") 892 578 parser.add_argument( 893 579 "--docling", 894 580 "--accurate", ··· 896 582 dest="docling", 897 583 help="Use Docling AI for complex/borderless tables (slower, ~1 sec/page)", 898 584 ) 899 - parser.add_argument( 900 - "--images-scale", 901 - type=float, 902 - default=4.0, 903 - help="Image resolution multiplier for Docling mode (default: 4.0)", 904 - ) 905 - parser.add_argument( 906 - "--no-images", action="store_true", help="Skip image extraction (faster)" 907 - ) 908 - parser.add_argument( 909 - "--no-metadata", action="store_true", help="Skip metadata header" 910 - ) 911 - parser.add_argument( 912 - "--no-progress", action="store_true", help="Disable progress indicator" 913 - ) 585 + parser.add_argument("--no-progress", action="store_true", help="Disable progress indicator") 914 586 915 587 # Cache options 916 588 parser.add_argument( 917 - "--no-cache", 918 - action="store_true", 919 - help="Bypass cache entirely (no read or write)", 920 - ) 921 - parser.add_argument( 922 589 "--clear-cache", 923 590 action="store_true", 924 591 help="Clear cache for this PDF before processing", ··· 928 595 action="store_true", 929 596 help="Clear entire cache directory and exit", 930 597 ) 931 - parser.add_argument( 932 - "--cache-stats", action="store_true", help="Show cache statistics and exit" 933 - ) 934 - parser.add_argument( 935 - "--force-stale-cache", 936 - action="store_true", 937 - help="Use cached extraction even if extractor version differs (when PDF missing)", 938 - ) 598 + parser.add_argument("--cache-stats", action="store_true", help="Show cache statistics and exit") 939 599 940 600 args = parser.parse_args() 941 601 942 - # Initialize cache manager 943 602 cache_mgr = CacheManager() 944 603 945 - # Handle cache management commands first 604 + # Handle cache management commands 946 605 if args.clear_all_cache: 947 606 if cache_mgr.clear(): 948 607 print(f"Cache cleared: {cache_mgr.cache_dir}", file=sys.stderr) ··· 961 620 if not args.input: 962 621 parser.error("the following arguments are required: input") 963 622 964 - # Handle --clear-cache before existence check (allows clearing cache for deleted PDFs) 623 + # Handle --clear-cache 965 624 if args.clear_cache: 966 625 if cache_mgr.clear(args.input): 967 626 print(f"Cache cleared for: {args.input}", file=sys.stderr) 968 627 else: 969 628 print(f"No cache found for: {args.input}", file=sys.stderr) 970 - # If only clearing cache and file doesn't exist, exit successfully 971 - if not os.path.exists(args.input): 972 - sys.exit(0) 973 629 974 - # Validate input and check for cache fallback 975 - pdf_exists = os.path.exists(args.input) 976 - cache_fallback = False 977 - fallback_cache_dir = None 978 - 979 - if not pdf_exists: 980 - # Try to find cached extraction by source path, filtered by requested mode 981 - if not args.no_cache: 982 - # First try to find cache matching the requested mode/scale 983 - matching_caches = cache_mgr.find_by_source( 984 - args.input, 985 - docling=args.docling, 986 - images_scale=args.images_scale if args.docling else None, 987 - ) 988 - # If no exact match, try any cache for this file 989 - if not matching_caches: 990 - matching_caches = cache_mgr.find_by_source(args.input) 991 - 992 - if matching_caches: 993 - # Use the freshest matching cache (already sorted by cached_at desc) 994 - fallback_cache_dir, fallback_metadata = matching_caches[0] 995 - 996 - # Check extractor version compatibility 997 - from extractor import EXTRACTOR_VERSION 998 - 999 - cached_version = fallback_metadata.get("extractor_version") 1000 - if cached_version != EXTRACTOR_VERSION and not args.force_stale_cache: 1001 - print( 1002 - f"ERROR: Cached extraction version mismatch", 1003 - file=sys.stderr, 1004 - ) 1005 - print( 1006 - f" Cached: {cached_version}, Current: {EXTRACTOR_VERSION}", 1007 - file=sys.stderr, 1008 - ) 1009 - print( 1010 - f" Re-extract with original PDF or use --force-stale-cache", 1011 - file=sys.stderr, 1012 - ) 1013 - sys.exit(1) 630 + # Validate input exists 631 + if not os.path.exists(args.input): 632 + print(f"ERROR: File not found: {args.input}", file=sys.stderr) 633 + sys.exit(1) 1014 634 1015 - cache_fallback = True 1016 - cached_mode = fallback_metadata.get("mode", "unknown") 1017 - version_warning = "" 1018 - if cached_version != EXTRACTOR_VERSION: 1019 - version_warning = f" [version {cached_version}, current is {EXTRACTOR_VERSION}]" 1020 - print( 1021 - f"WARNING: Source PDF not found, using cached extraction ({cached_mode} mode){version_warning}", 1022 - file=sys.stderr, 1023 - ) 1024 - else: 1025 - print(f"ERROR: File not found: {args.input}", file=sys.stderr) 1026 - print( 1027 - " (No cached extraction available either)", 1028 - file=sys.stderr, 1029 - ) 1030 - sys.exit(1) 1031 - else: 1032 - print(f"ERROR: File not found: {args.input}", file=sys.stderr) 1033 - sys.exit(1) 1034 - 1035 - if pdf_exists and not args.input.lower().endswith(".pdf"): 635 + if not args.input.lower().endswith(".pdf"): 1036 636 print(f"WARNING: File may not be a PDF: {args.input}", file=sys.stderr) 1037 637 1038 - # Error if --pages used with --docling (page slicing not supported) 1039 - # Check early before any expensive operations 1040 - if args.pages and args.docling: 1041 - print( 1042 - "ERROR: --pages is not supported in Docling mode (no page delimiters in output). " 1043 - "Use fast mode (default) for page slicing, or omit --pages.", 1044 - file=sys.stderr, 1045 - ) 1046 - sys.exit(1) 638 + show_progress = sys.stderr.isatty() and not args.no_progress 1047 639 1048 - # Determine if progress should be shown 1049 - show_progress = sys.stderr.isatty() and not args.no_progress and not args.stdout 640 + # Check cache 641 + config = ExtractionConfig(pdf_path=args.input, docling=args.docling) 642 + valid, cache_key = cache_mgr.is_valid(config) 1050 643 1051 - # Check cache FIRST (before dependency check) 1052 - # This allows using cached results even without extraction dependencies installed 1053 - cache_hit = False 1054 - cache_key = "" 1055 644 result = None 1056 645 image_dir = None 1057 - total_pages = 0 646 + cache_hit = False 647 + 648 + if valid: 649 + if show_progress: 650 + mode = "docling" if args.docling else "fast" 651 + print(f"Loading from cache ({mode} mode)...", file=sys.stderr) 1058 652 1059 - if cache_fallback: 1060 - # PDF missing, but we found a cached extraction 1061 - cache_key = fallback_cache_dir.name 1062 - total_pages = fallback_metadata.get("total_pages", 0) 1063 - cache_hit = True # Will load below after page range parsing 1064 - elif not args.no_cache: 1065 - # Check if we have a valid cache for this PDF 1066 - config = ExtractionConfig( 1067 - pdf_path=args.input, 1068 - docling=args.docling, 1069 - images_scale=args.images_scale, 1070 - no_images=args.no_images, 1071 - ) 1072 - valid, cache_key = cache_mgr.is_valid(config) 1073 - if valid: 1074 - # Get total_pages from cache metadata (doesn't load full content) 1075 - total_pages = cache_mgr._get_cached_total_pages(cache_key) 1076 - if total_pages > 0: 1077 - cache_hit = True # Will load below after page range parsing 653 + cache_result = cache_mgr.load(cache_key) 654 + if cache_result: 655 + result = cache_result.markdown 656 + image_dir = cache_result.image_dir 657 + total_pages = cache_result.total_pages 658 + cache_hit = True 1078 659 1079 - # If no cache hit, we need dependencies and page count from PDF 660 + # Extract if no cache hit 1080 661 if not cache_hit: 1081 - # Check dependencies only when extraction is needed 1082 662 if not check_dependencies(docling_mode=args.docling): 1083 663 sys.exit(1) 1084 664 ··· 1086 666 1087 667 total_pages = get_page_count(args.input) 1088 668 1089 - # Parse page range (for output slicing, not extraction) 1090 - try: 1091 - requested_pages = ( 1092 - parse_page_range(args.pages, total_pages) if args.pages else None 1093 - ) 1094 - except PageRangeError as e: 1095 - print(f"ERROR: {e}", file=sys.stderr) 1096 - print( 1097 - " Expected format: 1-5 or 1,3,5-7 (page numbers start at 1)", 1098 - file=sys.stderr, 1099 - ) 1100 - sys.exit(1) 1101 - 1102 - pages_to_output = len(requested_pages) if requested_pages else total_pages 1103 - 1104 - # Now load from cache if we had a cache hit 1105 - if cache_hit: 1106 - if show_progress: 1107 - mode = "docling" if args.docling else "fast" 1108 - print(f"Loading from cache ({mode} mode)...", file=sys.stderr) 1109 - cache_result = cache_mgr.load( 1110 - cache_key, requested_pages, no_images=args.no_images 1111 - ) 1112 - if cache_result is None: 1113 - if cache_fallback: 1114 - # Cache was corrupted and PDF doesn't exist - can't recover 1115 - print( 1116 - "ERROR: Cache was corrupted and source PDF is not available.", 1117 - file=sys.stderr, 1118 - ) 1119 - sys.exit(1) 1120 - else: 1121 - # Cache was corrupted, treat as cache miss 1122 - cache_hit = False 1123 - # Need to check dependencies now since we're going to extract 1124 - if not check_dependencies(docling_mode=args.docling): 1125 - sys.exit(1) 1126 - else: 1127 - result = cache_result.markdown 1128 - image_dir = cache_result.image_dir 1129 - 1130 - # Determine output path early (needed for image handling) 1131 - output_path = None 1132 - if not args.stdout: 1133 - output_path = args.output or os.path.splitext(args.input)[0] + ".md" 1134 - 1135 - # Use ImageManager for extraction with automatic cleanup 1136 - img_mgr = ImageManager() 1137 - 1138 - # If no cache hit, extract full PDF 1139 - if not cache_hit: 1140 - # Build config if we don't have it 1141 - if "config" not in dir() or config is None: 1142 - config = ExtractionConfig( 1143 - pdf_path=args.input, 1144 - docling=args.docling, 1145 - images_scale=args.images_scale, 1146 - no_images=args.no_images, 1147 - ) 1148 - 1149 - # Get cache key if we don't have it 1150 669 if not cache_key: 1151 670 cache_key = cache_mgr.get_key(config) 1152 671 1153 - # Setup image directory for extraction (temporary) 1154 - temp_image_dir = None 1155 - if not args.no_images: 1156 - temp_image_dir = img_mgr.create_temp_dir(args.input) 672 + img_mgr = ImageManager() 673 + temp_image_dir = img_mgr.create_temp_dir(args.input) 1157 674 1158 - # Extract FULL PDF 1159 675 try: 1160 676 if show_progress: 1161 677 if args.docling: ··· 1172 688 result = convert_pdf( 1173 689 args.input, 1174 690 image_dir=temp_image_dir, 1175 - no_images=args.no_images, 1176 691 show_progress=show_progress, 1177 692 docling=args.docling, 1178 - images_scale=args.images_scale, 1179 693 ) 1180 694 except Exception as e: 1181 - img_mgr.cleanup() # Clean up on error 695 + img_mgr.cleanup() 1182 696 print(f"ERROR: Conversion failed: {e}", file=sys.stderr) 1183 697 sys.exit(1) 1184 698 1185 - # Save to cache (full result) 1186 - if not args.no_cache: 1187 - extraction_result = ExtractionResult( 1188 - markdown=result, 1189 - image_dir=temp_image_dir, 1190 - total_pages=total_pages, 1191 - ) 1192 - cache_mgr.save(cache_key, extraction_result, config) 1193 - if show_progress: 1194 - print(f"Cached: {cache_mgr._get_dir(cache_key)}", file=sys.stderr) 1195 - 1196 - # Finalize image directory 1197 - if not args.no_images: 1198 - image_dir = img_mgr.finalize_images( 1199 - temp_dir=temp_image_dir, 1200 - cache_dir=cache_mgr._get_dir(cache_key) if not args.no_cache else None, 1201 - output_dir=output_path, 1202 - no_cache=args.no_cache, 1203 - show_progress=show_progress, 1204 - ) 699 + # Save to cache 700 + extraction_result = ExtractionResult( 701 + markdown=result, 702 + image_dir=temp_image_dir, 703 + total_pages=total_pages, 704 + ) 705 + cache_mgr.save(cache_key, extraction_result, config) 706 + if show_progress: 707 + print(f"Cached: {cache_mgr._get_dir(cache_key)}", file=sys.stderr) 1205 708 1206 - # Slice pages if requested (after caching full result) 1207 - if requested_pages: 1208 - result = slice_pages_from_markdown(result, requested_pages, total_pages) 709 + # Finalize images 710 + output_path = args.output or os.path.splitext(args.input)[0] + ".md" 711 + image_dir = img_mgr.finalize_images( 712 + temp_dir=temp_image_dir, 713 + cache_dir=cache_mgr._get_dir(cache_key), 714 + output_dir=output_path, 715 + show_progress=show_progress, 716 + ) 1209 717 1210 718 # Format output 1211 719 output = result 720 + img_mgr_for_output = ImageManager() # Fresh instance for output processing 1212 721 1213 - # Extract referenced images before enhancement (for filtering summary) 1214 - referenced_images = img_mgr.extract_references(result) if result else set() 722 + referenced_images = img_mgr_for_output.extract_references(result) if result else set() 1215 723 1216 - # Enhance image references with full paths (skip if --no-images) 1217 - if image_dir and not args.no_images: 1218 - output = img_mgr.enhance_markdown(output, image_dir) 1219 - 1220 - # Add image summary table at the end (filtered to referenced images only) 1221 - images = img_mgr.get_info(image_dir, referenced_only=referenced_images) 724 + if image_dir: 725 + output = img_mgr_for_output.enhance_markdown(output, image_dir) 726 + images = img_mgr_for_output.get_info(image_dir, referenced_only=referenced_images) 1222 727 if images: 1223 - output += img_mgr.create_summary(images) 728 + output += img_mgr_for_output.create_summary(images) 1224 729 1225 - if not args.no_metadata: 1226 - output = add_metadata_header( 1227 - output, 1228 - args.input, 1229 - total_pages, 1230 - pages_to_output, 1231 - image_dir, 1232 - cached=cache_hit, 1233 - ) 730 + output = add_metadata_header(output, args.input, total_pages, image_dir, cached=cache_hit) 1234 731 1235 732 # Write output 1236 - if args.stdout: 1237 - print(output) 1238 - else: 1239 - with open(output_path, "w", encoding="utf-8") as f: 1240 - f.write(output) 733 + output_path = args.output or os.path.splitext(args.input)[0] + ".md" 734 + with open(output_path, "w", encoding="utf-8") as f: 735 + f.write(output) 1241 736 1242 - msg = f"Converted {pages_to_output} pages to: {output_path}" 1243 - if cache_hit: 1244 - msg += " (from cache)" 1245 - if image_dir and not args.no_images: 1246 - # Use the same filtered image set for consistency 1247 - images = img_mgr.get_info(image_dir, referenced_only=referenced_images) 1248 - if images: 1249 - msg += f" ({len(images)} images)" 1250 - print(msg, file=sys.stderr) 737 + msg = f"Converted {total_pages} pages to: {output_path}" 738 + if cache_hit: 739 + msg += " (from cache)" 740 + if image_dir: 741 + images = img_mgr_for_output.get_info(image_dir, referenced_only=referenced_images) 742 + if images: 743 + msg += f" ({len(images)} images)" 744 + print(msg, file=sys.stderr) 1251 745 1252 746 1253 747 if __name__ == "__main__":

Configure Feed

Configure Feed