Release v1.18.1: Enhance markdown-tools with PDF image extraction

- Add extract_pdf_images.py script using PyMuPDF - Refactor SKILL.md for clearer workflow documentation - Update installation to use markitdown[pdf] extra - Update marketplace version to 1.18.1 - Update markdown-tools version to 1.1.0 - Update README/README.zh-CN with new features - Update QUICKSTART docs with in-app install instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-28 18:46:15 +08:00
parent 515514b058
commit 8233430cf2
9 changed files with 264 additions and 127 deletions
--- a/markdown-tools/SKILL.md
+++ b/markdown-tools/SKILL.md
@@ -1,146 +1,93 @@
 ---
 name: markdown-tools
-description: Converts documents to markdown (PDFs, Word docs, PowerPoint, Confluence exports) with Windows/WSL path handling. Activates when converting .doc/.docx/PDF/PPTX files to markdown, processing Confluence exports, handling Windows/WSL path conversions, or working with markitdown utility.
+description: Converts documents to markdown (PDFs, Word docs, PowerPoint, Confluence exports) with Windows/WSL path handling. Activates when converting .doc/.docx/PDF/PPTX files to markdown, processing Confluence exports, handling Windows/WSL path conversions, extracting images from PDFs, or working with markitdown utility.
 ---

 # Markdown Tools

-## Overview
-
-This skill provides document conversion to markdown with Windows/WSL path handling support. It helps convert various document formats to markdown and handles path conversions between Windows and WSL environments.
-
-## Core Capabilities
-
-### 1. Markdown Conversion
-Convert documents to markdown format with automatic Windows/WSL path handling.
-
-### 2. Confluence Export Processing
-Handle Confluence .doc exports with special characters for knowledge base integration.
+Convert documents to markdown with image extraction and Windows/WSL path handling.

 ## Quick Start

-### Convert Any Document to Markdown
+### Install markitdown with PDF Support

 ```bash
-# Basic conversion
-markitdown "path/to/document.pdf" > output.md
+# IMPORTANT: Use [pdf] extra for PDF support
+uv tool install "markitdown[pdf]"

-# WSL path example
-markitdown "/mnt/c/Users/username/Documents/file.docx" > output.md
+# Or via pip
+pip install "markitdown[pdf]"
 ```

-See `references/conversion-examples.md` for detailed examples of various conversion scenarios.
-
-### Convert Confluence Export
+### Basic Conversion

 ```bash
-# Direct conversion for simple exports
-markitdown "confluence-export.doc" > output.md
-
-# For exports with special characters, see references/
+markitdown "document.pdf" -o output.md
+# Or redirect: markitdown "document.pdf" > output.md
 ```

-## Path Conversion
+## PDF Conversion with Images

-### Windows to WSL Path Format
+markitdown extracts text only. For PDFs with images, use this workflow:

-Windows paths must be converted to WSL format before use in bash commands.
-
-**Conversion rules:**
- Replace `C:\` with `/mnt/c/`
- Replace `\` with `/`
- Preserve spaces and special characters
- Use quotes for paths with spaces
-
-**Example conversions:**
-```bash
-# Windows path
-C:\Users\username\Documents\file.doc
-
-# WSL path
-/mnt/c/Users/username/Documents/file.doc
-```
-
-**Helper script:** Use `scripts/convert_path.py` to automate conversion:
+### Step 1: Convert Text

 ```bash
-python scripts/convert_path.py "C:\Users\username\Downloads\document.doc"
+markitdown "document.pdf" -o output.md
 ```

-See `references/conversion-examples.md` for detailed path conversion examples.
+### Step 2: Extract Images

-## Document Conversion Workflows
-
-### Workflow 1: Simple Markdown Conversion
-
-For straightforward document conversions (PDF, .docx without special characters):
-
-1. Convert Windows path to WSL format (if needed)
-2. Run markitdown
-3. Redirect output to .md file
-
-See `references/conversion-examples.md` for detailed examples.
-
-### Workflow 2: Confluence Export with Special Characters
-
-For Confluence .doc exports that contain special characters or complex formatting:
-
-1. Save .doc file to accessible location
-2. Use appropriate conversion method (see references)
-3. Verify output formatting
-
-See `references/conversion-examples.md` for step-by-step command examples.
-
-## Error Handling
-
-### Common Issues and Solutions
-
-**markitdown not found:**
 ```bash
-# Install markitdown via pip
-pip install markitdown
+# Create assets directory alongside the markdown
+mkdir -p assets

-# Or via uv tools
-uv tool install markitdown
+# Extract images using PyMuPDF
+uv run --with pymupdf python scripts/extract_pdf_images.py "document.pdf" ./assets
 ```

-**Path not found:**
+### Step 3: Add Image References
+
+Insert image references in the markdown where needed:
+
+```markdown
+![Description](assets/img_page1_1.png)
+```
+
+### Step 4: Format Cleanup
+
+markitdown output often needs manual fixes:
+- Add proper heading levels (`#`, `##`, `###`)
+- Reconstruct tables in markdown format
+- Fix broken line breaks
+- Restore indentation structure
+
+## Path Conversion (Windows/WSL)
+
 ```bash
-# Verify path exists
-ls -la "/mnt/c/Users/username/Documents/file.doc"
+# Windows → WSL conversion
+C:\Users\name\file.pdf → /mnt/c/Users/name/file.pdf

-# Use convert_path.py helper
-python scripts/convert_path.py "C:\Users\username\Documents\file.doc"
+# Use helper script
+python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
 ```

-**Encoding issues:**
- Ensure files are UTF-8 encoded
- Check for special characters in filenames
- Use quotes around paths with spaces
+## Common Issues
+
+**"dependencies needed to read .pdf files"**
+```bash
+# Install with PDF support
+uv tool install "markitdown[pdf]" --force
+```
+
+**FontBBox warnings during PDF conversion**
+- These are harmless font parsing warnings, output is still correct
+
+**Images missing from output**
+- Use `scripts/extract_pdf_images.py` to extract images separately

 ## Resources

-### references/conversion-examples.md
-Comprehensive examples for all conversion scenarios including:
- Simple document conversions (PDF, Word, PowerPoint)
- Confluence export handling
- Path conversion examples for Windows/WSL
- Batch conversion operations
- Error recovery and troubleshooting examples
-
-Load this reference when users need specific command examples or encounter conversion issues.
-
-### scripts/convert_path.py
-Python script to automate Windows to WSL path conversion. Handles:
- Drive letter conversion (C:\ → /mnt/c/)
- Backslash to forward slash
- Special characters and spaces
-
-## Best Practices
-
-1. **Convert Windows paths to WSL format** before bash operations
-2. **Verify paths exist** before operations using ls or test commands
-3. **Check output quality** after conversion
-4. **Use markitdown directly** for simple conversions
-5. **Test incrementally** - Verify each conversion step before proceeding
-6. **Preserve directory structure** when doing batch conversions
+- `scripts/extract_pdf_images.py` - Extract images from PDF using PyMuPDF
+- `scripts/convert_path.py` - Windows to WSL path converter
+- `references/conversion-examples.md` - Detailed examples for batch operations
--- a/markdown-tools/scripts/extract_pdf_images.py
+++ b/markdown-tools/scripts/extract_pdf_images.py
@@ -0,0 +1,95 @@
+#!/usr/bin/env python3
+"""
+Extract images from PDF files using PyMuPDF.
+
+Usage:
+    uv run --with pymupdf python extract_pdf_images.py <pdf_path> [output_dir]
+
+Examples:
+    uv run --with pymupdf python extract_pdf_images.py document.pdf
+    uv run --with pymupdf python extract_pdf_images.py document.pdf ./assets
+
+Output:
+    Images are saved to output_dir (default: ./assets) with names like:
+    - img_page1_1.png
+    - img_page2_1.png
+"""
+
+import sys
+import os
+
+def extract_images(pdf_path: str, output_dir: str = "assets") -> list[str]:
+    """
+    Extract all images from a PDF file.
+
+    Args:
+        pdf_path: Path to the PDF file
+        output_dir: Directory to save extracted images
+
+    Returns:
+        List of extracted image file paths
+    """
+    try:
+        import fitz  # PyMuPDF
+    except ImportError:
+        print("Error: PyMuPDF not installed. Run with:")
+        print('  uv run --with pymupdf python extract_pdf_images.py <pdf_path>')
+        sys.exit(1)
+
+    os.makedirs(output_dir, exist_ok=True)
+
+    doc = fitz.open(pdf_path)
+    extracted_files = []
+
+    for page_num in range(len(doc)):
+        page = doc[page_num]
+        image_list = page.get_images()
+
+        for img_index, img in enumerate(image_list):
+            xref = img[0]
+            base_image = doc.extract_image(xref)
+            image_bytes = base_image["image"]
+            image_ext = base_image["ext"]
+
+            # Create descriptive filename
+            img_filename = f"img_page{page_num + 1}_{img_index + 1}.{image_ext}"
+            img_path = os.path.join(output_dir, img_filename)
+
+            with open(img_path, "wb") as f:
+                f.write(image_bytes)
+
+            extracted_files.append(img_path)
+            print(f"Extracted: {img_filename} ({len(image_bytes):,} bytes)")
+
+    doc.close()
+
+    print(f"\nTotal: {len(extracted_files)} images extracted to {output_dir}/")
+    return extracted_files
+
+
+def main():
+    if len(sys.argv) < 2 or sys.argv[1] in ("-h", "--help"):
+        print("Extract images from PDF files using PyMuPDF.")
+        print()
+        print("Usage: python extract_pdf_images.py <pdf_path> [output_dir]")
+        print()
+        print("Arguments:")
+        print("  pdf_path    Path to the PDF file")
+        print("  output_dir  Directory to save images (default: ./assets)")
+        print()
+        print("Example:")
+        print("  uv run --with pymupdf python extract_pdf_images.py document.pdf ./assets")
+        sys.exit(0 if "--help" in sys.argv or "-h" in sys.argv else 1)
+
+    pdf_path = sys.argv[1]
+    output_dir = sys.argv[2] if len(sys.argv) > 2 else "assets"
+
+    if not os.path.exists(pdf_path):
+        print(f"Error: File not found: {pdf_path}")
+        sys.exit(1)
+
+    extract_images(pdf_path, output_dir)
+
+
+if __name__ == "__main__":
+    main()