Release v1.18.1: Enhance markdown-tools with PDF image extraction

- Add extract_pdf_images.py script using PyMuPDF
- Refactor SKILL.md for clearer workflow documentation
- Update installation to use markitdown[pdf] extra
- Update marketplace version to 1.18.1
- Update markdown-tools version to 1.1.0
- Update README/README.zh-CN with new features
- Update QUICKSTART docs with in-app install instructions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
daymade
2025-12-28 18:46:15 +08:00
parent 515514b058
commit 8233430cf2
9 changed files with 264 additions and 127 deletions

View File

@@ -1,146 +1,93 @@
---
name: markdown-tools
description: Converts documents to markdown (PDFs, Word docs, PowerPoint, Confluence exports) with Windows/WSL path handling. Activates when converting .doc/.docx/PDF/PPTX files to markdown, processing Confluence exports, handling Windows/WSL path conversions, or working with markitdown utility.
description: Converts documents to markdown (PDFs, Word docs, PowerPoint, Confluence exports) with Windows/WSL path handling. Activates when converting .doc/.docx/PDF/PPTX files to markdown, processing Confluence exports, handling Windows/WSL path conversions, extracting images from PDFs, or working with markitdown utility.
---
# Markdown Tools
## Overview
This skill provides document conversion to markdown with Windows/WSL path handling support. It helps convert various document formats to markdown and handles path conversions between Windows and WSL environments.
## Core Capabilities
### 1. Markdown Conversion
Convert documents to markdown format with automatic Windows/WSL path handling.
### 2. Confluence Export Processing
Handle Confluence .doc exports with special characters for knowledge base integration.
Convert documents to markdown with image extraction and Windows/WSL path handling.
## Quick Start
### Convert Any Document to Markdown
### Install markitdown with PDF Support
```bash
# Basic conversion
markitdown "path/to/document.pdf" > output.md
# IMPORTANT: Use [pdf] extra for PDF support
uv tool install "markitdown[pdf]"
# WSL path example
markitdown "/mnt/c/Users/username/Documents/file.docx" > output.md
# Or via pip
pip install "markitdown[pdf]"
```
See `references/conversion-examples.md` for detailed examples of various conversion scenarios.
### Convert Confluence Export
### Basic Conversion
```bash
# Direct conversion for simple exports
markitdown "confluence-export.doc" > output.md
# For exports with special characters, see references/
markitdown "document.pdf" -o output.md
# Or redirect: markitdown "document.pdf" > output.md
```
## Path Conversion
## PDF Conversion with Images
### Windows to WSL Path Format
markitdown extracts text only. For PDFs with images, use this workflow:
Windows paths must be converted to WSL format before use in bash commands.
**Conversion rules:**
- Replace `C:\` with `/mnt/c/`
- Replace `\` with `/`
- Preserve spaces and special characters
- Use quotes for paths with spaces
**Example conversions:**
```bash
# Windows path
C:\Users\username\Documents\file.doc
# WSL path
/mnt/c/Users/username/Documents/file.doc
```
**Helper script:** Use `scripts/convert_path.py` to automate conversion:
### Step 1: Convert Text
```bash
python scripts/convert_path.py "C:\Users\username\Downloads\document.doc"
markitdown "document.pdf" -o output.md
```
See `references/conversion-examples.md` for detailed path conversion examples.
### Step 2: Extract Images
## Document Conversion Workflows
### Workflow 1: Simple Markdown Conversion
For straightforward document conversions (PDF, .docx without special characters):
1. Convert Windows path to WSL format (if needed)
2. Run markitdown
3. Redirect output to .md file
See `references/conversion-examples.md` for detailed examples.
### Workflow 2: Confluence Export with Special Characters
For Confluence .doc exports that contain special characters or complex formatting:
1. Save .doc file to accessible location
2. Use appropriate conversion method (see references)
3. Verify output formatting
See `references/conversion-examples.md` for step-by-step command examples.
## Error Handling
### Common Issues and Solutions
**markitdown not found:**
```bash
# Install markitdown via pip
pip install markitdown
# Create assets directory alongside the markdown
mkdir -p assets
# Or via uv tools
uv tool install markitdown
# Extract images using PyMuPDF
uv run --with pymupdf python scripts/extract_pdf_images.py "document.pdf" ./assets
```
**Path not found:**
### Step 3: Add Image References
Insert image references in the markdown where needed:
```markdown
![Description](assets/img_page1_1.png)
```
### Step 4: Format Cleanup
markitdown output often needs manual fixes:
- Add proper heading levels (`#`, `##`, `###`)
- Reconstruct tables in markdown format
- Fix broken line breaks
- Restore indentation structure
## Path Conversion (Windows/WSL)
```bash
# Verify path exists
ls -la "/mnt/c/Users/username/Documents/file.doc"
# Windows → WSL conversion
C:\Users\name\file.pdf → /mnt/c/Users/name/file.pdf
# Use convert_path.py helper
python scripts/convert_path.py "C:\Users\username\Documents\file.doc"
# Use helper script
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
```
**Encoding issues:**
- Ensure files are UTF-8 encoded
- Check for special characters in filenames
- Use quotes around paths with spaces
## Common Issues
**"dependencies needed to read .pdf files"**
```bash
# Install with PDF support
uv tool install "markitdown[pdf]" --force
```
**FontBBox warnings during PDF conversion**
- These are harmless font parsing warnings, output is still correct
**Images missing from output**
- Use `scripts/extract_pdf_images.py` to extract images separately
## Resources
### references/conversion-examples.md
Comprehensive examples for all conversion scenarios including:
- Simple document conversions (PDF, Word, PowerPoint)
- Confluence export handling
- Path conversion examples for Windows/WSL
- Batch conversion operations
- Error recovery and troubleshooting examples
Load this reference when users need specific command examples or encounter conversion issues.
### scripts/convert_path.py
Python script to automate Windows to WSL path conversion. Handles:
- Drive letter conversion (C:\ → /mnt/c/)
- Backslash to forward slash
- Special characters and spaces
## Best Practices
1. **Convert Windows paths to WSL format** before bash operations
2. **Verify paths exist** before operations using ls or test commands
3. **Check output quality** after conversion
4. **Use markitdown directly** for simple conversions
5. **Test incrementally** - Verify each conversion step before proceeding
6. **Preserve directory structure** when doing batch conversions
- `scripts/extract_pdf_images.py` - Extract images from PDF using PyMuPDF
- `scripts/convert_path.py` - Windows to WSL path converter
- `references/conversion-examples.md` - Detailed examples for batch operations

View File

@@ -0,0 +1,95 @@
#!/usr/bin/env python3
"""
Extract images from PDF files using PyMuPDF.
Usage:
uv run --with pymupdf python extract_pdf_images.py <pdf_path> [output_dir]
Examples:
uv run --with pymupdf python extract_pdf_images.py document.pdf
uv run --with pymupdf python extract_pdf_images.py document.pdf ./assets
Output:
Images are saved to output_dir (default: ./assets) with names like:
- img_page1_1.png
- img_page2_1.png
"""
import sys
import os
def extract_images(pdf_path: str, output_dir: str = "assets") -> list[str]:
"""
Extract all images from a PDF file.
Args:
pdf_path: Path to the PDF file
output_dir: Directory to save extracted images
Returns:
List of extracted image file paths
"""
try:
import fitz # PyMuPDF
except ImportError:
print("Error: PyMuPDF not installed. Run with:")
print(' uv run --with pymupdf python extract_pdf_images.py <pdf_path>')
sys.exit(1)
os.makedirs(output_dir, exist_ok=True)
doc = fitz.open(pdf_path)
extracted_files = []
for page_num in range(len(doc)):
page = doc[page_num]
image_list = page.get_images()
for img_index, img in enumerate(image_list):
xref = img[0]
base_image = doc.extract_image(xref)
image_bytes = base_image["image"]
image_ext = base_image["ext"]
# Create descriptive filename
img_filename = f"img_page{page_num + 1}_{img_index + 1}.{image_ext}"
img_path = os.path.join(output_dir, img_filename)
with open(img_path, "wb") as f:
f.write(image_bytes)
extracted_files.append(img_path)
print(f"Extracted: {img_filename} ({len(image_bytes):,} bytes)")
doc.close()
print(f"\nTotal: {len(extracted_files)} images extracted to {output_dir}/")
return extracted_files
def main():
if len(sys.argv) < 2 or sys.argv[1] in ("-h", "--help"):
print("Extract images from PDF files using PyMuPDF.")
print()
print("Usage: python extract_pdf_images.py <pdf_path> [output_dir]")
print()
print("Arguments:")
print(" pdf_path Path to the PDF file")
print(" output_dir Directory to save images (default: ./assets)")
print()
print("Example:")
print(" uv run --with pymupdf python extract_pdf_images.py document.pdf ./assets")
sys.exit(0 if "--help" in sys.argv or "-h" in sys.argv else 1)
pdf_path = sys.argv[1]
output_dir = sys.argv[2] if len(sys.argv) > 2 else "assets"
if not os.path.exists(pdf_path):
print(f"Error: File not found: {pdf_path}")
sys.exit(1)
extract_images(pdf_path, output_dir)
if __name__ == "__main__":
main()