Release v1.18.1: Enhance markdown-tools with PDF image extraction
- Add extract_pdf_images.py script using PyMuPDF - Refactor SKILL.md for clearer workflow documentation - Update installation to use markitdown[pdf] extra - Update marketplace version to 1.18.1 - Update markdown-tools version to 1.1.0 - Update README/README.zh-CN with new features - Update QUICKSTART docs with in-app install instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -1,146 +1,93 @@
|
||||
---
|
||||
name: markdown-tools
|
||||
description: Converts documents to markdown (PDFs, Word docs, PowerPoint, Confluence exports) with Windows/WSL path handling. Activates when converting .doc/.docx/PDF/PPTX files to markdown, processing Confluence exports, handling Windows/WSL path conversions, or working with markitdown utility.
|
||||
description: Converts documents to markdown (PDFs, Word docs, PowerPoint, Confluence exports) with Windows/WSL path handling. Activates when converting .doc/.docx/PDF/PPTX files to markdown, processing Confluence exports, handling Windows/WSL path conversions, extracting images from PDFs, or working with markitdown utility.
|
||||
---
|
||||
|
||||
# Markdown Tools
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides document conversion to markdown with Windows/WSL path handling support. It helps convert various document formats to markdown and handles path conversions between Windows and WSL environments.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Markdown Conversion
|
||||
Convert documents to markdown format with automatic Windows/WSL path handling.
|
||||
|
||||
### 2. Confluence Export Processing
|
||||
Handle Confluence .doc exports with special characters for knowledge base integration.
|
||||
Convert documents to markdown with image extraction and Windows/WSL path handling.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Convert Any Document to Markdown
|
||||
### Install markitdown with PDF Support
|
||||
|
||||
```bash
|
||||
# Basic conversion
|
||||
markitdown "path/to/document.pdf" > output.md
|
||||
# IMPORTANT: Use [pdf] extra for PDF support
|
||||
uv tool install "markitdown[pdf]"
|
||||
|
||||
# WSL path example
|
||||
markitdown "/mnt/c/Users/username/Documents/file.docx" > output.md
|
||||
# Or via pip
|
||||
pip install "markitdown[pdf]"
|
||||
```
|
||||
|
||||
See `references/conversion-examples.md` for detailed examples of various conversion scenarios.
|
||||
|
||||
### Convert Confluence Export
|
||||
### Basic Conversion
|
||||
|
||||
```bash
|
||||
# Direct conversion for simple exports
|
||||
markitdown "confluence-export.doc" > output.md
|
||||
|
||||
# For exports with special characters, see references/
|
||||
markitdown "document.pdf" -o output.md
|
||||
# Or redirect: markitdown "document.pdf" > output.md
|
||||
```
|
||||
|
||||
## Path Conversion
|
||||
## PDF Conversion with Images
|
||||
|
||||
### Windows to WSL Path Format
|
||||
markitdown extracts text only. For PDFs with images, use this workflow:
|
||||
|
||||
Windows paths must be converted to WSL format before use in bash commands.
|
||||
|
||||
**Conversion rules:**
|
||||
- Replace `C:\` with `/mnt/c/`
|
||||
- Replace `\` with `/`
|
||||
- Preserve spaces and special characters
|
||||
- Use quotes for paths with spaces
|
||||
|
||||
**Example conversions:**
|
||||
```bash
|
||||
# Windows path
|
||||
C:\Users\username\Documents\file.doc
|
||||
|
||||
# WSL path
|
||||
/mnt/c/Users/username/Documents/file.doc
|
||||
```
|
||||
|
||||
**Helper script:** Use `scripts/convert_path.py` to automate conversion:
|
||||
### Step 1: Convert Text
|
||||
|
||||
```bash
|
||||
python scripts/convert_path.py "C:\Users\username\Downloads\document.doc"
|
||||
markitdown "document.pdf" -o output.md
|
||||
```
|
||||
|
||||
See `references/conversion-examples.md` for detailed path conversion examples.
|
||||
### Step 2: Extract Images
|
||||
|
||||
## Document Conversion Workflows
|
||||
|
||||
### Workflow 1: Simple Markdown Conversion
|
||||
|
||||
For straightforward document conversions (PDF, .docx without special characters):
|
||||
|
||||
1. Convert Windows path to WSL format (if needed)
|
||||
2. Run markitdown
|
||||
3. Redirect output to .md file
|
||||
|
||||
See `references/conversion-examples.md` for detailed examples.
|
||||
|
||||
### Workflow 2: Confluence Export with Special Characters
|
||||
|
||||
For Confluence .doc exports that contain special characters or complex formatting:
|
||||
|
||||
1. Save .doc file to accessible location
|
||||
2. Use appropriate conversion method (see references)
|
||||
3. Verify output formatting
|
||||
|
||||
See `references/conversion-examples.md` for step-by-step command examples.
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Issues and Solutions
|
||||
|
||||
**markitdown not found:**
|
||||
```bash
|
||||
# Install markitdown via pip
|
||||
pip install markitdown
|
||||
# Create assets directory alongside the markdown
|
||||
mkdir -p assets
|
||||
|
||||
# Or via uv tools
|
||||
uv tool install markitdown
|
||||
# Extract images using PyMuPDF
|
||||
uv run --with pymupdf python scripts/extract_pdf_images.py "document.pdf" ./assets
|
||||
```
|
||||
|
||||
**Path not found:**
|
||||
### Step 3: Add Image References
|
||||
|
||||
Insert image references in the markdown where needed:
|
||||
|
||||
```markdown
|
||||

|
||||
```
|
||||
|
||||
### Step 4: Format Cleanup
|
||||
|
||||
markitdown output often needs manual fixes:
|
||||
- Add proper heading levels (`#`, `##`, `###`)
|
||||
- Reconstruct tables in markdown format
|
||||
- Fix broken line breaks
|
||||
- Restore indentation structure
|
||||
|
||||
## Path Conversion (Windows/WSL)
|
||||
|
||||
```bash
|
||||
# Verify path exists
|
||||
ls -la "/mnt/c/Users/username/Documents/file.doc"
|
||||
# Windows → WSL conversion
|
||||
C:\Users\name\file.pdf → /mnt/c/Users/name/file.pdf
|
||||
|
||||
# Use convert_path.py helper
|
||||
python scripts/convert_path.py "C:\Users\username\Documents\file.doc"
|
||||
# Use helper script
|
||||
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
|
||||
```
|
||||
|
||||
**Encoding issues:**
|
||||
- Ensure files are UTF-8 encoded
|
||||
- Check for special characters in filenames
|
||||
- Use quotes around paths with spaces
|
||||
## Common Issues
|
||||
|
||||
**"dependencies needed to read .pdf files"**
|
||||
```bash
|
||||
# Install with PDF support
|
||||
uv tool install "markitdown[pdf]" --force
|
||||
```
|
||||
|
||||
**FontBBox warnings during PDF conversion**
|
||||
- These are harmless font parsing warnings, output is still correct
|
||||
|
||||
**Images missing from output**
|
||||
- Use `scripts/extract_pdf_images.py` to extract images separately
|
||||
|
||||
## Resources
|
||||
|
||||
### references/conversion-examples.md
|
||||
Comprehensive examples for all conversion scenarios including:
|
||||
- Simple document conversions (PDF, Word, PowerPoint)
|
||||
- Confluence export handling
|
||||
- Path conversion examples for Windows/WSL
|
||||
- Batch conversion operations
|
||||
- Error recovery and troubleshooting examples
|
||||
|
||||
Load this reference when users need specific command examples or encounter conversion issues.
|
||||
|
||||
### scripts/convert_path.py
|
||||
Python script to automate Windows to WSL path conversion. Handles:
|
||||
- Drive letter conversion (C:\ → /mnt/c/)
|
||||
- Backslash to forward slash
|
||||
- Special characters and spaces
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Convert Windows paths to WSL format** before bash operations
|
||||
2. **Verify paths exist** before operations using ls or test commands
|
||||
3. **Check output quality** after conversion
|
||||
4. **Use markitdown directly** for simple conversions
|
||||
5. **Test incrementally** - Verify each conversion step before proceeding
|
||||
6. **Preserve directory structure** when doing batch conversions
|
||||
- `scripts/extract_pdf_images.py` - Extract images from PDF using PyMuPDF
|
||||
- `scripts/convert_path.py` - Windows to WSL path converter
|
||||
- `references/conversion-examples.md` - Detailed examples for batch operations
|
||||
|
||||
95
markdown-tools/scripts/extract_pdf_images.py
Normal file
95
markdown-tools/scripts/extract_pdf_images.py
Normal file
@@ -0,0 +1,95 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Extract images from PDF files using PyMuPDF.
|
||||
|
||||
Usage:
|
||||
uv run --with pymupdf python extract_pdf_images.py <pdf_path> [output_dir]
|
||||
|
||||
Examples:
|
||||
uv run --with pymupdf python extract_pdf_images.py document.pdf
|
||||
uv run --with pymupdf python extract_pdf_images.py document.pdf ./assets
|
||||
|
||||
Output:
|
||||
Images are saved to output_dir (default: ./assets) with names like:
|
||||
- img_page1_1.png
|
||||
- img_page2_1.png
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
def extract_images(pdf_path: str, output_dir: str = "assets") -> list[str]:
|
||||
"""
|
||||
Extract all images from a PDF file.
|
||||
|
||||
Args:
|
||||
pdf_path: Path to the PDF file
|
||||
output_dir: Directory to save extracted images
|
||||
|
||||
Returns:
|
||||
List of extracted image file paths
|
||||
"""
|
||||
try:
|
||||
import fitz # PyMuPDF
|
||||
except ImportError:
|
||||
print("Error: PyMuPDF not installed. Run with:")
|
||||
print(' uv run --with pymupdf python extract_pdf_images.py <pdf_path>')
|
||||
sys.exit(1)
|
||||
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
doc = fitz.open(pdf_path)
|
||||
extracted_files = []
|
||||
|
||||
for page_num in range(len(doc)):
|
||||
page = doc[page_num]
|
||||
image_list = page.get_images()
|
||||
|
||||
for img_index, img in enumerate(image_list):
|
||||
xref = img[0]
|
||||
base_image = doc.extract_image(xref)
|
||||
image_bytes = base_image["image"]
|
||||
image_ext = base_image["ext"]
|
||||
|
||||
# Create descriptive filename
|
||||
img_filename = f"img_page{page_num + 1}_{img_index + 1}.{image_ext}"
|
||||
img_path = os.path.join(output_dir, img_filename)
|
||||
|
||||
with open(img_path, "wb") as f:
|
||||
f.write(image_bytes)
|
||||
|
||||
extracted_files.append(img_path)
|
||||
print(f"Extracted: {img_filename} ({len(image_bytes):,} bytes)")
|
||||
|
||||
doc.close()
|
||||
|
||||
print(f"\nTotal: {len(extracted_files)} images extracted to {output_dir}/")
|
||||
return extracted_files
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2 or sys.argv[1] in ("-h", "--help"):
|
||||
print("Extract images from PDF files using PyMuPDF.")
|
||||
print()
|
||||
print("Usage: python extract_pdf_images.py <pdf_path> [output_dir]")
|
||||
print()
|
||||
print("Arguments:")
|
||||
print(" pdf_path Path to the PDF file")
|
||||
print(" output_dir Directory to save images (default: ./assets)")
|
||||
print()
|
||||
print("Example:")
|
||||
print(" uv run --with pymupdf python extract_pdf_images.py document.pdf ./assets")
|
||||
sys.exit(0 if "--help" in sys.argv or "-h" in sys.argv else 1)
|
||||
|
||||
pdf_path = sys.argv[1]
|
||||
output_dir = sys.argv[2] if len(sys.argv) > 2 else "assets"
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
print(f"Error: File not found: {pdf_path}")
|
||||
sys.exit(1)
|
||||
|
||||
extract_images(pdf_path, output_dir)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user