feat(twitter-reader): add fetch_article.py for X Articles with images

- Use twitter-cli for structured metadata (likes, retweets, bookmarks) - Use Jina API for content with images - Auto-download all images to attachments/ - Generate Markdown with YAML frontmatter and local image references - Security scan passed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 16:31:33 +08:00
parent 673980639b
commit 22ec9f0d59
3 changed files with 377 additions and 46 deletions
--- a/twitter-reader/.security-scan-passed
+++ b/twitter-reader/.security-scan-passed
@@ -1,4 +1,4 @@
 Security scan passed
-Scanned at: 2026-01-11T15:27:31.794239
+Scanned at: 2026-04-06T16:24:46.159857
 Tool: gitleaks + pattern-based validation
-Content hash: 2349813d82b8932f89fdc653c0dde1b2335483478227653554553b59007975c1
+Content hash: bfcc70ffa4bfda5e6308942605ad8dc3cf62e8aeade5b00c6d9b774bebb190fb
--- a/twitter-reader/SKILL.md
+++ b/twitter-reader/SKILL.md
@@ -1,72 +1,156 @@
 ---
 name: twitter-reader
-description: Fetch Twitter/X post content by URL using jina.ai API to bypass JavaScript restrictions. Use when Claude needs to retrieve tweet content including author, timestamp, post text, images, and thread replies. Supports individual posts or batch fetching from x.com or twitter.com URLs.
+description: Fetch Twitter/X post content including long-form Articles with full images and metadata. Use when Claude needs to retrieve tweet/article content, author info, engagement metrics, and embedded media. Supports individual posts and X Articles (long-form content). Automatically downloads all images to local attachments folder and generates complete Markdown with proper image references. Preferred over Jina for X Articles with images.
 ---

 # Twitter Reader

-Fetch Twitter/X post content without needing JavaScript or authentication.
+Fetch Twitter/X post and article content with full media support.
+
+## Quick Start (Recommended)
+
+For X Articles with images, use the new fetch_article.py script:
+
+```bash
+uv run --with pyyaml python scripts/fetch_article.py <article_url> [output_dir]
+```
+
+Example:
+```bash
+uv run --with pyyaml python scripts/fetch_article.py \
+  https://x.com/HiTw93/status/2040047268221608281 \
+  ./Clippings
+```
+
+This will:
+- Fetch structured data via `twitter-cli` (likes, retweets, bookmarks)
+- Fetch content with images via `jina.ai` API
+- Download all images to `attachments/YYYY-MM-DD-AUTHOR-TITLE/`
+- Generate complete Markdown with embedded image references
+- Include YAML frontmatter with metadata
+
+### Example Output
+
+```
+Fetching: https://x.com/HiTw93/status/2040047268221608281
+--------------------------------------------------
+Getting metadata...
+Title: 你不知道的大模型训练：原理、路径与新实践
+Author: Tw93
+Likes: 1648
+
+Getting content and images...
+Images: 15
+
+Downloading 15 images...
+  ✓ 01-image.jpg
+  ✓ 02-image.jpg
+  ...
+
+✓ Saved: ./Clippings/2026-04-03-文章标题.md
+✓ Images: ./Clippings/attachments/2026-04-03-HiTw93-.../ (15 downloaded)
+```
+
+## Alternative: Jina API (Text-only)
+
+For simple text-only fetching without authentication:
+
+```bash
+# Single tweet
+curl "https://r.jina.ai/https://x.com/USER/status/TWEET_ID" \
+  -H "Authorization: Bearer ${JINA_API_KEY}"
+
+# Batch fetching
+scripts/fetch_tweets.sh url1 url2 url3
+```
+
+## Features
+
+### Full Article Mode (fetch_article.py)
+- ✅ Structured metadata (author, date, engagement metrics)
+- ✅ Automatic image download (all embedded media)
+- ✅ Complete Markdown with local image references
+- ✅ YAML frontmatter for PKM systems
+- ✅ Handles X Articles (long-form content)
+
+### Simple Mode (Jina API)
+- Text-only content
+- No authentication required beyond Jina API key
+- Good for quick text extraction

 ## Prerequisites

-You need a Jina API key to use this skill:
-
-1. Visit https://jina.ai/ to sign up (free tier available)
-2. Get your API key from the dashboard
-3. Set the environment variable:
+### For Full Article Mode
+- `uv` (Python package manager)
+- No additional setup (twitter-cli auto-installed)

+### For Simple Mode (Jina)
 ```bash
 export JINA_API_KEY="your_api_key_here"
+# Get from https://jina.ai/
 ```

-## Quick Start
+## Output Structure

-For a single tweet, use curl directly:
-
-```bash
-curl "https://r.jina.ai/https://x.com/USER/status/TWEET_ID" \
-  -H "Authorization: Bearer ${JINA_API_KEY}"
 ```
-
-For multiple tweets, use the bundled script:
-
-```bash
-scripts/fetch_tweets.sh url1 url2 url3
+output_dir/
+├── YYYY-MM-DD-article-title.md       # Main Markdown file
+└── attachments/
+    └── YYYY-MM-DD-author-title/
+        ├── 01-image.jpg
+        ├── 02-image.jpg
+        └── ...
 ```

 ## What Gets Returned

+### Full Article Mode
+- **YAML Frontmatter**: source, author, date, likes, retweets, bookmarks
+- **Markdown Content**: Full article text with local image references
+- **Attachments**: All downloaded images in dedicated folder
+
+### Simple Mode
 - **Title**: Post author and content preview
 - **URL Source**: Original tweet link
 - **Published Time**: GMT timestamp
- **Markdown Content**: Full post text with media descriptions
-
-## Bundled Scripts
-
-### fetch_tweet.py
-
-Python script for fetching individual tweets.
-
-```bash
-python scripts/fetch_tweet.py https://x.com/user/status/123 output.md
-```
-
-### fetch_tweets.sh
-
-Bash script for batch fetching multiple tweets.
-
-```bash
-scripts/fetch_tweets.sh \
-  "https://x.com/user/status/123" \
-  "https://x.com/user/status/456"
-```
+- **Markdown Content**: Text with remote media URLs

 ## URL Formats Supported

- `https://x.com/USER/status/ID`
- `https://twitter.com/USER/status/ID`
- `https://x.com/...` (redirects work automatically)
+- `https://x.com/USER/status/ID` (posts)
+- `https://x.com/USER/article/ID` (long-form articles)
+- `https://twitter.com/USER/status/ID` (legacy)

-## Environment Variables
+## Scripts

- `JINA_API_KEY`: Required. Your Jina.ai API key for accessing the reader API
+### fetch_article.py
+Full-featured article fetcher with image download:
+```bash
+uv run --with pyyaml python scripts/fetch_article.py <url> [output_dir]
+```
+
+### fetch_tweet.py
+Simple text-only fetcher using Jina API:
+```bash
+python scripts/fetch_tweet.py <tweet_url> [output_file]
+```
+
+### fetch_tweets.sh
+Batch fetch multiple tweets (Jina API):
+```bash
+scripts/fetch_tweets.sh <url1> <url2> ...
+```
+
+## Migration from Jina API
+
+Old workflow:
+```bash
+curl "https://r.jina.ai/https://x.com/..."
+# Manual image extraction and download
+```
+
+New workflow:
+```bash
+uv run --with pyyaml python scripts/fetch_article.py <url>
+# Automatic image download, complete Markdown
+```
--- a/twitter-reader/scripts/fetch_article.py
+++ b/twitter-reader/scripts/fetch_article.py
@@ -0,0 +1,247 @@
+#!/usr/bin/env python3
+"""
+Fetch Twitter/X Article with images using twitter-cli.
+
+Usage:
+    python fetch_article.py <article_url> [output_dir]
+
+Example:
+    python fetch_article.py https://x.com/HiTw93/status/2040047268221608281 ./Clippings
+
+Features:
+    - Fetches structured data via twitter-cli
+    - Downloads all images to attachments folder
+    - Generates Markdown with embedded image references
+"""
+
+import sys
+import os
+import re
+import subprocess
+import argparse
+from pathlib import Path
+from datetime import datetime
+
+
+def run_twitter_cli(url: str) -> dict:
+    """Fetch article data using twitter-cli via uv run."""
+    cmd = ["uv", "run", "--with", "twitter-cli", "twitter", "article", url]
+
+    result = subprocess.run(cmd, capture_output=True, text=True)
+
+    if result.returncode != 0:
+        print(f"Error fetching article: {result.stderr}", file=sys.stderr)
+        sys.exit(1)
+
+    return parse_yaml_output(result.stdout)
+
+
+def run_jina_api(url: str) -> str:
+    """Fetch article text with images using Jina API."""
+    api_key = os.getenv("JINA_API_KEY", "")
+    jina_url = f"https://r.jina.ai/{url}"
+
+    cmd = ["curl", "-s", jina_url]
+    if api_key:
+        cmd.extend(["-H", f"Authorization: Bearer {api_key}"])
+
+    result = subprocess.run(cmd, capture_output=True, text=True)
+
+    if result.returncode != 0:
+        print(f"Warning: Jina API failed: {result.stderr}", file=sys.stderr)
+        return ""
+
+    return result.stdout
+
+
+def parse_yaml_output(output: str) -> dict:
+    """Parse twitter-cli YAML output into dict."""
+    try:
+        import yaml
+        data = yaml.safe_load(output)
+        if data.get("ok") and "data" in data:
+            return data["data"]
+        return data
+    except ImportError:
+        print("Error: PyYAML required. Install with: uv pip install pyyaml", file=sys.stderr)
+        sys.exit(1)
+    except Exception as e:
+        print(f"Error parsing YAML: {e}", file=sys.stderr)
+        sys.exit(1)
+
+
+def extract_image_urls(text: str) -> list:
+    """Extract image URLs from markdown text."""
+    # Extract all pbs.twimg.com URLs (note: twimg not twitter)
+    pattern = r'https://pbs\.twimg\.com/media/[^\s\)"\']+'
+    matches = re.findall(pattern, text)
+
+    # Deduplicate and normalize to large size
+    seen = set()
+    urls = []
+    for url in matches:
+        base_url = url.split('?')[0]
+        if base_url not in seen:
+            seen.add(base_url)
+            urls.append(f"{base_url}?format=jpg&name=large")
+
+    return urls
+
+
+def download_images(image_urls: list, attachments_dir: Path) -> list:
+    """Download images and return list of local paths."""
+    attachments_dir.mkdir(parents=True, exist_ok=True)
+    local_paths = []
+
+    for i, url in enumerate(image_urls, 1):
+        filename = f"{i:02d}-image.jpg"
+        filepath = attachments_dir / filename
+
+        cmd = ["curl", "-sL", url, "-o", str(filepath)]
+        result = subprocess.run(cmd, capture_output=True)
+
+        if result.returncode == 0 and filepath.exists() and filepath.stat().st_size > 0:
+            local_paths.append(f"attachments/{attachments_dir.name}/{filename}")
+            print(f"  ✓ {filename}")
+        else:
+            print(f"  ✗ Failed: {filename}")
+
+    return local_paths
+
+
+def replace_image_urls(text: str, image_urls: list, local_paths: list) -> str:
+    """Replace remote image URLs with local paths in markdown text."""
+    for remote_url, local_path in zip(image_urls, local_paths):
+        # Extract base URL pattern
+        base_url = remote_url.split('?')[0].replace('?format=jpg&name=large', '')
+        # Replace all variations of this URL
+        pattern = re.escape(base_url) + r'(\?[^\)]*)?'
+        text = re.sub(pattern, local_path, text)
+    return text
+
+
+def sanitize_filename(name: str) -> str:
+    """Sanitize string for use in filename."""
+    # Remove special chars, keep alphanumeric, CJK, and some safe chars
+    name = re.sub(r'[^\w\s\-\u4e00-\u9fff]', '', name)
+    name = re.sub(r'\s+', '-', name.strip())
+    return name[:60]  # Limit length
+
+
+def generate_markdown(data: dict, text: str, image_urls: list, local_paths: list, source_url: str) -> str:
+    """Generate complete Markdown document."""
+    # Parse date
+    created = data.get("createdAtLocal", "")
+    if created:
+        date_str = created[:10]
+    else:
+        date_str = datetime.now().strftime("%Y-%m-%d")
+
+    author = data.get("author", {})
+    metrics = data.get("metrics", {})
+    title = data.get("articleTitle", "Untitled")
+
+    # Build frontmatter
+    md = f"""---
+source: {source_url}
+author: {author.get("name", "")}
+date: {date_str}
+likes: {metrics.get("likes", 0)}
+retweets: {metrics.get("retweets", 0)}
+bookmarks: {metrics.get("bookmarks", 0)}
+---
+
+# {title}
+
+"""
+
+    # Replace image URLs with local paths
+    if image_urls and local_paths:
+        text = replace_image_urls(text, image_urls, local_paths)
+
+    md += text
+    return md
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Fetch Twitter/X Article with images")
+    parser.add_argument("url", help="Twitter/X article URL")
+    parser.add_argument("output_dir", nargs="?", default=".", help="Output directory (default: current)")
+    args = parser.parse_args()
+
+    if not args.url.startswith(("https://x.com/", "https://twitter.com/")):
+        print("Error: URL must be from x.com or twitter.com", file=sys.stderr)
+        sys.exit(1)
+
+    print(f"Fetching: {args.url}")
+    print("-" * 50)
+
+    # Fetch metadata from twitter-cli
+    print("Getting metadata...")
+    data = run_twitter_cli(args.url)
+
+    title = data.get("articleTitle", "")
+    if not title:
+        print("Error: Could not fetch article data", file=sys.stderr)
+        sys.exit(1)
+
+    author = data.get("author", {})
+
+    print(f"Title: {title}")
+    print(f"Author: {author.get('name', 'Unknown')}")
+    print(f"Likes: {data.get('metrics', {}).get('likes', 0)}")
+
+    # Fetch content with images from Jina API
+    print("\nGetting content and images...")
+    jina_content = run_jina_api(args.url)
+
+    # Use Jina content if available, otherwise fall back to twitter-cli text
+    if jina_content:
+        text = jina_content
+        # Remove Jina header lines to get clean markdown
+        # Find "Markdown Content:" and keep everything after it
+        marker = "Markdown Content:"
+        idx = text.find(marker)
+        if idx != -1:
+            text = text[idx + len(marker):].lstrip()
+    else:
+        text = data.get("articleText", "")
+
+    # Extract image URLs
+    image_urls = extract_image_urls(text)
+    print(f"Images: {len(image_urls)}")
+
+    # Setup output paths
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    # Create attachments folder
+    date_str = data.get("createdAtLocal", "")[:10] if data.get("createdAtLocal") else datetime.now().strftime("%Y-%m-%d")
+    safe_author = sanitize_filename(author.get("screenName", "unknown"))
+    safe_title = sanitize_filename(title)
+    attachments_name = f"{date_str}-{safe_author}-{safe_title[:30]}"
+    attachments_dir = output_dir / "attachments" / attachments_name
+
+    # Download images
+    local_paths = []
+    if image_urls:
+        print(f"\nDownloading {len(image_urls)} images...")
+        local_paths = download_images(image_urls, attachments_dir)
+
+    # Generate Markdown
+    md_content = generate_markdown(data, text, image_urls, local_paths, args.url)
+
+    # Save Markdown
+    md_filename = f"{date_str}-{safe_title}.md"
+    md_path = output_dir / md_filename
+    md_path.write_text(md_content, encoding="utf-8")
+
+    print(f"\n✓ Saved: {md_path}")
+    if local_paths:
+        print(f"✓ Images: {attachments_dir} ({len(local_paths)} downloaded)")
+
+    return md_path
+
+
+if __name__ == "__main__":
+    main()