feat(twitter-reader): add fetch_article.py for X Articles with images

- Use twitter-cli for structured metadata (likes, retweets, bookmarks)
- Use Jina API for content with images
- Auto-download all images to attachments/
- Generate Markdown with YAML frontmatter and local image references
- Security scan passed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
daymade
2026-04-06 16:31:33 +08:00
parent 673980639b
commit 22ec9f0d59
3 changed files with 377 additions and 46 deletions

View File

@@ -1,4 +1,4 @@
Security scan passed
Scanned at: 2026-01-11T15:27:31.794239
Scanned at: 2026-04-06T16:24:46.159857
Tool: gitleaks + pattern-based validation
Content hash: 2349813d82b8932f89fdc653c0dde1b2335483478227653554553b59007975c1
Content hash: bfcc70ffa4bfda5e6308942605ad8dc3cf62e8aeade5b00c6d9b774bebb190fb

View File

@@ -1,72 +1,156 @@
---
name: twitter-reader
description: Fetch Twitter/X post content by URL using jina.ai API to bypass JavaScript restrictions. Use when Claude needs to retrieve tweet content including author, timestamp, post text, images, and thread replies. Supports individual posts or batch fetching from x.com or twitter.com URLs.
description: Fetch Twitter/X post content including long-form Articles with full images and metadata. Use when Claude needs to retrieve tweet/article content, author info, engagement metrics, and embedded media. Supports individual posts and X Articles (long-form content). Automatically downloads all images to local attachments folder and generates complete Markdown with proper image references. Preferred over Jina for X Articles with images.
---
# Twitter Reader
Fetch Twitter/X post content without needing JavaScript or authentication.
Fetch Twitter/X post and article content with full media support.
## Quick Start (Recommended)
For X Articles with images, use the new fetch_article.py script:
```bash
uv run --with pyyaml python scripts/fetch_article.py <article_url> [output_dir]
```
Example:
```bash
uv run --with pyyaml python scripts/fetch_article.py \
https://x.com/HiTw93/status/2040047268221608281 \
./Clippings
```
This will:
- Fetch structured data via `twitter-cli` (likes, retweets, bookmarks)
- Fetch content with images via `jina.ai` API
- Download all images to `attachments/YYYY-MM-DD-AUTHOR-TITLE/`
- Generate complete Markdown with embedded image references
- Include YAML frontmatter with metadata
### Example Output
```
Fetching: https://x.com/HiTw93/status/2040047268221608281
--------------------------------------------------
Getting metadata...
Title: 你不知道的大模型训练:原理、路径与新实践
Author: Tw93
Likes: 1648
Getting content and images...
Images: 15
Downloading 15 images...
✓ 01-image.jpg
✓ 02-image.jpg
...
✓ Saved: ./Clippings/2026-04-03-文章标题.md
✓ Images: ./Clippings/attachments/2026-04-03-HiTw93-.../ (15 downloaded)
```
## Alternative: Jina API (Text-only)
For simple text-only fetching without authentication:
```bash
# Single tweet
curl "https://r.jina.ai/https://x.com/USER/status/TWEET_ID" \
-H "Authorization: Bearer ${JINA_API_KEY}"
# Batch fetching
scripts/fetch_tweets.sh url1 url2 url3
```
## Features
### Full Article Mode (fetch_article.py)
- ✅ Structured metadata (author, date, engagement metrics)
- ✅ Automatic image download (all embedded media)
- ✅ Complete Markdown with local image references
- ✅ YAML frontmatter for PKM systems
- ✅ Handles X Articles (long-form content)
### Simple Mode (Jina API)
- Text-only content
- No authentication required beyond Jina API key
- Good for quick text extraction
## Prerequisites
You need a Jina API key to use this skill:
1. Visit https://jina.ai/ to sign up (free tier available)
2. Get your API key from the dashboard
3. Set the environment variable:
### For Full Article Mode
- `uv` (Python package manager)
- No additional setup (twitter-cli auto-installed)
### For Simple Mode (Jina)
```bash
export JINA_API_KEY="your_api_key_here"
# Get from https://jina.ai/
```
## Quick Start
## Output Structure
For a single tweet, use curl directly:
```bash
curl "https://r.jina.ai/https://x.com/USER/status/TWEET_ID" \
-H "Authorization: Bearer ${JINA_API_KEY}"
```
For multiple tweets, use the bundled script:
```bash
scripts/fetch_tweets.sh url1 url2 url3
output_dir/
├── YYYY-MM-DD-article-title.md # Main Markdown file
└── attachments/
└── YYYY-MM-DD-author-title/
├── 01-image.jpg
├── 02-image.jpg
└── ...
```
## What Gets Returned
### Full Article Mode
- **YAML Frontmatter**: source, author, date, likes, retweets, bookmarks
- **Markdown Content**: Full article text with local image references
- **Attachments**: All downloaded images in dedicated folder
### Simple Mode
- **Title**: Post author and content preview
- **URL Source**: Original tweet link
- **Published Time**: GMT timestamp
- **Markdown Content**: Full post text with media descriptions
## Bundled Scripts
### fetch_tweet.py
Python script for fetching individual tweets.
```bash
python scripts/fetch_tweet.py https://x.com/user/status/123 output.md
```
### fetch_tweets.sh
Bash script for batch fetching multiple tweets.
```bash
scripts/fetch_tweets.sh \
"https://x.com/user/status/123" \
"https://x.com/user/status/456"
```
- **Markdown Content**: Text with remote media URLs
## URL Formats Supported
- `https://x.com/USER/status/ID`
- `https://twitter.com/USER/status/ID`
- `https://x.com/...` (redirects work automatically)
- `https://x.com/USER/status/ID` (posts)
- `https://x.com/USER/article/ID` (long-form articles)
- `https://twitter.com/USER/status/ID` (legacy)
## Environment Variables
## Scripts
- `JINA_API_KEY`: Required. Your Jina.ai API key for accessing the reader API
### fetch_article.py
Full-featured article fetcher with image download:
```bash
uv run --with pyyaml python scripts/fetch_article.py <url> [output_dir]
```
### fetch_tweet.py
Simple text-only fetcher using Jina API:
```bash
python scripts/fetch_tweet.py <tweet_url> [output_file]
```
### fetch_tweets.sh
Batch fetch multiple tweets (Jina API):
```bash
scripts/fetch_tweets.sh <url1> <url2> ...
```
## Migration from Jina API
Old workflow:
```bash
curl "https://r.jina.ai/https://x.com/..."
# Manual image extraction and download
```
New workflow:
```bash
uv run --with pyyaml python scripts/fetch_article.py <url>
# Automatic image download, complete Markdown
```

View File

@@ -0,0 +1,247 @@
#!/usr/bin/env python3
"""
Fetch Twitter/X Article with images using twitter-cli.
Usage:
python fetch_article.py <article_url> [output_dir]
Example:
python fetch_article.py https://x.com/HiTw93/status/2040047268221608281 ./Clippings
Features:
- Fetches structured data via twitter-cli
- Downloads all images to attachments folder
- Generates Markdown with embedded image references
"""
import sys
import os
import re
import subprocess
import argparse
from pathlib import Path
from datetime import datetime
def run_twitter_cli(url: str) -> dict:
"""Fetch article data using twitter-cli via uv run."""
cmd = ["uv", "run", "--with", "twitter-cli", "twitter", "article", url]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
print(f"Error fetching article: {result.stderr}", file=sys.stderr)
sys.exit(1)
return parse_yaml_output(result.stdout)
def run_jina_api(url: str) -> str:
"""Fetch article text with images using Jina API."""
api_key = os.getenv("JINA_API_KEY", "")
jina_url = f"https://r.jina.ai/{url}"
cmd = ["curl", "-s", jina_url]
if api_key:
cmd.extend(["-H", f"Authorization: Bearer {api_key}"])
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
print(f"Warning: Jina API failed: {result.stderr}", file=sys.stderr)
return ""
return result.stdout
def parse_yaml_output(output: str) -> dict:
"""Parse twitter-cli YAML output into dict."""
try:
import yaml
data = yaml.safe_load(output)
if data.get("ok") and "data" in data:
return data["data"]
return data
except ImportError:
print("Error: PyYAML required. Install with: uv pip install pyyaml", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"Error parsing YAML: {e}", file=sys.stderr)
sys.exit(1)
def extract_image_urls(text: str) -> list:
"""Extract image URLs from markdown text."""
# Extract all pbs.twimg.com URLs (note: twimg not twitter)
pattern = r'https://pbs\.twimg\.com/media/[^\s\)"\']+'
matches = re.findall(pattern, text)
# Deduplicate and normalize to large size
seen = set()
urls = []
for url in matches:
base_url = url.split('?')[0]
if base_url not in seen:
seen.add(base_url)
urls.append(f"{base_url}?format=jpg&name=large")
return urls
def download_images(image_urls: list, attachments_dir: Path) -> list:
"""Download images and return list of local paths."""
attachments_dir.mkdir(parents=True, exist_ok=True)
local_paths = []
for i, url in enumerate(image_urls, 1):
filename = f"{i:02d}-image.jpg"
filepath = attachments_dir / filename
cmd = ["curl", "-sL", url, "-o", str(filepath)]
result = subprocess.run(cmd, capture_output=True)
if result.returncode == 0 and filepath.exists() and filepath.stat().st_size > 0:
local_paths.append(f"attachments/{attachments_dir.name}/{filename}")
print(f"{filename}")
else:
print(f" ✗ Failed: {filename}")
return local_paths
def replace_image_urls(text: str, image_urls: list, local_paths: list) -> str:
"""Replace remote image URLs with local paths in markdown text."""
for remote_url, local_path in zip(image_urls, local_paths):
# Extract base URL pattern
base_url = remote_url.split('?')[0].replace('?format=jpg&name=large', '')
# Replace all variations of this URL
pattern = re.escape(base_url) + r'(\?[^\)]*)?'
text = re.sub(pattern, local_path, text)
return text
def sanitize_filename(name: str) -> str:
"""Sanitize string for use in filename."""
# Remove special chars, keep alphanumeric, CJK, and some safe chars
name = re.sub(r'[^\w\s\-\u4e00-\u9fff]', '', name)
name = re.sub(r'\s+', '-', name.strip())
return name[:60] # Limit length
def generate_markdown(data: dict, text: str, image_urls: list, local_paths: list, source_url: str) -> str:
"""Generate complete Markdown document."""
# Parse date
created = data.get("createdAtLocal", "")
if created:
date_str = created[:10]
else:
date_str = datetime.now().strftime("%Y-%m-%d")
author = data.get("author", {})
metrics = data.get("metrics", {})
title = data.get("articleTitle", "Untitled")
# Build frontmatter
md = f"""---
source: {source_url}
author: {author.get("name", "")}
date: {date_str}
likes: {metrics.get("likes", 0)}
retweets: {metrics.get("retweets", 0)}
bookmarks: {metrics.get("bookmarks", 0)}
---
# {title}
"""
# Replace image URLs with local paths
if image_urls and local_paths:
text = replace_image_urls(text, image_urls, local_paths)
md += text
return md
def main():
parser = argparse.ArgumentParser(description="Fetch Twitter/X Article with images")
parser.add_argument("url", help="Twitter/X article URL")
parser.add_argument("output_dir", nargs="?", default=".", help="Output directory (default: current)")
args = parser.parse_args()
if not args.url.startswith(("https://x.com/", "https://twitter.com/")):
print("Error: URL must be from x.com or twitter.com", file=sys.stderr)
sys.exit(1)
print(f"Fetching: {args.url}")
print("-" * 50)
# Fetch metadata from twitter-cli
print("Getting metadata...")
data = run_twitter_cli(args.url)
title = data.get("articleTitle", "")
if not title:
print("Error: Could not fetch article data", file=sys.stderr)
sys.exit(1)
author = data.get("author", {})
print(f"Title: {title}")
print(f"Author: {author.get('name', 'Unknown')}")
print(f"Likes: {data.get('metrics', {}).get('likes', 0)}")
# Fetch content with images from Jina API
print("\nGetting content and images...")
jina_content = run_jina_api(args.url)
# Use Jina content if available, otherwise fall back to twitter-cli text
if jina_content:
text = jina_content
# Remove Jina header lines to get clean markdown
# Find "Markdown Content:" and keep everything after it
marker = "Markdown Content:"
idx = text.find(marker)
if idx != -1:
text = text[idx + len(marker):].lstrip()
else:
text = data.get("articleText", "")
# Extract image URLs
image_urls = extract_image_urls(text)
print(f"Images: {len(image_urls)}")
# Setup output paths
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
# Create attachments folder
date_str = data.get("createdAtLocal", "")[:10] if data.get("createdAtLocal") else datetime.now().strftime("%Y-%m-%d")
safe_author = sanitize_filename(author.get("screenName", "unknown"))
safe_title = sanitize_filename(title)
attachments_name = f"{date_str}-{safe_author}-{safe_title[:30]}"
attachments_dir = output_dir / "attachments" / attachments_name
# Download images
local_paths = []
if image_urls:
print(f"\nDownloading {len(image_urls)} images...")
local_paths = download_images(image_urls, attachments_dir)
# Generate Markdown
md_content = generate_markdown(data, text, image_urls, local_paths, args.url)
# Save Markdown
md_filename = f"{date_str}-{safe_title}.md"
md_path = output_dir / md_filename
md_path.write_text(md_content, encoding="utf-8")
print(f"\n✓ Saved: {md_path}")
if local_paths:
print(f"✓ Images: {attachments_dir} ({len(local_paths)} downloaded)")
return md_path
if __name__ == "__main__":
main()