Files
skill-seekers-reference/docs/TROUBLESHOOTING.md
yusyus 37cb307455 docs: update all documentation for 17 source types
Update 32 documentation files across English and Chinese (zh-CN) docs
to reflect the 10 new source types added in the previous commit.

Updated files:
- README.md, README.zh-CN.md — taglines, feature lists, examples, install extras
- docs/reference/ — CLI_REFERENCE, FEATURE_MATRIX, MCP_REFERENCE, CONFIG_FORMAT, API_REFERENCE
- docs/features/ — UNIFIED_SCRAPING with generic merge docs
- docs/advanced/ — multi-source guide, MCP server guide
- docs/getting-started/ — installation extras, quick-start examples
- docs/user-guide/ — core-concepts, scraping, packaging, workflows (complex-merge)
- docs/ — FAQ, TROUBLESHOOTING, BEST_PRACTICES, ARCHITECTURE, UNIFIED_PARSERS, README
- Root — BULLETPROOF_QUICKSTART, CONTRIBUTING, ROADMAP
- docs/zh-CN/ — Chinese translations for all of the above

32 files changed, +3,016 lines, -245 lines
2026-03-15 15:56:04 +03:00

1095 lines
21 KiB
Markdown

# Troubleshooting Guide
Comprehensive guide for diagnosing and resolving common issues with Skill Seekers.
## Table of Contents
- [Installation Issues](#installation-issues)
- [Configuration Issues](#configuration-issues)
- [Scraping Issues](#scraping-issues)
- [GitHub API Issues](#github-api-issues)
- [API & Enhancement Issues](#api--enhancement-issues)
- [Docker & Kubernetes Issues](#docker--kubernetes-issues)
- [Performance Issues](#performance-issues)
- [Storage Issues](#storage-issues)
- [Network Issues](#network-issues)
- [General Debug Techniques](#general-debug-techniques)
- [Source-Type-Specific Issues](#source-type-specific-issues)
## Installation Issues
### Issue: Package Installation Fails
**Symptoms:**
```
ERROR: Could not build wheels for...
ERROR: Failed building wheel for...
```
**Solutions:**
```bash
# Update pip and setuptools
python -m pip install --upgrade pip setuptools wheel
# Install build dependencies (Ubuntu/Debian)
sudo apt install python3-dev build-essential libssl-dev
# Install build dependencies (RHEL/CentOS)
sudo yum install python3-devel gcc gcc-c++ openssl-devel
# Retry installation
pip install skill-seekers
```
### Issue: Command Not Found After Installation
**Symptoms:**
```bash
$ skill-seekers --version
bash: skill-seekers: command not found
```
**Solutions:**
```bash
# Check if installed
pip show skill-seekers
# Add to PATH
export PATH="$HOME/.local/bin:$PATH"
# Or reinstall with --user flag
pip install --user skill-seekers
# Verify
which skill-seekers
```
### Issue: Python Version Mismatch
**Symptoms:**
```
ERROR: Package requires Python >=3.10 but you are running 3.9
```
**Solutions:**
```bash
# Check Python version
python --version
python3 --version
# Use specific Python version
python3.12 -m pip install skill-seekers
# Create alias
alias python=python3.12
# Or use pyenv
pyenv install 3.12
pyenv global 3.12
```
### Issue: Video Visual Dependencies Missing
**Symptoms:**
```
Missing video dependencies: easyocr
RuntimeError: Required video visual dependencies not installed
```
**Solutions:**
```bash
# Run the GPU-aware setup command
skill-seekers video --setup
# This auto-detects your GPU and installs:
# - PyTorch (correct CUDA/ROCm/CPU variant)
# - easyocr, opencv, pytesseract, scenedetect, faster-whisper
# - yt-dlp, youtube-transcript-api
# Verify installation
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
python -c "import easyocr; print('easyocr OK')"
```
**Common issues:**
- Running outside a virtual environment → `--setup` will warn you; create a venv first
- Missing system packages → Install `tesseract-ocr` and `ffmpeg` for your OS
- AMD GPU without ROCm → Install ROCm first, then re-run `--setup`
## Configuration Issues
### Issue: API Keys Not Recognized
**Symptoms:**
```
Error: ANTHROPIC_API_KEY not found
401 Unauthorized
```
**Solutions:**
```bash
# Check environment variables
env | grep API_KEY
# Set in current session
export ANTHROPIC_API_KEY=sk-ant-...
# Set permanently (~/.bashrc or ~/.zshrc)
echo 'export ANTHROPIC_API_KEY=sk-ant-...' >> ~/.bashrc
source ~/.bashrc
# Or use .env file
cat > .env <<EOF
ANTHROPIC_API_KEY=sk-ant-...
EOF
# Load .env
set -a
source .env
set +a
# Verify
skill-seekers config --test
```
### Issue: Configuration File Not Found
**Symptoms:**
```
Error: Config file not found: configs/react.json
FileNotFoundError: [Errno 2] No such file or directory
```
**Solutions:**
```bash
# Check file exists
ls -la configs/react.json
# Use absolute path
skill-seekers scrape --config /full/path/to/configs/react.json
# Create config directory
mkdir -p ~/.config/skill-seekers/configs
# Copy config
cp configs/react.json ~/.config/skill-seekers/configs/
# List available configs
skill-seekers-config list
```
### Issue: Invalid Configuration Format
**Symptoms:**
```
json.decoder.JSONDecodeError: Expecting value: line 1 column 1
ValidationError: 1 validation error for Config
```
**Solutions:**
```bash
# Validate JSON syntax
python -m json.tool configs/myconfig.json
# Check required fields
skill-seekers-validate configs/myconfig.json
# Example valid config
cat > configs/test.json <<EOF
{
"name": "test",
"base_url": "https://docs.example.com/",
"selectors": {
"main_content": "article"
}
}
EOF
```
## Scraping Issues
### Issue: No Content Extracted
**Symptoms:**
```
Warning: No content found for URL
0 pages scraped
Empty SKILL.md generated
```
**Solutions:**
```bash
# Enable debug mode
export LOG_LEVEL=DEBUG
skill-seekers scrape --config config.json --verbose
# Test selectors manually
python -c "
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get('URL').content, 'html.parser')
print(soup.select_one('article')) # Test selector
"
# Adjust selectors in config
{
"selectors": {
"main_content": "main", # Try different selectors
"title": "h1",
"code_blocks": "pre"
}
}
# Use fallback selectors
{
"selectors": {
"main_content": ["article", "main", ".content", "#content"]
}
}
```
### Issue: Scraping Takes Too Long
**Symptoms:**
```
Scraping has been running for 2 hours...
Progress: 50/500 pages (10%)
```
**Solutions:**
```bash
# Enable async scraping (2-3x faster)
skill-seekers scrape --config config.json --async
# Reduce max pages
skill-seekers scrape --config config.json --max-pages 100
# Increase concurrency
# Edit config.json:
{
"concurrency": 20, # Default: 10
"rate_limit": 0.2 # Faster (0.2s delay)
}
# Use caching for re-runs
skill-seekers scrape --config config.json --use-cache
```
### Issue: Pages Not Being Discovered
**Symptoms:**
```
Only 5 pages found
Expected 100+ pages
```
**Solutions:**
```bash
# Check URL patterns
{
"url_patterns": {
"include": ["/docs"], # Make sure this matches
"exclude": [] # Remove restrictive patterns
}
}
# Enable breadth-first search
{
"crawl_strategy": "bfs", # vs "dfs"
"max_depth": 10 # Increase depth
}
# Debug URL discovery
skill-seekers scrape --config config.json --dry-run --verbose
```
## GitHub API Issues
### Issue: Rate Limit Exceeded
**Symptoms:**
```
403 Forbidden
API rate limit exceeded for user
X-RateLimit-Remaining: 0
```
**Solutions:**
```bash
# Check current rate limit
curl -H "Authorization: token $GITHUB_TOKEN" \
https://api.github.com/rate_limit
# Use multiple tokens
skill-seekers config --github
# Follow wizard to add multiple profiles
# Wait for reset
# Check X-RateLimit-Reset header for timestamp
# Use non-interactive mode in CI/CD
skill-seekers github --repo owner/repo --non-interactive
# Configure rate limit strategy
skill-seekers config --github
# Choose: prompt / wait / switch / fail
```
### Issue: Invalid GitHub Token
**Symptoms:**
```
401 Unauthorized
Bad credentials
```
**Solutions:**
```bash
# Verify token
curl -H "Authorization: token $GITHUB_TOKEN" \
https://api.github.com/user
# Generate new token
# Visit: https://github.com/settings/tokens
# Scopes needed: repo, read:org
# Update token
skill-seekers config --github
# Test token
skill-seekers config --test
```
### Issue: Repository Not Found
**Symptoms:**
```
404 Not Found
Repository not found: owner/repo
```
**Solutions:**
```bash
# Check repository name (case-sensitive)
skill-seekers github --repo facebook/react # Correct
skill-seekers github --repo Facebook/React # Wrong
# Check if repo is private (requires token)
export GITHUB_TOKEN=ghp_...
skill-seekers github --repo private/repo
# Verify repo exists
curl https://api.github.com/repos/owner/repo
```
## API & Enhancement Issues
### Issue: Enhancement Fails
**Symptoms:**
```
Error: SKILL.md enhancement failed
AuthenticationError: Invalid API key
```
**Solutions:**
```bash
# Verify API key
skill-seekers config --test
# Try LOCAL mode (free, uses Claude Code Max)
skill-seekers enhance output/react/ --mode LOCAL
# Check API key format
# Claude: sk-ant-...
# OpenAI: sk-...
# Gemini: AIza...
# Test API directly
curl https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{"model":"claude-sonnet-4.5","max_tokens":1024,"messages":[{"role":"user","content":"Hello"}]}'
```
### Issue: Enhancement Hangs/Timeouts
**Symptoms:**
```
Enhancement process not responding
Timeout after 300 seconds
```
**Solutions:**
```bash
# Increase timeout
skill-seekers enhance output/react/ --timeout 600
# Run in background
skill-seekers enhance output/react/ --background
# Monitor status
skill-seekers enhance-status output/react/ --watch
# Kill hung process
ps aux | grep enhance
kill -9 <PID>
# Check system resources
htop
df -h
```
### Issue: API Cost Concerns
**Symptoms:**
```
Worried about API costs for enhancement
Need free alternative
```
**Solutions:**
```bash
# Use LOCAL mode (free!)
skill-seekers enhance output/react/ --mode LOCAL
# Skip enhancement entirely
skill-seekers scrape --config config.json --skip-enhance
# Estimate cost before enhancing
# Claude API: ~$0.15-$0.30 per skill
# Check usage: https://console.anthropic.com/
# Use batch processing
for dir in output/*/; do
skill-seekers enhance "$dir" --mode LOCAL --background
done
```
## Docker & Kubernetes Issues
### Issue: Container Won't Start
**Symptoms:**
```
Error response from daemon: Container ... is not running
Container exits immediately
```
**Solutions:**
```bash
# Check logs
docker logs skillseekers-mcp
# Common issues:
# 1. Missing environment variables
docker run -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY ...
# 2. Port already in use
sudo lsof -i :8765
docker run -p 8766:8765 ...
# 3. Permission issues
docker run --user $(id -u):$(id -g) ...
# Run interactively to debug
docker run -it --entrypoint /bin/bash skillseekers:latest
```
### Issue: Kubernetes Pod CrashLoopBackOff
**Symptoms:**
```
NAME READY STATUS RESTARTS
skillseekers-mcp-xxx 0/1 CrashLoopBackOff 5
```
**Solutions:**
```bash
# Check pod logs
kubectl logs -n skillseekers skillseekers-mcp-xxx
# Describe pod
kubectl describe pod -n skillseekers skillseekers-mcp-xxx
# Check events
kubectl get events -n skillseekers --sort-by='.lastTimestamp'
# Common issues:
# 1. Missing secrets
kubectl get secrets -n skillseekers
# 2. Resource constraints
kubectl top nodes
kubectl edit deployment skillseekers-mcp -n skillseekers
# 3. Liveness probe failing
# Increase initialDelaySeconds in deployment
```
### Issue: Image Pull Errors
**Symptoms:**
```
ErrImagePull
ImagePullBackOff
Failed to pull image
```
**Solutions:**
```bash
# Check image exists
docker pull skillseekers:latest
# Create image pull secret
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com \
--docker-username=user \
--docker-password=pass \
-n skillseekers
# Add to deployment
spec:
imagePullSecrets:
- name: regcred
# Use public image (if available)
image: docker.io/skillseekers/skillseekers:latest
```
## Performance Issues
### Issue: High Memory Usage
**Symptoms:**
```
Process killed (OOM)
Memory usage: 8GB+
System swapping
```
**Solutions:**
```bash
# Check memory usage
ps aux --sort=-%mem | head -10
htop
# Reduce batch size
skill-seekers scrape --config config.json --batch-size 10
# Enable memory limits
# Docker:
docker run --memory=4g skillseekers:latest
# Kubernetes:
resources:
limits:
memory: 4Gi
# Clear cache
rm -rf ~/.cache/skill-seekers/
# Use streaming for large files
# (automatically handled by library)
```
### Issue: Slow Performance
**Symptoms:**
```
Operations taking much longer than expected
High CPU usage
Disk I/O bottleneck
```
**Solutions:**
```bash
# Enable async operations
skill-seekers scrape --config config.json --async
# Increase concurrency
{
"concurrency": 20 # Adjust based on resources
}
# Use SSD for storage
# Move output to SSD:
mv output/ /mnt/ssd/output/
# Monitor performance
# CPU:
mpstat 1
# Disk I/O:
iostat -x 1
# Network:
iftop
# Profile code
python -m cProfile -o profile.stats \
-m skill_seekers.cli.doc_scraper --config config.json
```
### Issue: Disk Space Issues
**Symptoms:**
```
No space left on device
Disk full
Cannot create file
```
**Solutions:**
```bash
# Check disk usage
df -h
du -sh output/*
# Clean up old skills
find output/ -type d -mtime +30 -exec rm -rf {} \;
# Compress old benchmarks
tar czf benchmarks-archive.tar.gz benchmarks/
rm -rf benchmarks/*.json
# Use cloud storage
skill-seekers scrape --config config.json \
--storage s3 \
--bucket my-skills-bucket
# Clear cache
skill-seekers cache --clear
```
## Storage Issues
### Issue: S3 Upload Fails
**Symptoms:**
```
botocore.exceptions.NoCredentialsError
AccessDenied
```
**Solutions:**
```bash
# Check credentials
aws sts get-caller-identity
# Configure AWS CLI
aws configure
# Set environment variables
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1
# Check bucket permissions
aws s3 ls s3://my-bucket/
# Test upload
echo "test" > test.txt
aws s3 cp test.txt s3://my-bucket/
```
### Issue: GCS Authentication Failed
**Symptoms:**
```
google.auth.exceptions.DefaultCredentialsError
Permission denied
```
**Solutions:**
```bash
# Set credentials file
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json
# Or use gcloud auth
gcloud auth application-default login
# Verify permissions
gsutil ls gs://my-bucket/
# Test upload
echo "test" > test.txt
gsutil cp test.txt gs://my-bucket/
```
## Network Issues
### Issue: Connection Timeouts
**Symptoms:**
```
requests.exceptions.ConnectionError
ReadTimeout
Connection refused
```
**Solutions:**
```bash
# Check network connectivity
ping google.com
curl https://docs.example.com/
# Increase timeout
{
"timeout": 60 # seconds
}
# Use proxy if behind firewall
export HTTP_PROXY=http://proxy.example.com:8080
export HTTPS_PROXY=http://proxy.example.com:8080
# Check DNS resolution
nslookup docs.example.com
dig docs.example.com
# Test with curl
curl -v https://docs.example.com/
```
### Issue: SSL/TLS Errors
**Symptoms:**
```
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED]
SSLCertVerificationError
```
**Solutions:**
```bash
# Update certificates
# Ubuntu/Debian:
sudo apt update && sudo apt install --reinstall ca-certificates
# RHEL/CentOS:
sudo yum reinstall ca-certificates
# As last resort (not recommended for production):
export PYTHONHTTPSVERIFY=0
# Or in code:
skill-seekers scrape --config config.json --no-verify-ssl
```
## General Debug Techniques
### Enable Debug Logging
```bash
# Set debug level
export LOG_LEVEL=DEBUG
# Run with verbose output
skill-seekers scrape --config config.json --verbose
# Save logs to file
skill-seekers scrape --config config.json 2>&1 | tee debug.log
```
### Collect Diagnostic Information
```bash
# System info
uname -a
python --version
pip --version
# Package info
pip show skill-seekers
pip list | grep skill
# Environment
env | grep -E '(API_KEY|TOKEN|PATH)'
# Recent errors
grep -i error /var/log/skillseekers/*.log | tail -20
# Package all diagnostics
tar czf diagnostics.tar.gz \
debug.log \
~/.config/skill-seekers/ \
/var/log/skillseekers/
```
### Test Individual Components
```bash
# Test scraper
python -c "
from skill_seekers.cli.doc_scraper import scrape_all
pages = scrape_all('configs/test.json')
print(f'Scraped {len(pages)} pages')
"
# Test GitHub API
python -c "
from skill_seekers.cli.github_fetcher import GitHubFetcher
fetcher = GitHubFetcher()
repo = fetcher.fetch('facebook/react')
print(repo['full_name'])
"
# Test embeddings
python -c "
from skill_seekers.embedding.generator import EmbeddingGenerator
gen = EmbeddingGenerator()
emb = gen.generate('test', model='text-embedding-3-small')
print(f'Embedding dimension: {len(emb)}')
"
```
### Interactive Debugging
```python
# Add breakpoint
import pdb; pdb.set_trace()
# Or use ipdb
import ipdb; ipdb.set_trace()
# Debug with IPython
ipython -i script.py
```
## Getting More Help
If you're still experiencing issues:
1. **Search existing issues:** https://github.com/yusufkaraaslan/Skill_Seekers/issues
2. **Check documentation:** https://skillseekersweb.com/
3. **Ask on GitHub Discussions:** https://github.com/yusufkaraaslan/Skill_Seekers/discussions
4. **Open a new issue:** Include:
- Skill Seekers version (`skill-seekers --version`)
- Python version (`python --version`)
- Operating system
- Complete error message
- Steps to reproduce
- Diagnostic information (see above)
## Source-Type-Specific Issues
### Issue: Missing Optional Dependencies for New Source Types
**Symptoms:**
```
ModuleNotFoundError: No module named 'ebooklib'
ModuleNotFoundError: No module named 'python-docx'
ModuleNotFoundError: No module named 'python-pptx'
ImportError: Missing dependency for jupyter extraction
```
**Solutions:**
```bash
# Install all optional dependencies at once
pip install skill-seekers[all]
# Or install per source type
pip install python-docx # Word (.docx) support
pip install ebooklib # EPUB support
pip install python-pptx # PowerPoint (.pptx) support
pip install nbformat nbconvert # Jupyter Notebook support
pip install pyyaml jsonschema # OpenAPI/Swagger support
pip install asciidoctor # AsciiDoc support (or install system asciidoctor)
pip install feedparser # RSS/Atom feed support
pip install groff # Man page support (system package)
# Video support (GPU-aware)
skill-seekers video --setup
```
### Issue: Confluence API Authentication Fails
**Symptoms:**
```
401 Unauthorized: Confluence API rejected credentials
Error: CONFLUENCE_TOKEN not found
```
**Solutions:**
```bash
# Set Confluence Cloud credentials
export CONFLUENCE_URL=https://yourorg.atlassian.net
export CONFLUENCE_EMAIL=your-email@example.com
export CONFLUENCE_TOKEN=your-api-token
# Generate API token at:
# https://id.atlassian.com/manage-profile/security/api-tokens
# Test connection
skill-seekers confluence --space MYSPACE --dry-run
# For Confluence Server/Data Center, use personal access token:
export CONFLUENCE_TOKEN=your-pat
```
### Issue: Notion API Authentication Fails
**Symptoms:**
```
401 Unauthorized: Notion API rejected credentials
Error: NOTION_TOKEN not found
```
**Solutions:**
```bash
# Set Notion integration token
export NOTION_TOKEN=secret_...
# Create an integration at:
# https://www.notion.so/my-integrations
# IMPORTANT: Share the target database/page with your integration
# (click "..." menu on page → "Add connections" → select your integration)
# Test connection
skill-seekers notion --database DATABASE_ID --dry-run
```
### Issue: Jupyter Notebook Extraction Fails
**Symptoms:**
```
Error: Cannot read notebook format
nbformat.reader.NotJSONError
```
**Solutions:**
```bash
# Ensure notebook is valid JSON
python -c "import json; json.load(open('notebook.ipynb'))"
# Install required deps
pip install nbformat nbconvert
# Try with explicit format version
skill-seekers jupyter notebook.ipynb --nbformat 4
```
### Issue: OpenAPI Spec Parsing Fails
**Symptoms:**
```
Error: Not a valid OpenAPI specification
Error: Missing 'openapi' or 'swagger' field
```
**Solutions:**
```bash
# Validate your spec first
pip install openapi-spec-validator
python -c "
from openapi_spec_validator import validate
validate({'openapi': '3.0.0', ...})
"
# Ensure the file has the 'openapi' or 'swagger' top-level key
# Supported: OpenAPI 3.x and Swagger 2.0
# For remote specs
skill-seekers openapi https://api.example.com/openapi.json --name my-api
```
### Issue: EPUB Extraction Produces Empty Output
**Symptoms:**
```
Warning: No content found in EPUB
0 chapters extracted
```
**Solutions:**
```bash
# Check EPUB is valid
pip install epubcheck
epubcheck book.epub
# Try with different content extraction
skill-seekers epub book.epub --extract-images --verbose
# Some DRM-protected EPUBs cannot be extracted
# Ensure your EPUB is DRM-free
```
### Issue: Slack/Discord Export Not Recognized
**Symptoms:**
```
Error: Cannot detect chat platform from export directory
Error: No messages found in export
```
**Solutions:**
```bash
# Specify platform explicitly
skill-seekers chat --platform slack --export-dir ./slack-export
skill-seekers chat --platform discord --export-dir ./discord-export
# For Slack: Export from Workspace Settings → Import/Export
# For Discord: Use DiscordChatExporter or similar tool
# Check export directory structure
ls ./slack-export/
# Should contain: channels/, users.json, etc.
```
---
## Common Error Messages Reference
| Error | Cause | Solution |
|-------|-------|----------|
| `ModuleNotFoundError` | Package not installed | `pip install skill-seekers` |
| `401 Unauthorized` | Invalid API key | Check API key format |
| `403 Forbidden` | Rate limit exceeded | Add more GitHub tokens |
| `404 Not Found` | Invalid URL/repo | Verify URL is correct |
| `429 Too Many Requests` | API rate limit | Wait or use multiple keys |
| `ConnectionError` | Network issue | Check internet connection |
| `TimeoutError` | Request too slow | Increase timeout |
| `MemoryError` | Out of memory | Reduce batch size |
| `PermissionError` | Access denied | Check file permissions |
| `FileNotFoundError` | Missing file | Verify file path |
| `No module named 'ebooklib'` | EPUB dep missing | `pip install ebooklib` |
| `No module named 'python-docx'` | Word dep missing | `pip install python-docx` |
| `No module named 'python-pptx'` | PPTX dep missing | `pip install python-pptx` |
| `CONFLUENCE_TOKEN not found` | Confluence auth missing | Set env vars (see above) |
| `NOTION_TOKEN not found` | Notion auth missing | Set env vars (see above) |
---
**Still stuck?** Open an issue with the "help wanted" label and we'll assist you!