feat(B2): add Microsoft Word (.docx) support
Implements ROADMAP task B2 — full .docx scraping support via mammoth + python-docx, producing SKILL.md + references/ output identical to other source types. New files: - src/skill_seekers/cli/word_scraper.py — WordToSkillConverter class + main() entry point (~600 lines); mammoth → BeautifulSoup pipeline; handles headings, code detection (incl. monospace <p><br> blocks), tables, images, metadata extraction - src/skill_seekers/cli/arguments/word.py — add_word_arguments() + WORD_ARGUMENTS dict - src/skill_seekers/cli/parsers/word_parser.py — WordParser for unified CLI parser registry - tests/test_word_scraper.py — comprehensive test suite (~300 lines) Modified files: - src/skill_seekers/cli/main.py — registered "word" command module - src/skill_seekers/cli/source_detector.py — .docx auto-detection + _detect_word() classmethod - src/skill_seekers/cli/create_command.py — _route_word() + --help-word - src/skill_seekers/cli/arguments/create.py — WORD_ARGUMENTS + routing - src/skill_seekers/cli/arguments/__init__.py — export word args - src/skill_seekers/cli/parsers/__init__.py — register WordParser - src/skill_seekers/cli/unified_scraper.py — _scrape_word() integration - src/skill_seekers/cli/pdf_scraper.py — fix: real enhancement instead of stub; remove [:3] reference file limit; capture run_workflows return - src/skill_seekers/cli/github_scraper.py — fix: remove arbitrary open_issues[:20] / closed_issues[:10] reference file limits - pyproject.toml — skill-seekers-word entry point + docx optional dep - tests/test_cli_parsers.py — update parser count 21→22 Bug fixes applied during real-world testing: - Code detection: detect monospace <p><br> blocks as code (mammoth renders Courier paragraphs this way, not as <pre>/<code>) - Language detector: fix wrong method name detect_from_text → detect_from_code - Description inference: pass None from main() so extract_docx() can infer description from Word document subject/title metadata - Bullet-point guard: exclude prose starting with •/-/* from code scoring - Enhancement: implement real API/LOCAL enhancement (was stub) - pip install message: add quotes around skill-seekers[docx] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
72
AGENTS.md
72
AGENTS.md
@@ -12,10 +12,12 @@ This file provides essential guidance for AI coding agents working with the Skil
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| **Current Version** | 3.0.0 |
|
||||
| **Current Version** | 3.1.3 |
|
||||
| **Python Version** | 3.10+ (tested on 3.10, 3.11, 3.12, 3.13) |
|
||||
| **License** | MIT |
|
||||
| **Package Name** | `skill-seekers` (PyPI) |
|
||||
| **Source Files** | 169 Python files |
|
||||
| **Test Files** | 101 test files |
|
||||
| **Website** | https://skillseekersweb.com/ |
|
||||
| **Repository** | https://github.com/yusufkaraaslan/Skill_Seekers |
|
||||
|
||||
@@ -55,7 +57,7 @@ This file provides essential guidance for AI coding agents working with the Skil
|
||||
```
|
||||
/mnt/1ece809a-2821-4f10-aecb-fcdf34760c0b/Git/Skill_Seekers/
|
||||
├── src/skill_seekers/ # Main source code (src/ layout)
|
||||
│ ├── cli/ # CLI tools and commands (~42k lines)
|
||||
│ ├── cli/ # CLI tools and commands (~70 modules)
|
||||
│ │ ├── adaptors/ # Platform adaptors (Strategy pattern)
|
||||
│ │ │ ├── base.py # Abstract base class (SkillAdaptor)
|
||||
│ │ │ ├── claude.py # Claude AI adaptor
|
||||
@@ -70,12 +72,6 @@ This file provides essential guidance for AI coding agents working with the Skil
|
||||
│ │ │ ├── qdrant.py # Qdrant vector DB adaptor
|
||||
│ │ │ ├── weaviate.py # Weaviate vector DB adaptor
|
||||
│ │ │ └── streaming_adaptor.py # Streaming output adaptor
|
||||
│ │ ├── storage/ # Cloud storage backends
|
||||
│ │ │ ├── base_storage.py # Storage interface
|
||||
│ │ │ ├── s3_storage.py # AWS S3 support
|
||||
│ │ │ ├── gcs_storage.py # Google Cloud Storage
|
||||
│ │ │ └── azure_storage.py # Azure Blob Storage
|
||||
│ │ ├── parsers/ # CLI argument parsers
|
||||
│ │ ├── arguments/ # CLI argument definitions
|
||||
│ │ ├── presets/ # Preset configuration management
|
||||
│ │ ├── main.py # Unified CLI entry point
|
||||
@@ -85,6 +81,7 @@ This file provides essential guidance for AI coding agents working with the Skil
|
||||
│ │ ├── pdf_scraper.py # PDF extraction
|
||||
│ │ ├── unified_scraper.py # Multi-source scraping
|
||||
│ │ ├── codebase_scraper.py # Local codebase analysis
|
||||
│ │ ├── enhance_command.py # AI enhancement command
|
||||
│ │ ├── enhance_skill_local.py # AI enhancement (local mode)
|
||||
│ │ ├── package_skill.py # Skill packager
|
||||
│ │ ├── upload_skill.py # Upload to platforms
|
||||
@@ -101,8 +98,8 @@ This file provides essential guidance for AI coding agents working with the Skil
|
||||
│ │ ├── source_manager.py # Config source management
|
||||
│ │ └── tools/ # MCP tool implementations
|
||||
│ │ ├── config_tools.py # Configuration tools
|
||||
│ │ ├── scraping_tools.py # Scraping tools
|
||||
│ │ ├── packaging_tools.py # Packaging tools
|
||||
│ │ ├── scraping_tools.py # Scraping tools
|
||||
│ │ ├── source_tools.py # Source management tools
|
||||
│ │ ├── splitting_tools.py # Config splitting tools
|
||||
│ │ ├── vector_db_tools.py # Vector database tools
|
||||
@@ -124,7 +121,7 @@ This file provides essential guidance for AI coding agents working with the Skil
|
||||
│ ├── workflows/ # YAML workflow presets
|
||||
│ ├── _version.py # Version information (reads from pyproject.toml)
|
||||
│ └── __init__.py # Package init
|
||||
├── tests/ # Test suite (98 test files)
|
||||
├── tests/ # Test suite (101 test files)
|
||||
├── configs/ # Preset configuration files
|
||||
├── docs/ # Documentation (80+ markdown files)
|
||||
│ ├── integrations/ # Platform integration guides
|
||||
@@ -134,17 +131,6 @@ This file provides essential guidance for AI coding agents working with the Skil
|
||||
│ ├── blog/ # Blog posts
|
||||
│ └── roadmap/ # Roadmap documents
|
||||
├── examples/ # Usage examples
|
||||
│ ├── langchain-rag-pipeline/ # LangChain example
|
||||
│ ├── llama-index-query-engine/ # LlamaIndex example
|
||||
│ ├── pinecone-upsert/ # Pinecone example
|
||||
│ ├── chroma-example/ # Chroma example
|
||||
│ ├── weaviate-example/ # Weaviate example
|
||||
│ ├── qdrant-example/ # Qdrant example
|
||||
│ ├── faiss-example/ # FAISS example
|
||||
│ ├── haystack-pipeline/ # Haystack example
|
||||
│ ├── cursor-react-skill/ # Cursor IDE example
|
||||
│ ├── windsurf-fastapi-context/ # Windsurf example
|
||||
│ └── continue-dev-universal/ # Continue.dev example
|
||||
├── .github/workflows/ # CI/CD workflows
|
||||
├── pyproject.toml # Main project configuration
|
||||
├── requirements.txt # Pinned dependencies
|
||||
@@ -259,7 +245,7 @@ pytest tests/ -v -m "not slow and not integration"
|
||||
|
||||
### Test Architecture
|
||||
|
||||
- **98 test files** covering all features
|
||||
- **101 test files** covering all features
|
||||
- **1880+ tests** passing
|
||||
- CI Matrix: Ubuntu + macOS, Python 3.10-3.12
|
||||
- Test markers defined in `pyproject.toml`:
|
||||
@@ -316,22 +302,19 @@ mypy src/skill_seekers --show-error-codes --pretty
|
||||
- **Ignored rules:** E501, F541, ARG002, B007, I001, SIM114
|
||||
- **Import sorting:** isort style with `skill_seekers` as first-party
|
||||
|
||||
### MyPy Configuration (from mypy.ini)
|
||||
### MyPy Configuration (from pyproject.toml)
|
||||
|
||||
```ini
|
||||
[mypy]
|
||||
python_version = 3.10
|
||||
warn_return_any = False
|
||||
warn_unused_configs = True
|
||||
disallow_untyped_defs = False
|
||||
check_untyped_defs = True
|
||||
ignore_missing_imports = True
|
||||
no_implicit_optional = True
|
||||
show_error_codes = True
|
||||
|
||||
# Gradual typing - be lenient for now
|
||||
disallow_incomplete_defs = False
|
||||
disallow_untyped_calls = False
|
||||
```toml
|
||||
[tool.mypy]
|
||||
python_version = "3.10"
|
||||
warn_return_any = true
|
||||
warn_unused_configs = true
|
||||
disallow_untyped_defs = false
|
||||
disallow_incomplete_defs = false
|
||||
check_untyped_defs = true
|
||||
ignore_missing_imports = true
|
||||
show_error_codes = true
|
||||
pretty = true
|
||||
```
|
||||
|
||||
### Code Conventions
|
||||
@@ -662,17 +645,6 @@ Preset configs are in `configs/` directory:
|
||||
- `astrovalley_unified.json` - Astrovalley
|
||||
- `configs/integrations/` - Integration-specific configs
|
||||
|
||||
### Configuration Documentation
|
||||
|
||||
Preset configs are in `configs/` directory:
|
||||
- `godot.json` - Godot Engine
|
||||
- `blender.json` / `blender-unified.json` - Blender Engine
|
||||
- `claude-code.json` - Claude Code
|
||||
- `httpx_comprehensive.json` - HTTPX library
|
||||
- `medusa-mercurjs.json` - Medusa/MercurJS
|
||||
- `astrovalley_unified.json` - Astrovalley
|
||||
- `configs/integrations/` - Integration-specific configs
|
||||
|
||||
---
|
||||
|
||||
## Key Dependencies
|
||||
@@ -700,6 +672,8 @@ Preset configs are in `configs/` directory:
|
||||
| `python-dotenv` | >=1.1.1 | Environment variables |
|
||||
| `jsonschema` | >=4.25.1 | JSON validation |
|
||||
| `PyYAML` | >=6.0 | YAML parsing |
|
||||
| `langchain` | >=1.2.10 | LangChain integration |
|
||||
| `llama-index` | >=0.14.15 | LlamaIndex integration |
|
||||
|
||||
### Optional Dependencies
|
||||
|
||||
@@ -852,4 +826,4 @@ Skill Seekers uses JSON configuration files to define scraping targets. Example
|
||||
|
||||
*This document is maintained for AI coding agents. For human contributors, see README.md and CONTRIBUTING.md.*
|
||||
|
||||
*Last updated: 2026-02-16*
|
||||
*Last updated: 2026-02-24*
|
||||
|
||||
Reference in New Issue
Block a user