yusyus
b81d55fda0
feat(B2): add Microsoft Word (.docx) support
Implements ROADMAP task B2 — full .docx scraping support via mammoth +
python-docx, producing SKILL.md + references/ output identical to other
source types.
New files:
- src/skill_seekers/cli/word_scraper.py — WordToSkillConverter class +
main() entry point (~600 lines); mammoth → BeautifulSoup pipeline;
handles headings, code detection (incl. monospace <p><br> blocks),
tables, images, metadata extraction
- src/skill_seekers/cli/arguments/word.py — add_word_arguments() +
WORD_ARGUMENTS dict
- src/skill_seekers/cli/parsers/word_parser.py — WordParser for unified
CLI parser registry
- tests/test_word_scraper.py — comprehensive test suite (~300 lines)
Modified files:
- src/skill_seekers/cli/main.py — registered "word" command module
- src/skill_seekers/cli/source_detector.py — .docx auto-detection +
_detect_word() classmethod
- src/skill_seekers/cli/create_command.py — _route_word() + --help-word
- src/skill_seekers/cli/arguments/create.py — WORD_ARGUMENTS + routing
- src/skill_seekers/cli/arguments/__init__.py — export word args
- src/skill_seekers/cli/parsers/__init__.py — register WordParser
- src/skill_seekers/cli/unified_scraper.py — _scrape_word() integration
- src/skill_seekers/cli/pdf_scraper.py — fix: real enhancement instead
of stub; remove [:3] reference file limit; capture run_workflows return
- src/skill_seekers/cli/github_scraper.py — fix: remove arbitrary
open_issues[:20] / closed_issues[:10] reference file limits
- pyproject.toml — skill-seekers-word entry point + docx optional dep
- tests/test_cli_parsers.py — update parser count 21→22
Bug fixes applied during real-world testing:
- Code detection: detect monospace <p><br> blocks as code (mammoth
renders Courier paragraphs this way, not as <pre>/<code>)
- Language detector: fix wrong method name detect_from_text →
detect_from_code
- Description inference: pass None from main() so extract_docx() can
infer description from Word document subject/title metadata
- Bullet-point guard: exclude prose starting with •/-/* from code scoring
- Enhancement: implement real API/LOCAL enhancement (was stub)
- pip install message: add quotes around skill-seekers[docx]
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 21:47:30 +03:00
..
2025-10-29 23:19:32 +03:00
2026-02-22 20:43:17 +03:00
2025-10-19 02:08:58 +03:00
2026-01-17 23:02:11 +03:00
2026-02-07 22:55:02 +03:00
2025-10-19 17:01:37 +03:00
2026-02-18 23:49:48 +03:00
2026-02-22 22:32:31 +03:00
2026-01-31 21:30:00 +03:00
2026-01-17 17:48:15 +00:00
2026-01-18 00:01:30 +03:00
2026-02-08 13:34:48 +03:00
2026-02-08 14:49:45 +03:00
2026-01-17 23:02:11 +03:00
2026-01-29 22:56:33 +03:00
2026-01-17 23:25:12 +03:00
2026-02-24 22:07:56 +03:00
2026-02-25 21:47:30 +03:00
2026-02-24 22:24:03 +03:00
2026-02-24 22:28:11 +03:00
2026-02-08 14:42:27 +03:00
2026-01-17 17:29:21 +00:00
2026-02-05 21:27:41 +03:00
2026-02-22 20:43:17 +03:00
2026-01-31 21:30:00 +03:00
2026-01-17 17:48:15 +00:00
2026-01-17 17:48:15 +00:00
2026-02-24 22:07:56 +03:00
2026-02-22 20:43:17 +03:00
2026-01-17 22:54:40 +03:00
2026-01-17 23:02:11 +03:00
2026-02-08 14:49:45 +03:00
2026-02-08 14:49:45 +03:00
2026-02-22 22:32:31 +03:00
2026-02-24 07:05:50 +03:00
2026-01-17 17:48:15 +00:00
2026-01-17 23:02:11 +03:00
2026-02-08 14:42:27 +03:00
2026-01-17 17:48:15 +00:00
2026-01-17 23:02:11 +03:00
2026-01-17 23:02:11 +03:00
2026-01-17 23:02:11 +03:00
2026-01-18 00:01:30 +03:00
2026-01-18 00:01:30 +03:00
2026-01-31 21:30:00 +03:00
2026-02-08 14:42:27 +03:00
2026-01-18 00:01:30 +03:00
2026-01-17 17:48:15 +00:00
2026-02-22 22:32:31 +03:00
2026-02-22 20:43:17 +03:00
2026-02-22 20:43:17 +03:00
2026-02-22 20:43:17 +03:00
2026-02-15 20:24:32 +03:00
2026-02-08 14:42:27 +03:00
2026-01-17 17:29:21 +00:00
2026-01-17 23:02:11 +03:00
2026-01-18 00:01:30 +03:00
2026-01-17 17:29:21 +00:00
2026-02-17 22:05:27 +03:00
2026-02-22 20:43:17 +03:00
2026-02-22 20:43:17 +03:00
2026-02-08 14:44:46 +03:00
2026-02-07 20:59:03 +03:00
2026-02-22 22:32:31 +03:00
2026-01-17 17:48:15 +00:00
2026-01-17 17:48:15 +00:00
2026-02-08 14:42:27 +03:00
2026-01-17 17:48:15 +00:00
2026-02-24 22:24:03 +03:00
2026-01-17 17:48:15 +00:00
2026-02-15 20:24:32 +03:00
2026-02-08 13:33:15 +03:00
2026-01-17 22:54:40 +03:00
2026-02-04 21:00:49 +03:00
2026-02-22 22:32:31 +03:00
2026-01-17 17:29:21 +00:00
2026-02-22 21:52:04 +03:00
2026-01-17 17:48:15 +00:00
2026-02-08 14:42:27 +03:00
2026-02-22 20:43:17 +03:00
2026-01-17 17:48:15 +00:00
2026-01-17 23:02:11 +03:00
2026-01-17 17:48:15 +00:00
2026-02-08 14:42:27 +03:00
2026-01-18 13:48:37 +03:00
2026-01-18 00:01:30 +03:00
2026-01-17 23:02:11 +03:00
2026-02-15 20:24:32 +03:00
2026-01-17 23:02:11 +03:00
2026-02-08 14:42:27 +03:00
2026-02-22 22:32:31 +03:00
2026-02-22 20:43:17 +03:00
2026-02-02 23:08:25 +03:00
2026-02-22 20:43:17 +03:00
2026-01-17 22:54:40 +03:00
2026-02-18 22:50:05 +03:00
2026-02-22 22:32:31 +03:00
2026-02-22 22:32:31 +03:00
2026-02-08 14:42:27 +03:00
2026-01-17 17:48:15 +00:00
2026-02-08 14:42:27 +03:00
2026-01-17 17:29:21 +00:00
2026-02-25 21:47:30 +03:00
2026-02-18 22:50:05 +03:00
2026-02-18 22:50:05 +03:00
2026-02-18 22:50:05 +03:00