Update 32 documentation files across English and Chinese (zh-CN) docs to reflect the 10 new source types added in the previous commit. Updated files: - README.md, README.zh-CN.md — taglines, feature lists, examples, install extras - docs/reference/ — CLI_REFERENCE, FEATURE_MATRIX, MCP_REFERENCE, CONFIG_FORMAT, API_REFERENCE - docs/features/ — UNIFIED_SCRAPING with generic merge docs - docs/advanced/ — multi-source guide, MCP server guide - docs/getting-started/ — installation extras, quick-start examples - docs/user-guide/ — core-concepts, scraping, packaging, workflows (complex-merge) - docs/ — FAQ, TROUBLESHOOTING, BEST_PRACTICES, ARCHITECTURE, UNIFIED_PARSERS, README - Root — BULLETPROOF_QUICKSTART, CONTRIBUTING, ROADMAP - docs/zh-CN/ — Chinese translations for all of the above 32 files changed, +3,016 lines, -245 lines
14 KiB
Unified Document Parsers Architecture
Overview
The Unified Document Parser system provides a standardized interface for extracting structured content from multiple document formats. As of v3.2.0, the system supports 17 source types through registered parsers and scraper modules. It replaces format-specific extraction logic with a common data model and extensible parser framework.
Architecture Goals
- Standardization: All parsers output the same
Documentstructure - Extensibility: Easy to add new formats via the scraper pattern (17 source types and growing)
- Quality: Built-in quality scoring for extracted content
- Backward Compatibility: Legacy parsers remain functional during migration
Core Components
1. Data Model Layer
File: src/skill_seekers/cli/parsers/extractors/unified_structure.py
┌─────────────────────────────────────────────────────────────┐
│ Document │
├─────────────────────────────────────────────────────────────┤
│ title: str │
│ format: str │
│ source_path: str │
├─────────────────────────────────────────────────────────────┤
│ blocks: List[ContentBlock] # All content blocks │
│ headings: List[Heading] # Extracted from blocks │
│ code_blocks: List[CodeBlock] # Extracted from blocks │
│ tables: List[Table] # Extracted from blocks │
│ images: List[Image] # Extracted from blocks │
├─────────────────────────────────────────────────────────────┤
│ internal_links: List[CrossReference] # :ref:, #anchor │
│ external_links: List[CrossReference] # URLs │
├─────────────────────────────────────────────────────────────┤
│ meta: Dict[str, Any] # Frontmatter, metadata │
│ stats: ExtractionStats # Processing metrics │
└─────────────────────────────────────────────────────────────┘
ContentBlock
The universal content container:
@dataclass
class ContentBlock:
type: ContentBlockType # HEADING, PARAGRAPH, CODE_BLOCK, etc.
content: str # Raw text content
metadata: Dict[str, Any] # Type-specific data
source_line: Optional[int] # Line number in source
quality_score: Optional[float] # 0-10 quality rating
ContentBlockType Enum:
HEADING- Section titlesPARAGRAPH- Text contentCODE_BLOCK- Code snippetsTABLE- Tabular dataLIST- Bullet/numbered listsIMAGE- Image referencesCROSS_REFERENCE- Internal linksDIRECTIVE- RST directivesFIELD_LIST- Parameter documentationDEFINITION_LIST- Term/definition pairsADMONITION- Notes, warnings, tipsMETA- Metadata fields
Specialized Data Classes
Table:
@dataclass
class Table:
rows: List[List[str]] # 2D cell array
headers: Optional[List[str]]
caption: Optional[str]
source_format: str # 'simple', 'grid', 'list-table'
CodeBlock:
@dataclass
class CodeBlock:
code: str
language: Optional[str]
quality_score: Optional[float]
confidence: Optional[float] # Language detection confidence
is_valid: Optional[bool] # Syntax validation
CrossReference:
@dataclass
class CrossReference:
ref_type: CrossRefType # REF, DOC, CLASS, METH, etc.
target: str # Target ID/URL
text: Optional[str] # Display text
2. Parser Interface Layer
File: src/skill_seekers/cli/parsers/extractors/base_parser.py
┌─────────────────────────────────────────────────────────────┐
│ BaseParser (Abstract) │
├─────────────────────────────────────────────────────────────┤
│ + format_name: str │
│ + supported_extensions: List[str] │
├─────────────────────────────────────────────────────────────┤
│ + parse(source) -> ParseResult │
│ + parse_file(path) -> ParseResult │
│ + parse_string(content) -> ParseResult │
│ # _parse_content(content, path) -> Document │
│ # _detect_format(content) -> bool │
└─────────────────────────────────────────────────────────────┘
ParseResult:
@dataclass
class ParseResult:
document: Optional[Document]
success: bool
errors: List[str]
warnings: List[str]
3. Parser Implementations
RST Parser
File: src/skill_seekers/cli/parsers/extractors/rst_parser.py
Supported Constructs:
- Headers (underline style:
====,----) - Code blocks (
.. code-block:: language) - Tables (simple, grid, list-table)
- Cross-references (
:ref:,:class:,:meth:,:func:,:attr:) - Directives (
.. note::,.. warning::,.. deprecated::) - Field lists (
:param:,:returns:,:type:) - Definition lists
- Substitutions (
|name|) - Toctree (
.. toctree::)
Parsing Strategy:
- First pass: Collect substitution definitions
- Second pass: Parse block-level constructs
- Post-process: Extract specialized content lists
Markdown Parser
File: src/skill_seekers/cli/parsers/extractors/markdown_parser.py
Supported Constructs:
- Headers (ATX:
#, Setext: underline) - Code blocks (fenced:
```) - Tables (GitHub-flavored)
- Lists (bullet, numbered)
- Admonitions (GitHub-style:
> [!NOTE]) - Images and links
- Frontmatter (YAML metadata)
PDF Parser
File: src/skill_seekers/cli/pdf_scraper.py
Status: Integrated. Extracts text, tables, images, and code blocks from PDF files. Supports OCR for scanned documents.
Additional Registered Parsers (v3.2.0)
The following source types each have a dedicated scraper module registered in parsers/__init__.py (PARSERS list), main.py (COMMAND_MODULES dict), and config_validator.py (VALID_SOURCE_TYPES set):
| # | Source Type | Scraper Module | Parser Registration |
|---|---|---|---|
| 1 | Documentation (web) | doc_scraper.py |
documentation |
| 2 | GitHub repo | github_scraper.py |
github |
| 3 | pdf_scraper.py |
pdf |
|
| 4 | Word (.docx) | word_scraper.py |
word |
| 5 | EPUB | epub_scraper.py |
epub |
| 6 | Video | video_scraper.py |
video |
| 7 | Local codebase | codebase_scraper.py |
local |
| 8 | Jupyter Notebook | jupyter_scraper.py |
jupyter |
| 9 | Local HTML | html_scraper.py |
html |
| 10 | OpenAPI/Swagger | openapi_scraper.py |
openapi |
| 11 | AsciiDoc | asciidoc_scraper.py |
asciidoc |
| 12 | PowerPoint | pptx_scraper.py |
pptx |
| 13 | RSS/Atom | rss_scraper.py |
rss |
| 14 | Man pages | manpage_scraper.py |
manpage |
| 15 | Confluence | confluence_scraper.py |
confluence |
| 16 | Notion | notion_scraper.py |
notion |
| 17 | Slack/Discord | chat_scraper.py |
chat |
Each scraper follows the same pattern: a <Type>ToSkillConverter class with a main() function, registered in three places (see CONTRIBUTING.md for the full scraper pattern).
Generic Merge System
File: src/skill_seekers/cli/unified_skill_builder.py
The unified_skill_builder.py handles multi-source merging:
- Pairwise synthesis: Optimized merge for common combos (docs+github, docs+pdf, github+pdf)
- Generic merge (
_generic_merge()): Handles all other source type combinations (e.g., docs+jupyter+confluence) by normalizing each source'sscraped_datainto a common structure and merging sections
4. Quality Scoring Layer
File: src/skill_seekers/cli/parsers/extractors/quality_scorer.py
Code Quality Factors:
- Language detection confidence
- Code length appropriateness
- Line count
- Keyword density
- Syntax pattern matching
- Bracket balance
Table Quality Factors:
- Has headers
- Consistent column count
- Reasonable size
- Non-empty cells
- Has caption
5. Output Formatter Layer
File: src/skill_seekers/cli/parsers/extractors/formatters.py
MarkdownFormatter:
- Converts Document to Markdown
- Handles all ContentBlockType variants
- Configurable options (TOC, max heading level, etc.)
SkillFormatter:
- Converts Document to skill-seekers internal format
- Compatible with existing skill pipelines
Integration Points
1. Codebase Scraper
File: src/skill_seekers/cli/codebase_scraper.py
# Enhanced RST extraction
def extract_rst_structure(content: str) -> dict:
parser = RstParser()
result = parser.parse_string(content)
if result.success:
return result.document.to_legacy_format()
# Fallback to legacy parser
2. Doc Scraper
File: src/skill_seekers/cli/doc_scraper.py
# Enhanced Markdown extraction
def _extract_markdown_content(self, content, url):
parser = MarkdownParser()
result = parser.parse_string(content, url)
if result.success:
doc = result.document
return {
"title": doc.title,
"headings": [...],
"code_samples": [...],
"_enhanced": True,
}
# Fallback to legacy extraction
Usage Patterns
Basic Parsing
from skill_seekers.cli.parsers.extractors import RstParser
parser = RstParser()
result = parser.parse_file("docs/class_node.rst")
if result.success:
doc = result.document
print(f"Title: {doc.title}")
print(f"Tables: {len(doc.tables)}")
Auto-Detection
from skill_seekers.cli.parsers.extractors import parse_document
result = parse_document("file.rst") # Auto-detects format
# or
result = parse_document(content, format_hint="rst")
Format Conversion
# To Markdown
markdown = doc.to_markdown()
# To Skill format
skill_data = doc.to_skill_format()
# To legacy format (backward compatibility)
legacy = doc.to_skill_format() # Compatible with old structure
API Documentation Extraction
# Extract structured API info
api_summary = doc.get_api_summary()
# Returns:
# {
# "properties": [{"name": "position", "type": "Vector2", ...}],
# "methods": [{"name": "_ready", "returns": "void", ...}],
# "signals": [{"name": "ready", ...}]
# }
Extending the System
Adding a New Parser
- Create parser class:
class HtmlParser(BaseParser):
@property
def format_name(self) -> str:
return "html"
@property
def supported_extensions(self) -> list[str]:
return [".html", ".htm"]
def _parse_content(self, content: str, source_path: str) -> Document:
# Parse HTML to Document
pass
- Register in
__init__.py:
from .html_parser import HtmlParser
__all__ = [..., "HtmlParser"]
- Add tests:
def test_html_parser():
parser = HtmlParser()
result = parser.parse_string("<h1>Title</h1>")
assert result.document.title == "Title"
Testing Strategy
Unit Tests
Test individual parsers with various constructs:
test_rst_parser.py- RST-specific featurestest_markdown_parser.py- Markdown-specific featurestest_quality_scorer.py- Quality scoring
Integration Tests
Test integration with existing scrapers:
test_codebase_scraper.py- RST file processingtest_doc_scraper.py- Markdown web content
Backward Compatibility Tests
Verify new parsers match old output:
- Same field names in output dicts
- Same content extraction (plus more)
- Legacy fallback works
Performance Considerations
Current Performance
- RST Parser: ~1-2ms per 1000 lines
- Markdown Parser: ~1ms per 1000 lines
- Quality Scoring: Adds ~10% overhead
Optimization Opportunities
- Caching: Cache parsed documents by hash
- Parallel Processing: Parse multiple files concurrently
- Lazy Evaluation: Only extract requested content types
Migration Guide
From Legacy Parsers
Before:
from skill_seekers.cli.codebase_scraper import extract_rst_structure
structure = extract_rst_structure(content)
After:
from skill_seekers.cli.parsers.extractors import RstParser
parser = RstParser()
result = parser.parse_string(content)
structure = result.document.to_skill_format()
Backward Compatibility
The enhanced extract_rst_structure() function:
- Tries unified parser first
- Falls back to legacy parser on failure
- Returns same dict structure
Future Enhancements
- Caching Layer: Redis/disk cache for parsed docs
- Streaming: Parse large files incrementally
- Validation: JSON Schema validation for output
- Additional formats: As new source types are added, they follow the same parser registration pattern
Last Updated: 2026-03-15 Version: 2.0.0 (updated for 17 source types)