docs: remove stale UNIFIED_PARSERS.md superseded by UML architecture
The parsers architecture is now fully documented in the StarUML project (Docs/UML/skill_seekers.mdj) with the Parsers class diagram showing all 28 SubcommandParser subclasses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,434 +0,0 @@
|
||||
# Unified Document Parsers Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
The Unified Document Parser system provides a standardized interface for extracting structured content from multiple document formats. As of v3.2.0, the system supports **17 source types** through registered parsers and scraper modules. It replaces format-specific extraction logic with a common data model and extensible parser framework.
|
||||
|
||||
## Architecture Goals
|
||||
|
||||
1. **Standardization**: All parsers output the same `Document` structure
|
||||
2. **Extensibility**: Easy to add new formats via the scraper pattern (17 source types and growing)
|
||||
3. **Quality**: Built-in quality scoring for extracted content
|
||||
4. **Backward Compatibility**: Legacy parsers remain functional during migration
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Data Model Layer
|
||||
|
||||
**File**: `src/skill_seekers/cli/parsers/extractors/unified_structure.py`
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Document │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ title: str │
|
||||
│ format: str │
|
||||
│ source_path: str │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ blocks: List[ContentBlock] # All content blocks │
|
||||
│ headings: List[Heading] # Extracted from blocks │
|
||||
│ code_blocks: List[CodeBlock] # Extracted from blocks │
|
||||
│ tables: List[Table] # Extracted from blocks │
|
||||
│ images: List[Image] # Extracted from blocks │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ internal_links: List[CrossReference] # :ref:, #anchor │
|
||||
│ external_links: List[CrossReference] # URLs │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ meta: Dict[str, Any] # Frontmatter, metadata │
|
||||
│ stats: ExtractionStats # Processing metrics │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
#### ContentBlock
|
||||
|
||||
The universal content container:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ContentBlock:
|
||||
type: ContentBlockType # HEADING, PARAGRAPH, CODE_BLOCK, etc.
|
||||
content: str # Raw text content
|
||||
metadata: Dict[str, Any] # Type-specific data
|
||||
source_line: Optional[int] # Line number in source
|
||||
quality_score: Optional[float] # 0-10 quality rating
|
||||
```
|
||||
|
||||
**ContentBlockType Enum**:
|
||||
- `HEADING` - Section titles
|
||||
- `PARAGRAPH` - Text content
|
||||
- `CODE_BLOCK` - Code snippets
|
||||
- `TABLE` - Tabular data
|
||||
- `LIST` - Bullet/numbered lists
|
||||
- `IMAGE` - Image references
|
||||
- `CROSS_REFERENCE` - Internal links
|
||||
- `DIRECTIVE` - RST directives
|
||||
- `FIELD_LIST` - Parameter documentation
|
||||
- `DEFINITION_LIST` - Term/definition pairs
|
||||
- `ADMONITION` - Notes, warnings, tips
|
||||
- `META` - Metadata fields
|
||||
|
||||
#### Specialized Data Classes
|
||||
|
||||
**Table**:
|
||||
```python
|
||||
@dataclass
|
||||
class Table:
|
||||
rows: List[List[str]] # 2D cell array
|
||||
headers: Optional[List[str]]
|
||||
caption: Optional[str]
|
||||
source_format: str # 'simple', 'grid', 'list-table'
|
||||
```
|
||||
|
||||
**CodeBlock**:
|
||||
```python
|
||||
@dataclass
|
||||
class CodeBlock:
|
||||
code: str
|
||||
language: Optional[str]
|
||||
quality_score: Optional[float]
|
||||
confidence: Optional[float] # Language detection confidence
|
||||
is_valid: Optional[bool] # Syntax validation
|
||||
```
|
||||
|
||||
**CrossReference**:
|
||||
```python
|
||||
@dataclass
|
||||
class CrossReference:
|
||||
ref_type: CrossRefType # REF, DOC, CLASS, METH, etc.
|
||||
target: str # Target ID/URL
|
||||
text: Optional[str] # Display text
|
||||
```
|
||||
|
||||
### 2. Parser Interface Layer
|
||||
|
||||
**File**: `src/skill_seekers/cli/parsers/extractors/base_parser.py`
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ BaseParser (Abstract) │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ + format_name: str │
|
||||
│ + supported_extensions: List[str] │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ + parse(source) -> ParseResult │
|
||||
│ + parse_file(path) -> ParseResult │
|
||||
│ + parse_string(content) -> ParseResult │
|
||||
│ # _parse_content(content, path) -> Document │
|
||||
│ # _detect_format(content) -> bool │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**ParseResult**:
|
||||
```python
|
||||
@dataclass
|
||||
class ParseResult:
|
||||
document: Optional[Document]
|
||||
success: bool
|
||||
errors: List[str]
|
||||
warnings: List[str]
|
||||
```
|
||||
|
||||
### 3. Parser Implementations
|
||||
|
||||
#### RST Parser
|
||||
|
||||
**File**: `src/skill_seekers/cli/parsers/extractors/rst_parser.py`
|
||||
|
||||
**Supported Constructs**:
|
||||
- Headers (underline style: `====`, `----`)
|
||||
- Code blocks (`.. code-block:: language`)
|
||||
- Tables (simple, grid, list-table)
|
||||
- Cross-references (`:ref:`, `:class:`, `:meth:`, `:func:`, `:attr:`)
|
||||
- Directives (`.. note::`, `.. warning::`, `.. deprecated::`)
|
||||
- Field lists (`:param:`, `:returns:`, `:type:`)
|
||||
- Definition lists
|
||||
- Substitutions (`|name|`)
|
||||
- Toctree (`.. toctree::`)
|
||||
|
||||
**Parsing Strategy**:
|
||||
1. First pass: Collect substitution definitions
|
||||
2. Second pass: Parse block-level constructs
|
||||
3. Post-process: Extract specialized content lists
|
||||
|
||||
#### Markdown Parser
|
||||
|
||||
**File**: `src/skill_seekers/cli/parsers/extractors/markdown_parser.py`
|
||||
|
||||
**Supported Constructs**:
|
||||
- Headers (ATX: `#`, Setext: underline)
|
||||
- Code blocks (fenced: ```` ``` ````)
|
||||
- Tables (GitHub-flavored)
|
||||
- Lists (bullet, numbered)
|
||||
- Admonitions (GitHub-style: `> [!NOTE]`)
|
||||
- Images and links
|
||||
- Frontmatter (YAML metadata)
|
||||
|
||||
#### PDF Parser
|
||||
|
||||
**File**: `src/skill_seekers/cli/pdf_scraper.py`
|
||||
|
||||
**Status**: Integrated. Extracts text, tables, images, and code blocks from PDF files. Supports OCR for scanned documents.
|
||||
|
||||
#### Additional Registered Parsers (v3.2.0)
|
||||
|
||||
The following source types each have a dedicated scraper module registered in `parsers/__init__.py` (PARSERS list), `main.py` (COMMAND_MODULES dict), and `config_validator.py` (VALID_SOURCE_TYPES set):
|
||||
|
||||
| # | Source Type | Scraper Module | Parser Registration |
|
||||
|---|------------|---------------|---------------------|
|
||||
| 1 | Documentation (web) | `doc_scraper.py` | `documentation` |
|
||||
| 2 | GitHub repo | `github_scraper.py` | `github` |
|
||||
| 3 | PDF | `pdf_scraper.py` | `pdf` |
|
||||
| 4 | Word (.docx) | `word_scraper.py` | `word` |
|
||||
| 5 | EPUB | `epub_scraper.py` | `epub` |
|
||||
| 6 | Video | `video_scraper.py` | `video` |
|
||||
| 7 | Local codebase | `codebase_scraper.py` | `local` |
|
||||
| 8 | Jupyter Notebook | `jupyter_scraper.py` | `jupyter` |
|
||||
| 9 | Local HTML | `html_scraper.py` | `html` |
|
||||
| 10 | OpenAPI/Swagger | `openapi_scraper.py` | `openapi` |
|
||||
| 11 | AsciiDoc | `asciidoc_scraper.py` | `asciidoc` |
|
||||
| 12 | PowerPoint | `pptx_scraper.py` | `pptx` |
|
||||
| 13 | RSS/Atom | `rss_scraper.py` | `rss` |
|
||||
| 14 | Man pages | `manpage_scraper.py` | `manpage` |
|
||||
| 15 | Confluence | `confluence_scraper.py` | `confluence` |
|
||||
| 16 | Notion | `notion_scraper.py` | `notion` |
|
||||
| 17 | Slack/Discord | `chat_scraper.py` | `chat` |
|
||||
|
||||
Each scraper follows the same pattern: a `<Type>ToSkillConverter` class with a `main()` function, registered in three places (see [CONTRIBUTING.md](../../CONTRIBUTING.md) for the full scraper pattern).
|
||||
|
||||
#### Generic Merge System
|
||||
|
||||
**File**: `src/skill_seekers/cli/unified_skill_builder.py`
|
||||
|
||||
The `unified_skill_builder.py` handles multi-source merging:
|
||||
- **Pairwise synthesis**: Optimized merge for common combos (docs+github, docs+pdf, github+pdf)
|
||||
- **Generic merge** (`_generic_merge()`): Handles all other source type combinations (e.g., docs+jupyter+confluence) by normalizing each source's `scraped_data` into a common structure and merging sections
|
||||
|
||||
### 4. Quality Scoring Layer
|
||||
|
||||
**File**: `src/skill_seekers/cli/parsers/extractors/quality_scorer.py`
|
||||
|
||||
**Code Quality Factors**:
|
||||
- Language detection confidence
|
||||
- Code length appropriateness
|
||||
- Line count
|
||||
- Keyword density
|
||||
- Syntax pattern matching
|
||||
- Bracket balance
|
||||
|
||||
**Table Quality Factors**:
|
||||
- Has headers
|
||||
- Consistent column count
|
||||
- Reasonable size
|
||||
- Non-empty cells
|
||||
- Has caption
|
||||
|
||||
### 5. Output Formatter Layer
|
||||
|
||||
**File**: `src/skill_seekers/cli/parsers/extractors/formatters.py`
|
||||
|
||||
**MarkdownFormatter**:
|
||||
- Converts Document to Markdown
|
||||
- Handles all ContentBlockType variants
|
||||
- Configurable options (TOC, max heading level, etc.)
|
||||
|
||||
**SkillFormatter**:
|
||||
- Converts Document to skill-seekers internal format
|
||||
- Compatible with existing skill pipelines
|
||||
|
||||
## Integration Points
|
||||
|
||||
### 1. Codebase Scraper
|
||||
|
||||
**File**: `src/skill_seekers/cli/codebase_scraper.py`
|
||||
|
||||
```python
|
||||
# Enhanced RST extraction
|
||||
def extract_rst_structure(content: str) -> dict:
|
||||
parser = RstParser()
|
||||
result = parser.parse_string(content)
|
||||
if result.success:
|
||||
return result.document.to_legacy_format()
|
||||
# Fallback to legacy parser
|
||||
```
|
||||
|
||||
### 2. Doc Scraper
|
||||
|
||||
**File**: `src/skill_seekers/cli/doc_scraper.py`
|
||||
|
||||
```python
|
||||
# Enhanced Markdown extraction
|
||||
def _extract_markdown_content(self, content, url):
|
||||
parser = MarkdownParser()
|
||||
result = parser.parse_string(content, url)
|
||||
if result.success:
|
||||
doc = result.document
|
||||
return {
|
||||
"title": doc.title,
|
||||
"headings": [...],
|
||||
"code_samples": [...],
|
||||
"_enhanced": True,
|
||||
}
|
||||
# Fallback to legacy extraction
|
||||
```
|
||||
|
||||
## Usage Patterns
|
||||
|
||||
### Basic Parsing
|
||||
|
||||
```python
|
||||
from skill_seekers.cli.parsers.extractors import RstParser
|
||||
|
||||
parser = RstParser()
|
||||
result = parser.parse_file("docs/class_node.rst")
|
||||
|
||||
if result.success:
|
||||
doc = result.document
|
||||
print(f"Title: {doc.title}")
|
||||
print(f"Tables: {len(doc.tables)}")
|
||||
```
|
||||
|
||||
### Auto-Detection
|
||||
|
||||
```python
|
||||
from skill_seekers.cli.parsers.extractors import parse_document
|
||||
|
||||
result = parse_document("file.rst") # Auto-detects format
|
||||
# or
|
||||
result = parse_document(content, format_hint="rst")
|
||||
```
|
||||
|
||||
### Format Conversion
|
||||
|
||||
```python
|
||||
# To Markdown
|
||||
markdown = doc.to_markdown()
|
||||
|
||||
# To Skill format
|
||||
skill_data = doc.to_skill_format()
|
||||
|
||||
# To legacy format (backward compatibility)
|
||||
legacy = doc.to_skill_format() # Compatible with old structure
|
||||
```
|
||||
|
||||
### API Documentation Extraction
|
||||
|
||||
```python
|
||||
# Extract structured API info
|
||||
api_summary = doc.get_api_summary()
|
||||
# Returns:
|
||||
# {
|
||||
# "properties": [{"name": "position", "type": "Vector2", ...}],
|
||||
# "methods": [{"name": "_ready", "returns": "void", ...}],
|
||||
# "signals": [{"name": "ready", ...}]
|
||||
# }
|
||||
```
|
||||
|
||||
## Extending the System
|
||||
|
||||
### Adding a New Parser
|
||||
|
||||
1. **Create parser class**:
|
||||
```python
|
||||
class HtmlParser(BaseParser):
|
||||
@property
|
||||
def format_name(self) -> str:
|
||||
return "html"
|
||||
|
||||
@property
|
||||
def supported_extensions(self) -> list[str]:
|
||||
return [".html", ".htm"]
|
||||
|
||||
def _parse_content(self, content: str, source_path: str) -> Document:
|
||||
# Parse HTML to Document
|
||||
pass
|
||||
```
|
||||
|
||||
2. **Register in `__init__.py`**:
|
||||
```python
|
||||
from .html_parser import HtmlParser
|
||||
|
||||
__all__ = [..., "HtmlParser"]
|
||||
```
|
||||
|
||||
3. **Add tests**:
|
||||
```python
|
||||
def test_html_parser():
|
||||
parser = HtmlParser()
|
||||
result = parser.parse_string("<h1>Title</h1>")
|
||||
assert result.document.title == "Title"
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
Test individual parsers with various constructs:
|
||||
- `test_rst_parser.py` - RST-specific features
|
||||
- `test_markdown_parser.py` - Markdown-specific features
|
||||
- `test_quality_scorer.py` - Quality scoring
|
||||
|
||||
### Integration Tests
|
||||
|
||||
Test integration with existing scrapers:
|
||||
- `test_codebase_scraper.py` - RST file processing
|
||||
- `test_doc_scraper.py` - Markdown web content
|
||||
|
||||
### Backward Compatibility Tests
|
||||
|
||||
Verify new parsers match old output:
|
||||
- Same field names in output dicts
|
||||
- Same content extraction (plus more)
|
||||
- Legacy fallback works
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Current Performance
|
||||
|
||||
- RST Parser: ~1-2ms per 1000 lines
|
||||
- Markdown Parser: ~1ms per 1000 lines
|
||||
- Quality Scoring: Adds ~10% overhead
|
||||
|
||||
### Optimization Opportunities
|
||||
|
||||
1. **Caching**: Cache parsed documents by hash
|
||||
2. **Parallel Processing**: Parse multiple files concurrently
|
||||
3. **Lazy Evaluation**: Only extract requested content types
|
||||
|
||||
## Migration Guide
|
||||
|
||||
### From Legacy Parsers
|
||||
|
||||
**Before**:
|
||||
```python
|
||||
from skill_seekers.cli.codebase_scraper import extract_rst_structure
|
||||
|
||||
structure = extract_rst_structure(content)
|
||||
```
|
||||
|
||||
**After**:
|
||||
```python
|
||||
from skill_seekers.cli.parsers.extractors import RstParser
|
||||
|
||||
parser = RstParser()
|
||||
result = parser.parse_string(content)
|
||||
structure = result.document.to_skill_format()
|
||||
```
|
||||
|
||||
### Backward Compatibility
|
||||
|
||||
The enhanced `extract_rst_structure()` function:
|
||||
1. Tries unified parser first
|
||||
2. Falls back to legacy parser on failure
|
||||
3. Returns same dict structure
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Caching Layer**: Redis/disk cache for parsed docs
|
||||
2. **Streaming**: Parse large files incrementally
|
||||
3. **Validation**: JSON Schema validation for output
|
||||
4. **Additional formats**: As new source types are added, they follow the same parser registration pattern
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2026-03-15
|
||||
**Version**: 2.0.0 (updated for 17 source types)
|
||||
Reference in New Issue
Block a user