docs: remove stale UNIFIED_PARSERS.md superseded by UML architecture

The parsers architecture is now fully documented in the StarUML project (Docs/UML/skill_seekers.mdj) with the Parsers class diagram showing all 28 SubcommandParser subclasses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 12:31:17 +03:00
parent 6b54988db5
commit 40603a3cf6
1 changed files with 0 additions and 434 deletions
--- a/docs/architecture/UNIFIED_PARSERS.md
+++ b/docs/architecture/UNIFIED_PARSERS.md
@@ -1,434 +0,0 @@
-# Unified Document Parsers Architecture
-
-## Overview
-
-The Unified Document Parser system provides a standardized interface for extracting structured content from multiple document formats. As of v3.2.0, the system supports **17 source types** through registered parsers and scraper modules. It replaces format-specific extraction logic with a common data model and extensible parser framework.
-
-## Architecture Goals
-
-1. **Standardization**: All parsers output the same `Document` structure
-2. **Extensibility**: Easy to add new formats via the scraper pattern (17 source types and growing)
-3. **Quality**: Built-in quality scoring for extracted content
-4. **Backward Compatibility**: Legacy parsers remain functional during migration
-
-## Core Components
-
-### 1. Data Model Layer
-
-**File**: `src/skill_seekers/cli/parsers/extractors/unified_structure.py`
-
-```
-┌─────────────────────────────────────────────────────────────┐
-│                      Document                                │
-├─────────────────────────────────────────────────────────────┤
-│  title: str                                                  │
-│  format: str                                                 │
-│  source_path: str                                            │
-├─────────────────────────────────────────────────────────────┤
-│  blocks: List[ContentBlock]         # All content blocks    │
-│  headings: List[Heading]            # Extracted from blocks │
-│  code_blocks: List[CodeBlock]       # Extracted from blocks │
-│  tables: List[Table]                # Extracted from blocks │
-│  images: List[Image]                # Extracted from blocks │
-├─────────────────────────────────────────────────────────────┤
-│  internal_links: List[CrossReference]  # :ref:, #anchor     │
-│  external_links: List[CrossReference]  # URLs               │
-├─────────────────────────────────────────────────────────────┤
-│  meta: Dict[str, Any]               # Frontmatter, metadata │
-│  stats: ExtractionStats             # Processing metrics    │
-└─────────────────────────────────────────────────────────────┘
-```
-
-#### ContentBlock
-
-The universal content container:
-
-```python
-@dataclass
-class ContentBlock:
-    type: ContentBlockType      # HEADING, PARAGRAPH, CODE_BLOCK, etc.
-    content: str                # Raw text content
-    metadata: Dict[str, Any]    # Type-specific data
-    source_line: Optional[int]  # Line number in source
-    quality_score: Optional[float]  # 0-10 quality rating
-```
-
-**ContentBlockType Enum**:
- `HEADING` - Section titles
- `PARAGRAPH` - Text content
- `CODE_BLOCK` - Code snippets
- `TABLE` - Tabular data
- `LIST` - Bullet/numbered lists
- `IMAGE` - Image references
- `CROSS_REFERENCE` - Internal links
- `DIRECTIVE` - RST directives
- `FIELD_LIST` - Parameter documentation
- `DEFINITION_LIST` - Term/definition pairs
- `ADMONITION` - Notes, warnings, tips
- `META` - Metadata fields
-
-#### Specialized Data Classes
-
-**Table**:
-```python
-@dataclass
-class Table:
-    rows: List[List[str]]       # 2D cell array
-    headers: Optional[List[str]]
-    caption: Optional[str]
-    source_format: str          # 'simple', 'grid', 'list-table'
-```
-
-**CodeBlock**:
-```python
-@dataclass
-class CodeBlock:
-    code: str
-    language: Optional[str]
-    quality_score: Optional[float]
-    confidence: Optional[float]  # Language detection confidence
-    is_valid: Optional[bool]     # Syntax validation
-```
-
-**CrossReference**:
-```python
-@dataclass
-class CrossReference:
-    ref_type: CrossRefType      # REF, DOC, CLASS, METH, etc.
-    target: str                 # Target ID/URL
-    text: Optional[str]         # Display text
-```
-
-### 2. Parser Interface Layer
-
-**File**: `src/skill_seekers/cli/parsers/extractors/base_parser.py`
-
-```
-┌─────────────────────────────────────────────────────────────┐
-│                    BaseParser (Abstract)                     │
-├─────────────────────────────────────────────────────────────┤
-│  + format_name: str                                          │
-│  + supported_extensions: List[str]                           │
-├─────────────────────────────────────────────────────────────┤
-│  + parse(source) -> ParseResult                              │
-│  + parse_file(path) -> ParseResult                           │
-│  + parse_string(content) -> ParseResult                      │
-│  # _parse_content(content, path) -> Document                 │
-│  # _detect_format(content) -> bool                           │
-└─────────────────────────────────────────────────────────────┘
-```
-
-**ParseResult**:
-```python
-@dataclass
-class ParseResult:
-    document: Optional[Document]
-    success: bool
-    errors: List[str]
-    warnings: List[str]
-```
-
-### 3. Parser Implementations
-
-#### RST Parser
-
-**File**: `src/skill_seekers/cli/parsers/extractors/rst_parser.py`
-
-**Supported Constructs**:
- Headers (underline style: `====`, `----`)
- Code blocks (`.. code-block:: language`)
- Tables (simple, grid, list-table)
- Cross-references (`:ref:`, `:class:`, `:meth:`, `:func:`, `:attr:`)
- Directives (`.. note::`, `.. warning::`, `.. deprecated::`)
- Field lists (`:param:`, `:returns:`, `:type:`)
- Definition lists
- Substitutions (`|name|`)
- Toctree (`.. toctree::`)
-
-**Parsing Strategy**:
-1. First pass: Collect substitution definitions
-2. Second pass: Parse block-level constructs
-3. Post-process: Extract specialized content lists
-
-#### Markdown Parser
-
-**File**: `src/skill_seekers/cli/parsers/extractors/markdown_parser.py`
-
-**Supported Constructs**:
- Headers (ATX: `#`, Setext: underline)
- Code blocks (fenced: ```` ``` ````)
- Tables (GitHub-flavored)
- Lists (bullet, numbered)
- Admonitions (GitHub-style: `> [!NOTE]`)
- Images and links
- Frontmatter (YAML metadata)
-
-#### PDF Parser
-
-**File**: `src/skill_seekers/cli/pdf_scraper.py`
-
-**Status**: Integrated. Extracts text, tables, images, and code blocks from PDF files. Supports OCR for scanned documents.
-
-#### Additional Registered Parsers (v3.2.0)
-
-The following source types each have a dedicated scraper module registered in `parsers/__init__.py` (PARSERS list), `main.py` (COMMAND_MODULES dict), and `config_validator.py` (VALID_SOURCE_TYPES set):
-
-| # | Source Type | Scraper Module | Parser Registration |
-|---|------------|---------------|---------------------|
-| 1 | Documentation (web) | `doc_scraper.py` | `documentation` |
-| 2 | GitHub repo | `github_scraper.py` | `github` |
-| 3 | PDF | `pdf_scraper.py` | `pdf` |
-| 4 | Word (.docx) | `word_scraper.py` | `word` |
-| 5 | EPUB | `epub_scraper.py` | `epub` |
-| 6 | Video | `video_scraper.py` | `video` |
-| 7 | Local codebase | `codebase_scraper.py` | `local` |
-| 8 | Jupyter Notebook | `jupyter_scraper.py` | `jupyter` |
-| 9 | Local HTML | `html_scraper.py` | `html` |
-| 10 | OpenAPI/Swagger | `openapi_scraper.py` | `openapi` |
-| 11 | AsciiDoc | `asciidoc_scraper.py` | `asciidoc` |
-| 12 | PowerPoint | `pptx_scraper.py` | `pptx` |
-| 13 | RSS/Atom | `rss_scraper.py` | `rss` |
-| 14 | Man pages | `manpage_scraper.py` | `manpage` |
-| 15 | Confluence | `confluence_scraper.py` | `confluence` |
-| 16 | Notion | `notion_scraper.py` | `notion` |
-| 17 | Slack/Discord | `chat_scraper.py` | `chat` |
-
-Each scraper follows the same pattern: a `<Type>ToSkillConverter` class with a `main()` function, registered in three places (see [CONTRIBUTING.md](../../CONTRIBUTING.md) for the full scraper pattern).
-
-#### Generic Merge System
-
-**File**: `src/skill_seekers/cli/unified_skill_builder.py`
-
-The `unified_skill_builder.py` handles multi-source merging:
- **Pairwise synthesis**: Optimized merge for common combos (docs+github, docs+pdf, github+pdf)
- **Generic merge** (`_generic_merge()`): Handles all other source type combinations (e.g., docs+jupyter+confluence) by normalizing each source's `scraped_data` into a common structure and merging sections
-
-### 4. Quality Scoring Layer
-
-**File**: `src/skill_seekers/cli/parsers/extractors/quality_scorer.py`
-
-**Code Quality Factors**:
- Language detection confidence
- Code length appropriateness
- Line count
- Keyword density
- Syntax pattern matching
- Bracket balance
-
-**Table Quality Factors**:
- Has headers
- Consistent column count
- Reasonable size
- Non-empty cells
- Has caption
-
-### 5. Output Formatter Layer
-
-**File**: `src/skill_seekers/cli/parsers/extractors/formatters.py`
-
-**MarkdownFormatter**:
- Converts Document to Markdown
- Handles all ContentBlockType variants
- Configurable options (TOC, max heading level, etc.)
-
-**SkillFormatter**:
- Converts Document to skill-seekers internal format
- Compatible with existing skill pipelines
-
-## Integration Points
-
-### 1. Codebase Scraper
-
-**File**: `src/skill_seekers/cli/codebase_scraper.py`
-
-```python
-# Enhanced RST extraction
-def extract_rst_structure(content: str) -> dict:
-    parser = RstParser()
-    result = parser.parse_string(content)
-    if result.success:
-        return result.document.to_legacy_format()
-    # Fallback to legacy parser
-```
-
-### 2. Doc Scraper
-
-**File**: `src/skill_seekers/cli/doc_scraper.py`
-
-```python
-# Enhanced Markdown extraction
-def _extract_markdown_content(self, content, url):
-    parser = MarkdownParser()
-    result = parser.parse_string(content, url)
-    if result.success:
-        doc = result.document
-        return {
-            "title": doc.title,
-            "headings": [...],
-            "code_samples": [...],
-            "_enhanced": True,
-        }
-    # Fallback to legacy extraction
-```
-
-## Usage Patterns
-
-### Basic Parsing
-
-```python
-from skill_seekers.cli.parsers.extractors import RstParser
-
-parser = RstParser()
-result = parser.parse_file("docs/class_node.rst")
-
-if result.success:
-    doc = result.document
-    print(f"Title: {doc.title}")
-    print(f"Tables: {len(doc.tables)}")
-```
-
-### Auto-Detection
-
-```python
-from skill_seekers.cli.parsers.extractors import parse_document
-
-result = parse_document("file.rst")  # Auto-detects format
-# or
-result = parse_document(content, format_hint="rst")
-```
-
-### Format Conversion
-
-```python
-# To Markdown
-markdown = doc.to_markdown()
-
-# To Skill format
-skill_data = doc.to_skill_format()
-
-# To legacy format (backward compatibility)
-legacy = doc.to_skill_format()  # Compatible with old structure
-```
-
-### API Documentation Extraction
-
-```python
-# Extract structured API info
-api_summary = doc.get_api_summary()
-# Returns:
-# {
-#   "properties": [{"name": "position", "type": "Vector2", ...}],
-#   "methods": [{"name": "_ready", "returns": "void", ...}],
-#   "signals": [{"name": "ready", ...}]
-# }
-```
-
-## Extending the System
-
-### Adding a New Parser
-
-1. **Create parser class**:
-```python
-class HtmlParser(BaseParser):
-    @property
-    def format_name(self) -> str:
-        return "html"
-    
-    @property
-    def supported_extensions(self) -> list[str]:
-        return [".html", ".htm"]
-    
-    def _parse_content(self, content: str, source_path: str) -> Document:
-        # Parse HTML to Document
-        pass
-```
-
-2. **Register in `__init__.py`**:
-```python
-from .html_parser import HtmlParser
-
-__all__ = [..., "HtmlParser"]
-```
-
-3. **Add tests**:
-```python
-def test_html_parser():
-    parser = HtmlParser()
-    result = parser.parse_string("<h1>Title</h1>")
-    assert result.document.title == "Title"
-```
-
-## Testing Strategy
-
-### Unit Tests
-
-Test individual parsers with various constructs:
- `test_rst_parser.py` - RST-specific features
- `test_markdown_parser.py` - Markdown-specific features
- `test_quality_scorer.py` - Quality scoring
-
-### Integration Tests
-
-Test integration with existing scrapers:
- `test_codebase_scraper.py` - RST file processing
- `test_doc_scraper.py` - Markdown web content
-
-### Backward Compatibility Tests
-
-Verify new parsers match old output:
- Same field names in output dicts
- Same content extraction (plus more)
- Legacy fallback works
-
-## Performance Considerations
-
-### Current Performance
-
- RST Parser: ~1-2ms per 1000 lines
- Markdown Parser: ~1ms per 1000 lines
- Quality Scoring: Adds ~10% overhead
-
-### Optimization Opportunities
-
-1. **Caching**: Cache parsed documents by hash
-2. **Parallel Processing**: Parse multiple files concurrently
-3. **Lazy Evaluation**: Only extract requested content types
-
-## Migration Guide
-
-### From Legacy Parsers
-
-**Before**:
-```python
-from skill_seekers.cli.codebase_scraper import extract_rst_structure
-
-structure = extract_rst_structure(content)
-```
-
-**After**:
-```python
-from skill_seekers.cli.parsers.extractors import RstParser
-
-parser = RstParser()
-result = parser.parse_string(content)
-structure = result.document.to_skill_format()
-```
-
-### Backward Compatibility
-
-The enhanced `extract_rst_structure()` function:
-1. Tries unified parser first
-2. Falls back to legacy parser on failure
-3. Returns same dict structure
-
-## Future Enhancements
-
-1. **Caching Layer**: Redis/disk cache for parsed docs
-2. **Streaming**: Parse large files incrementally
-3. **Validation**: JSON Schema validation for output
-4. **Additional formats**: As new source types are added, they follow the same parser registration pattern
-
---
-
-**Last Updated**: 2026-03-15
-**Version**: 2.0.0 (updated for 17 source types)