fix: filter h1 headings and short paragraphs in _extract_markdown_content

The unified MarkdownParser returns all headings (h1-h6) and all paragraphs
without length filtering. Apply the documented behaviour at the call site:
- Exclude h1 from the headings list (return h2-h6 only)
- Filter out paragraphs shorter than 20 characters from content

Fixes test_extract_headings_h2_to_h6 and test_extract_content_paragraphs.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
yusyus
2026-02-18 21:53:14 +03:00
parent 637bb0a602
commit a1bdcd037b

View File

@@ -425,10 +425,14 @@ class DocToSkillConverter:
return {
"url": url,
"title": doc.title or "",
"content": doc._extract_content_text(),
"content": "\n\n".join(
p for p in doc._extract_content_text().split("\n\n")
if len(p.strip()) >= 20
),
"headings": [
{"level": f"h{h.level}", "text": h.text, "id": h.id or ""}
for h in doc.headings
if h.level > 1
],
"code_samples": [
{"code": cb.code, "language": cb.language or "unknown"}