Files
claude-code-skills-reference/transcript-fixer/scripts/utils/diff_formats/text_splitter.py
daymade bd0aa12004 Release v1.8.0: Add transcript-fixer skill
## New Skill: transcript-fixer v1.0.0

Correct speech-to-text (ASR/STT) transcription errors through dictionary-based rules and AI-powered corrections with automatic pattern learning.

**Features:**
- Two-stage correction pipeline (dictionary + AI)
- Automatic pattern detection and learning
- Domain-specific dictionaries (general, embodied_ai, finance, medical)
- SQLite-based correction repository
- Team collaboration with import/export
- GLM API integration for AI corrections
- Cost optimization through dictionary promotion

**Use cases:**
- Correcting meeting notes, lecture recordings, or interview transcripts
- Fixing Chinese/English homophone errors and technical terminology
- Building domain-specific correction dictionaries
- Improving transcript accuracy through iterative learning

**Documentation:**
- Complete workflow guides in references/
- SQL query templates
- Troubleshooting guide
- Team collaboration patterns
- API setup instructions

**Marketplace updates:**
- Updated marketplace to v1.8.0
- Added transcript-fixer plugin (category: productivity)
- Updated README.md with skill description and use cases
- Updated CLAUDE.md with skill listing and counts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-28 13:16:37 +08:00

34 lines
885 B
Python

#!/usr/bin/env python3
"""
Text splitter utility for word-level diff generation
SINGLE RESPONSIBILITY: Split text into words while preserving structure
"""
from __future__ import annotations
import re
def split_into_words(text: str) -> list[str]:
"""
Split text into words, preserving whitespace and punctuation
This enables word-level diff generation for Chinese and English text
Args:
text: Input text to split
Returns:
List of word tokens (Chinese words, English words, numbers, punctuation)
"""
# Pattern: Chinese chars, English words, numbers, non-alphanumeric chars
pattern = r'[\u4e00-\u9fff]+|[a-zA-Z]+|[0-9]+|[^\u4e00-\u9fffa-zA-Z0-9]'
return re.findall(pattern, text)
def read_file(file_path: str) -> str:
"""Read file contents"""
with open(file_path, 'r', encoding='utf-8') as f:
return f.read()