New skills covering 10 categories: **Security & Audit**: 007 (STRIDE/PASTA/OWASP), cred-omega (secrets management) **AI Personas**: Karpathy, Hinton, Sutskever, LeCun (4 sub-skills), Altman, Musk, Gates, Jobs, Buffett **Multi-agent Orchestration**: agent-orchestrator, task-intelligence, multi-advisor **Code Analysis**: matematico-tao (Terence Tao-inspired mathematical code analysis) **Social & Messaging**: Instagram Graph API, Telegram Bot, WhatsApp Cloud API, social-orchestrator **Image Generation**: AI Studio (Gemini), Stability AI, ComfyUI Gateway, image-studio router **Brazilian Domain**: 6 auction specialist modules, 2 legal advisors, auctioneers data scraper **Product & Growth**: design, invention, monetization, analytics, growth engine **DevOps & LLM Ops**: Docker/CI-CD/AWS, RAG/embeddings/fine-tuning **Skill Governance**: installer, sentinel auditor, context management Each skill includes: - Standardized YAML frontmatter (name, description, risk, source, tags, tools) - Structured sections (Overview, When to Use, How it Works, Best Practices) - Python scripts and reference documentation where applicable - Cross-platform compatibility (Claude Code, Antigravity, Cursor, Gemini CLI, Codex CLI) Co-authored-by: ProgramadorBrasil <214873561+ProgramadorBrasil@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
398 lines
11 KiB
Markdown
398 lines
11 KiB
Markdown
# Data Transforms Reference
|
||
|
||
Patterns for cleaning, normalizing, deduplicating, and enriching
|
||
extracted web data. Apply these transforms in Phase 5 (Transform)
|
||
between extraction and validation.
|
||
|
||
---
|
||
|
||
## Automatic Transforms
|
||
|
||
Always apply these to every extraction result.
|
||
|
||
### Whitespace Cleanup
|
||
|
||
```python
|
||
# Remove leading/trailing whitespace, collapse internal whitespace
|
||
value = ' '.join(value.split())
|
||
|
||
# Remove zero-width characters
|
||
import re
|
||
value = re.sub(r'[\u200b\u200c\u200d\ufeff\u00a0]', ' ', value).strip()
|
||
```
|
||
|
||
Patterns to handle:
|
||
- `\n`, `\r`, `\t` inside cell values -> single space
|
||
- Multiple consecutive spaces -> single space
|
||
- Non-breaking spaces (` `, `\u00a0`) -> regular space
|
||
- Zero-width characters -> remove
|
||
|
||
### HTML Entity Decode
|
||
|
||
| Entity | Character | Entity | Character |
|
||
|:------------|:----------|:-----------|:----------|
|
||
| `&` | `&` | `"` | `"` |
|
||
| `<` | `<` | `'` | `'` |
|
||
| `>` | `>` | `'` | `'` |
|
||
| ` ` | ` ` | `’` | (curly ') |
|
||
| `—` | `--` | `—` | `--` |
|
||
|
||
```python
|
||
import html
|
||
value = html.unescape(value)
|
||
```
|
||
|
||
### Unicode Normalization
|
||
|
||
```python
|
||
import unicodedata
|
||
value = unicodedata.normalize('NFKC', value)
|
||
```
|
||
|
||
This handles:
|
||
- Fancy quotes -> standard quotes
|
||
- Ligatures -> separate characters (e.g. `fi` -> `fi`)
|
||
- Full-width characters -> standard (e.g. `A` -> `A`)
|
||
- Superscript/subscript numbers -> regular numbers
|
||
|
||
### Empty Value Standardization
|
||
|
||
| Input | Markdown Output | JSON Output |
|
||
|:------------------------|:----------------|:------------|
|
||
| `""` (empty string) | `N/A` | `null` |
|
||
| `"-"` or `"--"` | `N/A` | `null` |
|
||
| `"N/A"`, `"n/a"`, `"NA"`| `N/A` | `null` |
|
||
| `"None"`, `"null"` | `N/A` | `null` |
|
||
| `"TBD"`, `"TBA"` | `TBD` | `"TBD"` |
|
||
|
||
---
|
||
|
||
## Price Normalization
|
||
|
||
Apply when extracting product, pricing, or financial data.
|
||
|
||
### Extraction Pattern
|
||
|
||
```python
|
||
import re
|
||
|
||
def normalize_price(raw):
|
||
if not raw:
|
||
return None
|
||
# Remove currency words
|
||
cleaned = re.sub(r'(?i)(USD|EUR|GBP|BRL|R\$|US\$)', '', raw)
|
||
# Extract numeric value (handles 1,234.56 and 1.234,56 formats)
|
||
match = re.search(r'[\d.,]+', cleaned)
|
||
if not match:
|
||
return None
|
||
num_str = match.group()
|
||
# Detect format: if last separator is comma with 2 digits after, it's decimal
|
||
if re.search(r',\d{2}$', num_str):
|
||
num_str = num_str.replace('.', '').replace(',', '.')
|
||
else:
|
||
num_str = num_str.replace(',', '')
|
||
return float(num_str)
|
||
```
|
||
|
||
### Currency Detection
|
||
|
||
| Symbol/Code | Currency | Symbol/Code | Currency |
|
||
|:------------|:---------|:------------|:---------|
|
||
| `$`, `US$`, `USD` | US Dollar | `R$`, `BRL` | Brazilian Real |
|
||
| `€`, `EUR` | Euro | `£`, `GBP` | British Pound |
|
||
| `¥`, `JPY` | Yen | `₹`, `INR` | Indian Rupee |
|
||
| `C$`, `CAD` | Canadian Dollar | `A$`, `AUD` | Australian Dollar |
|
||
|
||
### Output Format
|
||
|
||
```json
|
||
{
|
||
"price": 29.99,
|
||
"currency": "USD",
|
||
"rawPrice": "$29.99"
|
||
}
|
||
```
|
||
|
||
For Markdown, show formatted: `$29.99` (right-aligned in table).
|
||
|
||
---
|
||
|
||
## Date Normalization
|
||
|
||
Normalize all dates to ISO-8601 format.
|
||
|
||
### Common Formats to Handle
|
||
|
||
| Input Format | Example | Normalized |
|
||
|:------------------------|:---------------------|:-------------------|
|
||
| Full text | February 25, 2026 | 2026-02-25 |
|
||
| Short text | Feb 25, 2026 | 2026-02-25 |
|
||
| US numeric | 02/25/2026 | 2026-02-25 |
|
||
| EU numeric | 25/02/2026 | 2026-02-25 |
|
||
| ISO already | 2026-02-25 | 2026-02-25 |
|
||
| Relative | 3 days ago | (compute from now) |
|
||
| Relative | Yesterday | (compute from now) |
|
||
| Timestamp | 1740441600 | 2025-02-25 |
|
||
| With time | 2026-02-25T14:30:00Z | 2026-02-25 14:30 |
|
||
|
||
### Ambiguous Dates
|
||
|
||
When format is ambiguous (e.g. `03/04/2026`):
|
||
- Default to US format (MM/DD/YYYY) unless site is clearly non-US
|
||
- Check page `lang` attribute or URL TLD for locale hints
|
||
- Note ambiguity in delivery notes
|
||
|
||
### Relative Date Resolution
|
||
|
||
```python
|
||
from datetime import datetime, timedelta
|
||
import re
|
||
|
||
def resolve_relative_date(text):
|
||
text = text.lower().strip()
|
||
today = datetime.now()
|
||
|
||
if 'today' in text: return today.strftime('%Y-%m-%d')
|
||
if 'yesterday' in text: return (today - timedelta(days=1)).strftime('%Y-%m-%d')
|
||
|
||
match = re.search(r'(\d+)\s*(hour|day|week|month|year)s?\s*ago', text)
|
||
if match:
|
||
n, unit = int(match.group(1)), match.group(2)
|
||
deltas = {'hour': 0, 'day': n, 'week': n*7, 'month': n*30, 'year': n*365}
|
||
return (today - timedelta(days=deltas.get(unit, 0))).strftime('%Y-%m-%d')
|
||
|
||
return text # Return as-is if can't parse
|
||
```
|
||
|
||
---
|
||
|
||
## URL Resolution
|
||
|
||
Convert relative URLs to absolute.
|
||
|
||
### Patterns
|
||
|
||
| Input | Base URL | Resolved |
|
||
|:-------------------------|:----------------------------|:--------------------------------------|
|
||
| `/products/item-1` | `https://example.com/shop` | `https://example.com/products/item-1` |
|
||
| `item-1` | `https://example.com/shop/` | `https://example.com/shop/item-1` |
|
||
| `//cdn.example.com/img` | `https://example.com` | `https://cdn.example.com/img` |
|
||
| `https://other.com/page` | (any) | `https://other.com/page` (absolute) |
|
||
|
||
### JavaScript Resolution
|
||
|
||
```javascript
|
||
function resolveUrl(relative, base) {
|
||
try { return new URL(relative, base || window.location.href).href; }
|
||
catch { return relative; }
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Phone Normalization
|
||
|
||
For contact mode extraction.
|
||
|
||
### Pattern
|
||
|
||
```python
|
||
import re
|
||
|
||
def normalize_phone(raw):
|
||
if not raw:
|
||
return None
|
||
# Remove all non-digit chars except leading +
|
||
digits = re.sub(r'[^\d+]', '', raw)
|
||
if not digits or len(digits) < 7:
|
||
return None
|
||
# Add + prefix if looks international
|
||
if len(digits) >= 11 and not digits.startswith('+'):
|
||
digits = '+' + digits
|
||
return digits
|
||
```
|
||
|
||
### Format by Context
|
||
|
||
| Context | Format Example |
|
||
|:-----------------|:---------------------|
|
||
| JSON output | `"+5511999998888"` |
|
||
| Markdown table | `+55 11 99999-8888` |
|
||
| CSV output | `"+5511999998888"` |
|
||
|
||
---
|
||
|
||
## Deduplication
|
||
|
||
### Exact Deduplication
|
||
|
||
```python
|
||
def deduplicate(records, key_fields=None):
|
||
"""Remove exact duplicate records.
|
||
If key_fields provided, deduplicate by those fields only.
|
||
"""
|
||
seen = set()
|
||
unique = []
|
||
for record in records:
|
||
if key_fields:
|
||
key = tuple(record.get(f) for f in key_fields)
|
||
else:
|
||
key = tuple(sorted(record.items()))
|
||
if key not in seen:
|
||
seen.add(key)
|
||
unique.append(record)
|
||
return unique, len(records) - len(unique) # returns (unique_list, removed_count)
|
||
```
|
||
|
||
### Near-Duplicate Detection
|
||
|
||
When records share key fields but differ in details:
|
||
1. Group by key fields (e.g. product name + source)
|
||
2. For each group, keep the record with fewest null values
|
||
3. If tie, keep the first occurrence
|
||
4. Report in notes: "Merged N near-duplicate records"
|
||
|
||
### Dedup Key Selection by Mode
|
||
|
||
| Mode | Key Fields |
|
||
|:---------|:----------------------------------|
|
||
| product | name + source (or name + brand) |
|
||
| contact | name + email (or name + org) |
|
||
| jobs | title + company + location |
|
||
| events | title + date + location |
|
||
| table | all fields (exact match) |
|
||
| list | first 2-3 identifying fields |
|
||
|
||
---
|
||
|
||
## Text Cleaning
|
||
|
||
### Remove Noise
|
||
|
||
Common noise patterns to strip from extracted text:
|
||
|
||
| Pattern | Action |
|
||
|:-----------------------------------|:--------------------------|
|
||
| `\[edit\]`, `\[citation needed\]` | Remove (Wikipedia) |
|
||
| `Read more...`, `See more` | Remove (truncation markers)|
|
||
| `Sponsored`, `Ad`, `Promoted` | Remove or flag |
|
||
| Cookie consent text | Remove |
|
||
| Navigation breadcrumbs | Remove |
|
||
| Footer boilerplate | Remove |
|
||
|
||
### Sentence Case Normalization
|
||
|
||
When extracting ALL-CAPS or inconsistent-case text:
|
||
|
||
```python
|
||
def normalize_case(text):
|
||
if text.isupper() and len(text) > 3:
|
||
return text.title() # ALL CAPS -> Title Case
|
||
return text
|
||
```
|
||
|
||
Only apply when: field is clearly ALL-CAPS input (common in older sites),
|
||
user requests it, or data looks better normalized.
|
||
|
||
---
|
||
|
||
## Data Type Coercion
|
||
|
||
### Automatic Type Detection
|
||
|
||
| Raw Value | Detected Type | Coerced Value |
|
||
|:--------------|:--------------|:------------------|
|
||
| `"123"` | integer | `123` |
|
||
| `"12.99"` | float | `12.99` |
|
||
| `"true"` | boolean | `true` |
|
||
| `"false"` | boolean | `false` |
|
||
| `"2026-02-25"`| date string | `"2026-02-25"` |
|
||
| `"$29.99"` | price | `29.99` + currency|
|
||
| `"4.5/5"` | rating | `4.5` |
|
||
| `"1,234"` | integer | `1234` |
|
||
|
||
### Rating Normalization
|
||
|
||
```python
|
||
import re
|
||
|
||
def normalize_rating(raw):
|
||
if not raw:
|
||
return None
|
||
match = re.search(r'([\d.]+)\s*(?:/\s*([\d.]+))?', str(raw))
|
||
if match:
|
||
score = float(match.group(1))
|
||
max_score = float(match.group(2)) if match.group(2) else 5.0
|
||
return round(score / max_score * 5, 1) # Normalize to /5 scale
|
||
return None
|
||
```
|
||
|
||
---
|
||
|
||
## Enrichment Patterns
|
||
|
||
### Domain Extraction
|
||
|
||
Add domain from full URLs:
|
||
```python
|
||
from urllib.parse import urlparse
|
||
|
||
def extract_domain(url):
|
||
try:
|
||
parsed = urlparse(url)
|
||
domain = parsed.netloc.replace('www.', '')
|
||
return domain
|
||
except:
|
||
return None
|
||
```
|
||
|
||
### Word Count
|
||
|
||
For article mode:
|
||
```python
|
||
def word_count(text):
|
||
return len(text.split()) if text else 0
|
||
```
|
||
|
||
### Relative Time
|
||
|
||
Add human-readable time since date:
|
||
```python
|
||
def time_since(date_str):
|
||
from datetime import datetime
|
||
try:
|
||
dt = datetime.fromisoformat(date_str)
|
||
delta = datetime.now() - dt
|
||
if delta.days == 0: return "Today"
|
||
if delta.days == 1: return "Yesterday"
|
||
if delta.days < 7: return f"{delta.days} days ago"
|
||
if delta.days < 30: return f"{delta.days // 7} weeks ago"
|
||
if delta.days < 365: return f"{delta.days // 30} months ago"
|
||
return f"{delta.days // 365} years ago"
|
||
except:
|
||
return None
|
||
```
|
||
|
||
---
|
||
|
||
## Transform Pipeline Order
|
||
|
||
Apply transforms in this sequence:
|
||
|
||
1. **HTML entity decode** - raw text cleanup
|
||
2. **Unicode normalization** - character standardization
|
||
3. **Whitespace cleanup** - spacing normalization
|
||
4. **Empty value standardization** - null/N/A handling
|
||
5. **URL resolution** - relative to absolute
|
||
6. **Data type coercion** - strings to numbers/dates
|
||
7. **Price normalization** - if applicable
|
||
8. **Date normalization** - if applicable
|
||
9. **Phone normalization** - if applicable
|
||
10. **Text cleaning** - noise removal
|
||
11. **Deduplication** - remove duplicates
|
||
12. **Sorting** - user-requested order
|
||
13. **Enrichment** - domain, word count, etc.
|
||
|
||
Not all steps apply to every extraction. Apply only what's relevant
|
||
to the data type and extraction mode.
|