New skills covering 10 categories: **Security & Audit**: 007 (STRIDE/PASTA/OWASP), cred-omega (secrets management) **AI Personas**: Karpathy, Hinton, Sutskever, LeCun (4 sub-skills), Altman, Musk, Gates, Jobs, Buffett **Multi-agent Orchestration**: agent-orchestrator, task-intelligence, multi-advisor **Code Analysis**: matematico-tao (Terence Tao-inspired mathematical code analysis) **Social & Messaging**: Instagram Graph API, Telegram Bot, WhatsApp Cloud API, social-orchestrator **Image Generation**: AI Studio (Gemini), Stability AI, ComfyUI Gateway, image-studio router **Brazilian Domain**: 6 auction specialist modules, 2 legal advisors, auctioneers data scraper **Product & Growth**: design, invention, monetization, analytics, growth engine **DevOps & LLM Ops**: Docker/CI-CD/AWS, RAG/embeddings/fine-tuning **Skill Governance**: installer, sentinel auditor, context management Each skill includes: - Standardized YAML frontmatter (name, description, risk, source, tags, tools) - Structured sections (Overview, When to Use, How it Works, Best Practices) - Python scripts and reference documentation where applicable - Cross-platform compatibility (Claude Code, Antigravity, Cursor, Gemini CLI, Codex CLI) Co-authored-by: ProgramadorBrasil <214873561+ProgramadorBrasil@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
11 KiB
Data Transforms Reference
Patterns for cleaning, normalizing, deduplicating, and enriching extracted web data. Apply these transforms in Phase 5 (Transform) between extraction and validation.
Automatic Transforms
Always apply these to every extraction result.
Whitespace Cleanup
# Remove leading/trailing whitespace, collapse internal whitespace
value = ' '.join(value.split())
# Remove zero-width characters
import re
value = re.sub(r'[\u200b\u200c\u200d\ufeff\u00a0]', ' ', value).strip()
Patterns to handle:
\n,\r,\tinside cell values -> single space- Multiple consecutive spaces -> single space
- Non-breaking spaces (
,\u00a0) -> regular space - Zero-width characters -> remove
HTML Entity Decode
| Entity | Character | Entity | Character |
|---|---|---|---|
& |
& |
" |
" |
< |
< |
' |
' |
> |
> |
' |
' |
|
|
’ |
(curly ') |
— |
-- |
— |
-- |
import html
value = html.unescape(value)
Unicode Normalization
import unicodedata
value = unicodedata.normalize('NFKC', value)
This handles:
- Fancy quotes -> standard quotes
- Ligatures -> separate characters (e.g.
fi->fi) - Full-width characters -> standard (e.g.
A->A) - Superscript/subscript numbers -> regular numbers
Empty Value Standardization
| Input | Markdown Output | JSON Output |
|---|---|---|
"" (empty string) |
N/A |
null |
"-" or "--" |
N/A |
null |
"N/A", "n/a", "NA" |
N/A |
null |
"None", "null" |
N/A |
null |
"TBD", "TBA" |
TBD |
"TBD" |
Price Normalization
Apply when extracting product, pricing, or financial data.
Extraction Pattern
import re
def normalize_price(raw):
if not raw:
return None
# Remove currency words
cleaned = re.sub(r'(?i)(USD|EUR|GBP|BRL|R\$|US\$)', '', raw)
# Extract numeric value (handles 1,234.56 and 1.234,56 formats)
match = re.search(r'[\d.,]+', cleaned)
if not match:
return None
num_str = match.group()
# Detect format: if last separator is comma with 2 digits after, it's decimal
if re.search(r',\d{2}$', num_str):
num_str = num_str.replace('.', '').replace(',', '.')
else:
num_str = num_str.replace(',', '')
return float(num_str)
Currency Detection
| Symbol/Code | Currency | Symbol/Code | Currency |
|---|---|---|---|
$, US$, USD |
US Dollar | R$, BRL |
Brazilian Real |
€, EUR |
Euro | £, GBP |
British Pound |
¥, JPY |
Yen | ₹, INR |
Indian Rupee |
C$, CAD |
Canadian Dollar | A$, AUD |
Australian Dollar |
Output Format
{
"price": 29.99,
"currency": "USD",
"rawPrice": "$29.99"
}
For Markdown, show formatted: $29.99 (right-aligned in table).
Date Normalization
Normalize all dates to ISO-8601 format.
Common Formats to Handle
| Input Format | Example | Normalized |
|---|---|---|
| Full text | February 25, 2026 | 2026-02-25 |
| Short text | Feb 25, 2026 | 2026-02-25 |
| US numeric | 02/25/2026 | 2026-02-25 |
| EU numeric | 25/02/2026 | 2026-02-25 |
| ISO already | 2026-02-25 | 2026-02-25 |
| Relative | 3 days ago | (compute from now) |
| Relative | Yesterday | (compute from now) |
| Timestamp | 1740441600 | 2025-02-25 |
| With time | 2026-02-25T14:30:00Z | 2026-02-25 14:30 |
Ambiguous Dates
When format is ambiguous (e.g. 03/04/2026):
- Default to US format (MM/DD/YYYY) unless site is clearly non-US
- Check page
langattribute or URL TLD for locale hints - Note ambiguity in delivery notes
Relative Date Resolution
from datetime import datetime, timedelta
import re
def resolve_relative_date(text):
text = text.lower().strip()
today = datetime.now()
if 'today' in text: return today.strftime('%Y-%m-%d')
if 'yesterday' in text: return (today - timedelta(days=1)).strftime('%Y-%m-%d')
match = re.search(r'(\d+)\s*(hour|day|week|month|year)s?\s*ago', text)
if match:
n, unit = int(match.group(1)), match.group(2)
deltas = {'hour': 0, 'day': n, 'week': n*7, 'month': n*30, 'year': n*365}
return (today - timedelta(days=deltas.get(unit, 0))).strftime('%Y-%m-%d')
return text # Return as-is if can't parse
URL Resolution
Convert relative URLs to absolute.
Patterns
| Input | Base URL | Resolved |
|---|---|---|
/products/item-1 |
https://example.com/shop |
https://example.com/products/item-1 |
item-1 |
https://example.com/shop/ |
https://example.com/shop/item-1 |
//cdn.example.com/img |
https://example.com |
https://cdn.example.com/img |
https://other.com/page |
(any) | https://other.com/page (absolute) |
JavaScript Resolution
function resolveUrl(relative, base) {
try { return new URL(relative, base || window.location.href).href; }
catch { return relative; }
}
Phone Normalization
For contact mode extraction.
Pattern
import re
def normalize_phone(raw):
if not raw:
return None
# Remove all non-digit chars except leading +
digits = re.sub(r'[^\d+]', '', raw)
if not digits or len(digits) < 7:
return None
# Add + prefix if looks international
if len(digits) >= 11 and not digits.startswith('+'):
digits = '+' + digits
return digits
Format by Context
| Context | Format Example |
|---|---|
| JSON output | "+5511999998888" |
| Markdown table | +55 11 99999-8888 |
| CSV output | "+5511999998888" |
Deduplication
Exact Deduplication
def deduplicate(records, key_fields=None):
"""Remove exact duplicate records.
If key_fields provided, deduplicate by those fields only.
"""
seen = set()
unique = []
for record in records:
if key_fields:
key = tuple(record.get(f) for f in key_fields)
else:
key = tuple(sorted(record.items()))
if key not in seen:
seen.add(key)
unique.append(record)
return unique, len(records) - len(unique) # returns (unique_list, removed_count)
Near-Duplicate Detection
When records share key fields but differ in details:
- Group by key fields (e.g. product name + source)
- For each group, keep the record with fewest null values
- If tie, keep the first occurrence
- Report in notes: "Merged N near-duplicate records"
Dedup Key Selection by Mode
| Mode | Key Fields |
|---|---|
| product | name + source (or name + brand) |
| contact | name + email (or name + org) |
| jobs | title + company + location |
| events | title + date + location |
| table | all fields (exact match) |
| list | first 2-3 identifying fields |
Text Cleaning
Remove Noise
Common noise patterns to strip from extracted text:
| Pattern | Action |
|---|---|
\[edit\], \[citation needed\] |
Remove (Wikipedia) |
Read more..., See more |
Remove (truncation markers) |
Sponsored, Ad, Promoted |
Remove or flag |
| Cookie consent text | Remove |
| Navigation breadcrumbs | Remove |
| Footer boilerplate | Remove |
Sentence Case Normalization
When extracting ALL-CAPS or inconsistent-case text:
def normalize_case(text):
if text.isupper() and len(text) > 3:
return text.title() # ALL CAPS -> Title Case
return text
Only apply when: field is clearly ALL-CAPS input (common in older sites), user requests it, or data looks better normalized.
Data Type Coercion
Automatic Type Detection
| Raw Value | Detected Type | Coerced Value |
|---|---|---|
"123" |
integer | 123 |
"12.99" |
float | 12.99 |
"true" |
boolean | true |
"false" |
boolean | false |
"2026-02-25" |
date string | "2026-02-25" |
"$29.99" |
price | 29.99 + currency |
"4.5/5" |
rating | 4.5 |
"1,234" |
integer | 1234 |
Rating Normalization
import re
def normalize_rating(raw):
if not raw:
return None
match = re.search(r'([\d.]+)\s*(?:/\s*([\d.]+))?', str(raw))
if match:
score = float(match.group(1))
max_score = float(match.group(2)) if match.group(2) else 5.0
return round(score / max_score * 5, 1) # Normalize to /5 scale
return None
Enrichment Patterns
Domain Extraction
Add domain from full URLs:
from urllib.parse import urlparse
def extract_domain(url):
try:
parsed = urlparse(url)
domain = parsed.netloc.replace('www.', '')
return domain
except:
return None
Word Count
For article mode:
def word_count(text):
return len(text.split()) if text else 0
Relative Time
Add human-readable time since date:
def time_since(date_str):
from datetime import datetime
try:
dt = datetime.fromisoformat(date_str)
delta = datetime.now() - dt
if delta.days == 0: return "Today"
if delta.days == 1: return "Yesterday"
if delta.days < 7: return f"{delta.days} days ago"
if delta.days < 30: return f"{delta.days // 7} weeks ago"
if delta.days < 365: return f"{delta.days // 30} months ago"
return f"{delta.days // 365} years ago"
except:
return None
Transform Pipeline Order
Apply transforms in this sequence:
- HTML entity decode - raw text cleanup
- Unicode normalization - character standardization
- Whitespace cleanup - spacing normalization
- Empty value standardization - null/N/A handling
- URL resolution - relative to absolute
- Data type coercion - strings to numbers/dates
- Price normalization - if applicable
- Date normalization - if applicable
- Phone normalization - if applicable
- Text cleaning - noise removal
- Deduplication - remove duplicates
- Sorting - user-requested order
- Enrichment - domain, word count, etc.
Not all steps apply to every extraction. Apply only what's relevant to the data type and extraction mode.