Files
antigravity-skills-reference/skills/web-scraper/references/data-transforms.md
ProgramadorBrasil 61ec71c5c7 feat: add 52 specialized AI agent skills (#217)
New skills covering 10 categories:

**Security & Audit**: 007 (STRIDE/PASTA/OWASP), cred-omega (secrets management)
**AI Personas**: Karpathy, Hinton, Sutskever, LeCun (4 sub-skills), Altman, Musk, Gates, Jobs, Buffett
**Multi-agent Orchestration**: agent-orchestrator, task-intelligence, multi-advisor
**Code Analysis**: matematico-tao (Terence Tao-inspired mathematical code analysis)
**Social & Messaging**: Instagram Graph API, Telegram Bot, WhatsApp Cloud API, social-orchestrator
**Image Generation**: AI Studio (Gemini), Stability AI, ComfyUI Gateway, image-studio router
**Brazilian Domain**: 6 auction specialist modules, 2 legal advisors, auctioneers data scraper
**Product & Growth**: design, invention, monetization, analytics, growth engine
**DevOps & LLM Ops**: Docker/CI-CD/AWS, RAG/embeddings/fine-tuning
**Skill Governance**: installer, sentinel auditor, context management

Each skill includes:
- Standardized YAML frontmatter (name, description, risk, source, tags, tools)
- Structured sections (Overview, When to Use, How it Works, Best Practices)
- Python scripts and reference documentation where applicable
- Cross-platform compatibility (Claude Code, Antigravity, Cursor, Gemini CLI, Codex CLI)

Co-authored-by: ProgramadorBrasil <214873561+ProgramadorBrasil@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 10:04:07 +01:00

398 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Data Transforms Reference
Patterns for cleaning, normalizing, deduplicating, and enriching
extracted web data. Apply these transforms in Phase 5 (Transform)
between extraction and validation.
---
## Automatic Transforms
Always apply these to every extraction result.
### Whitespace Cleanup
```python
# Remove leading/trailing whitespace, collapse internal whitespace
value = ' '.join(value.split())
# Remove zero-width characters
import re
value = re.sub(r'[\u200b\u200c\u200d\ufeff\u00a0]', ' ', value).strip()
```
Patterns to handle:
- `\n`, `\r`, `\t` inside cell values -> single space
- Multiple consecutive spaces -> single space
- Non-breaking spaces (`&nbsp;`, `\u00a0`) -> regular space
- Zero-width characters -> remove
### HTML Entity Decode
| Entity | Character | Entity | Character |
|:------------|:----------|:-----------|:----------|
| `&amp;` | `&` | `&quot;` | `"` |
| `&lt;` | `<` | `&apos;` | `'` |
| `&gt;` | `>` | `&#39;` | `'` |
| `&nbsp;` | ` ` | `&#8217;` | (curly ') |
| `&mdash;` | `--` | `&#8212;` | `--` |
```python
import html
value = html.unescape(value)
```
### Unicode Normalization
```python
import unicodedata
value = unicodedata.normalize('NFKC', value)
```
This handles:
- Fancy quotes -> standard quotes
- Ligatures -> separate characters (e.g. `fi` -> `fi`)
- Full-width characters -> standard (e.g. `` -> `A`)
- Superscript/subscript numbers -> regular numbers
### Empty Value Standardization
| Input | Markdown Output | JSON Output |
|:------------------------|:----------------|:------------|
| `""` (empty string) | `N/A` | `null` |
| `"-"` or `"--"` | `N/A` | `null` |
| `"N/A"`, `"n/a"`, `"NA"`| `N/A` | `null` |
| `"None"`, `"null"` | `N/A` | `null` |
| `"TBD"`, `"TBA"` | `TBD` | `"TBD"` |
---
## Price Normalization
Apply when extracting product, pricing, or financial data.
### Extraction Pattern
```python
import re
def normalize_price(raw):
if not raw:
return None
# Remove currency words
cleaned = re.sub(r'(?i)(USD|EUR|GBP|BRL|R\$|US\$)', '', raw)
# Extract numeric value (handles 1,234.56 and 1.234,56 formats)
match = re.search(r'[\d.,]+', cleaned)
if not match:
return None
num_str = match.group()
# Detect format: if last separator is comma with 2 digits after, it's decimal
if re.search(r',\d{2}$', num_str):
num_str = num_str.replace('.', '').replace(',', '.')
else:
num_str = num_str.replace(',', '')
return float(num_str)
```
### Currency Detection
| Symbol/Code | Currency | Symbol/Code | Currency |
|:------------|:---------|:------------|:---------|
| `$`, `US$`, `USD` | US Dollar | `R$`, `BRL` | Brazilian Real |
| `€`, `EUR` | Euro | `£`, `GBP` | British Pound |
| `¥`, `JPY` | Yen | `₹`, `INR` | Indian Rupee |
| `C$`, `CAD` | Canadian Dollar | `A$`, `AUD` | Australian Dollar |
### Output Format
```json
{
"price": 29.99,
"currency": "USD",
"rawPrice": "$29.99"
}
```
For Markdown, show formatted: `$29.99` (right-aligned in table).
---
## Date Normalization
Normalize all dates to ISO-8601 format.
### Common Formats to Handle
| Input Format | Example | Normalized |
|:------------------------|:---------------------|:-------------------|
| Full text | February 25, 2026 | 2026-02-25 |
| Short text | Feb 25, 2026 | 2026-02-25 |
| US numeric | 02/25/2026 | 2026-02-25 |
| EU numeric | 25/02/2026 | 2026-02-25 |
| ISO already | 2026-02-25 | 2026-02-25 |
| Relative | 3 days ago | (compute from now) |
| Relative | Yesterday | (compute from now) |
| Timestamp | 1740441600 | 2025-02-25 |
| With time | 2026-02-25T14:30:00Z | 2026-02-25 14:30 |
### Ambiguous Dates
When format is ambiguous (e.g. `03/04/2026`):
- Default to US format (MM/DD/YYYY) unless site is clearly non-US
- Check page `lang` attribute or URL TLD for locale hints
- Note ambiguity in delivery notes
### Relative Date Resolution
```python
from datetime import datetime, timedelta
import re
def resolve_relative_date(text):
text = text.lower().strip()
today = datetime.now()
if 'today' in text: return today.strftime('%Y-%m-%d')
if 'yesterday' in text: return (today - timedelta(days=1)).strftime('%Y-%m-%d')
match = re.search(r'(\d+)\s*(hour|day|week|month|year)s?\s*ago', text)
if match:
n, unit = int(match.group(1)), match.group(2)
deltas = {'hour': 0, 'day': n, 'week': n*7, 'month': n*30, 'year': n*365}
return (today - timedelta(days=deltas.get(unit, 0))).strftime('%Y-%m-%d')
return text # Return as-is if can't parse
```
---
## URL Resolution
Convert relative URLs to absolute.
### Patterns
| Input | Base URL | Resolved |
|:-------------------------|:----------------------------|:--------------------------------------|
| `/products/item-1` | `https://example.com/shop` | `https://example.com/products/item-1` |
| `item-1` | `https://example.com/shop/` | `https://example.com/shop/item-1` |
| `//cdn.example.com/img` | `https://example.com` | `https://cdn.example.com/img` |
| `https://other.com/page` | (any) | `https://other.com/page` (absolute) |
### JavaScript Resolution
```javascript
function resolveUrl(relative, base) {
try { return new URL(relative, base || window.location.href).href; }
catch { return relative; }
}
```
---
## Phone Normalization
For contact mode extraction.
### Pattern
```python
import re
def normalize_phone(raw):
if not raw:
return None
# Remove all non-digit chars except leading +
digits = re.sub(r'[^\d+]', '', raw)
if not digits or len(digits) < 7:
return None
# Add + prefix if looks international
if len(digits) >= 11 and not digits.startswith('+'):
digits = '+' + digits
return digits
```
### Format by Context
| Context | Format Example |
|:-----------------|:---------------------|
| JSON output | `"+5511999998888"` |
| Markdown table | `+55 11 99999-8888` |
| CSV output | `"+5511999998888"` |
---
## Deduplication
### Exact Deduplication
```python
def deduplicate(records, key_fields=None):
"""Remove exact duplicate records.
If key_fields provided, deduplicate by those fields only.
"""
seen = set()
unique = []
for record in records:
if key_fields:
key = tuple(record.get(f) for f in key_fields)
else:
key = tuple(sorted(record.items()))
if key not in seen:
seen.add(key)
unique.append(record)
return unique, len(records) - len(unique) # returns (unique_list, removed_count)
```
### Near-Duplicate Detection
When records share key fields but differ in details:
1. Group by key fields (e.g. product name + source)
2. For each group, keep the record with fewest null values
3. If tie, keep the first occurrence
4. Report in notes: "Merged N near-duplicate records"
### Dedup Key Selection by Mode
| Mode | Key Fields |
|:---------|:----------------------------------|
| product | name + source (or name + brand) |
| contact | name + email (or name + org) |
| jobs | title + company + location |
| events | title + date + location |
| table | all fields (exact match) |
| list | first 2-3 identifying fields |
---
## Text Cleaning
### Remove Noise
Common noise patterns to strip from extracted text:
| Pattern | Action |
|:-----------------------------------|:--------------------------|
| `\[edit\]`, `\[citation needed\]` | Remove (Wikipedia) |
| `Read more...`, `See more` | Remove (truncation markers)|
| `Sponsored`, `Ad`, `Promoted` | Remove or flag |
| Cookie consent text | Remove |
| Navigation breadcrumbs | Remove |
| Footer boilerplate | Remove |
### Sentence Case Normalization
When extracting ALL-CAPS or inconsistent-case text:
```python
def normalize_case(text):
if text.isupper() and len(text) > 3:
return text.title() # ALL CAPS -> Title Case
return text
```
Only apply when: field is clearly ALL-CAPS input (common in older sites),
user requests it, or data looks better normalized.
---
## Data Type Coercion
### Automatic Type Detection
| Raw Value | Detected Type | Coerced Value |
|:--------------|:--------------|:------------------|
| `"123"` | integer | `123` |
| `"12.99"` | float | `12.99` |
| `"true"` | boolean | `true` |
| `"false"` | boolean | `false` |
| `"2026-02-25"`| date string | `"2026-02-25"` |
| `"$29.99"` | price | `29.99` + currency|
| `"4.5/5"` | rating | `4.5` |
| `"1,234"` | integer | `1234` |
### Rating Normalization
```python
import re
def normalize_rating(raw):
if not raw:
return None
match = re.search(r'([\d.]+)\s*(?:/\s*([\d.]+))?', str(raw))
if match:
score = float(match.group(1))
max_score = float(match.group(2)) if match.group(2) else 5.0
return round(score / max_score * 5, 1) # Normalize to /5 scale
return None
```
---
## Enrichment Patterns
### Domain Extraction
Add domain from full URLs:
```python
from urllib.parse import urlparse
def extract_domain(url):
try:
parsed = urlparse(url)
domain = parsed.netloc.replace('www.', '')
return domain
except:
return None
```
### Word Count
For article mode:
```python
def word_count(text):
return len(text.split()) if text else 0
```
### Relative Time
Add human-readable time since date:
```python
def time_since(date_str):
from datetime import datetime
try:
dt = datetime.fromisoformat(date_str)
delta = datetime.now() - dt
if delta.days == 0: return "Today"
if delta.days == 1: return "Yesterday"
if delta.days < 7: return f"{delta.days} days ago"
if delta.days < 30: return f"{delta.days // 7} weeks ago"
if delta.days < 365: return f"{delta.days // 30} months ago"
return f"{delta.days // 365} years ago"
except:
return None
```
---
## Transform Pipeline Order
Apply transforms in this sequence:
1. **HTML entity decode** - raw text cleanup
2. **Unicode normalization** - character standardization
3. **Whitespace cleanup** - spacing normalization
4. **Empty value standardization** - null/N/A handling
5. **URL resolution** - relative to absolute
6. **Data type coercion** - strings to numbers/dates
7. **Price normalization** - if applicable
8. **Date normalization** - if applicable
9. **Phone normalization** - if applicable
10. **Text cleaning** - noise removal
11. **Deduplication** - remove duplicates
12. **Sorting** - user-requested order
13. **Enrichment** - domain, word count, etc.
Not all steps apply to every extraction. Apply only what's relevant
to the data type and extraction mode.