# Data Transforms Reference Patterns for cleaning, normalizing, deduplicating, and enriching extracted web data. Apply these transforms in Phase 5 (Transform) between extraction and validation. --- ## Automatic Transforms Always apply these to every extraction result. ### Whitespace Cleanup ```python # Remove leading/trailing whitespace, collapse internal whitespace value = ' '.join(value.split()) # Remove zero-width characters import re value = re.sub(r'[\u200b\u200c\u200d\ufeff\u00a0]', ' ', value).strip() ``` Patterns to handle: - `\n`, `\r`, `\t` inside cell values -> single space - Multiple consecutive spaces -> single space - Non-breaking spaces (` `, `\u00a0`) -> regular space - Zero-width characters -> remove ### HTML Entity Decode | Entity | Character | Entity | Character | |:------------|:----------|:-----------|:----------| | `&` | `&` | `"` | `"` | | `<` | `<` | `'` | `'` | | `>` | `>` | `'` | `'` | | ` ` | ` ` | `’` | (curly ') | | `—` | `--` | `—` | `--` | ```python import html value = html.unescape(value) ``` ### Unicode Normalization ```python import unicodedata value = unicodedata.normalize('NFKC', value) ``` This handles: - Fancy quotes -> standard quotes - Ligatures -> separate characters (e.g. `fi` -> `fi`) - Full-width characters -> standard (e.g. `A` -> `A`) - Superscript/subscript numbers -> regular numbers ### Empty Value Standardization | Input | Markdown Output | JSON Output | |:------------------------|:----------------|:------------| | `""` (empty string) | `N/A` | `null` | | `"-"` or `"--"` | `N/A` | `null` | | `"N/A"`, `"n/a"`, `"NA"`| `N/A` | `null` | | `"None"`, `"null"` | `N/A` | `null` | | `"TBD"`, `"TBA"` | `TBD` | `"TBD"` | --- ## Price Normalization Apply when extracting product, pricing, or financial data. ### Extraction Pattern ```python import re def normalize_price(raw): if not raw: return None # Remove currency words cleaned = re.sub(r'(?i)(USD|EUR|GBP|BRL|R\$|US\$)', '', raw) # Extract numeric value (handles 1,234.56 and 1.234,56 formats) match = re.search(r'[\d.,]+', cleaned) if not match: return None num_str = match.group() # Detect format: if last separator is comma with 2 digits after, it's decimal if re.search(r',\d{2}$', num_str): num_str = num_str.replace('.', '').replace(',', '.') else: num_str = num_str.replace(',', '') return float(num_str) ``` ### Currency Detection | Symbol/Code | Currency | Symbol/Code | Currency | |:------------|:---------|:------------|:---------| | `$`, `US$`, `USD` | US Dollar | `R$`, `BRL` | Brazilian Real | | `€`, `EUR` | Euro | `£`, `GBP` | British Pound | | `¥`, `JPY` | Yen | `₹`, `INR` | Indian Rupee | | `C$`, `CAD` | Canadian Dollar | `A$`, `AUD` | Australian Dollar | ### Output Format ```json { "price": 29.99, "currency": "USD", "rawPrice": "$29.99" } ``` For Markdown, show formatted: `$29.99` (right-aligned in table). --- ## Date Normalization Normalize all dates to ISO-8601 format. ### Common Formats to Handle | Input Format | Example | Normalized | |:------------------------|:---------------------|:-------------------| | Full text | February 25, 2026 | 2026-02-25 | | Short text | Feb 25, 2026 | 2026-02-25 | | US numeric | 02/25/2026 | 2026-02-25 | | EU numeric | 25/02/2026 | 2026-02-25 | | ISO already | 2026-02-25 | 2026-02-25 | | Relative | 3 days ago | (compute from now) | | Relative | Yesterday | (compute from now) | | Timestamp | 1740441600 | 2025-02-25 | | With time | 2026-02-25T14:30:00Z | 2026-02-25 14:30 | ### Ambiguous Dates When format is ambiguous (e.g. `03/04/2026`): - Default to US format (MM/DD/YYYY) unless site is clearly non-US - Check page `lang` attribute or URL TLD for locale hints - Note ambiguity in delivery notes ### Relative Date Resolution ```python from datetime import datetime, timedelta import re def resolve_relative_date(text): text = text.lower().strip() today = datetime.now() if 'today' in text: return today.strftime('%Y-%m-%d') if 'yesterday' in text: return (today - timedelta(days=1)).strftime('%Y-%m-%d') match = re.search(r'(\d+)\s*(hour|day|week|month|year)s?\s*ago', text) if match: n, unit = int(match.group(1)), match.group(2) deltas = {'hour': 0, 'day': n, 'week': n*7, 'month': n*30, 'year': n*365} return (today - timedelta(days=deltas.get(unit, 0))).strftime('%Y-%m-%d') return text # Return as-is if can't parse ``` --- ## URL Resolution Convert relative URLs to absolute. ### Patterns | Input | Base URL | Resolved | |:-------------------------|:----------------------------|:--------------------------------------| | `/products/item-1` | `https://example.com/shop` | `https://example.com/products/item-1` | | `item-1` | `https://example.com/shop/` | `https://example.com/shop/item-1` | | `//cdn.example.com/img` | `https://example.com` | `https://cdn.example.com/img` | | `https://other.com/page` | (any) | `https://other.com/page` (absolute) | ### JavaScript Resolution ```javascript function resolveUrl(relative, base) { try { return new URL(relative, base || window.location.href).href; } catch { return relative; } } ``` --- ## Phone Normalization For contact mode extraction. ### Pattern ```python import re def normalize_phone(raw): if not raw: return None # Remove all non-digit chars except leading + digits = re.sub(r'[^\d+]', '', raw) if not digits or len(digits) < 7: return None # Add + prefix if looks international if len(digits) >= 11 and not digits.startswith('+'): digits = '+' + digits return digits ``` ### Format by Context | Context | Format Example | |:-----------------|:---------------------| | JSON output | `"+5511999998888"` | | Markdown table | `+55 11 99999-8888` | | CSV output | `"+5511999998888"` | --- ## Deduplication ### Exact Deduplication ```python def deduplicate(records, key_fields=None): """Remove exact duplicate records. If key_fields provided, deduplicate by those fields only. """ seen = set() unique = [] for record in records: if key_fields: key = tuple(record.get(f) for f in key_fields) else: key = tuple(sorted(record.items())) if key not in seen: seen.add(key) unique.append(record) return unique, len(records) - len(unique) # returns (unique_list, removed_count) ``` ### Near-Duplicate Detection When records share key fields but differ in details: 1. Group by key fields (e.g. product name + source) 2. For each group, keep the record with fewest null values 3. If tie, keep the first occurrence 4. Report in notes: "Merged N near-duplicate records" ### Dedup Key Selection by Mode | Mode | Key Fields | |:---------|:----------------------------------| | product | name + source (or name + brand) | | contact | name + email (or name + org) | | jobs | title + company + location | | events | title + date + location | | table | all fields (exact match) | | list | first 2-3 identifying fields | --- ## Text Cleaning ### Remove Noise Common noise patterns to strip from extracted text: | Pattern | Action | |:-----------------------------------|:--------------------------| | `\[edit\]`, `\[citation needed\]` | Remove (Wikipedia) | | `Read more...`, `See more` | Remove (truncation markers)| | `Sponsored`, `Ad`, `Promoted` | Remove or flag | | Cookie consent text | Remove | | Navigation breadcrumbs | Remove | | Footer boilerplate | Remove | ### Sentence Case Normalization When extracting ALL-CAPS or inconsistent-case text: ```python def normalize_case(text): if text.isupper() and len(text) > 3: return text.title() # ALL CAPS -> Title Case return text ``` Only apply when: field is clearly ALL-CAPS input (common in older sites), user requests it, or data looks better normalized. --- ## Data Type Coercion ### Automatic Type Detection | Raw Value | Detected Type | Coerced Value | |:--------------|:--------------|:------------------| | `"123"` | integer | `123` | | `"12.99"` | float | `12.99` | | `"true"` | boolean | `true` | | `"false"` | boolean | `false` | | `"2026-02-25"`| date string | `"2026-02-25"` | | `"$29.99"` | price | `29.99` + currency| | `"4.5/5"` | rating | `4.5` | | `"1,234"` | integer | `1234` | ### Rating Normalization ```python import re def normalize_rating(raw): if not raw: return None match = re.search(r'([\d.]+)\s*(?:/\s*([\d.]+))?', str(raw)) if match: score = float(match.group(1)) max_score = float(match.group(2)) if match.group(2) else 5.0 return round(score / max_score * 5, 1) # Normalize to /5 scale return None ``` --- ## Enrichment Patterns ### Domain Extraction Add domain from full URLs: ```python from urllib.parse import urlparse def extract_domain(url): try: parsed = urlparse(url) domain = parsed.netloc.replace('www.', '') return domain except: return None ``` ### Word Count For article mode: ```python def word_count(text): return len(text.split()) if text else 0 ``` ### Relative Time Add human-readable time since date: ```python def time_since(date_str): from datetime import datetime try: dt = datetime.fromisoformat(date_str) delta = datetime.now() - dt if delta.days == 0: return "Today" if delta.days == 1: return "Yesterday" if delta.days < 7: return f"{delta.days} days ago" if delta.days < 30: return f"{delta.days // 7} weeks ago" if delta.days < 365: return f"{delta.days // 30} months ago" return f"{delta.days // 365} years ago" except: return None ``` --- ## Transform Pipeline Order Apply transforms in this sequence: 1. **HTML entity decode** - raw text cleanup 2. **Unicode normalization** - character standardization 3. **Whitespace cleanup** - spacing normalization 4. **Empty value standardization** - null/N/A handling 5. **URL resolution** - relative to absolute 6. **Data type coercion** - strings to numbers/dates 7. **Price normalization** - if applicable 8. **Date normalization** - if applicable 9. **Phone normalization** - if applicable 10. **Text cleaning** - noise removal 11. **Deduplication** - remove duplicates 12. **Sorting** - user-requested order 13. **Enrichment** - domain, word count, etc. Not all steps apply to every extraction. Apply only what's relevant to the data type and extraction mode.