Files
antigravity-skills-reference/skills/web-scraper/references/extraction-patterns.md
ProgramadorBrasil 61ec71c5c7 feat: add 52 specialized AI agent skills (#217)
New skills covering 10 categories:

**Security & Audit**: 007 (STRIDE/PASTA/OWASP), cred-omega (secrets management)
**AI Personas**: Karpathy, Hinton, Sutskever, LeCun (4 sub-skills), Altman, Musk, Gates, Jobs, Buffett
**Multi-agent Orchestration**: agent-orchestrator, task-intelligence, multi-advisor
**Code Analysis**: matematico-tao (Terence Tao-inspired mathematical code analysis)
**Social & Messaging**: Instagram Graph API, Telegram Bot, WhatsApp Cloud API, social-orchestrator
**Image Generation**: AI Studio (Gemini), Stability AI, ComfyUI Gateway, image-studio router
**Brazilian Domain**: 6 auction specialist modules, 2 legal advisors, auctioneers data scraper
**Product & Growth**: design, invention, monetization, analytics, growth engine
**DevOps & LLM Ops**: Docker/CI-CD/AWS, RAG/embeddings/fine-tuning
**Skill Governance**: installer, sentinel auditor, context management

Each skill includes:
- Standardized YAML frontmatter (name, description, risk, source, tags, tools)
- Structured sections (Overview, When to Use, How it Works, Best Practices)
- Python scripts and reference documentation where applicable
- Cross-platform compatibility (Claude Code, Antigravity, Cursor, Gemini CLI, Codex CLI)

Co-authored-by: ProgramadorBrasil <214873561+ProgramadorBrasil@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 10:04:07 +01:00

15 KiB

Extraction Patterns Reference

CSS selectors, JavaScript snippets, and domain-specific tips for common web scraping scenarios.


CSS Selector Patterns

Tables

/* Standard HTML tables */
table                               /* All tables */
table.data-table                    /* Class-based */
table[id*="result"]                 /* ID contains "result" */
table thead th                      /* Header cells */
table tbody tr                      /* Data rows */
table tbody tr td                   /* Data cells */
table tbody tr td:nth-child(2)      /* Specific column (2nd) */

/* Grid layouts acting as tables */
[role="table"]                      /* ARIA table role */
[role="row"]                        /* ARIA row */
[role="gridcell"]                   /* ARIA grid cell */
.table-responsive table             /* Bootstrap responsive wrapper */

Product Listings

/* E-commerce product grids */
.product-card, .product-item, .product-tile
[data-product-id]                   /* Data attribute markers */
.product-name, .product-title, h2.title
.price, .product-price, [data-price]
.price--sale, .price--original      /* Sale vs original price */
.rating, .stars, [data-rating]
.availability, .stock-status
.product-image img, .product-thumb img

/* Common e-commerce patterns */
.search-results .result-item
.catalog-grid .catalog-item
.listing .listing-item

Search Results

/* Generic search result patterns */
.search-result, .result-item, .search-entry
.result-title a, .result-link
.result-snippet, .result-description
.result-url, .result-source
.result-date, .result-timestamp
.pagination a, .page-numbers a, [aria-label="Next"]

Contact / Directory

/* People and contact cards */
.team-member, .staff-card, .person, .contact-card
.member-name, .person-name, h3.name
.member-title, .job-title, .role
.member-email a[href^="mailto:"]
.member-phone a[href^="tel:"]
.member-bio, .person-description
.vcard                              /* hCard microformat */

FAQ / Accordion

/* FAQ and accordion patterns */
.faq-item, .accordion-item, [itemtype*="FAQPage"] [itemprop="mainEntity"]
.faq-question, .accordion-header, [itemprop="name"], summary
.faq-answer, .accordion-body, .accordion-content, [itemprop="acceptedAnswer"]
details, details > summary          /* Native HTML accordion */
[role="tabpanel"]                   /* Tab-based FAQ */

Pricing Tables

/* SaaS pricing page patterns */
.pricing-table, .pricing-card, .plan-card, .pricing-tier
.plan-name, .tier-name, .pricing-title
.plan-price, .pricing-amount, .price-value
.plan-period, .billing-cycle        /* monthly/annually */
.plan-features li, .feature-list li
.plan-cta, .pricing-button
[class*="popular"], [class*="recommended"], [class*="featured"]  /* highlighted plan */

Job Listings

/* Job board patterns */
.job-listing, .job-card, .job-posting, [itemtype*="JobPosting"]
.job-title, [itemprop="title"]
.company-name, [itemprop="hiringOrganization"]
.job-location, [itemprop="jobLocation"]
.job-salary, [itemprop="baseSalary"]
.job-type, .employment-type
.job-date, [itemprop="datePosted"]

Events

/* Event listing patterns */
.event-card, .event-item, [itemtype*="Event"]
.event-title, [itemprop="name"]
.event-date, [itemprop="startDate"], time[datetime]
.event-location, [itemprop="location"]
.event-description, [itemprop="description"]
.event-speaker, .speaker-name

Navigation / Pagination

/* Pagination controls */
.pagination, .pager, nav[aria-label*="pagination"]
.pagination .next, a[rel="next"]
.pagination .prev, a[rel="prev"]
.page-numbers, .page-link
button[data-page], a[data-page]
.load-more, button.show-more

Articles / Blog Posts

/* Article content */
article, .post, .entry, .article-content
article h1, .post-title, .entry-title
.author, .byline, [rel="author"]
time, .date, .published, .post-date
.post-content, .entry-content, .article-body
.tags a, .categories a, .post-tags a

JavaScript Extraction Snippets

Generic Table Extractor

function extractTable(selector) {
  const table = document.querySelector(selector || 'table');
  if (!table) return { error: 'No table found' };

  const headers = Array.from(
    table.querySelectorAll('thead th, tr:first-child th, tr:first-child td')
  ).map(el => el.textContent.trim());

  const rows = Array.from(table.querySelectorAll('tbody tr, tr:not(:first-child)'))
    .map(tr => {
      const cells = Array.from(tr.querySelectorAll('td'))
        .map(td => td.textContent.trim());
      return cells.length > 0 ? cells : null;
    })
    .filter(Boolean);

  return { headers, rows, rowCount: rows.length };
}
JSON.stringify(extractTable());

Multi-Table Extractor

function extractAllTables() {
  const tables = document.querySelectorAll('table');
  return Array.from(tables).map((table, idx) => {
    const caption = table.querySelector('caption')?.textContent?.trim()
      || table.getAttribute('aria-label') || `Table ${idx + 1}`;
    const headers = Array.from(
      table.querySelectorAll('thead th, tr:first-child th')
    ).map(el => el.textContent.trim());
    const rows = Array.from(table.querySelectorAll('tbody tr'))
      .map(tr => Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim()))
      .filter(r => r.length > 0);
    return { caption, headers, rows, rowCount: rows.length };
  });
}
JSON.stringify(extractAllTables());

Generic List Extractor

function extractList(containerSelector, itemSelector, fieldMap) {
  // fieldMap: { fieldName: { selector: 'CSS', attr: 'href'|'src'|null } }
  const container = document.querySelector(containerSelector);
  if (!container) return { error: 'Container not found' };

  const items = Array.from(container.querySelectorAll(itemSelector));
  const data = items.map(item => {
    const record = {};
    for (const [key, config] of Object.entries(fieldMap)) {
      const sel = typeof config === 'string' ? config : config.selector;
      const attr = typeof config === 'object' ? config.attr : null;
      const el = item.querySelector(sel);
      if (!el) { record[key] = null; continue; }
      record[key] = attr ? el.getAttribute(attr) : el.textContent.trim();
    }
    return record;
  });
  return { data, itemCount: data.length };
}

// Example usage:
JSON.stringify(extractList('.results', '.result-item', {
  title: '.result-title',
  description: '.result-snippet',
  url: { selector: '.result-title a', attr: 'href' },
  date: '.result-date'
}));

JSON-LD Structured Data Extractor

Many pages embed structured data that's easier to parse than DOM:

function extractJsonLd(targetType) {
  const scripts = document.querySelectorAll('script[type="application/ld+json"]');
  const allData = Array.from(scripts).map(s => {
    try { return JSON.parse(s.textContent); } catch { return null; }
  }).filter(Boolean);

  // Flatten @graph arrays
  const flat = allData.flatMap(d => d['@graph'] || [d]);

  if (targetType) {
    return flat.filter(d =>
      d['@type'] === targetType ||
      (Array.isArray(d['@type']) && d['@type'].includes(targetType))
    );
  }
  return flat;
}
// Extract products: extractJsonLd('Product')
// Extract articles: extractJsonLd('Article')
// Extract all: extractJsonLd()
JSON.stringify(extractJsonLd());

Common JSON-LD types and their useful fields:

  • Product: name, offers.price, offers.priceCurrency, aggregateRating, brand.name
  • Article: headline, author.name, datePublished, description, wordCount
  • Organization: name, address, telephone, email, url
  • BreadcrumbList: itemListElement[].name (navigation path)
  • FAQPage: mainEntity[].name (question), mainEntity[].acceptedAnswer.text
  • JobPosting: title, hiringOrganization.name, jobLocation, baseSalary
  • Event: name, startDate, endDate, location, performer

OpenGraph / Meta Tag Extractor

function extractMeta() {
  const meta = {};
  document.querySelectorAll('meta[property^="og:"], meta[name^="twitter:"]')
    .forEach(el => {
      const key = el.getAttribute('property') || el.getAttribute('name');
      meta[key] = el.getAttribute('content');
    });
  meta.title = document.title;
  meta.description = document.querySelector('meta[name="description"]')
    ?.getAttribute('content');
  meta.canonical = document.querySelector('link[rel="canonical"]')
    ?.getAttribute('href');
  return meta;
}
JSON.stringify(extractMeta());

Pricing Plan Extractor

function extractPricingPlans() {
  const cards = document.querySelectorAll(
    '.pricing-card, .plan-card, .pricing-tier, [class*="pricing"] [class*="card"]'
  );
  return Array.from(cards).map(card => ({
    name: card.querySelector('[class*="name"], [class*="title"], h2, h3')
      ?.textContent?.trim() || null,
    price: card.querySelector('[class*="price"], [class*="amount"]')
      ?.textContent?.trim() || null,
    period: card.querySelector('[class*="period"], [class*="billing"]')
      ?.textContent?.trim() || null,
    features: Array.from(card.querySelectorAll('[class*="feature"] li, ul li'))
      .map(li => li.textContent.trim()),
    highlighted: card.matches('[class*="popular"], [class*="recommended"], [class*="featured"]'),
    ctaText: card.querySelector('a, button')?.textContent?.trim() || null,
    ctaUrl: card.querySelector('a')?.href || null,
  }));
}
JSON.stringify(extractPricingPlans());

FAQ Extractor

function extractFAQ() {
  // Try JSON-LD first
  const ldFaq = extractJsonLd('FAQPage');
  if (ldFaq.length > 0 && ldFaq[0].mainEntity) {
    return ldFaq[0].mainEntity.map(q => ({
      question: q.name,
      answer: q.acceptedAnswer?.text || null
    }));
  }

  // Try <details>/<summary> pattern
  const details = document.querySelectorAll('details');
  if (details.length > 0) {
    return Array.from(details).map(d => ({
      question: d.querySelector('summary')?.textContent?.trim() || null,
      answer: Array.from(d.children).filter(c => c.tagName !== 'SUMMARY')
        .map(c => c.textContent.trim()).join(' ')
    }));
  }

  // Try accordion pattern
  const items = document.querySelectorAll(
    '.faq-item, .accordion-item, [class*="faq"] [class*="item"]'
  );
  return Array.from(items).map(item => ({
    question: item.querySelector(
      '[class*="question"], [class*="header"], [class*="title"], h3, h4'
    )?.textContent?.trim() || null,
    answer: item.querySelector(
      '[class*="answer"], [class*="body"], [class*="content"], p'
    )?.textContent?.trim() || null
  }));
}
JSON.stringify(extractFAQ());
function extractLinks(scope) {
  const container = scope ? document.querySelector(scope) : document;
  const links = Array.from(container.querySelectorAll('a[href]'))
    .map(a => ({
      text: a.textContent.trim(),
      href: a.href,
      title: a.title || null
    }))
    .filter(l => l.text && l.href && !l.href.startsWith('javascript:'));
  return { links, count: links.length };
}
JSON.stringify(extractLinks());

Image Extractor

function extractImages(scope) {
  const container = scope ? document.querySelector(scope) : document;
  const images = Array.from(container.querySelectorAll('img'))
    .map(img => ({
      src: img.src,
      alt: img.alt || null,
      width: img.naturalWidth,
      height: img.naturalHeight
    }))
    .filter(i => i.src && !i.src.includes('data:image/gif'));
  return { images, count: images.length };
}
JSON.stringify(extractImages());

Scroll-and-Collect Pattern

For pages with lazy-loaded content, use this pattern with Browser automation:

// Count items before scroll
function countItems(selector) {
  return document.querySelectorAll(selector).length;
}

Then in the workflow:

  1. javascript_tool: countItems('.item') -> get initial count
  2. computer(action="scroll", scroll_direction="down")
  3. computer(action="wait", duration=2)
  4. javascript_tool: countItems('.item') -> get new count
  5. If new count > old count, repeat from step 2
  6. If count unchanged after 2 scrolls, all items loaded
  7. Extract all items at once

Domain-Specific Tips

E-Commerce Sites

  • Check for JSON-LD Product schema first - often has cleaner data than DOM
  • Prices may have hidden original/sale price elements
  • Availability often encoded in data attributes (data-available="true")
  • Product variants (size, color) may require click interactions
  • Review data often loaded lazily - scroll to reviews section first
  • Many sites have internal APIs at /api/products - check Network tab

Wikipedia

  • Tables use class .wikitable - always prefer this selector
  • Infoboxes use class .infobox
  • References in <sup class="reference"> - exclude from text extraction
  • Table cells may contain complex nested HTML - use .textContent.trim()
  • Sortable tables have class .sortable with sort buttons in headers

News Sites

  • Article body often in <article> or [itemprop="articleBody"]
  • Paywall indicators: .paywall, .subscribe-wall, truncated with "Read more"
  • Publication date in <time> element or [itemprop="datePublished"]
  • Author in [itemprop="author"] or .byline
  • JSON-LD NewsArticle often has complete metadata

Government / Data Portals

  • Often use HTML tables without JavaScript
  • May have download links for CSV/Excel - check for .csv, .xlsx links
  • Data dictionaries may be on separate pages
  • Look for API endpoints in page source (/api/, .json links)
  • CORS may block direct API access; use Bash curl instead

Social Media (Public Profiles)

  • Content is almost always JS-rendered - use Browser automation
  • Rate limiting is aggressive - keep requests minimal
  • Infinite scroll is the norm - set clear item limits
  • Structure changes frequently - prefer text extraction over selectors

SaaS Pricing Pages

  • Pricing often changes dynamically (monthly vs annual toggle)
  • May need to click "Annual" toggle to see annual prices
  • Feature comparison tables often use checkmarks (Unicode or SVG)
  • Check for hidden elements toggled by billing period selector

Job Boards

  • Most use JSON-LD JobPosting schema
  • Salary ranges often hidden behind "View salary" buttons
  • Location may include remote/hybrid indicators
  • Filters are URL-parameter based - useful for pagination

Anti-Patterns to Avoid

Anti-Pattern Why It Fails Better Approach
Selectors with generated hashes (.css-1a2b3c) Change on every deploy Use semantic selectors, ARIA roles, data attributes
Deeply nested paths (div > div > div > span) Fragile on layout changes Use closest meaningful class or attribute
Index-based (:nth-child(3)) for dynamic lists Order may change Use content-based identification
Selecting by inline styles Presentation, not semantics Use classes, IDs, or data attributes
Hardcoded wait times for JS content Too short or too long Check for content presence in a loop
Single selector for variant pages Different pages differ Test selector on multiple pages first

Robust Selector Priority

Prefer selectors in this order (most stable to least):

  1. [data-testid="..."], [data-id="..."] - test/data attributes
  2. #unique-id - unique IDs
  3. [role="..."], [aria-label="..."] - ARIA attributes
  4. [itemprop="..."], [itemtype="..."] - microdata / schema.org
  5. .semantic-class - meaningful class names
  6. tag.class - element type + class
  7. Structural selectors - last resort