* feat: add 12 official Apify skills for web scraping and data extraction Add the complete Apify agent-skills collection as official vendor skills, bringing the total skill count from 954 to 966. New skills: - apify-actor-development: Develop, debug, and deploy Apify Actors - apify-actorization: Convert existing projects into Apify Actors - apify-audience-analysis: Audience demographics across social platforms - apify-brand-reputation-monitoring: Track reviews, ratings, and sentiment - apify-competitor-intelligence: Analyze competitor strategies and pricing - apify-content-analytics: Track engagement metrics and campaign ROI - apify-ecommerce: E-commerce data scraping for pricing intelligence - apify-influencer-discovery: Find and evaluate influencers - apify-lead-generation: B2B/B2C lead generation from multiple platforms - apify-market-research: Market conditions and geographic opportunities - apify-trend-analysis: Discover emerging trends across platforms - apify-ultimate-scraper: Universal AI-powered web scraper Existing skill fixes: - design-orchestration: Add missing description, fix markdown list spacing - multi-agent-brainstorming: Add missing description, fix markdown list spacing Registry and documentation updates: - Update skill count to 966+ across README.md, README.vi.md - Add Apify to official sources in SOURCES.md and all README variants - Register new skills in catalog.json, skills_index.json, bundles.json, aliases.json - Update CATALOG.md category counts (data-ai: 152, infrastructure: 95) Validation script improvements: - Raise description length limit from 200 to 1024 characters - Add empty description validation check - Apply PEP 8 formatting (line length, spacing, trailing whitespace) * refactor: truncate skill descriptions in SKILL.md files and revert description length validation to 200 characters. * feat: Add `apify-ultimate-scraper` to data-ai and move `apify-lead-generation` from business to general categories.
141 lines
4.1 KiB
Markdown
141 lines
4.1 KiB
Markdown
# Schemas and Output Configuration
|
|
|
|
## Input Schema
|
|
|
|
Map your application's inputs to `.actor/input_schema.json`. Validate against the JSON Schema from the `@apify/json_schemas` npm package (`input.schema.json`).
|
|
|
|
```json
|
|
{
|
|
"title": "My Actor Input",
|
|
"type": "object",
|
|
"schemaVersion": 1,
|
|
"properties": {
|
|
"startUrl": {
|
|
"title": "Start URL",
|
|
"type": "string",
|
|
"description": "The URL to start processing from",
|
|
"editor": "textfield",
|
|
"prefill": "https://example.com"
|
|
},
|
|
"maxItems": {
|
|
"title": "Max Items",
|
|
"type": "integer",
|
|
"description": "Maximum number of items to process",
|
|
"default": 100,
|
|
"minimum": 1
|
|
}
|
|
},
|
|
"required": ["startUrl"]
|
|
}
|
|
```
|
|
|
|
### Mapping Guidelines
|
|
|
|
- Command-line arguments → input schema properties
|
|
- Environment variables → input schema or Actor env vars in actor.json
|
|
- Config files → input schema with object/array types
|
|
- Flatten deeply nested structures for better UX
|
|
|
|
## Output Schema
|
|
|
|
Define output structure in `.actor/output_schema.json`. Validate against the JSON Schema from the `@apify/json_schemas` npm package (`output.schema.json`).
|
|
|
|
### For Table-Like Data (Multiple Items)
|
|
|
|
- Use `Actor.pushData()` (JS) or `Actor.push_data()` (Python)
|
|
- Each item becomes a row in the dataset
|
|
|
|
### For Single Files or Blobs
|
|
|
|
- Use key-value store: `Actor.setValue()` / `Actor.set_value()`
|
|
- Get the public URL and include it in the dataset:
|
|
|
|
```javascript
|
|
// Store file with public access
|
|
await Actor.setValue('report.pdf', pdfBuffer, { contentType: 'application/pdf' });
|
|
|
|
// Get the public URL
|
|
const storeInfo = await Actor.openKeyValueStore();
|
|
const publicUrl = `https://api.apify.com/v2/key-value-stores/${storeInfo.id}/records/report.pdf`;
|
|
|
|
// Include URL in dataset output
|
|
await Actor.pushData({ reportUrl: publicUrl });
|
|
```
|
|
|
|
### For Multiple Files with a Common Prefix (Collections)
|
|
|
|
```javascript
|
|
// Store multiple files with a prefix
|
|
for (const [name, data] of files) {
|
|
await Actor.setValue(`screenshots/${name}`, data, { contentType: 'image/png' });
|
|
}
|
|
// Files are accessible at: .../records/screenshots%2F{name}
|
|
```
|
|
|
|
## Actor Configuration (actor.json)
|
|
|
|
Configure `.actor/actor.json`. Validate against the JSON Schema from the `@apify/json_schemas` npm package (`actor.schema.json`).
|
|
|
|
```json
|
|
{
|
|
"actorSpecification": 1,
|
|
"name": "my-actor",
|
|
"title": "My Actor",
|
|
"description": "Brief description of what the actor does",
|
|
"version": "1.0.0",
|
|
"meta": {
|
|
"templateId": "ts_empty",
|
|
"generatedBy": "Claude Code with Claude Opus 4.5"
|
|
},
|
|
"input": "./input_schema.json",
|
|
"dockerfile": "../Dockerfile"
|
|
}
|
|
```
|
|
|
|
**Important:** Fill in the `generatedBy` property with the tool/model used.
|
|
|
|
## State Management
|
|
|
|
### Request Queue - For Pausable Task Processing
|
|
|
|
The request queue works for any task processing, not just web scraping. Use a dummy URL with custom `uniqueKey` and `userData` for non-URL tasks:
|
|
|
|
```javascript
|
|
const requestQueue = await Actor.openRequestQueue();
|
|
|
|
// Add tasks to the queue (works for any processing, not just URLs)
|
|
await requestQueue.addRequest({
|
|
url: 'https://placeholder.local', // Dummy URL for non-scraping tasks
|
|
uniqueKey: `task-${taskId}`, // Unique identifier for deduplication
|
|
userData: { itemId: 123, action: 'process' }, // Your custom task data
|
|
});
|
|
|
|
// Process tasks from the queue (with Crawlee)
|
|
const crawler = new BasicCrawler({
|
|
requestQueue,
|
|
requestHandler: async ({ request }) => {
|
|
const { itemId, action } = request.userData;
|
|
// Process your task using userData
|
|
await processTask(itemId, action);
|
|
},
|
|
});
|
|
await crawler.run();
|
|
|
|
// Or manually consume without Crawlee:
|
|
let request;
|
|
while ((request = await requestQueue.fetchNextRequest())) {
|
|
await processTask(request.userData);
|
|
await requestQueue.markRequestHandled(request);
|
|
}
|
|
```
|
|
|
|
### Key-Value Store - For Checkpoint State
|
|
|
|
```javascript
|
|
// Save state
|
|
await Actor.setValue('STATE', { processedCount: 100 });
|
|
|
|
// Restore state on restart
|
|
const state = await Actor.getValue('STATE') || { processedCount: 0 };
|
|
```
|