* feat: add 12 official Apify skills for web scraping and data extraction Add the complete Apify agent-skills collection as official vendor skills, bringing the total skill count from 954 to 966. New skills: - apify-actor-development: Develop, debug, and deploy Apify Actors - apify-actorization: Convert existing projects into Apify Actors - apify-audience-analysis: Audience demographics across social platforms - apify-brand-reputation-monitoring: Track reviews, ratings, and sentiment - apify-competitor-intelligence: Analyze competitor strategies and pricing - apify-content-analytics: Track engagement metrics and campaign ROI - apify-ecommerce: E-commerce data scraping for pricing intelligence - apify-influencer-discovery: Find and evaluate influencers - apify-lead-generation: B2B/B2C lead generation from multiple platforms - apify-market-research: Market conditions and geographic opportunities - apify-trend-analysis: Discover emerging trends across platforms - apify-ultimate-scraper: Universal AI-powered web scraper Existing skill fixes: - design-orchestration: Add missing description, fix markdown list spacing - multi-agent-brainstorming: Add missing description, fix markdown list spacing Registry and documentation updates: - Update skill count to 966+ across README.md, README.vi.md - Add Apify to official sources in SOURCES.md and all README variants - Register new skills in catalog.json, skills_index.json, bundles.json, aliases.json - Update CATALOG.md category counts (data-ai: 152, infrastructure: 95) Validation script improvements: - Raise description length limit from 200 to 1024 characters - Add empty description validation check - Apply PEP 8 formatting (line length, spacing, trailing whitespace) * refactor: truncate skill descriptions in SKILL.md files and revert description length validation to 200 characters. * feat: Add `apify-ultimate-scraper` to data-ai and move `apify-lead-generation` from business to general categories.
210 lines
6.1 KiB
Markdown
210 lines
6.1 KiB
Markdown
# Dataset Schema Reference
|
|
|
|
The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console.
|
|
|
|
## Examples
|
|
|
|
### JavaScript and TypeScript
|
|
|
|
Consider an example Actor that calls `Actor.pushData()` to store data into dataset:
|
|
|
|
```javascript
|
|
import { Actor } from 'apify';
|
|
// Initialize the JavaScript SDK
|
|
await Actor.init();
|
|
|
|
/**
|
|
* Actor code
|
|
*/
|
|
await Actor.pushData({
|
|
numericField: 10,
|
|
pictureUrl: 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
|
|
linkUrl: 'https://google.com',
|
|
textField: 'Google',
|
|
booleanField: true,
|
|
dateField: new Date(),
|
|
arrayField: ['#hello', '#world'],
|
|
objectField: {},
|
|
});
|
|
|
|
// Exit successfully
|
|
await Actor.exit();
|
|
```
|
|
|
|
### Python
|
|
|
|
Consider an example Actor that calls `Actor.push_data()` to store data into dataset:
|
|
|
|
```python
|
|
# Dataset push example (Python)
|
|
import asyncio
|
|
from datetime import datetime
|
|
from apify import Actor
|
|
|
|
async def main():
|
|
await Actor.init()
|
|
|
|
# Actor code
|
|
await Actor.push_data({
|
|
'numericField': 10,
|
|
'pictureUrl': 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
|
|
'linkUrl': 'https://google.com',
|
|
'textField': 'Google',
|
|
'booleanField': True,
|
|
'dateField': datetime.now().isoformat(),
|
|
'arrayField': ['#hello', '#world'],
|
|
'objectField': {},
|
|
})
|
|
|
|
# Exit successfully
|
|
await Actor.exit()
|
|
|
|
if __name__ == '__main__':
|
|
asyncio.run(main())
|
|
```
|
|
|
|
## Configuration
|
|
|
|
To set up the Actor's output tab UI, reference a dataset schema file in `.actor/actor.json`:
|
|
|
|
```json
|
|
{
|
|
"actorSpecification": 1,
|
|
"name": "book-library-scraper",
|
|
"title": "Book Library Scraper",
|
|
"version": "1.0.0",
|
|
"storages": {
|
|
"dataset": "./dataset_schema.json"
|
|
}
|
|
}
|
|
```
|
|
|
|
Then create the dataset schema in `.actor/dataset_schema.json`:
|
|
|
|
```json
|
|
{
|
|
"actorSpecification": 1,
|
|
"fields": {},
|
|
"views": {
|
|
"overview": {
|
|
"title": "Overview",
|
|
"transformation": {
|
|
"fields": [
|
|
"pictureUrl",
|
|
"linkUrl",
|
|
"textField",
|
|
"booleanField",
|
|
"arrayField",
|
|
"objectField",
|
|
"dateField",
|
|
"numericField"
|
|
]
|
|
},
|
|
"display": {
|
|
"component": "table",
|
|
"properties": {
|
|
"pictureUrl": {
|
|
"label": "Image",
|
|
"format": "image"
|
|
},
|
|
"linkUrl": {
|
|
"label": "Link",
|
|
"format": "link"
|
|
},
|
|
"textField": {
|
|
"label": "Text",
|
|
"format": "text"
|
|
},
|
|
"booleanField": {
|
|
"label": "Boolean",
|
|
"format": "boolean"
|
|
},
|
|
"arrayField": {
|
|
"label": "Array",
|
|
"format": "array"
|
|
},
|
|
"objectField": {
|
|
"label": "Object",
|
|
"format": "object"
|
|
},
|
|
"dateField": {
|
|
"label": "Date",
|
|
"format": "date"
|
|
},
|
|
"numericField": {
|
|
"label": "Number",
|
|
"format": "number"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Structure
|
|
|
|
```json
|
|
{
|
|
"actorSpecification": 1,
|
|
"fields": {},
|
|
"views": {
|
|
"<VIEW_NAME>": {
|
|
"title": "string (required)",
|
|
"description": "string (optional)",
|
|
"transformation": {
|
|
"fields": ["string (required)"],
|
|
"unwind": ["string (optional)"],
|
|
"flatten": ["string (optional)"],
|
|
"omit": ["string (optional)"],
|
|
"limit": "integer (optional)",
|
|
"desc": "boolean (optional)"
|
|
},
|
|
"display": {
|
|
"component": "table (required)",
|
|
"properties": {
|
|
"<FIELD_NAME>": {
|
|
"label": "string (optional)",
|
|
"format": "text|number|date|link|boolean|image|array|object (optional)"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Properties
|
|
|
|
### Dataset Schema Properties
|
|
|
|
- `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1)
|
|
- `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible)
|
|
- `views` (DatasetView object, required) - Object with API and UI views description
|
|
|
|
### DatasetView Properties
|
|
|
|
- `title` (string, required) - Visible in UI Output tab and API
|
|
- `description` (string, optional) - Only available in API response
|
|
- `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API
|
|
- `display` (ViewDisplay object, required) - Output tab UI visualization definition
|
|
|
|
### ViewTransformation Properties
|
|
|
|
- `fields` (string[], required) - Fields to present in output (order matches column order)
|
|
- `unwind` (string[], optional) - Deconstructs nested children into parent object
|
|
- `flatten` (string[], optional) - Transforms nested object into flat structure
|
|
- `omit` (string[], optional) - Removes specified fields from output
|
|
- `limit` (integer, optional) - Maximum number of results (default: all)
|
|
- `desc` (boolean, optional) - Sort order (true = newest first)
|
|
|
|
### ViewDisplay Properties
|
|
|
|
- `component` (string, required) - Only `table` is available
|
|
- `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values
|
|
|
|
### ViewDisplayProperty Properties
|
|
|
|
- `label` (string, optional) - Table column header
|
|
- `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object`
|