feat(engineering-team): add snowflake-development skill

Snowflake SQL, data pipelines (Dynamic Tables, Streams+Tasks), Cortex AI, Snowpark Python, dbt integration. Includes 3 practical workflows, 9 anti-patterns, cross-references, and troubleshooting guide. - SKILL.md: 294 lines (colon-prefix rule, MERGE, DTs, Cortex AI, Snowpark) - Script: snowflake_query_helper.py (MERGE, DT, RBAC generators) - References: 3 files (SQL patterns, Cortex AI/agents, troubleshooting) Based on PR #416 by James Cha-Earley — enhanced with practical workflows, anti-patterns section, cross-references, and normalized frontmatter. Co-Authored-By: James Cha-Earley <jamescha-earley@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 09:38:57 +01:00
parent c6206efc49
commit 0e97512a42
8 changed files with 1557 additions and 2 deletions
--- a/docs/skills/engineering-team/index.md
+++ b/docs/skills/engineering-team/index.md
@@ -1,13 +1,13 @@
 ---
 title: "Engineering - Core Skills — Agent Skills & Codex Plugins"
-description: "44 engineering - core skills — engineering agent skill and Claude Code plugin for code generation, DevOps, architecture, and testing. Works with Claude Code, Codex CLI, Gemini CLI, and OpenClaw."
+description: "45 engineering - core skills — engineering agent skill and Claude Code plugin for code generation, DevOps, architecture, and testing. Works with Claude Code, Codex CLI, Gemini CLI, and OpenClaw."
 ---

 <div class="domain-header" markdown>

 # :material-code-braces: Engineering - Core

-<p class="domain-count">44 skills in this domain</p>
+<p class="domain-count">45 skills in this domain</p>

 </div>

@@ -179,6 +179,12 @@ description: "44 engineering - core skills — engineering agent skill and Claud

    Security engineering tools for threat modeling, vulnerability analysis, secure architecture design, and penetration t...

+-   **[Snowflake Development](snowflake-development.md)**
+
+    ---
+
+    Snowflake SQL, data pipelines, Cortex AI, and Snowpark Python development. Covers the colon-prefix rule, semi-structu...
+
 -   **[Stripe Integration Expert](stripe-integration-expert.md)**

    ---
--- a/docs/skills/engineering-team/snowflake-development.md
+++ b/docs/skills/engineering-team/snowflake-development.md
@@ -0,0 +1,305 @@
+---
+title: "Snowflake Development — Agent Skill & Codex Plugin"
+description: "Use when writing Snowflake SQL, building data pipelines with Dynamic Tables or Streams/Tasks, using Cortex AI functions, creating Cortex Agents. Agent skill for Claude Code, Codex CLI, Gemini CLI, OpenClaw."
+---
+
+# Snowflake Development
+
+<div class="page-meta" markdown>
+<span class="meta-badge">:material-code-braces: Engineering - Core</span>
+<span class="meta-badge">:material-identifier: `snowflake-development`</span>
+<span class="meta-badge">:material-github: <a href="https://github.com/alirezarezvani/claude-skills/tree/main/engineering-team/snowflake-development/SKILL.md">Source</a></span>
+</div>
+
+<div class="install-banner" markdown>
+<span class="install-label">Install:</span> <code>claude /plugin install engineering-skills</code>
+</div>
+
+
+Snowflake SQL, data pipelines, Cortex AI, and Snowpark Python development. Covers the colon-prefix rule, semi-structured data, MERGE upserts, Dynamic Tables, Streams+Tasks, Cortex AI functions, agent specs, performance tuning, and security hardening.
+
+> Originally contributed by [James Cha-Earley](https://github.com/jamescha-earley) — enhanced and integrated by the claude-skills team.
+
+## Quick Start
+
+```bash
+# Generate a MERGE upsert template
+python scripts/snowflake_query_helper.py merge --target customers --source staging_customers --key customer_id --columns name,email,updated_at
+
+# Generate a Dynamic Table template
+python scripts/snowflake_query_helper.py dynamic-table --name cleaned_events --warehouse transform_wh --lag "5 minutes"
+
+# Generate RBAC grant statements
+python scripts/snowflake_query_helper.py grant --role analyst_role --database analytics --schemas public,staging --privileges SELECT,USAGE
+```
+
+---
+
+## SQL Best Practices
+
+### Naming and Style
+
+- Use `snake_case` for all identifiers. Avoid double-quoted identifiers -- they force case-sensitive names that require constant quoting.
+- Use CTEs (`WITH` clauses) over nested subqueries.
+- Use `CREATE OR REPLACE` for idempotent DDL.
+- Use explicit column lists -- never `SELECT *` in production. Snowflake's columnar storage scans only referenced columns, so explicit lists reduce I/O.
+
+### Stored Procedures -- Colon Prefix Rule
+
+In SQL stored procedures (BEGIN...END blocks), variables and parameters **must** use the colon `:` prefix inside SQL statements. Without it, Snowflake treats them as column identifiers and raises "invalid identifier" errors.
+
+```sql
+-- WRONG: missing colon prefix
+SELECT name INTO result FROM users WHERE id = p_id;
+
+-- CORRECT: colon prefix on both variable and parameter
+SELECT name INTO :result FROM users WHERE id = :p_id;
+```
+
+This applies to DECLARE variables, LET variables, and procedure parameters when used inside SELECT, INSERT, UPDATE, DELETE, or MERGE.
+
+### Semi-Structured Data
+
+- VARIANT, OBJECT, ARRAY for JSON/Avro/Parquet/ORC.
+- Access nested fields: `src:customer.name::STRING`. Always cast with `::TYPE`.
+- VARIANT null vs SQL NULL: JSON `null` is stored as the string `"null"`. Use `STRIP_NULL_VALUE = TRUE` on load.
+- Flatten arrays: `SELECT f.value:name::STRING FROM my_table, LATERAL FLATTEN(input => src:items) f;`
+
+### MERGE for Upserts
+
+```sql
+MERGE INTO target t USING source s ON t.id = s.id
+WHEN MATCHED THEN UPDATE SET t.name = s.name, t.updated_at = CURRENT_TIMESTAMP()
+WHEN NOT MATCHED THEN INSERT (id, name, updated_at) VALUES (s.id, s.name, CURRENT_TIMESTAMP());
+```
+
+> See `references/snowflake_sql_and_pipelines.md` for deeper SQL patterns and anti-patterns.
+
+---
+
+## Data Pipelines
+
+### Choosing Your Approach
+
+| Approach | When to Use |
+|----------|-------------|
+| Dynamic Tables | Declarative transformations. **Default choice.** Define the query, Snowflake handles refresh. |
+| Streams + Tasks | Imperative CDC. Use for procedural logic, stored procedure calls, complex branching. |
+| Snowpipe | Continuous file loading from cloud storage (S3, GCS, Azure). |
+
+### Dynamic Tables
+
+```sql
+CREATE OR REPLACE DYNAMIC TABLE cleaned_events
+    TARGET_LAG = '5 minutes'
+    WAREHOUSE = transform_wh
+    AS
+    SELECT event_id, event_type, user_id, event_timestamp
+    FROM raw_events
+    WHERE event_type IS NOT NULL;
+```
+
+Key rules:
+- Set `TARGET_LAG` progressively: tighter at the top of the DAG, looser downstream.
+- Incremental DTs cannot depend on Full-refresh DTs.
+- `SELECT *` breaks on upstream schema changes -- use explicit column lists.
+- Views cannot sit between two Dynamic Tables in the DAG.
+
+### Streams and Tasks
+
+```sql
+CREATE OR REPLACE STREAM raw_stream ON TABLE raw_events;
+
+CREATE OR REPLACE TASK process_events
+    WAREHOUSE = transform_wh
+    SCHEDULE = 'USING CRON 0 */1 * * * America/Los_Angeles'
+    WHEN SYSTEM$STREAM_HAS_DATA('raw_stream')
+    AS INSERT INTO cleaned_events SELECT ... FROM raw_stream;
+
+-- Tasks start SUSPENDED. You MUST resume them.
+ALTER TASK process_events RESUME;
+```
+
+> See `references/snowflake_sql_and_pipelines.md` for DT debugging queries and Snowpipe patterns.
+
+---
+
+## Cortex AI
+
+### Function Reference
+
+| Function | Purpose |
+|----------|---------|
+| `AI_COMPLETE` | LLM completion (text, images, documents) |
+| `AI_CLASSIFY` | Classify text into categories (up to 500 labels) |
+| `AI_FILTER` | Boolean filter on text or images |
+| `AI_EXTRACT` | Structured extraction from text/images/documents |
+| `AI_SENTIMENT` | Sentiment score (-1 to 1) |
+| `AI_PARSE_DOCUMENT` | OCR or layout extraction from documents |
+| `AI_REDACT` | PII removal from text |
+
+**Deprecated names (do NOT use):** `COMPLETE`, `CLASSIFY_TEXT`, `EXTRACT_ANSWER`, `PARSE_DOCUMENT`, `SUMMARIZE`, `TRANSLATE`, `SENTIMENT`, `EMBED_TEXT_768`.
+
+### TO_FILE -- Common Pitfall
+
+Stage path and filename are **separate** arguments:
+
+```sql
+-- WRONG: single combined argument
+TO_FILE('@stage/file.pdf')
+
+-- CORRECT: two arguments
+TO_FILE('@db.schema.mystage', 'invoice.pdf')
+```
+
+### Cortex Agents
+
+Agent specs use a JSON structure with top-level keys: `models`, `instructions`, `tools`, `tool_resources`.
+
+- Use `$spec$` delimiter (not `$$`).
+- `models` must be an object, not an array.
+- `tool_resources` is a separate top-level key, not nested inside `tools`.
+- Tool descriptions are the single biggest factor in agent quality.
+
+> See `references/cortex_ai_and_agents.md` for full agent spec examples and Cortex Search patterns.
+
+---
+
+## Snowpark Python
+
+```python
+from snowflake.snowpark import Session
+import os
+
+session = Session.builder.configs({
+    "account": os.environ["SNOWFLAKE_ACCOUNT"],
+    "user": os.environ["SNOWFLAKE_USER"],
+    "password": os.environ["SNOWFLAKE_PASSWORD"],
+    "role": "my_role", "warehouse": "my_wh",
+    "database": "my_db", "schema": "my_schema"
+}).create()
+```
+
+- Never hardcode credentials. Use environment variables or key pair auth.
+- DataFrames are lazy -- executed on `collect()` / `show()`.
+- Do NOT call `collect()` on large DataFrames. Process server-side with DataFrame operations.
+- Use **vectorized UDFs** (10-100x faster) for batch and ML workloads.
+
+## dbt on Snowflake
+
+```sql
+-- Dynamic table materialization (streaming/near-real-time marts):
+{{ config(materialized='dynamic_table', snowflake_warehouse='transforming', target_lag='1 hour') }}
+
+-- Incremental materialization (large fact tables):
+{{ config(materialized='incremental', unique_key='event_id') }}
+
+-- Snowflake-specific configs (combine with any materialization):
+{{ config(transient=true, copy_grants=true, query_tag='team_daily') }}
+```
+
+- Do NOT use `{{ this }}` without `{% if is_incremental() %}` guard.
+- Use `dynamic_table` materialization for streaming or near-real-time marts.
+
+## Performance
+
+- **Cluster keys**: Only for multi-TB tables. Apply on WHERE / JOIN / GROUP BY columns.
+- **Search Optimization**: `ALTER TABLE t ADD SEARCH OPTIMIZATION ON EQUALITY(col);`
+- **Warehouse sizing**: Start X-Small, scale up. Set `AUTO_SUSPEND = 60`, `AUTO_RESUME = TRUE`.
+- **Separate warehouses** per workload (load, transform, query).
+
+## Security
+
+- Follow least-privilege RBAC. Use database roles for object-level grants.
+- Audit ACCOUNTADMIN regularly: `SHOW GRANTS OF ROLE ACCOUNTADMIN;`
+- Use network policies for IP allowlisting.
+- Use masking policies for PII columns and row access policies for multi-tenant isolation.
+
+---
+
+## Proactive Triggers
+
+Surface these issues without being asked when you notice them in context:
+
+- **Missing colon prefix** in SQL stored procedures -- flag immediately, this causes "invalid identifier" at runtime.
+- **`SELECT *` in Dynamic Tables** -- flag as a schema-change time bomb.
+- **Deprecated Cortex function names** (`CLASSIFY_TEXT`, `SUMMARIZE`, etc.) -- suggest the current `AI_*` equivalents.
+- **Task not resumed** after creation -- remind that tasks start SUSPENDED.
+- **Hardcoded credentials** in Snowpark code -- flag as a security risk.
+
+---
+
+## Common Errors
+
+| Error | Cause | Fix |
+|-------|-------|-----|
+| "Object does not exist" | Wrong database/schema context or missing grants | Fully qualify names (`db.schema.table`), check grants |
+| "Invalid identifier" in procedure | Missing colon prefix on variable | Use `:variable_name` inside SQL statements |
+| "Numeric value not recognized" | VARIANT field not cast | Cast explicitly: `src:field::NUMBER(10,2)` |
+| Task not running | Forgot to resume after creation | `ALTER TASK task_name RESUME;` |
+| DT refresh failing | Schema change upstream or tracking disabled | Use explicit columns, verify change tracking |
+| TO_FILE error | Combined path as single argument | Split into two args: `TO_FILE('@stage', 'file.pdf')` |
+
+---
+
+## Practical Workflows
+
+### Workflow 1: Build a Reporting Pipeline (30 min)
+
+1. **Stage raw data**: Create external stage pointing to S3/GCS/Azure, set up Snowpipe for auto-ingest
+2. **Clean with Dynamic Table**: Create DT with `TARGET_LAG = '5 minutes'` that filters nulls, casts types, deduplicates
+3. **Aggregate with downstream DT**: Second DT that joins cleaned data with dimension tables, computes metrics
+4. **Expose via Secure View**: Create `SECURE VIEW` for the BI tool / API layer
+5. **Grant access**: Use `snowflake_query_helper.py grant` to generate RBAC statements
+
+### Workflow 2: Add AI Classification to Existing Data
+
+1. **Identify the column**: Find the text column to classify (e.g., support tickets, reviews)
+2. **Test with AI_CLASSIFY**: `SELECT AI_CLASSIFY(text_col, ['bug', 'feature', 'question']) FROM table LIMIT 10;`
+3. **Create enrichment DT**: Dynamic Table that runs `AI_CLASSIFY` on new rows automatically
+4. **Monitor costs**: Cortex AI is billed per token — sample before running on full tables
+
+### Workflow 3: Debug a Failing Pipeline
+
+1. **Check task history**: `SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY()) WHERE STATE = 'FAILED' ORDER BY SCHEDULED_TIME DESC;`
+2. **Check DT refresh**: `SELECT * FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_REFRESH_HISTORY('my_dt')) ORDER BY REFRESH_END_TIME DESC;`
+3. **Check stream staleness**: `SHOW STREAMS; -- check stale_after column`
+4. **Consult troubleshooting reference**: See `references/troubleshooting.md` for error-specific fixes
+
+---
+
+## Anti-Patterns
+
+| Anti-Pattern | Why It Fails | Better Approach |
+|---|---|---|
+| `SELECT *` in Dynamic Tables | Schema changes upstream break the DT silently | Use explicit column lists |
+| Missing colon prefix in procedures | "Invalid identifier" runtime error | Always use `:variable_name` in SQL blocks |
+| Single warehouse for all workloads | Contention between load, transform, and query | Separate warehouses per workload type |
+| Hardcoded credentials in Snowpark | Security risk, breaks in CI/CD | Use `os.environ[]` or key pair auth |
+| `collect()` on large DataFrames | Pulls entire result set to client memory | Process server-side with DataFrame operations |
+| Nested subqueries instead of CTEs | Unreadable, hard to debug, Snowflake optimizes CTEs better | Use `WITH` clauses |
+| Using deprecated Cortex functions | `CLASSIFY_TEXT`, `SUMMARIZE` etc. will be removed | Use `AI_CLASSIFY`, `AI_COMPLETE` etc. |
+| Tasks without `WHEN SYSTEM$STREAM_HAS_DATA` | Task runs on schedule even with no new data, wasting credits | Add the WHEN clause for stream-driven tasks |
+| Double-quoted identifiers | Forces case-sensitive names across all queries | Use `snake_case` unquoted identifiers |
+
+---
+
+## Cross-References
+
+| Skill | Relationship |
+|-------|-------------|
+| `engineering/sql-database-assistant` | General SQL patterns — use for non-Snowflake databases |
+| `engineering/database-designer` | Schema design — use for data modeling before Snowflake implementation |
+| `engineering-team/senior-data-engineer` | Broader data engineering — pipelines, Spark, Airflow, data quality |
+| `engineering-team/senior-data-scientist` | Analytics and ML — use alongside Snowpark for feature engineering |
+| `engineering-team/senior-devops` | CI/CD for Snowflake deployments (Terraform, GitHub Actions) |
+
+---
+
+## Reference Documentation
+
+| Document | Contents |
+|----------|----------|
+| `references/snowflake_sql_and_pipelines.md` | SQL patterns, MERGE templates, Dynamic Table debugging, Snowpipe, anti-patterns |
+| `references/cortex_ai_and_agents.md` | Cortex AI functions, agent spec structure, Cortex Search, Snowpark |
+| `references/troubleshooting.md` | Error reference, debugging queries, common fixes |