feat(engineering-team): add snowflake-development skill
Snowflake SQL, data pipelines (Dynamic Tables, Streams+Tasks), Cortex AI, Snowpark Python, dbt integration. Includes 3 practical workflows, 9 anti-patterns, cross-references, and troubleshooting guide. - SKILL.md: 294 lines (colon-prefix rule, MERGE, DTs, Cortex AI, Snowpark) - Script: snowflake_query_helper.py (MERGE, DT, RBAC generators) - References: 3 files (SQL patterns, Cortex AI/agents, troubleshooting) Based on PR #416 by James Cha-Earley — enhanced with practical workflows, anti-patterns section, cross-references, and normalized frontmatter. Co-Authored-By: James Cha-Earley <jamescha-earley@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,13 +1,13 @@
|
||||
---
|
||||
title: "Engineering - Core Skills — Agent Skills & Codex Plugins"
|
||||
description: "44 engineering - core skills — engineering agent skill and Claude Code plugin for code generation, DevOps, architecture, and testing. Works with Claude Code, Codex CLI, Gemini CLI, and OpenClaw."
|
||||
description: "45 engineering - core skills — engineering agent skill and Claude Code plugin for code generation, DevOps, architecture, and testing. Works with Claude Code, Codex CLI, Gemini CLI, and OpenClaw."
|
||||
---
|
||||
|
||||
<div class="domain-header" markdown>
|
||||
|
||||
# :material-code-braces: Engineering - Core
|
||||
|
||||
<p class="domain-count">44 skills in this domain</p>
|
||||
<p class="domain-count">45 skills in this domain</p>
|
||||
|
||||
</div>
|
||||
|
||||
@@ -179,6 +179,12 @@ description: "44 engineering - core skills — engineering agent skill and Claud
|
||||
|
||||
Security engineering tools for threat modeling, vulnerability analysis, secure architecture design, and penetration t...
|
||||
|
||||
- **[Snowflake Development](snowflake-development.md)**
|
||||
|
||||
---
|
||||
|
||||
Snowflake SQL, data pipelines, Cortex AI, and Snowpark Python development. Covers the colon-prefix rule, semi-structu...
|
||||
|
||||
- **[Stripe Integration Expert](stripe-integration-expert.md)**
|
||||
|
||||
---
|
||||
|
||||
305
docs/skills/engineering-team/snowflake-development.md
Normal file
305
docs/skills/engineering-team/snowflake-development.md
Normal file
@@ -0,0 +1,305 @@
|
||||
---
|
||||
title: "Snowflake Development — Agent Skill & Codex Plugin"
|
||||
description: "Use when writing Snowflake SQL, building data pipelines with Dynamic Tables or Streams/Tasks, using Cortex AI functions, creating Cortex Agents. Agent skill for Claude Code, Codex CLI, Gemini CLI, OpenClaw."
|
||||
---
|
||||
|
||||
# Snowflake Development
|
||||
|
||||
<div class="page-meta" markdown>
|
||||
<span class="meta-badge">:material-code-braces: Engineering - Core</span>
|
||||
<span class="meta-badge">:material-identifier: `snowflake-development`</span>
|
||||
<span class="meta-badge">:material-github: <a href="https://github.com/alirezarezvani/claude-skills/tree/main/engineering-team/snowflake-development/SKILL.md">Source</a></span>
|
||||
</div>
|
||||
|
||||
<div class="install-banner" markdown>
|
||||
<span class="install-label">Install:</span> <code>claude /plugin install engineering-skills</code>
|
||||
</div>
|
||||
|
||||
|
||||
Snowflake SQL, data pipelines, Cortex AI, and Snowpark Python development. Covers the colon-prefix rule, semi-structured data, MERGE upserts, Dynamic Tables, Streams+Tasks, Cortex AI functions, agent specs, performance tuning, and security hardening.
|
||||
|
||||
> Originally contributed by [James Cha-Earley](https://github.com/jamescha-earley) — enhanced and integrated by the claude-skills team.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Generate a MERGE upsert template
|
||||
python scripts/snowflake_query_helper.py merge --target customers --source staging_customers --key customer_id --columns name,email,updated_at
|
||||
|
||||
# Generate a Dynamic Table template
|
||||
python scripts/snowflake_query_helper.py dynamic-table --name cleaned_events --warehouse transform_wh --lag "5 minutes"
|
||||
|
||||
# Generate RBAC grant statements
|
||||
python scripts/snowflake_query_helper.py grant --role analyst_role --database analytics --schemas public,staging --privileges SELECT,USAGE
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SQL Best Practices
|
||||
|
||||
### Naming and Style
|
||||
|
||||
- Use `snake_case` for all identifiers. Avoid double-quoted identifiers -- they force case-sensitive names that require constant quoting.
|
||||
- Use CTEs (`WITH` clauses) over nested subqueries.
|
||||
- Use `CREATE OR REPLACE` for idempotent DDL.
|
||||
- Use explicit column lists -- never `SELECT *` in production. Snowflake's columnar storage scans only referenced columns, so explicit lists reduce I/O.
|
||||
|
||||
### Stored Procedures -- Colon Prefix Rule
|
||||
|
||||
In SQL stored procedures (BEGIN...END blocks), variables and parameters **must** use the colon `:` prefix inside SQL statements. Without it, Snowflake treats them as column identifiers and raises "invalid identifier" errors.
|
||||
|
||||
```sql
|
||||
-- WRONG: missing colon prefix
|
||||
SELECT name INTO result FROM users WHERE id = p_id;
|
||||
|
||||
-- CORRECT: colon prefix on both variable and parameter
|
||||
SELECT name INTO :result FROM users WHERE id = :p_id;
|
||||
```
|
||||
|
||||
This applies to DECLARE variables, LET variables, and procedure parameters when used inside SELECT, INSERT, UPDATE, DELETE, or MERGE.
|
||||
|
||||
### Semi-Structured Data
|
||||
|
||||
- VARIANT, OBJECT, ARRAY for JSON/Avro/Parquet/ORC.
|
||||
- Access nested fields: `src:customer.name::STRING`. Always cast with `::TYPE`.
|
||||
- VARIANT null vs SQL NULL: JSON `null` is stored as the string `"null"`. Use `STRIP_NULL_VALUE = TRUE` on load.
|
||||
- Flatten arrays: `SELECT f.value:name::STRING FROM my_table, LATERAL FLATTEN(input => src:items) f;`
|
||||
|
||||
### MERGE for Upserts
|
||||
|
||||
```sql
|
||||
MERGE INTO target t USING source s ON t.id = s.id
|
||||
WHEN MATCHED THEN UPDATE SET t.name = s.name, t.updated_at = CURRENT_TIMESTAMP()
|
||||
WHEN NOT MATCHED THEN INSERT (id, name, updated_at) VALUES (s.id, s.name, CURRENT_TIMESTAMP());
|
||||
```
|
||||
|
||||
> See `references/snowflake_sql_and_pipelines.md` for deeper SQL patterns and anti-patterns.
|
||||
|
||||
---
|
||||
|
||||
## Data Pipelines
|
||||
|
||||
### Choosing Your Approach
|
||||
|
||||
| Approach | When to Use |
|
||||
|----------|-------------|
|
||||
| Dynamic Tables | Declarative transformations. **Default choice.** Define the query, Snowflake handles refresh. |
|
||||
| Streams + Tasks | Imperative CDC. Use for procedural logic, stored procedure calls, complex branching. |
|
||||
| Snowpipe | Continuous file loading from cloud storage (S3, GCS, Azure). |
|
||||
|
||||
### Dynamic Tables
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE DYNAMIC TABLE cleaned_events
|
||||
TARGET_LAG = '5 minutes'
|
||||
WAREHOUSE = transform_wh
|
||||
AS
|
||||
SELECT event_id, event_type, user_id, event_timestamp
|
||||
FROM raw_events
|
||||
WHERE event_type IS NOT NULL;
|
||||
```
|
||||
|
||||
Key rules:
|
||||
- Set `TARGET_LAG` progressively: tighter at the top of the DAG, looser downstream.
|
||||
- Incremental DTs cannot depend on Full-refresh DTs.
|
||||
- `SELECT *` breaks on upstream schema changes -- use explicit column lists.
|
||||
- Views cannot sit between two Dynamic Tables in the DAG.
|
||||
|
||||
### Streams and Tasks
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE STREAM raw_stream ON TABLE raw_events;
|
||||
|
||||
CREATE OR REPLACE TASK process_events
|
||||
WAREHOUSE = transform_wh
|
||||
SCHEDULE = 'USING CRON 0 */1 * * * America/Los_Angeles'
|
||||
WHEN SYSTEM$STREAM_HAS_DATA('raw_stream')
|
||||
AS INSERT INTO cleaned_events SELECT ... FROM raw_stream;
|
||||
|
||||
-- Tasks start SUSPENDED. You MUST resume them.
|
||||
ALTER TASK process_events RESUME;
|
||||
```
|
||||
|
||||
> See `references/snowflake_sql_and_pipelines.md` for DT debugging queries and Snowpipe patterns.
|
||||
|
||||
---
|
||||
|
||||
## Cortex AI
|
||||
|
||||
### Function Reference
|
||||
|
||||
| Function | Purpose |
|
||||
|----------|---------|
|
||||
| `AI_COMPLETE` | LLM completion (text, images, documents) |
|
||||
| `AI_CLASSIFY` | Classify text into categories (up to 500 labels) |
|
||||
| `AI_FILTER` | Boolean filter on text or images |
|
||||
| `AI_EXTRACT` | Structured extraction from text/images/documents |
|
||||
| `AI_SENTIMENT` | Sentiment score (-1 to 1) |
|
||||
| `AI_PARSE_DOCUMENT` | OCR or layout extraction from documents |
|
||||
| `AI_REDACT` | PII removal from text |
|
||||
|
||||
**Deprecated names (do NOT use):** `COMPLETE`, `CLASSIFY_TEXT`, `EXTRACT_ANSWER`, `PARSE_DOCUMENT`, `SUMMARIZE`, `TRANSLATE`, `SENTIMENT`, `EMBED_TEXT_768`.
|
||||
|
||||
### TO_FILE -- Common Pitfall
|
||||
|
||||
Stage path and filename are **separate** arguments:
|
||||
|
||||
```sql
|
||||
-- WRONG: single combined argument
|
||||
TO_FILE('@stage/file.pdf')
|
||||
|
||||
-- CORRECT: two arguments
|
||||
TO_FILE('@db.schema.mystage', 'invoice.pdf')
|
||||
```
|
||||
|
||||
### Cortex Agents
|
||||
|
||||
Agent specs use a JSON structure with top-level keys: `models`, `instructions`, `tools`, `tool_resources`.
|
||||
|
||||
- Use `$spec$` delimiter (not `$$`).
|
||||
- `models` must be an object, not an array.
|
||||
- `tool_resources` is a separate top-level key, not nested inside `tools`.
|
||||
- Tool descriptions are the single biggest factor in agent quality.
|
||||
|
||||
> See `references/cortex_ai_and_agents.md` for full agent spec examples and Cortex Search patterns.
|
||||
|
||||
---
|
||||
|
||||
## Snowpark Python
|
||||
|
||||
```python
|
||||
from snowflake.snowpark import Session
|
||||
import os
|
||||
|
||||
session = Session.builder.configs({
|
||||
"account": os.environ["SNOWFLAKE_ACCOUNT"],
|
||||
"user": os.environ["SNOWFLAKE_USER"],
|
||||
"password": os.environ["SNOWFLAKE_PASSWORD"],
|
||||
"role": "my_role", "warehouse": "my_wh",
|
||||
"database": "my_db", "schema": "my_schema"
|
||||
}).create()
|
||||
```
|
||||
|
||||
- Never hardcode credentials. Use environment variables or key pair auth.
|
||||
- DataFrames are lazy -- executed on `collect()` / `show()`.
|
||||
- Do NOT call `collect()` on large DataFrames. Process server-side with DataFrame operations.
|
||||
- Use **vectorized UDFs** (10-100x faster) for batch and ML workloads.
|
||||
|
||||
## dbt on Snowflake
|
||||
|
||||
```sql
|
||||
-- Dynamic table materialization (streaming/near-real-time marts):
|
||||
{{ config(materialized='dynamic_table', snowflake_warehouse='transforming', target_lag='1 hour') }}
|
||||
|
||||
-- Incremental materialization (large fact tables):
|
||||
{{ config(materialized='incremental', unique_key='event_id') }}
|
||||
|
||||
-- Snowflake-specific configs (combine with any materialization):
|
||||
{{ config(transient=true, copy_grants=true, query_tag='team_daily') }}
|
||||
```
|
||||
|
||||
- Do NOT use `{{ this }}` without `{% if is_incremental() %}` guard.
|
||||
- Use `dynamic_table` materialization for streaming or near-real-time marts.
|
||||
|
||||
## Performance
|
||||
|
||||
- **Cluster keys**: Only for multi-TB tables. Apply on WHERE / JOIN / GROUP BY columns.
|
||||
- **Search Optimization**: `ALTER TABLE t ADD SEARCH OPTIMIZATION ON EQUALITY(col);`
|
||||
- **Warehouse sizing**: Start X-Small, scale up. Set `AUTO_SUSPEND = 60`, `AUTO_RESUME = TRUE`.
|
||||
- **Separate warehouses** per workload (load, transform, query).
|
||||
|
||||
## Security
|
||||
|
||||
- Follow least-privilege RBAC. Use database roles for object-level grants.
|
||||
- Audit ACCOUNTADMIN regularly: `SHOW GRANTS OF ROLE ACCOUNTADMIN;`
|
||||
- Use network policies for IP allowlisting.
|
||||
- Use masking policies for PII columns and row access policies for multi-tenant isolation.
|
||||
|
||||
---
|
||||
|
||||
## Proactive Triggers
|
||||
|
||||
Surface these issues without being asked when you notice them in context:
|
||||
|
||||
- **Missing colon prefix** in SQL stored procedures -- flag immediately, this causes "invalid identifier" at runtime.
|
||||
- **`SELECT *` in Dynamic Tables** -- flag as a schema-change time bomb.
|
||||
- **Deprecated Cortex function names** (`CLASSIFY_TEXT`, `SUMMARIZE`, etc.) -- suggest the current `AI_*` equivalents.
|
||||
- **Task not resumed** after creation -- remind that tasks start SUSPENDED.
|
||||
- **Hardcoded credentials** in Snowpark code -- flag as a security risk.
|
||||
|
||||
---
|
||||
|
||||
## Common Errors
|
||||
|
||||
| Error | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| "Object does not exist" | Wrong database/schema context or missing grants | Fully qualify names (`db.schema.table`), check grants |
|
||||
| "Invalid identifier" in procedure | Missing colon prefix on variable | Use `:variable_name` inside SQL statements |
|
||||
| "Numeric value not recognized" | VARIANT field not cast | Cast explicitly: `src:field::NUMBER(10,2)` |
|
||||
| Task not running | Forgot to resume after creation | `ALTER TASK task_name RESUME;` |
|
||||
| DT refresh failing | Schema change upstream or tracking disabled | Use explicit columns, verify change tracking |
|
||||
| TO_FILE error | Combined path as single argument | Split into two args: `TO_FILE('@stage', 'file.pdf')` |
|
||||
|
||||
---
|
||||
|
||||
## Practical Workflows
|
||||
|
||||
### Workflow 1: Build a Reporting Pipeline (30 min)
|
||||
|
||||
1. **Stage raw data**: Create external stage pointing to S3/GCS/Azure, set up Snowpipe for auto-ingest
|
||||
2. **Clean with Dynamic Table**: Create DT with `TARGET_LAG = '5 minutes'` that filters nulls, casts types, deduplicates
|
||||
3. **Aggregate with downstream DT**: Second DT that joins cleaned data with dimension tables, computes metrics
|
||||
4. **Expose via Secure View**: Create `SECURE VIEW` for the BI tool / API layer
|
||||
5. **Grant access**: Use `snowflake_query_helper.py grant` to generate RBAC statements
|
||||
|
||||
### Workflow 2: Add AI Classification to Existing Data
|
||||
|
||||
1. **Identify the column**: Find the text column to classify (e.g., support tickets, reviews)
|
||||
2. **Test with AI_CLASSIFY**: `SELECT AI_CLASSIFY(text_col, ['bug', 'feature', 'question']) FROM table LIMIT 10;`
|
||||
3. **Create enrichment DT**: Dynamic Table that runs `AI_CLASSIFY` on new rows automatically
|
||||
4. **Monitor costs**: Cortex AI is billed per token — sample before running on full tables
|
||||
|
||||
### Workflow 3: Debug a Failing Pipeline
|
||||
|
||||
1. **Check task history**: `SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY()) WHERE STATE = 'FAILED' ORDER BY SCHEDULED_TIME DESC;`
|
||||
2. **Check DT refresh**: `SELECT * FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_REFRESH_HISTORY('my_dt')) ORDER BY REFRESH_END_TIME DESC;`
|
||||
3. **Check stream staleness**: `SHOW STREAMS; -- check stale_after column`
|
||||
4. **Consult troubleshooting reference**: See `references/troubleshooting.md` for error-specific fixes
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
| Anti-Pattern | Why It Fails | Better Approach |
|
||||
|---|---|---|
|
||||
| `SELECT *` in Dynamic Tables | Schema changes upstream break the DT silently | Use explicit column lists |
|
||||
| Missing colon prefix in procedures | "Invalid identifier" runtime error | Always use `:variable_name` in SQL blocks |
|
||||
| Single warehouse for all workloads | Contention between load, transform, and query | Separate warehouses per workload type |
|
||||
| Hardcoded credentials in Snowpark | Security risk, breaks in CI/CD | Use `os.environ[]` or key pair auth |
|
||||
| `collect()` on large DataFrames | Pulls entire result set to client memory | Process server-side with DataFrame operations |
|
||||
| Nested subqueries instead of CTEs | Unreadable, hard to debug, Snowflake optimizes CTEs better | Use `WITH` clauses |
|
||||
| Using deprecated Cortex functions | `CLASSIFY_TEXT`, `SUMMARIZE` etc. will be removed | Use `AI_CLASSIFY`, `AI_COMPLETE` etc. |
|
||||
| Tasks without `WHEN SYSTEM$STREAM_HAS_DATA` | Task runs on schedule even with no new data, wasting credits | Add the WHEN clause for stream-driven tasks |
|
||||
| Double-quoted identifiers | Forces case-sensitive names across all queries | Use `snake_case` unquoted identifiers |
|
||||
|
||||
---
|
||||
|
||||
## Cross-References
|
||||
|
||||
| Skill | Relationship |
|
||||
|-------|-------------|
|
||||
| `engineering/sql-database-assistant` | General SQL patterns — use for non-Snowflake databases |
|
||||
| `engineering/database-designer` | Schema design — use for data modeling before Snowflake implementation |
|
||||
| `engineering-team/senior-data-engineer` | Broader data engineering — pipelines, Spark, Airflow, data quality |
|
||||
| `engineering-team/senior-data-scientist` | Analytics and ML — use alongside Snowpark for feature engineering |
|
||||
| `engineering-team/senior-devops` | CI/CD for Snowflake deployments (Terraform, GitHub Actions) |
|
||||
|
||||
---
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
| Document | Contents |
|
||||
|----------|----------|
|
||||
| `references/snowflake_sql_and_pipelines.md` | SQL patterns, MERGE templates, Dynamic Table debugging, Snowpipe, anti-patterns |
|
||||
| `references/cortex_ai_and_agents.md` | Cortex AI functions, agent spec structure, Cortex Search, Snowpark |
|
||||
| `references/troubleshooting.md` | Error reference, debugging queries, common fixes |
|
||||
294
engineering-team/snowflake-development/SKILL.md
Normal file
294
engineering-team/snowflake-development/SKILL.md
Normal file
@@ -0,0 +1,294 @@
|
||||
---
|
||||
name: "snowflake-development"
|
||||
description: "Use when writing Snowflake SQL, building data pipelines with Dynamic Tables or Streams/Tasks, using Cortex AI functions, creating Cortex Agents, writing Snowpark Python, configuring dbt for Snowflake, or troubleshooting Snowflake errors."
|
||||
---
|
||||
|
||||
# Snowflake Development
|
||||
|
||||
Snowflake SQL, data pipelines, Cortex AI, and Snowpark Python development. Covers the colon-prefix rule, semi-structured data, MERGE upserts, Dynamic Tables, Streams+Tasks, Cortex AI functions, agent specs, performance tuning, and security hardening.
|
||||
|
||||
> Originally contributed by [James Cha-Earley](https://github.com/jamescha-earley) — enhanced and integrated by the claude-skills team.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Generate a MERGE upsert template
|
||||
python scripts/snowflake_query_helper.py merge --target customers --source staging_customers --key customer_id --columns name,email,updated_at
|
||||
|
||||
# Generate a Dynamic Table template
|
||||
python scripts/snowflake_query_helper.py dynamic-table --name cleaned_events --warehouse transform_wh --lag "5 minutes"
|
||||
|
||||
# Generate RBAC grant statements
|
||||
python scripts/snowflake_query_helper.py grant --role analyst_role --database analytics --schemas public,staging --privileges SELECT,USAGE
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SQL Best Practices
|
||||
|
||||
### Naming and Style
|
||||
|
||||
- Use `snake_case` for all identifiers. Avoid double-quoted identifiers -- they force case-sensitive names that require constant quoting.
|
||||
- Use CTEs (`WITH` clauses) over nested subqueries.
|
||||
- Use `CREATE OR REPLACE` for idempotent DDL.
|
||||
- Use explicit column lists -- never `SELECT *` in production. Snowflake's columnar storage scans only referenced columns, so explicit lists reduce I/O.
|
||||
|
||||
### Stored Procedures -- Colon Prefix Rule
|
||||
|
||||
In SQL stored procedures (BEGIN...END blocks), variables and parameters **must** use the colon `:` prefix inside SQL statements. Without it, Snowflake treats them as column identifiers and raises "invalid identifier" errors.
|
||||
|
||||
```sql
|
||||
-- WRONG: missing colon prefix
|
||||
SELECT name INTO result FROM users WHERE id = p_id;
|
||||
|
||||
-- CORRECT: colon prefix on both variable and parameter
|
||||
SELECT name INTO :result FROM users WHERE id = :p_id;
|
||||
```
|
||||
|
||||
This applies to DECLARE variables, LET variables, and procedure parameters when used inside SELECT, INSERT, UPDATE, DELETE, or MERGE.
|
||||
|
||||
### Semi-Structured Data
|
||||
|
||||
- VARIANT, OBJECT, ARRAY for JSON/Avro/Parquet/ORC.
|
||||
- Access nested fields: `src:customer.name::STRING`. Always cast with `::TYPE`.
|
||||
- VARIANT null vs SQL NULL: JSON `null` is stored as the string `"null"`. Use `STRIP_NULL_VALUE = TRUE` on load.
|
||||
- Flatten arrays: `SELECT f.value:name::STRING FROM my_table, LATERAL FLATTEN(input => src:items) f;`
|
||||
|
||||
### MERGE for Upserts
|
||||
|
||||
```sql
|
||||
MERGE INTO target t USING source s ON t.id = s.id
|
||||
WHEN MATCHED THEN UPDATE SET t.name = s.name, t.updated_at = CURRENT_TIMESTAMP()
|
||||
WHEN NOT MATCHED THEN INSERT (id, name, updated_at) VALUES (s.id, s.name, CURRENT_TIMESTAMP());
|
||||
```
|
||||
|
||||
> See `references/snowflake_sql_and_pipelines.md` for deeper SQL patterns and anti-patterns.
|
||||
|
||||
---
|
||||
|
||||
## Data Pipelines
|
||||
|
||||
### Choosing Your Approach
|
||||
|
||||
| Approach | When to Use |
|
||||
|----------|-------------|
|
||||
| Dynamic Tables | Declarative transformations. **Default choice.** Define the query, Snowflake handles refresh. |
|
||||
| Streams + Tasks | Imperative CDC. Use for procedural logic, stored procedure calls, complex branching. |
|
||||
| Snowpipe | Continuous file loading from cloud storage (S3, GCS, Azure). |
|
||||
|
||||
### Dynamic Tables
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE DYNAMIC TABLE cleaned_events
|
||||
TARGET_LAG = '5 minutes'
|
||||
WAREHOUSE = transform_wh
|
||||
AS
|
||||
SELECT event_id, event_type, user_id, event_timestamp
|
||||
FROM raw_events
|
||||
WHERE event_type IS NOT NULL;
|
||||
```
|
||||
|
||||
Key rules:
|
||||
- Set `TARGET_LAG` progressively: tighter at the top of the DAG, looser downstream.
|
||||
- Incremental DTs cannot depend on Full-refresh DTs.
|
||||
- `SELECT *` breaks on upstream schema changes -- use explicit column lists.
|
||||
- Views cannot sit between two Dynamic Tables in the DAG.
|
||||
|
||||
### Streams and Tasks
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE STREAM raw_stream ON TABLE raw_events;
|
||||
|
||||
CREATE OR REPLACE TASK process_events
|
||||
WAREHOUSE = transform_wh
|
||||
SCHEDULE = 'USING CRON 0 */1 * * * America/Los_Angeles'
|
||||
WHEN SYSTEM$STREAM_HAS_DATA('raw_stream')
|
||||
AS INSERT INTO cleaned_events SELECT ... FROM raw_stream;
|
||||
|
||||
-- Tasks start SUSPENDED. You MUST resume them.
|
||||
ALTER TASK process_events RESUME;
|
||||
```
|
||||
|
||||
> See `references/snowflake_sql_and_pipelines.md` for DT debugging queries and Snowpipe patterns.
|
||||
|
||||
---
|
||||
|
||||
## Cortex AI
|
||||
|
||||
### Function Reference
|
||||
|
||||
| Function | Purpose |
|
||||
|----------|---------|
|
||||
| `AI_COMPLETE` | LLM completion (text, images, documents) |
|
||||
| `AI_CLASSIFY` | Classify text into categories (up to 500 labels) |
|
||||
| `AI_FILTER` | Boolean filter on text or images |
|
||||
| `AI_EXTRACT` | Structured extraction from text/images/documents |
|
||||
| `AI_SENTIMENT` | Sentiment score (-1 to 1) |
|
||||
| `AI_PARSE_DOCUMENT` | OCR or layout extraction from documents |
|
||||
| `AI_REDACT` | PII removal from text |
|
||||
|
||||
**Deprecated names (do NOT use):** `COMPLETE`, `CLASSIFY_TEXT`, `EXTRACT_ANSWER`, `PARSE_DOCUMENT`, `SUMMARIZE`, `TRANSLATE`, `SENTIMENT`, `EMBED_TEXT_768`.
|
||||
|
||||
### TO_FILE -- Common Pitfall
|
||||
|
||||
Stage path and filename are **separate** arguments:
|
||||
|
||||
```sql
|
||||
-- WRONG: single combined argument
|
||||
TO_FILE('@stage/file.pdf')
|
||||
|
||||
-- CORRECT: two arguments
|
||||
TO_FILE('@db.schema.mystage', 'invoice.pdf')
|
||||
```
|
||||
|
||||
### Cortex Agents
|
||||
|
||||
Agent specs use a JSON structure with top-level keys: `models`, `instructions`, `tools`, `tool_resources`.
|
||||
|
||||
- Use `$spec$` delimiter (not `$$`).
|
||||
- `models` must be an object, not an array.
|
||||
- `tool_resources` is a separate top-level key, not nested inside `tools`.
|
||||
- Tool descriptions are the single biggest factor in agent quality.
|
||||
|
||||
> See `references/cortex_ai_and_agents.md` for full agent spec examples and Cortex Search patterns.
|
||||
|
||||
---
|
||||
|
||||
## Snowpark Python
|
||||
|
||||
```python
|
||||
from snowflake.snowpark import Session
|
||||
import os
|
||||
|
||||
session = Session.builder.configs({
|
||||
"account": os.environ["SNOWFLAKE_ACCOUNT"],
|
||||
"user": os.environ["SNOWFLAKE_USER"],
|
||||
"password": os.environ["SNOWFLAKE_PASSWORD"],
|
||||
"role": "my_role", "warehouse": "my_wh",
|
||||
"database": "my_db", "schema": "my_schema"
|
||||
}).create()
|
||||
```
|
||||
|
||||
- Never hardcode credentials. Use environment variables or key pair auth.
|
||||
- DataFrames are lazy -- executed on `collect()` / `show()`.
|
||||
- Do NOT call `collect()` on large DataFrames. Process server-side with DataFrame operations.
|
||||
- Use **vectorized UDFs** (10-100x faster) for batch and ML workloads.
|
||||
|
||||
## dbt on Snowflake
|
||||
|
||||
```sql
|
||||
-- Dynamic table materialization (streaming/near-real-time marts):
|
||||
{{ config(materialized='dynamic_table', snowflake_warehouse='transforming', target_lag='1 hour') }}
|
||||
|
||||
-- Incremental materialization (large fact tables):
|
||||
{{ config(materialized='incremental', unique_key='event_id') }}
|
||||
|
||||
-- Snowflake-specific configs (combine with any materialization):
|
||||
{{ config(transient=true, copy_grants=true, query_tag='team_daily') }}
|
||||
```
|
||||
|
||||
- Do NOT use `{{ this }}` without `{% if is_incremental() %}` guard.
|
||||
- Use `dynamic_table` materialization for streaming or near-real-time marts.
|
||||
|
||||
## Performance
|
||||
|
||||
- **Cluster keys**: Only for multi-TB tables. Apply on WHERE / JOIN / GROUP BY columns.
|
||||
- **Search Optimization**: `ALTER TABLE t ADD SEARCH OPTIMIZATION ON EQUALITY(col);`
|
||||
- **Warehouse sizing**: Start X-Small, scale up. Set `AUTO_SUSPEND = 60`, `AUTO_RESUME = TRUE`.
|
||||
- **Separate warehouses** per workload (load, transform, query).
|
||||
|
||||
## Security
|
||||
|
||||
- Follow least-privilege RBAC. Use database roles for object-level grants.
|
||||
- Audit ACCOUNTADMIN regularly: `SHOW GRANTS OF ROLE ACCOUNTADMIN;`
|
||||
- Use network policies for IP allowlisting.
|
||||
- Use masking policies for PII columns and row access policies for multi-tenant isolation.
|
||||
|
||||
---
|
||||
|
||||
## Proactive Triggers
|
||||
|
||||
Surface these issues without being asked when you notice them in context:
|
||||
|
||||
- **Missing colon prefix** in SQL stored procedures -- flag immediately, this causes "invalid identifier" at runtime.
|
||||
- **`SELECT *` in Dynamic Tables** -- flag as a schema-change time bomb.
|
||||
- **Deprecated Cortex function names** (`CLASSIFY_TEXT`, `SUMMARIZE`, etc.) -- suggest the current `AI_*` equivalents.
|
||||
- **Task not resumed** after creation -- remind that tasks start SUSPENDED.
|
||||
- **Hardcoded credentials** in Snowpark code -- flag as a security risk.
|
||||
|
||||
---
|
||||
|
||||
## Common Errors
|
||||
|
||||
| Error | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| "Object does not exist" | Wrong database/schema context or missing grants | Fully qualify names (`db.schema.table`), check grants |
|
||||
| "Invalid identifier" in procedure | Missing colon prefix on variable | Use `:variable_name` inside SQL statements |
|
||||
| "Numeric value not recognized" | VARIANT field not cast | Cast explicitly: `src:field::NUMBER(10,2)` |
|
||||
| Task not running | Forgot to resume after creation | `ALTER TASK task_name RESUME;` |
|
||||
| DT refresh failing | Schema change upstream or tracking disabled | Use explicit columns, verify change tracking |
|
||||
| TO_FILE error | Combined path as single argument | Split into two args: `TO_FILE('@stage', 'file.pdf')` |
|
||||
|
||||
---
|
||||
|
||||
## Practical Workflows
|
||||
|
||||
### Workflow 1: Build a Reporting Pipeline (30 min)
|
||||
|
||||
1. **Stage raw data**: Create external stage pointing to S3/GCS/Azure, set up Snowpipe for auto-ingest
|
||||
2. **Clean with Dynamic Table**: Create DT with `TARGET_LAG = '5 minutes'` that filters nulls, casts types, deduplicates
|
||||
3. **Aggregate with downstream DT**: Second DT that joins cleaned data with dimension tables, computes metrics
|
||||
4. **Expose via Secure View**: Create `SECURE VIEW` for the BI tool / API layer
|
||||
5. **Grant access**: Use `snowflake_query_helper.py grant` to generate RBAC statements
|
||||
|
||||
### Workflow 2: Add AI Classification to Existing Data
|
||||
|
||||
1. **Identify the column**: Find the text column to classify (e.g., support tickets, reviews)
|
||||
2. **Test with AI_CLASSIFY**: `SELECT AI_CLASSIFY(text_col, ['bug', 'feature', 'question']) FROM table LIMIT 10;`
|
||||
3. **Create enrichment DT**: Dynamic Table that runs `AI_CLASSIFY` on new rows automatically
|
||||
4. **Monitor costs**: Cortex AI is billed per token — sample before running on full tables
|
||||
|
||||
### Workflow 3: Debug a Failing Pipeline
|
||||
|
||||
1. **Check task history**: `SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY()) WHERE STATE = 'FAILED' ORDER BY SCHEDULED_TIME DESC;`
|
||||
2. **Check DT refresh**: `SELECT * FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_REFRESH_HISTORY('my_dt')) ORDER BY REFRESH_END_TIME DESC;`
|
||||
3. **Check stream staleness**: `SHOW STREAMS; -- check stale_after column`
|
||||
4. **Consult troubleshooting reference**: See `references/troubleshooting.md` for error-specific fixes
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
| Anti-Pattern | Why It Fails | Better Approach |
|
||||
|---|---|---|
|
||||
| `SELECT *` in Dynamic Tables | Schema changes upstream break the DT silently | Use explicit column lists |
|
||||
| Missing colon prefix in procedures | "Invalid identifier" runtime error | Always use `:variable_name` in SQL blocks |
|
||||
| Single warehouse for all workloads | Contention between load, transform, and query | Separate warehouses per workload type |
|
||||
| Hardcoded credentials in Snowpark | Security risk, breaks in CI/CD | Use `os.environ[]` or key pair auth |
|
||||
| `collect()` on large DataFrames | Pulls entire result set to client memory | Process server-side with DataFrame operations |
|
||||
| Nested subqueries instead of CTEs | Unreadable, hard to debug, Snowflake optimizes CTEs better | Use `WITH` clauses |
|
||||
| Using deprecated Cortex functions | `CLASSIFY_TEXT`, `SUMMARIZE` etc. will be removed | Use `AI_CLASSIFY`, `AI_COMPLETE` etc. |
|
||||
| Tasks without `WHEN SYSTEM$STREAM_HAS_DATA` | Task runs on schedule even with no new data, wasting credits | Add the WHEN clause for stream-driven tasks |
|
||||
| Double-quoted identifiers | Forces case-sensitive names across all queries | Use `snake_case` unquoted identifiers |
|
||||
|
||||
---
|
||||
|
||||
## Cross-References
|
||||
|
||||
| Skill | Relationship |
|
||||
|-------|-------------|
|
||||
| `engineering/sql-database-assistant` | General SQL patterns — use for non-Snowflake databases |
|
||||
| `engineering/database-designer` | Schema design — use for data modeling before Snowflake implementation |
|
||||
| `engineering-team/senior-data-engineer` | Broader data engineering — pipelines, Spark, Airflow, data quality |
|
||||
| `engineering-team/senior-data-scientist` | Analytics and ML — use alongside Snowpark for feature engineering |
|
||||
| `engineering-team/senior-devops` | CI/CD for Snowflake deployments (Terraform, GitHub Actions) |
|
||||
|
||||
---
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
| Document | Contents |
|
||||
|----------|----------|
|
||||
| `references/snowflake_sql_and_pipelines.md` | SQL patterns, MERGE templates, Dynamic Table debugging, Snowpipe, anti-patterns |
|
||||
| `references/cortex_ai_and_agents.md` | Cortex AI functions, agent spec structure, Cortex Search, Snowpark |
|
||||
| `references/troubleshooting.md` | Error reference, debugging queries, common fixes |
|
||||
@@ -0,0 +1,280 @@
|
||||
# Cortex AI and Agents Reference
|
||||
|
||||
Complete reference for Snowflake Cortex AI functions, Cortex Agents, Cortex Search, and Snowpark Python patterns.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Cortex AI Functions](#cortex-ai-functions)
|
||||
2. [Cortex Agents](#cortex-agents)
|
||||
3. [Cortex Search](#cortex-search)
|
||||
4. [Snowpark Python](#snowpark-python)
|
||||
|
||||
---
|
||||
|
||||
## Cortex AI Functions
|
||||
|
||||
### Complete Function Reference
|
||||
|
||||
| Function | Signature | Returns |
|
||||
|----------|-----------|---------|
|
||||
| `AI_COMPLETE` | `AI_COMPLETE(model, prompt)` or `AI_COMPLETE(model, conversation, options)` | STRING or OBJECT |
|
||||
| `AI_CLASSIFY` | `AI_CLASSIFY(input, categories)` | OBJECT with `labels` array |
|
||||
| `AI_EXTRACT` | `AI_EXTRACT(input, fields)` | OBJECT with extracted fields |
|
||||
| `AI_FILTER` | `AI_FILTER(input, condition)` | BOOLEAN |
|
||||
| `AI_SENTIMENT` | `AI_SENTIMENT(text)` | FLOAT (-1 to 1) |
|
||||
| `AI_SUMMARIZE` | `AI_SUMMARIZE(text)` | STRING |
|
||||
| `AI_TRANSLATE` | `AI_TRANSLATE(text, source_lang, target_lang)` | STRING |
|
||||
| `AI_PARSE_DOCUMENT` | `AI_PARSE_DOCUMENT(file, options)` | OBJECT |
|
||||
| `AI_REDACT` | `AI_REDACT(text)` | STRING |
|
||||
| `AI_EMBED` | `AI_EMBED(model, text)` | ARRAY (vector) |
|
||||
| `AI_AGG` | `AI_AGG(column, instruction)` | STRING |
|
||||
|
||||
### Deprecated Function Mapping
|
||||
|
||||
| Old Name (Do NOT Use) | New Name |
|
||||
|-----------------------|----------|
|
||||
| `COMPLETE` | `AI_COMPLETE` |
|
||||
| `CLASSIFY_TEXT` | `AI_CLASSIFY` |
|
||||
| `EXTRACT_ANSWER` | `AI_EXTRACT` |
|
||||
| `SUMMARIZE` | `AI_SUMMARIZE` |
|
||||
| `TRANSLATE` | `AI_TRANSLATE` |
|
||||
| `SENTIMENT` | `AI_SENTIMENT` |
|
||||
| `EMBED_TEXT_768` | `AI_EMBED` |
|
||||
|
||||
### AI_COMPLETE Patterns
|
||||
|
||||
**Simple completion:**
|
||||
```sql
|
||||
SELECT AI_COMPLETE('claude-4-sonnet', 'Summarize this text: ' || article_text) AS summary
|
||||
FROM articles;
|
||||
```
|
||||
|
||||
**With system prompt (conversation format):**
|
||||
```sql
|
||||
SELECT AI_COMPLETE(
|
||||
'claude-4-sonnet',
|
||||
[
|
||||
{'role': 'system', 'content': 'You are a data quality analyst. Be concise.'},
|
||||
{'role': 'user', 'content': 'Analyze this record: ' || record::STRING}
|
||||
]
|
||||
) AS analysis
|
||||
FROM flagged_records;
|
||||
```
|
||||
|
||||
**With document input (TO_FILE):**
|
||||
```sql
|
||||
SELECT AI_COMPLETE(
|
||||
'claude-4-sonnet',
|
||||
'Extract the invoice total from this document',
|
||||
TO_FILE('@docs_stage', 'invoice.pdf')
|
||||
) AS invoice_total;
|
||||
```
|
||||
|
||||
### AI_CLASSIFY Patterns
|
||||
|
||||
Use AI_CLASSIFY instead of AI_COMPLETE for classification tasks -- it is purpose-built, cheaper, and returns structured output.
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
ticket_text,
|
||||
AI_CLASSIFY(ticket_text, ['billing', 'technical', 'account', 'feature_request']):labels[0]::VARCHAR AS category
|
||||
FROM support_tickets;
|
||||
```
|
||||
|
||||
### AI_EXTRACT Patterns
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
AI_EXTRACT(email_body, ['sender_name', 'action_requested', 'deadline'])::OBJECT AS extracted
|
||||
FROM emails;
|
||||
```
|
||||
|
||||
### Cost Awareness
|
||||
|
||||
Estimate token costs before running AI functions on large tables:
|
||||
|
||||
```sql
|
||||
-- Count tokens first
|
||||
SELECT
|
||||
COUNT(*) AS row_count,
|
||||
SUM(AI_COUNT_TOKENS('claude-4-sonnet', text_column)) AS total_tokens
|
||||
FROM my_table;
|
||||
|
||||
-- Process a sample first
|
||||
SELECT AI_COMPLETE('claude-4-sonnet', text_column) FROM my_table SAMPLE (100 ROWS);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cortex Agents
|
||||
|
||||
### Agent Spec Structure
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE AGENT my_db.my_schema.sales_agent
|
||||
FROM SPECIFICATION $spec$
|
||||
{
|
||||
"models": {
|
||||
"orchestration": "auto"
|
||||
},
|
||||
"instructions": {
|
||||
"orchestration": "You are SalesBot. Help users query sales data.",
|
||||
"response": "Be concise. Use tables for numeric data."
|
||||
},
|
||||
"tools": [
|
||||
{
|
||||
"tool_spec": {
|
||||
"type": "cortex_analyst_text_to_sql",
|
||||
"name": "SalesQuery",
|
||||
"description": "Query sales metrics including revenue, orders, and customer data. Use for questions about sales performance, trends, and comparisons."
|
||||
}
|
||||
},
|
||||
{
|
||||
"tool_spec": {
|
||||
"type": "cortex_search",
|
||||
"name": "PolicySearch",
|
||||
"description": "Search company sales policies and procedures."
|
||||
}
|
||||
}
|
||||
],
|
||||
"tool_resources": {
|
||||
"SalesQuery": {
|
||||
"semantic_model_file": "@my_db.my_schema.models/sales_model.yaml"
|
||||
},
|
||||
"PolicySearch": {
|
||||
"cortex_search_service": "my_db.my_schema.policy_search_service"
|
||||
}
|
||||
}
|
||||
}
|
||||
$spec$;
|
||||
```
|
||||
|
||||
### Agent Rules
|
||||
|
||||
- **Delimiter**: Use `$spec$` not `$$` to avoid conflicts with SQL dollar-quoting.
|
||||
- **models**: Must be an object (`{"orchestration": "auto"}`), not an array.
|
||||
- **tool_resources**: A separate top-level key, not nested inside individual tool entries.
|
||||
- **Empty values in edit specs**: Do NOT include `null` or empty string values when editing -- they clear existing values.
|
||||
- **Tool descriptions**: The single biggest quality factor. Be specific about what data each tool accesses and what questions it answers.
|
||||
- **Testing**: Never modify production agents directly. Clone first, test, then swap.
|
||||
|
||||
### Calling an Agent
|
||||
|
||||
```sql
|
||||
SELECT SNOWFLAKE.CORTEX.AGENT(
|
||||
'my_db.my_schema.sales_agent',
|
||||
'What was total revenue last quarter?'
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cortex Search
|
||||
|
||||
### Creating a Search Service
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE CORTEX SEARCH SERVICE my_db.my_schema.docs_search
|
||||
ON text_column
|
||||
ATTRIBUTES category, department
|
||||
WAREHOUSE = search_wh
|
||||
TARGET_LAG = '1 hour'
|
||||
AS (
|
||||
SELECT text_column, category, department, doc_id
|
||||
FROM documents
|
||||
);
|
||||
```
|
||||
|
||||
### Querying a Search Service
|
||||
|
||||
```sql
|
||||
SELECT PARSE_JSON(
|
||||
SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
|
||||
'my_db.my_schema.docs_search',
|
||||
'{
|
||||
"query": "return policy for electronics",
|
||||
"columns": ["text_column", "category"],
|
||||
"filter": {"@eq": {"department": "retail"}},
|
||||
"limit": 5
|
||||
}'
|
||||
)
|
||||
) AS results;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Snowpark Python
|
||||
|
||||
### Session Setup
|
||||
|
||||
```python
|
||||
from snowflake.snowpark import Session
|
||||
import os
|
||||
|
||||
session = Session.builder.configs({
|
||||
"account": os.environ["SNOWFLAKE_ACCOUNT"],
|
||||
"user": os.environ["SNOWFLAKE_USER"],
|
||||
"password": os.environ["SNOWFLAKE_PASSWORD"],
|
||||
"role": "my_role",
|
||||
"warehouse": "my_wh",
|
||||
"database": "my_db",
|
||||
"schema": "my_schema"
|
||||
}).create()
|
||||
```
|
||||
|
||||
### DataFrame Operations
|
||||
|
||||
```python
|
||||
# Lazy operations -- nothing executes until collect()/show()
|
||||
df = session.table("events")
|
||||
result = (
|
||||
df.filter(df["event_type"] == "purchase")
|
||||
.group_by("user_id")
|
||||
.agg(F.sum("amount").alias("total_spent"))
|
||||
.sort(F.col("total_spent").desc())
|
||||
)
|
||||
result.show() # Execution happens here
|
||||
```
|
||||
|
||||
### Vectorized UDFs (10-100x Faster)
|
||||
|
||||
```python
|
||||
from snowflake.snowpark.functions import pandas_udf
|
||||
from snowflake.snowpark.types import StringType, PandasSeriesType
|
||||
import pandas as pd
|
||||
|
||||
@pandas_udf(
|
||||
name="normalize_email",
|
||||
is_permanent=True,
|
||||
stage_location="@udf_stage",
|
||||
replace=True
|
||||
)
|
||||
def normalize_email(emails: pd.Series) -> pd.Series:
|
||||
return emails.str.lower().str.strip()
|
||||
```
|
||||
|
||||
### Stored Procedures in Python
|
||||
|
||||
```python
|
||||
from snowflake.snowpark import Session
|
||||
|
||||
def process_batch(session: Session, batch_date: str) -> str:
|
||||
df = session.table("raw_events").filter(F.col("event_date") == batch_date)
|
||||
df.write.mode("overwrite").save_as_table("processed_events")
|
||||
return f"Processed {df.count()} rows for {batch_date}"
|
||||
|
||||
session.sproc.register(
|
||||
func=process_batch,
|
||||
name="process_batch",
|
||||
is_permanent=True,
|
||||
stage_location="@sproc_stage",
|
||||
replace=True
|
||||
)
|
||||
```
|
||||
|
||||
### Key Rules
|
||||
|
||||
- Never hardcode credentials. Use environment variables, key pair auth, or Snowflake's built-in connection config.
|
||||
- DataFrames are lazy. Calling `.collect()` pulls all data to the client -- avoid on large datasets.
|
||||
- Use vectorized UDFs over scalar UDFs for batch processing (10-100x performance improvement).
|
||||
- Close sessions when done: `session.close()`.
|
||||
@@ -0,0 +1,281 @@
|
||||
# Snowflake SQL and Pipelines Reference
|
||||
|
||||
Detailed patterns and anti-patterns for Snowflake SQL development and data pipeline design.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [SQL Patterns](#sql-patterns)
|
||||
2. [Dynamic Table Deep Dive](#dynamic-table-deep-dive)
|
||||
3. [Streams and Tasks Patterns](#streams-and-tasks-patterns)
|
||||
4. [Snowpipe](#snowpipe)
|
||||
5. [Anti-Patterns](#anti-patterns)
|
||||
|
||||
---
|
||||
|
||||
## SQL Patterns
|
||||
|
||||
### CTE-Based Transformations
|
||||
|
||||
```sql
|
||||
WITH raw AS (
|
||||
SELECT * FROM raw_events WHERE event_date = CURRENT_DATE()
|
||||
),
|
||||
cleaned AS (
|
||||
SELECT
|
||||
event_id,
|
||||
TRIM(LOWER(event_type)) AS event_type,
|
||||
user_id,
|
||||
event_timestamp,
|
||||
src:metadata::VARIANT AS metadata
|
||||
FROM raw
|
||||
WHERE event_type IS NOT NULL
|
||||
),
|
||||
enriched AS (
|
||||
SELECT
|
||||
c.*,
|
||||
u.name AS user_name,
|
||||
u.segment
|
||||
FROM cleaned c
|
||||
JOIN dim_users u ON c.user_id = u.user_id
|
||||
)
|
||||
SELECT * FROM enriched;
|
||||
```
|
||||
|
||||
### MERGE with Multiple Match Conditions
|
||||
|
||||
```sql
|
||||
MERGE INTO dim_customers t
|
||||
USING (
|
||||
SELECT customer_id, name, email, updated_at,
|
||||
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
|
||||
FROM staging_customers
|
||||
) s
|
||||
ON t.customer_id = s.customer_id AND s.rn = 1
|
||||
WHEN MATCHED AND s.updated_at > t.updated_at THEN
|
||||
UPDATE SET t.name = s.name, t.email = s.email, t.updated_at = s.updated_at
|
||||
WHEN NOT MATCHED THEN
|
||||
INSERT (customer_id, name, email, updated_at)
|
||||
VALUES (s.customer_id, s.name, s.email, s.updated_at);
|
||||
```
|
||||
|
||||
### Semi-Structured Data Patterns
|
||||
|
||||
**Flatten nested arrays:**
|
||||
```sql
|
||||
SELECT
|
||||
o.order_id,
|
||||
f.value:product_id::STRING AS product_id,
|
||||
f.value:quantity::NUMBER AS quantity,
|
||||
f.value:price::NUMBER(10,2) AS price
|
||||
FROM orders o,
|
||||
LATERAL FLATTEN(input => o.line_items) f;
|
||||
```
|
||||
|
||||
**Nested flatten (array of arrays):**
|
||||
```sql
|
||||
SELECT
|
||||
f1.value:category::STRING AS category,
|
||||
f2.value:tag::STRING AS tag
|
||||
FROM catalog,
|
||||
LATERAL FLATTEN(input => data:categories) f1,
|
||||
LATERAL FLATTEN(input => f1.value:tags) f2;
|
||||
```
|
||||
|
||||
**OBJECT_CONSTRUCT for building JSON:**
|
||||
```sql
|
||||
SELECT OBJECT_CONSTRUCT(
|
||||
'id', customer_id,
|
||||
'name', name,
|
||||
'orders', ARRAY_AGG(OBJECT_CONSTRUCT('order_id', order_id, 'total', total))
|
||||
) AS customer_json
|
||||
FROM customers c JOIN orders o ON c.customer_id = o.customer_id
|
||||
GROUP BY c.customer_id, c.name;
|
||||
```
|
||||
|
||||
### Window Functions
|
||||
|
||||
```sql
|
||||
-- Running total with partitions
|
||||
SELECT
|
||||
department,
|
||||
employee,
|
||||
salary,
|
||||
SUM(salary) OVER (PARTITION BY department ORDER BY hire_date) AS dept_running_total
|
||||
FROM employees;
|
||||
|
||||
-- Detect gaps in sequences
|
||||
SELECT id, seq_num,
|
||||
seq_num - LAG(seq_num) OVER (ORDER BY seq_num) AS gap
|
||||
FROM records
|
||||
HAVING gap > 1;
|
||||
```
|
||||
|
||||
### Time Travel
|
||||
|
||||
```sql
|
||||
-- Query data as of a specific timestamp
|
||||
SELECT * FROM my_table AT(TIMESTAMP => '2026-03-20 10:00:00'::TIMESTAMP);
|
||||
|
||||
-- Query data before a specific statement
|
||||
SELECT * FROM my_table BEFORE(STATEMENT => '<query_id>');
|
||||
|
||||
-- Restore a dropped table
|
||||
UNDROP TABLE accidentally_dropped_table;
|
||||
```
|
||||
|
||||
Default retention: 1 day (standard edition), up to 90 days (enterprise+). Set per table: `DATA_RETENTION_TIME_IN_DAYS = 7`.
|
||||
|
||||
---
|
||||
|
||||
## Dynamic Table Deep Dive
|
||||
|
||||
### TARGET_LAG Strategy
|
||||
|
||||
Design your DT DAG with progressive lag -- tighter upstream, looser downstream:
|
||||
|
||||
```
|
||||
raw_events (base table)
|
||||
|
|
||||
v
|
||||
cleaned_events (DT, TARGET_LAG = '1 minute')
|
||||
|
|
||||
v
|
||||
enriched_events (DT, TARGET_LAG = '5 minutes')
|
||||
|
|
||||
v
|
||||
daily_aggregates (DT, TARGET_LAG = '1 hour')
|
||||
```
|
||||
|
||||
### Refresh Mode Rules
|
||||
|
||||
| Refresh Mode | Condition |
|
||||
|-------------|-----------|
|
||||
| Incremental | DTs with simple SELECT, JOIN, WHERE, GROUP BY, UNION ALL on change-tracked sources |
|
||||
| Full | DTs using non-deterministic functions, LIMIT, or depending on full-refresh DTs |
|
||||
|
||||
**Check refresh mode:**
|
||||
```sql
|
||||
SELECT name, refresh_mode, refresh_mode_reason
|
||||
FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLES())
|
||||
WHERE name = 'MY_DT';
|
||||
```
|
||||
|
||||
### DT Debugging Queries
|
||||
|
||||
```sql
|
||||
-- Check DT health and lag
|
||||
SELECT name, scheduling_state, last_completed_refresh_state,
|
||||
data_timestamp, DATEDIFF('minute', data_timestamp, CURRENT_TIMESTAMP()) AS lag_minutes
|
||||
FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLES());
|
||||
|
||||
-- Check refresh history for failures
|
||||
SELECT name, state, state_message, refresh_trigger
|
||||
FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_REFRESH_HISTORY())
|
||||
WHERE state = 'FAILED'
|
||||
ORDER BY refresh_end_time DESC
|
||||
LIMIT 10;
|
||||
|
||||
-- Examine graph dependencies
|
||||
SELECT name, qualified_name, refresh_mode
|
||||
FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_GRAPH_HISTORY());
|
||||
```
|
||||
|
||||
### DT Constraints
|
||||
|
||||
- No views between two DTs in the DAG.
|
||||
- `SELECT *` breaks on upstream schema changes.
|
||||
- Cannot use non-deterministic functions (e.g., `CURRENT_TIMESTAMP()`) -- use a column from the source instead.
|
||||
- Change tracking must be enabled on source tables: `ALTER TABLE src SET CHANGE_TRACKING = TRUE;`
|
||||
|
||||
---
|
||||
|
||||
## Streams and Tasks Patterns
|
||||
|
||||
### Task Trees (Parent-Child)
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE TASK parent_task
|
||||
WAREHOUSE = transform_wh
|
||||
SCHEDULE = 'USING CRON 0 */1 * * * America/Los_Angeles'
|
||||
AS CALL process_stage_1();
|
||||
|
||||
CREATE OR REPLACE TASK child_task
|
||||
WAREHOUSE = transform_wh
|
||||
AFTER parent_task
|
||||
AS CALL process_stage_2();
|
||||
|
||||
-- Resume in reverse order: children first, then parent
|
||||
ALTER TASK child_task RESUME;
|
||||
ALTER TASK parent_task RESUME;
|
||||
```
|
||||
|
||||
### Stream Types
|
||||
|
||||
| Stream Type | Use Case |
|
||||
|------------|----------|
|
||||
| Standard (default) | Track all DML changes (INSERT, UPDATE, DELETE) |
|
||||
| Append-only | Only track INSERTs. More efficient for insert-heavy tables. |
|
||||
| Insert-only (external tables) | Track new files loaded via external tables. |
|
||||
|
||||
```sql
|
||||
-- Append-only stream for event log tables
|
||||
CREATE STREAM event_stream ON TABLE events APPEND_ONLY = TRUE;
|
||||
```
|
||||
|
||||
### Serverless Tasks
|
||||
|
||||
```sql
|
||||
-- No warehouse needed. Snowflake manages compute automatically.
|
||||
CREATE OR REPLACE TASK lightweight_task
|
||||
USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'XSMALL'
|
||||
SCHEDULE = '5 MINUTE'
|
||||
AS INSERT INTO audit_log SELECT CURRENT_TIMESTAMP(), 'heartbeat';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Snowpipe
|
||||
|
||||
### Auto-Ingest Setup (S3)
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE PIPE my_pipe
|
||||
AUTO_INGEST = TRUE
|
||||
AS COPY INTO raw_table
|
||||
FROM @my_s3_stage
|
||||
FILE_FORMAT = (TYPE = 'JSON', STRIP_NULL_VALUES = TRUE);
|
||||
```
|
||||
|
||||
Configure the S3 event notification to point to the pipe's SQS queue:
|
||||
```sql
|
||||
SHOW PIPES LIKE 'my_pipe';
|
||||
-- Use the notification_channel value for S3 event config
|
||||
```
|
||||
|
||||
### Snowpipe Monitoring
|
||||
|
||||
```sql
|
||||
-- Check pipe status
|
||||
SELECT SYSTEM$PIPE_STATUS('my_pipe');
|
||||
|
||||
-- Recent load history
|
||||
SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
|
||||
TABLE_NAME => 'raw_table',
|
||||
START_TIME => DATEADD(HOUR, -24, CURRENT_TIMESTAMP())
|
||||
));
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
| Anti-Pattern | Why It's Bad | Fix |
|
||||
|-------------|-------------|-----|
|
||||
| `SELECT *` in production | Scans all columns, breaks on schema changes | Explicit column list |
|
||||
| Double-quoted identifiers | Creates case-sensitive names requiring constant quoting | Use `snake_case` without quotes |
|
||||
| `ORDER BY` without `LIMIT` | Sorts entire result set for no reason | Add `LIMIT` or remove `ORDER BY` |
|
||||
| Single warehouse for everything | Workloads compete for resources | Separate warehouses per workload |
|
||||
| `FLOAT` for money | Rounding errors | `NUMBER(19,4)` or integer cents |
|
||||
| Missing `RESUME` after task creation | Task never runs | Always `ALTER TASK ... RESUME` |
|
||||
| `CURRENT_TIMESTAMP()` in DT query | Forces full refresh mode | Use a timestamp column from the source |
|
||||
| Scanning VARIANT without casting | "Numeric value not recognized" errors | Always cast: `col:field::TYPE` |
|
||||
@@ -0,0 +1,155 @@
|
||||
# Snowflake Troubleshooting Reference
|
||||
|
||||
Common errors, debugging queries, and resolution patterns for Snowflake development.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Error Reference](#error-reference)
|
||||
2. [Debugging Queries](#debugging-queries)
|
||||
3. [Performance Diagnostics](#performance-diagnostics)
|
||||
|
||||
---
|
||||
|
||||
## Error Reference
|
||||
|
||||
### SQL Errors
|
||||
|
||||
| Error | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| "Object 'X' does not exist or not authorized" | Wrong database/schema context, missing grants, or typo | Fully qualify: `db.schema.table`. Check `SHOW GRANTS ON TABLE`. |
|
||||
| "Invalid identifier 'VAR'" in procedure | Missing colon prefix on variable in SQL procedure | Use `:var_name` inside SELECT/INSERT/UPDATE/DELETE/MERGE |
|
||||
| "Numeric value 'X' is not recognized" | VARIANT field accessed without type cast | Always cast: `src:field::NUMBER(10,2)` |
|
||||
| "SQL compilation error: ambiguous column name" | Same column name in multiple joined tables | Use table aliases: `t.id`, `s.id` |
|
||||
| "Number of columns in insert does not match" | INSERT column count mismatch with VALUES | Verify column list matches value list exactly |
|
||||
| "Division by zero" | Dividing by a column that contains 0 | Use `NULLIF(divisor, 0)` or `IFF(divisor = 0, NULL, ...)` |
|
||||
|
||||
### Pipeline Errors
|
||||
|
||||
| Error | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| Task not running | Created but not resumed | `ALTER TASK task_name RESUME;` |
|
||||
| DT stuck in FAILED state | Query error or upstream dependency issue | Check `DYNAMIC_TABLE_REFRESH_HISTORY()` for error messages |
|
||||
| DT shows full refresh instead of incremental | Non-deterministic function or unsupported pattern | Check `refresh_mode_reason` in `INFORMATION_SCHEMA.DYNAMIC_TABLES()` |
|
||||
| Stream shows no data | Stream was consumed or table was recreated | Verify stream is on the correct table, check `STALE_AFTER` |
|
||||
| Snowpipe not loading files | SQS notification misconfigured or file format mismatch | Check `SYSTEM$PIPE_STATUS()`, verify notification channel |
|
||||
| "UPSTREAM_FAILED" on DT | A DT dependency upstream has a refresh failure | Fix the upstream DT first, then downstream will recover |
|
||||
|
||||
### Cortex AI Errors
|
||||
|
||||
| Error | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| "Function X does not exist" | Using deprecated function name | Use new `AI_*` names (e.g., `AI_CLASSIFY` not `CLASSIFY_TEXT`) |
|
||||
| TO_FILE error | Single argument instead of two | `TO_FILE('@stage', 'file.pdf')` -- two separate arguments |
|
||||
| Agent returns empty or wrong results | Poor tool descriptions or wrong semantic model | Improve tool descriptions, verify semantic model covers the question |
|
||||
| "Invalid specification" on agent | JSON structure error in spec | Check: `models` is object not array, `tool_resources` is top-level, no trailing commas |
|
||||
|
||||
---
|
||||
|
||||
## Debugging Queries
|
||||
|
||||
### Query History
|
||||
|
||||
```sql
|
||||
-- Find slow queries in the last 24 hours
|
||||
SELECT query_id, query_text, execution_status,
|
||||
total_elapsed_time / 1000 AS elapsed_sec,
|
||||
bytes_scanned / (1024*1024*1024) AS gb_scanned,
|
||||
rows_produced, warehouse_name
|
||||
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY(
|
||||
END_TIME_RANGE_START => DATEADD(HOUR, -24, CURRENT_TIMESTAMP()),
|
||||
RESULT_LIMIT => 50
|
||||
))
|
||||
WHERE total_elapsed_time > 30000 -- > 30 seconds
|
||||
ORDER BY total_elapsed_time DESC;
|
||||
```
|
||||
|
||||
### Dynamic Table Health
|
||||
|
||||
```sql
|
||||
-- Overall DT status
|
||||
SELECT name, scheduling_state, last_completed_refresh_state,
|
||||
data_timestamp,
|
||||
DATEDIFF('minute', data_timestamp, CURRENT_TIMESTAMP()) AS lag_minutes
|
||||
FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLES())
|
||||
ORDER BY lag_minutes DESC;
|
||||
|
||||
-- Recent failures
|
||||
SELECT name, state, state_message, refresh_trigger,
|
||||
DATEDIFF('second', refresh_start_time, refresh_end_time) AS duration_sec
|
||||
FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_REFRESH_HISTORY())
|
||||
WHERE state = 'FAILED'
|
||||
ORDER BY refresh_end_time DESC
|
||||
LIMIT 20;
|
||||
```
|
||||
|
||||
### Stream Status
|
||||
|
||||
```sql
|
||||
-- Check stream freshness
|
||||
SHOW STREAMS;
|
||||
|
||||
-- Check if stream has data
|
||||
SELECT SYSTEM$STREAM_HAS_DATA('my_stream');
|
||||
```
|
||||
|
||||
### Task Monitoring
|
||||
|
||||
```sql
|
||||
-- Check task run history
|
||||
SELECT name, state, error_message,
|
||||
scheduled_time, completed_time,
|
||||
DATEDIFF('second', scheduled_time, completed_time) AS duration_sec
|
||||
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
|
||||
WHERE name = 'MY_TASK'
|
||||
ORDER BY scheduled_time DESC
|
||||
LIMIT 20;
|
||||
```
|
||||
|
||||
### Grants Debugging
|
||||
|
||||
```sql
|
||||
-- What grants does a role have?
|
||||
SHOW GRANTS TO ROLE my_role;
|
||||
|
||||
-- What grants exist on an object?
|
||||
SHOW GRANTS ON TABLE my_db.my_schema.my_table;
|
||||
|
||||
-- Who has ACCOUNTADMIN?
|
||||
SHOW GRANTS OF ROLE ACCOUNTADMIN;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Diagnostics
|
||||
|
||||
### Warehouse Utilization
|
||||
|
||||
```sql
|
||||
-- Warehouse load over time
|
||||
SELECT start_time, warehouse_name,
|
||||
avg_running, avg_queued_load, avg_blocked
|
||||
FROM TABLE(INFORMATION_SCHEMA.WAREHOUSE_LOAD_HISTORY(
|
||||
DATE_RANGE_START => DATEADD(HOUR, -24, CURRENT_TIMESTAMP())
|
||||
))
|
||||
WHERE warehouse_name = 'MY_WH'
|
||||
ORDER BY start_time DESC;
|
||||
```
|
||||
|
||||
### Clustering Health
|
||||
|
||||
```sql
|
||||
-- Check clustering depth (lower is better)
|
||||
SELECT SYSTEM$CLUSTERING_INFORMATION('my_table', '(date_col, region)');
|
||||
```
|
||||
|
||||
### Storage Costs
|
||||
|
||||
```sql
|
||||
-- Table storage usage
|
||||
SELECT table_name, active_bytes / (1024*1024*1024) AS active_gb,
|
||||
time_travel_bytes / (1024*1024*1024) AS time_travel_gb,
|
||||
failsafe_bytes / (1024*1024*1024) AS failsafe_gb
|
||||
FROM INFORMATION_SCHEMA.TABLE_STORAGE_METRICS
|
||||
WHERE table_schema = 'MY_SCHEMA'
|
||||
ORDER BY active_bytes DESC;
|
||||
```
|
||||
@@ -0,0 +1,233 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Snowflake Query Helper
|
||||
|
||||
Generate common Snowflake SQL patterns: MERGE upserts, Dynamic Table DDL,
|
||||
and RBAC grant statements. Outputs ready-to-use SQL that follows Snowflake
|
||||
best practices.
|
||||
|
||||
Usage:
|
||||
python snowflake_query_helper.py merge --target customers --source stg_customers --key id --columns name,email
|
||||
python snowflake_query_helper.py dynamic-table --name cleaned_events --warehouse transform_wh --lag "5 minutes"
|
||||
python snowflake_query_helper.py grant --role analyst --database analytics --schemas public --privileges SELECT,USAGE
|
||||
python snowflake_query_helper.py merge --target t --source s --key id --columns a,b --json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import textwrap
|
||||
from typing import List, Optional
|
||||
|
||||
|
||||
def generate_merge(
|
||||
target: str,
|
||||
source: str,
|
||||
key: str,
|
||||
columns: List[str],
|
||||
schema: Optional[str] = None,
|
||||
) -> str:
|
||||
"""Generate a MERGE (upsert) statement following Snowflake best practices."""
|
||||
prefix = f"{schema}." if schema else ""
|
||||
t = f"{prefix}{target}"
|
||||
s = f"{prefix}{source}"
|
||||
|
||||
# Filter out updated_at from user columns to avoid duplicates
|
||||
merge_cols = [col for col in columns if col != "updated_at"]
|
||||
|
||||
update_sets = ",\n ".join(
|
||||
f"t.{col} = s.{col}" for col in merge_cols
|
||||
)
|
||||
update_sets += ",\n t.updated_at = CURRENT_TIMESTAMP()"
|
||||
|
||||
insert_cols = ", ".join([key] + merge_cols + ["updated_at"])
|
||||
insert_vals = ", ".join(
|
||||
[f"s.{key}"] + [f"s.{col}" for col in merge_cols] + ["CURRENT_TIMESTAMP()"]
|
||||
)
|
||||
|
||||
return textwrap.dedent(f"""\
|
||||
MERGE INTO {t} t
|
||||
USING {s} s
|
||||
ON t.{key} = s.{key}
|
||||
WHEN MATCHED THEN
|
||||
UPDATE SET
|
||||
{update_sets}
|
||||
WHEN NOT MATCHED THEN
|
||||
INSERT ({insert_cols})
|
||||
VALUES ({insert_vals});""")
|
||||
|
||||
|
||||
def generate_dynamic_table(
|
||||
name: str,
|
||||
warehouse: str,
|
||||
lag: str,
|
||||
source: Optional[str] = None,
|
||||
columns: Optional[List[str]] = None,
|
||||
schema: Optional[str] = None,
|
||||
) -> str:
|
||||
"""Generate a Dynamic Table DDL with best-practice defaults."""
|
||||
prefix = f"{schema}." if schema else ""
|
||||
full_name = f"{prefix}{name}"
|
||||
src = source or "<source_table>"
|
||||
col_list = ", ".join(columns) if columns else "<col1>, <col2>, <col3>"
|
||||
|
||||
return textwrap.dedent(f"""\
|
||||
CREATE OR REPLACE DYNAMIC TABLE {full_name}
|
||||
TARGET_LAG = '{lag}'
|
||||
WAREHOUSE = {warehouse}
|
||||
AS
|
||||
SELECT {col_list}
|
||||
FROM {src}
|
||||
WHERE 1=1; -- Add your filter conditions
|
||||
|
||||
-- Verify refresh mode (incremental is preferred):
|
||||
-- SELECT name, refresh_mode, refresh_mode_reason
|
||||
-- FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLES())
|
||||
-- WHERE name = '{name.upper()}';""")
|
||||
|
||||
|
||||
def generate_grants(
|
||||
role: str,
|
||||
database: str,
|
||||
schemas: List[str],
|
||||
privileges: List[str],
|
||||
) -> str:
|
||||
"""Generate RBAC grant statements following least-privilege principles."""
|
||||
lines = [f"-- RBAC grants for role: {role}"]
|
||||
lines.append(f"-- Generated following least-privilege principles")
|
||||
lines.append("")
|
||||
|
||||
# Database-level
|
||||
lines.append(f"GRANT USAGE ON DATABASE {database} TO ROLE {role};")
|
||||
lines.append("")
|
||||
|
||||
for schema in schemas:
|
||||
fq_schema = f"{database}.{schema}"
|
||||
lines.append(f"-- Schema: {fq_schema}")
|
||||
lines.append(f"GRANT USAGE ON SCHEMA {fq_schema} TO ROLE {role};")
|
||||
|
||||
for priv in privileges:
|
||||
p = priv.strip().upper()
|
||||
if p == "USAGE":
|
||||
continue # Already granted above
|
||||
elif p == "SELECT":
|
||||
lines.append(
|
||||
f"GRANT SELECT ON ALL TABLES IN SCHEMA {fq_schema} TO ROLE {role};"
|
||||
)
|
||||
lines.append(
|
||||
f"GRANT SELECT ON FUTURE TABLES IN SCHEMA {fq_schema} TO ROLE {role};"
|
||||
)
|
||||
lines.append(
|
||||
f"GRANT SELECT ON ALL VIEWS IN SCHEMA {fq_schema} TO ROLE {role};"
|
||||
)
|
||||
lines.append(
|
||||
f"GRANT SELECT ON FUTURE VIEWS IN SCHEMA {fq_schema} TO ROLE {role};"
|
||||
)
|
||||
elif p in ("INSERT", "UPDATE", "DELETE", "TRUNCATE"):
|
||||
lines.append(
|
||||
f"GRANT {p} ON ALL TABLES IN SCHEMA {fq_schema} TO ROLE {role};"
|
||||
)
|
||||
lines.append(
|
||||
f"GRANT {p} ON FUTURE TABLES IN SCHEMA {fq_schema} TO ROLE {role};"
|
||||
)
|
||||
elif p == "CREATE TABLE":
|
||||
lines.append(
|
||||
f"GRANT CREATE TABLE ON SCHEMA {fq_schema} TO ROLE {role};"
|
||||
)
|
||||
elif p == "CREATE VIEW":
|
||||
lines.append(
|
||||
f"GRANT CREATE VIEW ON SCHEMA {fq_schema} TO ROLE {role};"
|
||||
)
|
||||
else:
|
||||
lines.append(
|
||||
f"GRANT {p} ON SCHEMA {fq_schema} TO ROLE {role};"
|
||||
)
|
||||
lines.append("")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Generate common Snowflake SQL patterns",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog=textwrap.dedent("""\
|
||||
Examples:
|
||||
%(prog)s merge --target customers --source stg --key id --columns name,email
|
||||
%(prog)s dynamic-table --name clean_events --warehouse wh --lag "5 min"
|
||||
%(prog)s grant --role analyst --database db --schemas public --privileges SELECT
|
||||
"""),
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json", action="store_true", help="Output as JSON instead of raw SQL"
|
||||
)
|
||||
|
||||
subparsers = parser.add_subparsers(dest="command", help="SQL pattern to generate")
|
||||
|
||||
# MERGE subcommand
|
||||
merge_p = subparsers.add_parser("merge", help="Generate MERGE (upsert) statement")
|
||||
merge_p.add_argument("--target", required=True, help="Target table name")
|
||||
merge_p.add_argument("--source", required=True, help="Source table name")
|
||||
merge_p.add_argument("--key", required=True, help="Join key column")
|
||||
merge_p.add_argument(
|
||||
"--columns", required=True, help="Comma-separated columns to merge"
|
||||
)
|
||||
merge_p.add_argument("--schema", help="Schema prefix (e.g., my_db.my_schema)")
|
||||
|
||||
# Dynamic Table subcommand
|
||||
dt_p = subparsers.add_parser(
|
||||
"dynamic-table", help="Generate Dynamic Table DDL"
|
||||
)
|
||||
dt_p.add_argument("--name", required=True, help="Dynamic Table name")
|
||||
dt_p.add_argument("--warehouse", required=True, help="Warehouse for refresh")
|
||||
dt_p.add_argument(
|
||||
"--lag", required=True, help="Target lag (e.g., '5 minutes', '1 hour')"
|
||||
)
|
||||
dt_p.add_argument("--source", help="Source table name")
|
||||
dt_p.add_argument("--columns", help="Comma-separated column list")
|
||||
dt_p.add_argument("--schema", help="Schema prefix")
|
||||
|
||||
# Grant subcommand
|
||||
grant_p = subparsers.add_parser("grant", help="Generate RBAC grant statements")
|
||||
grant_p.add_argument("--role", required=True, help="Role to grant to")
|
||||
grant_p.add_argument("--database", required=True, help="Database name")
|
||||
grant_p.add_argument(
|
||||
"--schemas", required=True, help="Comma-separated schema names"
|
||||
)
|
||||
grant_p.add_argument(
|
||||
"--privileges",
|
||||
required=True,
|
||||
help="Comma-separated privileges (SELECT, INSERT, UPDATE, DELETE, CREATE TABLE, etc.)",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.command:
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
if args.command == "merge":
|
||||
cols = [c.strip() for c in args.columns.split(",")]
|
||||
sql = generate_merge(args.target, args.source, args.key, cols, args.schema)
|
||||
elif args.command == "dynamic-table":
|
||||
cols = [c.strip() for c in args.columns.split(",")] if args.columns else None
|
||||
sql = generate_dynamic_table(
|
||||
args.name, args.warehouse, args.lag, args.source, cols, args.schema
|
||||
)
|
||||
elif args.command == "grant":
|
||||
schemas = [s.strip() for s in args.schemas.split(",")]
|
||||
privs = [p.strip() for p in args.privileges.split(",")]
|
||||
sql = generate_grants(args.role, args.database, schemas, privs)
|
||||
else:
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
if args.json:
|
||||
output = {"command": args.command, "sql": sql}
|
||||
print(json.dumps(output, indent=2))
|
||||
else:
|
||||
print(sql)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -161,6 +161,7 @@ nav:
|
||||
- "Senior SecOps Engineer": skills/engineering-team/senior-secops.md
|
||||
- "Senior Security Engineer": skills/engineering-team/senior-security.md
|
||||
- "Security Pen Testing": skills/engineering-team/security-pen-testing.md
|
||||
- "Snowflake Development": skills/engineering-team/snowflake-development.md
|
||||
- "Stripe Integration Expert": skills/engineering-team/stripe-integration-expert.md
|
||||
- "TDD Guide": skills/engineering-team/tdd-guide.md
|
||||
- "Tech Stack Evaluator": skills/engineering-team/tech-stack-evaluator.md
|
||||
|
||||
Reference in New Issue
Block a user