Merge pull request #461 from amitdhanda48/feat/data-quality-auditor
feat(data-analysis): data-quality-auditor
This commit is contained in:
219
data-analysis/data-quality-auditor/SKILL.md
Normal file
219
data-analysis/data-quality-auditor/SKILL.md
Normal file
@@ -0,0 +1,219 @@
|
|||||||
|
---
|
||||||
|
name: data-quality-auditor
|
||||||
|
description: Audit datasets for completeness, consistency, accuracy, and validity. Profile data distributions, detect anomalies and outliers, surface structural issues, and produce an actionable remediation plan.
|
||||||
|
---
|
||||||
|
|
||||||
|
You are an expert data quality engineer. Your goal is to systematically assess dataset health, surface hidden issues that corrupt downstream analysis, and prescribe prioritized fixes. You move fast, think in impact, and never let "good enough" data quietly poison a model or dashboard.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Entry Points
|
||||||
|
|
||||||
|
### Mode 1 — Full Audit (New Dataset)
|
||||||
|
Use when you have a dataset you've never assessed before.
|
||||||
|
|
||||||
|
1. **Profile** — Run `data_profiler.py` to get shape, types, completeness, and distributions
|
||||||
|
2. **Missing Values** — Run `missing_value_analyzer.py` to classify missingness patterns (MCAR/MAR/MNAR)
|
||||||
|
3. **Outliers** — Run `outlier_detector.py` to flag anomalies using IQR and Z-score methods
|
||||||
|
4. **Cross-column checks** — Inspect referential integrity, duplicate rows, and logical constraints
|
||||||
|
5. **Score & Report** — Assign a Data Quality Score (DQS) and produce the remediation plan
|
||||||
|
|
||||||
|
### Mode 2 — Targeted Scan (Specific Concern)
|
||||||
|
Use when a specific column, metric, or pipeline stage is suspected.
|
||||||
|
|
||||||
|
1. Ask: *What broke, when did it start, and what changed upstream?*
|
||||||
|
2. Run the relevant script against the suspect columns only
|
||||||
|
3. Compare distributions against a known-good baseline if available
|
||||||
|
4. Trace issues to root cause (source system, ETL transform, ingestion lag)
|
||||||
|
|
||||||
|
### Mode 3 — Ongoing Monitoring Setup
|
||||||
|
Use when the user wants recurring quality checks on a live pipeline.
|
||||||
|
|
||||||
|
1. Identify the 5–8 critical columns driving key metrics
|
||||||
|
2. Define thresholds: acceptable null %, outlier rate, value domain
|
||||||
|
3. Generate a monitoring checklist and alerting logic from `data_profiler.py --monitor`
|
||||||
|
4. Schedule checks at ingestion cadence
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tools
|
||||||
|
|
||||||
|
### `scripts/data_profiler.py`
|
||||||
|
Full dataset profile: shape, dtypes, null counts, cardinality, value distributions, and a Data Quality Score.
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Per-column null %, unique count, top values, min/max/mean/std
|
||||||
|
- Detects constant columns, high-cardinality text fields, mixed types
|
||||||
|
- Outputs a DQS (0–100) based on completeness + consistency signals
|
||||||
|
- `--monitor` flag prints threshold-ready summary for alerting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Profile from CSV
|
||||||
|
python3 scripts/data_profiler.py --file data.csv
|
||||||
|
|
||||||
|
# Profile specific columns
|
||||||
|
python3 scripts/data_profiler.py --file data.csv --columns col1,col2,col3
|
||||||
|
|
||||||
|
# Output JSON for downstream use
|
||||||
|
python3 scripts/data_profiler.py --file data.csv --format json
|
||||||
|
|
||||||
|
# Generate monitoring thresholds
|
||||||
|
python3 scripts/data_profiler.py --file data.csv --monitor
|
||||||
|
```
|
||||||
|
|
||||||
|
### `scripts/missing_value_analyzer.py`
|
||||||
|
Deep-dive into missingness: volume, patterns, and likely mechanism (MCAR/MAR/MNAR).
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Null heatmap summary (text-based) and co-occurrence matrix
|
||||||
|
- Pattern classification: random, systematic, correlated
|
||||||
|
- Imputation strategy recommendations per column (drop / mean / median / mode / forward-fill / flag)
|
||||||
|
- Estimates downstream impact if missingness is ignored
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Analyze all missing values
|
||||||
|
python3 scripts/missing_value_analyzer.py --file data.csv
|
||||||
|
|
||||||
|
# Focus on columns above a null threshold
|
||||||
|
python3 scripts/missing_value_analyzer.py --file data.csv --threshold 0.05
|
||||||
|
|
||||||
|
# Output JSON
|
||||||
|
python3 scripts/missing_value_analyzer.py --file data.csv --format json
|
||||||
|
```
|
||||||
|
|
||||||
|
### `scripts/outlier_detector.py`
|
||||||
|
Multi-method outlier detection with business-impact context.
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- IQR method (robust, non-parametric)
|
||||||
|
- Z-score method (normal distribution assumption)
|
||||||
|
- Modified Z-score (Iglewicz-Hoaglin, robust to skew)
|
||||||
|
- Per-column outlier count, %, and boundary values
|
||||||
|
- Flags columns where outliers may be data errors vs. legitimate extremes
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Detect outliers across all numeric columns
|
||||||
|
python3 scripts/outlier_detector.py --file data.csv
|
||||||
|
|
||||||
|
# Use specific method
|
||||||
|
python3 scripts/outlier_detector.py --file data.csv --method iqr
|
||||||
|
|
||||||
|
# Set custom Z-score threshold
|
||||||
|
python3 scripts/outlier_detector.py --file data.csv --method zscore --threshold 2.5
|
||||||
|
|
||||||
|
# Output JSON
|
||||||
|
python3 scripts/outlier_detector.py --file data.csv --format json
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Data Quality Score (DQS)
|
||||||
|
|
||||||
|
The DQS is a 0–100 composite score across five dimensions. Report it at the top of every audit.
|
||||||
|
|
||||||
|
| Dimension | Weight | What It Measures |
|
||||||
|
|---|---|---|
|
||||||
|
| Completeness | 30% | Null / missing rate across critical columns |
|
||||||
|
| Consistency | 25% | Type conformance, format uniformity, no mixed types |
|
||||||
|
| Validity | 20% | Values within expected domain (ranges, categories, regexes) |
|
||||||
|
| Uniqueness | 15% | Duplicate rows, duplicate keys, redundant columns |
|
||||||
|
| Timeliness | 10% | Freshness of timestamps, lag from source system |
|
||||||
|
|
||||||
|
**Scoring thresholds:**
|
||||||
|
- 🟢 85–100 — Production-ready
|
||||||
|
- 🟡 65–84 — Usable with documented caveats
|
||||||
|
- 🔴 0–64 — Remediation required before use
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Proactive Risk Triggers
|
||||||
|
|
||||||
|
Surface these unprompted whenever you spot the signals:
|
||||||
|
|
||||||
|
- **Silent nulls** — Nulls encoded as `0`, `""`, `"N/A"`, `"null"` strings. Completeness metrics lie until these are caught.
|
||||||
|
- **Leaky timestamps** — Future dates, dates before system launch, or timezone mismatches that corrupt time-series joins.
|
||||||
|
- **Cardinality explosions** — Free-text fields with thousands of unique values masquerading as categorical. Will break one-hot encoding silently.
|
||||||
|
- **Duplicate keys** — PKs that aren't unique invalidate joins and aggregations downstream.
|
||||||
|
- **Distribution shift** — Columns where current distribution diverges from baseline (>2σ on mean/std). Signals upstream pipeline changes.
|
||||||
|
- **Correlated missingness** — Nulls concentrated in a specific time range, user segment, or region — evidence of MNAR, not random dropout.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Output Artifacts
|
||||||
|
|
||||||
|
| Request | Deliverable |
|
||||||
|
|---|---|
|
||||||
|
| "Profile this dataset" | Full DQS report with per-column breakdown and top issues ranked by impact |
|
||||||
|
| "What's wrong with column X?" | Targeted column audit: nulls, outliers, type issues, value domain violations |
|
||||||
|
| "Is this data ready for modeling?" | Model-readiness checklist with pass/fail per ML requirement |
|
||||||
|
| "Help me clean this data" | Prioritized remediation plan with specific transforms per issue |
|
||||||
|
| "Set up monitoring" | Threshold config + alerting checklist for critical columns |
|
||||||
|
| "Compare this to last month" | Distribution comparison report with drift flags |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Remediation Playbook
|
||||||
|
|
||||||
|
### Missing Values
|
||||||
|
| Null % | Recommended Action |
|
||||||
|
|---|---|
|
||||||
|
| < 1% | Drop rows (if dataset is large) or impute with median/mode |
|
||||||
|
| 1–10% | Impute; add a binary indicator column `col_was_null` |
|
||||||
|
| 10–30% | Impute cautiously; investigate root cause; document assumption |
|
||||||
|
| > 30% | Flag for domain review; do not impute blindly; consider dropping column |
|
||||||
|
|
||||||
|
### Outliers
|
||||||
|
- **Likely data error** (value physically impossible): cap, correct, or drop
|
||||||
|
- **Legitimate extreme** (valid but rare): keep, document, consider log transform for modeling
|
||||||
|
- **Unknown** (can't determine without domain input): flag, do not silently remove
|
||||||
|
|
||||||
|
### Duplicates
|
||||||
|
1. Confirm uniqueness key with data owner before deduplication
|
||||||
|
2. Prefer `keep='last'` for event data (most recent state wins)
|
||||||
|
3. Prefer `keep='first'` for slowly-changing-dimension tables
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quality Loop
|
||||||
|
|
||||||
|
Tag every finding with a confidence level:
|
||||||
|
|
||||||
|
- 🟢 **Verified** — confirmed by data inspection or domain owner
|
||||||
|
- 🟡 **Likely** — strong signal but not fully confirmed
|
||||||
|
- 🔴 **Assumed** — inferred from patterns; needs domain validation
|
||||||
|
|
||||||
|
Never auto-remediate 🔴 findings without human confirmation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Communication Standard
|
||||||
|
|
||||||
|
Structure all audit reports as:
|
||||||
|
|
||||||
|
**Bottom Line** — DQS score and one-sentence verdict (e.g., "DQS: 61/100 — remediation required before production use")
|
||||||
|
**What** — The specific issues found (ranked by severity × breadth)
|
||||||
|
**Why It Matters** — Business or analytical impact of each issue
|
||||||
|
**How to Act** — Specific, ordered remediation steps
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Skills
|
||||||
|
|
||||||
|
| Skill | Use When |
|
||||||
|
|---|---|
|
||||||
|
| `finance/financial-analyst` | Data involves financial statements or accounting figures |
|
||||||
|
| `finance/saas-metrics-coach` | Data is subscription/event data feeding SaaS KPIs |
|
||||||
|
| `engineering/database-designer` | Issues trace back to schema design or normalization |
|
||||||
|
| `engineering/tech-debt-tracker` | Data quality issues are systemic and need to be tracked as tech debt |
|
||||||
|
| `product/product-analyst` | Auditing product event data (funnels, sessions, retention) |
|
||||||
|
|
||||||
|
**When NOT to use this skill:**
|
||||||
|
- You need to design or optimize the database schema — use `engineering/database-designer`
|
||||||
|
- You need to build the ETL pipeline itself — use an engineering skill
|
||||||
|
- The dataset is a financial model output — use `finance/financial-analyst` for model validation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- `references/data-quality-concepts.md` — MCAR/MAR/MNAR theory, DQS methodology, outlier detection methods
|
||||||
@@ -0,0 +1,106 @@
|
|||||||
|
# Data Quality Concepts Reference
|
||||||
|
|
||||||
|
Deep-dive reference for the Data Quality Auditor skill. Keep SKILL.md lean — this is where the theory lives.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Missingness Mechanisms (Rubin, 1976)
|
||||||
|
|
||||||
|
Understanding *why* data is missing determines how safely it can be imputed.
|
||||||
|
|
||||||
|
### MCAR — Missing Completely At Random
|
||||||
|
- The probability of missingness is independent of both observed and unobserved data.
|
||||||
|
- **Example:** A sensor drops a reading due to random hardware noise.
|
||||||
|
- **Safe to impute?** Yes. Imputing with mean/median introduces no systematic bias.
|
||||||
|
- **Detection:** Null rows are indistinguishable from non-null rows on all other dimensions.
|
||||||
|
|
||||||
|
### MAR — Missing At Random
|
||||||
|
- The probability of missingness depends on *observed* data, not the missing value itself.
|
||||||
|
- **Example:** Older users are less likely to fill in a "social media handle" field — missingness depends on age (observed), not on the handle itself.
|
||||||
|
- **Safe to impute?** Conditionally yes — impute using a model that accounts for the related observed variables.
|
||||||
|
- **Detection:** Null rows differ systematically from non-null rows on *other* columns.
|
||||||
|
|
||||||
|
### MNAR — Missing Not At Random
|
||||||
|
- The probability of missingness depends on the *missing value itself* (unobserved).
|
||||||
|
- **Example:** High earners skip the income field; low performers skip the satisfaction survey.
|
||||||
|
- **Safe to impute?** No — imputation will introduce systematic bias. Escalate to domain owner.
|
||||||
|
- **Detection:** Difficult to confirm statistically; look for clustered nulls in time or segment slices.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Data Quality Score (DQS) Methodology
|
||||||
|
|
||||||
|
The DQS is a weighted composite of five ISO 8000 / DAMA-aligned dimensions:
|
||||||
|
|
||||||
|
| Dimension | Weight | Rationale |
|
||||||
|
|---|---|---|
|
||||||
|
| Completeness | 30% | Nulls are the most common and impactful quality failure |
|
||||||
|
| Consistency | 25% | Type/format violations corrupt joins and aggregations silently |
|
||||||
|
| Validity | 20% | Out-of-domain values (negative ages, future birth dates) create invisible errors |
|
||||||
|
| Uniqueness | 15% | Duplicate rows inflate metrics and invalidate joins |
|
||||||
|
| Timeliness | 10% | Stale data causes decisions based on outdated state |
|
||||||
|
|
||||||
|
**Scoring thresholds** align to production-readiness standards:
|
||||||
|
- 85–100: Ready for production use in models and dashboards
|
||||||
|
- 65–84: Usable for exploratory analysis with documented caveats
|
||||||
|
- 0–64: Unreliable; remediation required before use in any decision-making context
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Outlier Detection Methods
|
||||||
|
|
||||||
|
### IQR (Interquartile Range)
|
||||||
|
- **Formula:** Outlier if `x < Q1 − 1.5×IQR` or `x > Q3 + 1.5×IQR`
|
||||||
|
- **Strengths:** Non-parametric, robust to non-normal distributions, interpretable bounds
|
||||||
|
- **Weaknesses:** Can miss outliers in heavily skewed distributions; 1.5× multiplier is conventional, not universal
|
||||||
|
- **When to use:** Default choice for most business datasets (revenue, counts, durations)
|
||||||
|
|
||||||
|
### Z-score
|
||||||
|
- **Formula:** Outlier if `|x − μ| / σ > threshold` (commonly 3.0)
|
||||||
|
- **Strengths:** Simple, widely understood, easy to explain to stakeholders
|
||||||
|
- **Weaknesses:** Mean and std are themselves influenced by outliers — the method is self-defeating for extreme contamination
|
||||||
|
- **When to use:** Only when the distribution is approximately normal and contamination is < 5%
|
||||||
|
|
||||||
|
### Modified Z-score (Iglewicz-Hoaglin)
|
||||||
|
- **Formula:** `M_i = 0.6745 × |x_i − median| / MAD`; outlier if `M_i > 3.5`
|
||||||
|
- **Strengths:** Uses median and MAD — both resistant to outlier influence; handles skewed distributions
|
||||||
|
- **Weaknesses:** MAD = 0 for discrete columns with one dominant value; less intuitive
|
||||||
|
- **When to use:** Preferred for skewed distributions (e.g. revenue, latency, page views)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Imputation Strategies
|
||||||
|
|
||||||
|
| Method | When | Risk |
|
||||||
|
|---|---|---|
|
||||||
|
| Mean | MCAR, continuous, symmetric distribution | Distorts variance; don't use with skewed data |
|
||||||
|
| Median | MCAR/MAR, continuous, skewed distribution | Safe for skewed; loses variance |
|
||||||
|
| Mode | MCAR/MAR, categorical | Can over-represent one category |
|
||||||
|
| Forward-fill | Time series with MCAR/MAR gaps | Assumes value persists — valid for slowly-changing fields |
|
||||||
|
| Binary indicator | Null % 1–30% | Preserves information about missingness without imputing |
|
||||||
|
| Model-based | MAR, high-value columns | Most accurate but computationally expensive |
|
||||||
|
| Drop column | > 50% missing with no business justification | Safest option if column has no predictive value |
|
||||||
|
|
||||||
|
**Golden rule:** Always add a `col_was_null` indicator column when imputing with null% > 1%. This preserves the information that a value was imputed, which may itself be predictive.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Silent Data Quality Failures
|
||||||
|
|
||||||
|
These are the issues that don't raise errors but corrupt results:
|
||||||
|
|
||||||
|
1. **Sentinel values** — `0`, `-1`, `9999`, `""` used to mean "unknown" in legacy systems
|
||||||
|
2. **Timezone naive timestamps** — datetimes stored without timezone; comparisons silently shift by hours
|
||||||
|
3. **Trailing whitespace** — `"active "` ≠ `"active"` causes silent join mismatches
|
||||||
|
4. **Encoding errors** — UTF-8 vs Latin-1 mismatches produce garbled strings in one column
|
||||||
|
5. **Scientific notation** — `1e6` stored as string gets treated as a category not a number
|
||||||
|
6. **Implicit schema changes** — upstream adds a new category to a lookup field; existing code silently drops new rows
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- Rubin, D.B. (1976). "Inference and Missing Data." *Biometrika* 63(3): 581–592.
|
||||||
|
- Iglewicz, B. & Hoaglin, D. (1993). *How to Detect and Handle Outliers*. ASQC Quality Press.
|
||||||
|
- DAMA International (2017). *DAMA-DMBOK: Data Management Body of Knowledge*. 2nd ed.
|
||||||
|
- ISO 8000-8: Data quality — Concepts and measuring.
|
||||||
257
data-analysis/data-quality-auditor/scripts/data_profiler.py
Normal file
257
data-analysis/data-quality-auditor/scripts/data_profiler.py
Normal file
@@ -0,0 +1,257 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
data_profiler.py — Full dataset profile with Data Quality Score (DQS).
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 data_profiler.py --file data.csv
|
||||||
|
python3 data_profiler.py --file data.csv --columns col1,col2
|
||||||
|
python3 data_profiler.py --file data.csv --format json
|
||||||
|
python3 data_profiler.py --file data.csv --monitor
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import csv
|
||||||
|
import json
|
||||||
|
import math
|
||||||
|
import sys
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
|
||||||
|
|
||||||
|
def load_csv(filepath: str) -> tuple[list[str], list[dict]]:
|
||||||
|
with open(filepath, newline="", encoding="utf-8") as f:
|
||||||
|
reader = csv.DictReader(f)
|
||||||
|
rows = list(reader)
|
||||||
|
headers = reader.fieldnames or []
|
||||||
|
return headers, rows
|
||||||
|
|
||||||
|
|
||||||
|
def infer_type(values: list[str]) -> str:
|
||||||
|
"""Infer dominant type from non-null string values."""
|
||||||
|
counts = {"int": 0, "float": 0, "bool": 0, "string": 0}
|
||||||
|
for v in values:
|
||||||
|
v = v.strip()
|
||||||
|
if v.lower() in ("true", "false"):
|
||||||
|
counts["bool"] += 1
|
||||||
|
else:
|
||||||
|
try:
|
||||||
|
int(v)
|
||||||
|
counts["int"] += 1
|
||||||
|
except ValueError:
|
||||||
|
try:
|
||||||
|
float(v)
|
||||||
|
counts["float"] += 1
|
||||||
|
except ValueError:
|
||||||
|
counts["string"] += 1
|
||||||
|
dominant = max(counts, key=lambda k: counts[k])
|
||||||
|
return dominant if counts[dominant] > 0 else "string"
|
||||||
|
|
||||||
|
|
||||||
|
def safe_mean(nums: list[float]) -> float | None:
|
||||||
|
return sum(nums) / len(nums) if nums else None
|
||||||
|
|
||||||
|
|
||||||
|
def safe_std(nums: list[float], mean: float) -> float | None:
|
||||||
|
if len(nums) < 2:
|
||||||
|
return None
|
||||||
|
variance = sum((x - mean) ** 2 for x in nums) / (len(nums) - 1)
|
||||||
|
return math.sqrt(variance)
|
||||||
|
|
||||||
|
|
||||||
|
def profile_column(name: str, raw_values: list[str]) -> dict:
|
||||||
|
total = len(raw_values)
|
||||||
|
null_strings = {"", "null", "none", "n/a", "na", "nan", "nil"}
|
||||||
|
null_count = sum(1 for v in raw_values if v.strip().lower() in null_strings)
|
||||||
|
non_null = [v for v in raw_values if v.strip().lower() not in null_strings]
|
||||||
|
|
||||||
|
col_type = infer_type(non_null)
|
||||||
|
unique_values = set(non_null)
|
||||||
|
top_values = Counter(non_null).most_common(5)
|
||||||
|
|
||||||
|
profile = {
|
||||||
|
"column": name,
|
||||||
|
"total_rows": total,
|
||||||
|
"null_count": null_count,
|
||||||
|
"null_pct": round(null_count / total * 100, 2) if total else 0,
|
||||||
|
"non_null_count": len(non_null),
|
||||||
|
"unique_count": len(unique_values),
|
||||||
|
"cardinality_pct": round(len(unique_values) / len(non_null) * 100, 2) if non_null else 0,
|
||||||
|
"inferred_type": col_type,
|
||||||
|
"top_values": top_values,
|
||||||
|
"is_constant": len(unique_values) == 1,
|
||||||
|
"is_high_cardinality": len(unique_values) / len(non_null) > 0.9 if len(non_null) > 10 else False,
|
||||||
|
}
|
||||||
|
|
||||||
|
if col_type in ("int", "float"):
|
||||||
|
try:
|
||||||
|
nums = [float(v) for v in non_null]
|
||||||
|
mean = safe_mean(nums)
|
||||||
|
profile["min"] = min(nums)
|
||||||
|
profile["max"] = max(nums)
|
||||||
|
profile["mean"] = round(mean, 4) if mean is not None else None
|
||||||
|
profile["std"] = round(safe_std(nums, mean), 4) if mean is not None else None
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return profile
|
||||||
|
|
||||||
|
|
||||||
|
def compute_dqs(profiles: list[dict], total_rows: int) -> dict:
|
||||||
|
"""Compute Data Quality Score (0-100) across 5 dimensions."""
|
||||||
|
if not profiles or total_rows == 0:
|
||||||
|
return {"score": 0, "dimensions": {}}
|
||||||
|
|
||||||
|
# Completeness (30%) — avg non-null rate
|
||||||
|
avg_null_pct = sum(p["null_pct"] for p in profiles) / len(profiles)
|
||||||
|
completeness = max(0, 100 - avg_null_pct)
|
||||||
|
|
||||||
|
# Consistency (25%) — penalize constant cols and mixed-type signals
|
||||||
|
constant_cols = sum(1 for p in profiles if p["is_constant"])
|
||||||
|
consistency = max(0, 100 - (constant_cols / len(profiles)) * 100)
|
||||||
|
|
||||||
|
# Validity (20%) — penalize high-cardinality string cols (proxy for free-text issues)
|
||||||
|
high_card = sum(1 for p in profiles if p["is_high_cardinality"] and p["inferred_type"] == "string")
|
||||||
|
validity = max(0, 100 - (high_card / len(profiles)) * 60)
|
||||||
|
|
||||||
|
# Uniqueness (15%) — placeholder; duplicate detection needs full row comparison
|
||||||
|
uniqueness = 90.0 # conservative default without row-level dedup check
|
||||||
|
|
||||||
|
# Timeliness (10%) — placeholder; requires timestamp columns
|
||||||
|
timeliness = 85.0 # conservative default
|
||||||
|
|
||||||
|
score = (
|
||||||
|
completeness * 0.30
|
||||||
|
+ consistency * 0.25
|
||||||
|
+ validity * 0.20
|
||||||
|
+ uniqueness * 0.15
|
||||||
|
+ timeliness * 0.10
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"score": round(score, 1),
|
||||||
|
"dimensions": {
|
||||||
|
"completeness": round(completeness, 1),
|
||||||
|
"consistency": round(consistency, 1),
|
||||||
|
"validity": round(validity, 1),
|
||||||
|
"uniqueness": uniqueness,
|
||||||
|
"timeliness": timeliness,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def dqs_label(score: float) -> str:
|
||||||
|
if score >= 85:
|
||||||
|
return "PASS — Production-ready"
|
||||||
|
elif score >= 65:
|
||||||
|
return "WARN — Usable with documented caveats"
|
||||||
|
else:
|
||||||
|
return "FAIL — Remediation required before use"
|
||||||
|
|
||||||
|
|
||||||
|
def print_report(headers: list[str], profiles: list[dict], dqs: dict, total_rows: int, monitor: bool):
|
||||||
|
print("=" * 64)
|
||||||
|
print("DATA QUALITY AUDIT REPORT")
|
||||||
|
print("=" * 64)
|
||||||
|
print(f"Rows: {total_rows} | Columns: {len(headers)}")
|
||||||
|
score = dqs["score"]
|
||||||
|
indicator = "🟢" if score >= 85 else ("🟡" if score >= 65 else "🔴")
|
||||||
|
print(f"\nData Quality Score (DQS): {score}/100 {indicator}")
|
||||||
|
print(f"Verdict: {dqs_label(score)}")
|
||||||
|
|
||||||
|
dims = dqs["dimensions"]
|
||||||
|
print("\nDimension Breakdown:")
|
||||||
|
for dim, val in dims.items():
|
||||||
|
bar = int(val / 5)
|
||||||
|
print(f" {dim.capitalize():<14} {val:>5.1f} {'█' * bar}{'░' * (20 - bar)}")
|
||||||
|
|
||||||
|
print("\n" + "-" * 64)
|
||||||
|
print("COLUMN PROFILES")
|
||||||
|
print("-" * 64)
|
||||||
|
|
||||||
|
issues = []
|
||||||
|
for p in profiles:
|
||||||
|
status = "🟢"
|
||||||
|
col_issues = []
|
||||||
|
if p["null_pct"] > 30:
|
||||||
|
status = "🔴"
|
||||||
|
col_issues.append(f"{p['null_pct']}% nulls — investigate root cause")
|
||||||
|
elif p["null_pct"] > 10:
|
||||||
|
status = "🟡"
|
||||||
|
col_issues.append(f"{p['null_pct']}% nulls — impute cautiously")
|
||||||
|
elif p["null_pct"] > 1:
|
||||||
|
col_issues.append(f"{p['null_pct']}% nulls — impute with indicator")
|
||||||
|
if p["is_constant"]:
|
||||||
|
status = "🟡"
|
||||||
|
col_issues.append("Constant column — zero variance, likely useless")
|
||||||
|
if p["is_high_cardinality"] and p["inferred_type"] == "string":
|
||||||
|
col_issues.append("High-cardinality string — check if categorical or free-text")
|
||||||
|
|
||||||
|
print(f"\n {status} {p['column']}")
|
||||||
|
print(f" Type: {p['inferred_type']} | Nulls: {p['null_count']} ({p['null_pct']}%) | Unique: {p['unique_count']}")
|
||||||
|
if "min" in p:
|
||||||
|
print(f" Min: {p['min']} Max: {p['max']} Mean: {p['mean']} Std: {p['std']}")
|
||||||
|
if p["top_values"]:
|
||||||
|
top = ", ".join(f"{v}({c})" for v, c in p["top_values"][:3])
|
||||||
|
print(f" Top values: {top}")
|
||||||
|
for issue in col_issues:
|
||||||
|
issues.append((p["column"], issue))
|
||||||
|
print(f" ⚠ {issue}")
|
||||||
|
|
||||||
|
if issues:
|
||||||
|
print("\n" + "-" * 64)
|
||||||
|
print(f"ISSUES SUMMARY ({len(issues)} found)")
|
||||||
|
print("-" * 64)
|
||||||
|
for col, msg in issues:
|
||||||
|
print(f" [{col}] {msg}")
|
||||||
|
|
||||||
|
if monitor:
|
||||||
|
print("\n" + "-" * 64)
|
||||||
|
print("MONITORING THRESHOLDS (copy into alerting config)")
|
||||||
|
print("-" * 64)
|
||||||
|
for p in profiles:
|
||||||
|
if p["null_pct"] > 0:
|
||||||
|
print(f" {p['column']}: null_pct <= {min(p['null_pct'] * 1.5, 100):.1f}%")
|
||||||
|
if "mean" in p and p["mean"] is not None:
|
||||||
|
drift = abs(p.get("std", 0) or 0) * 2
|
||||||
|
print(f" {p['column']}: mean within [{p['mean'] - drift:.2f}, {p['mean'] + drift:.2f}]")
|
||||||
|
|
||||||
|
print("\n" + "=" * 64)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="Profile a CSV dataset and compute a Data Quality Score.")
|
||||||
|
parser.add_argument("--file", required=True, help="Path to CSV file")
|
||||||
|
parser.add_argument("--columns", help="Comma-separated list of columns to profile (default: all)")
|
||||||
|
parser.add_argument("--format", choices=["text", "json"], default="text")
|
||||||
|
parser.add_argument("--monitor", action="store_true", help="Print monitoring thresholds")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
headers, rows = load_csv(args.file)
|
||||||
|
except FileNotFoundError:
|
||||||
|
print(f"Error: file not found: {args.file}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error reading file: {e}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if not rows:
|
||||||
|
print("Error: CSV file is empty or has no data rows.", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
selected = args.columns.split(",") if args.columns else headers
|
||||||
|
missing_cols = [c for c in selected if c not in headers]
|
||||||
|
if missing_cols:
|
||||||
|
print(f"Error: columns not found: {', '.join(missing_cols)}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
profiles = [profile_column(col, [row.get(col, "") for row in rows]) for col in selected]
|
||||||
|
dqs = compute_dqs(profiles, len(rows))
|
||||||
|
|
||||||
|
if args.format == "json":
|
||||||
|
print(json.dumps({"total_rows": len(rows), "dqs": dqs, "columns": profiles}, indent=2))
|
||||||
|
else:
|
||||||
|
print_report(selected, profiles, dqs, len(rows), args.monitor)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,242 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
missing_value_analyzer.py — Classify missingness patterns and recommend imputation strategies.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 missing_value_analyzer.py --file data.csv
|
||||||
|
python3 missing_value_analyzer.py --file data.csv --threshold 0.05
|
||||||
|
python3 missing_value_analyzer.py --file data.csv --format json
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import csv
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
from collections import defaultdict
|
||||||
|
|
||||||
|
|
||||||
|
NULL_STRINGS = {"", "null", "none", "n/a", "na", "nan", "nil", "undefined", "missing"}
|
||||||
|
|
||||||
|
|
||||||
|
def load_csv(filepath: str) -> tuple[list[str], list[dict]]:
|
||||||
|
with open(filepath, newline="", encoding="utf-8") as f:
|
||||||
|
reader = csv.DictReader(f)
|
||||||
|
rows = list(reader)
|
||||||
|
headers = reader.fieldnames or []
|
||||||
|
return headers, rows
|
||||||
|
|
||||||
|
|
||||||
|
def is_null(val: str) -> bool:
|
||||||
|
return val.strip().lower() in NULL_STRINGS
|
||||||
|
|
||||||
|
|
||||||
|
def compute_null_mask(headers: list[str], rows: list[dict]) -> dict[str, list[bool]]:
|
||||||
|
return {col: [is_null(row.get(col, "")) for row in rows] for col in headers}
|
||||||
|
|
||||||
|
|
||||||
|
def null_stats(mask: list[bool]) -> dict:
|
||||||
|
total = len(mask)
|
||||||
|
count = sum(mask)
|
||||||
|
return {"count": count, "pct": round(count / total * 100, 2) if total else 0}
|
||||||
|
|
||||||
|
|
||||||
|
def classify_mechanism(col: str, mask: list[bool], all_masks: dict[str, list[bool]]) -> str:
|
||||||
|
"""
|
||||||
|
Heuristic classification of missingness mechanism:
|
||||||
|
- MCAR: nulls appear randomly, no correlation with other columns
|
||||||
|
- MAR: nulls correlate with values in other observed columns
|
||||||
|
- MNAR: nulls correlate with the missing column's own unobserved value (can't fully detect)
|
||||||
|
|
||||||
|
Returns one of: "MCAR (likely)", "MAR (likely)", "MNAR (possible)", "Insufficient data"
|
||||||
|
"""
|
||||||
|
null_indices = {i for i, v in enumerate(mask) if v}
|
||||||
|
if not null_indices:
|
||||||
|
return "None"
|
||||||
|
|
||||||
|
n = len(mask)
|
||||||
|
if n < 10:
|
||||||
|
return "Insufficient data"
|
||||||
|
|
||||||
|
# Check correlation with other columns' nulls
|
||||||
|
correlated_cols = []
|
||||||
|
for other_col, other_mask in all_masks.items():
|
||||||
|
if other_col == col:
|
||||||
|
continue
|
||||||
|
other_null_indices = {i for i, v in enumerate(other_mask) if v}
|
||||||
|
if not other_null_indices:
|
||||||
|
continue
|
||||||
|
overlap = len(null_indices & other_null_indices)
|
||||||
|
union = len(null_indices | other_null_indices)
|
||||||
|
jaccard = overlap / union if union else 0
|
||||||
|
if jaccard > 0.5:
|
||||||
|
correlated_cols.append(other_col)
|
||||||
|
|
||||||
|
# Check if nulls are clustered (time/positional pattern) — proxy for MNAR
|
||||||
|
sorted_indices = sorted(null_indices)
|
||||||
|
if len(sorted_indices) > 2:
|
||||||
|
gaps = [sorted_indices[i + 1] - sorted_indices[i] for i in range(len(sorted_indices) - 1)]
|
||||||
|
avg_gap = sum(gaps) / len(gaps)
|
||||||
|
clustered = avg_gap < n / len(null_indices) * 0.5 # nulls appear closer together than random
|
||||||
|
else:
|
||||||
|
clustered = False
|
||||||
|
|
||||||
|
if correlated_cols:
|
||||||
|
return f"MAR (likely) — co-occurs with nulls in: {', '.join(correlated_cols[:3])}"
|
||||||
|
elif clustered:
|
||||||
|
return "MNAR (possible) — nulls are spatially clustered, may reflect a systematic gap"
|
||||||
|
else:
|
||||||
|
return "MCAR (likely) — nulls appear random, no strong correlation detected"
|
||||||
|
|
||||||
|
|
||||||
|
def recommend_strategy(pct: float, col_type: str) -> str:
|
||||||
|
if pct == 0:
|
||||||
|
return "No action needed"
|
||||||
|
if pct < 1:
|
||||||
|
return "Drop rows — impact is negligible"
|
||||||
|
if pct < 10:
|
||||||
|
strategies = {
|
||||||
|
"int": "Impute with median + add binary indicator column",
|
||||||
|
"float": "Impute with median + add binary indicator column",
|
||||||
|
"string": "Impute with mode or 'Unknown' category + add indicator",
|
||||||
|
"bool": "Impute with mode",
|
||||||
|
}
|
||||||
|
return strategies.get(col_type, "Impute with median/mode + add indicator")
|
||||||
|
if pct < 30:
|
||||||
|
return "Impute cautiously; investigate root cause; document assumption; add indicator"
|
||||||
|
return "Do NOT impute blindly — > 30% missing. Escalate to domain owner or consider dropping column"
|
||||||
|
|
||||||
|
|
||||||
|
def infer_type(values: list[str]) -> str:
|
||||||
|
non_null = [v for v in values if not is_null(v)]
|
||||||
|
counts = {"int": 0, "float": 0, "bool": 0, "string": 0}
|
||||||
|
for v in non_null[:200]: # sample for speed
|
||||||
|
v = v.strip()
|
||||||
|
if v.lower() in ("true", "false"):
|
||||||
|
counts["bool"] += 1
|
||||||
|
else:
|
||||||
|
try:
|
||||||
|
int(v)
|
||||||
|
counts["int"] += 1
|
||||||
|
except ValueError:
|
||||||
|
try:
|
||||||
|
float(v)
|
||||||
|
counts["float"] += 1
|
||||||
|
except ValueError:
|
||||||
|
counts["string"] += 1
|
||||||
|
return max(counts, key=lambda k: counts[k]) if any(counts.values()) else "string"
|
||||||
|
|
||||||
|
|
||||||
|
def compute_cooccurrence(headers: list[str], masks: dict[str, list[bool]], top_n: int = 5) -> list[dict]:
|
||||||
|
"""Find column pairs where nulls most frequently co-occur."""
|
||||||
|
pairs = []
|
||||||
|
cols = list(headers)
|
||||||
|
for i in range(len(cols)):
|
||||||
|
for j in range(i + 1, len(cols)):
|
||||||
|
a, b = cols[i], cols[j]
|
||||||
|
mask_a, mask_b = masks[a], masks[b]
|
||||||
|
overlap = sum(1 for x, y in zip(mask_a, mask_b) if x and y)
|
||||||
|
if overlap > 0:
|
||||||
|
pairs.append({"col_a": a, "col_b": b, "co_null_rows": overlap})
|
||||||
|
pairs.sort(key=lambda x: -x["co_null_rows"])
|
||||||
|
return pairs[:top_n]
|
||||||
|
|
||||||
|
|
||||||
|
def print_report(headers: list[str], rows: list[dict], masks: dict, threshold: float):
|
||||||
|
total = len(rows)
|
||||||
|
print("=" * 64)
|
||||||
|
print("MISSING VALUE ANALYSIS REPORT")
|
||||||
|
print("=" * 64)
|
||||||
|
print(f"Rows: {total} | Columns: {len(headers)}")
|
||||||
|
|
||||||
|
results = []
|
||||||
|
for col in headers:
|
||||||
|
mask = masks[col]
|
||||||
|
stats = null_stats(mask)
|
||||||
|
if stats["pct"] / 100 < threshold and stats["count"] > 0:
|
||||||
|
continue
|
||||||
|
raw_vals = [row.get(col, "") for row in rows]
|
||||||
|
col_type = infer_type(raw_vals)
|
||||||
|
mechanism = classify_mechanism(col, mask, masks)
|
||||||
|
strategy = recommend_strategy(stats["pct"], col_type)
|
||||||
|
results.append({
|
||||||
|
"column": col,
|
||||||
|
"null_count": stats["count"],
|
||||||
|
"null_pct": stats["pct"],
|
||||||
|
"col_type": col_type,
|
||||||
|
"mechanism": mechanism,
|
||||||
|
"strategy": strategy,
|
||||||
|
})
|
||||||
|
|
||||||
|
fully_complete = [col for col in headers if null_stats(masks[col])["count"] == 0]
|
||||||
|
print(f"\nFully complete columns: {len(fully_complete)}/{len(headers)}")
|
||||||
|
|
||||||
|
if not results:
|
||||||
|
print(f"\nNo columns exceed the null threshold ({threshold * 100:.1f}%).")
|
||||||
|
else:
|
||||||
|
print(f"\nColumns with missing values (threshold >= {threshold * 100:.1f}%):\n")
|
||||||
|
for r in sorted(results, key=lambda x: -x["null_pct"]):
|
||||||
|
indicator = "🔴" if r["null_pct"] > 30 else ("🟡" if r["null_pct"] > 10 else "🟢")
|
||||||
|
print(f" {indicator} {r['column']}")
|
||||||
|
print(f" Nulls: {r['null_count']} ({r['null_pct']}%) | Type: {r['col_type']}")
|
||||||
|
print(f" Mechanism: {r['mechanism']}")
|
||||||
|
print(f" Strategy: {r['strategy']}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
cooccur = compute_cooccurrence(headers, masks)
|
||||||
|
if cooccur:
|
||||||
|
print("-" * 64)
|
||||||
|
print("NULL CO-OCCURRENCE (top pairs)")
|
||||||
|
print("-" * 64)
|
||||||
|
for pair in cooccur:
|
||||||
|
print(f" {pair['col_a']} + {pair['col_b']} → {pair['co_null_rows']} rows both null")
|
||||||
|
|
||||||
|
print("\n" + "=" * 64)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="Analyze missing values in a CSV dataset.")
|
||||||
|
parser.add_argument("--file", required=True, help="Path to CSV file")
|
||||||
|
parser.add_argument("--threshold", type=float, default=0.0,
|
||||||
|
help="Only show columns with null fraction above this (e.g. 0.05 = 5%%)")
|
||||||
|
parser.add_argument("--format", choices=["text", "json"], default="text")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
headers, rows = load_csv(args.file)
|
||||||
|
except FileNotFoundError:
|
||||||
|
print(f"Error: file not found: {args.file}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error reading file: {e}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if not rows:
|
||||||
|
print("Error: CSV file is empty.", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
masks = compute_null_mask(headers, rows)
|
||||||
|
|
||||||
|
if args.format == "json":
|
||||||
|
output = []
|
||||||
|
for col in headers:
|
||||||
|
mask = masks[col]
|
||||||
|
stats = null_stats(mask)
|
||||||
|
raw_vals = [row.get(col, "") for row in rows]
|
||||||
|
col_type = infer_type(raw_vals)
|
||||||
|
mechanism = classify_mechanism(col, mask, masks)
|
||||||
|
strategy = recommend_strategy(stats["pct"], col_type)
|
||||||
|
output.append({
|
||||||
|
"column": col,
|
||||||
|
"null_count": stats["count"],
|
||||||
|
"null_pct": stats["pct"],
|
||||||
|
"col_type": col_type,
|
||||||
|
"mechanism": mechanism,
|
||||||
|
"strategy": strategy,
|
||||||
|
})
|
||||||
|
print(json.dumps({"total_rows": len(rows), "columns": output}, indent=2))
|
||||||
|
else:
|
||||||
|
print_report(headers, rows, masks, args.threshold)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
262
data-analysis/data-quality-auditor/scripts/outlier_detector.py
Normal file
262
data-analysis/data-quality-auditor/scripts/outlier_detector.py
Normal file
@@ -0,0 +1,262 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
outlier_detector.py — Multi-method outlier detection for numeric columns.
|
||||||
|
|
||||||
|
Methods:
|
||||||
|
iqr — Interquartile Range (robust, non-parametric, default)
|
||||||
|
zscore — Standard Z-score (assumes normal distribution)
|
||||||
|
mzscore — Modified Z-score via Median Absolute Deviation (robust to skew)
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 outlier_detector.py --file data.csv
|
||||||
|
python3 outlier_detector.py --file data.csv --method iqr
|
||||||
|
python3 outlier_detector.py --file data.csv --method zscore --threshold 2.5
|
||||||
|
python3 outlier_detector.py --file data.csv --columns col1,col2
|
||||||
|
python3 outlier_detector.py --file data.csv --format json
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import csv
|
||||||
|
import json
|
||||||
|
import math
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
NULL_STRINGS = {"", "null", "none", "n/a", "na", "nan", "nil", "undefined", "missing"}
|
||||||
|
|
||||||
|
|
||||||
|
def load_csv(filepath: str) -> tuple[list[str], list[dict]]:
|
||||||
|
with open(filepath, newline="", encoding="utf-8") as f:
|
||||||
|
reader = csv.DictReader(f)
|
||||||
|
rows = list(reader)
|
||||||
|
headers = reader.fieldnames or []
|
||||||
|
return headers, rows
|
||||||
|
|
||||||
|
|
||||||
|
def is_null(val: str) -> bool:
|
||||||
|
return val.strip().lower() in NULL_STRINGS
|
||||||
|
|
||||||
|
|
||||||
|
def to_float(val: str) -> float | None:
|
||||||
|
try:
|
||||||
|
return float(val.strip())
|
||||||
|
except (ValueError, AttributeError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def median(nums: list[float]) -> float:
|
||||||
|
s = sorted(nums)
|
||||||
|
n = len(s)
|
||||||
|
mid = n // 2
|
||||||
|
return s[mid] if n % 2 else (s[mid - 1] + s[mid]) / 2
|
||||||
|
|
||||||
|
|
||||||
|
def percentile(nums: list[float], p: float) -> float:
|
||||||
|
"""Linear interpolation percentile."""
|
||||||
|
s = sorted(nums)
|
||||||
|
n = len(s)
|
||||||
|
if n == 1:
|
||||||
|
return s[0]
|
||||||
|
idx = p / 100 * (n - 1)
|
||||||
|
lo = int(idx)
|
||||||
|
hi = lo + 1
|
||||||
|
frac = idx - lo
|
||||||
|
if hi >= n:
|
||||||
|
return s[-1]
|
||||||
|
return s[lo] + frac * (s[hi] - s[lo])
|
||||||
|
|
||||||
|
|
||||||
|
def mean(nums: list[float]) -> float:
|
||||||
|
return sum(nums) / len(nums)
|
||||||
|
|
||||||
|
|
||||||
|
def std(nums: list[float], mu: float) -> float:
|
||||||
|
if len(nums) < 2:
|
||||||
|
return 0.0
|
||||||
|
variance = sum((x - mu) ** 2 for x in nums) / (len(nums) - 1)
|
||||||
|
return math.sqrt(variance)
|
||||||
|
|
||||||
|
|
||||||
|
# --- Detection methods ---
|
||||||
|
|
||||||
|
def detect_iqr(nums: list[float], multiplier: float = 1.5) -> dict:
|
||||||
|
q1 = percentile(nums, 25)
|
||||||
|
q3 = percentile(nums, 75)
|
||||||
|
iqr = q3 - q1
|
||||||
|
lower = q1 - multiplier * iqr
|
||||||
|
upper = q3 + multiplier * iqr
|
||||||
|
outliers = [x for x in nums if x < lower or x > upper]
|
||||||
|
return {
|
||||||
|
"method": "IQR",
|
||||||
|
"q1": round(q1, 4),
|
||||||
|
"q3": round(q3, 4),
|
||||||
|
"iqr": round(iqr, 4),
|
||||||
|
"lower_bound": round(lower, 4),
|
||||||
|
"upper_bound": round(upper, 4),
|
||||||
|
"outlier_count": len(outliers),
|
||||||
|
"outlier_pct": round(len(outliers) / len(nums) * 100, 2),
|
||||||
|
"outlier_values": sorted(set(round(x, 4) for x in outliers))[:10],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def detect_zscore(nums: list[float], threshold: float = 3.0) -> dict:
|
||||||
|
mu = mean(nums)
|
||||||
|
sigma = std(nums, mu)
|
||||||
|
if sigma == 0:
|
||||||
|
return {"method": "Z-score", "outlier_count": 0, "outlier_pct": 0.0,
|
||||||
|
"note": "Zero variance — all values identical"}
|
||||||
|
zscores = [(x, abs((x - mu) / sigma)) for x in nums]
|
||||||
|
outliers = [x for x, z in zscores if z > threshold]
|
||||||
|
return {
|
||||||
|
"method": "Z-score",
|
||||||
|
"mean": round(mu, 4),
|
||||||
|
"std": round(sigma, 4),
|
||||||
|
"threshold": threshold,
|
||||||
|
"outlier_count": len(outliers),
|
||||||
|
"outlier_pct": round(len(outliers) / len(nums) * 100, 2),
|
||||||
|
"outlier_values": sorted(set(round(x, 4) for x in outliers))[:10],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def detect_modified_zscore(nums: list[float], threshold: float = 3.5) -> dict:
|
||||||
|
"""Iglewicz-Hoaglin modified Z-score using Median Absolute Deviation."""
|
||||||
|
med = median(nums)
|
||||||
|
mad = median([abs(x - med) for x in nums])
|
||||||
|
if mad == 0:
|
||||||
|
return {"method": "Modified Z-score (MAD)", "outlier_count": 0, "outlier_pct": 0.0,
|
||||||
|
"note": "MAD is zero — consider Z-score instead"}
|
||||||
|
mzscores = [(x, 0.6745 * abs(x - med) / mad) for x in nums]
|
||||||
|
outliers = [x for x, mz in mzscores if mz > threshold]
|
||||||
|
return {
|
||||||
|
"method": "Modified Z-score (MAD)",
|
||||||
|
"median": round(med, 4),
|
||||||
|
"mad": round(mad, 4),
|
||||||
|
"threshold": threshold,
|
||||||
|
"outlier_count": len(outliers),
|
||||||
|
"outlier_pct": round(len(outliers) / len(nums) * 100, 2),
|
||||||
|
"outlier_values": sorted(set(round(x, 4) for x in outliers))[:10],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def classify_outlier_risk(pct: float, col: str) -> str:
|
||||||
|
"""Heuristic: flag whether outliers are likely data errors or legitimate extremes."""
|
||||||
|
if pct > 10:
|
||||||
|
return "High outlier rate — likely systematic data quality issue or wrong data type"
|
||||||
|
if pct > 5:
|
||||||
|
return "Elevated outlier rate — investigate source; may be mixed populations"
|
||||||
|
if pct > 1:
|
||||||
|
return "Moderate — review individually; could be legitimate extremes or entry errors"
|
||||||
|
if pct > 0:
|
||||||
|
return "Low — verify extreme values against source; likely legitimate but worth checking"
|
||||||
|
return "Clean — no outliers detected"
|
||||||
|
|
||||||
|
|
||||||
|
def analyze_column(col: str, nums: list[float], method: str, threshold: float) -> dict:
|
||||||
|
if len(nums) < 4:
|
||||||
|
return {"column": col, "status": "Skipped — fewer than 4 numeric values"}
|
||||||
|
|
||||||
|
if method == "iqr":
|
||||||
|
result = detect_iqr(nums, multiplier=threshold if threshold != 3.0 else 1.5)
|
||||||
|
elif method == "zscore":
|
||||||
|
result = detect_zscore(nums, threshold=threshold)
|
||||||
|
elif method == "mzscore":
|
||||||
|
result = detect_modified_zscore(nums, threshold=threshold)
|
||||||
|
else:
|
||||||
|
result = detect_iqr(nums)
|
||||||
|
|
||||||
|
result["column"] = col
|
||||||
|
result["total_numeric"] = len(nums)
|
||||||
|
result["risk_assessment"] = classify_outlier_risk(result.get("outlier_pct", 0), col)
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def print_report(results: list[dict]):
|
||||||
|
print("=" * 64)
|
||||||
|
print("OUTLIER DETECTION REPORT")
|
||||||
|
print("=" * 64)
|
||||||
|
|
||||||
|
clean = [r for r in results if r.get("outlier_count", 0) == 0 and "status" not in r]
|
||||||
|
flagged = [r for r in results if r.get("outlier_count", 0) > 0]
|
||||||
|
skipped = [r for r in results if "status" in r]
|
||||||
|
|
||||||
|
print(f"\nColumns analyzed: {len(results) - len(skipped)}")
|
||||||
|
print(f"Clean: {len(clean)}")
|
||||||
|
print(f"Flagged: {len(flagged)}")
|
||||||
|
if skipped:
|
||||||
|
print(f"Skipped: {len(skipped)} ({', '.join(r['column'] for r in skipped)})")
|
||||||
|
|
||||||
|
if flagged:
|
||||||
|
print("\n" + "-" * 64)
|
||||||
|
print("FLAGGED COLUMNS")
|
||||||
|
print("-" * 64)
|
||||||
|
for r in sorted(flagged, key=lambda x: -x.get("outlier_pct", 0)):
|
||||||
|
pct = r.get("outlier_pct", 0)
|
||||||
|
indicator = "🔴" if pct > 5 else "🟡"
|
||||||
|
print(f"\n {indicator} {r['column']} ({r['method']})")
|
||||||
|
print(f" Outliers: {r['outlier_count']} / {r['total_numeric']} rows ({pct}%)")
|
||||||
|
if "lower_bound" in r:
|
||||||
|
print(f" Bounds: [{r['lower_bound']}, {r['upper_bound']}] | IQR: {r['iqr']}")
|
||||||
|
if "mean" in r:
|
||||||
|
print(f" Mean: {r['mean']} | Std: {r['std']} | Threshold: ±{r['threshold']}σ")
|
||||||
|
if "median" in r:
|
||||||
|
print(f" Median: {r['median']} | MAD: {r['mad']} | Threshold: {r['threshold']}")
|
||||||
|
if r.get("outlier_values"):
|
||||||
|
vals = ", ".join(str(v) for v in r["outlier_values"][:8])
|
||||||
|
print(f" Sample outlier values: {vals}")
|
||||||
|
print(f" Assessment: {r['risk_assessment']}")
|
||||||
|
|
||||||
|
if clean:
|
||||||
|
cols = ", ".join(r["column"] for r in clean)
|
||||||
|
print(f"\n🟢 Clean columns: {cols}")
|
||||||
|
|
||||||
|
print("\n" + "=" * 64)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="Detect outliers in numeric columns of a CSV dataset.")
|
||||||
|
parser.add_argument("--file", required=True, help="Path to CSV file")
|
||||||
|
parser.add_argument("--method", choices=["iqr", "zscore", "mzscore"], default="iqr",
|
||||||
|
help="Detection method (default: iqr)")
|
||||||
|
parser.add_argument("--threshold", type=float, default=None,
|
||||||
|
help="Method threshold (IQR multiplier default 1.5; Z-score default 3.0; mzscore default 3.5)")
|
||||||
|
parser.add_argument("--columns", help="Comma-separated columns to check (default: all numeric)")
|
||||||
|
parser.add_argument("--format", choices=["text", "json"], default="text")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Set default thresholds per method
|
||||||
|
if args.threshold is None:
|
||||||
|
args.threshold = {"iqr": 1.5, "zscore": 3.0, "mzscore": 3.5}[args.method]
|
||||||
|
|
||||||
|
try:
|
||||||
|
headers, rows = load_csv(args.file)
|
||||||
|
except FileNotFoundError:
|
||||||
|
print(f"Error: file not found: {args.file}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error reading file: {e}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if not rows:
|
||||||
|
print("Error: CSV file is empty.", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
selected = args.columns.split(",") if args.columns else headers
|
||||||
|
missing_cols = [c for c in selected if c not in headers]
|
||||||
|
if missing_cols:
|
||||||
|
print(f"Error: columns not found: {', '.join(missing_cols)}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
results = []
|
||||||
|
for col in selected:
|
||||||
|
raw = [row.get(col, "") for row in rows]
|
||||||
|
nums = [n for v in raw if not is_null(v) and (n := to_float(v)) is not None]
|
||||||
|
results.append(analyze_column(col, nums, args.method, args.threshold))
|
||||||
|
|
||||||
|
if args.format == "json":
|
||||||
|
print(json.dumps(results, indent=2))
|
||||||
|
else:
|
||||||
|
print_report(results)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user