Adds a new data-quality-auditor skill with three stdlib-only Python tools: - data_profiler.py: full dataset profile with DQS (0-100) across 5 dimensions - missing_value_analyzer.py: MCAR/MAR/MNAR classification + imputation strategies - outlier_detector.py: IQR, Z-score, and Modified Z-score (MAD) outlier detection Validator: 86.4/100 (GOOD). Security audit: PASS (0 critical/high). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5.5 KiB
5.5 KiB
Data Quality Concepts Reference
Deep-dive reference for the Data Quality Auditor skill. Keep SKILL.md lean — this is where the theory lives.
Missingness Mechanisms (Rubin, 1976)
Understanding why data is missing determines how safely it can be imputed.
MCAR — Missing Completely At Random
- The probability of missingness is independent of both observed and unobserved data.
- Example: A sensor drops a reading due to random hardware noise.
- Safe to impute? Yes. Imputing with mean/median introduces no systematic bias.
- Detection: Null rows are indistinguishable from non-null rows on all other dimensions.
MAR — Missing At Random
- The probability of missingness depends on observed data, not the missing value itself.
- Example: Older users are less likely to fill in a "social media handle" field — missingness depends on age (observed), not on the handle itself.
- Safe to impute? Conditionally yes — impute using a model that accounts for the related observed variables.
- Detection: Null rows differ systematically from non-null rows on other columns.
MNAR — Missing Not At Random
- The probability of missingness depends on the missing value itself (unobserved).
- Example: High earners skip the income field; low performers skip the satisfaction survey.
- Safe to impute? No — imputation will introduce systematic bias. Escalate to domain owner.
- Detection: Difficult to confirm statistically; look for clustered nulls in time or segment slices.
Data Quality Score (DQS) Methodology
The DQS is a weighted composite of five ISO 8000 / DAMA-aligned dimensions:
| Dimension | Weight | Rationale |
|---|---|---|
| Completeness | 30% | Nulls are the most common and impactful quality failure |
| Consistency | 25% | Type/format violations corrupt joins and aggregations silently |
| Validity | 20% | Out-of-domain values (negative ages, future birth dates) create invisible errors |
| Uniqueness | 15% | Duplicate rows inflate metrics and invalidate joins |
| Timeliness | 10% | Stale data causes decisions based on outdated state |
Scoring thresholds align to production-readiness standards:
- 85–100: Ready for production use in models and dashboards
- 65–84: Usable for exploratory analysis with documented caveats
- 0–64: Unreliable; remediation required before use in any decision-making context
Outlier Detection Methods
IQR (Interquartile Range)
- Formula: Outlier if
x < Q1 − 1.5×IQRorx > Q3 + 1.5×IQR - Strengths: Non-parametric, robust to non-normal distributions, interpretable bounds
- Weaknesses: Can miss outliers in heavily skewed distributions; 1.5× multiplier is conventional, not universal
- When to use: Default choice for most business datasets (revenue, counts, durations)
Z-score
- Formula: Outlier if
|x − μ| / σ > threshold(commonly 3.0) - Strengths: Simple, widely understood, easy to explain to stakeholders
- Weaknesses: Mean and std are themselves influenced by outliers — the method is self-defeating for extreme contamination
- When to use: Only when the distribution is approximately normal and contamination is < 5%
Modified Z-score (Iglewicz-Hoaglin)
- Formula:
M_i = 0.6745 × |x_i − median| / MAD; outlier ifM_i > 3.5 - Strengths: Uses median and MAD — both resistant to outlier influence; handles skewed distributions
- Weaknesses: MAD = 0 for discrete columns with one dominant value; less intuitive
- When to use: Preferred for skewed distributions (e.g. revenue, latency, page views)
Imputation Strategies
| Method | When | Risk |
|---|---|---|
| Mean | MCAR, continuous, symmetric distribution | Distorts variance; don't use with skewed data |
| Median | MCAR/MAR, continuous, skewed distribution | Safe for skewed; loses variance |
| Mode | MCAR/MAR, categorical | Can over-represent one category |
| Forward-fill | Time series with MCAR/MAR gaps | Assumes value persists — valid for slowly-changing fields |
| Binary indicator | Null % 1–30% | Preserves information about missingness without imputing |
| Model-based | MAR, high-value columns | Most accurate but computationally expensive |
| Drop column | > 50% missing with no business justification | Safest option if column has no predictive value |
Golden rule: Always add a col_was_null indicator column when imputing with null% > 1%. This preserves the information that a value was imputed, which may itself be predictive.
Common Silent Data Quality Failures
These are the issues that don't raise errors but corrupt results:
- Sentinel values —
0,-1,9999,""used to mean "unknown" in legacy systems - Timezone naive timestamps — datetimes stored without timezone; comparisons silently shift by hours
- Trailing whitespace —
"active "≠"active"causes silent join mismatches - Encoding errors — UTF-8 vs Latin-1 mismatches produce garbled strings in one column
- Scientific notation —
1e6stored as string gets treated as a category not a number - Implicit schema changes — upstream adds a new category to a lookup field; existing code silently drops new rows
References
- Rubin, D.B. (1976). "Inference and Missing Data." Biometrika 63(3): 581–592.
- Iglewicz, B. & Hoaglin, D. (1993). How to Detect and Handle Outliers. ASQC Quality Press.
- DAMA International (2017). DAMA-DMBOK: Data Management Body of Knowledge. 2nd ed.
- ISO 8000-8: Data quality — Concepts and measuring.