Essay III of 4

How I Built a PII Detection Engine with LLMs

The architecture behind an ML-powered system that continuously scans codebases, databases, and SaaS integrations for sensitive data.

Nov 15, 2024 6 min read Shailesh Shivam

The Problem

Sensitive data ends up everywhere. Customer emails in log files. Social security numbers in staging databases. API keys in Slack messages. If you’ve worked at any company handling user data, you know this isn’t a hypothetical — it’s Tuesday.

My goal was to build a system that could continuously scan codebases, databases, and SaaS integrations to find PII before it became a compliance liability. The hard part wasn’t finding obvious patterns like 123-45-6789. The hard part was everything else: names that look like city names, addresses embedded in free-text fields, medical record numbers that look like order IDs.

Manual audits don’t scale. You can’t ask engineers to tag every field correctly — they won’t, and even when they try, data drifts. I needed something automated, accurate, and fast enough to run continuously.

Stage 1: Regex and Heuristics

The first pass was straightforward. Pattern matching catches the low-hanging fruit: SSNs, credit card numbers, phone numbers, email addresses. These have predictable formats, and regex handles them well.

I built a rule engine with about 40 patterns covering common PII formats across US, EU, and APAC regions. It was fast — scanning a million records in under a minute — and caught roughly 70% of true PII.

The problem was the other 30%. And the false positive rate was brutal. Phone numbers matched order IDs. ZIP codes matched internal codes. The string “John Smith” in a comment about a fictional user triggered alerts just like a real customer name in a database column.

A 30% miss rate and a 15% false positive rate is not a detection engine. It’s a suggestion engine.

Stage 2: Named Entity Recognition

To handle contextual entities — names, addresses, organizations — I added a Named Entity Recognition layer using spaCy with a fine-tuned model. NER understands that “Springfield” after “123 Main St” is probably an address, while “Springfield” in a column called server_region is not.

I trained the model on a labeled dataset of about 50,000 examples drawn from anonymized production data. The key insight was including the surrounding context: column names, neighboring fields, file paths. A value of “M” means nothing alone, but in a column called gender next to patient_name, it’s clearly PII.

This brought our recall up to about 88% and dropped false positives to around 8%. Good, but not good enough for a system that would generate alerts engineers actually needed to act on.

Stage 3: LLM Review for Ambiguous Cases

The final stage handles the hard cases — the ones where regex says “maybe” and NER says “I’m not sure.” These get routed to an LLM with a structured prompt that includes the value, its context (schema, neighboring data, source system), and a request for a classification with confidence score.

from dataclasses import dataclass
from enum import Enum
import spacy
import re
import openai

class PIICategory(Enum):
    SSN = "ssn"
    EMAIL = "email"
    PHONE = "phone"
    NAME = "name"
    ADDRESS = "address"
    MEDICAL_ID = "medical_id"
    NONE = "none"

@dataclass
class DetectionResult:
    value: str
    category: PIICategory
    confidence: float
    stage: str
    context: dict

class PIIDetectionPipeline:
    def __init__(self):
        self.nlp = spacy.load("en_pii_custom_v3")
        self.regex_patterns = self._load_patterns()
        self.llm_threshold = 0.6  # route to LLM below this confidence

    def scan(self, value: str, context: dict) -> DetectionResult:
        # Stage 1: Regex -- fast, high precision for structured PII
        regex_result = self._regex_scan(value)
        if regex_result and regex_result.confidence > 0.95:
            return regex_result

        # Stage 2: NER -- contextual entity detection
        ner_result = self._ner_scan(value, context)
        if ner_result and ner_result.confidence > self.llm_threshold:
            return ner_result

        # Stage 3: LLM -- ambiguous cases only
        if regex_result or ner_result:
            best_guess = regex_result or ner_result
            if best_guess.confidence > 0.3:
                return self._llm_review(value, context, best_guess)

        return DetectionResult(value, PIICategory.NONE, 0.0, "regex", context)

    def _regex_scan(self, value: str) -> DetectionResult | None:
        for pattern_name, pattern in self.regex_patterns.items():
            if match := re.search(pattern, value):
                return DetectionResult(
                    value=match.group(),
                    category=PIICategory(pattern_name),
                    confidence=0.97,
                    stage="regex",
                    context={}
                )
        return None

    def _ner_scan(self, value: str, context: dict) -> DetectionResult | None:
        enriched = f"[{context.get('column_name', '')}] {value}"
        doc = self.nlp(enriched)
        if doc.ents:
            top = max(doc.ents, key=lambda e: e._.confidence)
            return DetectionResult(
                value=top.text,
                category=PIICategory(top.label_.lower()),
                confidence=top._.confidence,
                stage="ner",
                context=context
            )
        return None

    def _llm_review(
        self, value: str, context: dict, prior: DetectionResult
    ) -> DetectionResult:
        prompt = f"""Classify whether this value is PII.
Value: {value}
Source: {context.get('source', 'unknown')}
Column/Field: {context.get('column_name', 'unknown')}
Neighboring fields: {context.get('neighbors', [])}
Prior classification: {prior.category.value} (confidence: {prior.confidence:.2f})

Respond with JSON: {{"category": "...", "confidence": 0.0-1.0, "reasoning": "..."}}"""

        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"},
            temperature=0
        )
        result = response.choices[0].message.content
        # Parse and return structured result
        parsed = eval(result)  # simplified -- use json.loads in production
        return DetectionResult(
            value=value,
            category=PIICategory(parsed["category"]),
            confidence=parsed["confidence"],
            stage="llm",
            context={**context, "reasoning": parsed["reasoning"]}
        )

The critical design decision: the LLM only sees cases that the cheaper stages couldn’t resolve confidently. In production, roughly 5% of scanned values reach Stage 3. This keeps costs manageable and latency low — the median scan takes 12ms, with LLM-routed cases adding ~800ms.

Pipeline Architecture

The system runs as a set of workers pulling from a job queue. Connectors for each data source (PostgreSQL, S3, GitHub, Slack) emit scan jobs. Each job contains the value, its context, and metadata about the source. Workers run the three-stage pipeline and write results to a findings database.

A scheduler triggers full scans weekly and incremental scans on every commit or database migration. Alerts go to Slack with enough context for engineers to triage without switching tools.

What Worked

The multi-stage approach was the right call. Regex handles 70% of cases at near-zero cost. NER picks up another 20%. The LLM handles the remaining 10% of ambiguous cases where context truly matters. Overall precision landed at 94% with 96% recall — good enough that engineers trusted the alerts and actually fixed findings.

What Didn’t

Fine-tuning the NER model was a bigger time investment than I expected. Labeling 50,000 examples took three weeks, and the model needed retraining every quarter as new data patterns emerged. If I were starting over, I’d invest earlier in an active learning loop that surfaces uncertain cases for human labeling.

The LLM stage also introduced a dependency I wasn’t thrilled about. API rate limits, model version changes, and cost unpredictability are real operational concerns. I’m exploring local models as a replacement for this stage, but the accuracy gap is still meaningful as of late 2024.

The system has been running in production for eight months. It’s caught PII in places nobody expected — debug logs, analytics event payloads, even README files. The lesson: sensitive data doesn’t stay where you put it, so you need a system that doesn’t assume it will.