ai-tools-suite/docs/building-privacy-scanner.qmd

---
title: "Building a Privacy Scanner: A Step-by-Step Implementation Guide"
author: "AI Tools Suite"
date: "2024-12-23"
categories: [tutorial, privacy, pii-detection, python, svelte]
format:
  html:
    toc: true
    toc-depth: 3
    code-fold: false
    code-line-numbers: true
---

## Introduction

In this tutorial, we'll build a production-grade Privacy Scanner from scratch. By the end, you'll have a tool that detects 40+ types of Personally Identifiable Information (PII) using an eight-layer detection pipeline, complete with a modern web interface.

Our stack: **FastAPI** for the backend API, **SvelteKit** for the frontend, and **Python regex** with validation logic for detection.

## Step 1: Project Structure

First, create the project scaffolding:

```bash
mkdir -p ai_tools_suite/{backend/routers,frontend/src/routes/privacy-scanner}
cd ai_tools_suite
```

Your directory structure should look like:

```
ai_tools_suite/
├── backend/
│   ├── main.py
│   └── routers/
│       └── privacy.py
└── frontend/
    └── src/
        └── routes/
            └── privacy-scanner/
                └── +page.svelte
```

## Step 2: Define PII Patterns

The foundation of any PII scanner is its pattern library. Create `backend/routers/privacy.py` and start with the core patterns:

```python
import re
from typing import List, Dict, Any
from pydantic import BaseModel

class PIIEntity(BaseModel):
    type: str
    value: str
    start: int
    end: int
    confidence: float
    context: str = ""

PII_PATTERNS = {
    # Identity Documents
    "SSN": {
        "pattern": r'\b\d{3}-\d{2}-\d{4}\b',
        "description": "US Social Security Number",
        "category": "identity"
    },
    "PASSPORT": {
        "pattern": r'\b[A-Z]{1,2}\d{6,9}\b',
        "description": "Passport Number",
        "category": "identity"
    },

    # Financial Information
    "CREDIT_CARD": {
        "pattern": r'\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13})\b',
        "description": "Credit Card Number (Visa, MC, Amex)",
        "category": "financial"
    },
    "IBAN": {
        "pattern": r'\b[A-Z]{2}\d{2}[A-Z0-9]{4,30}\b',
        "description": "International Bank Account Number",
        "category": "financial"
    },

    # Contact Information
    "EMAIL": {
        "pattern": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
        "description": "Email Address",
        "category": "contact"
    },
    "PHONE_US": {
        "pattern": r'\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b',
        "description": "US Phone Number",
        "category": "contact"
    },

    # Add more patterns as needed...
}
```

Each pattern includes a regex, human-readable description, and category for risk classification.

## Step 3: Build the Basic Detection Engine

Add the core detection function:

```python
def detect_pii_basic(text: str) -> List[PIIEntity]:
    """Layer 1: Standard regex pattern matching."""
    entities = []

    for pii_type, config in PII_PATTERNS.items():
        pattern = re.compile(config["pattern"], re.IGNORECASE)

        for match in pattern.finditer(text):
            entity = PIIEntity(
                type=pii_type,
                value=match.group(),
                start=match.start(),
                end=match.end(),
                confidence=0.8,  # Base confidence
                context=text[max(0, match.start()-20):match.end()+20]
            )
            entities.append(entity)

    return entities
```

This gives us working PII detection, but it's easily fooled by obfuscation.

## Step 4: Add Text Normalization (Layer 2)

Attackers often hide PII using separators, leetspeak, or unicode tricks. Add normalization:

```python
def normalize_text(text: str) -> tuple[str, dict]:
    """Layer 2: Remove obfuscation techniques."""
    original = text
    mappings = {}

    # Remove common separators
    normalized = re.sub(r'[\s\-\.\(\)]+', '', text)

    # Leetspeak conversion
    leet_map = {'0': 'o', '1': 'i', '3': 'e', '4': 'a', '5': 's', '7': 't'}
    for leet, char in leet_map.items():
        normalized = normalized.replace(leet, char)

    # Track position mappings for accurate reporting
    # (simplified - production code needs full position tracking)

    return normalized, mappings
```

Now `4-5-6-7-8-9-0-1-2-3` gets normalized and detected as a potential SSN.

## Step 5: Implement Checksum Validation (Layer 4)

Not every number sequence is valid PII. Add validation logic:

```python
def luhn_checksum(card_number: str) -> bool:
    """Validate credit card using Luhn algorithm."""
    digits = [int(d) for d in card_number if d.isdigit()]
    odd_digits = digits[-1::-2]
    even_digits = digits[-2::-2]

    total = sum(odd_digits)
    for d in even_digits:
        total += sum(divmod(d * 2, 10))

    return total % 10 == 0

def validate_iban(iban: str) -> bool:
    """Validate IBAN using MOD-97 algorithm."""
    iban = iban.replace(' ', '').upper()

    # Move first 4 chars to end
    rearranged = iban[4:] + iban[:4]

    # Convert letters to numbers (A=10, B=11, etc.)
    numeric = ''
    for char in rearranged:
        if char.isdigit():
            numeric += char
        else:
            numeric += str(ord(char) - 55)

    return int(numeric) % 97 == 1
```

With validation, we can boost confidence for valid numbers and flag invalid ones as `POSSIBLE_CARD_PATTERN`.

## Step 6: JSON Blob Extraction (Layer 2.5)

PII often hides in JSON payloads within logs or messages:

```python
import json

def extract_json_strings(text: str) -> list[tuple[str, int, int]]:
    """Find and extract JSON objects from text."""
    json_objects = []

    # Find potential JSON starts
    for i, char in enumerate(text):
        if char == '{':
            depth = 0
            for j in range(i, len(text)):
                if text[j] == '{':
                    depth += 1
                elif text[j] == '}':
                    depth -= 1
                    if depth == 0:
                        try:
                            candidate = text[i:j+1]
                            json.loads(candidate)  # Validate
                            json_objects.append((candidate, i, j+1))
                        except json.JSONDecodeError:
                            pass
                        break

    return json_objects

def deep_scan_json(json_str: str) -> list[str]:
    """Recursively extract all string values from JSON."""
    values = []

    def extract(obj):
        if isinstance(obj, str):
            values.append(obj)
        elif isinstance(obj, dict):
            for v in obj.values():
                extract(v)
        elif isinstance(obj, list):
            for item in obj:
                extract(item)

    try:
        extract(json.loads(json_str))
    except:
        pass

    return values
```

## Step 7: Base64 Auto-Decoding (Layer 2.6)

Encoded PII is common in API responses and logs:

```python
import base64

def is_valid_base64(s: str) -> bool:
    """Check if string is valid base64."""
    if len(s) < 20 or len(s) % 4 != 0:
        return False
    try:
        decoded = base64.b64decode(s, validate=True)
        decoded.decode('utf-8')  # Must be valid UTF-8
        return True
    except:
        return False

def decode_base64_strings(text: str) -> list[tuple[str, str, int, int]]:
    """Find and decode base64 strings."""
    results = []
    pattern = r'[A-Za-z0-9+/]{20,}={0,2}'

    for match in re.finditer(pattern, text):
        candidate = match.group()
        if is_valid_base64(candidate):
            try:
                decoded = base64.b64decode(candidate).decode('utf-8')
                results.append((candidate, decoded, match.start(), match.end()))
            except:
                pass

    return results
```

## Step 8: Build the FastAPI Endpoint

Wire everything together in an API endpoint:

```python
from fastapi import APIRouter, Form

router = APIRouter(prefix="/api/privacy", tags=["privacy"])

@router.post("/scan-text")
async def scan_text(
    text: str = Form(...),
    sensitivity: str = Form("medium")
):
    """Main PII scanning endpoint."""

    # Layer 1: Basic pattern matching
    entities = detect_pii_basic(text)

    # Layer 2: Normalized text scan
    normalized, mappings = normalize_text(text)
    normalized_entities = detect_pii_basic(normalized)
    # ... map positions back to original

    # Layer 2.5: JSON extraction
    for json_str, start, end in extract_json_strings(text):
        for value in deep_scan_json(json_str):
            entities.extend(detect_pii_basic(value))

    # Layer 2.6: Base64 decoding
    for original, decoded, start, end in decode_base64_strings(text):
        decoded_entities = detect_pii_basic(decoded)
        for e in decoded_entities:
            e.type = f"{e.type}_BASE64_ENCODED"
        entities.extend(decoded_entities)

    # Layer 4: Validation
    for entity in entities:
        if entity.type == "CREDIT_CARD":
            if luhn_checksum(entity.value):
                entity.confidence = 0.95
            else:
                entity.type = "POSSIBLE_CARD_PATTERN"
                entity.confidence = 0.5

    # Deduplicate and sort
    entities = deduplicate_entities(entities)

    # Generate masked preview
    redacted = mask_pii(text, entities)

    return {
        "entities": [e.dict() for e in entities],
        "redacted_preview": redacted,
        "summary": generate_summary(entities)
    }
```

## Step 9: Create the SvelteKit Frontend

Build an interactive UI in `frontend/src/routes/privacy-scanner/+page.svelte`:

```svelte
<script lang="ts">
    let inputText = '';
    let results: any = null;
    let loading = false;

    async function scanText() {
        loading = true;
        const formData = new FormData();
        formData.append('text', inputText);

        const response = await fetch('/api/privacy/scan-text', {
            method: 'POST',
            body: formData
        });

        results = await response.json();
        loading = false;
    }
</script>

<div class="container mx-auto p-6">
    <h1 class="text-2xl font-bold mb-4">Privacy Scanner</h1>

    <textarea
        bind:value={inputText}
        class="w-full h-48 p-4 border rounded"
        placeholder="Paste text to scan for PII..."
    ></textarea>

    <button
        on:click={scanText}
        disabled={loading}
        class="mt-4 px-6 py-2 bg-blue-600 text-white rounded"
    >
        {loading ? 'Scanning...' : 'Scan for PII'}
    </button>

    {#if results}
        <div class="mt-6">
            <h2 class="text-xl font-semibold">Results</h2>

            <!-- Entity badges -->
            <div class="flex flex-wrap gap-2 mt-4">
                {#each results.entities as entity}
                    <span class="px-3 py-1 rounded-full bg-red-100 text-red-800">
                        {entity.type}: {entity.value}
                    </span>
                {/each}
            </div>

            <!-- Redacted preview -->
            <div class="mt-4 p-4 bg-gray-100 rounded font-mono">
                {results.redacted_preview}
            </div>
        </div>
    {/if}
</div>
```

## Step 10: Add Security Features

For production deployment, implement ephemeral processing:

```python
# In main.py - ensure no PII logging
import logging

class PIIFilter(logging.Filter):
    def filter(self, record):
        # Never log request bodies that might contain PII
        return 'text=' not in str(record.msg)

logging.getLogger().addFilter(PIIFilter())
```

And add coordinates-only mode for ultra-sensitive clients:

```python
@router.post("/scan-text")
async def scan_text(
    text: str = Form(...),
    coordinates_only: bool = Form(False)  # Client-side redaction mode
):
    entities = detect_pii_multilayer(text)

    if coordinates_only:
        # Return only positions, not actual values
        return {
            "entities": [
                {"type": e.type, "start": e.start, "end": e.end, "length": e.end - e.start}
                for e in entities
            ],
            "coordinates_only": True
        }

    # Normal response with values
    return {"entities": [e.dict() for e in entities], ...}
```

## Conclusion

You've now built a multi-layer Privacy Scanner that can:

- Detect 40+ PII types using regex patterns
- Defeat obfuscation through text normalization
- Extract PII from JSON payloads and Base64 encodings
- Validate checksums to reduce false positives
- Provide a clean web interface for interactive scanning
- Operate in secure, coordinates-only mode

**Next steps** to enhance your scanner:

1. Add machine learning for name/address detection
2. Implement language-specific patterns (EU VAT, UK NI numbers)
3. Build CI/CD integration for automated pre-commit scanning
4. Add PDF and document parsing capabilities

The complete source code is available in the AI Tools Suite repository. Happy scanning!