ai-tools-suite/docs/building-privacy-scanner.qmd
2025-12-27 15:33:06 +00:00

463 lines
13 KiB
Text

---
title: "Building a Privacy Scanner: A Step-by-Step Implementation Guide"
author: "AI Tools Suite"
date: "2024-12-23"
categories: [tutorial, privacy, pii-detection, python, svelte]
format:
html:
toc: true
toc-depth: 3
code-fold: false
code-line-numbers: true
---
## Introduction
In this tutorial, we'll build a production-grade Privacy Scanner from scratch. By the end, you'll have a tool that detects 40+ types of Personally Identifiable Information (PII) using an eight-layer detection pipeline, complete with a modern web interface.
Our stack: **FastAPI** for the backend API, **SvelteKit** for the frontend, and **Python regex** with validation logic for detection.
## Step 1: Project Structure
First, create the project scaffolding:
```bash
mkdir -p ai_tools_suite/{backend/routers,frontend/src/routes/privacy-scanner}
cd ai_tools_suite
```
Your directory structure should look like:
```
ai_tools_suite/
├── backend/
│ ├── main.py
│ └── routers/
│ └── privacy.py
└── frontend/
└── src/
└── routes/
└── privacy-scanner/
└── +page.svelte
```
## Step 2: Define PII Patterns
The foundation of any PII scanner is its pattern library. Create `backend/routers/privacy.py` and start with the core patterns:
```python
import re
from typing import List, Dict, Any
from pydantic import BaseModel
class PIIEntity(BaseModel):
type: str
value: str
start: int
end: int
confidence: float
context: str = ""
PII_PATTERNS = {
# Identity Documents
"SSN": {
"pattern": r'\b\d{3}-\d{2}-\d{4}\b',
"description": "US Social Security Number",
"category": "identity"
},
"PASSPORT": {
"pattern": r'\b[A-Z]{1,2}\d{6,9}\b',
"description": "Passport Number",
"category": "identity"
},
# Financial Information
"CREDIT_CARD": {
"pattern": r'\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13})\b',
"description": "Credit Card Number (Visa, MC, Amex)",
"category": "financial"
},
"IBAN": {
"pattern": r'\b[A-Z]{2}\d{2}[A-Z0-9]{4,30}\b',
"description": "International Bank Account Number",
"category": "financial"
},
# Contact Information
"EMAIL": {
"pattern": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
"description": "Email Address",
"category": "contact"
},
"PHONE_US": {
"pattern": r'\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b',
"description": "US Phone Number",
"category": "contact"
},
# Add more patterns as needed...
}
```
Each pattern includes a regex, human-readable description, and category for risk classification.
## Step 3: Build the Basic Detection Engine
Add the core detection function:
```python
def detect_pii_basic(text: str) -> List[PIIEntity]:
"""Layer 1: Standard regex pattern matching."""
entities = []
for pii_type, config in PII_PATTERNS.items():
pattern = re.compile(config["pattern"], re.IGNORECASE)
for match in pattern.finditer(text):
entity = PIIEntity(
type=pii_type,
value=match.group(),
start=match.start(),
end=match.end(),
confidence=0.8, # Base confidence
context=text[max(0, match.start()-20):match.end()+20]
)
entities.append(entity)
return entities
```
This gives us working PII detection, but it's easily fooled by obfuscation.
## Step 4: Add Text Normalization (Layer 2)
Attackers often hide PII using separators, leetspeak, or unicode tricks. Add normalization:
```python
def normalize_text(text: str) -> tuple[str, dict]:
"""Layer 2: Remove obfuscation techniques."""
original = text
mappings = {}
# Remove common separators
normalized = re.sub(r'[\s\-\.\(\)]+', '', text)
# Leetspeak conversion
leet_map = {'0': 'o', '1': 'i', '3': 'e', '4': 'a', '5': 's', '7': 't'}
for leet, char in leet_map.items():
normalized = normalized.replace(leet, char)
# Track position mappings for accurate reporting
# (simplified - production code needs full position tracking)
return normalized, mappings
```
Now `4-5-6-7-8-9-0-1-2-3` gets normalized and detected as a potential SSN.
## Step 5: Implement Checksum Validation (Layer 4)
Not every number sequence is valid PII. Add validation logic:
```python
def luhn_checksum(card_number: str) -> bool:
"""Validate credit card using Luhn algorithm."""
digits = [int(d) for d in card_number if d.isdigit()]
odd_digits = digits[-1::-2]
even_digits = digits[-2::-2]
total = sum(odd_digits)
for d in even_digits:
total += sum(divmod(d * 2, 10))
return total % 10 == 0
def validate_iban(iban: str) -> bool:
"""Validate IBAN using MOD-97 algorithm."""
iban = iban.replace(' ', '').upper()
# Move first 4 chars to end
rearranged = iban[4:] + iban[:4]
# Convert letters to numbers (A=10, B=11, etc.)
numeric = ''
for char in rearranged:
if char.isdigit():
numeric += char
else:
numeric += str(ord(char) - 55)
return int(numeric) % 97 == 1
```
With validation, we can boost confidence for valid numbers and flag invalid ones as `POSSIBLE_CARD_PATTERN`.
## Step 6: JSON Blob Extraction (Layer 2.5)
PII often hides in JSON payloads within logs or messages:
```python
import json
def extract_json_strings(text: str) -> list[tuple[str, int, int]]:
"""Find and extract JSON objects from text."""
json_objects = []
# Find potential JSON starts
for i, char in enumerate(text):
if char == '{':
depth = 0
for j in range(i, len(text)):
if text[j] == '{':
depth += 1
elif text[j] == '}':
depth -= 1
if depth == 0:
try:
candidate = text[i:j+1]
json.loads(candidate) # Validate
json_objects.append((candidate, i, j+1))
except json.JSONDecodeError:
pass
break
return json_objects
def deep_scan_json(json_str: str) -> list[str]:
"""Recursively extract all string values from JSON."""
values = []
def extract(obj):
if isinstance(obj, str):
values.append(obj)
elif isinstance(obj, dict):
for v in obj.values():
extract(v)
elif isinstance(obj, list):
for item in obj:
extract(item)
try:
extract(json.loads(json_str))
except:
pass
return values
```
## Step 7: Base64 Auto-Decoding (Layer 2.6)
Encoded PII is common in API responses and logs:
```python
import base64
def is_valid_base64(s: str) -> bool:
"""Check if string is valid base64."""
if len(s) < 20 or len(s) % 4 != 0:
return False
try:
decoded = base64.b64decode(s, validate=True)
decoded.decode('utf-8') # Must be valid UTF-8
return True
except:
return False
def decode_base64_strings(text: str) -> list[tuple[str, str, int, int]]:
"""Find and decode base64 strings."""
results = []
pattern = r'[A-Za-z0-9+/]{20,}={0,2}'
for match in re.finditer(pattern, text):
candidate = match.group()
if is_valid_base64(candidate):
try:
decoded = base64.b64decode(candidate).decode('utf-8')
results.append((candidate, decoded, match.start(), match.end()))
except:
pass
return results
```
## Step 8: Build the FastAPI Endpoint
Wire everything together in an API endpoint:
```python
from fastapi import APIRouter, Form
router = APIRouter(prefix="/api/privacy", tags=["privacy"])
@router.post("/scan-text")
async def scan_text(
text: str = Form(...),
sensitivity: str = Form("medium")
):
"""Main PII scanning endpoint."""
# Layer 1: Basic pattern matching
entities = detect_pii_basic(text)
# Layer 2: Normalized text scan
normalized, mappings = normalize_text(text)
normalized_entities = detect_pii_basic(normalized)
# ... map positions back to original
# Layer 2.5: JSON extraction
for json_str, start, end in extract_json_strings(text):
for value in deep_scan_json(json_str):
entities.extend(detect_pii_basic(value))
# Layer 2.6: Base64 decoding
for original, decoded, start, end in decode_base64_strings(text):
decoded_entities = detect_pii_basic(decoded)
for e in decoded_entities:
e.type = f"{e.type}_BASE64_ENCODED"
entities.extend(decoded_entities)
# Layer 4: Validation
for entity in entities:
if entity.type == "CREDIT_CARD":
if luhn_checksum(entity.value):
entity.confidence = 0.95
else:
entity.type = "POSSIBLE_CARD_PATTERN"
entity.confidence = 0.5
# Deduplicate and sort
entities = deduplicate_entities(entities)
# Generate masked preview
redacted = mask_pii(text, entities)
return {
"entities": [e.dict() for e in entities],
"redacted_preview": redacted,
"summary": generate_summary(entities)
}
```
## Step 9: Create the SvelteKit Frontend
Build an interactive UI in `frontend/src/routes/privacy-scanner/+page.svelte`:
```svelte
<script lang="ts">
let inputText = '';
let results: any = null;
let loading = false;
async function scanText() {
loading = true;
const formData = new FormData();
formData.append('text', inputText);
const response = await fetch('/api/privacy/scan-text', {
method: 'POST',
body: formData
});
results = await response.json();
loading = false;
}
</script>
<div class="container mx-auto p-6">
<h1 class="text-2xl font-bold mb-4">Privacy Scanner</h1>
<textarea
bind:value={inputText}
class="w-full h-48 p-4 border rounded"
placeholder="Paste text to scan for PII..."
></textarea>
<button
on:click={scanText}
disabled={loading}
class="mt-4 px-6 py-2 bg-blue-600 text-white rounded"
>
{loading ? 'Scanning...' : 'Scan for PII'}
</button>
{#if results}
<div class="mt-6">
<h2 class="text-xl font-semibold">Results</h2>
<!-- Entity badges -->
<div class="flex flex-wrap gap-2 mt-4">
{#each results.entities as entity}
<span class="px-3 py-1 rounded-full bg-red-100 text-red-800">
{entity.type}: {entity.value}
</span>
{/each}
</div>
<!-- Redacted preview -->
<div class="mt-4 p-4 bg-gray-100 rounded font-mono">
{results.redacted_preview}
</div>
</div>
{/if}
</div>
```
## Step 10: Add Security Features
For production deployment, implement ephemeral processing:
```python
# In main.py - ensure no PII logging
import logging
class PIIFilter(logging.Filter):
def filter(self, record):
# Never log request bodies that might contain PII
return 'text=' not in str(record.msg)
logging.getLogger().addFilter(PIIFilter())
```
And add coordinates-only mode for ultra-sensitive clients:
```python
@router.post("/scan-text")
async def scan_text(
text: str = Form(...),
coordinates_only: bool = Form(False) # Client-side redaction mode
):
entities = detect_pii_multilayer(text)
if coordinates_only:
# Return only positions, not actual values
return {
"entities": [
{"type": e.type, "start": e.start, "end": e.end, "length": e.end - e.start}
for e in entities
],
"coordinates_only": True
}
# Normal response with values
return {"entities": [e.dict() for e in entities], ...}
```
## Conclusion
You've now built a multi-layer Privacy Scanner that can:
- Detect 40+ PII types using regex patterns
- Defeat obfuscation through text normalization
- Extract PII from JSON payloads and Base64 encodings
- Validate checksums to reduce false positives
- Provide a clean web interface for interactive scanning
- Operate in secure, coordinates-only mode
**Next steps** to enhance your scanner:
1. Add machine learning for name/address detection
2. Implement language-specific patterns (EU VAT, UK NI numbers)
3. Build CI/CD integration for automated pre-commit scanning
4. Add PDF and document parsing capabilities
The complete source code is available in the AI Tools Suite repository. Happy scanning!