463 lines
13 KiB
Text
463 lines
13 KiB
Text
---
|
|
title: "Building a Privacy Scanner: A Step-by-Step Implementation Guide"
|
|
author: "AI Tools Suite"
|
|
date: "2024-12-23"
|
|
categories: [tutorial, privacy, pii-detection, python, svelte]
|
|
format:
|
|
html:
|
|
toc: true
|
|
toc-depth: 3
|
|
code-fold: false
|
|
code-line-numbers: true
|
|
---
|
|
|
|
## Introduction
|
|
|
|
In this tutorial, we'll build a production-grade Privacy Scanner from scratch. By the end, you'll have a tool that detects 40+ types of Personally Identifiable Information (PII) using an eight-layer detection pipeline, complete with a modern web interface.
|
|
|
|
Our stack: **FastAPI** for the backend API, **SvelteKit** for the frontend, and **Python regex** with validation logic for detection.
|
|
|
|
## Step 1: Project Structure
|
|
|
|
First, create the project scaffolding:
|
|
|
|
```bash
|
|
mkdir -p ai_tools_suite/{backend/routers,frontend/src/routes/privacy-scanner}
|
|
cd ai_tools_suite
|
|
```
|
|
|
|
Your directory structure should look like:
|
|
|
|
```
|
|
ai_tools_suite/
|
|
├── backend/
|
|
│ ├── main.py
|
|
│ └── routers/
|
|
│ └── privacy.py
|
|
└── frontend/
|
|
└── src/
|
|
└── routes/
|
|
└── privacy-scanner/
|
|
└── +page.svelte
|
|
```
|
|
|
|
## Step 2: Define PII Patterns
|
|
|
|
The foundation of any PII scanner is its pattern library. Create `backend/routers/privacy.py` and start with the core patterns:
|
|
|
|
```python
|
|
import re
|
|
from typing import List, Dict, Any
|
|
from pydantic import BaseModel
|
|
|
|
class PIIEntity(BaseModel):
|
|
type: str
|
|
value: str
|
|
start: int
|
|
end: int
|
|
confidence: float
|
|
context: str = ""
|
|
|
|
PII_PATTERNS = {
|
|
# Identity Documents
|
|
"SSN": {
|
|
"pattern": r'\b\d{3}-\d{2}-\d{4}\b',
|
|
"description": "US Social Security Number",
|
|
"category": "identity"
|
|
},
|
|
"PASSPORT": {
|
|
"pattern": r'\b[A-Z]{1,2}\d{6,9}\b',
|
|
"description": "Passport Number",
|
|
"category": "identity"
|
|
},
|
|
|
|
# Financial Information
|
|
"CREDIT_CARD": {
|
|
"pattern": r'\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13})\b',
|
|
"description": "Credit Card Number (Visa, MC, Amex)",
|
|
"category": "financial"
|
|
},
|
|
"IBAN": {
|
|
"pattern": r'\b[A-Z]{2}\d{2}[A-Z0-9]{4,30}\b',
|
|
"description": "International Bank Account Number",
|
|
"category": "financial"
|
|
},
|
|
|
|
# Contact Information
|
|
"EMAIL": {
|
|
"pattern": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
|
|
"description": "Email Address",
|
|
"category": "contact"
|
|
},
|
|
"PHONE_US": {
|
|
"pattern": r'\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b',
|
|
"description": "US Phone Number",
|
|
"category": "contact"
|
|
},
|
|
|
|
# Add more patterns as needed...
|
|
}
|
|
```
|
|
|
|
Each pattern includes a regex, human-readable description, and category for risk classification.
|
|
|
|
## Step 3: Build the Basic Detection Engine
|
|
|
|
Add the core detection function:
|
|
|
|
```python
|
|
def detect_pii_basic(text: str) -> List[PIIEntity]:
|
|
"""Layer 1: Standard regex pattern matching."""
|
|
entities = []
|
|
|
|
for pii_type, config in PII_PATTERNS.items():
|
|
pattern = re.compile(config["pattern"], re.IGNORECASE)
|
|
|
|
for match in pattern.finditer(text):
|
|
entity = PIIEntity(
|
|
type=pii_type,
|
|
value=match.group(),
|
|
start=match.start(),
|
|
end=match.end(),
|
|
confidence=0.8, # Base confidence
|
|
context=text[max(0, match.start()-20):match.end()+20]
|
|
)
|
|
entities.append(entity)
|
|
|
|
return entities
|
|
```
|
|
|
|
This gives us working PII detection, but it's easily fooled by obfuscation.
|
|
|
|
## Step 4: Add Text Normalization (Layer 2)
|
|
|
|
Attackers often hide PII using separators, leetspeak, or unicode tricks. Add normalization:
|
|
|
|
```python
|
|
def normalize_text(text: str) -> tuple[str, dict]:
|
|
"""Layer 2: Remove obfuscation techniques."""
|
|
original = text
|
|
mappings = {}
|
|
|
|
# Remove common separators
|
|
normalized = re.sub(r'[\s\-\.\(\)]+', '', text)
|
|
|
|
# Leetspeak conversion
|
|
leet_map = {'0': 'o', '1': 'i', '3': 'e', '4': 'a', '5': 's', '7': 't'}
|
|
for leet, char in leet_map.items():
|
|
normalized = normalized.replace(leet, char)
|
|
|
|
# Track position mappings for accurate reporting
|
|
# (simplified - production code needs full position tracking)
|
|
|
|
return normalized, mappings
|
|
```
|
|
|
|
Now `4-5-6-7-8-9-0-1-2-3` gets normalized and detected as a potential SSN.
|
|
|
|
## Step 5: Implement Checksum Validation (Layer 4)
|
|
|
|
Not every number sequence is valid PII. Add validation logic:
|
|
|
|
```python
|
|
def luhn_checksum(card_number: str) -> bool:
|
|
"""Validate credit card using Luhn algorithm."""
|
|
digits = [int(d) for d in card_number if d.isdigit()]
|
|
odd_digits = digits[-1::-2]
|
|
even_digits = digits[-2::-2]
|
|
|
|
total = sum(odd_digits)
|
|
for d in even_digits:
|
|
total += sum(divmod(d * 2, 10))
|
|
|
|
return total % 10 == 0
|
|
|
|
def validate_iban(iban: str) -> bool:
|
|
"""Validate IBAN using MOD-97 algorithm."""
|
|
iban = iban.replace(' ', '').upper()
|
|
|
|
# Move first 4 chars to end
|
|
rearranged = iban[4:] + iban[:4]
|
|
|
|
# Convert letters to numbers (A=10, B=11, etc.)
|
|
numeric = ''
|
|
for char in rearranged:
|
|
if char.isdigit():
|
|
numeric += char
|
|
else:
|
|
numeric += str(ord(char) - 55)
|
|
|
|
return int(numeric) % 97 == 1
|
|
```
|
|
|
|
With validation, we can boost confidence for valid numbers and flag invalid ones as `POSSIBLE_CARD_PATTERN`.
|
|
|
|
## Step 6: JSON Blob Extraction (Layer 2.5)
|
|
|
|
PII often hides in JSON payloads within logs or messages:
|
|
|
|
```python
|
|
import json
|
|
|
|
def extract_json_strings(text: str) -> list[tuple[str, int, int]]:
|
|
"""Find and extract JSON objects from text."""
|
|
json_objects = []
|
|
|
|
# Find potential JSON starts
|
|
for i, char in enumerate(text):
|
|
if char == '{':
|
|
depth = 0
|
|
for j in range(i, len(text)):
|
|
if text[j] == '{':
|
|
depth += 1
|
|
elif text[j] == '}':
|
|
depth -= 1
|
|
if depth == 0:
|
|
try:
|
|
candidate = text[i:j+1]
|
|
json.loads(candidate) # Validate
|
|
json_objects.append((candidate, i, j+1))
|
|
except json.JSONDecodeError:
|
|
pass
|
|
break
|
|
|
|
return json_objects
|
|
|
|
def deep_scan_json(json_str: str) -> list[str]:
|
|
"""Recursively extract all string values from JSON."""
|
|
values = []
|
|
|
|
def extract(obj):
|
|
if isinstance(obj, str):
|
|
values.append(obj)
|
|
elif isinstance(obj, dict):
|
|
for v in obj.values():
|
|
extract(v)
|
|
elif isinstance(obj, list):
|
|
for item in obj:
|
|
extract(item)
|
|
|
|
try:
|
|
extract(json.loads(json_str))
|
|
except:
|
|
pass
|
|
|
|
return values
|
|
```
|
|
|
|
## Step 7: Base64 Auto-Decoding (Layer 2.6)
|
|
|
|
Encoded PII is common in API responses and logs:
|
|
|
|
```python
|
|
import base64
|
|
|
|
def is_valid_base64(s: str) -> bool:
|
|
"""Check if string is valid base64."""
|
|
if len(s) < 20 or len(s) % 4 != 0:
|
|
return False
|
|
try:
|
|
decoded = base64.b64decode(s, validate=True)
|
|
decoded.decode('utf-8') # Must be valid UTF-8
|
|
return True
|
|
except:
|
|
return False
|
|
|
|
def decode_base64_strings(text: str) -> list[tuple[str, str, int, int]]:
|
|
"""Find and decode base64 strings."""
|
|
results = []
|
|
pattern = r'[A-Za-z0-9+/]{20,}={0,2}'
|
|
|
|
for match in re.finditer(pattern, text):
|
|
candidate = match.group()
|
|
if is_valid_base64(candidate):
|
|
try:
|
|
decoded = base64.b64decode(candidate).decode('utf-8')
|
|
results.append((candidate, decoded, match.start(), match.end()))
|
|
except:
|
|
pass
|
|
|
|
return results
|
|
```
|
|
|
|
## Step 8: Build the FastAPI Endpoint
|
|
|
|
Wire everything together in an API endpoint:
|
|
|
|
```python
|
|
from fastapi import APIRouter, Form
|
|
|
|
router = APIRouter(prefix="/api/privacy", tags=["privacy"])
|
|
|
|
@router.post("/scan-text")
|
|
async def scan_text(
|
|
text: str = Form(...),
|
|
sensitivity: str = Form("medium")
|
|
):
|
|
"""Main PII scanning endpoint."""
|
|
|
|
# Layer 1: Basic pattern matching
|
|
entities = detect_pii_basic(text)
|
|
|
|
# Layer 2: Normalized text scan
|
|
normalized, mappings = normalize_text(text)
|
|
normalized_entities = detect_pii_basic(normalized)
|
|
# ... map positions back to original
|
|
|
|
# Layer 2.5: JSON extraction
|
|
for json_str, start, end in extract_json_strings(text):
|
|
for value in deep_scan_json(json_str):
|
|
entities.extend(detect_pii_basic(value))
|
|
|
|
# Layer 2.6: Base64 decoding
|
|
for original, decoded, start, end in decode_base64_strings(text):
|
|
decoded_entities = detect_pii_basic(decoded)
|
|
for e in decoded_entities:
|
|
e.type = f"{e.type}_BASE64_ENCODED"
|
|
entities.extend(decoded_entities)
|
|
|
|
# Layer 4: Validation
|
|
for entity in entities:
|
|
if entity.type == "CREDIT_CARD":
|
|
if luhn_checksum(entity.value):
|
|
entity.confidence = 0.95
|
|
else:
|
|
entity.type = "POSSIBLE_CARD_PATTERN"
|
|
entity.confidence = 0.5
|
|
|
|
# Deduplicate and sort
|
|
entities = deduplicate_entities(entities)
|
|
|
|
# Generate masked preview
|
|
redacted = mask_pii(text, entities)
|
|
|
|
return {
|
|
"entities": [e.dict() for e in entities],
|
|
"redacted_preview": redacted,
|
|
"summary": generate_summary(entities)
|
|
}
|
|
```
|
|
|
|
## Step 9: Create the SvelteKit Frontend
|
|
|
|
Build an interactive UI in `frontend/src/routes/privacy-scanner/+page.svelte`:
|
|
|
|
```svelte
|
|
<script lang="ts">
|
|
let inputText = '';
|
|
let results: any = null;
|
|
let loading = false;
|
|
|
|
async function scanText() {
|
|
loading = true;
|
|
const formData = new FormData();
|
|
formData.append('text', inputText);
|
|
|
|
const response = await fetch('/api/privacy/scan-text', {
|
|
method: 'POST',
|
|
body: formData
|
|
});
|
|
|
|
results = await response.json();
|
|
loading = false;
|
|
}
|
|
</script>
|
|
|
|
<div class="container mx-auto p-6">
|
|
<h1 class="text-2xl font-bold mb-4">Privacy Scanner</h1>
|
|
|
|
<textarea
|
|
bind:value={inputText}
|
|
class="w-full h-48 p-4 border rounded"
|
|
placeholder="Paste text to scan for PII..."
|
|
></textarea>
|
|
|
|
<button
|
|
on:click={scanText}
|
|
disabled={loading}
|
|
class="mt-4 px-6 py-2 bg-blue-600 text-white rounded"
|
|
>
|
|
{loading ? 'Scanning...' : 'Scan for PII'}
|
|
</button>
|
|
|
|
{#if results}
|
|
<div class="mt-6">
|
|
<h2 class="text-xl font-semibold">Results</h2>
|
|
|
|
<!-- Entity badges -->
|
|
<div class="flex flex-wrap gap-2 mt-4">
|
|
{#each results.entities as entity}
|
|
<span class="px-3 py-1 rounded-full bg-red-100 text-red-800">
|
|
{entity.type}: {entity.value}
|
|
</span>
|
|
{/each}
|
|
</div>
|
|
|
|
<!-- Redacted preview -->
|
|
<div class="mt-4 p-4 bg-gray-100 rounded font-mono">
|
|
{results.redacted_preview}
|
|
</div>
|
|
</div>
|
|
{/if}
|
|
</div>
|
|
```
|
|
|
|
## Step 10: Add Security Features
|
|
|
|
For production deployment, implement ephemeral processing:
|
|
|
|
```python
|
|
# In main.py - ensure no PII logging
|
|
import logging
|
|
|
|
class PIIFilter(logging.Filter):
|
|
def filter(self, record):
|
|
# Never log request bodies that might contain PII
|
|
return 'text=' not in str(record.msg)
|
|
|
|
logging.getLogger().addFilter(PIIFilter())
|
|
```
|
|
|
|
And add coordinates-only mode for ultra-sensitive clients:
|
|
|
|
```python
|
|
@router.post("/scan-text")
|
|
async def scan_text(
|
|
text: str = Form(...),
|
|
coordinates_only: bool = Form(False) # Client-side redaction mode
|
|
):
|
|
entities = detect_pii_multilayer(text)
|
|
|
|
if coordinates_only:
|
|
# Return only positions, not actual values
|
|
return {
|
|
"entities": [
|
|
{"type": e.type, "start": e.start, "end": e.end, "length": e.end - e.start}
|
|
for e in entities
|
|
],
|
|
"coordinates_only": True
|
|
}
|
|
|
|
# Normal response with values
|
|
return {"entities": [e.dict() for e in entities], ...}
|
|
```
|
|
|
|
## Conclusion
|
|
|
|
You've now built a multi-layer Privacy Scanner that can:
|
|
|
|
- Detect 40+ PII types using regex patterns
|
|
- Defeat obfuscation through text normalization
|
|
- Extract PII from JSON payloads and Base64 encodings
|
|
- Validate checksums to reduce false positives
|
|
- Provide a clean web interface for interactive scanning
|
|
- Operate in secure, coordinates-only mode
|
|
|
|
**Next steps** to enhance your scanner:
|
|
|
|
1. Add machine learning for name/address detection
|
|
2. Implement language-specific patterns (EU VAT, UK NI numbers)
|
|
3. Build CI/CD integration for automated pre-commit scanning
|
|
4. Add PDF and document parsing capabilities
|
|
|
|
The complete source code is available in the AI Tools Suite repository. Happy scanning!
|