--- title: "Building a Privacy Scanner: A Step-by-Step Implementation Guide" author: "AI Tools Suite" date: "2024-12-23" categories: [tutorial, privacy, pii-detection, python, svelte] format: html: toc: true toc-depth: 3 code-fold: false code-line-numbers: true --- ## Introduction In this tutorial, we'll build a production-grade Privacy Scanner from scratch. By the end, you'll have a tool that detects 40+ types of Personally Identifiable Information (PII) using an eight-layer detection pipeline, complete with a modern web interface. Our stack: **FastAPI** for the backend API, **SvelteKit** for the frontend, and **Python regex** with validation logic for detection. ## Step 1: Project Structure First, create the project scaffolding: ```bash mkdir -p ai_tools_suite/{backend/routers,frontend/src/routes/privacy-scanner} cd ai_tools_suite ``` Your directory structure should look like: ``` ai_tools_suite/ ├── backend/ │ ├── main.py │ └── routers/ │ └── privacy.py └── frontend/ └── src/ └── routes/ └── privacy-scanner/ └── +page.svelte ``` ## Step 2: Define PII Patterns The foundation of any PII scanner is its pattern library. Create `backend/routers/privacy.py` and start with the core patterns: ```python import re from typing import List, Dict, Any from pydantic import BaseModel class PIIEntity(BaseModel): type: str value: str start: int end: int confidence: float context: str = "" PII_PATTERNS = { # Identity Documents "SSN": { "pattern": r'\b\d{3}-\d{2}-\d{4}\b', "description": "US Social Security Number", "category": "identity" }, "PASSPORT": { "pattern": r'\b[A-Z]{1,2}\d{6,9}\b', "description": "Passport Number", "category": "identity" }, # Financial Information "CREDIT_CARD": { "pattern": r'\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13})\b', "description": "Credit Card Number (Visa, MC, Amex)", "category": "financial" }, "IBAN": { "pattern": r'\b[A-Z]{2}\d{2}[A-Z0-9]{4,30}\b', "description": "International Bank Account Number", "category": "financial" }, # Contact Information "EMAIL": { "pattern": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', "description": "Email Address", "category": "contact" }, "PHONE_US": { "pattern": r'\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b', "description": "US Phone Number", "category": "contact" }, # Add more patterns as needed... } ``` Each pattern includes a regex, human-readable description, and category for risk classification. ## Step 3: Build the Basic Detection Engine Add the core detection function: ```python def detect_pii_basic(text: str) -> List[PIIEntity]: """Layer 1: Standard regex pattern matching.""" entities = [] for pii_type, config in PII_PATTERNS.items(): pattern = re.compile(config["pattern"], re.IGNORECASE) for match in pattern.finditer(text): entity = PIIEntity( type=pii_type, value=match.group(), start=match.start(), end=match.end(), confidence=0.8, # Base confidence context=text[max(0, match.start()-20):match.end()+20] ) entities.append(entity) return entities ``` This gives us working PII detection, but it's easily fooled by obfuscation. ## Step 4: Add Text Normalization (Layer 2) Attackers often hide PII using separators, leetspeak, or unicode tricks. Add normalization: ```python def normalize_text(text: str) -> tuple[str, dict]: """Layer 2: Remove obfuscation techniques.""" original = text mappings = {} # Remove common separators normalized = re.sub(r'[\s\-\.\(\)]+', '', text) # Leetspeak conversion leet_map = {'0': 'o', '1': 'i', '3': 'e', '4': 'a', '5': 's', '7': 't'} for leet, char in leet_map.items(): normalized = normalized.replace(leet, char) # Track position mappings for accurate reporting # (simplified - production code needs full position tracking) return normalized, mappings ``` Now `4-5-6-7-8-9-0-1-2-3` gets normalized and detected as a potential SSN. ## Step 5: Implement Checksum Validation (Layer 4) Not every number sequence is valid PII. Add validation logic: ```python def luhn_checksum(card_number: str) -> bool: """Validate credit card using Luhn algorithm.""" digits = [int(d) for d in card_number if d.isdigit()] odd_digits = digits[-1::-2] even_digits = digits[-2::-2] total = sum(odd_digits) for d in even_digits: total += sum(divmod(d * 2, 10)) return total % 10 == 0 def validate_iban(iban: str) -> bool: """Validate IBAN using MOD-97 algorithm.""" iban = iban.replace(' ', '').upper() # Move first 4 chars to end rearranged = iban[4:] + iban[:4] # Convert letters to numbers (A=10, B=11, etc.) numeric = '' for char in rearranged: if char.isdigit(): numeric += char else: numeric += str(ord(char) - 55) return int(numeric) % 97 == 1 ``` With validation, we can boost confidence for valid numbers and flag invalid ones as `POSSIBLE_CARD_PATTERN`. ## Step 6: JSON Blob Extraction (Layer 2.5) PII often hides in JSON payloads within logs or messages: ```python import json def extract_json_strings(text: str) -> list[tuple[str, int, int]]: """Find and extract JSON objects from text.""" json_objects = [] # Find potential JSON starts for i, char in enumerate(text): if char == '{': depth = 0 for j in range(i, len(text)): if text[j] == '{': depth += 1 elif text[j] == '}': depth -= 1 if depth == 0: try: candidate = text[i:j+1] json.loads(candidate) # Validate json_objects.append((candidate, i, j+1)) except json.JSONDecodeError: pass break return json_objects def deep_scan_json(json_str: str) -> list[str]: """Recursively extract all string values from JSON.""" values = [] def extract(obj): if isinstance(obj, str): values.append(obj) elif isinstance(obj, dict): for v in obj.values(): extract(v) elif isinstance(obj, list): for item in obj: extract(item) try: extract(json.loads(json_str)) except: pass return values ``` ## Step 7: Base64 Auto-Decoding (Layer 2.6) Encoded PII is common in API responses and logs: ```python import base64 def is_valid_base64(s: str) -> bool: """Check if string is valid base64.""" if len(s) < 20 or len(s) % 4 != 0: return False try: decoded = base64.b64decode(s, validate=True) decoded.decode('utf-8') # Must be valid UTF-8 return True except: return False def decode_base64_strings(text: str) -> list[tuple[str, str, int, int]]: """Find and decode base64 strings.""" results = [] pattern = r'[A-Za-z0-9+/]{20,}={0,2}' for match in re.finditer(pattern, text): candidate = match.group() if is_valid_base64(candidate): try: decoded = base64.b64decode(candidate).decode('utf-8') results.append((candidate, decoded, match.start(), match.end())) except: pass return results ``` ## Step 8: Build the FastAPI Endpoint Wire everything together in an API endpoint: ```python from fastapi import APIRouter, Form router = APIRouter(prefix="/api/privacy", tags=["privacy"]) @router.post("/scan-text") async def scan_text( text: str = Form(...), sensitivity: str = Form("medium") ): """Main PII scanning endpoint.""" # Layer 1: Basic pattern matching entities = detect_pii_basic(text) # Layer 2: Normalized text scan normalized, mappings = normalize_text(text) normalized_entities = detect_pii_basic(normalized) # ... map positions back to original # Layer 2.5: JSON extraction for json_str, start, end in extract_json_strings(text): for value in deep_scan_json(json_str): entities.extend(detect_pii_basic(value)) # Layer 2.6: Base64 decoding for original, decoded, start, end in decode_base64_strings(text): decoded_entities = detect_pii_basic(decoded) for e in decoded_entities: e.type = f"{e.type}_BASE64_ENCODED" entities.extend(decoded_entities) # Layer 4: Validation for entity in entities: if entity.type == "CREDIT_CARD": if luhn_checksum(entity.value): entity.confidence = 0.95 else: entity.type = "POSSIBLE_CARD_PATTERN" entity.confidence = 0.5 # Deduplicate and sort entities = deduplicate_entities(entities) # Generate masked preview redacted = mask_pii(text, entities) return { "entities": [e.dict() for e in entities], "redacted_preview": redacted, "summary": generate_summary(entities) } ``` ## Step 9: Create the SvelteKit Frontend Build an interactive UI in `frontend/src/routes/privacy-scanner/+page.svelte`: ```svelte

Privacy Scanner

{#if results}

Results

{#each results.entities as entity} {entity.type}: {entity.value} {/each}
{results.redacted_preview}
{/if}
``` ## Step 10: Add Security Features For production deployment, implement ephemeral processing: ```python # In main.py - ensure no PII logging import logging class PIIFilter(logging.Filter): def filter(self, record): # Never log request bodies that might contain PII return 'text=' not in str(record.msg) logging.getLogger().addFilter(PIIFilter()) ``` And add coordinates-only mode for ultra-sensitive clients: ```python @router.post("/scan-text") async def scan_text( text: str = Form(...), coordinates_only: bool = Form(False) # Client-side redaction mode ): entities = detect_pii_multilayer(text) if coordinates_only: # Return only positions, not actual values return { "entities": [ {"type": e.type, "start": e.start, "end": e.end, "length": e.end - e.start} for e in entities ], "coordinates_only": True } # Normal response with values return {"entities": [e.dict() for e in entities], ...} ``` ## Conclusion You've now built a multi-layer Privacy Scanner that can: - Detect 40+ PII types using regex patterns - Defeat obfuscation through text normalization - Extract PII from JSON payloads and Base64 encodings - Validate checksums to reduce false positives - Provide a clean web interface for interactive scanning - Operate in secure, coordinates-only mode **Next steps** to enhance your scanner: 1. Add machine learning for name/address detection 2. Implement language-specific patterns (EU VAT, UK NI numbers) 3. Build CI/CD integration for automated pre-commit scanning 4. Add PDF and document parsing capabilities The complete source code is available in the AI Tools Suite repository. Happy scanning!