Privacy Scanner: Security & Compliance White Paper

Enterprise-Grade PII Detection with Zero-Trust Architecture

security
compliance
enterprise
privacy
whitepaper
Author

AI Tools Suite

Published

December 23, 2024

1 Executive Summary

1.1 Value Realization

Stakeholder Primary Benefit
Developer Prevents secrets/keys from ever reaching GitHub
Data Engineer Automates PII scrubbing before data enters the warehouse
Compliance Officer Provides proof of “Privacy by Design” for GDPR/SOC2 audits
CISO Reduces the overall blast radius of a potential data breach
Legal/DPO Supports DSAR (Data Subject Access Request) fulfillment
DevOps/SRE Sanitizes logs before shipping to centralized observability

The Privacy Scanner is an enterprise-grade Personally Identifiable Information (PII) detection and redaction solution designed with security-first principles. This white paper details the security architecture, compliance capabilities, and technical safeguards that make the Privacy Scanner suitable for organizations with stringent data protection requirements.

Key Highlights:

  • 40+ PII Types Detected across identity, financial, contact, medical, and secret categories
  • 8-Layer Detection Pipeline for comprehensive coverage including obfuscation bypass
  • Zero-Trust Architecture with optional client-side redaction mode
  • Ephemeral Processing - no data persistence, no logging of sensitive content
  • Supports Compliance Programs - technical controls aligned with GDPR, HIPAA, PCI-DSS, SOC 2, and CCPA requirements (tool assists compliance efforts; does not guarantee compliance)

2 Security Architecture

2.1 2.1 Defense in Depth

The Privacy Scanner implements multiple layers of security controls:

┌─────────────────────────────────────────────────────────────┐
│                    CLIENT BROWSER                           │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Client-Side Redaction Mode (Optional)              │   │
│  │  • PII never leaves browser                         │   │
│  │  • Only coordinates returned from backend           │   │
│  │  • Maximum privacy guarantee                        │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    TRANSPORT LAYER                          │
│  • TLS 1.3 encryption in transit                           │
│  • Certificate pinning (recommended)                        │
│  • No sensitive data in URL parameters                      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    APPLICATION LAYER                        │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  FastAPI Backend                                    │   │
│  │  • Request validation via Pydantic                  │   │
│  │  • No database connections for scan operations      │   │
│  │  • Stateless processing                             │   │
│  │  • PII-filtered logging                             │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    PROCESSING LAYER                         │
│  • In-memory only - no disk writes                         │
│  • Automatic garbage collection post-response              │
│  • No caching of scanned content                           │
│  • Deterministic regex patterns (no ML model storage)      │
└─────────────────────────────────────────────────────────────┘

2.2 2.2 Ephemeral Processing Model

The Privacy Scanner operates on a strict ephemeral processing model:

Aspect Implementation
Data Retention Zero - content exists only during request processing
Disk Writes None - all processing in-memory
Database Storage None - stateless architecture
Log Sanitization PII-filtered logging prevents accidental exposure
Session State None - each request is independent
# Example: PII-Safe Logging Filter
class PIIFilter(logging.Filter):
    def filter(self, record):
        # Block any log message containing request body content
        sensitive_patterns = ['text=', 'content=', 'body=']
        return not any(p in str(record.msg) for p in sensitive_patterns)

2.3 2.3 Client-Side Redaction Mode

For organizations with ultra-sensitive data, the Privacy Scanner offers Coordinates-Only Mode:

Standard Mode:

Client → Server: "John's SSN is 123-45-6789"
Server → Client: {type: "SSN", value: "123-45-6789", masked: "[SSN:***-**-6789]"}

Client-Side Redaction Mode:

Client → Server: "John's SSN is 123-45-6789"
Server → Client: {type: "SSN", start: 15, end: 26, length: 11}
Client performs local redaction - actual PII value never returned

This mode ensures:

  • Backend never echoes PII values back to the client
  • Redaction occurs entirely in the browser
  • Suitable for air-gapped environments with strict data egress policies
  • Zero data leakage risk from server-side processing

3 Detection Capabilities

3.1 3.1 PII Categories and Types

The Privacy Scanner detects 40+ distinct PII types across six categories:

3.1.1 Identity Documents

Type Pattern Validation
US Social Security Number (SSN) XXX-XX-XXXX Format + Area validation
US Medicare ID (MBI) XAXX-XXX-XXXX Format validation
US Driver’s License State-specific Context-aware
UK National Insurance AB123456C Format + prefix validation
Canadian SIN XXX-XXX-XXX Luhn checksum
India Aadhaar 12 digits Verhoeff checksum
India PAN ABCDE1234F Format validation
Australia TFN 8-9 digits Checksum validation
Brazil CPF XXX.XXX.XXX-XX MOD-11 checksum
Mexico CURP 18 chars Format validation
South Africa ID 13 digits Luhn checksum
Passport Numbers Country-specific Format validation
German Personalausweis 10 chars Context-aware

3.1.2 Financial Information

Type Pattern Validation
Credit Card (Visa/MC/Amex/Discover) 13-19 digits Luhn Algorithm
IBAN Country + check digits + BBAN MOD-97 Algorithm
SWIFT/BIC 8 or 11 chars Format + context
Bank Account Numbers 8-17 digits Context-aware
Routing/ABA Numbers 9 digits Context-aware
CUSIP 9 chars Check digit
ISIN 12 chars Luhn checksum
SEDOL 7 chars Checksum

3.1.3 Contact Information

Type Pattern Validation
Email Addresses RFC 5322 compliant Domain validation
Obfuscated Emails [at], (dot) variants TLD validation
US Phone Numbers Multiple formats Area code validation
International Phone 30+ country codes Country-specific
Physical Addresses US format Context-aware

3.1.4 Secrets and API Keys

Type Pattern Example
AWS Access Key AKIA[A-Z0-9]{16} AKIAIOSFODNN7EXAMPLE
AWS Secret Key 40-char base64 wJalrXUtnFEMI/K7MDENG...
GitHub Token gh[pousr]_[A-Za-z0-9]{36+} ghp_xxxxxxxxxxxx...
Slack Token xox[baprs]-... xoxb-123456-789012-...
Stripe Key sk_live_... / pk_test_... sk_live_abc123...
JWT Token Base64.Base64.Base64 eyJhbGci...
OpenAI API Key sk-[A-Za-z0-9]{48} sk-abc123...
Anthropic API Key sk-ant-... sk-ant-api03-...
Discord Token Base64 format Token pattern
Private Keys PEM headers -----BEGIN PRIVATE KEY-----

3.1.5 Medical Information

Type Pattern Validation
Medical Record Number 6-10 digits Context-aware
NPI (Provider ID) 10 digits Luhn checksum
DEA Number 2 letters + 7 digits Checksum

3.1.6 Cryptocurrency

Type Pattern Validation
Bitcoin Address 1, 3, or bc1 prefix Base58Check / Bech32
Ethereum Address 0x + 40 hex Checksum optional
Monero Address 4 prefix, 95 chars Format validation

3.2 3.2 Eight-Layer Detection Pipeline

┌────────────────────────────────────────────────────────────────┐
│ INPUT TEXT                                                     │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 1: Unicode Normalization (NFKC)                         │
│ • Converts fullwidth chars: email → email                │
│ • Normalizes homoglyphs: е (Cyrillic) → e (Latin)             │
│ • Decodes HTML entities: @ → @                            │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 2: Text Normalization                                   │
│ • Defanging reversal: [dot] → ., [at] → @                     │
│ • Smart "at" detection (TLD validation, false trigger filter) │
│ • Separator removal: 123-45-6789 → 123456789                  │
│ • Character unspacing: t-e-s-t → test                         │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 2.5: Structured Data Extraction                         │
│ • JSON blob detection and deep value extraction               │
│ • Recursive scanning of nested objects/arrays                 │
│ • Key-value pair analysis                                     │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 2.6: Encoding Detection                                 │
│ • Base64 auto-detection and decoding                          │
│ • UTF-8 validation of decoded content                         │
│ • Recursive PII scan on decoded payloads                      │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 3: Pattern Matching                                     │
│ • 40+ regex patterns with category classification             │
│ • Context-aware matching (lookbehind/lookahead)               │
│ • Multi-format support per PII type                           │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 4: Checksum Validation                                  │
│ • Luhn algorithm (credit cards, Canadian SIN)                 │
│ • MOD-97 (IBAN)                                               │
│ • Verhoeff (Aadhaar)                                          │
│ • Custom checksums (DEA, NPI)                                 │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 5: Context Analysis                                     │
│ • Surrounding text analysis for disambiguation                │
│ • False positive filtering (connection strings, UUIDs)        │
│ • Confidence adjustment based on context                      │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 6: Deduplication & Scoring                              │
│ • Overlapping entity resolution                               │
│ • Confidence score aggregation                                │
│ • Risk level classification                                   │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ OUTPUT: Structured PII Report                                  │
│ • Entity list with types, values, positions, confidence       │
│ • Redacted text preview                                       │
│ • Risk assessment summary                                     │
└────────────────────────────────────────────────────────────────┘

3.3 3.3 Anti-Evasion Capabilities

The Privacy Scanner is designed to detect PII even when intentionally obfuscated:

Evasion Technique Example Detection Method
Defanging john[at]gmail[dot]com Layer 2 normalization
Spacing j-o-h-n @ g-m-a-i-l Character joining
Leetspeak j0hn@gm4il.c0m Leetspeak reversal
Unicode tricks john@gmail.com NFKC normalization
HTML encoding john@gmail.com Entity decoding
Base64 hiding am9obkBnbWFpbC5jb20= Auto-decode + scan
JSON embedding {"email":"john@gmail.com"} Deep extraction
Number formatting 123.45.6789 (SSN with dots) Multi-separator support

4 Compliance Mapping

4.1 4.1 GDPR (General Data Protection Regulation)

GDPR Requirement Privacy Scanner Capability
Art. 5(1)(c) - Data Minimization Client-side redaction mode ensures minimal data processing
Art. 5(1)(e) - Storage Limitation Zero data retention - ephemeral processing only
Art. 25 - Privacy by Design Built-in PII detection before data enters downstream systems
Art. 32 - Security of Processing TLS encryption, no persistent storage, PII-filtered logs
Art. 33/34 - Breach Notification Detection of exposed PII in logs/documents aids breach assessment

GDPR PII Types Detected: - Names (via context analysis) - Email addresses - Phone numbers (EU formats) - National IDs (UK NI, German Ausweis) - Financial identifiers (IBAN, EU VAT) - IP addresses - Physical addresses

4.2 4.2 HIPAA (Health Insurance Portability and Accountability Act)

HIPAA Requirement Privacy Scanner Capability
§164.502 - Minimum Necessary Detects PHI before transmission to reduce exposure
§164.312(a)(1) - Access Control Coordinates-only mode prevents PHI echo
§164.312(c)(1) - Integrity Immutable detection - no modification of source data
§164.312(e)(1) - Transmission Security TLS 1.3 for all communications
§164.530(c) - Safeguards Multi-layer detection prevents PHI leakage

HIPAA PHI Types Detected: - Social Security Numbers - Medicare Beneficiary Identifiers (MBI) - Medical Record Numbers - NPI (National Provider Identifier) - DEA Numbers - Dates of Birth - Phone Numbers - Email Addresses - Physical Addresses

4.3 4.3 PCI-DSS (Payment Card Industry Data Security Standard)

PCI-DSS Requirement Privacy Scanner Capability
Req. 3.4 - Render PAN Unreadable Automatic credit card detection and masking
Req. 4.1 - Encrypt Transmission TLS 1.3 encryption
Req. 6.5 - Secure Development Input validation, no SQL/command injection vectors
Req. 10.2 - Audit Trails PII-safe logging with detection events
Req. 12.3 - Usage Policies Supports policy enforcement via API integration

PCI-DSS Data Types Detected: - Primary Account Numbers (PAN) - Visa, Mastercard, Amex, Discover - Luhn validation reduces false positives - Detects formatted (4111-1111-1111-1111) and unformatted (4111111111111111) - Bank routing numbers - IBAN/SWIFT codes

4.4 4.4 SOC 2 (Service Organization Control)

SOC 2 Criteria Privacy Scanner Capability
CC6.1 - Logical Access API-based access with optional authentication
CC6.6 - System Boundaries Clear input/output contracts via OpenAPI spec
CC6.7 - Transmission Integrity TLS encryption, request validation
CC7.2 - System Monitoring Structured detection logs (without PII content)
PI1.1 - Privacy Notice Transparent processing - documented detection categories

4.5 4.5 CCPA (California Consumer Privacy Act)

CCPA Requirement Privacy Scanner Capability
§1798.100 - Right to Know Identifies all PII categories in documents
§1798.105 - Right to Delete Supports identification for deletion workflows
§1798.110 - Disclosure Structured output for compliance reporting

5 Integration Patterns

5.1 5.1 Pre-Commit Hook (Developer Workflow)

#!/bin/bash
# .git/hooks/pre-commit

# Scan staged files for PII
for file in $(git diff --cached --name-only); do
    response=$(curl -s -X POST http://localhost:8000/api/privacy/scan-text \
        -F "text=$(cat $file)" \
        -F "coordinates_only=true")

    count=$(echo $response | jq '.entities | length')
    if [ "$count" -gt 0 ]; then
        echo "PII detected in $file - commit blocked"
        exit 1
    fi
done

5.2 5.2 CI/CD Pipeline Integration

# GitHub Actions example
- name: PII Scan
  run: |
    for file in $(find . -name "*.log" -o -name "*.json"); do
      result=$(curl -s -X POST $PII_SCANNER_URL/api/privacy/scan-text \
        -F "text=$(cat $file)")
      if echo "$result" | jq -e '.entities | length > 0' > /dev/null; then
        echo "::error::PII detected in $file"
        exit 1
      fi
    done

5.3 5.3 Data Pipeline Integration

# Apache Airflow DAG example
from airflow.decorators import task
import requests

@task
def scan_for_pii(data: str, coordinates_only: bool = True) -> dict:
    """Scan data for PII before loading to data warehouse"""
    response = requests.post(
        f"{PII_SCANNER_URL}/api/privacy/scan-text",
        data={
            "text": data,
            "coordinates_only": coordinates_only
        }
    )
    result = response.json()

    if result.get("entities"):
        raise ValueError(f"PII detected: {len(result['entities'])} entities")

    return {"status": "clean", "data": data}

5.4 5.4 Log Sanitization Service

# Real-time log sanitization
import asyncio
import aiohttp

async def sanitize_log_stream(log_lines: list[str]) -> list[str]:
    """Sanitize logs before shipping to centralized logging"""
    async with aiohttp.ClientSession() as session:
        tasks = []
        for line in log_lines:
            task = session.post(
                f"{PII_SCANNER_URL}/api/privacy/scan-text",
                data={"text": line}
            )
            tasks.append(task)

        responses = await asyncio.gather(*tasks)
        sanitized = []
        for resp, original in zip(responses, log_lines):
            result = await resp.json()
            sanitized.append(result.get("redacted_preview", original))

        return sanitized

6 Performance Characteristics

6.1 6.1 Benchmarks

Metric Value Conditions
Throughput ~10,000 chars/sec Single-threaded, all layers enabled
Latency (P50) <50ms 1KB text input
Latency (P99) <200ms 10KB text input
Memory Usage <100MB Per-request peak
Startup Time <2 seconds Cold start with pattern compilation

6.2 6.2 Scalability

The Privacy Scanner is designed for horizontal scalability:

  • Stateless Architecture: Any instance can handle any request
  • No Shared State: No database or cache dependencies for scan operations
  • Container-Ready: Single-process model ideal for Kubernetes
  • Load Balancer Compatible: Round-robin distribution works optimally
# Kubernetes HPA example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: privacy-scanner
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: privacy-scanner
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

7 Deployment Options

7.1 7.1 On-Premises

For maximum data sovereignty:

# Docker deployment
docker run -d \
  --name privacy-scanner \
  -p 8000:8000 \
  --memory=512m \
  --cpus=1 \
  privacy-scanner:latest

Benefits: - Data never leaves your network - Full control over infrastructure - No external dependencies

7.2 7.2 Private Cloud (VPC)

# AWS VPC deployment example
resource "aws_ecs_service" "privacy_scanner" {
  name            = "privacy-scanner"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.privacy_scanner.arn
  desired_count   = 2

  network_configuration {
    subnets          = aws_subnet.private[*].id
    security_groups  = [aws_security_group.privacy_scanner.id]
    assign_public_ip = false  # No public access
  }
}

Benefits: - Network isolation via VPC - Integration with cloud IAM - Auto-scaling capabilities

7.3 7.3 Air-Gapped Deployment

For highly restricted environments:

  1. Client-Side Redaction Mode: Backend only returns coordinates
  2. No Outbound Connections: Zero external API calls
  3. Offline Pattern Updates: Manual pattern file updates
  4. Local-Only Logging: No telemetry or metrics export

8 Security Hardening Checklist

8.1 Pre-Deployment

8.2 Runtime

8.3 Audit


9 Appendix A: API Reference

9.1 Scan Text Endpoint

POST /api/privacy/scan-text
Content-Type: multipart/form-data

Parameters:

Parameter Type Required Description
text string Yes Text content to scan
coordinates_only boolean No Return only positions (default: false)
detect_emails boolean No Enable email detection (default: true)
detect_phones boolean No Enable phone detection (default: true)
detect_ssn boolean No Enable SSN detection (default: true)
detect_credit_cards boolean No Enable credit card detection (default: true)
detect_secrets boolean No Enable secrets detection (default: true)

Response (Standard Mode):

{
  "entities": [
    {
      "type": "EMAIL",
      "value": "john@example.com",
      "masked_value": "[EMAIL:j***@example.com]",
      "start": 15,
      "end": 31,
      "confidence": 0.95,
      "category": "pii"
    }
  ],
  "redacted_preview": "Contact: [EMAIL:j***@example.com] for info",
  "summary": {
    "total_entities": 1,
    "by_category": {"pii": 1},
    "risk_level": "medium"
  }
}

Response (Coordinates-Only Mode):

{
  "entities": [
    {
      "type": "EMAIL",
      "start": 15,
      "end": 31,
      "length": 16
    }
  ],
  "coordinates_only": true
}

10 Appendix B: Confidence Scoring

Confidence Level Score Range Meaning
Very High 0.95 - 1.00 Checksum validated (Luhn, MOD-97)
High 0.85 - 0.94 Strong pattern match with context
Medium 0.70 - 0.84 Pattern match, limited context
Low 0.50 - 0.69 Possible match, needs review
Uncertain < 0.50 Flagged for manual review

Confidence Adjustments:

  • +15%: Checksum validation passed
  • +10%: Contextual keywords present (e.g., “SSN:”, “card number”)
  • -30%: Anti-context detected (e.g., “order number”, “reference ID”)
  • -20%: Common false positive pattern (UUID format, connection string)

11 Appendix C: Version History

Version Date Changes
1.1 2024-12-23 Added international IDs (UK NI, Canadian SIN, India Aadhaar/PAN, etc.), cloud tokens (OpenAI, Anthropic, Discord), crypto addresses, financial identifiers (CUSIP, ISIN), improved false positive filtering
1.0 2024-12-20 Initial release with 30+ PII types, 8-layer detection pipeline

12 Contact & Support

For enterprise licensing, custom integrations, or security assessments:

  • Documentation: See privacy-scanner-overview.qmd and building-privacy-scanner.qmd
  • Issues: Report via your organization’s support channel
  • Updates: Pattern updates released quarterly

This document is intended for enterprise security and compliance teams evaluating the Privacy Scanner for production deployment. All technical specifications are subject to change. Please refer to the latest documentation for current capabilities.