Privacy Scanner: Security & Compliance White Paper

Enterprise-Grade PII Detection with Zero-Trust Architecture

security

compliance

enterprise

privacy

whitepaper

Author

AI Tools Suite

Published

December 23, 2024

1 Executive Summary

1.1 Value Realization

Stakeholder	Primary Benefit
Developer	Prevents secrets/keys from ever reaching GitHub
Data Engineer	Automates PII scrubbing before data enters the warehouse
Compliance Officer	Provides proof of “Privacy by Design” for GDPR/SOC2 audits
CISO	Reduces the overall blast radius of a potential data breach
Legal/DPO	Supports DSAR (Data Subject Access Request) fulfillment
DevOps/SRE	Sanitizes logs before shipping to centralized observability

The Privacy Scanner is an enterprise-grade Personally Identifiable Information (PII) detection and redaction solution designed with security-first principles. This white paper details the security architecture, compliance capabilities, and technical safeguards that make the Privacy Scanner suitable for organizations with stringent data protection requirements.

Key Highlights:

40+ PII Types Detected across identity, financial, contact, medical, and secret categories
8-Layer Detection Pipeline for comprehensive coverage including obfuscation bypass
Zero-Trust Architecture with optional client-side redaction mode
Ephemeral Processing - no data persistence, no logging of sensitive content
Supports Compliance Programs - technical controls aligned with GDPR, HIPAA, PCI-DSS, SOC 2, and CCPA requirements (tool assists compliance efforts; does not guarantee compliance)

2 Security Architecture

2.1 2.1 Defense in Depth

The Privacy Scanner implements multiple layers of security controls:

┌─────────────────────────────────────────────────────────────┐
│                    CLIENT BROWSER                           │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Client-Side Redaction Mode (Optional)              │   │
│  │  • PII never leaves browser                         │   │
│  │  • Only coordinates returned from backend           │   │
│  │  • Maximum privacy guarantee                        │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    TRANSPORT LAYER                          │
│  • TLS 1.3 encryption in transit                           │
│  • Certificate pinning (recommended)                        │
│  • No sensitive data in URL parameters                      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    APPLICATION LAYER                        │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  FastAPI Backend                                    │   │
│  │  • Request validation via Pydantic                  │   │
│  │  • No database connections for scan operations      │   │
│  │  • Stateless processing                             │   │
│  │  • PII-filtered logging                             │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    PROCESSING LAYER                         │
│  • In-memory only - no disk writes                         │
│  • Automatic garbage collection post-response              │
│  • No caching of scanned content                           │
│  • Deterministic regex patterns (no ML model storage)      │
└─────────────────────────────────────────────────────────────┘

2.2 2.2 Ephemeral Processing Model

The Privacy Scanner operates on a strict ephemeral processing model:

Aspect	Implementation
Data Retention	Zero - content exists only during request processing
Disk Writes	None - all processing in-memory
Database Storage	None - stateless architecture
Log Sanitization	PII-filtered logging prevents accidental exposure
Session State	None - each request is independent

# Example: PII-Safe Logging Filter
class PIIFilter(logging.Filter):
    def filter(self, record):
        # Block any log message containing request body content
        sensitive_patterns = ['text=', 'content=', 'body=']
        return not any(p in str(record.msg) for p in sensitive_patterns)

2.3 2.3 Client-Side Redaction Mode

For organizations with ultra-sensitive data, the Privacy Scanner offers Coordinates-Only Mode:

Standard Mode:

Client → Server: "John's SSN is 123-45-6789"
Server → Client: {type: "SSN", value: "123-45-6789", masked: "[SSN:***-**-6789]"}

Client-Side Redaction Mode:

Client → Server: "John's SSN is 123-45-6789"
Server → Client: {type: "SSN", start: 15, end: 26, length: 11}
Client performs local redaction - actual PII value never returned

This mode ensures:

Backend never echoes PII values back to the client
Redaction occurs entirely in the browser
Suitable for air-gapped environments with strict data egress policies
Zero data leakage risk from server-side processing

3 Detection Capabilities

3.1 3.1 PII Categories and Types

The Privacy Scanner detects 40+ distinct PII types across six categories:

3.1.1 Identity Documents

Type	Pattern	Validation
US Social Security Number (SSN)	`XXX-XX-XXXX`	Format + Area validation
US Medicare ID (MBI)	`XAXX-XXX-XXXX`	Format validation
US Driver’s License	State-specific	Context-aware
UK National Insurance	`AB123456C`	Format + prefix validation
Canadian SIN	`XXX-XXX-XXX`	Luhn checksum
India Aadhaar	12 digits	Verhoeff checksum
India PAN	`ABCDE1234F`	Format validation
Australia TFN	8-9 digits	Checksum validation
Brazil CPF	`XXX.XXX.XXX-XX`	MOD-11 checksum
Mexico CURP	18 chars	Format validation
South Africa ID	13 digits	Luhn checksum
Passport Numbers	Country-specific	Format validation
German Personalausweis	10 chars	Context-aware

3.1.2 Financial Information

Type	Pattern	Validation
Credit Card (Visa/MC/Amex/Discover)	13-19 digits	Luhn Algorithm
IBAN	Country + check digits + BBAN	MOD-97 Algorithm
SWIFT/BIC	8 or 11 chars	Format + context
Bank Account Numbers	8-17 digits	Context-aware
Routing/ABA Numbers	9 digits	Context-aware
CUSIP	9 chars	Check digit
ISIN	12 chars	Luhn checksum
SEDOL	7 chars	Checksum

3.1.3 Contact Information

Type	Pattern	Validation
Email Addresses	RFC 5322 compliant	Domain validation
Obfuscated Emails	`[at]`, `(dot)` variants	TLD validation
US Phone Numbers	Multiple formats	Area code validation
International Phone	30+ country codes	Country-specific
Physical Addresses	US format	Context-aware

3.1.4 Secrets and API Keys

Type	Pattern	Example
AWS Access Key	`AKIA[A-Z0-9]{16}`	`AKIAIOSFODNN7EXAMPLE`
AWS Secret Key	40-char base64	`wJalrXUtnFEMI/K7MDENG...`
GitHub Token	`gh[pousr]_[A-Za-z0-9]{36+}`	`ghp_xxxxxxxxxxxx...`
Slack Token	`xox[baprs]-...`	`xoxb-123456-789012-...`
Stripe Key	`sk_live_...` / `pk_test_...`	`sk_live_abc123...`
JWT Token	Base64.Base64.Base64	`eyJhbGci...`
OpenAI API Key	`sk-[A-Za-z0-9]{48}`	`sk-abc123...`
Anthropic API Key	`sk-ant-...`	`sk-ant-api03-...`
Discord Token	Base64 format	Token pattern
Private Keys	PEM headers	`-----BEGIN PRIVATE KEY-----`

3.1.5 Medical Information

Type	Pattern	Validation
Medical Record Number	6-10 digits	Context-aware
NPI (Provider ID)	10 digits	Luhn checksum
DEA Number	2 letters + 7 digits	Checksum

3.1.6 Cryptocurrency

Type	Pattern	Validation
Bitcoin Address	`1`, `3`, or `bc1` prefix	Base58Check / Bech32
Ethereum Address	`0x` + 40 hex	Checksum optional
Monero Address	`4` prefix, 95 chars	Format validation

3.2 3.2 Eight-Layer Detection Pipeline

┌────────────────────────────────────────────────────────────────┐
│ INPUT TEXT                                                     │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 1: Unicode Normalization (NFKC)                         │
│ • Converts fullwidth chars: ｅｍａｉｌ → email                │
│ • Normalizes homoglyphs: е (Cyrillic) → e (Latin)             │
│ • Decodes HTML entities: &#64; → @                            │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 2: Text Normalization                                   │
│ • Defanging reversal: [dot] → ., [at] → @                     │
│ • Smart "at" detection (TLD validation, false trigger filter) │
│ • Separator removal: 123-45-6789 → 123456789                  │
│ • Character unspacing: t-e-s-t → test                         │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 2.5: Structured Data Extraction                         │
│ • JSON blob detection and deep value extraction               │
│ • Recursive scanning of nested objects/arrays                 │
│ • Key-value pair analysis                                     │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 2.6: Encoding Detection                                 │
│ • Base64 auto-detection and decoding                          │
│ • UTF-8 validation of decoded content                         │
│ • Recursive PII scan on decoded payloads                      │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 3: Pattern Matching                                     │
│ • 40+ regex patterns with category classification             │
│ • Context-aware matching (lookbehind/lookahead)               │
│ • Multi-format support per PII type                           │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 4: Checksum Validation                                  │
│ • Luhn algorithm (credit cards, Canadian SIN)                 │
│ • MOD-97 (IBAN)                                               │
│ • Verhoeff (Aadhaar)                                          │
│ • Custom checksums (DEA, NPI)                                 │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 5: Context Analysis                                     │
│ • Surrounding text analysis for disambiguation                │
│ • False positive filtering (connection strings, UUIDs)        │
│ • Confidence adjustment based on context                      │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ LAYER 6: Deduplication & Scoring                              │
│ • Overlapping entity resolution                               │
│ • Confidence score aggregation                                │
│ • Risk level classification                                   │
└────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────────┐
│ OUTPUT: Structured PII Report                                  │
│ • Entity list with types, values, positions, confidence       │
│ • Redacted text preview                                       │
│ • Risk assessment summary                                     │
└────────────────────────────────────────────────────────────────┘

3.3 3.3 Anti-Evasion Capabilities

The Privacy Scanner is designed to detect PII even when intentionally obfuscated:

Evasion Technique	Example	Detection Method
Defanging	`john[at]gmail[dot]com`	Layer 2 normalization
Spacing	`j-o-h-n @ g-m-a-i-l`	Character joining
Leetspeak	`j0hn@gm4il.c0m`	Leetspeak reversal
Unicode tricks	`ｊｏｈｎ＠ｇｍａｉｌ．ｃｏｍ`	NFKC normalization
HTML encoding	`john@gmail.com`	Entity decoding
Base64 hiding	`am9obkBnbWFpbC5jb20=`	Auto-decode + scan
JSON embedding	`{"email":"john@gmail.com"}`	Deep extraction
Number formatting	`123.45.6789` (SSN with dots)	Multi-separator support

4 Compliance Mapping

4.1 4.1 GDPR (General Data Protection Regulation)

GDPR Requirement	Privacy Scanner Capability
Art. 5(1)(c) - Data Minimization	Client-side redaction mode ensures minimal data processing
Art. 5(1)(e) - Storage Limitation	Zero data retention - ephemeral processing only
Art. 25 - Privacy by Design	Built-in PII detection before data enters downstream systems
Art. 32 - Security of Processing	TLS encryption, no persistent storage, PII-filtered logs
Art. 33/34 - Breach Notification	Detection of exposed PII in logs/documents aids breach assessment

GDPR PII Types Detected: - Names (via context analysis) - Email addresses - Phone numbers (EU formats) - National IDs (UK NI, German Ausweis) - Financial identifiers (IBAN, EU VAT) - IP addresses - Physical addresses

4.2 4.2 HIPAA (Health Insurance Portability and Accountability Act)

HIPAA Requirement	Privacy Scanner Capability
§164.502 - Minimum Necessary	Detects PHI before transmission to reduce exposure
§164.312(a)(1) - Access Control	Coordinates-only mode prevents PHI echo
§164.312(c)(1) - Integrity	Immutable detection - no modification of source data
§164.312(e)(1) - Transmission Security	TLS 1.3 for all communications
§164.530(c) - Safeguards	Multi-layer detection prevents PHI leakage

HIPAA PHI Types Detected: - Social Security Numbers - Medicare Beneficiary Identifiers (MBI) - Medical Record Numbers - NPI (National Provider Identifier) - DEA Numbers - Dates of Birth - Phone Numbers - Email Addresses - Physical Addresses

4.3 4.3 PCI-DSS (Payment Card Industry Data Security Standard)

PCI-DSS Requirement	Privacy Scanner Capability
Req. 3.4 - Render PAN Unreadable	Automatic credit card detection and masking
Req. 4.1 - Encrypt Transmission	TLS 1.3 encryption
Req. 6.5 - Secure Development	Input validation, no SQL/command injection vectors
Req. 10.2 - Audit Trails	PII-safe logging with detection events
Req. 12.3 - Usage Policies	Supports policy enforcement via API integration

PCI-DSS Data Types Detected: - Primary Account Numbers (PAN) - Visa, Mastercard, Amex, Discover - Luhn validation reduces false positives - Detects formatted (4111-1111-1111-1111) and unformatted (4111111111111111) - Bank routing numbers - IBAN/SWIFT codes

4.4 4.4 SOC 2 (Service Organization Control)

SOC 2 Criteria	Privacy Scanner Capability
CC6.1 - Logical Access	API-based access with optional authentication
CC6.6 - System Boundaries	Clear input/output contracts via OpenAPI spec
CC6.7 - Transmission Integrity	TLS encryption, request validation
CC7.2 - System Monitoring	Structured detection logs (without PII content)
PI1.1 - Privacy Notice	Transparent processing - documented detection categories

4.5 4.5 CCPA (California Consumer Privacy Act)

CCPA Requirement	Privacy Scanner Capability
§1798.100 - Right to Know	Identifies all PII categories in documents
§1798.105 - Right to Delete	Supports identification for deletion workflows
§1798.110 - Disclosure	Structured output for compliance reporting

5 Integration Patterns

5.1 5.1 Pre-Commit Hook (Developer Workflow)

#!/bin/bash
# .git/hooks/pre-commit

# Scan staged files for PII
for file in $(git diff --cached --name-only); do
    response=$(curl -s -X POST http://localhost:8000/api/privacy/scan-text \
        -F "text=$(cat $file)" \
        -F "coordinates_only=true")

    count=$(echo $response | jq '.entities | length')
    if [ "$count" -gt 0 ]; then
        echo "PII detected in $file - commit blocked"
        exit 1
    fi
done

5.2 5.2 CI/CD Pipeline Integration

# GitHub Actions example
- name: PII Scan
  run: |
    for file in $(find . -name "*.log" -o -name "*.json"); do
      result=$(curl -s -X POST $PII_SCANNER_URL/api/privacy/scan-text \
        -F "text=$(cat $file)")
      if echo "$result" | jq -e '.entities | length > 0' > /dev/null; then
        echo "::error::PII detected in $file"
        exit 1
      fi
    done

5.3 5.3 Data Pipeline Integration

# Apache Airflow DAG example
from airflow.decorators import task
import requests

@task
def scan_for_pii(data: str, coordinates_only: bool = True) -> dict:
    """Scan data for PII before loading to data warehouse"""
    response = requests.post(
        f"{PII_SCANNER_URL}/api/privacy/scan-text",
        data={
            "text": data,
            "coordinates_only": coordinates_only
        }
    )
    result = response.json()

    if result.get("entities"):
        raise ValueError(f"PII detected: {len(result['entities'])} entities")

    return {"status": "clean", "data": data}

5.4 5.4 Log Sanitization Service

# Real-time log sanitization
import asyncio
import aiohttp

async def sanitize_log_stream(log_lines: list[str]) -> list[str]:
    """Sanitize logs before shipping to centralized logging"""
    async with aiohttp.ClientSession() as session:
        tasks = []
        for line in log_lines:
            task = session.post(
                f"{PII_SCANNER_URL}/api/privacy/scan-text",
                data={"text": line}
            )
            tasks.append(task)

        responses = await asyncio.gather(*tasks)
        sanitized = []
        for resp, original in zip(responses, log_lines):
            result = await resp.json()
            sanitized.append(result.get("redacted_preview", original))

        return sanitized

6 Performance Characteristics

6.1 6.1 Benchmarks

Metric	Value	Conditions
Throughput	~10,000 chars/sec	Single-threaded, all layers enabled
Latency (P50)	<50ms	1KB text input
Latency (P99)	<200ms	10KB text input
Memory Usage	<100MB	Per-request peak
Startup Time	<2 seconds	Cold start with pattern compilation

6.2 6.2 Scalability

The Privacy Scanner is designed for horizontal scalability:

Stateless Architecture: Any instance can handle any request
No Shared State: No database or cache dependencies for scan operations
Container-Ready: Single-process model ideal for Kubernetes
Load Balancer Compatible: Round-robin distribution works optimally

# Kubernetes HPA example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: privacy-scanner
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: privacy-scanner
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

7 Deployment Options

7.1 7.1 On-Premises

For maximum data sovereignty:

# Docker deployment
docker run -d \
  --name privacy-scanner \
  -p 8000:8000 \
  --memory=512m \
  --cpus=1 \
  privacy-scanner:latest

Benefits: - Data never leaves your network - Full control over infrastructure - No external dependencies

7.2 7.2 Private Cloud (VPC)

# AWS VPC deployment example
resource "aws_ecs_service" "privacy_scanner" {
  name            = "privacy-scanner"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.privacy_scanner.arn
  desired_count   = 2

  network_configuration {
    subnets          = aws_subnet.private[*].id
    security_groups  = [aws_security_group.privacy_scanner.id]
    assign_public_ip = false  # No public access
  }
}

Benefits: - Network isolation via VPC - Integration with cloud IAM - Auto-scaling capabilities

7.3 7.3 Air-Gapped Deployment

For highly restricted environments:

Client-Side Redaction Mode: Backend only returns coordinates
No Outbound Connections: Zero external API calls
Offline Pattern Updates: Manual pattern file updates
Local-Only Logging: No telemetry or metrics export

9 Appendix A: API Reference

9.1 Scan Text Endpoint

POST /api/privacy/scan-text
Content-Type: multipart/form-data

Parameters:

Parameter	Type	Required	Description
`text`	string	Yes	Text content to scan
`coordinates_only`	boolean	No	Return only positions (default: false)
`detect_emails`	boolean	No	Enable email detection (default: true)
`detect_phones`	boolean	No	Enable phone detection (default: true)
`detect_ssn`	boolean	No	Enable SSN detection (default: true)
`detect_credit_cards`	boolean	No	Enable credit card detection (default: true)
`detect_secrets`	boolean	No	Enable secrets detection (default: true)

Response (Standard Mode):

{
  "entities": [
    {
      "type": "EMAIL",
      "value": "john@example.com",
      "masked_value": "[EMAIL:j***@example.com]",
      "start": 15,
      "end": 31,
      "confidence": 0.95,
      "category": "pii"
    }
  ],
  "redacted_preview": "Contact: [EMAIL:j***@example.com] for info",
  "summary": {
    "total_entities": 1,
    "by_category": {"pii": 1},
    "risk_level": "medium"
  }
}

Response (Coordinates-Only Mode):

{
  "entities": [
    {
      "type": "EMAIL",
      "start": 15,
      "end": 31,
      "length": 16
    }
  ],
  "coordinates_only": true
}

10 Appendix B: Confidence Scoring

Confidence Level	Score Range	Meaning
Very High	0.95 - 1.00	Checksum validated (Luhn, MOD-97)
High	0.85 - 0.94	Strong pattern match with context
Medium	0.70 - 0.84	Pattern match, limited context
Low	0.50 - 0.69	Possible match, needs review
Uncertain	< 0.50	Flagged for manual review

Confidence Adjustments:

+15%: Checksum validation passed
+10%: Contextual keywords present (e.g., “SSN:”, “card number”)
-30%: Anti-context detected (e.g., “order number”, “reference ID”)
-20%: Common false positive pattern (UUID format, connection string)

11 Appendix C: Version History

Version	Date	Changes
1.1	2024-12-23	Added international IDs (UK NI, Canadian SIN, India Aadhaar/PAN, etc.), cloud tokens (OpenAI, Anthropic, Discord), crypto addresses, financial identifiers (CUSIP, ISIN), improved false positive filtering
1.0	2024-12-20	Initial release with 30+ PII types, 8-layer detection pipeline

12 Contact & Support

For enterprise licensing, custom integrations, or security assessments:

Documentation: See privacy-scanner-overview.qmd and building-privacy-scanner.qmd
Issues: Report via your organization’s support channel
Updates: Pattern updates released quarterly

This document is intended for enterprise security and compliance teams evaluating the Privacy Scanner for production deployment. All technical specifications are subject to change. Please refer to the latest documentation for current capabilities.