708 lines
30 KiB
Text
708 lines
30 KiB
Text
---
|
||
title: "Privacy Scanner: Security & Compliance White Paper"
|
||
subtitle: "Enterprise-Grade PII Detection with Zero-Trust Architecture"
|
||
author: "AI Tools Suite"
|
||
date: "2024-12-23"
|
||
version: "1.1"
|
||
categories: [security, compliance, enterprise, privacy, whitepaper]
|
||
format:
|
||
html:
|
||
toc: true
|
||
toc-depth: 3
|
||
code-fold: true
|
||
number-sections: true
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
### Value Realization
|
||
|
||
| Stakeholder | Primary Benefit |
|
||
|-------------|-----------------|
|
||
| **Developer** | Prevents secrets/keys from ever reaching GitHub |
|
||
| **Data Engineer** | Automates PII scrubbing before data enters the warehouse |
|
||
| **Compliance Officer** | Provides proof of "Privacy by Design" for GDPR/SOC2 audits |
|
||
| **CISO** | Reduces the overall blast radius of a potential data breach |
|
||
| **Legal/DPO** | Supports DSAR (Data Subject Access Request) fulfillment |
|
||
| **DevOps/SRE** | Sanitizes logs before shipping to centralized observability |
|
||
|
||
---
|
||
|
||
The Privacy Scanner is an enterprise-grade Personally Identifiable Information (PII) detection and redaction solution designed with security-first principles. This white paper details the security architecture, compliance capabilities, and technical safeguards that make the Privacy Scanner suitable for organizations with stringent data protection requirements.
|
||
|
||
**Key Highlights:**
|
||
|
||
- **40+ PII Types Detected** across identity, financial, contact, medical, and secret categories
|
||
- **8-Layer Detection Pipeline** for comprehensive coverage including obfuscation bypass
|
||
- **Zero-Trust Architecture** with optional client-side redaction mode
|
||
- **Ephemeral Processing** - no data persistence, no logging of sensitive content
|
||
- **Supports Compliance Programs** - technical controls aligned with GDPR, HIPAA, PCI-DSS, SOC 2, and CCPA requirements (tool assists compliance efforts; does not guarantee compliance)
|
||
|
||
---
|
||
|
||
## Security Architecture
|
||
|
||
### 2.1 Defense in Depth
|
||
|
||
The Privacy Scanner implements multiple layers of security controls:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ CLIENT BROWSER │
|
||
│ ┌─────────────────────────────────────────────────────┐ │
|
||
│ │ Client-Side Redaction Mode (Optional) │ │
|
||
│ │ • PII never leaves browser │ │
|
||
│ │ • Only coordinates returned from backend │ │
|
||
│ │ • Maximum privacy guarantee │ │
|
||
│ └─────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ TRANSPORT LAYER │
|
||
│ • TLS 1.3 encryption in transit │
|
||
│ • Certificate pinning (recommended) │
|
||
│ • No sensitive data in URL parameters │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ APPLICATION LAYER │
|
||
│ ┌─────────────────────────────────────────────────────┐ │
|
||
│ │ FastAPI Backend │ │
|
||
│ │ • Request validation via Pydantic │ │
|
||
│ │ • No database connections for scan operations │ │
|
||
│ │ • Stateless processing │ │
|
||
│ │ • PII-filtered logging │ │
|
||
│ └─────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ PROCESSING LAYER │
|
||
│ • In-memory only - no disk writes │
|
||
│ • Automatic garbage collection post-response │
|
||
│ • No caching of scanned content │
|
||
│ • Deterministic regex patterns (no ML model storage) │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 2.2 Ephemeral Processing Model
|
||
|
||
The Privacy Scanner operates on a strict ephemeral processing model:
|
||
|
||
| Aspect | Implementation |
|
||
|--------|----------------|
|
||
| **Data Retention** | Zero - content exists only during request processing |
|
||
| **Disk Writes** | None - all processing in-memory |
|
||
| **Database Storage** | None - stateless architecture |
|
||
| **Log Sanitization** | PII-filtered logging prevents accidental exposure |
|
||
| **Session State** | None - each request is independent |
|
||
|
||
```python
|
||
# Example: PII-Safe Logging Filter
|
||
class PIIFilter(logging.Filter):
|
||
def filter(self, record):
|
||
# Block any log message containing request body content
|
||
sensitive_patterns = ['text=', 'content=', 'body=']
|
||
return not any(p in str(record.msg) for p in sensitive_patterns)
|
||
```
|
||
|
||
### 2.3 Client-Side Redaction Mode
|
||
|
||
For organizations with ultra-sensitive data, the Privacy Scanner offers **Coordinates-Only Mode**:
|
||
|
||
**Standard Mode:**
|
||
```
|
||
Client → Server: "John's SSN is 123-45-6789"
|
||
Server → Client: {type: "SSN", value: "123-45-6789", masked: "[SSN:***-**-6789]"}
|
||
```
|
||
|
||
**Client-Side Redaction Mode:**
|
||
```
|
||
Client → Server: "John's SSN is 123-45-6789"
|
||
Server → Client: {type: "SSN", start: 15, end: 26, length: 11}
|
||
Client performs local redaction - actual PII value never returned
|
||
```
|
||
|
||
This mode ensures:
|
||
|
||
- Backend **never echoes PII values** back to the client
|
||
- Redaction occurs **entirely in the browser**
|
||
- Suitable for **air-gapped environments** with strict data egress policies
|
||
- **Zero data leakage risk** from server-side processing
|
||
|
||
---
|
||
|
||
## Detection Capabilities
|
||
|
||
### 3.1 PII Categories and Types
|
||
|
||
The Privacy Scanner detects **40+ distinct PII types** across six categories:
|
||
|
||
#### Identity Documents
|
||
| Type | Pattern | Validation |
|
||
|------|---------|------------|
|
||
| US Social Security Number (SSN) | `XXX-XX-XXXX` | Format + Area validation |
|
||
| US Medicare ID (MBI) | `XAXX-XXX-XXXX` | Format validation |
|
||
| US Driver's License | State-specific | Context-aware |
|
||
| UK National Insurance | `AB123456C` | Format + prefix validation |
|
||
| Canadian SIN | `XXX-XXX-XXX` | Luhn checksum |
|
||
| India Aadhaar | 12 digits | Verhoeff checksum |
|
||
| India PAN | `ABCDE1234F` | Format validation |
|
||
| Australia TFN | 8-9 digits | Checksum validation |
|
||
| Brazil CPF | `XXX.XXX.XXX-XX` | MOD-11 checksum |
|
||
| Mexico CURP | 18 chars | Format validation |
|
||
| South Africa ID | 13 digits | Luhn checksum |
|
||
| Passport Numbers | Country-specific | Format validation |
|
||
| German Personalausweis | 10 chars | Context-aware |
|
||
|
||
#### Financial Information
|
||
| Type | Pattern | Validation |
|
||
|------|---------|------------|
|
||
| Credit Card (Visa/MC/Amex/Discover) | 13-19 digits | **Luhn Algorithm** |
|
||
| IBAN | Country + check digits + BBAN | **MOD-97 Algorithm** |
|
||
| SWIFT/BIC | 8 or 11 chars | Format + context |
|
||
| Bank Account Numbers | 8-17 digits | Context-aware |
|
||
| Routing/ABA Numbers | 9 digits | Context-aware |
|
||
| CUSIP | 9 chars | Check digit |
|
||
| ISIN | 12 chars | Luhn checksum |
|
||
| SEDOL | 7 chars | Checksum |
|
||
|
||
#### Contact Information
|
||
| Type | Pattern | Validation |
|
||
|------|---------|------------|
|
||
| Email Addresses | RFC 5322 compliant | Domain validation |
|
||
| Obfuscated Emails | `[at]`, `(dot)` variants | TLD validation |
|
||
| US Phone Numbers | Multiple formats | Area code validation |
|
||
| International Phone | 30+ country codes | Country-specific |
|
||
| Physical Addresses | US format | Context-aware |
|
||
|
||
#### Secrets and API Keys
|
||
| Type | Pattern | Example |
|
||
|------|---------|---------|
|
||
| AWS Access Key | `AKIA[A-Z0-9]{16}` | `AKIAIOSFODNN7EXAMPLE` |
|
||
| AWS Secret Key | 40-char base64 | `wJalrXUtnFEMI/K7MDENG...` |
|
||
| GitHub Token | `gh[pousr]_[A-Za-z0-9]{36+}` | `ghp_xxxxxxxxxxxx...` |
|
||
| Slack Token | `xox[baprs]-...` | `xoxb-123456-789012-...` |
|
||
| Stripe Key | `sk_live_...` / `pk_test_...` | `sk_live_abc123...` |
|
||
| JWT Token | Base64.Base64.Base64 | `eyJhbGci...` |
|
||
| OpenAI API Key | `sk-[A-Za-z0-9]{48}` | `sk-abc123...` |
|
||
| Anthropic API Key | `sk-ant-...` | `sk-ant-api03-...` |
|
||
| Discord Token | Base64 format | Token pattern |
|
||
| Private Keys | PEM headers | `-----BEGIN PRIVATE KEY-----` |
|
||
|
||
#### Medical Information
|
||
| Type | Pattern | Validation |
|
||
|------|---------|------------|
|
||
| Medical Record Number | 6-10 digits | Context-aware |
|
||
| NPI (Provider ID) | 10 digits | Luhn checksum |
|
||
| DEA Number | 2 letters + 7 digits | Checksum |
|
||
|
||
#### Cryptocurrency
|
||
| Type | Pattern | Validation |
|
||
|------|---------|------------|
|
||
| Bitcoin Address | `1`, `3`, or `bc1` prefix | Base58Check / Bech32 |
|
||
| Ethereum Address | `0x` + 40 hex | Checksum optional |
|
||
| Monero Address | `4` prefix, 95 chars | Format validation |
|
||
|
||
### 3.2 Eight-Layer Detection Pipeline
|
||
|
||
```
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ INPUT TEXT │
|
||
└────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ LAYER 1: Unicode Normalization (NFKC) │
|
||
│ • Converts fullwidth chars: email → email │
|
||
│ • Normalizes homoglyphs: е (Cyrillic) → e (Latin) │
|
||
│ • Decodes HTML entities: @ → @ │
|
||
└────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ LAYER 2: Text Normalization │
|
||
│ • Defanging reversal: [dot] → ., [at] → @ │
|
||
│ • Smart "at" detection (TLD validation, false trigger filter) │
|
||
│ • Separator removal: 123-45-6789 → 123456789 │
|
||
│ • Character unspacing: t-e-s-t → test │
|
||
└────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ LAYER 2.5: Structured Data Extraction │
|
||
│ • JSON blob detection and deep value extraction │
|
||
│ • Recursive scanning of nested objects/arrays │
|
||
│ • Key-value pair analysis │
|
||
└────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ LAYER 2.6: Encoding Detection │
|
||
│ • Base64 auto-detection and decoding │
|
||
│ • UTF-8 validation of decoded content │
|
||
│ • Recursive PII scan on decoded payloads │
|
||
└────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ LAYER 3: Pattern Matching │
|
||
│ • 40+ regex patterns with category classification │
|
||
│ • Context-aware matching (lookbehind/lookahead) │
|
||
│ • Multi-format support per PII type │
|
||
└────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ LAYER 4: Checksum Validation │
|
||
│ • Luhn algorithm (credit cards, Canadian SIN) │
|
||
│ • MOD-97 (IBAN) │
|
||
│ • Verhoeff (Aadhaar) │
|
||
│ • Custom checksums (DEA, NPI) │
|
||
└────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ LAYER 5: Context Analysis │
|
||
│ • Surrounding text analysis for disambiguation │
|
||
│ • False positive filtering (connection strings, UUIDs) │
|
||
│ • Confidence adjustment based on context │
|
||
└────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ LAYER 6: Deduplication & Scoring │
|
||
│ • Overlapping entity resolution │
|
||
│ • Confidence score aggregation │
|
||
│ • Risk level classification │
|
||
└────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌────────────────────────────────────────────────────────────────┐
|
||
│ OUTPUT: Structured PII Report │
|
||
│ • Entity list with types, values, positions, confidence │
|
||
│ • Redacted text preview │
|
||
│ • Risk assessment summary │
|
||
└────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 3.3 Anti-Evasion Capabilities
|
||
|
||
The Privacy Scanner is designed to detect PII even when intentionally obfuscated:
|
||
|
||
| Evasion Technique | Example | Detection Method |
|
||
|-------------------|---------|------------------|
|
||
| **Defanging** | `john[at]gmail[dot]com` | Layer 2 normalization |
|
||
| **Spacing** | `j-o-h-n @ g-m-a-i-l` | Character joining |
|
||
| **Leetspeak** | `j0hn@gm4il.c0m` | Leetspeak reversal |
|
||
| **Unicode tricks** | `john@gmail.com` | NFKC normalization |
|
||
| **HTML encoding** | `john@gmail.com` | Entity decoding |
|
||
| **Base64 hiding** | `am9obkBnbWFpbC5jb20=` | Auto-decode + scan |
|
||
| **JSON embedding** | `{"email":"john@gmail.com"}` | Deep extraction |
|
||
| **Number formatting** | `123.45.6789` (SSN with dots) | Multi-separator support |
|
||
|
||
---
|
||
|
||
## Compliance Mapping
|
||
|
||
### 4.1 GDPR (General Data Protection Regulation)
|
||
|
||
| GDPR Requirement | Privacy Scanner Capability |
|
||
|------------------|---------------------------|
|
||
| **Art. 5(1)(c)** - Data Minimization | Client-side redaction mode ensures minimal data processing |
|
||
| **Art. 5(1)(e)** - Storage Limitation | Zero data retention - ephemeral processing only |
|
||
| **Art. 25** - Privacy by Design | Built-in PII detection before data enters downstream systems |
|
||
| **Art. 32** - Security of Processing | TLS encryption, no persistent storage, PII-filtered logs |
|
||
| **Art. 33/34** - Breach Notification | Detection of exposed PII in logs/documents aids breach assessment |
|
||
|
||
**GDPR PII Types Detected:**
|
||
- Names (via context analysis)
|
||
- Email addresses
|
||
- Phone numbers (EU formats)
|
||
- National IDs (UK NI, German Ausweis)
|
||
- Financial identifiers (IBAN, EU VAT)
|
||
- IP addresses
|
||
- Physical addresses
|
||
|
||
### 4.2 HIPAA (Health Insurance Portability and Accountability Act)
|
||
|
||
| HIPAA Requirement | Privacy Scanner Capability |
|
||
|------------------|---------------------------|
|
||
| **§164.502** - Minimum Necessary | Detects PHI before transmission to reduce exposure |
|
||
| **§164.312(a)(1)** - Access Control | Coordinates-only mode prevents PHI echo |
|
||
| **§164.312(c)(1)** - Integrity | Immutable detection - no modification of source data |
|
||
| **§164.312(e)(1)** - Transmission Security | TLS 1.3 for all communications |
|
||
| **§164.530(c)** - Safeguards | Multi-layer detection prevents PHI leakage |
|
||
|
||
**HIPAA PHI Types Detected:**
|
||
- Social Security Numbers
|
||
- Medicare Beneficiary Identifiers (MBI)
|
||
- Medical Record Numbers
|
||
- NPI (National Provider Identifier)
|
||
- DEA Numbers
|
||
- Dates of Birth
|
||
- Phone Numbers
|
||
- Email Addresses
|
||
- Physical Addresses
|
||
|
||
### 4.3 PCI-DSS (Payment Card Industry Data Security Standard)
|
||
|
||
| PCI-DSS Requirement | Privacy Scanner Capability |
|
||
|--------------------|---------------------------|
|
||
| **Req. 3.4** - Render PAN Unreadable | Automatic credit card detection and masking |
|
||
| **Req. 4.1** - Encrypt Transmission | TLS 1.3 encryption |
|
||
| **Req. 6.5** - Secure Development | Input validation, no SQL/command injection vectors |
|
||
| **Req. 10.2** - Audit Trails | PII-safe logging with detection events |
|
||
| **Req. 12.3** - Usage Policies | Supports policy enforcement via API integration |
|
||
|
||
**PCI-DSS Data Types Detected:**
|
||
- Primary Account Numbers (PAN) - Visa, Mastercard, Amex, Discover
|
||
- **Luhn validation** reduces false positives
|
||
- Detects formatted (`4111-1111-1111-1111`) and unformatted (`4111111111111111`)
|
||
- Bank routing numbers
|
||
- IBAN/SWIFT codes
|
||
|
||
### 4.4 SOC 2 (Service Organization Control)
|
||
|
||
| SOC 2 Criteria | Privacy Scanner Capability |
|
||
|----------------|---------------------------|
|
||
| **CC6.1** - Logical Access | API-based access with optional authentication |
|
||
| **CC6.6** - System Boundaries | Clear input/output contracts via OpenAPI spec |
|
||
| **CC6.7** - Transmission Integrity | TLS encryption, request validation |
|
||
| **CC7.2** - System Monitoring | Structured detection logs (without PII content) |
|
||
| **PI1.1** - Privacy Notice | Transparent processing - documented detection categories |
|
||
|
||
### 4.5 CCPA (California Consumer Privacy Act)
|
||
|
||
| CCPA Requirement | Privacy Scanner Capability |
|
||
|-----------------|---------------------------|
|
||
| **§1798.100** - Right to Know | Identifies all PII categories in documents |
|
||
| **§1798.105** - Right to Delete | Supports identification for deletion workflows |
|
||
| **§1798.110** - Disclosure | Structured output for compliance reporting |
|
||
|
||
---
|
||
|
||
## Integration Patterns
|
||
|
||
### 5.1 Pre-Commit Hook (Developer Workflow)
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# .git/hooks/pre-commit
|
||
|
||
# Scan staged files for PII
|
||
for file in $(git diff --cached --name-only); do
|
||
response=$(curl -s -X POST http://localhost:8000/api/privacy/scan-text \
|
||
-F "text=$(cat $file)" \
|
||
-F "coordinates_only=true")
|
||
|
||
count=$(echo $response | jq '.entities | length')
|
||
if [ "$count" -gt 0 ]; then
|
||
echo "PII detected in $file - commit blocked"
|
||
exit 1
|
||
fi
|
||
done
|
||
```
|
||
|
||
### 5.2 CI/CD Pipeline Integration
|
||
|
||
```yaml
|
||
# GitHub Actions example
|
||
- name: PII Scan
|
||
run: |
|
||
for file in $(find . -name "*.log" -o -name "*.json"); do
|
||
result=$(curl -s -X POST $PII_SCANNER_URL/api/privacy/scan-text \
|
||
-F "text=$(cat $file)")
|
||
if echo "$result" | jq -e '.entities | length > 0' > /dev/null; then
|
||
echo "::error::PII detected in $file"
|
||
exit 1
|
||
fi
|
||
done
|
||
```
|
||
|
||
### 5.3 Data Pipeline Integration
|
||
|
||
```python
|
||
# Apache Airflow DAG example
|
||
from airflow.decorators import task
|
||
import requests
|
||
|
||
@task
|
||
def scan_for_pii(data: str, coordinates_only: bool = True) -> dict:
|
||
"""Scan data for PII before loading to data warehouse"""
|
||
response = requests.post(
|
||
f"{PII_SCANNER_URL}/api/privacy/scan-text",
|
||
data={
|
||
"text": data,
|
||
"coordinates_only": coordinates_only
|
||
}
|
||
)
|
||
result = response.json()
|
||
|
||
if result.get("entities"):
|
||
raise ValueError(f"PII detected: {len(result['entities'])} entities")
|
||
|
||
return {"status": "clean", "data": data}
|
||
```
|
||
|
||
### 5.4 Log Sanitization Service
|
||
|
||
```python
|
||
# Real-time log sanitization
|
||
import asyncio
|
||
import aiohttp
|
||
|
||
async def sanitize_log_stream(log_lines: list[str]) -> list[str]:
|
||
"""Sanitize logs before shipping to centralized logging"""
|
||
async with aiohttp.ClientSession() as session:
|
||
tasks = []
|
||
for line in log_lines:
|
||
task = session.post(
|
||
f"{PII_SCANNER_URL}/api/privacy/scan-text",
|
||
data={"text": line}
|
||
)
|
||
tasks.append(task)
|
||
|
||
responses = await asyncio.gather(*tasks)
|
||
sanitized = []
|
||
for resp, original in zip(responses, log_lines):
|
||
result = await resp.json()
|
||
sanitized.append(result.get("redacted_preview", original))
|
||
|
||
return sanitized
|
||
```
|
||
|
||
---
|
||
|
||
## Performance Characteristics
|
||
|
||
### 6.1 Benchmarks
|
||
|
||
| Metric | Value | Conditions |
|
||
|--------|-------|------------|
|
||
| **Throughput** | ~10,000 chars/sec | Single-threaded, all layers enabled |
|
||
| **Latency (P50)** | <50ms | 1KB text input |
|
||
| **Latency (P99)** | <200ms | 10KB text input |
|
||
| **Memory Usage** | <100MB | Per-request peak |
|
||
| **Startup Time** | <2 seconds | Cold start with pattern compilation |
|
||
|
||
### 6.2 Scalability
|
||
|
||
The Privacy Scanner is designed for horizontal scalability:
|
||
|
||
- **Stateless Architecture**: Any instance can handle any request
|
||
- **No Shared State**: No database or cache dependencies for scan operations
|
||
- **Container-Ready**: Single-process model ideal for Kubernetes
|
||
- **Load Balancer Compatible**: Round-robin distribution works optimally
|
||
|
||
```yaml
|
||
# Kubernetes HPA example
|
||
apiVersion: autoscaling/v2
|
||
kind: HorizontalPodAutoscaler
|
||
metadata:
|
||
name: privacy-scanner
|
||
spec:
|
||
scaleTargetRef:
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
name: privacy-scanner
|
||
minReplicas: 2
|
||
maxReplicas: 20
|
||
metrics:
|
||
- type: Resource
|
||
resource:
|
||
name: cpu
|
||
target:
|
||
type: Utilization
|
||
averageUtilization: 70
|
||
```
|
||
|
||
---
|
||
|
||
## Deployment Options
|
||
|
||
### 7.1 On-Premises
|
||
|
||
For maximum data sovereignty:
|
||
|
||
```bash
|
||
# Docker deployment
|
||
docker run -d \
|
||
--name privacy-scanner \
|
||
-p 8000:8000 \
|
||
--memory=512m \
|
||
--cpus=1 \
|
||
privacy-scanner:latest
|
||
```
|
||
|
||
**Benefits:**
|
||
- Data never leaves your network
|
||
- Full control over infrastructure
|
||
- No external dependencies
|
||
|
||
### 7.2 Private Cloud (VPC)
|
||
|
||
```terraform
|
||
# AWS VPC deployment example
|
||
resource "aws_ecs_service" "privacy_scanner" {
|
||
name = "privacy-scanner"
|
||
cluster = aws_ecs_cluster.main.id
|
||
task_definition = aws_ecs_task_definition.privacy_scanner.arn
|
||
desired_count = 2
|
||
|
||
network_configuration {
|
||
subnets = aws_subnet.private[*].id
|
||
security_groups = [aws_security_group.privacy_scanner.id]
|
||
assign_public_ip = false # No public access
|
||
}
|
||
}
|
||
```
|
||
|
||
**Benefits:**
|
||
- Network isolation via VPC
|
||
- Integration with cloud IAM
|
||
- Auto-scaling capabilities
|
||
|
||
### 7.3 Air-Gapped Deployment
|
||
|
||
For highly restricted environments:
|
||
|
||
1. **Client-Side Redaction Mode**: Backend only returns coordinates
|
||
2. **No Outbound Connections**: Zero external API calls
|
||
3. **Offline Pattern Updates**: Manual pattern file updates
|
||
4. **Local-Only Logging**: No telemetry or metrics export
|
||
|
||
---
|
||
|
||
## Security Hardening Checklist
|
||
|
||
### Pre-Deployment
|
||
|
||
- [ ] Enable TLS 1.3 with strong cipher suites
|
||
- [ ] Configure rate limiting (recommend: 100 req/min per IP)
|
||
- [ ] Set up authentication (API keys or OAuth 2.0)
|
||
- [ ] Review and customize PII patterns for your use case
|
||
- [ ] Configure PII-safe logging
|
||
- [ ] Set appropriate request size limits (default: 10MB)
|
||
|
||
### Runtime
|
||
|
||
- [ ] Monitor for unusual request patterns
|
||
- [ ] Set up alerting on high PII detection rates
|
||
- [ ] Implement request timeout (default: 30 seconds)
|
||
- [ ] Enable health check endpoints for orchestration
|
||
- [ ] Configure graceful shutdown handling
|
||
|
||
### Audit
|
||
|
||
- [ ] Log detection events (without PII content)
|
||
- [ ] Track API usage metrics
|
||
- [ ] Periodic pattern effectiveness review
|
||
- [ ] Regular security scanning of container images
|
||
|
||
---
|
||
|
||
## Appendix A: API Reference
|
||
|
||
### Scan Text Endpoint
|
||
|
||
```
|
||
POST /api/privacy/scan-text
|
||
Content-Type: multipart/form-data
|
||
```
|
||
|
||
**Parameters:**
|
||
|
||
| Parameter | Type | Required | Description |
|
||
|-----------|------|----------|-------------|
|
||
| `text` | string | Yes | Text content to scan |
|
||
| `coordinates_only` | boolean | No | Return only positions (default: false) |
|
||
| `detect_emails` | boolean | No | Enable email detection (default: true) |
|
||
| `detect_phones` | boolean | No | Enable phone detection (default: true) |
|
||
| `detect_ssn` | boolean | No | Enable SSN detection (default: true) |
|
||
| `detect_credit_cards` | boolean | No | Enable credit card detection (default: true) |
|
||
| `detect_secrets` | boolean | No | Enable secrets detection (default: true) |
|
||
|
||
**Response (Standard Mode):**
|
||
|
||
```json
|
||
{
|
||
"entities": [
|
||
{
|
||
"type": "EMAIL",
|
||
"value": "john@example.com",
|
||
"masked_value": "[EMAIL:j***@example.com]",
|
||
"start": 15,
|
||
"end": 31,
|
||
"confidence": 0.95,
|
||
"category": "pii"
|
||
}
|
||
],
|
||
"redacted_preview": "Contact: [EMAIL:j***@example.com] for info",
|
||
"summary": {
|
||
"total_entities": 1,
|
||
"by_category": {"pii": 1},
|
||
"risk_level": "medium"
|
||
}
|
||
}
|
||
```
|
||
|
||
**Response (Coordinates-Only Mode):**
|
||
|
||
```json
|
||
{
|
||
"entities": [
|
||
{
|
||
"type": "EMAIL",
|
||
"start": 15,
|
||
"end": 31,
|
||
"length": 16
|
||
}
|
||
],
|
||
"coordinates_only": true
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Appendix B: Confidence Scoring
|
||
|
||
| Confidence Level | Score Range | Meaning |
|
||
|-----------------|-------------|---------|
|
||
| **Very High** | 0.95 - 1.00 | Checksum validated (Luhn, MOD-97) |
|
||
| **High** | 0.85 - 0.94 | Strong pattern match with context |
|
||
| **Medium** | 0.70 - 0.84 | Pattern match, limited context |
|
||
| **Low** | 0.50 - 0.69 | Possible match, needs review |
|
||
| **Uncertain** | < 0.50 | Flagged for manual review |
|
||
|
||
**Confidence Adjustments:**
|
||
|
||
- **+15%**: Checksum validation passed
|
||
- **+10%**: Contextual keywords present (e.g., "SSN:", "card number")
|
||
- **-30%**: Anti-context detected (e.g., "order number", "reference ID")
|
||
- **-20%**: Common false positive pattern (UUID format, connection string)
|
||
|
||
---
|
||
|
||
## Appendix C: Version History
|
||
|
||
| Version | Date | Changes |
|
||
|---------|------|---------|
|
||
| **1.1** | 2024-12-23 | Added international IDs (UK NI, Canadian SIN, India Aadhaar/PAN, etc.), cloud tokens (OpenAI, Anthropic, Discord), crypto addresses, financial identifiers (CUSIP, ISIN), improved false positive filtering |
|
||
| **1.0** | 2024-12-20 | Initial release with 30+ PII types, 8-layer detection pipeline |
|
||
|
||
---
|
||
|
||
## Contact & Support
|
||
|
||
For enterprise licensing, custom integrations, or security assessments:
|
||
|
||
- **Documentation**: See `privacy-scanner-overview.qmd` and `building-privacy-scanner.qmd`
|
||
- **Issues**: Report via your organization's support channel
|
||
- **Updates**: Pattern updates released quarterly
|
||
|
||
---
|
||
|
||
*This document is intended for enterprise security and compliance teams evaluating the Privacy Scanner for production deployment. All technical specifications are subject to change. Please refer to the latest documentation for current capabilities.*
|