Data Classification and DLP
Data classification is the process of organising data into categories based on sensitivity - so that appropriate controls, access policies, and handling procedures can be applied consistently. Data Loss Prevention (DLP) is the set of tools and processes that prevent sensitive data from leaving authorised boundaries (exfiltration, accidental sharing, or leakage to wrong parties).
Data Classification Tiers
Section titled “Data Classification Tiers”Most organisations use 3–4 tiers. Map your existing data to a tier before designing controls.
| Tier | Also called | Examples | Who can see it |
|---|---|---|---|
| Public | Open | Press releases, marketing, public documentation, open source code | Anyone |
| Internal | General, Limited | Employee handbook, org charts, project names, internal blog | Employees (all) |
| Confidential | Sensitive, Business-sensitive | Business strategy, customer lists, source code, contract terms, financial forecasts | Need-to-know employees only |
| Restricted | Secret, Highly Confidential | PII (SSN, email, address), PHI (health records), PAN (credit cards), authentication secrets, employee salaries | Named individuals; access-controlled; logged |
Classification Criteria
Section titled “Classification Criteria”Classify based on: 1. Impact of exposure (financial, legal, reputational) 2. Regulatory obligation (GDPR, HIPAA, PCI DSS) 3. Contractual requirement (NDA, customer data handling agreement) 4. Strategic value (M&A plans, product roadmap, source code)
Default rule: When uncertain → classify at the HIGHER tier It's easier to declassify later than to undo a breachData Inventory and Mapping
Section titled “Data Inventory and Mapping”Before you can protect data, you must find it:
# Find files matching PII patterns (SSN, credit card, email)# WARNING: Run as privilege-appropriate user; log findings securely
# Social Security Numbers (US)grep -rE '\b[0-9]{3}-[0-9]{2}-[0-9]{4}\b' /var/data/ 2>/dev/null
# Credit card numbers (Luhn pattern - not exact but broad)grep -rE '\b([0-9]{4}[ -]?){3}[0-9]{4}\b' /var/data/ 2>/dev/null
# Email addressesgrep -rE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /var/data/ 2>/dev/null
# Passport numbers (common formats)grep -rE '\b[A-Z]{2}[0-9]{7}\b|\b[A-Z][0-9]{8}\b' /var/data/ 2>/dev/null
# Find all databases (data stores to inventory)find / -name "*.db" -o -name "*.sqlite" -o -name "*.csv" 2>/dev/nullfind / -name "*.sql" -newer /etc/passwd 2>/dev/null # recently modified SQL files
# AWS: find S3 buckets with public accessaws s3api list-buckets --query 'Buckets[].Name' --output text | \ xargs -I{} aws s3api get-bucket-acl --bucket {} --query 'Grants[?Grantee.URI==`http://acs.amazonaws.com/groups/global/AllUsers`]'Data Flow Mapping
Section titled “Data Flow Mapping”Data flow map elements: Source → where data originates (user input, external API, IoT sensor) Storage → where data at rest lives (S3, RDS, file server) Processing → where data is computed on (app server, Lambda, ML pipeline) Transfers → where data moves to (third-party APIs, email, reporting tools) Disposal → how data is destroyed when retention period ends
Each flow must answer: - What classification is this data? - Is this transfer authorised? - Is it encrypted in transit? - Is the receiving party authorised and covered by a DPA/BAA? - How long is it retained at the destination?Data Handling Requirements by Tier
Section titled “Data Handling Requirements by Tier”| Requirement | Public | Internal | Confidential | Restricted |
|---|---|---|---|---|
| Encryption at rest | Optional | Optional | Required | Required |
| Encryption in transit | Optional | Recommended | Required | Required (TLS 1.2+) |
| Access logging | No | Optional | Recommended | Required; individual audit trail |
| Access controls | None | Authentication | Need-to-know ACL | Named individual ACL + approval |
| Sharing externally | Freely | With approval | NDA required | Prohibited unless specific exception |
| Storage on personal devices | OK | With MDM | Prohibited | Prohibited |
| Retention | As needed | Per policy | As needed | Per regulatory obligation |
| Destruction | Standard delete | Standard delete | Certified wipe | Physical destruction or certified wipe |
Labelling and Tagging
Section titled “Labelling and Tagging”Classification is only useful if data is labelled so systems and humans know how to handle it.
Document Labelling
Section titled “Document Labelling”# Microsoft Purview Information Protection (formerly AIP)# Sensitive labels applied to Word/Excel/PDF/email:# Public → Internal → Confidential → Highly Confidential
# Label applies:# Visual marking (header/footer: "CONFIDENTIAL")# Encryption (only authorised users can decrypt)# DLP policy triggers (based on label)# Retention policies# Access control (can't copy/forward/print if restricted)
# PowerShell: check label on a fileGet-MIPLabelStatus -File "contract.docx"
# List labels in your tenantConnect-IPPSSessionGet-Label | Select-Object DisplayName, Priority, GuidCloud Resource Tagging
Section titled “Cloud Resource Tagging”# AWS: tag S3 buckets by data classificationaws s3api put-bucket-tagging --bucket mybucket --tagging '{ "TagSet": [ {"Key": "DataClassification", "Value": "Confidential"}, {"Key": "DataOwner", "Value": "finance-team"}, {"Key": "RetentionYears", "Value": "7"} ]}'
# Find untagged S3 buckets (no classification tag)aws s3api list-buckets --query 'Buckets[].Name' --output text | while read bucket; do tags=$(aws s3api get-bucket-tagging --bucket "$bucket" 2>/dev/null | grep DataClassification) [ -z "$tags" ] && echo "UNTAGGED: $bucket"done
# Azure: tag resourcesaz resource tag --ids /subscriptions/.../myresource \ --tags "DataClassification=Restricted" "DataOwner=security-team"
# Enforce tagging via Azure Policy (deny creation without classification tag)az policy definition create --name "require-data-classification" \ --rules '{"if":{"field":"tags[DataClassification]","exists":"false"},"then":{"effect":"deny"}}'DLP - Data Loss Prevention
Section titled “DLP - Data Loss Prevention”DLP tools monitor, detect, and block sensitive data leaving authorised channels.
DLP Enforcement Points
Section titled “DLP Enforcement Points”ENDPOINT DLP → Agent on workstations that monitors: - File copies to USB drives - Uploads to personal cloud (Dropbox, personal Google Drive) - Printing of sensitive documents - Screenshots of restricted content - Email attachments from managed email client
NETWORK DLP → Inline proxy or CASB that inspects: - HTTP/HTTPS traffic (decrypt TLS to inspect) - SMTP email outbound - Cloud app uploads (Box, Dropbox, SharePoint) - FTP, SCP transfers
EMAIL DLP → Scans email body and attachments: - Detect credit card numbers in attachments - Block sending PAN data outside the domain - Quarantine emails with health record data - Enforce encryption for sensitive outbound email
CLOUD DLP → Scans cloud storage (S3, GCS, Azure Blob, SharePoint): - Finds PII/PHI/PAN-like patterns in stored files - Alerts or automatically reclassifies discovered filesMicrosoft Purview DLP
Section titled “Microsoft Purview DLP”# PowerShell: View DLP policiesConnect-IPPSSessionGet-DlpCompliancePolicy | Select-Object Name, IsEnabled, Mode
# DLP policy states:# Mode = TestWithoutNotifications → audit only; no blocking# Mode = TestWithNotifications → audit + notify user# Mode = Enable → enforce (block + notify)
# View DLP rules in a policyGet-DlpComplianceRule -Policy "PCI DSS Policy" | Select-Object Name, ContentContainsSensitiveInformation
# Create a DLP rule (block sharing of credit card numbers externally)New-DlpComplianceRule -Name "Block PAN external sharing" ` -Policy "PCI DSS Policy" ` -ContentContainsSensitiveInformation @{name="Credit Card Number"; minCount=1} ` -AccessScope NotInOrganization ` -BlockAccess $true ` -NotifyUser Owner
# Check DLP incidentsGet-DlpDetailReport -StartDate (Get-Date).AddDays(-7) -EndDate (Get-Date) | Select-Object Date, Subject, Policy, ActionGoogle Cloud DLP (Cloud DLP API)
Section titled “Google Cloud DLP (Cloud DLP API)”# Install gcloud CLI and enable DLP APIgcloud services enable dlp.googleapis.com
# Scan a GCS bucket for PIIpython3 << 'EOF'from google.cloud import dlp_v2
client = dlp_v2.DlpServiceClient()project = "my-project"
# Define what to findinspect_config = dlp_v2.InspectConfig( info_types=[ dlp_v2.InfoType(name="EMAIL_ADDRESS"), dlp_v2.InfoType(name="PHONE_NUMBER"), dlp_v2.InfoType(name="CREDIT_CARD_NUMBER"), dlp_v2.InfoType(name="US_SOCIAL_SECURITY_NUMBER"), ], likelihood_threshold=dlp_v2.Likelihood.LIKELY, include_quote=False,)
# Storage to scanstorage_config = dlp_v2.StorageConfig( cloud_storage_options=dlp_v2.CloudStorageOptions( file_set=dlp_v2.CloudStorageOptions.FileSet( url="gs://my-bucket/**" ) ))
# Create a jobjob = dlp_v2.InspectJobConfig(inspect_config=inspect_config, storage_config=storage_config)response = client.create_dlp_job(parent=f"projects/{project}", inspect_job=job)print(f"Job created: {response.name}")EOFAWS Macie - S3 Data Discovery
Section titled “AWS Macie - S3 Data Discovery”# Enable Macie (automated PII scanning for S3)aws macie2 enable-macie
# Create a scan jobaws macie2 create-classification-job \ --job-type ONE_TIME \ --name "PII-Scan-Q1" \ --s3-job-definition '{ "bucketDefinitions": [ {"accountId": "123456789012", "buckets": ["my-data-bucket"]} ] }'
# List findingsaws macie2 list-findings --finding-criteria '{"criterion": {"severity.description": {"eq": ["High","Critical"]}}}' \ | jq '.findingIds'
# Get finding detailsaws macie2 get-findings --finding-ids <finding-id> | jq '.[].description'Data Retention and Destruction
Section titled “Data Retention and Destruction”Retention Policy Framework
Section titled “Retention Policy Framework”| Data type | Minimum retention | Maximum (typical) | Destruction method |
|---|---|---|---|
| Financial records | 7 years (tax law) | 7 years | Certified destruction |
| HR / employee records | Duration of employment + 7 years | As above | Certified destruction |
| HIPAA ePHI | 6 years (HIPAA) | Per state law | Certified destruction |
| PCI card data | Only as long as needed | Immediate post-auth (CVV: never) | Secure wipe or tokenise |
| GDPR personal data | As long as processing purpose exists | Per purpose limitation | Anonymisation or deletion |
| Security logs | 90 days accessible | 12 months (PCI); varies | Secure deletion |
| Backup tapes | Per recovery RTO/RPO | Aligned with data retention of contained data | Degauss + shred |
# Automate data retention with S3 Lifecycle rulesaws s3api put-bucket-lifecycle-configuration --bucket my-logs-bucket \ --lifecycle-configuration '{ "Rules": [ { "ID": "SecurityLogRetention", "Status": "Enabled", "Filter": {"Prefix": "security-logs/"}, "Transitions": [ {"Days": 30, "StorageClass": "STANDARD_IA"}, {"Days": 90, "StorageClass": "GLACIER"} ], "Expiration": {"Days": 365} } ] }'
# Check for expired objects that haven't been deletedaws s3api list-objects-v2 --bucket my-bucket \ --query 'Contents[?LastModified<`2025-01-01`].[Key,LastModified]'GDPR Data Subject Rights (Technical Implementation)
Section titled “GDPR Data Subject Rights (Technical Implementation)”If you process personal data of EU residents, GDPR grants data subjects rights you must technically support:
| Right | Technical requirement |
|---|---|
| Right of access | Ability to export all data about a person (by email/ID) |
| Right to erasure (right to be forgotten) | Ability to delete all data about a person across all systems |
| Right to portability | Export data in machine-readable format (JSON/CSV) |
| Right to rectification | Ability to update incorrect personal data |
# Example: data subject erasure request handlerdef handle_erasure_request(user_id: str): """Delete all personal data for a given user across systems."""
# 1. Database: anonymise or delete records db.execute(""" UPDATE users SET email = '[email protected]', name = 'Deleted User', phone = NULL, address = NULL WHERE id = %s """, [user_id])
# 2. Logs: pseudonymise (replace email with hash in logs) # Note: full deletion from logs may violate retention requirements # - consult legal before deleting audit logs
# 3. Backups: add to "deleted users" list; filter on restore add_to_erasure_list(user_id)
# 4. Third parties: notify processors (email provider, analytics, etc.) notify_processors(user_id)
# 5. Document: record the erasure request and completion date log_erasure(user_id, completed_at=datetime.now())