Skip to content

Data Classification and DLP

Data classification is the process of organising data into categories based on sensitivity - so that appropriate controls, access policies, and handling procedures can be applied consistently. Data Loss Prevention (DLP) is the set of tools and processes that prevent sensitive data from leaving authorised boundaries (exfiltration, accidental sharing, or leakage to wrong parties).


Most organisations use 3–4 tiers. Map your existing data to a tier before designing controls.

TierAlso calledExamplesWho can see it
PublicOpenPress releases, marketing, public documentation, open source codeAnyone
InternalGeneral, LimitedEmployee handbook, org charts, project names, internal blogEmployees (all)
ConfidentialSensitive, Business-sensitiveBusiness strategy, customer lists, source code, contract terms, financial forecastsNeed-to-know employees only
RestrictedSecret, Highly ConfidentialPII (SSN, email, address), PHI (health records), PAN (credit cards), authentication secrets, employee salariesNamed individuals; access-controlled; logged
Classify based on:
1. Impact of exposure (financial, legal, reputational)
2. Regulatory obligation (GDPR, HIPAA, PCI DSS)
3. Contractual requirement (NDA, customer data handling agreement)
4. Strategic value (M&A plans, product roadmap, source code)
Default rule:
When uncertain → classify at the HIGHER tier
It's easier to declassify later than to undo a breach

Before you can protect data, you must find it:

Terminal window
# Find files matching PII patterns (SSN, credit card, email)
# WARNING: Run as privilege-appropriate user; log findings securely
# Social Security Numbers (US)
grep -rE '\b[0-9]{3}-[0-9]{2}-[0-9]{4}\b' /var/data/ 2>/dev/null
# Credit card numbers (Luhn pattern - not exact but broad)
grep -rE '\b([0-9]{4}[ -]?){3}[0-9]{4}\b' /var/data/ 2>/dev/null
# Email addresses
grep -rE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /var/data/ 2>/dev/null
# Passport numbers (common formats)
grep -rE '\b[A-Z]{2}[0-9]{7}\b|\b[A-Z][0-9]{8}\b' /var/data/ 2>/dev/null
# Find all databases (data stores to inventory)
find / -name "*.db" -o -name "*.sqlite" -o -name "*.csv" 2>/dev/null
find / -name "*.sql" -newer /etc/passwd 2>/dev/null # recently modified SQL files
# AWS: find S3 buckets with public access
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
xargs -I{} aws s3api get-bucket-acl --bucket {} --query 'Grants[?Grantee.URI==`http://acs.amazonaws.com/groups/global/AllUsers`]'
Data flow map elements:
Source → where data originates (user input, external API, IoT sensor)
Storage → where data at rest lives (S3, RDS, file server)
Processing → where data is computed on (app server, Lambda, ML pipeline)
Transfers → where data moves to (third-party APIs, email, reporting tools)
Disposal → how data is destroyed when retention period ends
Each flow must answer:
- What classification is this data?
- Is this transfer authorised?
- Is it encrypted in transit?
- Is the receiving party authorised and covered by a DPA/BAA?
- How long is it retained at the destination?

RequirementPublicInternalConfidentialRestricted
Encryption at restOptionalOptionalRequiredRequired
Encryption in transitOptionalRecommendedRequiredRequired (TLS 1.2+)
Access loggingNoOptionalRecommendedRequired; individual audit trail
Access controlsNoneAuthenticationNeed-to-know ACLNamed individual ACL + approval
Sharing externallyFreelyWith approvalNDA requiredProhibited unless specific exception
Storage on personal devicesOKWith MDMProhibitedProhibited
RetentionAs neededPer policyAs neededPer regulatory obligation
DestructionStandard deleteStandard deleteCertified wipePhysical destruction or certified wipe

Classification is only useful if data is labelled so systems and humans know how to handle it.

Terminal window
# Microsoft Purview Information Protection (formerly AIP)
# Sensitive labels applied to Word/Excel/PDF/email:
# Public → Internal → Confidential → Highly Confidential
# Label applies:
# Visual marking (header/footer: "CONFIDENTIAL")
# Encryption (only authorised users can decrypt)
# DLP policy triggers (based on label)
# Retention policies
# Access control (can't copy/forward/print if restricted)
# PowerShell: check label on a file
Get-MIPLabelStatus -File "contract.docx"
# List labels in your tenant
Connect-IPPSSession
Get-Label | Select-Object DisplayName, Priority, Guid
Terminal window
# AWS: tag S3 buckets by data classification
aws s3api put-bucket-tagging --bucket mybucket --tagging '{
"TagSet": [
{"Key": "DataClassification", "Value": "Confidential"},
{"Key": "DataOwner", "Value": "finance-team"},
{"Key": "RetentionYears", "Value": "7"}
]
}'
# Find untagged S3 buckets (no classification tag)
aws s3api list-buckets --query 'Buckets[].Name' --output text | while read bucket; do
tags=$(aws s3api get-bucket-tagging --bucket "$bucket" 2>/dev/null | grep DataClassification)
[ -z "$tags" ] && echo "UNTAGGED: $bucket"
done
# Azure: tag resources
az resource tag --ids /subscriptions/.../myresource \
--tags "DataClassification=Restricted" "DataOwner=security-team"
# Enforce tagging via Azure Policy (deny creation without classification tag)
az policy definition create --name "require-data-classification" \
--rules '{"if":{"field":"tags[DataClassification]","exists":"false"},"then":{"effect":"deny"}}'

DLP tools monitor, detect, and block sensitive data leaving authorised channels.

ENDPOINT DLP
→ Agent on workstations that monitors:
- File copies to USB drives
- Uploads to personal cloud (Dropbox, personal Google Drive)
- Printing of sensitive documents
- Screenshots of restricted content
- Email attachments from managed email client
NETWORK DLP
→ Inline proxy or CASB that inspects:
- HTTP/HTTPS traffic (decrypt TLS to inspect)
- SMTP email outbound
- Cloud app uploads (Box, Dropbox, SharePoint)
- FTP, SCP transfers
EMAIL DLP
→ Scans email body and attachments:
- Detect credit card numbers in attachments
- Block sending PAN data outside the domain
- Quarantine emails with health record data
- Enforce encryption for sensitive outbound email
CLOUD DLP
→ Scans cloud storage (S3, GCS, Azure Blob, SharePoint):
- Finds PII/PHI/PAN-like patterns in stored files
- Alerts or automatically reclassifies discovered files
Terminal window
# PowerShell: View DLP policies
Connect-IPPSSession
Get-DlpCompliancePolicy | Select-Object Name, IsEnabled, Mode
# DLP policy states:
# Mode = TestWithoutNotifications → audit only; no blocking
# Mode = TestWithNotifications → audit + notify user
# Mode = Enable → enforce (block + notify)
# View DLP rules in a policy
Get-DlpComplianceRule -Policy "PCI DSS Policy" | Select-Object Name, ContentContainsSensitiveInformation
# Create a DLP rule (block sharing of credit card numbers externally)
New-DlpComplianceRule -Name "Block PAN external sharing" `
-Policy "PCI DSS Policy" `
-ContentContainsSensitiveInformation @{name="Credit Card Number"; minCount=1} `
-AccessScope NotInOrganization `
-BlockAccess $true `
-NotifyUser Owner
# Check DLP incidents
Get-DlpDetailReport -StartDate (Get-Date).AddDays(-7) -EndDate (Get-Date) | Select-Object Date, Subject, Policy, Action
Terminal window
# Install gcloud CLI and enable DLP API
gcloud services enable dlp.googleapis.com
# Scan a GCS bucket for PII
python3 << 'EOF'
from google.cloud import dlp_v2
client = dlp_v2.DlpServiceClient()
project = "my-project"
# Define what to find
inspect_config = dlp_v2.InspectConfig(
info_types=[
dlp_v2.InfoType(name="EMAIL_ADDRESS"),
dlp_v2.InfoType(name="PHONE_NUMBER"),
dlp_v2.InfoType(name="CREDIT_CARD_NUMBER"),
dlp_v2.InfoType(name="US_SOCIAL_SECURITY_NUMBER"),
],
likelihood_threshold=dlp_v2.Likelihood.LIKELY,
include_quote=False,
)
# Storage to scan
storage_config = dlp_v2.StorageConfig(
cloud_storage_options=dlp_v2.CloudStorageOptions(
file_set=dlp_v2.CloudStorageOptions.FileSet(
url="gs://my-bucket/**"
)
)
)
# Create a job
job = dlp_v2.InspectJobConfig(inspect_config=inspect_config, storage_config=storage_config)
response = client.create_dlp_job(parent=f"projects/{project}", inspect_job=job)
print(f"Job created: {response.name}")
EOF
Terminal window
# Enable Macie (automated PII scanning for S3)
aws macie2 enable-macie
# Create a scan job
aws macie2 create-classification-job \
--job-type ONE_TIME \
--name "PII-Scan-Q1" \
--s3-job-definition '{
"bucketDefinitions": [
{"accountId": "123456789012", "buckets": ["my-data-bucket"]}
]
}'
# List findings
aws macie2 list-findings --finding-criteria '{"criterion": {"severity.description": {"eq": ["High","Critical"]}}}' \
| jq '.findingIds'
# Get finding details
aws macie2 get-findings --finding-ids <finding-id> | jq '.[].description'

Data typeMinimum retentionMaximum (typical)Destruction method
Financial records7 years (tax law)7 yearsCertified destruction
HR / employee recordsDuration of employment + 7 yearsAs aboveCertified destruction
HIPAA ePHI6 years (HIPAA)Per state lawCertified destruction
PCI card dataOnly as long as neededImmediate post-auth (CVV: never)Secure wipe or tokenise
GDPR personal dataAs long as processing purpose existsPer purpose limitationAnonymisation or deletion
Security logs90 days accessible12 months (PCI); variesSecure deletion
Backup tapesPer recovery RTO/RPOAligned with data retention of contained dataDegauss + shred
Terminal window
# Automate data retention with S3 Lifecycle rules
aws s3api put-bucket-lifecycle-configuration --bucket my-logs-bucket \
--lifecycle-configuration '{
"Rules": [
{
"ID": "SecurityLogRetention",
"Status": "Enabled",
"Filter": {"Prefix": "security-logs/"},
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER"}
],
"Expiration": {"Days": 365}
}
]
}'
# Check for expired objects that haven't been deleted
aws s3api list-objects-v2 --bucket my-bucket \
--query 'Contents[?LastModified<`2025-01-01`].[Key,LastModified]'

GDPR Data Subject Rights (Technical Implementation)

Section titled “GDPR Data Subject Rights (Technical Implementation)”

If you process personal data of EU residents, GDPR grants data subjects rights you must technically support:

RightTechnical requirement
Right of accessAbility to export all data about a person (by email/ID)
Right to erasure (right to be forgotten)Ability to delete all data about a person across all systems
Right to portabilityExport data in machine-readable format (JSON/CSV)
Right to rectificationAbility to update incorrect personal data
# Example: data subject erasure request handler
def handle_erasure_request(user_id: str):
"""Delete all personal data for a given user across systems."""
# 1. Database: anonymise or delete records
db.execute("""
UPDATE users SET
email = '[email protected]',
name = 'Deleted User',
phone = NULL,
address = NULL
WHERE id = %s
""", [user_id])
# 2. Logs: pseudonymise (replace email with hash in logs)
# Note: full deletion from logs may violate retention requirements
# - consult legal before deleting audit logs
# 3. Backups: add to "deleted users" list; filter on restore
add_to_erasure_list(user_id)
# 4. Third parties: notify processors (email provider, analytics, etc.)
notify_processors(user_id)
# 5. Document: record the erasure request and completion date
log_erasure(user_id, completed_at=datetime.now())