Data Classification and DLP

Data classification is the process of organising data into categories based on sensitivity - so that appropriate controls, access policies, and handling procedures can be applied consistently. Data Loss Prevention (DLP) is the set of tools and processes that prevent sensitive data from leaving authorised boundaries (exfiltration, accidental sharing, or leakage to wrong parties).

Data Classification Tiers

Most organisations use 3–4 tiers. Map your existing data to a tier before designing controls.

Tier	Also called	Examples	Who can see it
Public	Open	Press releases, marketing, public documentation, open source code	Anyone
Internal	General, Limited	Employee handbook, org charts, project names, internal blog	Employees (all)
Confidential	Sensitive, Business-sensitive	Business strategy, customer lists, source code, contract terms, financial forecasts	Need-to-know employees only
Restricted	Secret, Highly Confidential	PII (SSN, email, address), PHI (health records), PAN (credit cards), authentication secrets, employee salaries	Named individuals; access-controlled; logged

Classification Criteria

Classify based on:
  1. Impact of exposure (financial, legal, reputational)
  2. Regulatory obligation (GDPR, HIPAA, PCI DSS)
  3. Contractual requirement (NDA, customer data handling agreement)
  4. Strategic value (M&A plans, product roadmap, source code)

Default rule:
  When uncertain → classify at the HIGHER tier
  It's easier to declassify later than to undo a breach

Data Inventory and Mapping

Before you can protect data, you must find it:

# Find files matching PII patterns (SSN, credit card, email)
# WARNING: Run as privilege-appropriate user; log findings securely

# Social Security Numbers (US)
grep -rE '\b[0-9]{3}-[0-9]{2}-[0-9]{4}\b' /var/data/ 2>/dev/null

# Credit card numbers (Luhn pattern - not exact but broad)
grep -rE '\b([0-9]{4}[ -]?){3}[0-9]{4}\b' /var/data/ 2>/dev/null

# Email addresses
grep -rE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /var/data/ 2>/dev/null

# Passport numbers (common formats)
grep -rE '\b[A-Z]{2}[0-9]{7}\b|\b[A-Z][0-9]{8}\b' /var/data/ 2>/dev/null

# Find all databases (data stores to inventory)
find / -name "*.db" -o -name "*.sqlite" -o -name "*.csv" 2>/dev/null
find / -name "*.sql" -newer /etc/passwd 2>/dev/null   # recently modified SQL files

# AWS: find S3 buckets with public access
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
  xargs -I{} aws s3api get-bucket-acl --bucket {} --query 'Grants[?Grantee.URI==`http://acs.amazonaws.com/groups/global/AllUsers`]'

Data Flow Mapping

Data flow map elements:
  Source      → where data originates (user input, external API, IoT sensor)
  Storage     → where data at rest lives (S3, RDS, file server)
  Processing  → where data is computed on (app server, Lambda, ML pipeline)
  Transfers   → where data moves to (third-party APIs, email, reporting tools)
  Disposal    → how data is destroyed when retention period ends

Each flow must answer:
  - What classification is this data?
  - Is this transfer authorised?
  - Is it encrypted in transit?
  - Is the receiving party authorised and covered by a DPA/BAA?
  - How long is it retained at the destination?

Data Handling Requirements by Tier

Requirement	Public	Internal	Confidential	Restricted
Encryption at rest	Optional	Optional	Required	Required
Encryption in transit	Optional	Recommended	Required	Required (TLS 1.2+)
Access logging	No	Optional	Recommended	Required; individual audit trail
Access controls	None	Authentication	Need-to-know ACL	Named individual ACL + approval
Sharing externally	Freely	With approval	NDA required	Prohibited unless specific exception
Storage on personal devices	OK	With MDM	Prohibited	Prohibited
Retention	As needed	Per policy	As needed	Per regulatory obligation
Destruction	Standard delete	Standard delete	Certified wipe	Physical destruction or certified wipe

Labelling and Tagging

Classification is only useful if data is labelled so systems and humans know how to handle it.

Document Labelling

# Microsoft Purview Information Protection (formerly AIP)
# Sensitive labels applied to Word/Excel/PDF/email:
#   Public → Internal → Confidential → Highly Confidential

# Label applies:
#   Visual marking (header/footer: "CONFIDENTIAL")
#   Encryption (only authorised users can decrypt)
#   DLP policy triggers (based on label)
#   Retention policies
#   Access control (can't copy/forward/print if restricted)

# PowerShell: check label on a file
Get-MIPLabelStatus -File "contract.docx"

# List labels in your tenant
Connect-IPPSSession
Get-Label | Select-Object DisplayName, Priority, Guid

Cloud Resource Tagging

# AWS: tag S3 buckets by data classification
aws s3api put-bucket-tagging --bucket mybucket --tagging '{
  "TagSet": [
    {"Key": "DataClassification", "Value": "Confidential"},
    {"Key": "DataOwner", "Value": "finance-team"},
    {"Key": "RetentionYears", "Value": "7"}
  ]
}'

# Find untagged S3 buckets (no classification tag)
aws s3api list-buckets --query 'Buckets[].Name' --output text | while read bucket; do
  tags=$(aws s3api get-bucket-tagging --bucket "$bucket" 2>/dev/null | grep DataClassification)
  [ -z "$tags" ] && echo "UNTAGGED: $bucket"
done

# Azure: tag resources
az resource tag --ids /subscriptions/.../myresource \
  --tags "DataClassification=Restricted" "DataOwner=security-team"

# Enforce tagging via Azure Policy (deny creation without classification tag)
az policy definition create --name "require-data-classification" \
  --rules '{"if":{"field":"tags[DataClassification]","exists":"false"},"then":{"effect":"deny"}}'

DLP - Data Loss Prevention

DLP tools monitor, detect, and block sensitive data leaving authorised channels.

DLP Enforcement Points

ENDPOINT DLP
  → Agent on workstations that monitors:
     - File copies to USB drives
     - Uploads to personal cloud (Dropbox, personal Google Drive)
     - Printing of sensitive documents
     - Screenshots of restricted content
     - Email attachments from managed email client

NETWORK DLP
  → Inline proxy or CASB that inspects:
     - HTTP/HTTPS traffic (decrypt TLS to inspect)
     - SMTP email outbound
     - Cloud app uploads (Box, Dropbox, SharePoint)
     - FTP, SCP transfers

EMAIL DLP
  → Scans email body and attachments:
     - Detect credit card numbers in attachments
     - Block sending PAN data outside the domain
     - Quarantine emails with health record data
     - Enforce encryption for sensitive outbound email

CLOUD DLP
  → Scans cloud storage (S3, GCS, Azure Blob, SharePoint):
     - Finds PII/PHI/PAN-like patterns in stored files
     - Alerts or automatically reclassifies discovered files

Microsoft Purview DLP

# PowerShell: View DLP policies
Connect-IPPSSession
Get-DlpCompliancePolicy | Select-Object Name, IsEnabled, Mode

# DLP policy states:
# Mode = TestWithoutNotifications  → audit only; no blocking
# Mode = TestWithNotifications     → audit + notify user
# Mode = Enable                    → enforce (block + notify)

# View DLP rules in a policy
Get-DlpComplianceRule -Policy "PCI DSS Policy" | Select-Object Name, ContentContainsSensitiveInformation

# Create a DLP rule (block sharing of credit card numbers externally)
New-DlpComplianceRule -Name "Block PAN external sharing" `
  -Policy "PCI DSS Policy" `
  -ContentContainsSensitiveInformation @{name="Credit Card Number"; minCount=1} `
  -AccessScope NotInOrganization `
  -BlockAccess $true `
  -NotifyUser Owner

# Check DLP incidents
Get-DlpDetailReport -StartDate (Get-Date).AddDays(-7) -EndDate (Get-Date) | Select-Object Date, Subject, Policy, Action

Google Cloud DLP (Cloud DLP API)

# Install gcloud CLI and enable DLP API
gcloud services enable dlp.googleapis.com

# Scan a GCS bucket for PII
python3 << 'EOF'
from google.cloud import dlp_v2

client = dlp_v2.DlpServiceClient()
project = "my-project"

# Define what to find
inspect_config = dlp_v2.InspectConfig(
    info_types=[
        dlp_v2.InfoType(name="EMAIL_ADDRESS"),
        dlp_v2.InfoType(name="PHONE_NUMBER"),
        dlp_v2.InfoType(name="CREDIT_CARD_NUMBER"),
        dlp_v2.InfoType(name="US_SOCIAL_SECURITY_NUMBER"),
    ],
    likelihood_threshold=dlp_v2.Likelihood.LIKELY,
    include_quote=False,
)

# Storage to scan
storage_config = dlp_v2.StorageConfig(
    cloud_storage_options=dlp_v2.CloudStorageOptions(
        file_set=dlp_v2.CloudStorageOptions.FileSet(
            url="gs://my-bucket/**"
        )
    )
)

# Create a job
job = dlp_v2.InspectJobConfig(inspect_config=inspect_config, storage_config=storage_config)
response = client.create_dlp_job(parent=f"projects/{project}", inspect_job=job)
print(f"Job created: {response.name}")
EOF

AWS Macie - S3 Data Discovery

# Enable Macie (automated PII scanning for S3)
aws macie2 enable-macie

# Create a scan job
aws macie2 create-classification-job \
  --job-type ONE_TIME \
  --name "PII-Scan-Q1" \
  --s3-job-definition '{
    "bucketDefinitions": [
      {"accountId": "123456789012", "buckets": ["my-data-bucket"]}
    ]
  }'

# List findings
aws macie2 list-findings --finding-criteria '{"criterion": {"severity.description": {"eq": ["High","Critical"]}}}' \
  | jq '.findingIds'

# Get finding details
aws macie2 get-findings --finding-ids <finding-id> | jq '.[].description'

Data Retention and Destruction

Retention Policy Framework

Data type	Minimum retention	Maximum (typical)	Destruction method
Financial records	7 years (tax law)	7 years	Certified destruction
HR / employee records	Duration of employment + 7 years	As above	Certified destruction
HIPAA ePHI	6 years (HIPAA)	Per state law	Certified destruction
PCI card data	Only as long as needed	Immediate post-auth (CVV: never)	Secure wipe or tokenise
GDPR personal data	As long as processing purpose exists	Per purpose limitation	Anonymisation or deletion
Security logs	90 days accessible	12 months (PCI); varies	Secure deletion
Backup tapes	Per recovery RTO/RPO	Aligned with data retention of contained data	Degauss + shred

# Automate data retention with S3 Lifecycle rules
aws s3api put-bucket-lifecycle-configuration --bucket my-logs-bucket \
  --lifecycle-configuration '{
    "Rules": [
      {
        "ID": "SecurityLogRetention",
        "Status": "Enabled",
        "Filter": {"Prefix": "security-logs/"},
        "Transitions": [
          {"Days": 30, "StorageClass": "STANDARD_IA"},
          {"Days": 90, "StorageClass": "GLACIER"}
        ],
        "Expiration": {"Days": 365}
      }
    ]
  }'

# Check for expired objects that haven't been deleted
aws s3api list-objects-v2 --bucket my-bucket \
  --query 'Contents[?LastModified<`2025-01-01`].[Key,LastModified]'

If you process personal data of EU residents, GDPR grants data subjects rights you must technically support:

Right	Technical requirement
Right of access	Ability to export all data about a person (by email/ID)
Right to erasure (right to be forgotten)	Ability to delete all data about a person across all systems
Right to portability	Export data in machine-readable format (JSON/CSV)
Right to rectification	Ability to update incorrect personal data

# Example: data subject erasure request handler
def handle_erasure_request(user_id: str):
    """Delete all personal data for a given user across systems."""

    # 1. Database: anonymise or delete records
    db.execute("""
        UPDATE users SET
            email = '[email protected]',
            name = 'Deleted User',
            phone = NULL,
            address = NULL
        WHERE id = %s
    """, [user_id])

    # 2. Logs: pseudonymise (replace email with hash in logs)
    # Note: full deletion from logs may violate retention requirements
    # - consult legal before deleting audit logs

    # 3. Backups: add to "deleted users" list; filter on restore
    add_to_erasure_list(user_id)

    # 4. Third parties: notify processors (email provider, analytics, etc.)
    notify_processors(user_id)

    # 5. Document: record the erasure request and completion date
    log_erasure(user_id, completed_at=datetime.now())