Skip to content

Cloud Cost Optimization

  • Cloud cost optimization is the practice of reducing cloud spend without reducing the business value delivered - not just cutting costs, but eliminating waste while keeping the system performing and reliable.
  • Unmanaged cloud costs are the norm, not the exception. The pay-as-you-go model makes it trivially easy to overspend - idle resources, over-provisioned instances, and forgotten services accumulate silently.
  • The discipline is called FinOps (Financial Operations) - a collaboration between engineering, finance, and product teams to make cost a shared responsibility, not Finance’s problem.
  • Everyone is accountable for their cloud usage. Costs should be visible to the teams generating them, not just reviewed by finance at end of month.
  • Cost visibility comes before cost reduction. You can’t optimize what you can’t see. Start by understanding where money is going before making changes.
  • Iterate, don’t batch. Small continuous improvements beat a one-time annual optimization project.
  • Business value trumps raw cost. A $10k/month service generating $1M in revenue is not a problem. A $500/month forgotten test environment that should have been deleted 6 months ago is.

Before optimizing, understand the structure:

  • Compute: The largest cost driver for most workloads. Charges for CPU and memory by the hour/second.
  • Storage: Typically cheap per GB but can accumulate. Watch for snapshot sprawl, orphaned volumes, and unaccessed S3/Blob objects.
  • Data Transfer (Egress): Often the most underestimated cost. Moving data out of cloud regions or to the internet is charged; moving data in is usually free. Egress between AZs within a region also carries fees.
  • Managed Services: Databases, queues, caches - convenience costs more per unit than running it yourself, but operational cost savings often make it worth it.
  • Licensing: Windows instances, SQL Server, and commercial software carry additional licensing costs on top of compute.
  • Rightsizing is the process of matching instance size to actual workload demands - neither over-provisioned (wasting money) nor under-provisioned (degrading performance).
  • Most teams default to “larger than we need, just in case” and never revisit it. This is the most common source of waste.
  1. Collect utilization data - use your cloud provider’s monitoring (CloudWatch, Azure Monitor, GCP Cloud Monitoring) to measure average and peak CPU, memory, and network utilization over 1–4 weeks.
  2. Identify candidates - instances consistently under 20–30% CPU utilization are prime candidates.
  3. Downsize in stages - drop one size tier, monitor for 1–2 weeks, then continue if stable. Don’t go straight from an XL to a small.
  4. Use provider recommendations - AWS Compute Optimizer, Azure Advisor, and GCP Recommender all surface rightsizing recommendations automatically.

Cloud providers offer multiple ways to pay for compute. Choosing the right model is one of the highest-leverage optimizations.

  • Pay for compute by the hour or second, no commitment.
  • Use for: Unpredictable workloads, new services where sizing is unknown, short-term projects.
  • Cost: Highest per-unit price. Baseline for comparing other models.
  • Commit to a specific instance type (and sometimes region) for 1 or 3 years in exchange for a significant discount (typically 30–60% off on-demand).
  • Use for: Stable, predictable baseline workloads that you know will run continuously.
  • Watch out for: Committing to an instance type before you’ve rightsized. Lock in at the wrong size and you’re stuck paying for waste.
  • AWS variants: Standard RIs (least flexible, deepest discount), Convertible RIs (can change instance family, lower discount), Savings Plans (more flexible, commitment is in $/hr spend not instance type).
  • Spare cloud capacity offered at 60–90% discount off on-demand. The provider can reclaim instances with 2 minutes notice (AWS/Azure) or 30 seconds (GCP).
  • Use for: Fault-tolerant, stateless, or batch workloads - CI/CD workers, data processing pipelines, rendering farms, ML training jobs.
  • Not suitable for: Stateful services, databases, anything requiring stable uptime.
  • Best practice: Mix spot with a small on-demand or reserved baseline. Use Spot Instance diversification across multiple instance types and AZs to reduce interruption probability.
  • Pay strictly per invocation and duration, not for idle time.
  • Cost advantage: Zero cost when idle; scales to zero automatically.
  • Watch out for: High-volume, long-running functions can cost more than a reserved instance at scale. Always model the cost before going serverless on a high-throughput path.
  • Autoscaling is not just an availability tool - it’s a cost tool. Scaling down during off-peak hours directly reduces the compute bill.
  • Schedule-based scaling: If traffic patterns are predictable (workday peaks, weekend drops), schedule scale-in events rather than waiting for metrics to trigger them.
  • Target tracking: Set autoscaling to maintain a target utilization (e.g., 70% CPU) instead of a fixed instance count. This ensures both adequate headroom and no wasted capacity.
  • S3/Blob lifecycle policies: Automatically transition objects to cheaper storage tiers (Infrequent Access, Glacier/Archive) after a set period, and delete them after a retention policy expires. Set this on every bucket - without it, data accumulates forever.
  • Storage tiers:
    • Hot/Standard: Frequent access. Highest storage cost, lowest retrieval cost.
    • Cool/Infrequent Access: Occasional access. Lower storage cost, higher retrieval cost.
    • Archive/Glacier: Rare access. Lowest storage cost, retrieval takes minutes to hours and charges apply.
  • Snapshot cleanup: Automated snapshots accumulate quickly. Define a retention policy (e.g., keep 7 daily, 4 weekly, 12 monthly) and enforce it.
  • Orphaned volumes: Disks detached from terminated instances continue to incur charges. Audit and delete regularly.
  • Compress and deduplicate before storing large datasets, especially in data lakes.
  • Move filtering and processing close to the data to reduce egress. Run queries in the same region as your data.
  • Use CDNs (CloudFront, Azure CDN, Cloud CDN) to serve static assets from edge locations, reducing origin egress costs and improving latency.
  • VPC endpoints / Private Link: Keep traffic between your services and cloud provider APIs private (e.g., S3 via VPC endpoint). This avoids NAT Gateway egress charges and is also a security win.
  • Consolidate microservices that talk to each other frequently into the same AZ to avoid inter-AZ data transfer fees.
  • Tags are the foundation of cost visibility. Without them, you can’t answer “how much does feature X cost?” or “which team is responsible for this bill?”
  • Enforce a tagging policy at the organization level:
    • Environment: prod / staging / dev
    • Team: platform / data / frontend
    • Project: <project-name>
    • Owner: <email>
  • Use tag-based cost allocation in your provider’s billing console (AWS Cost Explorer, Azure Cost Management, GCP Billing) to build dashboards per team/project.
  • Enforce tags via policy engines (AWS Service Control Policies, Azure Policy) to prevent untagged resources from being created.
ToolProviderPurpose
AWS Cost ExplorerAWSVisualize and analyze spend trends, reservation coverage
AWS Compute OptimizerAWSRightsizing recommendations for EC2, RDS, Lambda
AWS Trusted AdvisorAWSCost, security, and performance checks
Azure Cost ManagementAzureBudgets, alerts, cost analysis by resource group/tag
Azure AdvisorAzureRightsizing and reservation recommendations
InfracostMulti-cloudCost estimates in CI/CD pipelines - catch expensive changes before merge
KubecostKubernetesPod and namespace-level cost attribution in K8s clusters
OpenCostKubernetesOpen-source Kubernetes cost monitoring (CNCF project)
  • Delete unused/stopped instances that have been idle for 30+ days
  • Delete orphaned EBS volumes / managed disks not attached to any instance
  • Set S3/Blob lifecycle policies on all buckets
  • Delete old unused snapshots
  • Identify and purchase Reserved Instances for stable production workloads
  • Enable autoscaling on all stateless services with predictable traffic patterns
  • Set budget alerts at 80% and 100% of expected monthly spend
  • Audit NAT Gateway usage - often a hidden cost driver
  • Review data transfer costs in billing console - investigate any unexpected egress spikes
  • Enforce mandatory tags on all new resources via policy