Cloud Cost Optimization
- Cloud cost optimization is the practice of reducing cloud spend without reducing the business value delivered - not just cutting costs, but eliminating waste while keeping the system performing and reliable.
- Unmanaged cloud costs are the norm, not the exception. The pay-as-you-go model makes it trivially easy to overspend - idle resources, over-provisioned instances, and forgotten services accumulate silently.
- The discipline is called FinOps (Financial Operations) - a collaboration between engineering, finance, and product teams to make cost a shared responsibility, not Finance’s problem.
FinOps Core Principles
Section titled “FinOps Core Principles”- Everyone is accountable for their cloud usage. Costs should be visible to the teams generating them, not just reviewed by finance at end of month.
- Cost visibility comes before cost reduction. You can’t optimize what you can’t see. Start by understanding where money is going before making changes.
- Iterate, don’t batch. Small continuous improvements beat a one-time annual optimization project.
- Business value trumps raw cost. A $10k/month service generating $1M in revenue is not a problem. A $500/month forgotten test environment that should have been deleted 6 months ago is.
Understanding Your Bill
Section titled “Understanding Your Bill”Before optimizing, understand the structure:
- Compute: The largest cost driver for most workloads. Charges for CPU and memory by the hour/second.
- Storage: Typically cheap per GB but can accumulate. Watch for snapshot sprawl, orphaned volumes, and unaccessed S3/Blob objects.
- Data Transfer (Egress): Often the most underestimated cost. Moving data out of cloud regions or to the internet is charged; moving data in is usually free. Egress between AZs within a region also carries fees.
- Managed Services: Databases, queues, caches - convenience costs more per unit than running it yourself, but operational cost savings often make it worth it.
- Licensing: Windows instances, SQL Server, and commercial software carry additional licensing costs on top of compute.
Rightsizing
Section titled “Rightsizing”- Rightsizing is the process of matching instance size to actual workload demands - neither over-provisioned (wasting money) nor under-provisioned (degrading performance).
- Most teams default to “larger than we need, just in case” and never revisit it. This is the most common source of waste.
How to Rightsize
Section titled “How to Rightsize”- Collect utilization data - use your cloud provider’s monitoring (CloudWatch, Azure Monitor, GCP Cloud Monitoring) to measure average and peak CPU, memory, and network utilization over 1–4 weeks.
- Identify candidates - instances consistently under 20–30% CPU utilization are prime candidates.
- Downsize in stages - drop one size tier, monitor for 1–2 weeks, then continue if stable. Don’t go straight from an XL to a small.
- Use provider recommendations - AWS Compute Optimizer, Azure Advisor, and GCP Recommender all surface rightsizing recommendations automatically.
Pricing Models
Section titled “Pricing Models”Cloud providers offer multiple ways to pay for compute. Choosing the right model is one of the highest-leverage optimizations.
On-Demand
Section titled “On-Demand”- Pay for compute by the hour or second, no commitment.
- Use for: Unpredictable workloads, new services where sizing is unknown, short-term projects.
- Cost: Highest per-unit price. Baseline for comparing other models.
Reserved Instances / Committed Use
Section titled “Reserved Instances / Committed Use”- Commit to a specific instance type (and sometimes region) for 1 or 3 years in exchange for a significant discount (typically 30–60% off on-demand).
- Use for: Stable, predictable baseline workloads that you know will run continuously.
- Watch out for: Committing to an instance type before you’ve rightsized. Lock in at the wrong size and you’re stuck paying for waste.
- AWS variants: Standard RIs (least flexible, deepest discount), Convertible RIs (can change instance family, lower discount), Savings Plans (more flexible, commitment is in $/hr spend not instance type).
Spot / Preemptible Instances
Section titled “Spot / Preemptible Instances”- Spare cloud capacity offered at 60–90% discount off on-demand. The provider can reclaim instances with 2 minutes notice (AWS/Azure) or 30 seconds (GCP).
- Use for: Fault-tolerant, stateless, or batch workloads - CI/CD workers, data processing pipelines, rendering farms, ML training jobs.
- Not suitable for: Stateful services, databases, anything requiring stable uptime.
- Best practice: Mix spot with a small on-demand or reserved baseline. Use Spot Instance diversification across multiple instance types and AZs to reduce interruption probability.
Serverless Pricing
Section titled “Serverless Pricing”- Pay strictly per invocation and duration, not for idle time.
- Cost advantage: Zero cost when idle; scales to zero automatically.
- Watch out for: High-volume, long-running functions can cost more than a reserved instance at scale. Always model the cost before going serverless on a high-throughput path.
Autoscaling for Cost
Section titled “Autoscaling for Cost”- Autoscaling is not just an availability tool - it’s a cost tool. Scaling down during off-peak hours directly reduces the compute bill.
- Schedule-based scaling: If traffic patterns are predictable (workday peaks, weekend drops), schedule scale-in events rather than waiting for metrics to trigger them.
- Target tracking: Set autoscaling to maintain a target utilization (e.g., 70% CPU) instead of a fixed instance count. This ensures both adequate headroom and no wasted capacity.
Storage Optimization
Section titled “Storage Optimization”- S3/Blob lifecycle policies: Automatically transition objects to cheaper storage tiers (Infrequent Access, Glacier/Archive) after a set period, and delete them after a retention policy expires. Set this on every bucket - without it, data accumulates forever.
- Storage tiers:
- Hot/Standard: Frequent access. Highest storage cost, lowest retrieval cost.
- Cool/Infrequent Access: Occasional access. Lower storage cost, higher retrieval cost.
- Archive/Glacier: Rare access. Lowest storage cost, retrieval takes minutes to hours and charges apply.
- Snapshot cleanup: Automated snapshots accumulate quickly. Define a retention policy (e.g., keep 7 daily, 4 weekly, 12 monthly) and enforce it.
- Orphaned volumes: Disks detached from terminated instances continue to incur charges. Audit and delete regularly.
- Compress and deduplicate before storing large datasets, especially in data lakes.
Data Transfer Cost Reduction
Section titled “Data Transfer Cost Reduction”- Move filtering and processing close to the data to reduce egress. Run queries in the same region as your data.
- Use CDNs (CloudFront, Azure CDN, Cloud CDN) to serve static assets from edge locations, reducing origin egress costs and improving latency.
- VPC endpoints / Private Link: Keep traffic between your services and cloud provider APIs private (e.g., S3 via VPC endpoint). This avoids NAT Gateway egress charges and is also a security win.
- Consolidate microservices that talk to each other frequently into the same AZ to avoid inter-AZ data transfer fees.
Tagging and Cost Allocation
Section titled “Tagging and Cost Allocation”- Tags are the foundation of cost visibility. Without them, you can’t answer “how much does feature X cost?” or “which team is responsible for this bill?”
- Enforce a tagging policy at the organization level:
Environment: prod / staging / devTeam: platform / data / frontendProject:<project-name>Owner:<email>
- Use tag-based cost allocation in your provider’s billing console (AWS Cost Explorer, Azure Cost Management, GCP Billing) to build dashboards per team/project.
- Enforce tags via policy engines (AWS Service Control Policies, Azure Policy) to prevent untagged resources from being created.
Practical Tooling
Section titled “Practical Tooling”| Tool | Provider | Purpose |
|---|---|---|
| AWS Cost Explorer | AWS | Visualize and analyze spend trends, reservation coverage |
| AWS Compute Optimizer | AWS | Rightsizing recommendations for EC2, RDS, Lambda |
| AWS Trusted Advisor | AWS | Cost, security, and performance checks |
| Azure Cost Management | Azure | Budgets, alerts, cost analysis by resource group/tag |
| Azure Advisor | Azure | Rightsizing and reservation recommendations |
| Infracost | Multi-cloud | Cost estimates in CI/CD pipelines - catch expensive changes before merge |
| Kubecost | Kubernetes | Pod and namespace-level cost attribution in K8s clusters |
| OpenCost | Kubernetes | Open-source Kubernetes cost monitoring (CNCF project) |
Quick Wins Checklist
Section titled “Quick Wins Checklist”- Delete unused/stopped instances that have been idle for 30+ days
- Delete orphaned EBS volumes / managed disks not attached to any instance
- Set S3/Blob lifecycle policies on all buckets
- Delete old unused snapshots
- Identify and purchase Reserved Instances for stable production workloads
- Enable autoscaling on all stateless services with predictable traffic patterns
- Set budget alerts at 80% and 100% of expected monthly spend
- Audit NAT Gateway usage - often a hidden cost driver
- Review data transfer costs in billing console - investigate any unexpected egress spikes
- Enforce mandatory tags on all new resources via policy