The Math Behind VM Right-Sizing (Stop guessing your Azure SKU)

# finops# azure# infrastructure# cloud
The Math Behind VM Right-Sizing (Stop guessing your Azure SKU)Abhijeet H

We have all done this at some point. You are deploying a new application, and the manager asks, "What...

We have all done this at some point. You are deploying a new application, and the manager asks, "What size VM do we need?"

You don't want to be the person who crashed the production server because of low RAM. So, what do you do? You take the estimated requirement and multiply it by 2 or 4. "Just to be safe."

If the load test hit 60% CPU on 4 vCPUs, we request 8 vCPUs. The VM goes live, runs at 12% utilization, and we never look at it again.

This "Safety-margin culture" is the single biggest reason for cloud waste.

I am currently building CloudSavvy.io to automate this problem, but today I want to share the core engineering logic and the math you need to implement right-sizing yourself without breaking production.

Problem Statement: The Cost of Static Sizing

Most organizations size VMs at deployment time and never revisit the decision. This is a structural issue.

Consider a D8s_v5 (8 vCPU, 32 GiB) in East US.

  • Cost: ~$280/month.
  • Actual Usage: 11% CPU, 22% Memory.

A D4s_v5 (4 vCPU, 16 GiB) costs ~$140/month. It would handle that load with plenty of buffer. If you have 200 VMs like this, the annual waste reaches six figures.

The problem is not that engineers over-provision deliberately. The problem is that right-sizing requires continuous, metrics-driven evaluation—and most teams lack the instrumentation to do it systematically.

Core Metrics Required (CPU is not enough)

Many scripts just look at "Average CPU" and suggest a downsize. This is dangerous. You need to analyze four resource dimensions over a 30-day window.

1. CPU Utilization

Raw average is insufficient. You need three statistical views:

  • Average: If it is below 20% sustained for 30 days, it is a downsizing candidate.
  • P95 (95th Percentile): This captures the realistic peak. If P95 is below 50%, you are definitely over-provisioned.
  • Peak (P99/Max): If Peak is high (90%+) but P95 is low, the workload is "bursty." Do not switch to a smaller fixed SKU; consider a B-series (Burstable) instead.

2. Memory Utilization

This is the most neglected metric. A VM can run at 10% CPU while using 85% of available memory (common for databases and caching workloads).

Formula:
memory_utilization_pct = ((total_memory - available_memory) / total_memory) * 100

If average memory utilization exceeds 80% sustained, the VM is a candidate for Upsizing or a family change (e.g., to E-series), regardless of CPU. If you ignore this, you risk Out Of Memory (OOM) crashes.

3. Disk IOPS and Throughput

Disk performance constrains VM sizing independently of CPU. Azure VM SKUs have hard ceilings.

  • Standard_D4s_v5: Max 6,400 IOPS.
  • Standard_D2s_v5: Max 3,200 IOPS.

If your workload sustains 5,800 IOPS and you downsize to a D2s because "CPU is low," you will hit I/O throttling and the application will lag. Always compare P95 IOPS against the target SKU limit.

4. Network Throughput

Similar to disk, network bandwidth is SKU-dependent. If sustained network throughput exceeds 60% of the target SKU's ceiling, block the downsize. Network-bound workloads (like API gateways) often have low CPU but cannot tolerate bandwidth reduction.

Sizing Decision Logic

You cannot rely on simple thresholds. You need a decision framework.

Here is the logic flow:

Step 1: Coverage Gate
If cpu_hours < 648 (90% of 720 hours/30 days), BLOCK. Do not guess with insufficient data.

Step 2: Classification

  • cpu_sustained_low = (cpu_p95 < 20%) AND (cpu_avg < 15%)
  • memory_low = (memory_p95 < 40%)
  • memory_high = (memory_p95 >= 75%)

Step 3: Action Determination

  • IF cpu_sustained_low AND memory_low:

    • Action: DOWNSIZE within same family.
    • Example: D8s_v5 → D4s_v5.
  • IF cpu_sustained_low AND memory_high:

    • Action: SWITCH FAMILY to memory-optimized (E-series).
    • Example: D8s_v5 → E4s_v5 (fewer vCPUs, same memory).
  • IF cpu_high AND memory_low:

    • Action: SWITCH FAMILY to compute-optimized (F-series).
    • Example: D8s_v5 → F8s_v2 (same vCPU, less memory, higher clock speed).
  • IF CPU variability is high (stddev/mean > 0.6):

    • Action: RECOMMEND BURSTABLE (B-series).
    • Example: D4s_v5 → B4ms.

Step 4: Guardrails

  • IOPS Safety: IF target_sku_max_iops < current_disk_iops_p95 * 1.2 → BLOCK.
  • Production Tag: IF resource is tagged "Production" → apply 30% stricter headroom margins.
  • Compliance: IF tagged "PCI-DSS" or "HIPAA" → BLOCK automated resize.

Example Scenarios

Scenario A: The Memory-Bound Database

  • Current: Standard_D8s_v5 (8 vCPU, 32 GiB) — USD 280/month
  • Metrics: CPU avg 12%, P95 25% | Memory avg 78%, P95 89%
  • Analysis: CPU is underutilized, but memory is near capacity. Downsizing D-series reduces RAM, risking OOM.
  • Recommendation: Switch to Standard_E4s_v5 (4 vCPU, 32 GiB).
  • Savings: USD 85/month. Memory preserved, CPU reduced to match actual utilization.

Scenario B: The GPU Mistake

  • Current: Standard_NC24s_v3 (24 vCPU, 4x V100 GPUs) — USD 9,204/month
  • Metrics: GPU utilization avg 22% (single GPU active).
  • Analysis: Only 1 of 4 GPUs is active. The workload is a single-model inference service that does not parallelize.
  • Recommendation: Downsize to Standard_NC6s_v3 (6 vCPU, 1x V100).
  • Savings: USD 6,903/month.

Data Engineering Considerations

If you are implementing this, keep in mind:

  1. Telemetry: Use Azure Monitor Metrics API (Microsoft.Compute/virtualMachines), not Resource Graph. Resource Graph provides metadata, not performance history.
  2. Sampling Window: 30 days is the mandatory minimum to capture monthly batch jobs. 7 days is too risky.
  3. Missing Data: Missing metric hours are not zero-utilization hours. If the agent was down, do not interpolate. Block the recommendation.
  4. ROI Check: Calculate the exact monthly cost delta. If savings < USD 5/month, skip it. It's not worth the engineering effort.

Conclusion

Right-sizing is not just about cost minimization—it is cost-to-performance optimization. The goal is to eliminate waste without introducing performance risk.

A one-time audit is not enough because workloads change. If you automate this logic effectively, you can maintain performance while significantly reducing your Azure bill.

If you are looking for a tool that automates this entire decision framework, do check out CloudSavvy.io.

Let me know in the comments if you have faced issues with IOPS throttling after resizing!