How to Autoscale Kubernetes
How to Autoscale Kubernetes Autoscaling in Kubernetes is a foundational capability that enables applications to dynamically adjust their resource allocation based on real-time demand. In today’s cloud-native environments, where traffic patterns are unpredictable and performance expectations are high, manual scaling is no longer viable. Autoscaling ensures optimal resource utilization, cost efficie
How to Autoscale Kubernetes
Autoscaling in Kubernetes is a foundational capability that enables applications to dynamically adjust their resource allocation based on real-time demand. In today’s cloud-native environments, where traffic patterns are unpredictable and performance expectations are high, manual scaling is no longer viable. Autoscaling ensures optimal resource utilization, cost efficiency, and service reliability by automatically adding or removing compute resources—whether pods, nodes, or clusters—without human intervention. This tutorial provides a comprehensive, step-by-step guide to implementing autoscaling across all layers of a Kubernetes cluster, from pod-level horizontal and vertical scaling to node-level and cluster-level automation. Whether you’re managing microservices on public clouds, hybrid infrastructures, or on-premises data centers, mastering Kubernetes autoscaling is essential for building resilient, scalable, and cost-effective systems.
Step-by-Step Guide
Understanding Kubernetes Autoscaling Components
Kubernetes autoscaling operates at three distinct levels: Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and Cluster Autoscaler (CA). Each serves a unique purpose and must be configured appropriately to achieve full automation.
The Horizontal Pod Autoscaler (HPA) adjusts the number of pod replicas based on observed metrics such as CPU utilization, memory usage, or custom application-specific metrics. It works by querying the Kubernetes Metrics Server and scaling the associated Deployment, StatefulSet, or ReplicaSet up or down within defined limits.
The Vertical Pod Autoscaler (VPA) modifies the CPU and memory requests and limits of individual pods. Unlike HPA, which adds or removes pods, VPA changes the resource allocation of existing pods, requiring them to be restarted. VPA is ideal for applications with irregular or long-term resource usage patterns that don’t respond well to horizontal scaling.
The Cluster Autoscaler (CA) operates at the infrastructure layer. It monitors for pods that cannot be scheduled due to insufficient node resources and automatically provisions new worker nodes from the cloud provider’s node pool. Conversely, when nodes are underutilized for extended periods, CA terminates them to reduce costs.
Together, these three components form a complete autoscaling ecosystem. HPA handles application-level demand, VPA optimizes per-pod resource efficiency, and CA ensures the underlying infrastructure scales in sync.
Prerequisites for Autoscaling
Before implementing autoscaling, ensure your Kubernetes cluster meets the following prerequisites:
- A running Kubernetes cluster (version 1.19 or higher recommended)
- The Kubernetes Metrics Server installed and operational
- Appropriate RBAC permissions for autoscaling components
- Cloud provider integration (if using Cluster Autoscaler on AWS, GCP, Azure, etc.)
- Resource requests and limits defined in all pod specifications
The Metrics Server is critical—it collects resource usage data from kubelets and exposes it via the Kubernetes API. Without it, HPA and VPA cannot function. To install it, use:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Verify its status:
kubectl get pods -n kube-system | grep metrics-server
Ensure the pods are in a Running state. If not, check logs with kubectl logs -n kube-system <metrics-server-pod-name>.
Configuring Horizontal Pod Autoscaler (HPA)
HPA is the most commonly used autoscaling mechanism. It scales the number of pod replicas based on metrics.
Start by deploying a sample application with defined resource requests and limits. Here’s an example Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 2
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: nginx
image: nginx:1.21
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
Apply it:
kubectl apply -f web-app-deployment.yaml
Now create the HPA to scale between 2 and 10 replicas when average CPU utilization exceeds 70%:
kubectl autoscale deployment web-app --cpu-percent=70 --min=2 --max=10
Alternatively, define it in YAML for version control:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Apply the HPA:
kubectl apply -f web-app-hpa.yaml
Monitor scaling events:
kubectl get hpa
For more granular control, use custom metrics from Prometheus or other monitoring tools. For example, to scale based on HTTP requests per second:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
This requires the Prometheus Adapter to expose custom metrics to the Kubernetes API. Install it via Helm:
helm install prometheus-adapter prometheus-community/prometheus-adapter
Configuring Vertical Pod Autoscaler (VPA)
VPA adjusts CPU and memory requests and limits of pods automatically. Unlike HPA, it does not scale replicas—it changes the resource profile of existing pods, which requires pod restarts.
Install VPA using the official manifests:
kubectl apply -f https://github.com/kubernetes/autoscaler/raw/master/vertical-pod-autoscaler/deploy/vpa-release.yaml
Verify installation:
kubectl get pods -n kube-system | grep vpa
Now, create a VPA object targeting your Deployment. Note: VPA must be configured in Recommendation mode first to avoid unintended restarts.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
updatePolicy:
updateMode: "Off"
Start in "Off" mode to observe recommendations
Apply it:
kubectl apply -f web-app-vpa.yaml
Check recommendations:
kubectl get vpa web-app-vpa -o yaml
Look under status.recommendation.containerRecommendations for suggested CPU and memory values. Once validated, switch updateMode to Auto to enable automatic updates:
updatePolicy:
updateMode: "Auto"
Important: VPA does not work with static pod manifests or DaemonSets without additional configuration. Use it for stateless, replicable workloads like web servers, APIs, and background workers.
Configuring Cluster Autoscaler
Cluster Autoscaler is provider-specific. Below are examples for AWS EKS, GCP GKE, and Azure AKS.
AWS EKS
For EKS, ensure your node group has Auto Scaling Group (ASG) enabled. Then deploy the Cluster Autoscaler using the official Helm chart:
helm repo add eks https://aws.github.io/eks-charts
helm install cluster-autoscaler eks/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=your-eks-cluster-name \
--set awsRegion=us-east-1 \
--set rbac.serviceAccount.create=true \
--set rbac.serviceAccount.name=cluster-autoscaler
Alternatively, use the YAML manifest with IAM permissions attached to the node role:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.27.0
name: cluster-autoscaler
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/your-eks-cluster-name
GCP GKE
GKE enables Cluster Autoscaler by default for node pools with autoscaling enabled. Enable it via the CLI:
gcloud container node-pools update your-node-pool \
--cluster=your-cluster \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=10
Cluster Autoscaler runs automatically in the background. Monitor its status via:
kubectl get pods -n kube-system | grep cluster-autoscaler
Azure AKS
Enable autoscaling on an AKS node pool:
az aks nodepool update \
--resource-group your-resource-group \
--cluster-name your-aks-cluster \
--name nodepool1 \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 10
Verify:
kubectl get nodes
Cluster Autoscaler will now respond to unschedulable pods by adding nodes from the configured node pool. It removes nodes only after 10 minutes of consistent underutilization.
Testing Autoscaling Behavior
To validate your autoscaling setup, simulate load on your application.
Deploy a simple load generator:
apiVersion: apps/v1
kind: Deployment
metadata:
name: load-generator
spec:
replicas: 1
selector:
matchLabels:
app: load-generator
template:
metadata:
labels:
app: load-generator
spec:
containers:
- name: loader
image: busybox
command: ['sh', '-c', 'while true; do curl -s http://web-app.default.svc.cluster.local; sleep 0.1; done']
Apply and monitor:
kubectl apply -f load-generator.yaml
kubectl get hpa -w
kubectl get pods -w
Within seconds, you should see HPA increase replicas as CPU usage rises. After the load stops, HPA should scale back down. Cluster Autoscaler may add nodes if pods remain unschedulable due to resource constraints.
Debugging Autoscaling Issues
Common issues and how to resolve them:
- HPA not scaling: Check if Metrics Server is running and if resource requests/limits are defined. Use
kubectl describe hpa <name>to view events. - VPA not updating pods: Ensure
updateModeis set toAutoand that the pod is managed by a Deployment or StatefulSet. - Cluster Autoscaler not adding nodes: Verify cloud provider permissions, node group configuration, and that the pod’s resource requests exceed available capacity.
- Pods stuck in Pending: Use
kubectl describe pod <pod-name>to check for “Insufficient cpu” or “Insufficient memory” events.
Enable verbose logging for Cluster Autoscaler:
--v=5
Review logs:
kubectl logs -n kube-system <cluster-autoscaler-pod-name>
Best Practices
Define Realistic Resource Requests and Limits
Autoscaling relies on accurate resource definitions. Under-provisioning causes performance degradation; over-provisioning wastes money and prevents efficient scheduling.
Use tools like kubectl top pods and kubectl top nodes to observe actual usage. Then set requests to 70–80% of average usage and limits to 150–200% of peak usage.
Avoid setting identical limits across all containers. Different services have different resource profiles—API gateways may need more CPU, while background workers may need more memory.
Use Coordinated Scaling Policies
HPA, VPA, and CA should work in harmony. For example, if VPA increases a pod’s memory request beyond the node’s capacity, Cluster Autoscaler must respond by adding a larger node.
Use node taints and tolerations to group workloads by resource needs. For example, memory-intensive workloads can be scheduled on nodes with high RAM, while CPU-heavy workloads run on compute-optimized instances.
Set Appropriate Scaling Cooldown Periods
Scaling too frequently causes instability. HPA has a default cooldown of 5 minutes for scale-up and 15 minutes for scale-down. Customize these using the scaleUpCooldown and scaleDownCooldown parameters in advanced configurations (e.g., with KEDA or custom metrics).
For workloads with bursty traffic (e.g., batch jobs), use KEDA (Kubernetes Event-Driven Autoscaling) to trigger scaling based on events like queue depth, rather than periodic metrics.
Avoid Scaling on Custom Metrics Without Validation
Custom metrics (e.g., requests per second, database latency) can be powerful but risky. Ensure the metric is stable, measurable, and directly tied to user experience. Avoid using metrics that fluctuate rapidly or are influenced by external factors like network latency.
Use alerting and monitoring to validate scaling behavior. If HPA scales up because of a spike in error rates, it may be reacting to a bug, not load.
Use Pod Disruption Budgets (PDBs)
When Cluster Autoscaler or VPA terminates pods, ensure applications remain available. Define a PodDisruptionBudget to guarantee minimum available pods during voluntary disruptions.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: web-app
This ensures at least one pod remains running during scaling events.
Monitor and Alert on Scaling Events
Track autoscaling activity with observability tools like Prometheus, Grafana, or cloud-native monitoring. Create dashboards showing:
- Number of replicas over time
- Node count and utilization
- HPA target vs. actual utilization
- Cluster Autoscaler scale-up/scale-down events
Set alerts for:
- HPA reaching max replicas for more than 10 minutes
- Cluster Autoscaler unable to provision nodes
- Pods pending for more than 5 minutes
Test Scaling Under Realistic Load
Don’t rely on synthetic benchmarks. Use load testing tools like Locust, k6, or Artillery to simulate real user behavior. Test during peak hours, after deployments, and during traffic spikes.
Document your scaling thresholds and response times. This becomes part of your system’s SLA and incident response playbook.
Use Cost Optimization Tools
Autoscaling reduces waste, but further savings come from:
- Using spot/preemptible instances for non-critical workloads
- Enabling node auto-provisioning (GKE) or node pool auto-scaling (EKS)
- Applying resource quotas and limits at the namespace level
- Using tools like Kubecost or Prometheus + Grafana for cost attribution
Tools and Resources
Core Kubernetes Tools
- Kubernetes Metrics Server – Required for HPA and VPA. Collects resource usage data from kubelets.
- Horizontal Pod Autoscaler (HPA) – Built-in Kubernetes component for replica scaling.
- Vertical Pod Autoscaler (VPA) – Official Kubernetes project for adjusting pod resources.
- Cluster Autoscaler – Official project for adding/removing nodes based on scheduling constraints.
Advanced Autoscaling Tools
- KEDA (Kubernetes Event-Driven Autoscaling) – Enables scaling based on event sources like Kafka, RabbitMQ, Azure Queues, or Prometheus metrics. Ideal for event-driven architectures.
- Prometheus + Prometheus Adapter – Exposes custom metrics to HPA. Essential for application-specific scaling rules.
- OpenKruise – Alibaba’s extended Kubernetes controller suite, offering advanced autoscaling and workload management features.
- Argo Rollouts – Combines canary deployments with autoscaling for intelligent traffic shifting during scaling events.
Monitoring and Observability
- Prometheus – Open-source monitoring and alerting toolkit.
- Grafana – Visualization platform for metrics dashboards.
- Kubecost – Cost monitoring and optimization for Kubernetes clusters.
- Datadog / New Relic / Dynatrace – Commercial APM tools with Kubernetes integration.
Cloud Provider Resources
- AWS EKS – Cluster Autoscaler Documentation
- GCP GKE – Autoscaling in GKE
- Azure AKS – AKS Cluster Autoscaler Guide
Learning Resources
- “Kubernetes in Action” by Marko Luksa – Comprehensive guide covering autoscaling in depth.
- Kubernetes.io Official Docs – HPA, VPA, CA
- KEDA Documentation – keda.sh/docs
Real Examples
Example 1: E-commerce Website on AWS EKS
An e-commerce platform experiences traffic spikes during Black Friday sales. The frontend is served by a Deployment with 3 replicas. The HPA is configured to scale between 3 and 50 replicas based on CPU usage above 65%.
During a sale, traffic surges. HPA scales to 48 replicas within 90 seconds. However, the existing nodes are at 90% capacity. Cluster Autoscaler detects unschedulable pods and provisions 4 new m5.large instances from the ASG. After the sale ends, traffic drops. HPA scales back to 3 replicas. Cluster Autoscaler waits 15 minutes, then terminates the 4 extra nodes, saving $120 in cloud costs.
Custom metrics from CloudWatch (requests per minute) are fed into Prometheus via the CloudWatch Exporter. A second HPA triggers scaling if the API error rate exceeds 5%, ensuring user experience is maintained even if CPU is not overloaded.
Example 2: Data Processing Pipeline on GKE
A data ingestion pipeline processes incoming sensor data from IoT devices. Each job is a pod that reads from a Pub/Sub topic. The workload is highly variable—sometimes 10 jobs per hour, sometimes 500.
Instead of using HPA with CPU metrics, KEDA is configured to scale based on Pub/Sub backlog. When messages accumulate, KEDA triggers pod creation. When the backlog drops below 100, pods are terminated.
VPA is applied to optimize memory usage. Each pod requests 512Mi and is limited to 2Gi. VPA recommends 1Gi after analyzing 7 days of data. The update mode is switched to Auto.
Cluster Autoscaler uses a node pool of n1-standard-4 instances with autoscaling from 2 to 20 nodes. During peak hours, 18 nodes are provisioned. At night, only 2 remain. Monthly savings exceed $2,000.
Example 3: On-Premises Kubernetes with Mixed Workloads
A financial institution runs Kubernetes on bare-metal servers with limited hardware. They use HPA for web services and VPA for batch jobs. Cluster Autoscaler is replaced with a custom script that triggers VM provisioning via Ansible when node capacity is exceeded.
They use resource quotas to prevent any namespace from consuming more than 40% of cluster capacity. This prevents one team’s workload from starving others.
Monitoring is done with Prometheus and Alertmanager. Alerts trigger Slack notifications when HPA reaches max replicas or when VPA recommendations change by more than 50%.
FAQs
Can I use HPA and VPA together?
Yes, but with caution. HPA scales replicas; VPA changes resource requests. If VPA increases a pod’s memory request beyond the node’s capacity, the pod may become unschedulable. Always validate VPA recommendations before enabling Auto mode, and ensure Cluster Autoscaler is active to handle node provisioning.
Does autoscaling work with StatefulSets?
Yes, HPA and VPA both support StatefulSets. However, Cluster Autoscaler only helps if the StatefulSet’s pods require more resources than available nodes. StatefulSets with persistent storage must ensure new nodes can mount volumes—use node affinity or storage classes compatible with dynamic provisioning.
Why isn’t my HPA scaling up even though CPU is high?
Check these common causes: (1) Resource requests are not defined in the pod spec, (2) Metrics Server is not running or unreachable, (3) The HPA target utilization is set too high (e.g., 95%), (4) The pod is in a CrashLoopBackOff state, (5) The HPA is misconfigured with incorrect target resource name.
How long does Cluster Autoscaler take to add a node?
Typically 1–5 minutes, depending on cloud provider provisioning speed. AWS EKS may take longer if ASG launch templates require image builds. Use node pools with pre-warmed AMIs or container-optimized OS to reduce latency.
Is autoscaling expensive?
No—it reduces costs by eliminating over-provisioning. A study by Google showed that autoscaling can reduce cloud infrastructure costs by 30–60% compared to static clusters. The key is combining HPA, VPA, and CA to match supply with demand precisely.
Can I autoscale based on memory usage?
Yes. HPA supports memory-based scaling. Define a metric with type: Resource and name: memory. Use averageUtilization or averageValue to set thresholds. Memory scaling is less common than CPU because memory is harder to reclaim, but it’s essential for memory-intensive applications like databases or caches.
What happens if Cluster Autoscaler can’t find a suitable node type?
If no node type in the pool can satisfy a pod’s resource request, the pod remains unschedulable, and Cluster Autoscaler logs a warning. Ensure your node pools include a range of instance types (e.g., small, medium, large) and consider using node affinity or taints to direct workloads appropriately.
Should I use autoscaling for databases?
Generally, no. Databases like PostgreSQL or MySQL are stateful and don’t scale horizontally well. Use vertical scaling (VPA) cautiously, and only if the database supports live resizing. Prefer dedicated, sized instances with replication for high availability.
Conclusion
Autoscaling in Kubernetes is not a single feature—it’s a system of coordinated components that work together to ensure applications are always performing optimally while minimizing resource waste. By mastering Horizontal Pod Autoscaler, Vertical Pod Autoscaler, and Cluster Autoscaler, you gain the ability to build systems that respond intelligently to real-world traffic patterns, from quiet nights to viral product launches.
The key to success lies in thoughtful configuration: define accurate resource requests, validate scaling triggers, integrate observability, and test under realistic conditions. Avoid the temptation to enable autoscaling without understanding its implications. Use custom metrics wisely, combine tools like KEDA and Prometheus for advanced scenarios, and always monitor the outcomes.
As Kubernetes continues to dominate cloud-native infrastructure, the ability to autoscale effectively will separate reactive teams from proactive, resilient engineering organizations. Start small—enable HPA on one deployment. Observe, measure, refine. Then expand to VPA and Cluster Autoscaler. With each layer you add, your system becomes more intelligent, more efficient, and more capable of handling the unpredictable nature of modern applications.