How to Autoscale Kubernetes

How to Autoscale Kubernetes Autoscaling in Kubernetes is a foundational capability that enables applications to dynamically adjust their resource allocation based on real-time demand. In today’s cloud-native environments, where traffic patterns are unpredictable and performance expectations are high, manual scaling is no longer viable. Autoscaling ensures optimal resource utilization, cost efficie

alex

Oct 30, 2025 - 20:25

How to Autoscale Kubernetes

Autoscaling in Kubernetes is a foundational capability that enables applications to dynamically adjust their resource allocation based on real-time demand. In todays cloud-native environments, where traffic patterns are unpredictable and performance expectations are high, manual scaling is no longer viable. Autoscaling ensures optimal resource utilization, cost efficiency, and service reliability by automatically adding or removing compute resourceswhether pods, nodes, or clusterswithout human intervention. This tutorial provides a comprehensive, step-by-step guide to implementing autoscaling across all layers of a Kubernetes cluster, from pod-level horizontal and vertical scaling to node-level and cluster-level automation. Whether youre managing microservices on public clouds, hybrid infrastructures, or on-premises data centers, mastering Kubernetes autoscaling is essential for building resilient, scalable, and cost-effective systems.

Step-by-Step Guide

Understanding Kubernetes Autoscaling Components

Kubernetes autoscaling operates at three distinct levels: Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and Cluster Autoscaler (CA). Each serves a unique purpose and must be configured appropriately to achieve full automation.

The Horizontal Pod Autoscaler (HPA) adjusts the number of pod replicas based on observed metrics such as CPU utilization, memory usage, or custom application-specific metrics. It works by querying the Kubernetes Metrics Server and scaling the associated Deployment, StatefulSet, or ReplicaSet up or down within defined limits.

The Vertical Pod Autoscaler (VPA) modifies the CPU and memory requests and limits of individual pods. Unlike HPA, which adds or removes pods, VPA changes the resource allocation of existing pods, requiring them to be restarted. VPA is ideal for applications with irregular or long-term resource usage patterns that dont respond well to horizontal scaling.

The Cluster Autoscaler (CA) operates at the infrastructure layer. It monitors for pods that cannot be scheduled due to insufficient node resources and automatically provisions new worker nodes from the cloud providers node pool. Conversely, when nodes are underutilized for extended periods, CA terminates them to reduce costs.

Together, these three components form a complete autoscaling ecosystem. HPA handles application-level demand, VPA optimizes per-pod resource efficiency, and CA ensures the underlying infrastructure scales in sync.

Prerequisites for Autoscaling

Before implementing autoscaling, ensure your Kubernetes cluster meets the following prerequisites:

A running Kubernetes cluster (version 1.19 or higher recommended)
The Kubernetes Metrics Server installed and operational
Appropriate RBAC permissions for autoscaling components
Cloud provider integration (if using Cluster Autoscaler on AWS, GCP, Azure, etc.)
Resource requests and limits defined in all pod specifications

The Metrics Server is criticalit collects resource usage data from kubelets and exposes it via the Kubernetes API. Without it, HPA and VPA cannot function. To install it, use:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Verify its status:

kubectl get pods -n kube-system | grep metrics-server

Ensure the pods are in a Running state. If not, check logs with kubectl logs -n kube-system <metrics-server-pod-name>.

Configuring Horizontal Pod Autoscaler (HPA)

HPA is the most commonly used autoscaling mechanism. It scales the number of pod replicas based on metrics.

Start by deploying a sample application with defined resource requests and limits. Heres an example Deployment:

apiVersion: apps/v1 kind: Deployment metadata: name: web-app spec: replicas: 2 selector: matchLabels: app: web-app template: metadata: labels: app: web-app spec: containers: - name: nginx image: nginx:1.21 resources: requests: cpu: "200m" memory: "256Mi" limits: cpu: "500m" memory: "512Mi"

Apply it:

kubectl apply -f web-app-deployment.yaml

Now create the HPA to scale between 2 and 10 replicas when average CPU utilization exceeds 70%:

kubectl autoscale deployment web-app --cpu-percent=70 --min=2 --max=10

Alternatively, define it in YAML for version control:

apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: web-app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: web-app minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70

Apply the HPA:

kubectl apply -f web-app-hpa.yaml

Monitor scaling events:

kubectl get hpa

For more granular control, use custom metrics from Prometheus or other monitoring tools. For example, to scale based on HTTP requests per second:

- type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: "100"

This requires the Prometheus Adapter to expose custom metrics to the Kubernetes API. Install it via Helm:

helm install prometheus-adapter prometheus-community/prometheus-adapter

Configuring Vertical Pod Autoscaler (VPA)

VPA adjusts CPU and memory requests and limits of pods automatically. Unlike HPA, it does not scale replicasit changes the resource profile of existing pods, which requires pod restarts.

Install VPA using the official manifests:

kubectl apply -f https://github.com/kubernetes/autoscaler/raw/master/vertical-pod-autoscaler/deploy/vpa-release.yaml

Verify installation:

kubectl get pods -n kube-system | grep vpa

Now, create a VPA object targeting your Deployment. Note: VPA must be configured in Recommendation mode first to avoid unintended restarts.

apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: web-app-vpa spec: targetRef: apiVersion: apps/v1 kind: Deployment name: web-app updatePolicy: updateMode: "Off" Start in "Off" mode to observe recommendations

Apply it:

kubectl apply -f web-app-vpa.yaml

Check recommendations:

kubectl get vpa web-app-vpa -o yaml

Look under status.recommendation.containerRecommendations for suggested CPU and memory values. Once validated, switch updateMode to Auto to enable automatic updates:

updatePolicy: updateMode: "Auto"

Important: VPA does not work with static pod manifests or DaemonSets without additional configuration. Use it for stateless, replicable workloads like web servers, APIs, and background workers.

Configuring Cluster Autoscaler

Cluster Autoscaler is provider-specific. Below are examples for AWS EKS, GCP GKE, and Azure AKS.

AWS EKS

For EKS, ensure your node group has Auto Scaling Group (ASG) enabled. Then deploy the Cluster Autoscaler using the official Helm chart:

helm repo add eks https://aws.github.io/eks-charts
helm install cluster-autoscaler eks/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=your-eks-cluster-name \
--set awsRegion=us-east-1 \
--set rbac.serviceAccount.create=true \
--set rbac.serviceAccount.name=cluster-autoscaler

Alternatively, use the YAML manifest with IAM permissions attached to the node role:

apiVersion: apps/v1 kind: Deployment metadata: name: cluster-autoscaler namespace: kube-system spec: template: spec: containers: - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.27.0 name: cluster-autoscaler command: - ./cluster-autoscaler - --v=4 - --stderrthreshold=info - --cloud-provider=aws - --skip-nodes-with-local-storage=false - --expander=least-waste - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/your-eks-cluster-name

GCP GKE

GKE enables Cluster Autoscaler by default for node pools with autoscaling enabled. Enable it via the CLI:

gcloud container node-pools update your-node-pool \ --cluster=your-cluster \ --enable-autoscaling \ --min-nodes=1 \ --max-nodes=10

Cluster Autoscaler runs automatically in the background. Monitor its status via:

kubectl get pods -n kube-system | grep cluster-autoscaler

Azure AKS

Enable autoscaling on an AKS node pool:

az aks nodepool update \ --resource-group your-resource-group \ --cluster-name your-aks-cluster \ --name nodepool1 \ --enable-cluster-autoscaler \ --min-count 1 \ --max-count 10

Verify:

kubectl get nodes

Cluster Autoscaler will now respond to unschedulable pods by adding nodes from the configured node pool. It removes nodes only after 10 minutes of consistent underutilization.

Testing Autoscaling Behavior

To validate your autoscaling setup, simulate load on your application.

Deploy a simple load generator:

apiVersion: apps/v1 kind: Deployment metadata: name: load-generator spec: replicas: 1 selector: matchLabels: app: load-generator template: metadata: labels: app: load-generator spec: containers: - name: loader image: busybox command: ['sh', '-c', 'while true; do curl -s http://web-app.default.svc.cluster.local; sleep 0.1; done']

Apply and monitor:

kubectl apply -f load-generator.yaml kubectl get hpa -w kubectl get pods -w

Within seconds, you should see HPA increase replicas as CPU usage rises. After the load stops, HPA should scale back down. Cluster Autoscaler may add nodes if pods remain unschedulable due to resource constraints.

Debugging Autoscaling Issues

Common issues and how to resolve them:

HPA not scaling: Check if Metrics Server is running and if resource requests/limits are defined. Use kubectl describe hpa <name> to view events.
VPA not updating pods: Ensure updateMode is set to Auto and that the pod is managed by a Deployment or StatefulSet.
Cluster Autoscaler not adding nodes: Verify cloud provider permissions, node group configuration, and that the pods resource requests exceed available capacity.
Pods stuck in Pending: Use kubectl describe pod <pod-name> to check for Insufficient cpu or Insufficient memory events.

Enable verbose logging for Cluster Autoscaler:

--v=5

Review logs:

kubectl logs -n kube-system <cluster-autoscaler-pod-name>

Best Practices

Define Realistic Resource Requests and Limits

Autoscaling relies on accurate resource definitions. Under-provisioning causes performance degradation; over-provisioning wastes money and prevents efficient scheduling.

Use tools like kubectl top pods and kubectl top nodes to observe actual usage. Then set requests to 7080% of average usage and limits to 150200% of peak usage.

Avoid setting identical limits across all containers. Different services have different resource profilesAPI gateways may need more CPU, while background workers may need more memory.

Use Coordinated Scaling Policies

HPA, VPA, and CA should work in harmony. For example, if VPA increases a pods memory request beyond the nodes capacity, Cluster Autoscaler must respond by adding a larger node.

Use node taints and tolerations to group workloads by resource needs. For example, memory-intensive workloads can be scheduled on nodes with high RAM, while CPU-heavy workloads run on compute-optimized instances.

Set Appropriate Scaling Cooldown Periods

Scaling too frequently causes instability. HPA has a default cooldown of 5 minutes for scale-up and 15 minutes for scale-down. Customize these using the scaleUpCooldown and scaleDownCooldown parameters in advanced configurations (e.g., with KEDA or custom metrics).

For workloads with bursty traffic (e.g., batch jobs), use KEDA (Kubernetes Event-Driven Autoscaling) to trigger scaling based on events like queue depth, rather than periodic metrics.

Avoid Scaling on Custom Metrics Without Validation

Custom metrics (e.g., requests per second, database latency) can be powerful but risky. Ensure the metric is stable, measurable, and directly tied to user experience. Avoid using metrics that fluctuate rapidly or are influenced by external factors like network latency.

Use alerting and monitoring to validate scaling behavior. If HPA scales up because of a spike in error rates, it may be reacting to a bug, not load.

Use Pod Disruption Budgets (PDBs)

When Cluster Autoscaler or VPA terminates pods, ensure applications remain available. Define a PodDisruptionBudget to guarantee minimum available pods during voluntary disruptions.

apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: web-app-pdb spec: minAvailable: 1 selector: matchLabels: app: web-app

This ensures at least one pod remains running during scaling events.

Monitor and Alert on Scaling Events

Track autoscaling activity with observability tools like Prometheus, Grafana, or cloud-native monitoring. Create dashboards showing:

Number of replicas over time
Node count and utilization
HPA target vs. actual utilization
Cluster Autoscaler scale-up/scale-down events

Set alerts for:

HPA reaching max replicas for more than 10 minutes
Cluster Autoscaler unable to provision nodes
Pods pending for more than 5 minutes

Test Scaling Under Realistic Load

Dont rely on synthetic benchmarks. Use load testing tools like Locust, k6, or Artillery to simulate real user behavior. Test during peak hours, after deployments, and during traffic spikes.

Document your scaling thresholds and response times. This becomes part of your systems SLA and incident response playbook.

Use Cost Optimization Tools

Autoscaling reduces waste, but further savings come from:

Using spot/preemptible instances for non-critical workloads
Enabling node auto-provisioning (GKE) or node pool auto-scaling (EKS)
Applying resource quotas and limits at the namespace level
Using tools like Kubecost or Prometheus + Grafana for cost attribution

Tools and Resources

Core Kubernetes Tools

Kubernetes Metrics Server Required for HPA and VPA. Collects resource usage data from kubelets.
Horizontal Pod Autoscaler (HPA) Built-in Kubernetes component for replica scaling.
Vertical Pod Autoscaler (VPA) Official Kubernetes project for adjusting pod resources.
Cluster Autoscaler Official project for adding/removing nodes based on scheduling constraints.

Advanced Autoscaling Tools

KEDA (Kubernetes Event-Driven Autoscaling) Enables scaling based on event sources like Kafka, RabbitMQ, Azure Queues, or Prometheus metrics. Ideal for event-driven architectures.
Prometheus + Prometheus Adapter Exposes custom metrics to HPA. Essential for application-specific scaling rules.
OpenKruise Alibabas extended Kubernetes controller suite, offering advanced autoscaling and workload management features.
Argo Rollouts Combines canary deployments with autoscaling for intelligent traffic shifting during scaling events.

Monitoring and Observability

Prometheus Open-source monitoring and alerting toolkit.
Grafana Visualization platform for metrics dashboards.
Kubecost Cost monitoring and optimization for Kubernetes clusters.
Datadog / New Relic / Dynatrace Commercial APM tools with Kubernetes integration.

Cloud Provider Resources

AWS EKS Cluster Autoscaler Documentation
GCP GKE Autoscaling in GKE
Azure AKS AKS Cluster Autoscaler Guide

Learning Resources

Kubernetes in Action by Marko Luksa Comprehensive guide covering autoscaling in depth.
Kubernetes.io Official Docs HPA, VPA, CA
KEDA Documentation keda.sh/docs

Real Examples

Example 1: E-commerce Website on AWS EKS

An e-commerce platform experiences traffic spikes during Black Friday sales. The frontend is served by a Deployment with 3 replicas. The HPA is configured to scale between 3 and 50 replicas based on CPU usage above 65%.

During a sale, traffic surges. HPA scales to 48 replicas within 90 seconds. However, the existing nodes are at 90% capacity. Cluster Autoscaler detects unschedulable pods and provisions 4 new m5.large instances from the ASG. After the sale ends, traffic drops. HPA scales back to 3 replicas. Cluster Autoscaler waits 15 minutes, then terminates the 4 extra nodes, saving $120 in cloud costs.

Custom metrics from CloudWatch (requests per minute) are fed into Prometheus via the CloudWatch Exporter. A second HPA triggers scaling if the API error rate exceeds 5%, ensuring user experience is maintained even if CPU is not overloaded.

Example 2: Data Processing Pipeline on GKE

A data ingestion pipeline processes incoming sensor data from IoT devices. Each job is a pod that reads from a Pub/Sub topic. The workload is highly variablesometimes 10 jobs per hour, sometimes 500.

Instead of using HPA with CPU metrics, KEDA is configured to scale based on Pub/Sub backlog. When messages accumulate, KEDA triggers pod creation. When the backlog drops below 100, pods are terminated.

VPA is applied to optimize memory usage. Each pod requests 512Mi and is limited to 2Gi. VPA recommends 1Gi after analyzing 7 days of data. The update mode is switched to Auto.

Cluster Autoscaler uses a node pool of n1-standard-4 instances with autoscaling from 2 to 20 nodes. During peak hours, 18 nodes are provisioned. At night, only 2 remain. Monthly savings exceed $2,000.

Example 3: On-Premises Kubernetes with Mixed Workloads

A financial institution runs Kubernetes on bare-metal servers with limited hardware. They use HPA for web services and VPA for batch jobs. Cluster Autoscaler is replaced with a custom script that triggers VM provisioning via Ansible when node capacity is exceeded.

They use resource quotas to prevent any namespace from consuming more than 40% of cluster capacity. This prevents one teams workload from starving others.

Monitoring is done with Prometheus and Alertmanager. Alerts trigger Slack notifications when HPA reaches max replicas or when VPA recommendations change by more than 50%.

FAQs

Can I use HPA and VPA together?

Yes, but with caution. HPA scales replicas; VPA changes resource requests. If VPA increases a pods memory request beyond the nodes capacity, the pod may become unschedulable. Always validate VPA recommendations before enabling Auto mode, and ensure Cluster Autoscaler is active to handle node provisioning.

Does autoscaling work with StatefulSets?

Yes, HPA and VPA both support StatefulSets. However, Cluster Autoscaler only helps if the StatefulSets pods require more resources than available nodes. StatefulSets with persistent storage must ensure new nodes can mount volumesuse node affinity or storage classes compatible with dynamic provisioning.

Why isnt my HPA scaling up even though CPU is high?

Check these common causes: (1) Resource requests are not defined in the pod spec, (2) Metrics Server is not running or unreachable, (3) The HPA target utilization is set too high (e.g., 95%), (4) The pod is in a CrashLoopBackOff state, (5) The HPA is misconfigured with incorrect target resource name.

How long does Cluster Autoscaler take to add a node?

Typically 15 minutes, depending on cloud provider provisioning speed. AWS EKS may take longer if ASG launch templates require image builds. Use node pools with pre-warmed AMIs or container-optimized OS to reduce latency.

Is autoscaling expensive?

Noit reduces costs by eliminating over-provisioning. A study by Google showed that autoscaling can reduce cloud infrastructure costs by 3060% compared to static clusters. The key is combining HPA, VPA, and CA to match supply with demand precisely.

Can I autoscale based on memory usage?

Yes. HPA supports memory-based scaling. Define a metric with type: Resource and name: memory. Use averageUtilization or averageValue to set thresholds. Memory scaling is less common than CPU because memory is harder to reclaim, but its essential for memory-intensive applications like databases or caches.

What happens if Cluster Autoscaler cant find a suitable node type?

If no node type in the pool can satisfy a pods resource request, the pod remains unschedulable, and Cluster Autoscaler logs a warning. Ensure your node pools include a range of instance types (e.g., small, medium, large) and consider using node affinity or taints to direct workloads appropriately.

Should I use autoscaling for databases?

Generally, no. Databases like PostgreSQL or MySQL are stateful and dont scale horizontally well. Use vertical scaling (VPA) cautiously, and only if the database supports live resizing. Prefer dedicated, sized instances with replication for high availability.

Conclusion

Autoscaling in Kubernetes is not a single featureits a system of coordinated components that work together to ensure applications are always performing optimally while minimizing resource waste. By mastering Horizontal Pod Autoscaler, Vertical Pod Autoscaler, and Cluster Autoscaler, you gain the ability to build systems that respond intelligently to real-world traffic patterns, from quiet nights to viral product launches.

The key to success lies in thoughtful configuration: define accurate resource requests, validate scaling triggers, integrate observability, and test under realistic conditions. Avoid the temptation to enable autoscaling without understanding its implications. Use custom metrics wisely, combine tools like KEDA and Prometheus for advanced scenarios, and always monitor the outcomes.

As Kubernetes continues to dominate cloud-native infrastructure, the ability to autoscale effectively will separate reactive teams from proactive, resilient engineering organizations. Start smallenable HPA on one deployment. Observe, measure, refine. Then expand to VPA and Cluster Autoscaler. With each layer you add, your system becomes more intelligent, more efficient, and more capable of handling the unpredictable nature of modern applications.

alex

How to Autoscale Kubernetes

How to Autoscale Kubernetes

Step-by-Step Guide

Understanding Kubernetes Autoscaling Components

Prerequisites for Autoscaling

Configuring Horizontal Pod Autoscaler (HPA)

Configuring Vertical Pod Autoscaler (VPA)

Start in "Off" mode to observe recommendations

Configuring Cluster Autoscaler

AWS EKS

GCP GKE

Azure AKS

Testing Autoscaling Behavior

Debugging Autoscaling Issues

Best Practices

Define Realistic Resource Requests and Limits

Use Coordinated Scaling Policies

Set Appropriate Scaling Cooldown Periods

Avoid Scaling on Custom Metrics Without Validation

Use Pod Disruption Budgets (PDBs)

Monitor and Alert on Scaling Events

Test Scaling Under Realistic Load

Use Cost Optimization Tools

Tools and Resources

Core Kubernetes Tools

Advanced Autoscaling Tools

Monitoring and Observability

Cloud Provider Resources

Learning Resources

Real Examples

Example 1: E-commerce Website on AWS EKS

Example 2: Data Processing Pipeline on GKE

Example 3: On-Premises Kubernetes with Mixed Workloads

FAQs

Can I use HPA and VPA together?

Does autoscaling work with StatefulSets?

Why isnt my HPA scaling up even though CPU is high?

How long does Cluster Autoscaler take to add a node?

Is autoscaling expensive?

Can I autoscale based on memory usage?

What happens if Cluster Autoscaler cant find a suitable node type?

Should I use autoscaling for databases?

Conclusion

Related Posts

Popular Posts

Recommended Posts

Popular Tags