How to Scale Elasticsearch Nodes

How to Scale Elasticsearch Nodes Elasticsearch is a powerful, distributed search and analytics engine built on Apache Lucene. Its ability to handle massive volumes of data in near real-time makes it a cornerstone of modern search applications, logging systems, and business intelligence platforms. However, as data volume, query complexity, and user demand grow, a single-node or small cluster can qu

alex

Oct 30, 2025 - 20:36

How to Scale Elasticsearch Nodes

Elasticsearch is a powerful, distributed search and analytics engine built on Apache Lucene. Its ability to handle massive volumes of data in near real-time makes it a cornerstone of modern search applications, logging systems, and business intelligence platforms. However, as data volume, query complexity, and user demand grow, a single-node or small cluster can quickly become a bottleneck. Scaling Elasticsearch nodes is not merely about adding more hardwareits a strategic process that involves architectural planning, resource allocation, performance tuning, and operational discipline.

Scaling Elasticsearch nodes effectively ensures high availability, low latency, fault tolerance, and sustained performance under load. Whether youre managing a growing e-commerce product catalog, processing millions of log events per minute, or serving real-time analytics dashboards, understanding how to scale your Elasticsearch cluster is critical to maintaining reliability and user satisfaction.

This comprehensive guide walks you through the entire process of scaling Elasticsearch nodesfrom foundational concepts to advanced configurations, best practices, real-world examples, and essential tools. By the end, youll have a clear, actionable roadmap to expand your cluster efficiently and avoid common pitfalls that lead to degraded performance or system instability.

Step-by-Step Guide

1. Assess Your Current Cluster Health

Before adding or reconfiguring nodes, you must understand your current state. Use Elasticsearchs built-in monitoring tools to evaluate performance bottlenecks and resource utilization.

Run the following API calls to gather essential metrics:

GET /_cluster/health Check cluster status (green, yellow, red), number of nodes, and unassigned shards.
GET /_cat/nodes?v&h=name,heap.percent,ram.percent,cpu,load_1m,started_shards,store.size View per-node resource usage.
GET /_cat/shards?v Identify oversized shards, unbalanced distributions, or too many shards per node.
GET /_cat/allocation?v See how shards are distributed across nodes and whether disk usage is uneven.

Look for red flags such as:

High heap usage (>80%) on multiple nodes
Excessive GC activity (check logs for frequent Full GC events)
High CPU load (>70% sustained)
Unassigned shards indicating allocation failures
Nodes with significantly more shards than others

Use Kibanas Stack Monitoring (if available) to visualize trends over time. Identify whether the bottleneck is CPU-bound, memory-bound, disk I/O-bound, or network-bound. This assessment determines whether you need to scale vertically (upgrading existing nodes) or horizontally (adding more nodes).

2. Define Your Scaling Goals

Scaling without clear objectives leads to over-provisioning or under-performance. Define measurable goals based on your use case:

Throughput: Increase indexing rate from 5,000 to 20,000 documents per second.
Latency: Reduce average search response time from 800ms to under 200ms.
Availability: Achieve 99.95% uptime with no single point of failure.
Storage: Support 50TB of data with 30-day retention.

Align these goals with business KPIs. For example, if your search results impact conversion rates, latency improvements directly affect revenue. Document these targets so you can validate success after scaling.

3. Choose Your Scaling Strategy: Horizontal vs. Vertical

Elasticsearch scales primarily through horizontal expansionadding more nodesrather than vertical scaling (upgrading existing nodes). While vertical scaling can help temporarily, it has physical and economic limits.

Horizontal Scaling (Recommended):

Add more nodes to distribute load and shards.
Improves fault tolerancefailure of one node doesnt bring down the cluster.
Allows independent scaling of data, query, and ingest roles.
More cost-effective at scale due to commodity hardware.

Vertical Scaling (Limited Use):

Upgrade CPU, RAM, or disk on existing nodes.
Only viable if nodes are under-resourced (e.g., 8GB heap on a 64GB machine).
Risk: Larger heaps increase GC pressure and pause times.
Single point of failure remains.

Best practice: Use horizontal scaling as your primary strategy. Reserve vertical scaling for short-term fixes or when hardware constraints prevent adding nodes.

4. Plan Your Node Roles

Elasticsearch 7.0+ introduced dedicated node roles to improve cluster stability and performance. Assign roles explicitly to avoid resource contention.

Define three core node types:

Data Nodes

Handle indexing and search requests. Store shards. These are your workhorses.

Allocate sufficient RAM (heap size ? 31GB to avoid compressed pointers).
Use fast SSDs with high IOPS.
Set node.roles: [data_hot, data_warm, data_cold] based on data tiering strategy.

Ingest Nodes

Process data before indexing using ingest pipelines (e.g., parsing logs, enriching fields, removing sensitive data).

Can be separate from data nodes to offload CPU-intensive preprocessing.
Use moderate RAM (816GB heap).
Set node.roles: [ingest].

Master-Eligible Nodes

Manage cluster state and coordination. Critical for stability.

Minimum of 3 nodes for quorum (avoid single master).
Low resource requirements (48GB heap).
Set node.roles: [master].
Do not assign data or ingest roles to these nodes.

Optional: Use coordinating nodes (also called client nodes) to handle client requests and distribute queries. Useful in large clusters to reduce load on data nodes. Set node.roles: [] to make them pure coordinators.

5. Configure Shard Allocation and Index Design

Shards are the building blocks of Elasticsearchs scalability. Poor shard design is the

1 cause of scaling failures.

Shard Size

Keep shard sizes between 10GB and 50GB. Smaller shards increase overhead; larger shards slow recovery and rebalancing.

Use the following formula to estimate shards per index:

Shards = Total Index Size (GB) / Target Shard Size (GB)

Example: 200GB index ? 200 / 25 = 8 shards

Shard Count

Do not over-shard. More than 1,000 shards per node can degrade performance. Aim for fewer than 20 shards per GB of heap.

Index Lifecycle Management (ILM)

Automate shard allocation and rollover using ILM policies:

Hot phase: High-performance nodes, frequent writes.
Warm phase: Lower-cost nodes, infrequent queries.
Cold phase: Archive to slower storage.
Delete phase: Remove old indices automatically.

Example ILM policy:

{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "30d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"allocate": {
"include": {
"data": "warm"
}
}
}
},
"cold": {
"min_age": "90d",
"actions": {
"allocate": {
"include": {
"data": "cold"
}
}
}
},
"delete": {
"min_age": "365d",
"actions": {
"delete": {}
}
}
}
}
}

6. Add New Nodes to the Cluster

Once your plan is ready, begin adding nodes:

Provision new servers with identical OS, Java version, and Elasticsearch version.
Install Elasticsearch and configure elasticsearch.yml with correct settings:

cluster.name: my-production-cluster node.name: node-04 node.roles: [data_hot] network.host: 0.0.0.0 discovery.seed_hosts: ["node-01", "node-02", "node-03"] cluster.initial_master_nodes: ["node-01", "node-02", "node-03"] path.data: /data/elasticsearch path.logs: /var/log/elasticsearch bootstrap.memory_lock: true xpack.security.enabled: false

Ensure discovery.seed_hosts includes all master-eligible nodes. Do not include the new node in cluster.initial_master_nodes unless its also master-eligible.

Start Elasticsearch on the new node: systemctl start elasticsearch
Verify the node joined: GET /_cat/nodes
Monitor shard rebalancing: GET /_cluster/allocation/explain

Shards will automatically redistribute across the cluster. This may take minutes to hours depending on data size and network bandwidth. Monitor the cluster health and avoid adding multiple nodes simultaneously.

7. Optimize for Rebalancing

When new nodes join, Elasticsearch triggers shard rebalancing. This can strain network and disk I/O.

To control rebalancing speed:

Set cluster.routing.allocation.cluster_concurrent_rebalance: 2 (default is 2; increase to 46 on high-bandwidth networks).
Limit per-node relocations: cluster.routing.allocation.node_concurrent_recoveries: 3
Reduce recovery speed if network is saturated: indices.recovery.max_bytes_per_sec: "100mb"

Use the PUT /_cluster/settings API to adjust these dynamically without restarts.

8. Scale Search and Indexing Throughput

After adding nodes, optimize query and ingestion performance:

For Indexing:

Use bulk API with optimal batch sizes (515MB per request).
Disable refresh interval during bulk loads: "refresh_interval": "-1"
Set "number_of_replicas": 0 during initial ingestion, then increase after load.
Use ingest nodes to offload transformations.

For Searching:

Use filter context instead of query context where possible (caches results).
Limit size and from parameters; use search_after for deep pagination.
Use index aliases to route queries to optimal indices.
Cache frequently used aggregations with cache: true in aggregations.

9. Validate Performance After Scaling

After scaling, run load tests to confirm improvements:

Use esrally (Elasticsearch Rally) to simulate real-world workloads.
Compare pre- and post-scaling metrics: indexing rate, query latency, CPU/memory usage.
Test failure scenarios: Kill a data node and observe recovery time.
Verify shard distribution is balanced: GET /_cat/shards?v&h=index,shard,prirep,state,docs,store,node

If performance hasnt improved, investigate:

Over-sharding
Insufficient heap on data nodes
Network latency between nodes
Slow disk I/O (check with iostat)

Best Practices

1. Avoid the Monolith Cluster Trap

Do not run all roles (master, data, ingest, coordinating) on every node. Dedicated roles improve stability. A single node handling all functions becomes a bottleneck and a single point of failure.

2. Maintain an Odd Number of Master-Eligible Nodes

Use 3, 5, or 7 master-eligible nodes to ensure quorum during network partitions. Never use 2split-brain risk increases. A 3-node master quorum allows one node to fail without disruption.

3. Never Exceed 32GB Heap Size

Elasticsearch uses Javas compressed pointers. Beyond 32GB, memory compression is disabled, leading to inefficient heap usage. Use 2630GB heap for optimal performance. Set via -Xms26g -Xmx26g in jvm.options.

4. Use SSDs for Data Nodes

Hard drives are too slow for modern Elasticsearch workloads. SSDs reduce shard recovery time, improve search latency, and increase indexing throughput. NVMe drives offer further gains for high-throughput scenarios.

5. Monitor Heap Usage and GC

Heap pressure is the leading cause of node crashes. Set up alerts for:

Heap usage > 80%
GC duration > 10 seconds
GC count > 1 per minute

Use tools like Prometheus + Grafana or Elasticsearchs built-in monitoring to track these metrics.

6. Limit Indexes per Node

Each index consumes metadata memory. Avoid creating thousands of small indices. Use ILM and rollover to consolidate data into fewer, larger indices.

7. Enable Index Sorting for Time-Series Data

For log and metrics data, sort by timestamp to improve range queries:

"settings": {
"index.sort.field": "@timestamp",
"index.sort.order": "desc"
}

This enables faster searches and reduces disk I/O by co-locating related data.

8. Use Index Aliases for Zero-Downtime Operations

Always query via aliases, not direct index names. This allows you to:

Rollover to new indices without changing application code.
Reindex data without downtime.
Route queries to specific data tiers.

Example:

PUT /_alias/logs
{
"indices": ["logs-000001"],
"is_write_index": true
}

9. Secure Your Cluster

Even in internal networks, enable TLS encryption and role-based access control (RBAC) via X-Pack Security. Use certificates for node-to-node communication and API keys for applications.

10. Test Scaling in Staging First

Never scale production without testing the exact configuration in a staging environment that mirrors production traffic, data volume, and network topology.

Tools and Resources

Elasticsearch Built-in Tools

Kibana Stack Monitoring Real-time cluster health, node metrics, and slow log analysis.
Elasticsearch API /_cat, /_cluster/health, /_nodes/stats for diagnostics.
Index Lifecycle Management (ILM) Automate index rollover, tiering, and deletion.
Index Templates Enforce consistent settings and mappings across new indices.

Third-Party Monitoring Tools

Prometheus + Elasticsearch Exporter Collect metrics for alerting and dashboards.
Grafana Visualize cluster performance with pre-built Elasticsearch dashboards.
Datadog Full-stack observability with Elasticsearch integration.
New Relic Application performance monitoring with Elasticsearch tracing.

Load Testing Tools

Elasticsearch Rally Official benchmarking tool for simulating production workloads.
JMeter Custom HTTP-based search and indexing load tests.
Locust Python-based distributed load testing framework.

Automation and Infrastructure Tools

Terraform Provision Elasticsearch nodes on AWS, GCP, or Azure.
Ansible Configure Elasticsearch nodes consistently across environments.
Docker & Kubernetes Run Elasticsearch in containers using Helm charts or custom operators.
Elastic Cloud Managed Elasticsearch service by Elasticideal for teams without dedicated DevOps.

Documentation and Learning Resources

Elasticsearch Official Documentation
Elastic Blog Real-world scaling case studies.
Elastic YouTube Channel Tutorials and webinars.
Elasticsearch: The Definitive Guide Book by Clinton Gormley and Zachary Tong.

Real Examples

Example 1: E-Commerce Search Scaling

A retail company had a 5-node Elasticsearch cluster handling 1.2 million product SKUs. Search latency exceeded 1.2 seconds during peak hours (Black Friday).

Diagnosis:

5 data nodes with 32GB heap each, all running master and ingest roles.
1 index with 120 shards (avg. 10GB each).
Heavy use of expensive aggregations on product categories.

Solution:

Added 3 new data-only nodes (64GB RAM, SSDs).
Reduced shards from 120 to 48 (target: 25GB/shard).
Created dedicated ingest nodes to handle product enrichment pipelines.
Added 3 master-eligible nodes (previously only 2).
Implemented ILM: hot (30d), warm (90d), cold (1y).
Used index sorting on product_id and cached frequent category aggregations.

Result: Latency dropped to 180ms. Indexing throughput increased by 300%. Cluster remained stable during 200% traffic spikes.

Example 2: Log Aggregation for 500+ Microservices

A fintech startup ingested 8TB of logs daily from 500+ microservices. Nodes were crashing daily due to GC pressure.

Diagnosis:

4 nodes, 32GB heap, 80% heap usage.
200+ daily indices, each with 5 shards.
No ILM; indices never deleted.
Single node handling all ingest and data roles.

Solution:

Added 6 new data nodes (26GB heap, SSDs).
Created dedicated ingest nodes with 16GB heap.
Implemented ILM: delete logs older than 30 days.
Reduced shard count to 10 per daily index (50GB target).
Used index aliases and rollover to automate daily index creation.
Enabled index sorting by @timestamp for faster time-range queries.

Result: GC pauses reduced from 45 seconds to under 2 seconds. Cluster uptime improved from 92% to 99.9%. Storage costs dropped 40% due to automated deletion.

Example 3: Global Real-Time Analytics Platform

A SaaS company needed sub-100ms search latency across 12 regions with 200M+ documents.

Challenge: Network latency between regions made centralized clustering ineffective.

Solution:

Deployed 3 independent clusters (North America, Europe, Asia-Pacific).
Each cluster: 5 data nodes, 3 master nodes, 2 coordinating nodes.
Used Elasticsearch Cross-Cluster Search (CCS) to federate queries from a global gateway.
Replicated only aggregated metrics (not raw data) between clusters for global dashboards.
Used geo-aware routing to direct users to nearest cluster.

Result: Global median latency reduced from 850ms to 85ms. Regional failures had no global impact.

FAQs

How many nodes do I need for Elasticsearch?

Theres no fixed number. Start with 3 master-eligible nodes and 23 data nodes for small workloads. Scale horizontally as data or query volume grows. A typical production cluster has 515 nodes. Large enterprises run clusters with 50+ nodes.

Can I scale Elasticsearch without downtime?

Yes. Add new nodes one at a time while the cluster remains operational. Use index aliases to reroute queries during reindexing. Avoid restarting multiple nodes simultaneously.

What happens if I add too many shards?

Too many shards increase memory usage (each shard consumes ~1020MB of heap), slow cluster state updates, and increase overhead during recovery. Keep shard count under 1,000 per node and aim for 1050GB per shard.

Should I use SSDs or HDDs for Elasticsearch?

Always use SSDs for data nodes. HDDs cause unacceptable latency for search and recovery. For archival cold data, consider hybrid storage (SSD for hot, HDD for cold), but never use HDDs for active shards.

How do I know if I need more RAM or more nodes?

If heap usage is consistently above 80% and GC is frequent, you need more RAM or more nodes. If CPU or disk I/O is saturated, add more nodes to distribute load. If network bandwidth is maxed, add nodes with higher network capacity.

Can I downgrade Elasticsearch nodes after scaling up?

Technically yes, but its risky. Removing nodes triggers shard relocation, which can overload remaining nodes. Only remove nodes if youve reduced data volume or consolidated indices. Always plan removals during low-traffic windows.

Is Kubernetes a good platform for scaling Elasticsearch?

Yes, but with caution. Kubernetes simplifies deployment and scaling but introduces complexity in managing persistent storage, network policies, and stateful workloads. Use the official Elastic Helm chart or a certified operator like Elastic Cloud on Kubernetes (ECK).

How often should I rebalance shards?

Elasticsearch rebalances automatically. You should not manually trigger rebalancing unless a node fails or is removed. Monitor shard distribution weekly to ensure balance.

Whats the impact of replication on scaling?

Replicas improve availability and search throughput but double storage requirements. For high-availability clusters, use 1 replica. For non-critical data, use 0 replicas during ingestion. Never use more than 2 replicasit increases overhead without meaningful benefit.

How do I handle node failures during scaling?

Ensure you have at least 3 master-eligible nodes. If a data node fails, Elasticsearch automatically reallocates its shards to other nodes. Monitor cluster.health and _cat/shards to confirm recovery. Avoid adding new nodes during a failure event.

Conclusion

Scaling Elasticsearch nodes is a strategic, multi-phase process that requires careful planning, continuous monitoring, and adherence to best practices. Its not simply about adding hardwareits about designing a resilient, performant, and maintainable architecture that evolves with your data and user demands.

By following the step-by-step guide in this tutorial, youve learned how to assess your cluster, define clear goals, assign dedicated roles, optimize shard allocation, add nodes safely, and validate performance. Youve explored real-world examples that demonstrate how companies have overcome scaling challengesand you now understand the tools, pitfalls, and strategies that separate successful clusters from failing ones.

Remember: The most scalable Elasticsearch clusters are those designed with simplicity, observability, and automation in mind. Avoid over-engineering. Monitor relentlessly. Test everything. And never underestimate the power of proper shard sizing and index lifecycle management.

As your data grows, so should your understanding of Elasticsearch internals. Stay updated with Elastics releases, participate in the community, and continuously refine your scaling strategy. With the right approach, your Elasticsearch cluster wont just handle growthit will thrive on it.

alex

How to Scale Elasticsearch Nodes

How to Scale Elasticsearch Nodes

Step-by-Step Guide

1. Assess Your Current Cluster Health

2. Define Your Scaling Goals

3. Choose Your Scaling Strategy: Horizontal vs. Vertical

4. Plan Your Node Roles

Data Nodes

Ingest Nodes

Master-Eligible Nodes

5. Configure Shard Allocation and Index Design

1 cause of scaling failures.

Shard Size

Shard Count

Index Lifecycle Management (ILM)

6. Add New Nodes to the Cluster

7. Optimize for Rebalancing

8. Scale Search and Indexing Throughput

For Indexing:

For Searching:

9. Validate Performance After Scaling

Best Practices

1. Avoid the Monolith Cluster Trap

2. Maintain an Odd Number of Master-Eligible Nodes

3. Never Exceed 32GB Heap Size

4. Use SSDs for Data Nodes

5. Monitor Heap Usage and GC

6. Limit Indexes per Node

7. Enable Index Sorting for Time-Series Data

8. Use Index Aliases for Zero-Downtime Operations

9. Secure Your Cluster

10. Test Scaling in Staging First

Tools and Resources

Elasticsearch Built-in Tools

Third-Party Monitoring Tools

Load Testing Tools

Automation and Infrastructure Tools

Documentation and Learning Resources

Real Examples

Example 1: E-Commerce Search Scaling

Example 2: Log Aggregation for 500+ Microservices

Example 3: Global Real-Time Analytics Platform

FAQs

How many nodes do I need for Elasticsearch?

Can I scale Elasticsearch without downtime?

What happens if I add too many shards?

Should I use SSDs or HDDs for Elasticsearch?

How do I know if I need more RAM or more nodes?

Can I downgrade Elasticsearch nodes after scaling up?

Is Kubernetes a good platform for scaling Elasticsearch?

How often should I rebalance shards?

Whats the impact of replication on scaling?

How do I handle node failures during scaling?

Conclusion

Related Posts

Popular Posts

Recommended Posts

Popular Tags