How to Scale Elasticsearch Nodes
How to Scale Elasticsearch Nodes Elasticsearch is a powerful, distributed search and analytics engine built on Apache Lucene. Its ability to handle massive volumes of data in near real-time makes it a cornerstone of modern search applications, logging systems, and business intelligence platforms. However, as data volume, query complexity, and user demand grow, a single-node or small cluster can qu
How to Scale Elasticsearch Nodes
Elasticsearch is a powerful, distributed search and analytics engine built on Apache Lucene. Its ability to handle massive volumes of data in near real-time makes it a cornerstone of modern search applications, logging systems, and business intelligence platforms. However, as data volume, query complexity, and user demand grow, a single-node or small cluster can quickly become a bottleneck. Scaling Elasticsearch nodes is not merely about adding more hardware—it’s a strategic process that involves architectural planning, resource allocation, performance tuning, and operational discipline.
Scaling Elasticsearch nodes effectively ensures high availability, low latency, fault tolerance, and sustained performance under load. Whether you’re managing a growing e-commerce product catalog, processing millions of log events per minute, or serving real-time analytics dashboards, understanding how to scale your Elasticsearch cluster is critical to maintaining reliability and user satisfaction.
This comprehensive guide walks you through the entire process of scaling Elasticsearch nodes—from foundational concepts to advanced configurations, best practices, real-world examples, and essential tools. By the end, you’ll have a clear, actionable roadmap to expand your cluster efficiently and avoid common pitfalls that lead to degraded performance or system instability.
Step-by-Step Guide
1. Assess Your Current Cluster Health
Before adding or reconfiguring nodes, you must understand your current state. Use Elasticsearch’s built-in monitoring tools to evaluate performance bottlenecks and resource utilization.
Run the following API calls to gather essential metrics:
GET /_cluster/health– Check cluster status (green, yellow, red), number of nodes, and unassigned shards.GET /_cat/nodes?v&h=name,heap.percent,ram.percent,cpu,load_1m,started_shards,store.size– View per-node resource usage.GET /_cat/shards?v– Identify oversized shards, unbalanced distributions, or too many shards per node.GET /_cat/allocation?v– See how shards are distributed across nodes and whether disk usage is uneven.
Look for red flags such as:
- High heap usage (>80%) on multiple nodes
- Excessive GC activity (check logs for frequent Full GC events)
- High CPU load (>70% sustained)
- Unassigned shards indicating allocation failures
- Nodes with significantly more shards than others
Use Kibana’s Stack Monitoring (if available) to visualize trends over time. Identify whether the bottleneck is CPU-bound, memory-bound, disk I/O-bound, or network-bound. This assessment determines whether you need to scale vertically (upgrading existing nodes) or horizontally (adding more nodes).
2. Define Your Scaling Goals
Scaling without clear objectives leads to over-provisioning or under-performance. Define measurable goals based on your use case:
- Throughput: Increase indexing rate from 5,000 to 20,000 documents per second.
- Latency: Reduce average search response time from 800ms to under 200ms.
- Availability: Achieve 99.95% uptime with no single point of failure.
- Storage: Support 50TB of data with 30-day retention.
Align these goals with business KPIs. For example, if your search results impact conversion rates, latency improvements directly affect revenue. Document these targets so you can validate success after scaling.
3. Choose Your Scaling Strategy: Horizontal vs. Vertical
Elasticsearch scales primarily through horizontal expansion—adding more nodes—rather than vertical scaling (upgrading existing nodes). While vertical scaling can help temporarily, it has physical and economic limits.
Horizontal Scaling (Recommended):
- Add more nodes to distribute load and shards.
- Improves fault tolerance—failure of one node doesn’t bring down the cluster.
- Allows independent scaling of data, query, and ingest roles.
- More cost-effective at scale due to commodity hardware.
Vertical Scaling (Limited Use):
- Upgrade CPU, RAM, or disk on existing nodes.
- Only viable if nodes are under-resourced (e.g., 8GB heap on a 64GB machine).
- Risk: Larger heaps increase GC pressure and pause times.
- Single point of failure remains.
Best practice: Use horizontal scaling as your primary strategy. Reserve vertical scaling for short-term fixes or when hardware constraints prevent adding nodes.
4. Plan Your Node Roles
Elasticsearch 7.0+ introduced dedicated node roles to improve cluster stability and performance. Assign roles explicitly to avoid resource contention.
Define three core node types:
Data Nodes
Handle indexing and search requests. Store shards. These are your workhorses.
- Allocate sufficient RAM (heap size ≤ 31GB to avoid compressed pointers).
- Use fast SSDs with high IOPS.
- Set
node.roles: [data_hot, data_warm, data_cold]based on data tiering strategy.
Ingest Nodes
Process data before indexing using ingest pipelines (e.g., parsing logs, enriching fields, removing sensitive data).
- Can be separate from data nodes to offload CPU-intensive preprocessing.
- Use moderate RAM (8–16GB heap).
- Set
node.roles: [ingest].
Master-Eligible Nodes
Manage cluster state and coordination. Critical for stability.
- Minimum of 3 nodes for quorum (avoid single master).
- Low resource requirements (4–8GB heap).
- Set
node.roles: [master]. - Do not assign data or ingest roles to these nodes.
Optional: Use coordinating nodes (also called client nodes) to handle client requests and distribute queries. Useful in large clusters to reduce load on data nodes. Set node.roles: [] to make them pure coordinators.
5. Configure Shard Allocation and Index Design
Shards are the building blocks of Elasticsearch’s scalability. Poor shard design is the
1 cause of scaling failures.
Shard Size
Keep shard sizes between 10GB and 50GB. Smaller shards increase overhead; larger shards slow recovery and rebalancing.
Use the following formula to estimate shards per index:
Shards = Total Index Size (GB) / Target Shard Size (GB)
Example: 200GB index → 200 / 25 = 8 shards
Shard Count
Do not over-shard. More than 1,000 shards per node can degrade performance. Aim for fewer than 20 shards per GB of heap.
Index Lifecycle Management (ILM)
Automate shard allocation and rollover using ILM policies:
- Hot phase: High-performance nodes, frequent writes.
- Warm phase: Lower-cost nodes, infrequent queries.
- Cold phase: Archive to slower storage.
- Delete phase: Remove old indices automatically.
Example ILM policy:
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "30d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"allocate": {
"include": {
"data": "warm"
}
}
}
},
"cold": {
"min_age": "90d",
"actions": {
"allocate": {
"include": {
"data": "cold"
}
}
}
},
"delete": {
"min_age": "365d",
"actions": {
"delete": {}
}
}
}
}
}
6. Add New Nodes to the Cluster
Once your plan is ready, begin adding nodes:
- Provision new servers with identical OS, Java version, and Elasticsearch version.
- Install Elasticsearch and configure
elasticsearch.ymlwith correct settings:
cluster.name: my-production-cluster
node.name: node-04
node.roles: [data_hot]
network.host: 0.0.0.0
discovery.seed_hosts: ["node-01", "node-02", "node-03"]
cluster.initial_master_nodes: ["node-01", "node-02", "node-03"]
path.data: /data/elasticsearch
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
xpack.security.enabled: false
Ensure discovery.seed_hosts includes all master-eligible nodes. Do not include the new node in cluster.initial_master_nodes unless it’s also master-eligible.
- Start Elasticsearch on the new node:
systemctl start elasticsearch - Verify the node joined:
GET /_cat/nodes - Monitor shard rebalancing:
GET /_cluster/allocation/explain
Shards will automatically redistribute across the cluster. This may take minutes to hours depending on data size and network bandwidth. Monitor the cluster health and avoid adding multiple nodes simultaneously.
7. Optimize for Rebalancing
When new nodes join, Elasticsearch triggers shard rebalancing. This can strain network and disk I/O.
To control rebalancing speed:
- Set
cluster.routing.allocation.cluster_concurrent_rebalance: 2(default is 2; increase to 4–6 on high-bandwidth networks). - Limit per-node relocations:
cluster.routing.allocation.node_concurrent_recoveries: 3 - Reduce recovery speed if network is saturated:
indices.recovery.max_bytes_per_sec: "100mb"
Use the PUT /_cluster/settings API to adjust these dynamically without restarts.
8. Scale Search and Indexing Throughput
After adding nodes, optimize query and ingestion performance:
For Indexing:
- Use bulk API with optimal batch sizes (5–15MB per request).
- Disable refresh interval during bulk loads:
"refresh_interval": "-1" - Set
"number_of_replicas": 0during initial ingestion, then increase after load. - Use ingest nodes to offload transformations.
For Searching:
- Use filter context instead of query context where possible (caches results).
- Limit
sizeandfromparameters; use search_after for deep pagination. - Use index aliases to route queries to optimal indices.
- Cache frequently used aggregations with
cache: truein aggregations.
9. Validate Performance After Scaling
After scaling, run load tests to confirm improvements:
- Use esrally (Elasticsearch Rally) to simulate real-world workloads.
- Compare pre- and post-scaling metrics: indexing rate, query latency, CPU/memory usage.
- Test failure scenarios: Kill a data node and observe recovery time.
- Verify shard distribution is balanced:
GET /_cat/shards?v&h=index,shard,prirep,state,docs,store,node
If performance hasn’t improved, investigate:
- Over-sharding
- Insufficient heap on data nodes
- Network latency between nodes
- Slow disk I/O (check with
iostat)
Best Practices
1. Avoid the “Monolith Cluster” Trap
Do not run all roles (master, data, ingest, coordinating) on every node. Dedicated roles improve stability. A single node handling all functions becomes a bottleneck and a single point of failure.
2. Maintain an Odd Number of Master-Eligible Nodes
Use 3, 5, or 7 master-eligible nodes to ensure quorum during network partitions. Never use 2—split-brain risk increases. A 3-node master quorum allows one node to fail without disruption.
3. Never Exceed 32GB Heap Size
Elasticsearch uses Java’s compressed pointers. Beyond 32GB, memory compression is disabled, leading to inefficient heap usage. Use 26–30GB heap for optimal performance. Set via -Xms26g -Xmx26g in jvm.options.
4. Use SSDs for Data Nodes
Hard drives are too slow for modern Elasticsearch workloads. SSDs reduce shard recovery time, improve search latency, and increase indexing throughput. NVMe drives offer further gains for high-throughput scenarios.
5. Monitor Heap Usage and GC
Heap pressure is the leading cause of node crashes. Set up alerts for:
- Heap usage > 80%
- GC duration > 10 seconds
- GC count > 1 per minute
Use tools like Prometheus + Grafana or Elasticsearch’s built-in monitoring to track these metrics.
6. Limit Indexes per Node
Each index consumes metadata memory. Avoid creating thousands of small indices. Use ILM and rollover to consolidate data into fewer, larger indices.
7. Enable Index Sorting for Time-Series Data
For log and metrics data, sort by timestamp to improve range queries:
"settings": {
"index.sort.field": "@timestamp",
"index.sort.order": "desc"
}
This enables faster searches and reduces disk I/O by co-locating related data.
8. Use Index Aliases for Zero-Downtime Operations
Always query via aliases, not direct index names. This allows you to:
- Rollover to new indices without changing application code.
- Reindex data without downtime.
- Route queries to specific data tiers.
Example:
PUT /_alias/logs
{
"indices": ["logs-000001"],
"is_write_index": true
}
9. Secure Your Cluster
Even in internal networks, enable TLS encryption and role-based access control (RBAC) via X-Pack Security. Use certificates for node-to-node communication and API keys for applications.
10. Test Scaling in Staging First
Never scale production without testing the exact configuration in a staging environment that mirrors production traffic, data volume, and network topology.
Tools and Resources
Elasticsearch Built-in Tools
- Kibana Stack Monitoring – Real-time cluster health, node metrics, and slow log analysis.
- Elasticsearch API –
/_cat,/_cluster/health,/_nodes/statsfor diagnostics. - Index Lifecycle Management (ILM) – Automate index rollover, tiering, and deletion.
- Index Templates – Enforce consistent settings and mappings across new indices.
Third-Party Monitoring Tools
- Prometheus + Elasticsearch Exporter – Collect metrics for alerting and dashboards.
- Grafana – Visualize cluster performance with pre-built Elasticsearch dashboards.
- Datadog – Full-stack observability with Elasticsearch integration.
- New Relic – Application performance monitoring with Elasticsearch tracing.
Load Testing Tools
- Elasticsearch Rally – Official benchmarking tool for simulating production workloads.
- JMeter – Custom HTTP-based search and indexing load tests.
- Locust – Python-based distributed load testing framework.
Automation and Infrastructure Tools
- Terraform – Provision Elasticsearch nodes on AWS, GCP, or Azure.
- Ansible – Configure Elasticsearch nodes consistently across environments.
- Docker & Kubernetes – Run Elasticsearch in containers using Helm charts or custom operators.
- Elastic Cloud – Managed Elasticsearch service by Elastic—ideal for teams without dedicated DevOps.
Documentation and Learning Resources
- Elasticsearch Official Documentation
- Elastic Blog – Real-world scaling case studies.
- Elastic YouTube Channel – Tutorials and webinars.
- “Elasticsearch: The Definitive Guide” – Book by Clinton Gormley and Zachary Tong.
Real Examples
Example 1: E-Commerce Search Scaling
A retail company had a 5-node Elasticsearch cluster handling 1.2 million product SKUs. Search latency exceeded 1.2 seconds during peak hours (Black Friday).
Diagnosis:
- 5 data nodes with 32GB heap each, all running master and ingest roles.
- 1 index with 120 shards (avg. 10GB each).
- Heavy use of expensive aggregations on product categories.
Solution:
- Added 3 new data-only nodes (64GB RAM, SSDs).
- Reduced shards from 120 to 48 (target: 25GB/shard).
- Created dedicated ingest nodes to handle product enrichment pipelines.
- Added 3 master-eligible nodes (previously only 2).
- Implemented ILM: hot (30d), warm (90d), cold (1y).
- Used index sorting on
product_idand cached frequent category aggregations.
Result: Latency dropped to 180ms. Indexing throughput increased by 300%. Cluster remained stable during 200% traffic spikes.
Example 2: Log Aggregation for 500+ Microservices
A fintech startup ingested 8TB of logs daily from 500+ microservices. Nodes were crashing daily due to GC pressure.
Diagnosis:
- 4 nodes, 32GB heap, 80% heap usage.
- 200+ daily indices, each with 5 shards.
- No ILM; indices never deleted.
- Single node handling all ingest and data roles.
Solution:
- Added 6 new data nodes (26GB heap, SSDs).
- Created dedicated ingest nodes with 16GB heap.
- Implemented ILM: delete logs older than 30 days.
- Reduced shard count to 10 per daily index (50GB target).
- Used index aliases and rollover to automate daily index creation.
- Enabled index sorting by
@timestampfor faster time-range queries.
Result: GC pauses reduced from 45 seconds to under 2 seconds. Cluster uptime improved from 92% to 99.9%. Storage costs dropped 40% due to automated deletion.
Example 3: Global Real-Time Analytics Platform
A SaaS company needed sub-100ms search latency across 12 regions with 200M+ documents.
Challenge: Network latency between regions made centralized clustering ineffective.
Solution:
- Deployed 3 independent clusters (North America, Europe, Asia-Pacific).
- Each cluster: 5 data nodes, 3 master nodes, 2 coordinating nodes.
- Used Elasticsearch Cross-Cluster Search (CCS) to federate queries from a global gateway.
- Replicated only aggregated metrics (not raw data) between clusters for global dashboards.
- Used geo-aware routing to direct users to nearest cluster.
Result: Global median latency reduced from 850ms to 85ms. Regional failures had no global impact.
FAQs
How many nodes do I need for Elasticsearch?
There’s no fixed number. Start with 3 master-eligible nodes and 2–3 data nodes for small workloads. Scale horizontally as data or query volume grows. A typical production cluster has 5–15 nodes. Large enterprises run clusters with 50+ nodes.
Can I scale Elasticsearch without downtime?
Yes. Add new nodes one at a time while the cluster remains operational. Use index aliases to reroute queries during reindexing. Avoid restarting multiple nodes simultaneously.
What happens if I add too many shards?
Too many shards increase memory usage (each shard consumes ~10–20MB of heap), slow cluster state updates, and increase overhead during recovery. Keep shard count under 1,000 per node and aim for 10–50GB per shard.
Should I use SSDs or HDDs for Elasticsearch?
Always use SSDs for data nodes. HDDs cause unacceptable latency for search and recovery. For archival cold data, consider hybrid storage (SSD for hot, HDD for cold), but never use HDDs for active shards.
How do I know if I need more RAM or more nodes?
If heap usage is consistently above 80% and GC is frequent, you need more RAM or more nodes. If CPU or disk I/O is saturated, add more nodes to distribute load. If network bandwidth is maxed, add nodes with higher network capacity.
Can I downgrade Elasticsearch nodes after scaling up?
Technically yes, but it’s risky. Removing nodes triggers shard relocation, which can overload remaining nodes. Only remove nodes if you’ve reduced data volume or consolidated indices. Always plan removals during low-traffic windows.
Is Kubernetes a good platform for scaling Elasticsearch?
Yes, but with caution. Kubernetes simplifies deployment and scaling but introduces complexity in managing persistent storage, network policies, and stateful workloads. Use the official Elastic Helm chart or a certified operator like Elastic Cloud on Kubernetes (ECK).
How often should I rebalance shards?
Elasticsearch rebalances automatically. You should not manually trigger rebalancing unless a node fails or is removed. Monitor shard distribution weekly to ensure balance.
What’s the impact of replication on scaling?
Replicas improve availability and search throughput but double storage requirements. For high-availability clusters, use 1 replica. For non-critical data, use 0 replicas during ingestion. Never use more than 2 replicas—it increases overhead without meaningful benefit.
How do I handle node failures during scaling?
Ensure you have at least 3 master-eligible nodes. If a data node fails, Elasticsearch automatically reallocates its shards to other nodes. Monitor cluster.health and _cat/shards to confirm recovery. Avoid adding new nodes during a failure event.
Conclusion
Scaling Elasticsearch nodes is a strategic, multi-phase process that requires careful planning, continuous monitoring, and adherence to best practices. It’s not simply about adding hardware—it’s about designing a resilient, performant, and maintainable architecture that evolves with your data and user demands.
By following the step-by-step guide in this tutorial, you’ve learned how to assess your cluster, define clear goals, assign dedicated roles, optimize shard allocation, add nodes safely, and validate performance. You’ve explored real-world examples that demonstrate how companies have overcome scaling challenges—and you now understand the tools, pitfalls, and strategies that separate successful clusters from failing ones.
Remember: The most scalable Elasticsearch clusters are those designed with simplicity, observability, and automation in mind. Avoid over-engineering. Monitor relentlessly. Test everything. And never underestimate the power of proper shard sizing and index lifecycle management.
As your data grows, so should your understanding of Elasticsearch internals. Stay updated with Elastic’s releases, participate in the community, and continuously refine your scaling strategy. With the right approach, your Elasticsearch cluster won’t just handle growth—it will thrive on it.