How to Backup Elasticsearch Data
How to Backup Elasticsearch Data Elasticsearch is a powerful, distributed search and analytics engine used by organizations worldwide to store, search, and analyze large volumes of data in near real time. From e-commerce product catalogs to log monitoring systems and cybersecurity threat detection, Elasticsearch powers mission-critical applications. Yet, despite its robust architecture, Elasticsea
How to Backup Elasticsearch Data
Elasticsearch is a powerful, distributed search and analytics engine used by organizations worldwide to store, search, and analyze large volumes of data in near real time. From e-commerce product catalogs to log monitoring systems and cybersecurity threat detection, Elasticsearch powers mission-critical applications. Yet, despite its robust architecture, Elasticsearch is not immune to data loss. Hardware failures, human error, software bugs, misconfigurations, or even malicious attacks can lead to irreversible data loss. That’s why implementing a reliable and repeatable Elasticsearch backup strategy is not optional—it’s essential.
Backing up Elasticsearch data ensures business continuity, supports compliance requirements, and enables rapid recovery in the event of system failure. Whether you’re managing a small cluster or a large-scale production environment, understanding how to properly back up and restore your data is a core competency for any DevOps engineer, SRE, or data platform administrator.
This comprehensive guide walks you through every aspect of Elasticsearch data backup—from the foundational concepts to advanced automation techniques. You’ll learn step-by-step procedures, industry best practices, recommended tools, real-world examples, and answers to frequently asked questions. By the end of this tutorial, you’ll have the knowledge and confidence to implement a resilient backup strategy tailored to your infrastructure.
Step-by-Step Guide
Understand Elasticsearch Snapshot Architecture
Before initiating any backup, it’s critical to understand how Elasticsearch handles data persistence through its snapshots feature. Unlike traditional database backups that copy raw files, Elasticsearch uses snapshots to create point-in-time copies of indices and cluster metadata. These snapshots are stored in a shared repository—such as a network file system, Amazon S3, Azure Blob Storage, Google Cloud Storage, or HDFS.
Snapshotting is incremental by design. The first snapshot contains all data, but subsequent snapshots only store changes since the last snapshot. This significantly reduces storage overhead and backup time. Snapshots are also consistent across the cluster, meaning they capture the state of all shards at the same moment, even if the cluster is actively indexing new data.
It’s important to note that snapshots are not direct file copies. They are managed by Elasticsearch’s snapshot service, which coordinates with nodes to read data from shards and write it to the repository. This ensures data integrity and avoids corruption during the backup process.
Step 1: Choose a Snapshot Repository Type
Elasticsearch supports multiple repository types for storing snapshots. Your choice depends on your infrastructure, scalability needs, and cloud provider.
- File System Repository: Stores snapshots on a shared network file system (e.g., NFS, SMB). Suitable for on-premises deployments with shared storage.
- S3 Repository: Uses Amazon S3 for durable, scalable storage. Ideal for cloud-native environments.
- Azure Repository: Integrates with Azure Blob Storage for Microsoft Azure users.
- Google Cloud Repository: Leverages Google Cloud Storage for GCP-based deployments.
- HDFS Repository: For organizations using Hadoop Distributed File System.
For most modern deployments, cloud-based repositories like S3 are preferred due to their durability, availability, and integration with automated backup workflows.
Step 2: Register a Snapshot Repository
To begin backing up, you must first register a repository with your Elasticsearch cluster. This is done via the REST API using a PUT request.
For an S3 repository, you’ll need to configure AWS credentials and region. Here’s an example request:
PUT _snapshot/my_s3_repository
{
"type": "s3",
"settings": {
"bucket": "my-elasticsearch-backups",
"region": "us-west-2",
"base_path": "snapshots/",
"access_key": "YOUR_AWS_ACCESS_KEY",
"secret_key": "YOUR_AWS_SECRET_KEY"
}
}
For a shared file system repository:
PUT _snapshot/my_fs_repository
{
"type": "fs",
"settings": {
"location": "/mnt/backups/elasticsearch",
"compress": true
}
}
After registration, validate the repository using:
GET _snapshot/my_s3_repository
If successful, you’ll receive a response confirming the repository type and settings. If there’s a misconfiguration, Elasticsearch returns an error—such as permission denied or invalid bucket name—so ensure your storage backend is accessible and properly configured.
Step 3: Create a Snapshot
Once the repository is registered, you can create your first snapshot. Snapshots can include one or more indices, or the entire cluster.
To back up a specific index:
PUT _snapshot/my_s3_repository/snapshot_2024_06_15
{
"indices": "logs-2024.06.15,users-index",
"ignore_unavailable": true,
"include_global_state": false
}
To back up all indices and cluster state:
PUT _snapshot/my_s3_repository/full_cluster_backup_2024_06_15
{
"indices": "*",
"include_global_state": true
}
Key parameters:
- indices: Specifies which indices to include. Use
*for all. - ignore_unavailable: If set to
true, the snapshot will proceed even if some indices are missing or closed. - include_global_state: If
true, cluster-wide settings, templates, and lifecycle policies are saved. Use this for full cluster recovery.
By default, snapshots are created asynchronously. You can monitor progress using:
GET _snapshot/my_s3_repository/snapshot_2024_06_15
The response includes the snapshot status: IN_PROGRESS, SUCCESS, or FAILED. Successful snapshots return metadata including the version, UUID, and number of shards backed up.
Step 4: Automate Snapshot Creation
Manually creating snapshots is impractical for production environments. Automation ensures consistency and reduces human error.
Elasticsearch offers two primary methods for automation:
- Elasticsearch Curator: A Python-based tool for managing indices and snapshots. Install via pip:
pip install elasticsearch-curatorCreate a configuration file (
curator.yml) and a snapshot action file (snapshot_action.yml):actions:1:
action: snapshot
description: "Create snapshot of daily indices"
options:
repository: my_s3_repository
name: "daily_snapshot_%Y.%m.%d-%H.%M.%S"
ignore_unavailable: true
include_global_state: false
wait_for_completion: true
filters:
- filtertype: pattern
kind: prefix
value: logs-
- filtertype: age
source: name
direction: older
unit: days
unit_count: 1
Schedule via cron:
0 2 * * * /usr/bin/curator --config /etc/curator/curator.yml /etc/curator/snapshot_action.yml - Elasticsearch ILM (Index Lifecycle Management) + Snapshot Policy: If you’re using Elasticsearch 7.10+, you can define snapshot policies directly in ILM. This allows automatic snapshotting when indices transition to the “cold” or “frozen” phase.
PUT _ilm/policy/logs_snapshot_policy{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "30d"
}
}
},
"cold": {
"actions": {
"snapshot": {
"repository": "my_s3_repository",
"snapshot": "cold-snapshot-{now/d}-{{index}}"
}
}
}
}
}
}
Then apply this policy to your index template.
Step 5: Verify and Test Your Snapshot
Creating a snapshot is only half the battle. You must verify its integrity and test the restore process.
To list all snapshots in a repository:
GET _snapshot/my_s3_repository/_all
To inspect a specific snapshot’s contents:
GET _snapshot/my_s3_repository/snapshot_2024_06_15/_status
Test the restore by creating a new index from the snapshot:
POST _snapshot/my_s3_repository/snapshot_2024_06_15/_restore
{
"indices": "logs-2024.06.15",
"rename_pattern": "logs-(.+)",
"rename_replacement": "restored_logs_$1",
"include_global_state": false
}
After restoration, verify data integrity by querying the restored index:
GET restored_logs-2024.06.15/_search
Always test restores in a non-production environment before relying on them in an emergency.
Step 6: Manage Snapshot Retention
Unlimited snapshot retention leads to storage bloat and increased costs. Implement a retention policy to automatically delete old snapshots.
Using Curator, you can delete snapshots older than a specified age:
actions:
1:
action: delete_snapshots
description: "Delete snapshots older than 30 days"
options:
repository: my_s3_repository
ignore_empty_list: true
filters:
- filtertype: age
source: creation_date
direction: older
unit: days
unit_count: 30
Alternatively, use Elasticsearch’s built-in snapshot lifecycle management (SLM) feature (available in Elasticsearch 7.8+). Create a policy:
PUT _slm/policy/daily-retention
{
"schedule": "0 30 2 * * ?",
"name": "",
"repository": "my_s3_repository",
"config": {
"indices": "*",
"include_global_state": false
},
"retention": {
"expire_after": "30d",
"min_count": 5,
"max_count": 100
}
}
This policy creates a daily snapshot and retains up to 100 snapshots, automatically deleting those older than 30 days. SLM is the recommended approach for modern Elasticsearch deployments.
Best Practices
1. Always Use External Repositories
Never store snapshots on the same nodes or disks where your Elasticsearch data resides. If a node fails or a disk corrupts, your snapshots may be lost along with the original data. Use a separate, highly durable storage system—preferably cloud-based—with redundancy and versioning enabled.
2. Enable Compression
When configuring your repository, always enable compression ("compress": true). This reduces storage costs and speeds up network transfers, especially for large datasets. Compression has minimal impact on CPU usage during snapshot creation and is negligible compared to the storage savings.
3. Snapshot During Low Traffic Periods
While snapshots are designed to be non-disruptive, they do consume I/O and network bandwidth. Schedule snapshots during maintenance windows or off-peak hours to minimize performance impact on search and indexing operations.
4. Monitor Snapshot Health
Set up alerts for failed snapshots. Use Elasticsearch’s monitoring tools (such as Kibana’s Monitoring UI or Prometheus + Grafana) to track snapshot success rates, durations, and sizes. A sudden drop in snapshot completion rate may indicate storage issues, permission changes, or network instability.
5. Include Global State for Full Recovery
When backing up critical systems, always set include_global_state: true. This ensures that cluster settings, index templates, ingest pipelines, and security roles are preserved. Without global state, restoring data may require manually reconfiguring these components—a time-consuming and error-prone process.
6. Test Restores Regularly
Never assume your backups work. Schedule quarterly restore drills in a staging environment. Validate that data is complete, mappings are preserved, and queries return expected results. Document the restore procedure and train team members on execution.
7. Secure Your Snapshot Repository
Snapshot repositories often contain sensitive data. Restrict access using IAM policies (for S3), Azure RBAC, or filesystem permissions. Avoid hardcoding credentials in configuration files. Instead, use AWS IAM roles, Azure Managed Identities, or Kubernetes secrets for dynamic credential injection.
8. Avoid Snapshotting Too Frequently
While it’s tempting to create hourly snapshots, this can overwhelm your storage system and increase costs. Balance frequency with recovery point objectives (RPO). For most applications, daily snapshots with hourly index rollovers are sufficient. For high-transaction systems, consider combining snapshotting with log shipping (e.g., Kafka + Logstash) for finer-grained recovery.
9. Use Versioned Storage
Enable versioning on your object storage (S3, Azure Blob, etc.). This protects against accidental deletion or overwrites. Even if a snapshot is deleted or corrupted, you can recover the previous version.
10. Document Your Backup Strategy
Document the repository configuration, retention policy, automation scripts, and restore procedures. Include contact information for team members responsible for backups and recovery. Keep this documentation version-controlled and accessible to all relevant engineers.
Tools and Resources
Elasticsearch Native Tools
- Snapshot and Restore API: The core mechanism for creating, listing, and restoring snapshots. Fully integrated into Elasticsearch and available in all versions since 1.0.
- Index Lifecycle Management (ILM): Automates index rollover and snapshotting based on age, size, or phase. Reduces manual intervention.
- Snapshot Lifecycle Management (SLM): Introduced in Elasticsearch 7.8, SLM automates the creation and deletion of snapshots according to defined policies. Recommended for production use.
- Kibana Snapshot UI: Provides a graphical interface to manage repositories and snapshots. Useful for ad-hoc backups and monitoring.
Third-Party Tools
- Elasticsearch Curator: A mature, Python-based tool for managing indices and snapshots. Highly customizable and widely adopted in enterprise environments.
- Elastic Cloud (Elasticsearch Service): If you’re using Elastic’s managed service, snapshots are automated and stored in secure, durable cloud storage. You can configure retention and restore via the UI or API.
- Velero: A Kubernetes backup tool that can back up Elasticsearch stateful sets along with their PVCs. Useful for Helm-deployed Elasticsearch clusters.
- Stash by AppsCode: A Kubernetes-native backup solution that supports Elasticsearch via plugins. Integrates with S3, GCS, and MinIO.
- OpenSearch: The open-source fork of Elasticsearch (from AWS) includes identical snapshot functionality. Tools and strategies are interchangeable.
Monitoring and Alerting
- Kibana Monitoring: Built-in dashboard for tracking snapshot success, duration, and repository health.
- Prometheus + Elasticsearch Exporter: Exposes snapshot metrics (e.g.,
es_snapshot_count,es_snapshot_duration_seconds) for alerting. - Graylog / Datadog / New Relic: Third-party platforms with Elasticsearch plugins to monitor backup health and trigger alerts on failure.
Storage Recommendations
- Amazon S3: Highly durable (11 nines), scalable, and cost-effective. Enable versioning and lifecycle policies.
- Azure Blob Storage: Enterprise-grade storage with geo-redundancy and access tiers.
- Google Cloud Storage: Excellent performance and integration with GKE and other GCP services.
- MinIO: Open-source, S3-compatible object storage. Ideal for on-premises or hybrid deployments.
- NFS with RAID: For on-premises setups, use enterprise-grade NAS with replication and snapshots.
Documentation and Learning Resources
- Elasticsearch Official Snapshot Documentation
- Snapshot Lifecycle Management (SLM)
- Elasticsearch Curator Guide
- Curator GitHub Repository
- Elastic Blog: Backup and Restore Best Practices
Real Examples
Example 1: E-Commerce Platform with Daily Snapshots
A mid-sized e-commerce company runs Elasticsearch to power product search and recommendation engines. Their cluster handles 500 million documents across 15 indices, with 2TB of data.
Strategy:
- Uses S3 as the snapshot repository with versioning enabled.
- Creates a full cluster snapshot every night at 2 AM using SLM.
- Includes global state to preserve index templates and security roles.
- Retains 30 daily snapshots and 12 monthly snapshots.
- Automatically deletes snapshots older than 1 year.
Outcome: After a misconfigured index template caused data corruption, the team restored the cluster from the previous day’s snapshot. Search functionality was restored within 15 minutes, with zero data loss beyond the 24-hour window.
Example 2: Log Aggregation System with Hourly Index Rollovers
A fintech firm ingests 10GB/hour of application and security logs into Elasticsearch. They use index rollovers every 24 hours or when the index reaches 50GB.
Strategy:
- Uses Curator to trigger a snapshot of the previous day’s logs every morning at 3 AM.
- Only snapshots the “cold” phase indices (older than 7 days) to reduce overhead.
- Stores snapshots in a separate S3 bucket with lifecycle rules to move to Glacier after 30 days.
- Alerts are configured via Slack if any snapshot fails for three consecutive days.
Outcome: When a storage node failed unexpectedly, the team restored the last 30 days of logs from snapshots. No data was lost, and compliance audits were unaffected.
Example 3: On-Premises Healthcare System with NFS Repository
A hospital uses Elasticsearch to store patient monitoring data. Due to regulatory requirements, they cannot use public cloud storage.
Strategy:
- Deploys a dedicated NFS server with RAID-6 and daily backups to tape.
- Registers the NFS share as a filesystem repository.
- Creates snapshots every 6 hours using a cron job.
- Uses encrypted file system (LUKS) and restricts NFS access to Elasticsearch nodes only.
- Performs quarterly restore drills with a standby cluster.
Outcome: During a power outage, the primary cluster went offline. The standby cluster was restored from the most recent snapshot and brought online within 20 minutes, ensuring continuity of care.
FAQs
Can I backup Elasticsearch while it’s running?
Yes. Elasticsearch snapshots are designed to be created while the cluster is actively indexing and serving queries. The process is non-blocking and uses a consistent point-in-time view of the data. However, heavy snapshot activity during peak load may impact performance, so schedule backups during off-peak hours.
Do snapshots include all types of data?
Yes—snapshots include index data, mappings, settings, and (if configured) global cluster state such as index templates, ingest pipelines, security roles, and watch configurations. However, they do not include external resources like Kibana dashboards, saved searches, or machine learning jobs. These must be backed up separately using Kibana’s export/import features or API calls.
How much storage do snapshots require?
Snapshots are incremental, so storage usage depends on data churn. The first snapshot of a 1TB cluster may require 1TB of storage. Subsequent snapshots may only require 5–20GB if only a small portion of data changes. Compression reduces this further. Always monitor repository usage and set retention policies to avoid runaway costs.
Can I restore a snapshot to a different cluster version?
Elasticsearch supports restoring snapshots to the same or newer major version (e.g., 7.x → 8.x), but not to older versions. Always test restores across versions in a staging environment. Minor version upgrades (e.g., 8.1 → 8.5) are fully compatible.
What happens if a snapshot fails?
If a snapshot fails, Elasticsearch marks it as FAILED and does not corrupt existing snapshots. You can retry the snapshot after resolving the underlying issue—such as insufficient disk space, network timeouts, or permission errors. Failed snapshots do not consume additional storage.
Is it possible to backup only specific documents or fields?
No. Elasticsearch snapshots operate at the index level. You cannot selectively back up individual documents or fields. To achieve granular backup, export data using the Scroll API or reindex into a separate index with filtered data, then snapshot that index.
How long does a snapshot take to create?
Snapshot duration depends on data size, network bandwidth, and storage performance. A 100GB index may take 10–30 minutes on a fast network and SSD-backed storage. Large clusters (multi-terabyte) may take hours. Monitor progress via the _status endpoint and consider splitting large indices into smaller ones for faster backups.
Can I use snapshots for disaster recovery across regions?
Yes. You can copy snapshots between repositories using tools like aws s3 sync or cloud-native replication (e.g., S3 Cross-Region Replication). This enables geographic redundancy. However, restoring from a cross-region snapshot may take longer due to network latency. Always test cross-region recovery procedures.
Are snapshots encrypted?
Elasticsearch does not encrypt snapshots at rest by default. However, you can enable encryption at the storage layer: use S3 server-side encryption (SSE-S3 or SSE-KMS), Azure encryption, or filesystem-level encryption (e.g., LUKS, ZFS). Never store unencrypted sensitive data in snapshot repositories.
What’s the difference between a snapshot and a clone?
A snapshot is a read-only, point-in-time copy stored externally. A clone is a live, writable copy of an index within the same cluster. Clones are useful for testing or temporary copies but are not a substitute for backups. Snapshots are durable and can be restored to any cluster.
Conclusion
Backing up Elasticsearch data is not a one-time task—it’s an ongoing discipline that requires planning, automation, testing, and documentation. In today’s data-driven world, the cost of losing critical data far exceeds the investment required to implement a robust backup strategy.
By following the steps outlined in this guide—registering a secure repository, creating incremental snapshots, automating retention, and regularly testing restores—you can ensure your Elasticsearch clusters remain resilient against failure. Whether you’re running on-premises or in the cloud, the principles remain the same: store backups externally, verify their integrity, and treat them as mission-critical assets.
Remember: the best time to implement a backup strategy was yesterday. The second-best time is now. Start by registering your first repository today, schedule your first snapshot, and test a restore within the next week. Your future self—and your organization—will thank you.