How to Index Data in Elasticsearch
How to Index Data in Elasticsearch Elasticsearch is a powerful, distributed search and analytics engine built on Apache Lucene. It enables real-time indexing, searching, and analysis of large volumes of structured and unstructured data. At the heart of Elasticsearch’s functionality lies the process of indexing data —the act of storing and organizing documents so they can be efficiently retrieved d
How to Index Data in Elasticsearch
Elasticsearch is a powerful, distributed search and analytics engine built on Apache Lucene. It enables real-time indexing, searching, and analysis of large volumes of structured and unstructured data. At the heart of Elasticsearch’s functionality lies the process of indexing data—the act of storing and organizing documents so they can be efficiently retrieved during queries. Whether you're logging application events, analyzing user behavior, or building a full-text search engine, mastering how to index data in Elasticsearch is essential for performance, scalability, and reliability.
Indexing is not merely about inserting data into a database. It involves defining mappings, choosing appropriate settings, managing document IDs, handling errors, and optimizing for query speed. Poorly indexed data leads to slow searches, high resource consumption, and difficult maintenance. Conversely, well-structured indexing ensures fast response times, seamless scalability, and accurate results—even across petabytes of data.
This guide provides a comprehensive, step-by-step walkthrough of how to index data in Elasticsearch, from basic operations to advanced techniques. You’ll learn best practices, explore real-world examples, and discover tools that streamline the process. By the end, you’ll have the knowledge to confidently index data in production environments with optimal efficiency.
Step-by-Step Guide
Prerequisites
Before you begin indexing data, ensure you have the following components in place:
- Elasticsearch installed and running—You can install Elasticsearch locally using Docker, download it from the official website, or use a managed service like Elastic Cloud.
- A working HTTP client—Use tools like cURL, Postman, or Kibana’s Dev Tools to send requests to Elasticsearch’s REST API.
- Basic understanding of JSON—Elasticsearch stores data as JSON documents, so familiarity with JSON syntax is required.
- Permissions and network access—Ensure your Elasticsearch instance is accessible and authentication (if enabled) is configured.
Verify your Elasticsearch instance is running by sending a GET request to the root endpoint:
curl -X GET "localhost:9200"
You should receive a JSON response containing cluster name, version, and node information.
Step 1: Understand Indexes and Documents
In Elasticsearch, data is stored in indexes, which are similar to databases in relational systems. Each index contains one or more documents, which are JSON objects representing individual records. For example, an index named products might contain documents representing individual items in an e-commerce catalog.
Each document has a unique document ID (optional). If not provided, Elasticsearch auto-generates a UUID. Documents are composed of fields, which are key-value pairs. For instance:
{
"name": "Wireless Headphones",
"price": 129.99,
"category": "Electronics",
"in_stock": true,
"tags": ["audio", "wireless", "bluetooth"]
}
Indexes are further divided into shards (for horizontal scaling) and replicas (for high availability). Understanding this architecture helps you plan indexing strategies for performance and resilience.
Step 2: Create an Index with Custom Mapping
By default, Elasticsearch uses dynamic mapping to infer field types when a document is indexed. While convenient for prototyping, this can lead to unintended data types (e.g., a numeric field being indexed as a string). For production use, define an explicit mapping when creating the index.
Use the PUT method to create an index with a custom mapping:
PUT /products
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "standard"
},
"price": {
"type": "float"
},
"category": {
"type": "keyword"
},
"in_stock": {
"type": "boolean"
},
"tags": {
"type": "keyword"
},
"created_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
}
}
}
}
Key mapping types explained:
- text: Used for full-text search. Analyzed (broken into tokens) for relevance scoring.
- keyword: Used for exact matches, aggregations, and sorting. Not analyzed.
- float, integer, boolean: Numeric and boolean types for precise calculations.
- date: Stores timestamps in ISO 8601 format or custom formats.
Always define mappings before indexing large datasets to avoid mapping conflicts and ensure consistent data handling.
Step 3: Index a Single Document
Once the index is created, use the PUT or POST method to index a document.
To index with a specific ID:
PUT /products/_doc/1
{
"name": "Wireless Headphones",
"price": 129.99,
"category": "Electronics",
"in_stock": true,
"tags": ["audio", "wireless", "bluetooth"],
"created_at": "2024-06-15 10:30:00"
}
To let Elasticsearch auto-generate the ID:
POST /products/_doc
{
"name": "Smart Watch",
"price": 199.99,
"category": "Electronics",
"in_stock": false,
"tags": ["wearable", "fitness", "smart"],
"created_at": "2024-06-15 11:15:00"
}
Successful indexing returns a JSON response with the document’s _id, _version, and result (e.g., “created” or “updated”).
Step 4: Index Multiple Documents in Bulk
Indexing documents one at a time is inefficient for large datasets. Use the Bulk API to index multiple documents in a single request, reducing network overhead and improving throughput.
The Bulk API requires a newline-delimited JSON (NDJSON) format. Each document is preceded by a metadata line specifying the action and target index:
POST /products/_bulk
{ "index": { "_id": "2" } }
{ "name": "Bluetooth Speaker", "price": 89.99, "category": "Electronics", "in_stock": true, "tags": ["audio", "portable"], "created_at": "2024-06-15 12:00:00" }
{ "index": { "_id": "3" } }
{ "name": "Laptop", "price": 999.99, "category": "Electronics", "in_stock": true, "tags": ["computing", "mobile"], "created_at": "2024-06-15 12:05:00" }
{ "delete": { "_id": "1" } }
{ "create": { "_id": "4" } }
{ "name": "Wireless Mouse", "price": 49.99, "category": "Electronics", "in_stock": true, "tags": ["input", "gaming"], "created_at": "2024-06-15 12:10:00" }
Actions available:
index: Index a document (creates or replaces).create: Index only if the document doesn’t exist.update: Partially update a document.delete: Remove a document.
Bulk requests can handle thousands of documents per request. For optimal performance, keep bulk request sizes between 5–15 MB and limit to 1,000–5,000 documents per request.
Step 5: Monitor Indexing Performance
After indexing, monitor your cluster’s health and performance using the following endpoints:
GET /_cluster/health— Check cluster status (green, yellow, red).GET /products/_stats— View indexing statistics for the index.GET /_cat/nodes?v— Monitor node load and resource usage.GET /_cat/indices?v— See index size, document count, and shard distribution.
Use Kibana’s Monitoring dashboard for real-time visual insights into indexing rate, latency, and errors.
Step 6: Handle Errors and Conflicts
Indexing operations can fail due to various reasons:
- Mapping conflicts: Field type mismatch (e.g., trying to index a string where a number is expected).
- Document ID conflicts: Using
PUTon an existing document without version control. - Shard allocation failures: Insufficient nodes or disk space.
- Timeouts: Network latency or heavy cluster load.
Always check the response body for errors. For example:
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "failed to parse field [price] of type [float] in document with id '1'. Preview of field's value: 'abc'"
}
],
"type": "mapper_parsing_exception",
"reason": "failed to parse field [price] of type [float] in document with id '1'. Preview of field's value: 'abc'"
},
"status": 400
}
To prevent conflicts, use the if_seq_no and if_primary_term parameters for optimistic concurrency control:
PUT /products/_doc/1?if_seq_no=10&if_primary_term=3
{
"name": "Updated Headphones",
"price": 119.99
}
This ensures the update only proceeds if the document hasn’t been modified since the last read.
Step 7: Refresh and Flush Indexes
Elasticsearch uses a near-real-time (NRT) model. Documents are indexed in memory and made searchable after a refresh interval (default: 1 second). For immediate searchability after indexing, manually trigger a refresh:
POST /products/_refresh
However, frequent refreshes impact performance. In batch ingestion scenarios, disable automatic refresh during bulk indexing and enable it afterward:
PUT /products/_settings
{
"index.refresh_interval": "-1"
}
Perform bulk indexing
PUT /products/_settings
{
"index.refresh_interval": "1s"
}
POST /products/_refresh
Use flush to force writing data from memory to disk (for durability), but this is rarely needed manually:
POST /products/_flush
Best Practices
Design Indexes for Search Patterns
Index design should align with your query patterns. If you frequently filter by category and sort by price, ensure those fields are mapped as keyword and float respectively. Avoid using text fields for filtering or sorting—they are analyzed and unsuitable for exact matches.
Consider using aliasing to decouple application logic from physical index names. For example, create an alias products_current pointing to products_v1. When upgrading, create products_v2, reindex data, then switch the alias. This enables zero-downtime updates.
Use Appropriate Data Types
Choosing the right field type is critical:
- Use
keywordfor IDs, tags, statuses, and categorical data. - Use
textonly for fields requiring full-text search (e.g., product descriptions). - Use
datefor timestamps—never store as strings. - Use
ipfor IP addresses to enable range queries. - Use
nestedfor arrays of objects that need independent querying (e.g., product variants).
Avoid dynamic mapping in production. Always define mappings explicitly to prevent schema drift.
Optimize for Bulk Operations
When indexing large volumes of data:
- Batch documents into requests of 5–15 MB.
- Use the Bulk API instead of individual
indexcalls. - Disable refresh intervals during bulk ingestion.
- Use multiple concurrent bulk threads if your cluster has sufficient resources.
- Monitor heap usage and avoid overwhelming nodes.
Tools like Logstash or Filebeat are optimized for high-volume ingestion and should be preferred over custom scripts for log data.
Manage Index Lifecycle
Indexes grow over time. Implement an Index Lifecycle Management (ILM) policy to automate rollover, shrink, and delete operations:
- Hot phase: Active indexing and querying (high-performance nodes).
- Warm phase: Read-only, lower-cost storage.
- Cold phase: Archived, rarely accessed data.
- Delete phase: Automatic removal after retention period.
ILM policies reduce storage costs and maintain performance by moving older data to cheaper hardware.
Enable Replication Strategically
Replicas improve search performance and availability but consume additional disk and memory. For write-heavy indexes, use fewer replicas (e.g., 1) during ingestion. Increase replicas (e.g., 2) after indexing completes for better query scalability.
Never set number_of_replicas higher than the number of data nodes, or replicas will remain unassigned.
Secure Your Indexes
Apply security controls:
- Use Elasticsearch’s built-in security (X-Pack) to restrict index access by role.
- Encrypt data at rest and in transit using TLS.
- Limit indexing permissions to trusted services or users.
- Audit index creation and modification via Elasticsearch logs.
Monitor and Alert
Set up monitoring for:
- Indexing rate (docs/sec)
- Latency of bulk requests
- Shard failures and unassigned shards
- Heap memory usage
- Disk space utilization
Use tools like Prometheus + Grafana or Elastic Observability to visualize metrics and trigger alerts for anomalies.
Tools and Resources
Elasticsearch REST API
The primary interface for indexing. All operations (create, update, delete, bulk) are performed via HTTP requests. Documentation is available at elastic.co/guide.
Kibana Dev Tools
Kibana’s Dev Tools console provides a web-based interface to interact with Elasticsearch. It supports syntax highlighting, autocomplete, and real-time response visualization. Ideal for testing mappings, queries, and bulk operations.
Logstash
A server-side data processing pipeline that ingests data from multiple sources, transforms it, and sends it to Elasticsearch. Useful for logs, metrics, and structured data from databases or files.
Filebeat
A lightweight shipper for forwarding and centralizing log files. Integrates seamlessly with Logstash and Elasticsearch for real-time log indexing.
Python Elasticsearch Client
A Python library for interacting with Elasticsearch. Example:
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
doc = {
"name": "Python Book",
"price": 45.99,
"category": "Books",
"in_stock": True
}
es.index(index="products", document=doc)
Install via pip: pip install elasticsearch
Java High-Level REST Client (Deprecated) / Java API Client
For Java applications, use the new Java API Client (replacing the deprecated High-Level Client). It provides type-safe, async, and reactive interfaces.
Apache NiFi
A dataflow automation tool that can ingest, route, and transform data before sending it to Elasticsearch. Useful for complex data pipelines involving multiple sources and transformations.
OpenSearch
An open-source fork of Elasticsearch with similar APIs. If you prefer community-driven development or need to avoid Elastic’s licensing changes, OpenSearch is a viable alternative.
Postman and cURL
Essential for manual testing and scripting. cURL is lightweight and available on all systems. Postman offers GUI-based request building and environment management.
Elastic Cloud
Elastic’s fully managed service. Eliminates infrastructure management and provides auto-scaling, backups, and monitoring out of the box. Ideal for teams without dedicated DevOps resources.
Documentation and Community
- Official Elasticsearch Documentation
- Elastic Community Forum
- Stack Overflow (elasticsearch tag)
- GitHub Repository
Real Examples
Example 1: E-Commerce Product Catalog
Scenario: Index 10,000 products from a CSV file into Elasticsearch.
Step 1: Define the index mapping:
PUT /products
{
"settings": {
"number_of_shards": 4,
"number_of_replicas": 1,
"refresh_interval": "30s"
},
"mappings": {
"properties": {
"product_id": { "type": "keyword" },
"name": { "type": "text", "analyzer": "english" },
"description": { "type": "text", "analyzer": "english" },
"price": { "type": "float" },
"category": { "type": "keyword" },
"brand": { "type": "keyword" },
"in_stock": { "type": "boolean" },
"tags": { "type": "keyword" },
"created_at": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }
}
}
}
Step 2: Convert CSV to NDJSON using Python:
import csv
import json
with open('products.csv', 'r') as f:
reader = csv.DictReader(f)
bulk_data = []
for row in reader:
bulk_data.append(json.dumps({"index": {"_id": row['product_id']}}))
bulk_data.append(json.dumps({
"name": row['name'],
"description": row['description'],
"price": float(row['price']),
"category": row['category'],
"brand": row['brand'],
"in_stock": row['in_stock'] == 'true',
"tags": row['tags'].split(','),
"created_at": row['created_at']
}))
with open('products_bulk.ndjson', 'w') as f:
f.write('\n'.join(bulk_data))
Step 3: Bulk index using cURL:
curl -X POST "localhost:9200/products/_bulk" \
-H "Content-Type: application/json" \
--data-binary "@products_bulk.ndjson"
Step 4: Verify indexing:
GET /products/_count
GET /products/_search?q=wireless&pretty
Example 2: Application Log Indexing
Scenario: Ingest application logs from a Spring Boot app into Elasticsearch.
Use Filebeat to tail the log file (application.log):
filebeat.yml
filebeat.inputs:
- type: filestream
enabled: true
paths:
- /var/log/myapp/application.log
output.elasticsearch:
hosts: ["localhost:9200"]
index: "app-logs-%{+yyyy.MM.dd}"
processors:
- decode_json_fields:
fields: ["message"]
target: ""
overwrite_keys: true
Filebeat automatically creates daily indexes (e.g., app-logs-2024.06.15) and parses JSON logs. Use ILM to retain logs for 30 days and then delete.
Example 3: Real-Time Sensor Data
Scenario: Index temperature readings from IoT devices every 5 seconds.
Use a lightweight Python script with the Elasticsearch client:
import time
import random
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
while True:
doc = {
"device_id": f"sensor-{random.randint(1, 100)}",
"temperature": round(random.uniform(20.0, 35.0), 2),
"humidity": random.randint(30, 90),
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ")
}
es.index(index="sensor-readings", document=doc)
print(f"Indexed: {doc}")
time.sleep(5)
Use an ILM policy to roll over the index daily and keep only 7 days of data.
FAQs
What is the difference between indexing and searching in Elasticsearch?
Indexing is the process of storing and organizing documents so they can be retrieved efficiently. Searching is the act of querying those indexed documents to find matches based on criteria like keywords, filters, or ranges. Indexing happens once (or periodically); searching happens repeatedly.
Can I change the mapping of an existing index?
No, you cannot modify field mappings after an index is created. To change a mapping, you must create a new index with the desired schema and reindex the data using the Reindex API:
POST /_reindex
{
"source": { "index": "products_old" },
"dest": { "index": "products_new" }
}
How do I handle large datasets that won’t fit in memory?
Use the Bulk API with small batches (1,000–5,000 docs per request). Stream data from disk or a database using a producer-consumer pattern. Tools like Logstash or custom Python scripts with generators can process data in chunks without loading everything into memory.
Does indexing affect search performance?
Yes. Heavy indexing can consume CPU, memory, and I/O, which may slow down concurrent searches. To minimize impact, schedule bulk indexing during off-peak hours, use separate nodes for ingestion, and monitor cluster health.
What happens if I index a document with the same ID twice?
Elasticsearch will update the existing document and increment its version number. The old version is marked for deletion and removed during segment merging. Use create instead of index if you want to prevent overwrites.
Is it better to use one large index or many small indexes?
It depends. Use one index for related data with similar query patterns. Use multiple indexes for time-series data (e.g., daily logs) or when you need different retention policies. Too many small indexes can overwhelm the cluster’s metadata management.
How can I index data from a relational database?
Use Logstash with the JDBC input plugin, or write a script that queries the database and sends results to Elasticsearch via the Bulk API. For CDC (Change Data Capture), use tools like Debezium to stream database changes in real time.
What is the maximum size of a document in Elasticsearch?
By default, the maximum document size is 100 MB. This can be adjusted via the index.max_doc_value_fields_search and index.mapping.total_fields.limit settings, but large documents are discouraged due to performance implications.
How do I know if my index is optimized?
Check for:
- Low segment count (use
GET /_cat/segments) - High indexing rate with low latency
- Low refresh frequency
- Minimal shard failures
- Efficient query response times
Run the optimize API (now called forcemerge) to reduce segments:
POST /products/_forcemerge?max_num_segments=1
Can I index data without creating an index first?
Yes. Elasticsearch will create an index automatically using dynamic mapping. However, this is not recommended for production due to unpredictable field types and lack of control over settings.
Conclusion
Indexing data in Elasticsearch is a foundational skill for anyone working with search, analytics, or logging systems. From defining precise mappings to executing bulk operations and managing index lifecycles, each step plays a critical role in ensuring data is stored efficiently and retrieved quickly. The examples and best practices outlined in this guide provide a robust framework for implementing indexing strategies that scale with your data and meet performance expectations.
Remember: indexing is not a one-time task—it’s an ongoing process that requires monitoring, optimization, and adaptation. As your data grows and query patterns evolve, revisit your mappings, refresh intervals, and shard configurations. Leverage tools like Kibana, Logstash, and ILM to automate routine tasks and reduce operational overhead.
By following the principles in this guide—explicit mapping, bulk ingestion, strategic replication, and proactive monitoring—you’ll build Elasticsearch indexes that are fast, reliable, and maintainable. Whether you’re indexing product catalogs, application logs, or sensor data, mastering indexing transforms Elasticsearch from a tool into a powerful data engine that drives real business value.