How to Index Data in Elasticsearch

How to Index Data in Elasticsearch Elasticsearch is a powerful, distributed search and analytics engine built on Apache Lucene. It enables real-time indexing, searching, and analysis of large volumes of structured and unstructured data. At the heart of Elasticsearch’s functionality lies the process of indexing data —the act of storing and organizing documents so they can be efficiently retrieved d

alex

Oct 30, 2025 - 20:37

How to Index Data in Elasticsearch

Elasticsearch is a powerful, distributed search and analytics engine built on Apache Lucene. It enables real-time indexing, searching, and analysis of large volumes of structured and unstructured data. At the heart of Elasticsearchs functionality lies the process of indexing datathe act of storing and organizing documents so they can be efficiently retrieved during queries. Whether you're logging application events, analyzing user behavior, or building a full-text search engine, mastering how to index data in Elasticsearch is essential for performance, scalability, and reliability.

Indexing is not merely about inserting data into a database. It involves defining mappings, choosing appropriate settings, managing document IDs, handling errors, and optimizing for query speed. Poorly indexed data leads to slow searches, high resource consumption, and difficult maintenance. Conversely, well-structured indexing ensures fast response times, seamless scalability, and accurate resultseven across petabytes of data.

This guide provides a comprehensive, step-by-step walkthrough of how to index data in Elasticsearch, from basic operations to advanced techniques. Youll learn best practices, explore real-world examples, and discover tools that streamline the process. By the end, youll have the knowledge to confidently index data in production environments with optimal efficiency.

Step-by-Step Guide

Prerequisites

Before you begin indexing data, ensure you have the following components in place:

Elasticsearch installed and runningYou can install Elasticsearch locally using Docker, download it from the official website, or use a managed service like Elastic Cloud.
A working HTTP clientUse tools like cURL, Postman, or Kibanas Dev Tools to send requests to Elasticsearchs REST API.
Basic understanding of JSONElasticsearch stores data as JSON documents, so familiarity with JSON syntax is required.
Permissions and network accessEnsure your Elasticsearch instance is accessible and authentication (if enabled) is configured.

Verify your Elasticsearch instance is running by sending a GET request to the root endpoint:

curl -X GET "localhost:9200"

You should receive a JSON response containing cluster name, version, and node information.

Step 1: Understand Indexes and Documents

In Elasticsearch, data is stored in indexes, which are similar to databases in relational systems. Each index contains one or more documents, which are JSON objects representing individual records. For example, an index named products might contain documents representing individual items in an e-commerce catalog.

Each document has a unique document ID (optional). If not provided, Elasticsearch auto-generates a UUID. Documents are composed of fields, which are key-value pairs. For instance:

{
"name": "Wireless Headphones",
"price": 129.99,
"category": "Electronics",
"in_stock": true,
"tags": ["audio", "wireless", "bluetooth"]
}

Indexes are further divided into shards (for horizontal scaling) and replicas (for high availability). Understanding this architecture helps you plan indexing strategies for performance and resilience.

Step 2: Create an Index with Custom Mapping

By default, Elasticsearch uses dynamic mapping to infer field types when a document is indexed. While convenient for prototyping, this can lead to unintended data types (e.g., a numeric field being indexed as a string). For production use, define an explicit mapping when creating the index.

Use the PUT method to create an index with a custom mapping:

PUT /products
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "standard"
},
"price": {
"type": "float"
},
"category": {
"type": "keyword"
},
"in_stock": {
"type": "boolean"
},
"tags": {
"type": "keyword"
},
"created_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
}
}
}
}

Key mapping types explained:

text: Used for full-text search. Analyzed (broken into tokens) for relevance scoring.
keyword: Used for exact matches, aggregations, and sorting. Not analyzed.
float, integer, boolean: Numeric and boolean types for precise calculations.
date: Stores timestamps in ISO 8601 format or custom formats.

Always define mappings before indexing large datasets to avoid mapping conflicts and ensure consistent data handling.

Step 3: Index a Single Document

Once the index is created, use the PUT or POST method to index a document.

To index with a specific ID:

PUT /products/_doc/1 { "name": "Wireless Headphones", "price": 129.99, "category": "Electronics", "in_stock": true, "tags": ["audio", "wireless", "bluetooth"], "created_at": "2024-06-15 10:30:00"

}

To let Elasticsearch auto-generate the ID:

POST /products/_doc { "name": "Smart Watch", "price": 199.99, "category": "Electronics", "in_stock": false, "tags": ["wearable", "fitness", "smart"], "created_at": "2024-06-15 11:15:00"

}

Successful indexing returns a JSON response with the documents _id, _version, and result (e.g., created or updated).

Step 4: Index Multiple Documents in Bulk

Indexing documents one at a time is inefficient for large datasets. Use the Bulk API to index multiple documents in a single request, reducing network overhead and improving throughput.

The Bulk API requires a newline-delimited JSON (NDJSON) format. Each document is preceded by a metadata line specifying the action and target index:

POST /products/_bulk
{ "index": { "_id": "2" } }
{ "name": "Bluetooth Speaker", "price": 89.99, "category": "Electronics", "in_stock": true, "tags": ["audio", "portable"], "created_at": "2024-06-15 12:00:00" }
{ "index": { "_id": "3" } }
{ "name": "Laptop", "price": 999.99, "category": "Electronics", "in_stock": true, "tags": ["computing", "mobile"], "created_at": "2024-06-15 12:05:00" }
{ "delete": { "_id": "1" } }
{ "create": { "_id": "4" } }
{ "name": "Wireless Mouse", "price": 49.99, "category": "Electronics", "in_stock": true, "tags": ["input", "gaming"], "created_at": "2024-06-15 12:10:00" }

Actions available:

index: Index a document (creates or replaces).
create: Index only if the document doesnt exist.
update: Partially update a document.
delete: Remove a document.

Bulk requests can handle thousands of documents per request. For optimal performance, keep bulk request sizes between 515 MB and limit to 1,0005,000 documents per request.

Step 5: Monitor Indexing Performance

After indexing, monitor your clusters health and performance using the following endpoints:

GET /_cluster/health Check cluster status (green, yellow, red).
GET /products/_stats View indexing statistics for the index.
GET /_cat/nodes?v Monitor node load and resource usage.
GET /_cat/indices?v See index size, document count, and shard distribution.

Use Kibanas Monitoring dashboard for real-time visual insights into indexing rate, latency, and errors.

Step 6: Handle Errors and Conflicts

Indexing operations can fail due to various reasons:

Mapping conflicts: Field type mismatch (e.g., trying to index a string where a number is expected).
Document ID conflicts: Using PUT on an existing document without version control.
Shard allocation failures: Insufficient nodes or disk space.
Timeouts: Network latency or heavy cluster load.

Always check the response body for errors. For example:

{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "failed to parse field [price] of type [float] in document with id '1'. Preview of field's value: 'abc'"
}
],
"type": "mapper_parsing_exception",
"reason": "failed to parse field [price] of type [float] in document with id '1'. Preview of field's value: 'abc'"
},
"status": 400
}

To prevent conflicts, use the if_seq_no and if_primary_term parameters for optimistic concurrency control:

PUT /products/_doc/1?if_seq_no=10&if_primary_term=3
{
"name": "Updated Headphones",
"price": 119.99
}

This ensures the update only proceeds if the document hasnt been modified since the last read.

Step 7: Refresh and Flush Indexes

Elasticsearch uses a near-real-time (NRT) model. Documents are indexed in memory and made searchable after a refresh interval (default: 1 second). For immediate searchability after indexing, manually trigger a refresh:

POST /products/_refresh

However, frequent refreshes impact performance. In batch ingestion scenarios, disable automatic refresh during bulk indexing and enable it afterward:

PUT /products/_settings { "index.refresh_interval": "-1" } Perform bulk indexing PUT /products/_settings { "index.refresh_interval": "1s" }

POST /products/_refresh

Use flush to force writing data from memory to disk (for durability), but this is rarely needed manually:

POST /products/_flush

Best Practices

Design Indexes for Search Patterns

Index design should align with your query patterns. If you frequently filter by category and sort by price, ensure those fields are mapped as keyword and float respectively. Avoid using text fields for filtering or sortingthey are analyzed and unsuitable for exact matches.

Consider using aliasing to decouple application logic from physical index names. For example, create an alias products_current pointing to products_v1. When upgrading, create products_v2, reindex data, then switch the alias. This enables zero-downtime updates.

Use Appropriate Data Types

Choosing the right field type is critical:

Use keyword for IDs, tags, statuses, and categorical data.
Use text only for fields requiring full-text search (e.g., product descriptions).
Use date for timestampsnever store as strings.
Use ip for IP addresses to enable range queries.
Use nested for arrays of objects that need independent querying (e.g., product variants).

Avoid dynamic mapping in production. Always define mappings explicitly to prevent schema drift.

Optimize for Bulk Operations

When indexing large volumes of data:

Batch documents into requests of 515 MB.
Use the Bulk API instead of individual index calls.
Disable refresh intervals during bulk ingestion.
Use multiple concurrent bulk threads if your cluster has sufficient resources.
Monitor heap usage and avoid overwhelming nodes.

Tools like Logstash or Filebeat are optimized for high-volume ingestion and should be preferred over custom scripts for log data.

Manage Index Lifecycle

Indexes grow over time. Implement an Index Lifecycle Management (ILM) policy to automate rollover, shrink, and delete operations:

Hot phase: Active indexing and querying (high-performance nodes).
Warm phase: Read-only, lower-cost storage.
Cold phase: Archived, rarely accessed data.
Delete phase: Automatic removal after retention period.

ILM policies reduce storage costs and maintain performance by moving older data to cheaper hardware.

Enable Replication Strategically

Replicas improve search performance and availability but consume additional disk and memory. For write-heavy indexes, use fewer replicas (e.g., 1) during ingestion. Increase replicas (e.g., 2) after indexing completes for better query scalability.

Never set number_of_replicas higher than the number of data nodes, or replicas will remain unassigned.

Secure Your Indexes

Apply security controls:

Use Elasticsearchs built-in security (X-Pack) to restrict index access by role.
Encrypt data at rest and in transit using TLS.
Limit indexing permissions to trusted services or users.
Audit index creation and modification via Elasticsearch logs.

Monitor and Alert

Set up monitoring for:

Indexing rate (docs/sec)
Latency of bulk requests
Shard failures and unassigned shards
Heap memory usage
Disk space utilization

Use tools like Prometheus + Grafana or Elastic Observability to visualize metrics and trigger alerts for anomalies.

Tools and Resources

Elasticsearch REST API

The primary interface for indexing. All operations (create, update, delete, bulk) are performed via HTTP requests. Documentation is available at elastic.co/guide.

Kibana Dev Tools

Kibanas Dev Tools console provides a web-based interface to interact with Elasticsearch. It supports syntax highlighting, autocomplete, and real-time response visualization. Ideal for testing mappings, queries, and bulk operations.

Logstash

A server-side data processing pipeline that ingests data from multiple sources, transforms it, and sends it to Elasticsearch. Useful for logs, metrics, and structured data from databases or files.

Filebeat

A lightweight shipper for forwarding and centralizing log files. Integrates seamlessly with Logstash and Elasticsearch for real-time log indexing.

Python Elasticsearch Client

A Python library for interacting with Elasticsearch. Example:

from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
doc = {
"name": "Python Book",
"price": 45.99,
"category": "Books",
"in_stock": True
}
es.index(index="products", document=doc)

Install via pip: pip install elasticsearch

Java High-Level REST Client (Deprecated) / Java API Client

For Java applications, use the new Java API Client (replacing the deprecated High-Level Client). It provides type-safe, async, and reactive interfaces.

Apache NiFi

A dataflow automation tool that can ingest, route, and transform data before sending it to Elasticsearch. Useful for complex data pipelines involving multiple sources and transformations.

OpenSearch

An open-source fork of Elasticsearch with similar APIs. If you prefer community-driven development or need to avoid Elastics licensing changes, OpenSearch is a viable alternative.

Postman and cURL

Essential for manual testing and scripting. cURL is lightweight and available on all systems. Postman offers GUI-based request building and environment management.

Elastic Cloud

Elastics fully managed service. Eliminates infrastructure management and provides auto-scaling, backups, and monitoring out of the box. Ideal for teams without dedicated DevOps resources.

Documentation and Community

Real Examples

Example 1: E-Commerce Product Catalog

Scenario: Index 10,000 products from a CSV file into Elasticsearch.

Step 1: Define the index mapping:

PUT /products
{
"settings": {
"number_of_shards": 4,
"number_of_replicas": 1,
"refresh_interval": "30s"
},
"mappings": {
"properties": {
"product_id": { "type": "keyword" },
"name": { "type": "text", "analyzer": "english" },
"description": { "type": "text", "analyzer": "english" },
"price": { "type": "float" },
"category": { "type": "keyword" },
"brand": { "type": "keyword" },
"in_stock": { "type": "boolean" },
"tags": { "type": "keyword" },
"created_at": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }
}
}
}

Step 2: Convert CSV to NDJSON using Python:

import csv
import json
with open('products.csv', 'r') as f:
reader = csv.DictReader(f)
bulk_data = []
for row in reader:
bulk_data.append(json.dumps({"index": {"_id": row['product_id']}}))
bulk_data.append(json.dumps({
"name": row['name'],
"description": row['description'],
"price": float(row['price']),
"category": row['category'],
"brand": row['brand'],
"in_stock": row['in_stock'] == 'true',
"tags": row['tags'].split(','),
"created_at": row['created_at']
}))
with open('products_bulk.ndjson', 'w') as f:
f.write('\n'.join(bulk_data))

Step 3: Bulk index using cURL:

curl -X POST "localhost:9200/products/_bulk" \ -H "Content-Type: application/json" \

--data-binary "@products_bulk.ndjson"

Step 4: Verify indexing:

GET /products/_count

GET /products/_search?q=wireless&pretty

Example 2: Application Log Indexing

Scenario: Ingest application logs from a Spring Boot app into Elasticsearch.

Use Filebeat to tail the log file (application.log):

filebeat.yml filebeat.inputs: - type: filestream enabled: true paths: - /var/log/myapp/application.log output.elasticsearch: hosts: ["localhost:9200"] index: "app-logs-%{+yyyy.MM.dd}" processors: - decode_json_fields: fields: ["message"] target: ""

overwrite_keys: true

Filebeat automatically creates daily indexes (e.g., app-logs-2024.06.15) and parses JSON logs. Use ILM to retain logs for 30 days and then delete.

Example 3: Real-Time Sensor Data

Scenario: Index temperature readings from IoT devices every 5 seconds.

Use a lightweight Python script with the Elasticsearch client:

import time
import random
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
while True:
doc = {
"device_id": f"sensor-{random.randint(1, 100)}",
"temperature": round(random.uniform(20.0, 35.0), 2),
"humidity": random.randint(30, 90),
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ")
}
es.index(index="sensor-readings", document=doc)
print(f"Indexed: {doc}")
time.sleep(5)

Use an ILM policy to roll over the index daily and keep only 7 days of data.

FAQs

What is the difference between indexing and searching in Elasticsearch?

Indexing is the process of storing and organizing documents so they can be retrieved efficiently. Searching is the act of querying those indexed documents to find matches based on criteria like keywords, filters, or ranges. Indexing happens once (or periodically); searching happens repeatedly.

Can I change the mapping of an existing index?

No, you cannot modify field mappings after an index is created. To change a mapping, you must create a new index with the desired schema and reindex the data using the Reindex API:

POST /_reindex
{
"source": { "index": "products_old" },
"dest": { "index": "products_new" }
}

How do I handle large datasets that wont fit in memory?

Use the Bulk API with small batches (1,0005,000 docs per request). Stream data from disk or a database using a producer-consumer pattern. Tools like Logstash or custom Python scripts with generators can process data in chunks without loading everything into memory.

Does indexing affect search performance?

Yes. Heavy indexing can consume CPU, memory, and I/O, which may slow down concurrent searches. To minimize impact, schedule bulk indexing during off-peak hours, use separate nodes for ingestion, and monitor cluster health.

What happens if I index a document with the same ID twice?

Elasticsearch will update the existing document and increment its version number. The old version is marked for deletion and removed during segment merging. Use create instead of index if you want to prevent overwrites.

Is it better to use one large index or many small indexes?

It depends. Use one index for related data with similar query patterns. Use multiple indexes for time-series data (e.g., daily logs) or when you need different retention policies. Too many small indexes can overwhelm the clusters metadata management.

How can I index data from a relational database?

Use Logstash with the JDBC input plugin, or write a script that queries the database and sends results to Elasticsearch via the Bulk API. For CDC (Change Data Capture), use tools like Debezium to stream database changes in real time.

What is the maximum size of a document in Elasticsearch?

By default, the maximum document size is 100 MB. This can be adjusted via the index.max_doc_value_fields_search and index.mapping.total_fields.limit settings, but large documents are discouraged due to performance implications.

How do I know if my index is optimized?

Check for:

Low segment count (use GET /_cat/segments)
High indexing rate with low latency
Low refresh frequency
Minimal shard failures
Efficient query response times

Run the optimize API (now called forcemerge) to reduce segments:

POST /products/_forcemerge?max_num_segments=1

Can I index data without creating an index first?

Yes. Elasticsearch will create an index automatically using dynamic mapping. However, this is not recommended for production due to unpredictable field types and lack of control over settings.

Conclusion

Indexing data in Elasticsearch is a foundational skill for anyone working with search, analytics, or logging systems. From defining precise mappings to executing bulk operations and managing index lifecycles, each step plays a critical role in ensuring data is stored efficiently and retrieved quickly. The examples and best practices outlined in this guide provide a robust framework for implementing indexing strategies that scale with your data and meet performance expectations.

Remember: indexing is not a one-time taskits an ongoing process that requires monitoring, optimization, and adaptation. As your data grows and query patterns evolve, revisit your mappings, refresh intervals, and shard configurations. Leverage tools like Kibana, Logstash, and ILM to automate routine tasks and reduce operational overhead.

By following the principles in this guideexplicit mapping, bulk ingestion, strategic replication, and proactive monitoringyoull build Elasticsearch indexes that are fast, reliable, and maintainable. Whether youre indexing product catalogs, application logs, or sensor data, mastering indexing transforms Elasticsearch from a tool into a powerful data engine that drives real business value.

alex

How to Index Data in Elasticsearch

How to Index Data in Elasticsearch

Step-by-Step Guide

Prerequisites

Step 1: Understand Indexes and Documents

Step 2: Create an Index with Custom Mapping

Step 3: Index a Single Document

Step 4: Index Multiple Documents in Bulk

Step 5: Monitor Indexing Performance

Step 6: Handle Errors and Conflicts

Step 7: Refresh and Flush Indexes

Perform bulk indexing

Best Practices

Design Indexes for Search Patterns

Use Appropriate Data Types

Optimize for Bulk Operations

Manage Index Lifecycle

Enable Replication Strategically

Secure Your Indexes

Monitor and Alert

Tools and Resources

Elasticsearch REST API

Kibana Dev Tools

Logstash

Filebeat

Python Elasticsearch Client

Java High-Level REST Client (Deprecated) / Java API Client

Apache NiFi

OpenSearch

Postman and cURL

Elastic Cloud

Documentation and Community

Real Examples

Example 1: E-Commerce Product Catalog

Example 2: Application Log Indexing

filebeat.yml

Example 3: Real-Time Sensor Data

FAQs

What is the difference between indexing and searching in Elasticsearch?

Can I change the mapping of an existing index?

How do I handle large datasets that wont fit in memory?

Does indexing affect search performance?

What happens if I index a document with the same ID twice?

Is it better to use one large index or many small indexes?

How can I index data from a relational database?

What is the maximum size of a document in Elasticsearch?

How do I know if my index is optimized?

Can I index data without creating an index first?

Conclusion

Related Posts

Popular Posts

Recommended Posts

Popular Tags