Вы находитесь на странице: 1из 6

Time-Based Indexing in

ElasticSearch Using Java


Anybody who uses ElasticSearch for indexing time-based data, such as log
events, is accustomed to the index-per-day pattern: use an index name
derived from the timestamp of the logging event rounded to the nearest
day and new indices pop into existence as soon as they are required. It’s a
classic use-case.

Need for Time-Based Indexing


Most traditional use cases for search engines involve a relatively static
collection of documents that grow slowly. Searches look for the most
relevant documents, regardless of when they were created.
The number of documents in the index grows rapidly, often accelerating
with time. Documents are almost never updated, and searches mostly
target the most recent documents. As documents age, they lose value.
If we were to have one big index for documents of this type, we would soon
run out of space. Logging events just keep on coming without pause or
interruption. We could delete the old events with a scroll query and bulk
delete, but this approach is very inefficient. When you delete a document, it
is only marked as deleted. It won’t be physically deleted until the segment
containing it is merged away.
Purging old data with time-based indexing is easy — just delete old indices.

Rollover API
Elasticsearch provides support for time-based indexing using its Rollover
API. It is offered in two forms that I found particularly interesting:

1. REST-based APIs
2. Java APIs
For testing and playing around with how rollover actually works, it is
imperative to use the REST endpoint since it’s so easy to set up and run. We
will talk about both the ways in this blog.
Rollover API follows the Rollover pattern, which essentially works as
follows:

 There is one alias used for indexing that points to the active index.
 Another alias points to active and inactive indices and is used for
searching.
 The active index can have as many shards as you have hot nodes to
take advantage of the indexing resources of all your expensive
hardware.
 When the active index is too full or too old, it is rolled over, a new
index is created, and the indexing alias switches atomically from the
old index to the new.
 The old index is moved to a cold node and is shrunk down to one
shard, which can also be force-merged and compressed. However,
this will not be not covered in this blog.

REST-Based Method
We’re going to create two aliases:logs-search for searches and logs-write
for indexing.
1. First, we create a new index template with a search alias. We will now
refer to the index using this alias only for searches.
PUT localhost:9200/_template/logs
{
“template”: “logs-*”,
“settings”: {
“number_of_shards”: 5,
“number_of_replicas”: 1
},
“aliases”: {
“logs-search”: {
}
}
}
2. Next, we create an index with payload as it writes the alias and rollover
settings.
PUT localhost:9200/logs-000001
{
“aliases”: {
“logs-write”: {
“rollover”: {
“conditions”: {
“max_age”: “60s”,
“max_docs”: 10
}
}
}
}
}

3. We index some data using the alias — this is not the actual index name.
POST localhost:9200/logs-write/_doc/861233345
{
“user”: “kimchy”,
“post_date”: “2009-11-15T14:12:12”,
“message”: “trying out Elasticsearch”
}
You may see a response similar to this:
{
“_index”: “logs-000001”,
“_type”: “_doc”,
“_id”: “861233345”,
“_version”: 3,
“result”: “updated”,
“_shards”: {
“total”: 2,
“successful”: 1,
“failed”: 0
},
“_seq_no”: 2,
“_primary_term”: 1
}

4. You need to keep on hitting the rollover endpoint to do the rollover,


given any of the three specified conditions are met. If it is such, the rollover
happens and a new index is created with the name viz. logs-00002. The
alias now points to this active index. The rollover API is smart enough to
detect naming patterns via numbers, dates, and increments to the next
value.
POST localhost:9200/logs-write/_rollover
{
“conditions”: {
“max_age”: “5s”,
“max_docs”: 5,
“max_size”: “5mb”
}
}

5. To verify that the rollover did, indeed, happen, try writing some new
data to the index (again using the alias):
POST localhost:9200/logs-write/_doc/1233
You can see the index that was written to logs-000002, which is the rolled
over index:
{
“_index”: “logs-000002”,
“_type”: “_doc”,
“_id”: “861233345”,
“_version”: 3,
“result”: “updated”,
“_shards”: {
“total”: 2,
“successful”: 1,
“failed”: 0
},
“_seq_no”: 2,
“_primary_term”: 1
}

6. For searches, however, you would use the search alias, which keeps on
pointing to all the logs-* indexes because of the index template we defined
in step one. If we were to use the logs-write alias for searching, it would
only point to the rolled over index (only one), and we won’t have all the
documents from the previous indexes.
GET localhost:9200/logs-search/_search
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 20,
"successful": 20,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 1,
"hits": [
{
"_index": "logs-000001",
"_type": "_doc",
"_id": "8611234677862",
"_score": 1,
"_source": {
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch"
}
},
{
"_index": "logs-000002",
"_type": "_doc",
"_id": "861123467jahd",
"_score": 1,
"_source": {
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch"
}
},
{
"_index": "logs-000003",
"_type": "_doc",
"_id": "861123467jahd",
"_score": 1,
"_source": {
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch"
}
},
{
"_index": "logs-000004",
"_type": "_doc",
"_id": "861123467jahd",
"_score": 1,
"_source": {
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch"
}
},
{
"_index": "logs-000001",
"_type": "_doc",
"_id": "8611234677",
"_score": 1,
"_source": {
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch"
}
},
{
"_index": "logs-000002",
"_type": "_doc",
"_id": "8611234677",
"_score": 1,
"_source": {
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch"
}
}
]
}
}

As you can see, the search result contains data from different indexes(logs-
00001 thru logs-00004).
7. The multiple indices fetch was possible because of the logs-search alias
that points to multiple indices. To verify this, use:
alias index filter routing.index routing.search
logs-search logs-000002 - - -
logs-write logs-000002 - - -
logs-search logs-000001 - - -

Also, notice that logs-write just points to one index at a time, which is what
we desire.

Rollover Java API


For the Java API, refer to the code here.

Вам также может понравиться