Guide to Refresh and Flush Operations in Elasticsearch

In this guide we explore Refresh and Flush operations in Elasticsearch. This guide will bring resolution to the differences between the two in an effective manner. We also cover the underlying basics of Lucene functionalities, like reopen and commits, which helps in understanding refresh and flux operations.

Refresh and Flush

At the first glance, the general purpose of Refresh and the Flush operations seems identical. Both are used to make documents available for search. When new documents are added in elasticsearch, we call either the “refresh” or “flush” operations in elasticsearch and make them available for search. To understand better, you must be familiar with Segments, Reopen, and Commits in Lucene, which is the underlying engine to the Elasticsearch.

Segments in Lucene

In Elasticsearch, the most basic unit of storage of data is shards. But, looking through the Lucene lens makes things a bit different. Here, each Elasticsearch shard is a Lucene index, and each Lucene index consists of several Lucene segments. A segment is an inverted index of the mapping of terms to the documents containing those terms.

This concept of segments and how it applies to an elasticsearch index and its shards are shown in the below diagram:

lucene1.png#asset:1574

The concept behind this segmentation is that whenever new documents are created, they are written in new segments. Whenever new documents are created, they belong to a new segment and there is no need to modify the previous segment. If a document has to be deleted, it is flagged as deleted in its original segment. This means it never gets physically deleted from the segment.

Learn About Our Enterprise Kubernetes Development Support Subscriptions

Same for updating, the previous version of the document is marked as deleted in the previous segment and the updated version is kept under the same document id in the current segment.

Lucene Reopen

Lucene Reopen, when called, will make the data accumulated available for search. Although the latest data is made available for search, this does not guarantee the persistence of the data or that it is not written into the disc. We can call the reopen feature n number of times and make the latest data searchable, but cannot be sure about the presence of data in the disc.

Commits in Lucene

Lucene commits make the data safe. For each commit, the data from the different segments are merged and pushed to the disk, making the data persistent. Although commits are the ideal way to want our state of data, the issue is that each commit operation is resource expensive. Each commit operations has its own I/O operations and read/write cycles associated with it. This results in costly operations. This is the exact reason why we prefer the reopen feature to be reused, again and again, in Lucene based systems for making the new data searchable.

Translog

Elasticsearch addresses the issue of persistence in a different methodology. It introduces a translog (transaction log) in every shard. The new documents indexed are passed to this transaction log and an in-memory buffer. This process is shown in the figure below:

lucene2.png#asset:1575

Refresh in Elasticsearch

The "refresh" operation is set to be done for every second by default. On a refresh operation, the in-memory buffer empties the contents to a newly created segment in the memory, which is shown in the below diagram.This makes the new data available for search.

lucene3.png#asset:1576

Translog and Persistence

How does the translog work around the problem of persistence? The translog exists in each shard, which means it pertains to the physical disk memory. Even if the there is a failure, the data in the translog will remain in the shard level. When the node restarts, Elasticsearch is able to retrieve the data from the translog.

The translog is committed to the disk either in every set interval, or the completion of a successful request: Index, Bulk, Delete, or Update.

Flush in Elasticsearch

Flush essentially means that all the documents in the in-memory buffer are written to new segments, which is shown in Figure 3 below. These, along with all the existing in-memory segments, are committed to the disk, which clears the translog, which is shown in figure 4. This commit is a Lucene commit.

lucene4.png#asset:1577

A flush is triggered either periodically, or whenever the translog reaches a specific size. These settings prevent unruly costs from Lucene commits.

Conclusion

In this guide we explored two closely related operations, flush and refresh, in an understandable manner. We also touched up on the underlying basics of Lucene functionalities, like reopen and commits, which helps in understanding refresh and flux operations.

In short, refresh is used to make new documents visible to search. Flush is used to persist in memory segments via hard disk. Flush does not affect the visibility of the documents in Elasticsearch, because search happens in memory segments. Refresh affects the visibility to the indexed documents.

Leave a comment