02 - Getting Started

Node

A node refers to a running instance of Elasticsearch, which can be running on a physical or virtual. Nodes are where the data resides on Elastic Search. In production, each VM or container will be a dedicated node.

Node Roles

A node can have one or more roles.

Master Node - Performing clusterwide actions. This mainly includes creating and deleting indices. A node with this role will not automatically become the master node (unless there are no other master eligible nodes)

node.master: true | false

Data Node - This node enabled a node to store data. Data node enables storing data and performing queries on that data

node.data: true | false

Ingest Node - Enables a node to run ingest pipelines. Ingest pipelines are a series of steps that are performed when indexing documents.

node.ingest: true | false

Cooridnation Role - Coordination refers to the distribution of queries and the aggregation of results

Cluster

When a node starts up either it will create it’s own cluster or it will join an existing cluster. An elastic search node will always be part of a cluster. An Elasticsearch cluster is a collection of nodes, which are responsible for storing data.

Document

Each unit of data you store in your cluster is called a document. Documents are JSON documents containing fields and value. When you index a document, the original JSON object that you sent to Elasticsearch is stored along with some metadata that Elasticsearch uses internally.

Indexes

Every document within Elasticsearch is stored within an Index. An index groups documents together logically, as well as provide configuration options that are related to scalability and availability.

Sharding

Sharding is a way to sub-divide indices into smaller pieces. Each piece is referred to as a Shard. Sharding is done at the index level. Each shard is an independent index in terms of storage, each shard is an Apache Lucene index. Shard is configured at the index level.

The main purpose is to horizontally scale the data volume to be able to store more documents. Sharding also helps in improved performance by parallelization of queries increases the throughput of an index. Index defaults to 1 Shard

How many Shards are Optimal ?

It depends on various factors:

  • No of Nodes and their Capacity
  • No of Indices and their Size
  • No of queries

Replication

Elasticsearch supports replication for Fault tolerance. Replicaiton is configured at the index level. Replication works by creating copies of shards, reffered to as Replica Shards. A shard that has been replicated is called a primary shard. A primary shard and its replica shards are referred to as replication group. Replica shards are a complete copy of a shard. A replica shard can serve search requests, exactly like its primary shard. Replication helps in increasing availability of data if nodes go down.

Replication can also help increase throughput. Replica shards of a replication group can serve different search requests simultaneously, this increases the number of requests that can be handled at the same time. Elasticsearch intelligently routes requests to the best shards. CPU parallelization improves performance if multiple replica shards are stored on the same node.

By Default, 1 Primary Shard and 1 Replica Shard are added for an index by default.

Choosing number of replica shards ?

  • Replicate shards once, if data loss is not a disaster
  • For critical systems, data should be replicated at least twice

Snapshots

Snapshots can be used to restore to a given point in time. Snapshots can be taken at the index level, or for the entire cluster. Snapshots are commonly used for daily backups whereas replication ensures that indexes can recover for node failure.