A node refers to a running instance of Elasticsearch, which can be running on a physical or virtual. Nodes are where the data resides on Elastic Search. In production, each VM or container will be a dedicated node.
A node can have one or more roles.
Master Node - Performing clusterwide actions. This mainly includes creating and deleting indices. A node with this role will not automatically become the master node (unless there are no other master eligible nodes)
node.master: true | false
Data Node - This node enabled a node to store data. Data node enables storing data and performing queries on that data
node.data: true | false
Ingest Node - Enables a node to run ingest pipelines. Ingest pipelines are a series of steps that are performed when indexing documents.
node.ingest: true | false
Cooridnation Role - Coordination refers to the distribution of queries and the aggregation of results
When a node starts up either it will create it’s own cluster or it will join an existing cluster. An elastic search node will always be part of a cluster. An Elasticsearch cluster is a collection of nodes, which are responsible for storing data.
Each unit of data you store in your cluster is called a document. Documents are JSON documents containing fields and value. When you index a document, the original JSON object that you sent to Elasticsearch is stored along with some metadata that Elasticsearch uses internally.
Every document within Elasticsearch is stored within an Index. An index groups documents together logically, as well as provide configuration options that are related to scalability and availability.
Sharding is a way to sub-divide indices into smaller pieces. Each piece is referred to as a Shard. Sharding is done at the index level. Each shard is an independent index in terms of storage, each shard is an Apache Lucene index. Shard is configured at the index level.
The main purpose is to horizontally scale the data volume to be able to store more documents. Sharding also helps in improved performance by parallelization of queries increases the throughput of an index. Index defaults to 1 Shard
It depends on various factors:
Elasticsearch supports replication for Fault tolerance. Replicaiton is configured at the index level. Replication works by creating copies of shards, reffered to as Replica Shards. A shard that has been replicated is called a primary shard. A primary shard and its replica shards are referred to as replication group. Replica shards are a complete copy of a shard. A replica shard can serve search requests, exactly like its primary shard. Replication helps in increasing availability of data if nodes go down.
Replication can also help increase throughput. Replica shards of a replication group can serve different search requests simultaneously, this increases the number of requests that can be handled at the same time. Elasticsearch intelligently routes requests to the best shards. CPU parallelization improves performance if multiple replica shards are stored on the same node.
By Default, 1 Primary Shard and 1 Replica Shard are added for an index by default.
Snapshots can be used to restore to a given point in time. Snapshots can be taken at the index level, or for the entire cluster. Snapshots are commonly used for daily backups whereas replication ensures that indexes can recover for node failure.