1— title: 03 - Managing Documents

Creating and Deleting Indexes

PUT /pages

DELETE /pages

# Specify Index Settings while creating Index
PUT /products
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 2
  }
}

Indexing Document

POST /products/_doc
{
  "name": "Cell Phone",
  "price": 100,
  "inStock": 10
}

PUT /products/_doc/100
{
  "name": "Toaster",
  "price": 20,
  "inStock": 100
}

GET /products/_doc/100

POST /products/_update/100
{
  "doc": {
    "inStock": 3
  }
}

POST /products/_update/100
{
  "doc": {
    "isAvailable": "true"
  }
}

How does the Update API works

  • The current document is retreived
  • The field values are changed
  • The existing document is replaced with the modified document

Introduction to routing

  • Routing is the process of resolving a shard for a document
  • A formula is used when indexing, retreiving and updating documents
  • Routing may be customized
  • The default routing strategry distributes documents evenly
  • One of the reason why an index shards cannot be changed is that the routing formulate would yield in different shard number
shard_num = hash(_routing) % num_primary_shards

How Elasticsearch reads data

Cloud Front Architecture

  • A read request is received and handled by a coordinator node
  • Routing is used to resolve the document’s replication group
  • Adaptice Replica Selection (ARS) is used to send the query to the best available shard
    • ARS helps reduce query response times
    • ARS is essentially an intelligent load balancer
  • The coordinating node collects the response and sends it to the client

How Elasticsearch writes data

Cloud Front Architecture

  • Write operations are sent to primary shards
  • The primary shard forwards the operation to its replica shards
  • Primary terms and sequence numbers are used to recover from failures
  • Global and local checkpoints help speed up the recovery process
  • Primary terms and sequence numbers are available within responses

Optimistic Concurrency Control

  • Sending write requests to Elasticsearch concurrently may overwrite changes made by other concurrent process
  • _primary_terms and _seq_no field are used to optimistic concurrency control
  • Elasticsearch will reject a write operation if it contains the wrong primary term or sequence number

Update By Query / Delete By Query

  • The query creates a snapshot to do optimistic concurrency control
  • Search queries and bulk requests are sent to replication groups sequentially
    • Elasticsearch retries these queries upto 10 times
    • If there queries still fail, the whole query is aborted
    • Any changes already made to documents, are not rolled back
  • The api returns information about failures
  • If a document has been modified since taking the snapshot, the query is aborted. This is checked with the document’s primary term and sequence number
  • To count version conflicts instead of aborting the query, the conflicts option can be set to proceed

Bulk API

  • The HTTP Content-Type header should be set as
    • Content-Type: application/x-ndjson
  • A failed action will NOT affect other actions, neither will the bulk request as a whole be aborted
  • The Bulk API returns detailed information about each action
  • Bulk Api Use case
    • When you need to perform lots of write operations at the same time
    • Bulk API is more efficient than sending individual write requests

Importing Data with Curl

curl -k -u username:password -H 'Content-Type: application/x-ndjson' -XPOST https://localhost:9200/products/_bulk --data-binary  @products-bulk.json

Working Document