04 - Mapping & Analysis

Inverted Index

  • One inverted index per text field
  • Other data types use BKD trees, for instance
  • Values for a text field are analyzed and the results are stored within an inverted index
  • Each field has a dedicated inverted index
  • An inverted index is a mapping between terms and which documents contain them
  • Terms are sorted alphabetically for performance reasons
  • Inverted indices enable fast searches
  • Inverted indices contain other data as well. Eg. Things used for relevance scoring
  • Elasticsearch uses other data structures as well. Eg. BKD trees for numeric values, dates, and geospatial data

Introduction to Mapping

What is Mapping ?

Mapping defines the structure of documents. Eg. Fields and their data types. Mapping is also used to configure how values are indexes. Similar to table’s schema in a realtional database, mapping provides the schema for the document stored in Elasticsearch.

  • Explicit Mapping - We define field mapping ourselves
  • Dynamic Mapping - Elasticsearch generates field mappings for us

Overview of Data Types

Object

  • Used for any JSON Object
  • Objects may be nested
  • Mapped using the properties parameter
  • Objects are not stored as Objects in Apache Lucene
    • Objects are tranformed

Nested

  • Similar to the object data type, but maintains object relationship
  • Enables us to query objects indepedently
    • Must use the nested query
  • nested objects are stored as hidden documents

Keyword

  • Used for exact matching for values
  • Typically used for filtering, aggregation, and sorting
  • E.g. Searching for articles with a status of PUBLISHED
    • E.g. Searching the body text of an article
  • keyword field are anlyzed with the keyword analyzer
  • The keyword analyzer is a no-op analyzer. It outputs the unmodified string as a single token. This token is then placed into the inverted index
  • keyword fields are used for exact matching, aggregation, and sorting

Adding Explicit Mapping

PUT /reviews
{
  "mappings": {
    "properties": {
      "rating": {"type": "float"},
      "content": {"type": "text"},
      "product_id": {"type": "integer"},
      "author": {
        "properties": {
          "first_name": { "type": "text"},
          "last_name": { "type": "text"},
          "email": { "type": "keyword"}
        }
      }
    }
  }
}

PUT /reviews/_doc/1
{
  "rating": 4.5,
  "content": "Best Review",
  "product_id": 123,
  "author": {
    "first_name": "John",
    "last_name": "Doe",
    "email": "john.doe@gmail.com"
  }
}

GET /reviews/_mapping

GET /reviews/_search
{
  "query": {"match_all": {}}
}

Date

  • Dates are specified in one of three ways:
    • Specially formatted strings (defaults to ISO 8601)
    • Milliseconds since the epoch (long)
    • Seconds since the epoch (integer)
  • Datest are stored as long values internally
  • Don’t provide UNIX timestamps for default date fields

Missing fields

  • All fields in Elasticsearch are optional
  • You can leave out a field when indexing documents. E.g. Unline relational databases where you need to allow NULL values
  • Some integrity checks need to be done at the application level. Eg. Having required fields
  • Adding a field mapping does not make a field required
  • Search automatically handles missing fields

Mapping Paramters

  • format - Used to customize the format for date fields

  • properties - Defines nested fields for object and nested fields

  • coerce - Used to enable or disable coercion of value

  • doc_values -

  • norms - Normalization factors used for relevance scoring.Useful for fields that won’t be used for relevance

  • index - Disable indexing for a field. Values are still stored within _source

  • null_value - NULL values cannot be indexed or searched

    • Use this parameter to replace NULL values with another value
  • copy_to - Used to copy multiple field values into a “group field”

Updating Existing Mappings

Updating existing mapping with new type requires reindexing

POST /_reindex
{
  "source": {
    "index": "reviews"
  },
  "dest": {
    "index": "reviews_new"
  }
}

Field Aliases

  • Field names can be changed when reindexing documents
  • An alternative is to use field aliases
    • Doesn’t require documents to be reindexed
    • Let’s add one pointing from comment to content
    • Aliases can be used within queries
    • Aliases are defined with a field mapping
PUT /reviews/_mapping
{
  "properties": {
    "comment": {
      "type": "alias",
      "path": "content"
    }
  }
}

Index Templates

  • Index templates specify settings and templates
PUT /_template/access-logs
{
  "index_patterns": ["access-logs-*"],
  "settings": {
    "number_of_shards": 2,
    "index.mapping.coerce": false
  }, 
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
        },
        "url.original": {
          "type": "keyword"
        },
        "http.request.referrer": {
          "type": "keyword"
        },
        "http.response.status_code": {
          "type": "long"
        }
      }
  }
}

PUT /access-logs-2020-01-01
GET /access-logs-2020-01-01/

Elastic Commons Schema (ECS)

  • A specification of common fields and how they should be mapped
  • In ECS, documents are referred to as events
  • Mostly useful for standard events
  • ECS is automatically handled by Elastic Stack products

Dynamic Mapping

Combining Explicit + Dynamic Mapping

You can combine both Explicit as well as Dynamic Mapping

Dynamic Templates

Using the match and unmatch is useful for applying naming convention as way for applying field mappings.

  • path_match Parameter

  • path_unmatch Parameter

  • Index templates apply mapping and index settings for matching indices. This happens when indices are created and their names match a pattern

  • Dynamic templates are evaluated when new fields are encountered. This specified field mapping is added if the template’s conditions match

  • Index template define fixed mappings; dynamic templates are.. dynamic

Mapping recommendations

  • Dynamic mapping is convenient, but often not a good idea in production
  • Save disk space with optimized mappings when storing many documents
  • Set dynamic to strict, not false. Avoid surprises and unexpected results
  • Don’t always map strings are both text and keyword
    • text mapping - If you need full-text searches
    • keyword mapping - Aggregation, Sorting, or filtering on exact values
  • Set doc_values to false, if you don’t need sorting, aggregation and scripting
  • Set norms to false if you don’t need relevance scoring
  • Set index to false if you don’t need to filter on values

Stop Words

Words that are filtered out during text analysis. Common words such as “a”, “the”, “the”, “at etc. They provide little to no value for relevance scoring.

Analyzer and Search Queries

  • Standard Analyzers - Splits text at word boundary. Removes most punctuation, lowercase terms, and supports stop words.
  • Simple Analyzers - Divides text into terms whenever it encounters a character which is non letter. It lowercases all terms.
  • Whitespace Analyzers - The whitespace analyzer divides text into terms whenever it encounters any whitespace character
  • Keyword Analyzers - It is a “No-op” analyzer that accepts whatever text it is given and outputs the exact same text as a single term
  • Pattern Analyzers
  • Custome Analyzer

Built-in Analyzers Reference