Elasticsearch - Why and How

What is Elasticsearch

A distributed search engine
Built on Apache Lucene
Highly performant
Easy to cluster
Multi-tenant
Multi-typed
RESTful

Why Elasticsearch

-

Indexing with SOLR


   $ curl http://localhost:8983/solr/collection1/update -H 'Content-type:application/json' -d '
   [
     {
      "id"        : "TestDoc1",
      "title"     : {"set":"test1"},
      "revision"  : {"inc":3},
      "publisher" : {"add":"TestPublisher"}
     },
     {
      "id"        : "TestDoc2",
      "publisher" : {"add":"TestPublisher"}
     }
   ]'

The collection has to exist in the configset for solr cloud, describing the collection.
Uses operations rather than treating the document as a strict post.
Uses commands to update the index.

Searching with SOLR


   $ curl -XGET http://localhost:8983/solr/collection1/select?q=solr&wt=json

Querying is done by hitting the SELECT endpoint on a collection.
GET request parameters are used to define the query syntax.

Clustering With Elasticsearch

Step 1

Clustering with Elasticsearch

Step 2

Clustering with Elasticsearch

Step 3

Elasticsearch is fault-tolerant

Data is distributed to nodes and shards
One node or shard dying won’t kill the index

Elasticsearch has an excellent JSON-BASED Query and Filtering DSL

Fuzzy
Term-Matching
Relationship based
Geographical queries

Elasticsearch index creation is easy


   $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
    }'

Index creation Broken down

Index NODE

Index creation broken down

Index NAME

Index creation broken down

Index TYPE

Index creation broken down

Document ID

Index creation broken down

DOCUMENT


    {
      "user" : "kimchy",
      "post_date" : "2009-11-15T14:12:12",
      "message" : "trying out Elasticsearch"
    }

Searching Elasticsearch is easy


   $ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
     "query" : {
       "term" : { "user" : "kimchy" }
     }
   }'

Search broken down

Search NODE

Search broken down

OPTIONAL Index NAME

Search broken down

OPTIONAL Index TYPE

Search broken down

Resource: _search

SEARCH broken down


    {
     "query" : {
       "term" : { "user" : "kimchy" }
     }
    }

Searching Elasticsearch

Searching in elasticsearch is called querying
can be done via GET parameters like lucene and Solr
Uses extended GET to submit a body request for better query languages
Query results are scored, and sorted by best match

Query DSL: match_all


    $ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
       "query" : {
         "match_all" : {}
       }
    }'

#+REVEAL_HTML: match_all matches all documents in an index or index type
Useful when listing lots of results
Filters can be used to pare down the results

Query DSL: fuzzy


   $ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
      "query" : {
        {
          "fuzzy" : {
            "user" : {
              "value" :         "ki",
              "boost" :         1.0,
              "fuzziness" :     2,
              "prefix_length" : 0,
              "max_expansions": 100
            }
          }
        } 
      }
    }'

Allows you to search by part of a field value
Specify field
What the distance of difference of the value has to be for the field
How many exact characters must match
Computationally expensive – use max_expansions to limit matches

Query DSL: match


   $ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
    "query" : {
      "match" : {
          "message" : {
            "query" : "this is a test",
            "operator" : "and",
            "boost" :         1.0,
            "fuzziness" :     2,
            "prefix_length" : 0,
            "max_expansions": 100
          }
        }
      }
    }'

Matches phrases and single terms and numeric/date ranges
Specify field
Can be fuzzy - same ideas apply
Fuzziness in this case matches individual words (terms)
match_phrase queries match on many terms
Can be multiple fields – each has its own options and is combined with logical OR by default

Query DSL: span_first


     $ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
    "query" : {
      {
        "span_first" : {
          "match" : {
            "span_term" : { "user" : "kimchy" }
          },
          "end" : 3
        }
      }
    }'

Matches parts of a field at the beginning exactly (“kim” in the user field, here)
Specify field and value as an object
End is the number of characters to care about

IMPORTANT: Do as little as possible with search queries.

Query only enough to roughly find what you want.
Use the FILTER DSL to project/transform/pare down search results.

Filtering in es is easy:


   $ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
     "query" : {
       "match_all" : { }
     },
     "filter" : {
       "term" : { "user" : "kimchy" }
     }
   }'

Query DSL and Filter DSL Docs

Query DSL http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-queries.html
Filter DSL http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filters.html

How Assets uses Elasticsearch

HTTP and JSON are great
But there are decent Java/Scala wrappers out there for es

Elastic4s

https://github.com/sksamuel/elastic4s
Nice asynchronous dsl over Java elasticsearch API


  import com.sksamuel.elastic4s.ElasticClient
  import com.sksamuel.elastic4s.ElasticDsl._

  object Test extends App {
    val client = ElasticClient.local
    client execute { index into "bands/singers" fields "name"->"chris martin" }
  }

Elastic4s Connecting in ASSETS


    lazy val settings = ImmutableSettings
      .settingsBuilder()
      .put(
        "cluster.name",
        getConfigString("elasticsearch.clustername"))
      .put("client.transport.ping_timeout",
        getConfigInt("elasticsearch.pingTimeout") +
          getConfigString("elasticsearch.pingTimeoutTimeUnit"))
      .put(
        "client.transport.ignore_cluster_name",
        getConfigBoolean("elasticsearch.ignoreClusterName"))
      .build()

    lazy val client = ElasticClient
      .remote(
        settings,
        (getConfigString("elasticsearch.host"),getConfigInt("elasticsearch.port")))

Elastic4s Indexing


    client.execute {
      index into indexName -> institutionId fields (em.mapped(asset)) id asset.id
    } map arc.convert

em stands for “ElasticMapper”.
arc stands for “ActionResponseConvertable”.
All of the requests made to Elasticsearch return an ActionResponse.

Elastic4s Searching: Common operations

Common operations are split into their own local definitions for reuse.

Elastic4s Searching: Common operations


     // wrap up the client execution and return conversion logic
     lazy val executeSearch: (SearchDefinition) =>
       (ActionResponseConvertable[SearchResponse, AssetSearch]) =>
         Future[AssetSearch] = (s) =>
           (arc) => client.execute { s } map arc.convert

Everything needs client.execute, and to have its response converted.

Elastic4s Searching: Common operations


    // take an institutionId if we have one or none if we don't and return the
    // correct starting search definition for both possibilities
    lazy val maybeSearchInstitution: (Option[String]) => SearchDefinition =
      (institutionId) => institutionId map { i =>
        search in indexName -> i
      } getOrElse (search in indexName)

institution ids are used as index names
Elasticsearch doesn’t require index names to search, so we allow searching by both with this

Elastic4s Searching: Common operations


   // common query -- perform a fuzzy search on some field with some given value,
   // with the optional institutition id
   lazy val fuzzySearch: (Option[String]) =>
     (String) => (String) => SearchDefinition = (maybeInstitution) =>
     (field) => (value) =>
       maybeSearchInstitution(maybeInstitution) query
         fuzzyQuery(field)(value)

      // take a field and a value and return a fuzzy query definition with our common settings
      lazy val fuzzyQuery: (String) => (String) => MatchQueryDefinition = (field) =>
        (value) => matchQuery(field, value).boost(4)
          .maxExpansions(10)
          .prefixLength(3)

Handling typos is common. MatchQuery does that with some limits on expansions and prefix length.
Querying fuzzily is supported on multiple fields.

Elastic4s Searching: Example


    def searchFuzzyTitle(
      institutionId: Option[String],
      title: String,
      size: Int,
      offset: Int, sortField: String, sortDirection: String)(
        implicit arc: ActionResponseConvertable[SearchResponse, AssetSearch]):
        Future[AssetSearch] =
      executeSearch(fuzzySearch(institutionId)("filename")(title)
          start offset limit size sortByScoreAnd
            (sortField, sortDirection))(arc)

Search for a fuzzy filename, start at the given record, limit it to n results, and sort the return by query score and the given fields and direction.
fuzzySearch includes the call to maybeSearchInstitution.

DONTS

Don’t create connections to the client on accident. We did that. It eats your memory.
Query to reduce the results instead of filtering. Filtering caches and doesn’t (in general) compute scores. You have to use it when things get big.
Elasticsearch is for search. Don’t use it as a primary datastore.
Rely on something being indexed and available immediately after you queue it to be indexed. ES is async and eventually consistent.

DOS

Rebuild your index if you have to. Elasticsearch takes care of it, so if things start looking

weird just run a job to pull from your primary data store and reindex everything.

Pool your Elasticsearch clients. We are moving to that.
Separate indexing from searching. You search should not wait on some indexing operation. Separate concerns.

More fun stuff that can’t be covered in one talk!

Elastic4s: https://github.com/sksamuel/elastic4s
Plugins: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html
Analyzers: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-analyzers.html#default-analyzers
Mappings: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping.html#all-mapping-types
Percolation: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html#_percolate_api

Files

elastic-search.org

Latest commit

History

elastic-search.org

File metadata and controls

Elasticsearch - Why and How

What is Elasticsearch

Why Elasticsearch

Indexing with SOLR

Searching with SOLR

Clustering With Elasticsearch

Clustering with Elasticsearch

Clustering with Elasticsearch

Elasticsearch is fault-tolerant

Elasticsearch has an excellent JSON-BASED Query and Filtering DSL

Elasticsearch index creation is easy

Index creation Broken down

Index creation broken down

Index creation broken down

Index creation broken down

Index creation broken down

Searching Elasticsearch is easy

Search broken down

Search broken down

Search broken down

Search broken down

SEARCH broken down

Searching Elasticsearch

Query DSL: match_all

Query DSL: fuzzy

Query DSL: match

Query DSL: span_first

IMPORTANT: Do as little as possible with search queries.

Filtering in es is easy:

Query DSL and Filter DSL Docs

How Assets uses Elasticsearch

Elastic4s

Elastic4s Connecting in ASSETS

Elastic4s Indexing

Elastic4s Searching: Common operations

Elastic4s Searching: Common operations

Elastic4s Searching: Common operations

Elastic4s Searching: Common operations

Elastic4s Searching: Example

DONTS

DOS

More fun stuff that can’t be covered in one talk!