Skip to content

Latest commit

 

History

History
494 lines (479 loc) · 21 KB

elastic-search.org

File metadata and controls

494 lines (479 loc) · 21 KB

Elasticsearch - Why and How

What is Elasticsearch

  • A distributed search engine
  • Built on Apache Lucene
  • Highly performant
  • Easy to cluster
  • Multi-tenant
  • Multi-typed
  • RESTful

Why Elasticsearch

-

Indexing with SOLR


   $ curl http://localhost:8983/solr/collection1/update -H 'Content-type:application/json' -d '
   [
     {
      "id"        : "TestDoc1",
      "title"     : {"set":"test1"},
      "revision"  : {"inc":3},
      "publisher" : {"add":"TestPublisher"}
     },
     {
      "id"        : "TestDoc2",
      "publisher" : {"add":"TestPublisher"}
     }
   ]'
   
  • The collection has to exist in the configset for solr cloud, describing the collection.
  • Uses operations rather than treating the document as a strict post.
  • Uses commands to update the index.

Searching with SOLR


   $ curl -XGET http://localhost:8983/solr/collection1/select?q=solr&wt=json
   
  • Querying is done by hitting the SELECT endpoint on a collection.
  • GET request parameters are used to define the query syntax.

Clustering With Elasticsearch

  • Step 1

Clustering with Elasticsearch

  • Step 2

Clustering with Elasticsearch

  • Step 3

Elasticsearch is fault-tolerant

  • Data is distributed to nodes and shards
  • One node or shard dying won’t kill the index

Elasticsearch has an excellent JSON-BASED Query and Filtering DSL

  • Fuzzy
  • Term-Matching
  • Relationship based
  • Geographical queries

Elasticsearch index creation is easy


   $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
    }'
    

Index creation Broken down

  • Index NODE

Index creation broken down

  • Index NAME

Index creation broken down

  • Index TYPE

Index creation broken down

  • Document ID

Index creation broken down

  • DOCUMENT

    {
      "user" : "kimchy",
      "post_date" : "2009-11-15T14:12:12",
      "message" : "trying out Elasticsearch"
    }
    

Searching Elasticsearch is easy


   $ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
     "query" : {
       "term" : { "user" : "kimchy" }
     }
   }'
   

Search broken down

  • Search NODE

Search broken down

  • OPTIONAL Index NAME

Search broken down

  • OPTIONAL Index TYPE

Search broken down

  • Resource: _search

SEARCH broken down


    {
     "query" : {
       "term" : { "user" : "kimchy" }
     }
    }
   

Searching Elasticsearch

  • Searching in elasticsearch is called querying
  • can be done via GET parameters like lucene and Solr
  • Uses extended GET to submit a body request for better query languages
  • Query results are scored, and sorted by best match

Query DSL: match_all


    $ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
       "query" : {
         "match_all" : {}
       }
    }'
  • #+REVEAL_HTML: match_all matches all documents in an index or index type
  • Useful when listing lots of results
  • Filters can be used to pare down the results

Query DSL: fuzzy


   $ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
      "query" : {
        {
          "fuzzy" : {
            "user" : {
              "value" :         "ki",
              "boost" :         1.0,
              "fuzziness" :     2,
              "prefix_length" : 0,
              "max_expansions": 100
            }
          }
        } 
      }
    }'
  • Allows you to search by part of a field value
  • Specify field
  • What the distance of difference of the value has to be for the field
  • How many exact characters must match
  • Computationally expensive – use max_expansions to limit matches

Query DSL: match


   $ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
    "query" : {
      "match" : {
          "message" : {
            "query" : "this is a test",
            "operator" : "and",
            "boost" :         1.0,
            "fuzziness" :     2,
            "prefix_length" : 0,
            "max_expansions": 100
          }
        }
      }
    }'
  • Matches phrases and single terms and numeric/date ranges
  • Specify field
  • Can be fuzzy - same ideas apply
  • Fuzziness in this case matches individual words (terms)
  • match_phrase queries match on many terms
  • Can be multiple fields – each has its own options and is combined with logical OR by default

Query DSL: span_first


     $ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
    "query" : {
      {
        "span_first" : {
          "match" : {
            "span_term" : { "user" : "kimchy" }
          },
          "end" : 3
        }
      }
    }'
  • Matches parts of a field at the beginning exactly (“kim” in the user field, here)
  • Specify field and value as an object
  • End is the number of characters to care about

IMPORTANT: Do as little as possible with search queries.

  • Query only enough to roughly find what you want.
  • Use the FILTER DSL to project/transform/pare down search results.

Filtering in es is easy:


   $ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
     "query" : {
       "match_all" : { }
     },
     "filter" : {
       "term" : { "user" : "kimchy" }
     }
   }'

Query DSL and Filter DSL Docs

How Assets uses Elasticsearch

  • HTTP and JSON are great
  • But there are decent Java/Scala wrappers out there for es

Elastic4s


  import com.sksamuel.elastic4s.ElasticClient
  import com.sksamuel.elastic4s.ElasticDsl._

  object Test extends App {
    val client = ElasticClient.local
    client execute { index into "bands/singers" fields "name"->"chris martin" }
  }

Elastic4s Connecting in ASSETS


    lazy val settings = ImmutableSettings
      .settingsBuilder()
      .put(
        "cluster.name",
        getConfigString("elasticsearch.clustername"))
      .put("client.transport.ping_timeout",
        getConfigInt("elasticsearch.pingTimeout") +
          getConfigString("elasticsearch.pingTimeoutTimeUnit"))
      .put(
        "client.transport.ignore_cluster_name",
        getConfigBoolean("elasticsearch.ignoreClusterName"))
      .build()

    lazy val client = ElasticClient
      .remote(
        settings,
        (getConfigString("elasticsearch.host"),getConfigInt("elasticsearch.port")))

Elastic4s Indexing


    client.execute {
      index into indexName -> institutionId fields (em.mapped(asset)) id asset.id
    } map arc.convert
  • em stands for “ElasticMapper”.
  • arc stands for “ActionResponseConvertable”.
  • All of the requests made to Elasticsearch return an ActionResponse.

Elastic4s Searching: Common operations

  • Common operations are split into their own local definitions for reuse.

Elastic4s Searching: Common operations


     // wrap up the client execution and return conversion logic
     lazy val executeSearch: (SearchDefinition) =>
       (ActionResponseConvertable[SearchResponse, AssetSearch]) =>
         Future[AssetSearch] = (s) =>
           (arc) => client.execute { s } map arc.convert
  • Everything needs client.execute, and to have its response converted.

Elastic4s Searching: Common operations


    // take an institutionId if we have one or none if we don't and return the
    // correct starting search definition for both possibilities
    lazy val maybeSearchInstitution: (Option[String]) => SearchDefinition =
      (institutionId) => institutionId map { i =>
        search in indexName -> i
      } getOrElse (search in indexName)
  • institution ids are used as index names
  • Elasticsearch doesn’t require index names to search, so we allow searching by both with this

Elastic4s Searching: Common operations


   // common query -- perform a fuzzy search on some field with some given value,
   // with the optional institutition id
   lazy val fuzzySearch: (Option[String]) =>
     (String) => (String) => SearchDefinition = (maybeInstitution) =>
     (field) => (value) =>
       maybeSearchInstitution(maybeInstitution) query
         fuzzyQuery(field)(value)

      // take a field and a value and return a fuzzy query definition with our common settings
      lazy val fuzzyQuery: (String) => (String) => MatchQueryDefinition = (field) =>
        (value) => matchQuery(field, value).boost(4)
          .maxExpansions(10)
          .prefixLength(3)
  • Handling typos is common. MatchQuery does that with some limits on expansions and prefix length.
  • Querying fuzzily is supported on multiple fields.

Elastic4s Searching: Example


    def searchFuzzyTitle(
      institutionId: Option[String],
      title: String,
      size: Int,
      offset: Int, sortField: String, sortDirection: String)(
        implicit arc: ActionResponseConvertable[SearchResponse, AssetSearch]):
        Future[AssetSearch] =
      executeSearch(fuzzySearch(institutionId)("filename")(title)
          start offset limit size sortByScoreAnd
            (sortField, sortDirection))(arc)
  • Search for a fuzzy filename, start at the given record, limit it to n results, and sort the return by query score and the given fields and direction.
  • fuzzySearch includes the call to maybeSearchInstitution.

DONTS

  • Don’t create connections to the client on accident. We did that. It eats your memory.
  • Query to reduce the results instead of filtering. Filtering caches and doesn’t (in general) compute scores. You have to use it when things get big.
  • Elasticsearch is for search. Don’t use it as a primary datastore.
  • Rely on something being indexed and available immediately after you queue it to be indexed. ES is async and eventually consistent.

DOS

  • Rebuild your index if you have to. Elasticsearch takes care of it, so if things start looking

weird just run a job to pull from your primary data store and reindex everything.

  • Pool your Elasticsearch clients. We are moving to that.
  • Separate indexing from searching. You search should not wait on some indexing operation. Separate concerns.

More fun stuff that can’t be covered in one talk!