- A distributed search engine
- Built on Apache Lucene
- Highly performant
- Easy to cluster
- Multi-tenant
- Multi-typed
- RESTful
$ curl http://localhost:8983/solr/collection1/update -H 'Content-type:application/json' -d '
"id" : "TestDoc1",
"title" : {"set":"test1"},
"revision" : {"inc":3},
"publisher" : {"add":"TestPublisher"}
"id" : "TestDoc2",
"publisher" : {"add":"TestPublisher"}
- The collection has to exist in the configset for solr cloud, describing the collection.
- Uses operations rather than treating the document as a strict post.
- Uses commands to update the index.
$ curl -XGET http://localhost:8983/solr/collection1/select?q=solr&wt=json
- Querying is done by hitting the SELECT endpoint on a collection.
- GET request parameters are used to define the query syntax.
- Step 1
- Step 2
- Step 3
- Data is distributed to nodes and shards
- One node or shard dying won’t kill the index
- Fuzzy
- Term-Matching
- Relationship based
- Geographical queries
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
- Index NODE
- Index NAME
- Index TYPE
- Document ID
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
"query" : {
"term" : { "user" : "kimchy" }
- Search NODE
- Resource: _search
"query" : {
"term" : { "user" : "kimchy" }
- Searching in elasticsearch is called querying
- can be done via GET parameters like lucene and Solr
- Uses extended GET to submit a body request for better query languages
- Query results are scored, and sorted by best match
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
"query" : {
"match_all" : {}
- #+REVEAL_HTML: match_all matches all documents in an index or index type
- Useful when listing lots of results
- Filters can be used to pare down the results
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
"query" : {
"fuzzy" : {
"user" : {
"value" : "ki",
"boost" : 1.0,
"fuzziness" : 2,
"prefix_length" : 0,
"max_expansions": 100
- Allows you to search by part of a field value
- Specify field
- What the distance of difference of the value has to be for the field
- How many exact characters must match
- Computationally expensive – use max_expansions to limit matches
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
"query" : {
"match" : {
"message" : {
"query" : "this is a test",
"operator" : "and",
"boost" : 1.0,
"fuzziness" : 2,
"prefix_length" : 0,
"max_expansions": 100
- Matches phrases and single terms and numeric/date ranges
- Specify field
- Can be fuzzy - same ideas apply
- Fuzziness in this case matches individual words (terms)
- match_phrase queries match on many terms
- Can be multiple fields – each has its own options and is combined with logical OR by default
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
"query" : {
"span_first" : {
"match" : {
"span_term" : { "user" : "kimchy" }
"end" : 3
- Matches parts of a field at the beginning exactly (“kim” in the user field, here)
- Specify field and value as an object
- End is the number of characters to care about
- Query only enough to roughly find what you want.
- Use the FILTER DSL to project/transform/pare down search results.
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
"query" : {
"match_all" : { }
"filter" : {
"term" : { "user" : "kimchy" }
- Query DSL http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-queries.html
- Filter DSL http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filters.html
- HTTP and JSON are great
- But there are decent Java/Scala wrappers out there for es
- https://github.com/sksamuel/elastic4s
- Nice asynchronous dsl over Java elasticsearch API
import com.sksamuel.elastic4s.ElasticClient
import com.sksamuel.elastic4s.ElasticDsl._
object Test extends App {
val client = ElasticClient.local
client execute { index into "bands/singers" fields "name"->"chris martin" }
lazy val settings = ImmutableSettings
getConfigInt("elasticsearch.pingTimeout") +
lazy val client = ElasticClient
client.execute {
index into indexName -> institutionId fields (em.mapped(asset)) id asset.id
} map arc.convert
- em stands for “ElasticMapper”.
- arc stands for “ActionResponseConvertable”.
- All of the requests made to Elasticsearch return an ActionResponse.
- Common operations are split into their own local definitions for reuse.
// wrap up the client execution and return conversion logic
lazy val executeSearch: (SearchDefinition) =>
(ActionResponseConvertable[SearchResponse, AssetSearch]) =>
Future[AssetSearch] = (s) =>
(arc) => client.execute { s } map arc.convert
- Everything needs client.execute, and to have its response converted.
// take an institutionId if we have one or none if we don't and return the
// correct starting search definition for both possibilities
lazy val maybeSearchInstitution: (Option[String]) => SearchDefinition =
(institutionId) => institutionId map { i =>
search in indexName -> i
} getOrElse (search in indexName)
- institution ids are used as index names
- Elasticsearch doesn’t require index names to search, so we allow searching by both with this
// common query -- perform a fuzzy search on some field with some given value,
// with the optional institutition id
lazy val fuzzySearch: (Option[String]) =>
(String) => (String) => SearchDefinition = (maybeInstitution) =>
(field) => (value) =>
maybeSearchInstitution(maybeInstitution) query
// take a field and a value and return a fuzzy query definition with our common settings
lazy val fuzzyQuery: (String) => (String) => MatchQueryDefinition = (field) =>
(value) => matchQuery(field, value).boost(4)
- Handling typos is common. MatchQuery does that with some limits on expansions and prefix length.
- Querying fuzzily is supported on multiple fields.
def searchFuzzyTitle(
institutionId: Option[String],
title: String,
size: Int,
offset: Int, sortField: String, sortDirection: String)(
implicit arc: ActionResponseConvertable[SearchResponse, AssetSearch]):
Future[AssetSearch] =
start offset limit size sortByScoreAnd
(sortField, sortDirection))(arc)
- Search for a fuzzy filename, start at the given record, limit it to n results, and sort the return by query score and the given fields and direction.
- fuzzySearch includes the call to maybeSearchInstitution.
- Don’t create connections to the client on accident. We did that. It eats your memory.
- Query to reduce the results instead of filtering. Filtering caches and doesn’t (in general) compute scores. You have to use it when things get big.
- Elasticsearch is for search. Don’t use it as a primary datastore.
- Rely on something being indexed and available immediately after you queue it to be indexed. ES is async and eventually consistent.
- Rebuild your index if you have to. Elasticsearch takes care of it, so if things start looking
weird just run a job to pull from your primary data store and reindex everything.
- Pool your Elasticsearch clients. We are moving to that.
- Separate indexing from searching. You search should not wait on some indexing operation. Separate concerns.
- Elastic4s: https://github.com/sksamuel/elastic4s
- Plugins: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html
- Analyzers: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-analyzers.html#default-analyzers
- Mappings: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping.html#all-mapping-types
- Percolation: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html#_percolate_api