class: center, middle # Elasticsearch ## Filtering and Advanced Topics --- ## Queries Review: Simple Matching Find documents with a given field containing a value. ```json GET /library/_search { "query": { "match": { "title": "overstory" } } } ``` --- ## Queries Review: Aggregations Perform calculations on groups of documents. ```json GET twitter/_search { "aggs": { "avg_likes": { "avg": { "field": "likes" } } } } ``` --- ## Queries Review: Pagination `size` limits results returned, `from` allows easy pagination ```json GET twitter/_search { "query": { "match_all": {} }, "from": 0, "size": 50 } ``` --- ## Filtering We can filter results using a `bool` query, which combines queries and filters ``` GET twitter/_search { "query": { "bool": { "must": {"match": {"name": "smith" }}, "filter": { "range": { "likes": {"gte": 4} } } } } } ``` --- ## Filtering To simply apply a filter, omit the query property from the `bool` object: ``` GET twitter/_search { "query": { "bool": { "filter": { "range": { "likes": {"gte": 4} } } } } } ``` --- ## Query Types - `must` - results are required to satisfy the condition -- - `must_not` - results do not satisfy the condition -- - `should` - results that _do_ satisfy the condition are scored higher --- ## Query Types These may be combined in complex ways: ``` GET twitter/_search { "query": { "bool": { "must": [ "match": {"name": "smith"}, "match": {"likes": 4} ], "must_not": { "match": {"tweet": "elastic"} }}}}} ``` --- ## Filter Types - `range` - specifies a range of values that a numeric field may have; we can specify `gt`, `gte`, `lt`, `lte` - `terms` - specifies an array of values that a text field may have to be included in the result set ``` GET twitter/_search { "query": { "bool": { "filter": { "terms": { "name": ["john", "mary"]} } }}}}} ``` --- ## Fuzzy Queries **Fuzzy Queries** allow you to match misspellings -- ``` GET /twitter/_search { "query": { "match": {"tweet": "relational"} } } ``` Returns no results, even though there's a tweet with a mispelling of "relational" ("rational"). --- ## Fuzzy Queries Closeness to search term is calculated by **edit distance**. That is, the number of characters that would have to be deleted, added, changed, or transposed. 80% of human misspellings have an edit distance of only 1! ``` GET /twitter/_search { "query": { "fuzzy": { "tweet": { "value": "relational", "fuzziness": 3 }}}} ``` --- ## Fuzzy Queries By default, the fuzziness for a query is `AUTO`, which generates a fuzziness value based on the length of the term (longer terms allow for more fuzziness). --- # Field Types * Field datatypes include: * `text`, `keyword`, `date`, `boolean`, `ip`, `long` * Object or nested data, which support nesting JSON * Specialized types like `geo_point` and `geo_shape` --- ## Document Mapping * A mapping defines how the document and its fields are stored and indexed -- * You can think of this like a schema in SQL -- * A field can be indexed multiple times in different ways. For example, a username might be treated as a text field for full-text search but also as a keyword field for sorting and aggregation. --- ## Working With Mappings To view mappings for an index: ``` $ curl 'localhost:9200/twitter/_mapping?pretty' ``` --- ## Creating A Mapping ``` PUT /twitter { "mappings" : { "tweets" : { "properties" : { "date" : { "type" : "date" }, "likes" : { "type" : "long" }, "location" : { "type" : "geo_point" }, [content ommitted] ``` --- ## Mapping Explosions You can map data dynamically! Hey, we’ve been doing it! * For dynamic fields, new field names will be auto added. If you don’t send ES data on the fields, it does this. -- * Some field types won’t be mapped correctly automatically (like geopoints). -- * This is very powerful but can get out of hand, leading to resource issues, referred to as **mapping explosions** -- * You can set limits to the number of fields, and how deep it can be nested to prevent such issues --- ## When Data Changes If you want to add or change a field in your mapping, you must **reindex** -- The `_reindex` API allows you to create a new index with a new mapping: ``` POST _reindex { "source": { "index": "book" }, "dest": { "index": "book_" } } ``` --- ## Reindexing Notes * `_reindex` doesn’t actually change data in place. **Index mappings are immutable.** * It copies data over from an old index into a new one quickly. You still have to set up the new one first. --- ## Geopoints A field of type `geo_point` can be represented in one of three ways: 1. A string representation, with `"lat,lon"`. 1. An object representation with `lat` and `lon` explicitly named. 1. An array representation with `[lon,lat]`. _Geopoints are not automatically recognized by dynamic mapping, and must be mapped explicitly_ --- ## Geo Queries * There are several ways to calculate physical distance in searches -- * Save latitude and longitude on the object (`geo_point` field) * Calculate precisely based on lat/long of a second point - `geo_distance` query * geo-bounding, faster - divide map into shaped areas (`geo_shape` field) and determine if your object is inside an area - `geo_bounding_box` or `geo_polygon` query --- ## Geodistance Example ``` GET /twitter/_search { "query": { "bool": { "filter": { "geo_distance": { "distance": "500km", "location": { "lat": 34.052235, "lon": -118.243683 } } } } } } ``` --- ## Geo Aggregations * `geo_bounds`: computes a bounding box containing all geo points for a field (i.e. give me an area containing all the concert venues in this collection of businesses) * `geo_centroid`: calculates the centroid/geometric center of a `geo_polygon` (i.e. what's the centroid among a collection of restaurants serving ice cream?) * `geo_distance`: User defines an origin point and a set of distance ranges, and it will return documents in those buckets of distance ranges. (i.e. what restaurants are within 10 miles of me? 25 miles?) --- ## Other Features * You can **boost** one field over another * For example, you might specify that location matters more than price range, so increase the score on location matches * The default `boost` value is 1. To prioritize a given field, give it a boost value greater than 1 -- * **Highlighting** can be applied using the `highlight` parameter. This results in a `highlight` field being added to each matching response with the matching text wrapped in `
` tags --- ## Parent/Child Relationships * Documents that are related should be stored in a parent child relationship. * There will be one document, that has nested JSON inside of it. For Example: * For each airport, there are many routes.. * Instead of two tables, each airport will contain the JSON for all routes. --- ## Parent/Child Example ```json { "_source": { "name": "Port Moresby Jacksons International Airport", "airport_id": 5, "elevation_ft": 146, "routes": [ { "src": "STL", "dst": "LAX", "airline": "NZ" } ] ] } } ``` --- ## What about Joins? * Elasticsearch is composed of only documents, no tables. * Joining with other documents is technically possible, but not recommended. --- ## _id * Lucene/ES uses a key `_id`, just like many other databases * The contents will be a hash, not sequential numbers * This is to avoid collisions: two records accidentally being assigned the same `_id` because two nodes tried to write at the same time * Since data is potentially distributed to many servers, managing unique ids is complicated * We can assign an ID to an object upon creation, but it's usually best to let ES handle ID assignment --- ## Further Into ES We've only presented the most commonly-used features of ES. There is much more that we haven't covered, but with an understanding of these core features you can work with ES effectively while adding to your knowledge.