WHERE/IN query in Elasticsearch

Sep 9, 2019

Introduction

People coming to Elasticsearch from the SQL world have questions regarding how to accomplish something in Elasticsearch that can be done by an SQL statement in a relational database. One such question is the WHERE/IN query. In the SQL world, this query looks like this:

SELECT * FROM titles WHERE title IN ('Engineer', 'Senior Engineer');

Of course, there can be more entries in the IN list.

Alternative 1 - Using the terms query

There are several ways to achieve this in Elastic search. One is the terms query which is specifically meant for these kinds of searches. It looks like this:

{
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
              "Country Name.keyword": ["Afghanistan", "Albania", "Algeria"]
          }
        }
      ]
    }
  }
}

Which finds all three countries as requested.

...
"hits": [
  {
    "_index": "diabetes-prevalence.csv",
    "_type": "doc",
    "_id": "2",
    "_score": 0.0,
    "_source": {
      "Country Name": "Afghanistan",
      "Country Code": "AFG",
      "Indicator Name": "Diabetes prevalence (% of population ages 20 to 79)",
      "Indicator Code": "SH.STA.DIAB.ZS",
      "1960": "",
      "2017": 9.59
    }
  },
  {
    "_index": "diabetes-prevalence.csv",
    "_type": "doc",
    "_id": "4",
    "_score": 0.0,
    "_source": {
      "Country Name": "Albania",
      "Country Code": "ALB",
      "Indicator Name": "Diabetes prevalence (% of population ages 20 to 79)",
      "Indicator Code": "SH.STA.DIAB.ZS",
      "1960": "",
      "2017": 10.08
    }
  },
  {
    "_index": "diabetes-prevalence.csv",
    "_type": "doc",
    "_id": "59",
    "_score": 0.0,
    "_source": {
      "Country Name": "Algeria",
      "Country Code": "DZA",
      "Indicator Name": "Diabetes prevalence (% of population ages 20 to 79)",
      "Indicator Code": "SH.STA.DIAB.ZS",
      "1960": "",
      "2017": 6.73
    }
  }
]
...

Note that there can be a maximum of 65,536 entries in the array of items. To increase this value, you can change the Elastic searches parameter index.max_terms_count setting.

Searching integer fields

You can also use the terms query with integer fields. Just drop the .keyword from the field name.

{
  "query": {
    "bool": {
      "filter": [
        {
            "terms": {
                "emp_no": [10001, 10002, 10003]
            }
        }
      ]
    }
  }
}

As expected, the specified employees are located.

"hits": [
  {
    "_index": "employees.csv",
    "_type": "doc",
    "_id": "1",
    "_score": 0.0,
    "_source": {
      "emp_no": 10001,
      "birth_date": "1953-09-01T18:30:00.000Z",
      "first_name": "Georgi",
      "last_name": "Facello",
      "gender": null,
      "hire_date": "1986-06-25T18:30:00.000Z"
    }
  },
  {
    "_index": "employees.csv",
    "_type": "doc",
    "_id": "2",
    "_score": 0.0,
    "_source": {
      "emp_no": 10002,
      "birth_date": "1964-06-01T18:30:00.000Z",
      "first_name": "Bezalel",
      "last_name": "Simmel",
      "gender": false,
      "hire_date": "1985-11-20T18:30:00.000Z"
    }
  },
  {
    "_index": "employees.csv",
    "_type": "doc",
    "_id": "3",
    "_score": 0.0,
    "_source": {
      "emp_no": 10003,
      "birth_date": "1959-12-02T18:30:00.000Z",
      "first_name": "Parto",
      "last_name": "Bamford",
      "gender": null,
      "hire_date": "1986-08-27T18:30:00.000Z"
    }
  }
]

Alternative 2 - Using the SHOULD clause of BOOL query

Another method to perform a WHERE/IN type of search is to use the SHOULD clause of a BOOL query. Here is an example.

{
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "last_name.keyword": "Brizzi"
                }
              },
              {
                "term": {
                  "last_name.keyword": "Danner"
                }
              },
              {
                "term": {
                  "last_name.keyword": "Krone"
                }
              }
            ]
          }
        }
      ]
    }
  }
}

And the results:

"hits": [
  {
    "_index": "employees.csv",
    "_type": "doc",
    "_id": "1",
    "_score": 0.0,
    "_source": {
      "emp_no": 10001,
      "birth_date": "1953-09-01T18:30:00.000Z",
      "first_name": "Georgi",
      "last_name": "Facello",
      "gender": null,
      "hire_date": "1986-06-25T18:30:00.000Z"
    }
  },
  {
    "_index": "employees.csv",
    "_type": "doc",
    "_id": "2",
    "_score": 0.0,
    "_source": {
      "emp_no": 10002,
      "birth_date": "1964-06-01T18:30:00.000Z",
      "first_name": "Bezalel",
      "last_name": "Simmel",
      "gender": false,
      "hire_date": "1985-11-20T18:30:00.000Z"
    }
  },
  {
    "_index": "employees.csv",
    "_type": "doc",
    "_id": "3",
    "_score": 0.0,
    "_source": {
      "emp_no": 10003,
      "birth_date": "1959-12-02T18:30:00.000Z",
      "first_name": "Parto",
      "last_name": "Bamford",
      "gender": null,
      "hire_date": "1986-08-27T18:30:00.000Z"
    }
  }
]

The advantage of this approach is that you can add other conditions to the query to get the exact results you need by combining with other clauses of the BOOL query such as MUST_NOT, FILTER, etc.

Searching date fields

To perform an SQL-type WHERE/IN query on date fields, you can use the SHOULD clause as above, but with RANGE queries.

  • To simulate an equals conditon, you can use gte and lte with the same value.

  • To drop the time-part of the date, you can specify a format without the time. Here is an example:

{
  "query": {
    "bool": {
      "should": [
        {
          "range": {
            "hire_date": {
              "format": "yyyy-MM-dd",
              "lte": "1993-03-13",
              "gte": "1993-03-13"
            }
          }
        },
        {
          "range": {
            "hire_date": {
              "format": "yyyy-MM-dd",
              "lte": "1986-10-25",
              "gte": "1986-10-25"
            }
          }
        },
        {
          "range": {
            "hire_date": {
              "format": "yyyy-MM-dd",
              "lte": "1991-11-04",
              "gte": "1991-11-04"
            }
          }
        }
      ]
    }
  }
}

And the (partial) results:

...
"hits": [
  {
    "_index": "employees.csv",
    "_type": "doc",
    "_id": "21926",
    "_score": 1.0,
    "_source": {
      "emp_no": 31926,
      "birth_date": "1960-02-09T18:30:00.000Z",
      "first_name": "Aamod",
      "last_name": "Cheshire",
      "gender": null,
      "hire_date": "1993-03-13T18:30:00.000Z"
    }
  },
  {
    "_index": "employees.csv",
    "_type": "doc",
    "_id": "22276",
    "_score": 1.0,
    "_source": {
      "emp_no": 32276,
      "birth_date": "1959-08-18T18:30:00.000Z",
      "first_name": "Lillian",
      "last_name": "Seuren",
      "gender": false,
      "hire_date": "1986-10-25T18:30:00.000Z"
    }
  },
  {
    "_index": "employees.csv",
    "_type": "doc",
    "_id": "22372",
    "_score": 1.0,
    "_source": {
      "emp_no": 32372,
      "birth_date": "1955-04-02T18:30:00.000Z",
      "first_name": "Randy",
      "last_name": "Nitsch",
      "gender": false,
      "hire_date": "1986-10-25T18:30:00.000Z"
    }
  },
...

WHERE/NOT IN query

To exclude matches from query results, SQL provides the WHERE/NOT IN clause which looks something like this:

SELECT count(*)
FROM   employees
WHERE  first_name NOT IN ('Randy', 'Nathan');

You can do the same in Elasticsearch using a query of the form (for non-string fields, remove the .keyword):

{
  "query": {
    "bool": {
      "must_not": [
        {
          "term": {
            "first_name.keyword": "Randy"
          }
        },
        {
          "term": {
            "first_name.keyword": "Nathan"
          }
        }
      ]
    }
  }
}

Same thing can also be done using the following query:

{
  "query": {
    "bool": {
      "must_not": [
        {
          "terms": {
            "first_name.keyword": ["Randy", "Nathan"]
          }
        }
      ]
    }
  }
}

Multivalued search in Argon

Argon provides the MultiSelect dialog which allows you to easily perform WHERE/IN and WHERE/NOT IN_ searches.

1. Right-click for context menu on a field value. Choose Select Multiple from the menu.

(click for larger image)

2. Select values from the left pane, and click the arrow button to add to the current selection.

(click for larger image)

3. Toggle between Include Selection and Exclude Selection. Click Apply.

(click for larger image)