Elasticsearch Aggregations

Sep 24, 2019

Introduction

An aggregation is a summary of raw data for the purpose of obtaining insights from the data. When you have many bits of raw data (for example, time spent by each driver at a traffic signal) it is difficult to get meaningful insights from any one piece of data. In such cases, it is more relevant to look at the data as a whole, and to derive insights from summarized data. That is what data aggregation hopes to achieve.

Elasticsearch has extensive support for aggregations some of which we cover in the following sections.

Bucket and Metric aggregations

There are two commonly used types of aggregations in Elasticsearch: bucket aggregations and metrics aggregations**.

A bucket aggregation is used to group data into one or more buckets based on a set of criteria. For example, suppose you have a list of car models along with its type, such as sedan, hatchback, truck, etc. To group these car models by type, you would use a bucket aggregation

A metric aggregation is used to compute a metric over a bunch of grouped documents. For example, to compute the average weight of a group of students, you would use the average metric aggregation.

It is possible and very common to use these two types of aggregations in the same query. For example, suppose you want to group students by ethnicity, and then compute a metric for each group, you would use a nested aggregation to first group the data into buckets and then compute a metric for each bucket.

Format of an aggregation request

An aggregation can appear alone without an associated query, in which case the aggregation is computed for all documents in the index. Or it can be combined with a query, and the aggregation applied to only the documents matching the query.

The following is the format of an aggregation request without a query. We have an object definition for aggregation (or short form aggs), which defines a field called count consisting of the value_count metric over the column called first_name. We also set the size to 0 to indicate that we are not interested in the hits comprising the aggregation.

{
  "aggs": {
    "count": {
      "value_count": {
        "field": "first_name.keyword"
      }
    }
  },
  "size": 0
}

To compute an aggregation over the results of a query, the query is also included like this:

{
  "query": {
    [Body of the query]
  },
  "aggs": {
    "value_count": {
      "value_count": {
        "field": "first_name.keyword"
      }
    }
  },
  "size": 0
}

Counting the number of values in a field

The value_count is a metric aggregation which counts the number of values of the specified field.

In the example below, we have a value_count aggregation which counts the number of values of the field first_name.

{
  "aggs": {
    "value_count": {
      "value_count": {
        "field": "first_name.keyword"
      }
    }
  },
  "size": 0
}

The result, which tells us there are 300024 rows with a value present for the field first_name.

...
  "aggregations": {
    "value_count": {
      "value": 300024
    }
  }
...

The following example combines a query to restrict the aggregation to the hits from the query. Here, we count how many people are there in the employees.csv index with first_name starting with ann.

{
  "query": {
    "bool": {
      "must": [
        {
          "prefix": {
            "first_name": "ann"
          }
        }
      ]
    }
  },
  "aggs": {
    "value_count": {
      "value_count": {
        "field": "first_name.keyword"
      }
    }
  },
  "size": 0
}

And we find the answer to be 627.

...
  "aggregations": {
    "value_count": {
      "value": 679
    }
  }
...

Number of unique values in a field

While the total count of number of values in a field is certainly useful, sometimes we need to know how many unique values are there in a field. You can use the cardinality metric aggregations for this purpose.

{
  "aggs": {
    "unique_count": {
      "cardinality": {
        "field": "first_name.keyword"
      }
    }
  },
  "size": 0
}

Which tells use there are 1275 unique values in the field.

...
  "aggregations": {
    "unique_count": {
      "value": 1275
    }
  }
...

Suppose we want to know how many unique names are present in this field with names starting with a prefix such as ann.

{
  "query": {
    "bool": {
      "must": [
        {
          "prefix": {
            "first_name": "ann"
          }
        }
      ]
    }
  },
  "aggs": {
    "unique_count": {
      "cardinality": {
        "field": "first_name.keyword"
      }
    }
  },
  "size": 0
}

The following response tells us that out of a total of 679 hits which matched the query (first_name starting with ann), Elasticsearch found a total of 3 unique values.

  "hits": {
    "total": 679,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "unique_count": {
      "value": 3
    }
  }

Unique values in a field

Counting the number of unique values in a field, while interesting, is frequently insufficient. What we also need is to find the actual unique values in a field.

To find the top few unique values in a field (by occurence count), we use the terms bucket aggregation. It requires a property named field.

{
  "aggs": {
    "unique_names": {
      "terms": {
        "field": "first_name.keyword"
      }
    }
  },
  "size": 0
}

And here we have the unique values in the field, sorted by occurence count.

...
"aggregations": {
  "unique_names": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 297245,
    "buckets": [
      {
        "key": "Shahab",
        "doc_count": 295
      },
      {
        "key": "Tetsushi",
        "doc_count": 291
      },
      {
        "key": "Elgin",
        "doc_count": 279
      },
      {
        "key": "Anyuan",
        "doc_count": 278
      },
      {
        "key": "Huican",
        "doc_count": 276
      },
      {
        "key": "Make",
        "doc_count": 275
      },
      {
        "key": "Panayotis",
        "doc_count": 272
      },
      {
        "key": "Sreekrishna",
        "doc_count": 272
      },
      {
        "key": "Hatem",
        "doc_count": 271
      },
      {
        "key": "Giri",
        "doc_count": 270
      }
    ]
  }
}
...

Since we have found only the top 10 (by default) unique values, there must be a lot more? While the doc_count property gives the number of records with that specific value, the field sum_other_doc_count gives the number of hits that had values other than the ones returned.

You can also use the property size to indicate how many top terms you would like. As mentioned above, it defaults to 10. You can, of course, specify a larger number. Here we have specified 20.

{
  "aggs": {
    "unique_names": {
      "terms": {
        "field": "first_name.keyword",
        "size": 20
      }
    }
  },
  "size": 0
}

And we get upto 20 names back.

...
  "aggregations": {
    "unique_names": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 294554,
      "buckets": [
        {
          "key": "Shahab",
          "doc_count": 295
        },
        {
          "key": "Tetsushi",
          "doc_count": 291
        },
        {
          "key": "Elgin",
          "doc_count": 279
        },
        {
          "key": "Anyuan",
          "doc_count": 278
        },
...

But what if you want all the unique values? Should you just specify an arbitrary large number for the size?

Absolutely not! You should ask only for whatever you need, and if you happen to require all the unique values, you should use a different aggregation type (composite aggregation, see below).

Obtaining all unique values in a field

To obtain all the unique values in a field, Elasticsearch recommends that you use a composite aggregation. It works by wrapping the terms aggregation inside a composite aggregation. The procedure for reading all values is as follows:

Step 1. You start the request specifying how many unique values you can handle in one batch. Here, we have specified a value of 5 for the size to illustrate how things work.

{
  "aggs": {
    "keys": {
      "composite": {
        "size": 5,
        "sources": [
          {
            "last_name.keyword": {
              "terms": {
                "field": "last_name.keyword",
                "size": 5
              }
            }
          }
        ]
      }
    }
  },
  "size": 0
}

The response to this request is shown below.

...
  "aggregations": {
    "keys": {
      "after_key": {
        "last_name.keyword": "Akaboshi"
      },
      "buckets": [
        {
          "key": {
            "last_name.keyword": "Aamodt"
          },
          "doc_count": 205
        },
        {
          "key": {
            "last_name.keyword": "Acton"
          },
          "doc_count": 189
        },
        {
          "key": {
            "last_name.keyword": "Adachi"
          },
          "doc_count": 221
        },
        {
          "key": {
            "last_name.keyword": "Aingworth"
          },
          "doc_count": 172
        },
        {
          "key": {
            "last_name.keyword": "Akaboshi"
          },
          "doc_count": 199
        }
      ]
    }
  }
...

Step 2. To get the next batch of unique values, you pick up the last value from the previous batch and include it in the next request under the after property as shown below.

{
  "aggs": {
    "keys": {
      "composite": {
        "sources": [
          {
            "last_name.keyword": {
              "terms": {
                "field": "last_name.keyword"
              }
            }
          }
        ],
        "after": {
          "last_name.keyword": "Akaboshi"
        },
        "size": 5
      }
    }
  },
  "size": 0
}

And you have the next batch of results. You then process this batch and repeat the procedure till no more names are returned (buckets is empty).

...
  "aggregations": {
    "keys": {
      "after_key": {
        "last_name.keyword": "Alblas"
      },
      "buckets": [
        {
          "key": {
            "last_name.keyword": "Akazan"
          },
          "doc_count": 207
        },
        {
          "key": {
            "last_name.keyword": "Akiyama"
          },
          "doc_count": 170
        },
        {
          "key": {
            "last_name.keyword": "Alameldin"
          },
          "doc_count": 195
        },
        {
          "key": {
            "last_name.keyword": "Albarhamtoshy"
          },
          "doc_count": 201
        },
        {
          "key": {
            "last_name.keyword": "Alblas"
          },
          "doc_count": 193
        }
      ]
    }
  }
...

Top unique field values using Argon

Argon allows you to easily obtain the top unique values in a field with a few clicks. Let us see how to do it.

1. In the Explorer View, right-click on the field you want to analyze and choose Select Multiple

(click for larger image)

2. Click the Top/Bottom Values tab.

You will get a list of the top values of the field. You can choose how many unique values you wish to view by selecting the dropdown at the bottom.

(click for larger image)

3. Browse all values from the All Values tab.

Select the All Values tab to browse all the unique values, along with the counts.

  • You can use the navigator at the bottom to page through the values.

  • You can also enter a value in the search box to search for a value.

(click for larger image)