Histogram aggregation in Elasticsearch

Sep 22, 2019

Introduction

A histogram is a representation of numerical data grouped into contiguous groups based on the frequency of occurence. It is used to indicate the distribution of data. For example, height of subjects in a population is a continuous variable (meaning it can take all values within its range), and it can be grouped into ranges such as: less than 5 feet, 5 feet to 5 feet 30 inches, and so on. The number of instances in each group is a histogram of its distribution.

The histogram aggregation

Let us look at how to generate a histogram using Elasticsearch. An example histogram request looks like this:

{
  "aggs": {
    "histogram": {
        "histogram": {
        "field": "Value",
        "interval": 200
        }
    }
  }
}

Using the stats aggregation to determine the interval

As shown above, the histogram aggregation requires an interval parameter which determines how many classes the histogram represents.

To determine a suitable interval, we need to know the minimum and maximum values on the field. These can be determined using the stats aggregation as follows:

{
  "aggs": {
    "stats": {
      "stats": {
        "field": "Value"
      }
    }
  },
  "size": 0
}

The stats aggregation is returned for the field Value as follows:

"aggregations": {
  "stats": {
    "count": 209,
    "min": 1879.0,
    "max": 3768.0,
    "avg": 2862.8803827751194,
    "sum": 598342.0
  }
}

Assuming we want about 10 buckets in the histogram, we choose 200 for the interval parameter.

A histogram example

Le us now look at an example. For our example, we have FAO data presenting the calories consumed by each person per day in each countries. The following query extracts the raw data for the year 2013.

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "Year": "2013"
          }
        },
        {
          "term": {
            "Element.keyword": "Food supply (kcal/capita/day)"
          }
        },
        {
          "term": {
            "Item.keyword": "Grand Total"
          }
        }
      ]
    }
  }
}

Which returns data as follows:

"hits": [
  {
    "_index": "foodsupply_crops_e_all_data_(normalized).csv",
    "_type": "doc",
    "_id": "491193",
    "_score": 7.691229,
    "_source": {
      "Area Code": 27,
      "Area": "Bulgaria",
      "Item Code": 2901,
      "Item": "Grand Total",
      "Element Code": 664,
      "Element": "Food supply (kcal/capita/day)",
      "Year Code": 2013,
      "Year": 2013,
      "Unit": "kcal/capita/day",
      "Value": 2829.0,
      "Flag": "Fc"
    }
  },
  {
    "_index": "foodsupply_crops_e_all_data_(normalized).csv",
    "_type": "doc",
    "_id": "514831",
    "_score": 7.691229,
    "_source": {
      "Area Code": 233,
      "Area": "Burkina Faso",
      "Item Code": 2901,
      "Item": "Grand Total",
      "Element Code": 664,
      "Element": "Food supply (kcal/capita/day)",
      "Year Code": 2013,
      "Year": 2013,
      "Unit": "kcal/capita/day",
      "Value": 2720.0,
      "Flag": "Fc"
  }
...

Combining this query with the histogram aggregation as shown above, we have the following request:

{
    "query": {
    "bool": {
        "must": [
        {
            "term": {
            "Year": "2013"
            }
        }, 
        {
            "term": {
            "Element.keyword": "Food supply (kcal/capita/day)"
            }
        }, 
        {
            "term": {
            "Item.keyword": "Grand Total"
            }
        }
        ]
    }
    },
    "aggs": {
    "histo": {
        "histogram": {
        "field": "Value",
        "interval": 200
        }
    }
    },
    "size": 0
}

We have turned off the hits comprising the aggregation by setting size to 0. Executing this aggregation results in the following response:

"aggregations": {
  "histo": {
    "buckets": [
      {
        "key": 1800.0,
        "doc_count": 2
      },
      {
        "key": 2000.0,
        "doc_count": 11
      },
      {
        "key": 2200.0,
        "doc_count": 18
      },
      {
        "key": 2400.0,
        "doc_count": 29
      },
      {
        "key": 2600.0,
        "doc_count": 37
      },
      {
        "key": 2800.0,
        "doc_count": 30
      },
      {
        "key": 3000.0,
        "doc_count": 28
      },
      {
        "key": 3200.0,
        "doc_count": 29
      },
      {
        "key": 3400.0,
        "doc_count": 18
      },
      {
        "key": 3600.0,
        "doc_count": 7
      }
    ]
  }
}

Displaying a histogram in Argon

Argon provides an easy way to view the histogram distribution of a field.

1. After loading the data in the Explorer View, right click on the column to pull up the context menu.

Choose Distribution from the context menu.

(click for larger image)

2. The Distribution box shows the histogram.

  • Minimum and maximum values applicable are shown.

  • You can change the number of classes to divide the range into.

  • The interval changes appropriately when the number of classes are changed.

  • The histogram is updated automatically. For each range, it shows the number of records than fall within that range.

(click for larger image)

3. Resize the Distribution box window to update the display.

(click for larger image)

Some Example Histograms

In the following sections we present some example histograms using data from the Worldbank Development Indicators.

Alcohol Consumption across countries

The following is the distribution of per-capita alcohol consumption across countries for the year 2016.

What it means: 38 countries had an alcohol consumption rate of upto 2 litres per year. 36 countries consumed between 6 and 8 litres of pure alcohol per year. The highest was 2 countries consumed between 16 and 18 litres of alcohol per person per year. All numbers for 2016.

(click for larger image)

Mortality rate due to air pollution

This chart represents the death rate (per 100,000 population) due to household and ambient air pollution in 2016.

What it means: Upto 40 people per 100,000 population died in 65 countries due to household and ambient air pollution. In 3 countries the death was between 280 and 360 per 100,000 population.

(click for larger image)

Tax revenue as percent of GDP

Tax revenue refers to compulsory transfers to the central government for public purposes. Certain compulsory transfers such as fines, penalties, and most social security contributions are excluded. Refunds and corrections of erroneously collected tax revenue are treated as negative revenue.

What it means: 34 countries have a tax revenue between 12 and 15% of GDP. 3 countries have the lowest rates of 0-4% of GDP and 4 countries can be considered heavily taxed at between 28-32% of GDP.

(click for larger image)

Suicide mortality rate

Presented below are the number of suicide deaths in a year per 100,000 population.

What it means: 32 countries have a suicide death rate of 0-4 persons per 100,000 population, while 62 countries have a rate of 4-8 persons per 100,000 population. 3 countries have the highest rates of 28-32 persons.

(click for larger image)

DPT immunication rates of children

Next up is the rate of Immunization of children against DPT. The measure is the percent of children ages 12-23 months.

What it means: Thankfully there are no countries which do not immunize against DPT due to whatever misguided notion.The highest rates are 67 and 66 countries which cover between 88% to 100% of the children.

(click for larger image)

Diabetes prevalence rates

Raes of diabetes prevalence among people ages 20-79 who have type 1 or type 2 diabetes.

What it means: 19 countries have the lowest diabetes prevalence rates between 0-3% while 3 countries have the highest prevalence rates between 24-33%. The largest number of countries (66) have a rate of between 6-9%.

(click for larger image)

Losses due to power failure

This measure represents loss of sales due to electrical outages as a percentage of total annual sales, and represent average losses for all firms which reported outages.

What it means: 6 countries reported loss of sales between 0-2% in 2017, while 1 country reported a loss of 14-16% of the total sales.

(click for larger image)

Using data from global land temperatures tracked from 1750 through 2015, we prepared the following histograms for each 50 years period. If you observe these 50-year trends of the temperature, you can see that histogram moves slowly towards higher temperatures with each period, indicating a rise in global warming.

1. 1751-1800

(click for larger image)

2. 1801-1850

(click for larger image)

3. 1851-1900

(click for larger image)

4. 1901-1950

(click for larger image)

5. 1951-2000

(click for larger image)

6. 2001-2015

(click for larger image)