ES - Aggregations

🔍 elastic search

ES - Aggregations

beomsic 2022. 9. 22. 17:23

Aggregations

Elasticsearch는 검색엔진 뿐 아니라 로그 분석 을 비롯한 다양한 목적의 데이터 시스템으로 사용되고 있다.

이렇게 활용이 가능한 이유는 데이터를 단순히 검색할 뿐 아니라 여러가지 연산을 할 수 있는 Aggregation 기능이 있기 때문이다.

Kibana 에서 바 차트, 파이 차트 등으로 데이터를 시각화 할 수 있는데 여기서 Aggregation 기능을 사용

aggregation은 번역하면 “집계” 라는 뜻이지만, ES에서는 원문대로 aggregation 혹은 애그리게이션으로 많이 표현.

크게 세 종류

Metrics Aggregations
Bucket Aggregations
Pipeline Aggregations

Metrics Aggregations

필드 값에서 합계 또는 평균과 같은 메트릭을 계산하는 aggregation

ex) min, max, sum, avg, stats, cardinality 등등

min, max, sum, avg

최소, 최대, 합, 평균 값을 가져오는 aggregation

가장 흔하게 사용되는 metrics aggregation

Example - sum

POST /sales/_search?size=0
{
  "query": {
    "constant_score": {
      "filter": {
        "match": { "type": "hat" }
      }
    }
  },
  "aggs": {
    "hat_prices": { "sum": { "field": "price" } }
  }
}

// curl
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "constant_score": {
      "filter": {
        "match": { "type": "hat" }
      }
    }
  },
  "aggs": {
    "hat_prices": { "sum": { "field": "price" } }
  }
}
'

// response
{
  ...
  "aggregations": {
    "hat_prices": {
      "value": 450.0
    }
  }
}

sales 에 있는 hat 필드값의 hat_prices 값의 합을 가져옴.

Stats

min, max, sum, avg 값을 모두 가져와야 한다면 stats aggregation을 사용하면 위 4개의 값 모두와 count 값을 한번에 가져온다.

Example

// grade 필드의 min, max, sum, avg 값을 가져오는 aggs
POST /exams/_search?size=0
{
  "aggs": {
    "grades_stats": { "stats": { "field": "grade" } }
  }
}

// curl
curl -X POST "localhost:9200/exams/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "grades_stats": { "stats": { "field": "grade" } }
  }
}
'

// response
{
  ...

  "aggregations": {
    "grades_stats": {
      "count": 2,
      "min": 50.0,
      "max": 100.0,
      "avg": 75.0,
      "sum": 150.0
    }
  }
}

Aggregation 결과만 보고 싶다면 ❓

기본적으로 Aggregation을 포함하는 검색은 aggregation 및 검색 결과를 모두 반환한다.

이때, aggregation 결과만 보고 싶다면 “size” : 0 을 추가하면 된다.
GET /students/_search
{
	"size": 0, 
    "aggs": ...
}

cardinality

필드의 값이 모두 몇 종류인지 분포값을 알고 싶을 때 사용

일반적으로 Text 필드에서는 사용할 수 없고

숫자
Keyword
ip

필드 등에 사용이 가능하다.

사용자 접속 로그에서 IP 주소 필드를 가지고 실제 접속한 사용자가 몇명인지 파악하는 등의 용도로 주로 사용된다.

Example

// type 필드 가 몇 종류인지 가져오는 aggs
POST /sales/_search?size=0
{
  "aggs": {
    "type_count": {
      "cardinality": {
        "field": "type"
      }
    }
  }
}

// curl 
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "type_count": {
      "cardinality": {
        "field": "type"
      }
    }
  }
}
'

// response
{
  ...
  "aggregations": {
    "type_count": {
      "value": 3
    }
  }
}

Bucket Aggregations

주어진 조건으로 분류된 버킷들을 만들고 각 버킷에 소속되는 도큐먼트들을 모아 그룹으로 구분

각 버킷 별로 포함되는 도큐먼트의 개수가 doc_count 값에 기본적으로 표시된다.
각 버킷 안에 metrics aggregation을 이용한 다른 계산도 가능하다.

range

숫자 필드 값으로 범위를 지정하고 각 범위에 해당하는 버킷을 만드는 aggregation

Example

// price 필드의 값을 range aggs를 이용해 버킷으로 구분
GET sales/_search
{
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 100.0 },
          { "from": 100.0, "to": 200.0 },
          { "from": 200.0 }
        ]
      }
    }
  }
}

// curl
curl -X GET "localhost:9200/sales/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 100.0 },
          { "from": 100.0, "to": 200.0 },
          { "from": 200.0 }
        ]
      }
    }
  }
}
'

// response 
{
  ...
  "aggregations": {
    "price_ranges": {
      "buckets": [
        {
          "key": "*-100.0",
          "to": 100.0,
          "doc_count": 2
        },
        {
          "key": "100.0-200.0",
          "from": 100.0,
          "to": 200.0,
          "doc_count": 2
        },
        {
          "key": "200.0-*",
          "from": 200.0,
          "doc_count": 3
        }
      ]
    }
  }
}

histogram

range 와 동일하게 숫자 필드의 범위를 나누는 aggregation

range는 from / to 를 이용해 각 버킷의 범위를 지정하지만,

histogram은 interval 옵션 을 이용해서 주어진 간격 크기대로 버킷을 구분한다.

Example

// price 필드의 값을 histogram aggs를 이용해 버킷으로 구분
POST /sales/_search?size=0
{
  "aggs": {
    "prices": {
      "histogram": {
        "field": "price",
        "interval": 50
      }
    }
  }
}

// curl
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "prices": {
      "histogram": {
        "field": "price",
        "interval": 50
      }
    }
  }
}
'

// response
{
  ...
  "aggregations": {
    "prices": {
      "buckets": [
        {
          "key": 0.0,
          "doc_count": 1
        },
        {
          "key": 50.0,
          "doc_count": 1
        },
        {
          "key": 100.0,
          "doc_count": 0
        },
        {
          "key": 150.0,
          "doc_count": 2
        },
        {
          "key": 200.0,
          "doc_count": 3
        }
      ]
    }
  }
}

terms

keyword 필드의 문자열 별로 버킷을 나누는 aggregation

Example

// genre 값에 따라 버킷 생성
GET /_search
{
  "aggs": {
    "genres": {
      "terms": { "field": "genre" }
    }
  }
}

// curl
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "genres": {
      "terms": { "field": "genre" }
    }
  }
}
'

// response
{
  ...
  "aggregations": {
    "genres": {
      "doc_count_error_upper_bound": 0,   
      "sum_other_doc_count": 0,           
      "buckets": [                        
        {
          "key": "electronic",
          "doc_count": 6
        },
        {
          "key": "rock",
          "doc_count": 3
        },
        {
          "key": "jazz",
          "doc_count": 2
        }
      ]
    }
  }
}

⚠️ text 필드로 terms aggregation을 사용했을 경우 ⚠️

⇒ 오류가 발생 💣

텍스트 필드는 aggregation 및 정렬과 같은 문서별 필드 데이터가 필요한 작업에 최적화 ❌

기본적으로 이러한 작업은 실행 중지된다.

대신 키워드 필드를 사용

즉, 텍스트 필드는 terms 로 나누어서 색인되기 때문에 버킷을 나누기에 적당하지 않다.

입력된 문자열을 하나의 토큰으로 저장하는 키워드 필드를 사용해야 한다!!

Pipeline Aggregations

다른 metrics aggregation의 결과를 새로운 입력으로 하는 pipeline aggregation

다른 버킷의 결과들을 다시 연산

min_bucket
max_bucket
avg_bucket
sum_bucket
stats_bucket
moving_avg - 이동 평균 구하기
derivative - 미분 값 구하기
cumulative_sum - 누적 합 구하기

Pipeline aggregation은 “buckets_path”: “<버킷 이름>” 옵션을 이용해 사용할 버킷을 입력 값으로 지정한다.

Example

// price 의 값을 입력으로 받는 cumulative_sum aggs 실행
POST /sales/_search
{
  "size": 0,
  "aggs": {
    "sales_per_month": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "sales": {
          "sum": {
            "field": "price"
          }
        },
        "cumulative_sales": {
          "cumulative_sum": {
            "buckets_path": "sales" 
          }
        }
      }
    }
  }
}

// curl
curl -X POST "localhost:9200/sales/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "sales_per_month": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "sales": {
          "sum": {
            "field": "price"
          }
        },
        "cumulative_sales": {
          "cumulative_sum": {
            "buckets_path": "sales" 
          }
        }
      }
    }
  }
}
'

// response
{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               },
               "cumulative_sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               },
               "cumulative_sales": {
                  "value": 610.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               },
               "cumulative_sales": {
                  "value": 985.0
               }
            }
         ]
      }
   }
}

Sub Aggregation

Bucket Aggregation으로 만든 버킷들 내부에 다시 “aggs” : {} 을 선언해

또 다른 버킷을 만들거나 Metrics Aggregation을 만들어 사용하는 aggregation

Example

// 이미 만든 stations 버킷별로 avg aggs을 이용해 passangers 필드의 평균값 계산 
GET my_stations/_search
{
  "size": 0,
  "aggs": {
    "stations": {
      "terms": {
        "field": "station.keyword"
      },
      "aggs": {
        "avg_psg_per_st": {
          "avg": {
            "field": "passangers"
          }
        }
      }
    }
  }
}

// response
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "stations" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "강남",
          "doc_count" : 5,
          "avg_psg_per_st" : {
            "value" : 5931.2
          }
        },
        {
          "key" : "불광",
          "doc_count" : 1,
          "avg_psg_per_st" : {
            "value" : 971.0
          }
        },
        {
          "key" : "신촌",
          "doc_count" : 1,
          "avg_psg_per_st" : {
            "value" : 3912.0
          }
        },
        {
          "key" : "양재",
          "doc_count" : 1,
          "avg_psg_per_st" : {
            "value" : 4121.0
          }
        },
        {
          "key" : "종각",
          "doc_count" : 1,
          "avg_psg_per_st" : {
            "value" : 2314.0
          }
        },
        {
          "key" : "홍제",
          "doc_count" : 1,
          "avg_psg_per_st" : {
            "value" : 1021.0
          }
        }
      ]
    }
  }
}

주의 ❗

하위 버킷이 깊어질수록 ES 가 하는 작업량과 메모리 소모량이 기하급수적으로 늘어남

→ 예상치 못한 오류를 발생

보통은 2레벨의 깊이 이상의 버킷은 생성하지 않는 것이 좋다.

참고자료

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html

https://esbook.kimjmin.net/08-aggregations

https://velog.io/@soyeon207/ES-7.-aggregations-집계

저작자표시 비영리 변경금지 (새창열림)