TL;DR
The term aggregation does not support pagination. Only bucket sorting is available, but it is limited by the size parameter.
Prepare Data
- run opensearch container [ref]
- generate flight sample data
- check the schema by
_mapping
api
A Scenario for Terms Aggregation
We aim to calculate the total number of flights for each origin city, and find the earliest departure time min_flight_time
for each city, then sort the results based on min_flight_time
.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"group_by_origin_city": {
"terms": { # term aggregation
"field": "OriginCityName",
"order": {
"min_flight_time": "asc"
}
},
"aggs": { # sub-aggregation
"min_flight_time": {
"min": {
"field": "timestamp"
}
}
}
}
}
}
When I need pagination along with sorting functionality, I have to retrieve a sufficient number of buckets (very large) to ensure correct sorting.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"group_by_origin_city": {
"terms": {
"field": "OriginCityName",
"size": 1000000, ## large enough integer
"order": {
"min_flight_time": "desc"
}
},
"aggs": {
"min_flight_time": {
"min": {
"field": "timestamp"
}
},
"custom_paginate": {
"bucket_sort": {
"from": 0,
"size": 5
}
}
}
}
}
}
Potential Issue
This approach about large enough size may lead to memory overload issues, degrade performance, and potentially trigger document count errors.
Solutions
- Addressing from a business logic perspective.
Use an additional index to record the earliest departure time and total number of flights for each origin city. - Addressing from an implementation standpoint.
Utilize filters such as time intervals or specific condition to reduce the number of documents.
NOTE/Others
- Why not use composite aggregation? Because it doesn’t allow sorting with sub-aggregations.