Min/max/sort one array by another#
A common task in data analysis is to select items from one array that minimizes or maximizes another, or to sort one array by the values of another.
import awkward as ak
Naive attempt goes wrong#
For instance, in
data = ak.Array([
[
{"title": "zero", "x": 0, "y": 0},
{"title": "two", "x": 2, "y": 2.2},
{"title": "one", "x": 1, "y": 1.1},
],
[],
[
{"title": "four", "x": 4, "y": 4.4},
{"title": "three", "x": 3, "y": 3.3},
],
[
{"title": "five", "x": 5, "y": 5.5},
],
[
{"title": "eight", "x": 8, "y": 8.8},
{"title": "six", "x": 6, "y": 6.6},
{"title": "nine", "x": 9, "y": 9.9},
{"title": "seven", "x": 7, "y": 7.7},
],
])
you may want to score each record with a computed value, such as x**2 + y**2
, and then select the record with the highest score from each list.
score = data.x**2 + data.y**2
score
[[0, 8.84, 2.21], [], [35.4, 19.9], [55.2], [141, 79.6, 179, 108]] ----------------------- backend: cpu nbytes: 128 B type: 5 * var * float64
At first, it would seem that ak.argmax()
is what you need to identify the item with the highest score from each list and select it from data
.
best_index = ak.argmax(score, axis=1)
best_index
[1, None, 0, 0, 2] ---------------- backend: cpu nbytes: 45 B type: 5 * ?int64
However, if you attempt to slice the data
with this, you’ll either get an indexing error or lists instead of records:
data[best_index]
[[], None, [{title: 'zero', x: 0, y: 0}, {...}, {title: 'one', x: 1, y: 1.1}], [{title: 'zero', x: 0, y: 0}, {...}, {title: 'one', x: 1, y: 1.1}], [{title: 'four', x: 4, y: 4.4}, {title: 'three', x: 3, y: 3.3}]] --------------------------------------------------------------------------- backend: cpu nbytes: 392 B type: 5 * option[var * { title: string, x: int64, y: float64 }]
What happend?#
Following the logic for reducers, the ak.argmin()
function returns an array with one fewer dimension than the input: the data
is an array of lists of records, but best_index
is an array of integers. We want an array of lists of integers.
The keepdims=True
parameter can ensure that the output has the same number of dimensions as the input:
best_index = ak.argmax(score, axis=1, keepdims=True)
best_index
[[1], [None], [0], [0], [2]] -------------------- backend: cpu nbytes: 45 B type: 5 * 1 * ?int64
Now these integers are at the same level of depth as the records that we want to select:
result = data[best_index]
result
[[{title: 'two', x: 2, y: 2.2}], [None], [{title: 'four', x: 4, y: 4.4}], [{title: 'five', x: 5, y: 5.5}], [{title: 'nine', x: 9, y: 9.9}]] -------------------------------------------------------------------- backend: cpu nbytes: 376 B type: 5 * var * ?{ title: string, x: int64, y: float64 }
In the above, each length-1 list contains the record with the highest score
. Even the empty list, for which the ak.argmax()
is missing (None
), is now a length-1 list containing None
. We can remove this length-1 list structure with a slice:
result[:, 0]
[{title: 'two', x: 2, y: 2.2}, None, {title: 'four', x: 4, y: 4.4}, {title: 'five', x: 5, y: 5.5}, {title: 'nine', x: 9, y: 9.9}] -------------------------------------------------------------- backend: cpu nbytes: 328 B type: 5 * ?{ title: string, x: int64, y: float64 }
To summarize this as a handy idiom, the way to get the record with maximum data.x**2 + data.y**2
from an array of lists of records named data
is
data[ak.argmax(data.x**2 + data.y**2, axis=1, keepdims=True)][:, 0]
[{title: 'two', x: 2, y: 2.2}, None, {title: 'four', x: 4, y: 4.4}, {title: 'five', x: 5, y: 5.5}, {title: 'nine', x: 9, y: 9.9}] -------------------------------------------------------------- backend: cpu nbytes: 328 B type: 5 * ?{ title: string, x: int64, y: float64 }
For an array of lists of lists of records, axis=2
and the final slice would be [:, :, 0]
, and so on.
Sorting by another array#
In addition to selecting items corresponding to the minimum or maximum of some other array, we may want to sort by another array. Just as ak.argmin()
and ak.argmax()
are the functions that would convey indexes from one array to another, ak.argsort()
conveys sorted indexes from one array to another array. However, ak.argsort()
always maintains the total number of dimensions, so we don’t need to worry about keepdims
.
sorted_indexes = ak.argsort(score)
sorted_indexes
[[0, 2, 1], [], [1, 0], [0], [1, 3, 0, 2]] --------------------- backend: cpu nbytes: 128 B type: 5 * var * int64
data[sorted_indexes]
[[{title: 'zero', x: 0, y: 0}, {...}, {title: 'two', x: 2, y: 2.2}], [], [{title: 'three', x: 3, y: 3.3}, {title: 'four', x: 4, y: 4.4}], [{title: 'five', x: 5, y: 5.5}], [{title: 'six', x: 6, y: 6.6}, {...}, ..., {title: 'nine', x: 9, y: 9.9}]] --------------------------------------------------------------------------- backend: cpu nbytes: 416 B type: 5 * var * { title: string, x: int64, y: float64 }
This sorted data has the same type as data
:
data.type.show()
5 * var * {
title: string,
x: int64,
y: float64
}
It’s exactly what we want. ak.argsort()
is easier to use than ak.argmin()
and ak.argmax()
.
Getting the top n items#
The ak.min()
, ak.max()
, ak.argmin()
, and ak.argmax()
functions select one extreme value. If you want the top n items (with n ≠ 1), you can use ak.sort()
or ak.argsort()
, followed by a slice:
top2 = data[ak.argsort(score)][:, :2]
top2
[[{title: 'zero', x: 0, y: 0}, {title: 'one', x: 1, y: 1.1}], [], [{title: 'three', x: 3, y: 3.3}, {title: 'four', x: 4, y: 4.4}], [{title: 'five', x: 5, y: 5.5}], [{title: 'six', x: 6, y: 6.6}, {title: 'seven', x: 7, y: 7.7}]] ------------------------------------------------------------------- backend: cpu nbytes: 392 B type: 5 * var * { title: string, x: int64, y: float64 }
Notice, though, that not all of these lists have length 2. The lists with 0 or 1 input items have 0 or 1 output items: these lists have up to length 2. That may be fine, but the example with ak.argmax()
, above, resulted in None
for an empty list. We could emulate that with ak.pad_none()
.
padded = ak.pad_none(top2, 2, axis=1)
padded
[[{title: 'zero', x: 0, y: 0}, {title: 'one', x: 1, y: 1.1}], [None, None], [{title: 'three', x: 3, y: 3.3}, {title: 'four', x: 4, y: 4.4}], [{title: 'five', x: 5, y: 5.5}, None], [{title: 'six', x: 6, y: 6.6}, {title: 'seven', x: 7, y: 7.7}]] -------------------------------------------------------------------- backend: cpu nbytes: 416 B type: 5 * var * ?{ title: string, x: int64, y: float64 }
The data type still says “var *
”, meaning that the lists are allowed to be variable-length, even though they happen to all have length 2. At this point, we might not care because that’s all we need in order to convert these fields into NumPy arrays (e.g. for some machine learning process):
ak.to_numpy(padded.x)
masked_array(
data=[[0, 1],
[--, --],
[3, 4],
[5, --],
[6, 7]],
mask=[[False, False],
[ True, True],
[False, False],
[False, True],
[False, False]],
fill_value=999999)
ak.to_numpy(padded.y)
masked_array(
data=[[0.0, 1.1],
[--, --],
[3.3, 4.4],
[5.5, --],
[6.6, 7.7]],
mask=[[False, False],
[ True, True],
[False, False],
[False, True],
[False, False]],
fill_value=1e+20)
Or we might want to force the data type to ensure that the lists have length 2, using ak.to_regular()
, ak.enforce_type()
, or just by passing clip=True
in the original ak.pad_none()
.
ak.to_regular(padded, axis=1)
[[{title: 'zero', x: 0, y: 0}, {title: 'one', x: 1, y: 1.1}], [None, None], [{title: 'three', x: 3, y: 3.3}, {title: 'four', x: 4, y: 4.4}], [{title: 'five', x: 5, y: 5.5}, None], [{title: 'six', x: 6, y: 6.6}, {title: 'seven', x: 7, y: 7.7}]] ------------------------------------------------------------------ backend: cpu nbytes: 368 B type: 5 * 2 * ?{ title: string, x: int64, y: float64 }
(Now the list lengths are “2 *
”, rather than “var *
”.)