How to filter arrays by number of items#
import awkward as ak
In general, arrays are filtered using NumPy-like slicing. Numerical values can be filtered by numerical expressions in a way that is very similar to NumPy:
array = ak.Array([
[[0, 1.1, 2.2], []], [[3.3, 4.4]], [], [[5.5], [6.6, 7.7, 8.8, 9.9]]
])
array[array > 4]
[[[], []], [[4.4]], [], [[5.5], [6.6, 7.7, 8.8, 9.9]]] ------------------------------- backend: cpu nbytes: 136 B type: 4 * var * var * float64
but it’s also common to want to filter arrays by the number of items in each list, for two reasons:
to exclude empty lists so that subsequent slices can select the item at index
0,to make the list lengths rectangular for computational steps that require rectangular array (such as most forms of machine learning).
There are two functions that provide the lengths of lists: ak.num() and ak.count(). To filter arrays, you’ll most likely want ak.num().
Use ak.num#
ak.num() can be applied at any axis, and it returns the number of items in lists at that axis with the same shape for all levels above that axis.
ak.num(array, axis=0)
array(4)
ak.num(array, axis=1) # default
[2, 1, 0, 2] --- backend: cpu nbytes: 32 B type: 4 * int64
ak.num(array, axis=2)
[[3, 0], [2], [], [1, 4]] -------- backend: cpu nbytes: 80 B type: 4 * var * int64
Thus, if you want to select outer lists of array with length 2, you would use axis=1:
array[ak.num(array) == 2]
[[[0, 1.1, 2.2], []], [[5.5], [6.6, 7.7, 8.8, 9.9]]] ------------------------------- backend: cpu nbytes: 160 B type: 2 * var * var * float64
And if you want to select inner lists of array with length greater than 2, you would use axis=2:
array[ak.num(array, axis=2) > 2]
[[[0, 1.1, 2.2]], [], [], [[6.6, 7.7, 8.8, 9.9]]] ------------------------ backend: cpu nbytes: 152 B type: 4 * var * var * float64
The ragged array of booleans that you get from comparing ak.num() with a number is exactly what is needed to slice the array.
Don’t use ak.count#
By contrast, ak.count() returns structures that you can’t use this way (for all but axis=-1):
ak.count(array, axis=None) # default
np.int64(10)
ak.count(array, axis=0)
[[3, 2, 1], [1, 1, 1, 1]] -------------- backend: cpu nbytes: 96 B type: 2 * var * int64
ak.count(array, axis=1)
[[1, 1, 1], [1, 1], [], [2, 1, 1, 1]] -------------- backend: cpu nbytes: 192 B type: 4 * var * int64
ak.count(array, axis=2) # equivalent to axis=-1 for this array
[[3, 0], [2], [], [1, 4]] -------- backend: cpu nbytes: 80 B type: 4 * var * int64
Also, ak.num() can be used on arrays that contain records, whereas ak.count() (like other reducers), can’t.
As a reducer, ak.count() is intended to be used in a mathematical formula with other reducers, like ak.sum(), ak.max(), etc. (usually as a denominator). Its axis behavior matches that of other reducers, which is important for the shapes of nested lists to align.