How to filter arrays by number of items#
import awkward as ak
In general, arrays are filtered using NumPy-like slicing. Numerical values can be filtered by numerical expressions in a way that is very similar to NumPy:
array = ak.Array([
[[0, 1.1, 2.2], []], [[3.3, 4.4]], [], [[5.5], [6.6, 7.7, 8.8, 9.9]]
])
array[array > 4]
[[[], []], [[4.4]], [], [[5.5], [6.6, 7.7, 8.8, 9.9]]] ------------------------------- backend: cpu nbytes: 136 B type: 4 * var * var * float64
but it’s also common to want to filter arrays by the number of items in each list, for two reasons:
to exclude empty lists so that subsequent slices can select the item at index
0
,to make the list lengths rectangular for computational steps that require rectangular array (such as most forms of machine learning).
There are two functions that provide the lengths of lists: ak.num()
and ak.count()
. To filter arrays, you’ll most likely want ak.num()
.
Use ak.num
#
ak.num()
can be applied at any axis
, and it returns the number of items in lists at that axis
with the same shape for all levels above that axis
.
ak.num(array, axis=0)
array(4)
ak.num(array, axis=1) # default
[2, 1, 0, 2] --------------- backend: cpu nbytes: 32 B type: 4 * int64
ak.num(array, axis=2)
[[3, 0], [2], [], [1, 4]] --------------------- backend: cpu nbytes: 80 B type: 4 * var * int64
Thus, if you want to select outer lists of array
with length 2, you would use axis=1
:
array[ak.num(array) == 2]
[[[0, 1.1, 2.2], []], [[5.5], [6.6, 7.7, 8.8, 9.9]]] ------------------------------- backend: cpu nbytes: 160 B type: 2 * var * var * float64
And if you want to select inner lists of array
with length greater than 2, you would use axis=2
:
array[ak.num(array, axis=2) > 2]
[[[0, 1.1, 2.2]], [], [], [[6.6, 7.7, 8.8, 9.9]]] ----------------------------- backend: cpu nbytes: 152 B type: 4 * var * var * float64
The ragged array of booleans that you get from comparing ak.num()
with a number is exactly what is needed to slice the array.
Don’t use ak.count
#
By contrast, ak.count()
returns structures that you can’t use this way (for all but axis=-1
):
ak.count(array, axis=None) # default
np.int64(10)
ak.count(array, axis=0)
[[3, 2, 1], [1, 1, 1, 1]] --------------------- backend: cpu nbytes: 96 B type: 2 * var * int64
ak.count(array, axis=1)
[[1, 1, 1], [1, 1], [], [2, 1, 1, 1]] --------------------- backend: cpu nbytes: 192 B type: 4 * var * int64
ak.count(array, axis=2) # equivalent to axis=-1 for this array
[[3, 0], [2], [], [1, 4]] --------------------- backend: cpu nbytes: 80 B type: 4 * var * int64
Also, ak.num()
can be used on arrays that contain records, whereas ak.count()
(like other reducers), can’t.
As a reducer, ak.count()
is intended to be used in a mathematical formula with other reducers, like ak.sum()
, ak.max()
, etc. (usually as a denominator). Its axis
behavior matches that of other reducers, which is important for the shapes of nested lists to align.