How to filter arrays by number of items#

import awkward as ak

In general, arrays are filtered using NumPy-like slicing. Numerical values can be filtered by numerical expressions in a way that is very similar to NumPy:

array = ak.Array([
    [[0, 1.1, 2.2], []], [[3.3, 4.4]], [], [[5.5], [6.6, 7.7, 8.8, 9.9]]
])
array[array > 4]
[[[], []],
 [[4.4]],
 [],
 [[5.5], [6.6, 7.7, 8.8, 9.9]]]
-------------------------------
backend: cpu
nbytes: 136 B
type: 4 * var * var * float64

but it’s also common to want to filter arrays by the number of items in each list, for two reasons:

  • to exclude empty lists so that subsequent slices can select the item at index 0,

  • to make the list lengths rectangular for computational steps that require rectangular array (such as most forms of machine learning).

There are two functions that provide the lengths of lists: ak.num() and ak.count(). To filter arrays, you’ll most likely want ak.num().

Use ak.num#

ak.num() can be applied at any axis, and it returns the number of items in lists at that axis with the same shape for all levels above that axis.

ak.num(array, axis=0)
array(4)
ak.num(array, axis=1)   # default
[2,
 1,
 0,
 2]
---------------
backend: cpu
nbytes: 32 B
type: 4 * int64
ak.num(array, axis=2)
[[3, 0],
 [2],
 [],
 [1, 4]]
---------------------
backend: cpu
nbytes: 80 B
type: 4 * var * int64

Thus, if you want to select outer lists of array with length 2, you would use axis=1:

array[ak.num(array) == 2]
[[[0, 1.1, 2.2], []],
 [[5.5], [6.6, 7.7, 8.8, 9.9]]]
-------------------------------
backend: cpu
nbytes: 160 B
type: 2 * var * var * float64

And if you want to select inner lists of array with length greater than 2, you would use axis=2:

array[ak.num(array, axis=2) > 2]
[[[0, 1.1, 2.2]],
 [],
 [],
 [[6.6, 7.7, 8.8, 9.9]]]
-----------------------------
backend: cpu
nbytes: 152 B
type: 4 * var * var * float64

The ragged array of booleans that you get from comparing ak.num() with a number is exactly what is needed to slice the array.

Don’t use ak.count#

By contrast, ak.count() returns structures that you can’t use this way (for all but axis=-1):

ak.count(array, axis=None)   # default
np.int64(10)
ak.count(array, axis=0)
[[3, 2, 1],
 [1, 1, 1, 1]]
---------------------
backend: cpu
nbytes: 96 B
type: 2 * var * int64
ak.count(array, axis=1)
[[1, 1, 1],
 [1, 1],
 [],
 [2, 1, 1, 1]]
---------------------
backend: cpu
nbytes: 192 B
type: 4 * var * int64
ak.count(array, axis=2)   # equivalent to axis=-1 for this array
[[3, 0],
 [2],
 [],
 [1, 4]]
---------------------
backend: cpu
nbytes: 80 B
type: 4 * var * int64

Also, ak.num() can be used on arrays that contain records, whereas ak.count() (like other reducers), can’t.

As a reducer, ak.count() is intended to be used in a mathematical formula with other reducers, like ak.sum(), ak.max(), etc. (usually as a denominator). Its axis behavior matches that of other reducers, which is important for the shapes of nested lists to align.