How to filter arrays: cutting vs. masking#

import awkward as ak
import numpy as np

The problem with slicing#

When you write a mathematical formula using binary operators like + and *, or NumPy universal functions (ufuncs) like np.sqrt, the shapes of nested lists must align. If the arrays in an expression were derived from a single array, this is often automatic. For instance,

original_array = ak.Array([
        {"title": "zero", "x": 0, "y": 0},
        {"title": "one", "x": 1, "y": 1.1},
        {"title": "two", "x": 2, "y": 2.2},
        {"title": "three", "x": 3, "y": 3.3},
        {"title": "four", "x": 4, "y": 4.4},
        {"title": "five", "x": 5, "y": 5.5},
        {"title": "six", "x": 6, "y": 6.6},
        {"title": "seven", "x": 7, "y": 7.7},
        {"title": "eight", "x": 8, "y": 8.8},
        {"title": "nine", "x": 9, "y": 9.9},
array_x = original_array.x
array_y = original_array.y

The array_x and array_y have the same number of lists and the same numbers of items in each list because they were both slices of the original_array.

[[0, 1, 2],
 [3, 4],
 [6, 7, 8, 9]]
type: 5 * var * int64
[[0, 1.1, 2.2],
 [3.3, 4.4],
 [6.6, 7.7, 8.8, 9.9]]
type: 5 * var * float64

Thus, they can be used together in a mathematical formula.

array_x**2 + array_y**2
[[0, 2.21, 8.84],
 [19.9, 35.4],
 [79.6, 108, 141, 179]]
type: 5 * var * float64

However, if one array is sliced, or if the two arrays are sliced by different criteria, they would no longer line up:

sliced_x = array_x[array_x > 3]
sliced_y = array_y[array_y > 3]
 [6, 7, 8, 9]]
type: 5 * var * int64
 [3.3, 4.4],
 [6.6, 7.7, 8.8, 9.9]]
type: 5 * var * float64

Notice that the first was sliced with array_x > 3 and the second was sliced with array_y > 3, and as a result, the third list differs in length between the two arrays:

sliced_x[2], sliced_y[2]
(<Array [4] type='1 * int64'>, <Array [3.3, 4.4] type='2 * float64'>)

If we try to use these together, we get a ValueError:

sliced_x**2 + sliced_y**2
This error occurred while calling

        <Array [[], [], ..., [25], [36, 49, 64, 81]] type='5 * var * int64'>
        <Array [[], [], ..., [43.6, 59.3, 77.4, 98]] type='5 * var * float64'>

Sometimes, these misalignments are overt, but sometimes they’re subtle and embedded deep within a very large array. You can start investigating a problem like this with ak.num():

ak.num(sliced_x) != ak.num(sliced_y)
type: 5 * bool
np.nonzero(ak.to_numpy(ak.num(sliced_x) != ak.num(sliced_y)))

But it’s also possible to avoid them in the first place.

Masking with missing values#

The problem was that the two arrays’ shapes changed differently; instead, we’ll slice them in such a way that their shapes don’t change at all.

The ak.mask() function uses a boolean array like a slice, but takes values that line up with False and returns None instead of removing them.

ak.mask(array_x, array_x > 3)
[[None, None, None],
 [None, 4],
 [6, 7, 8, 9]]
type: 5 * var * ?int64

It can also be accessed as an array property, with square brackets, so that it resembles a slice:

masked_x = array_x.mask[array_x > 3]
masked_y = array_y.mask[array_y > 3]
[[None, None, None],
 [None, 4],
 [6, 7, 8, 9]]
type: 5 * var * ?int64
[[None, None, None],
 [3.3, 4.4],
 [6.6, 7.7, 8.8, 9.9]]
type: 5 * var * ?float64

The results of these two masks can be used in a mathematical expression because they line up:

result = masked_x**2 + masked_y**2
[[None, None, None],
 [None, 35.4],
 [79.6, 108, 141, 179]]
type: 5 * var * ?float64

Now only one problem remains: the None (missing) values might be undesirable in the output. There are several ways to get rid of them:

  • ak.drop_none() eliminates None, like a slice, but it can be done once at the end of a calculation,

  • ak.fill_none() replaces None with a chosen value,

  • ak.flatten() removes list structure, and if the None values are at the level of a list (the ones in result aren’t), they’ll be removed too,

  • ak.singletons() replaces None with [] and any other value x with [x]. The resulting lists all have length 0 or length 1.

ak.drop_none(result, axis=1)
 [79.6, 108, 141, 179]]
type: 5 * var * float64
ak.fill_none(result, -1, axis=1)
[[-1, -1, -1],
 [-1, 35.4],
 [79.6, 108, 141, 179]]
type: 5 * var * float64
ak.singletons(result, axis=1)
[[[], [], []],
 [[], [35.4]],
 [[79.6], [108], [141], [179]]]
type: 5 * var * var * float64

As a final note, the difference between using ak.drop_none() and slicing with the result of ak.is_none() is that ak.drop_none() also removes “missingness” from the data type; a slice does not.

result[~ak.is_none(result, axis=1)]
 [79.6, 108, 141, 179]]
type: 5 * var * ?float64

(Note the ? for “option-type” before float64. This could have consequences, good or bad, at a later stage in processing.)