How to filter with arrays containing missing values#

import awkward as ak
import numpy as np

Indexing with missing values#

In Building an awkward index, we looked building arrays of integers to perform awkward indexing using ak.argmin() and ak.argmax(). In particular, the keepdims argument of ak.argmin() and ak.argmax() is very useful for creating arrays that can be used to index into the original array. However, reducers such as ak.argmax() behave differently when they are asked to operate upon empty lists.

Let’s first create an array that contains empty sublists:

array = ak.Array(
    [
        [],
        [10, 3, 2, 9],
        [4, 5, 5, 12, 6],
        [],
        [8, 9, -1],
    ]
)
array
[[],
 [10, 3, 2, 9],
 [4, 5, 5, 12, 6],
 [],
 [8, 9, -1]]
------------------
backend: cpu
nbytes: 144 B
type: 5 * var * int64

Awkward reducers accept a mask_identity argument, which changes the ak.Array.type and the values of the result:

ak.argmax(array, keepdims=True, axis=-1, mask_identity=False)
[[-1],
 [0],
 [3],
 [-1],
 [1]]
------
backend: cpu
nbytes: 40 B
type: 5 * 1 * int64
ak.argmax(array, keepdims=True, axis=-1, mask_identity=True)
[[None],
 [0],
 [3],
 [None],
 [1]]
--------
backend: cpu
nbytes: 45 B
type: 5 * 1 * ?int64

Setting mask_identity=True yields the identity value for the reducer instead of None when reducing empty lists. From the above examples of ak.argmax(), we can see that the identity for the ak.argmax() is -1: What happens if we try and use the array produced with mask_identity=False to index into array?

As discussed in Indexing with argmin and argmax, we first need to convert at least one dimension to a ragged dimension

index = ak.from_regular(
    ak.argmax(array, keepdims=True, axis=-1, mask_identity=False)
)

Now, if we try and index into array with index, it will raise an exception

array[index]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 array[index]

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1106, in Array.__getitem__(self, where)
    677 def __getitem__(self, where):
    678     """
    679     Args:
    680         where (many types supported; see below): Index of positions to
   (...)   1104     have the same dimension as the array being indexed.
   1105     """
-> 1106     with ak._errors.SlicingErrorContext(self, where):
   1107         # Handle named axis
   1108         (_, ndim) = self._layout.minmax_depth
   1109         named_axis = _get_named_axis(self)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_errors.py:80, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
     78     self._slate.__dict__.clear()
     79     # Handle caught exception
---> 80     raise self.decorate_exception(exception_type, exception_value)
     81 else:
     82     # Step out of the way so that another ErrorContext can become primary.
     83     if self.primary() is self:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1114, in Array.__getitem__(self, where)
   1110 where = _normalize_named_slice(named_axis, where, ndim)
   1112 NamedAxis.mapping = named_axis
-> 1114 indexed_layout = prepare_layout(self._layout._getitem(where, NamedAxis))
   1116 if NamedAxis.mapping:
   1117     return ak.operations.ak_with_named_axis._impl(
   1118         indexed_layout,
   1119         named_axis=NamedAxis.mapping,
   (...)   1122         attrs=self._attrs,
   1123     )

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:659, in Content._getitem(self, where, named_axis)
    656         return out._getitem_at(0)
    658 elif isinstance(where, ak.highlevel.Array):
--> 659     return self._getitem(where.layout, named_axis)
    661 # Convert between nplikes of different backends
    662 elif (
    663     isinstance(where, ak.contents.Content)
    664     and where.backend is not self._backend
    665 ):

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:739, in Content._getitem(self, where, named_axis)
    733     return self._carry(
    734         Index64.empty(0, self._backend.nplike),
    735         allow_lazy=True,
    736     )
    738 elif isinstance(where, Content):
--> 739     return self._getitem((where,), named_axis)
    741 elif is_sized_iterable(where):
    742     # Do we have an array
    743     nplike = nplike_of_obj(where, default=None)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:651, in Content._getitem(self, where, named_axis)
    642 named_axis.mapping = _named_axis
    644 next = ak.contents.RegularArray(
    645     this,
    646     this.length,
    647     1,
    648     parameters=None,
    649 )
--> 651 out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
    653 if out.length is not unknown_length and out.length == 0:
    654     return out._getitem_nothing()

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/regulararray.py:723, in RegularArray._getitem_next(self, head, tail, advanced)
    707     assert head.offsets.nplike is nplike
    708     self._maybe_index_error(
    709         self._backend[
    710             "awkward_RegularArray_getitem_jagged_expand",
   (...)    721         slicer=head,
    722     )
--> 723     down = self._content._getitem_next_jagged(
    724         multistarts, multistops, head._content, tail
    725     )
    727     return RegularArray(
    728         down, headlength, self.length, parameters=self._parameters
    729     )
    731 elif isinstance(head, ak.contents.IndexedOptionArray):

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/listoffsetarray.py:447, in ListOffsetArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
    441 def _getitem_next_jagged(
    442     self, slicestarts: Index, slicestops: Index, slicecontent: Content, tail
    443 ) -> Content:
    444     out = ak.contents.ListArray(
    445         self.starts, self.stops, self._content, parameters=self._parameters
    446     )
--> 447     return out._getitem_next_jagged(slicestarts, slicestops, slicecontent, tail)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/listarray.py:584, in ListArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
    573 nextcarry = ak.index.Index64.empty(carrylen, self._backend.nplike)
    575 assert (
    576     outoffsets.nplike is self._backend.nplike
    577     and nextcarry.nplike is self._backend.nplike
   (...)    582     and self._stops.nplike is self._backend.nplike
    583 )
--> 584 self._maybe_index_error(
    585     self._backend[
    586         "awkward_ListArray_getitem_jagged_apply",
    587         outoffsets.dtype.type,
    588         nextcarry.dtype.type,
    589         slicestarts.dtype.type,
    590         slicestops.dtype.type,
    591         sliceindex.dtype.type,
    592         self._starts.dtype.type,
    593         self._stops.dtype.type,
    594     ](
    595         outoffsets.data,
    596         nextcarry.data,
    597         slicestarts.data,
    598         slicestops.data,
    599         slicestarts.length,
    600         sliceindex.data,
    601         sliceindex.length,
    602         self._starts.data,
    603         self._stops.data,
    604         self._content.length,
    605     ),
    606     slicer=ak.contents.ListArray(slicestarts, slicestops, slicecontent),
    607 )
    608 nextcontent = self._content._carry(nextcarry, True)
    609 nexthead, nexttail = ak._slicing.head_tail(tail)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:297, in Content._maybe_index_error(self, error, slicer)
    295 else:
    296     message = self._backend.format_kernel_error(error)
--> 297     raise ak._errors.index_error(self, slicer, message)

IndexError: cannot slice ListArray (of length 5) with [[-1], [0], [3], [-1], [1]]: index out of range while attempting to get index -1 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-51/awkward-cpp/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43)

This error occurred while attempting to slice

    <Array [[], [10, 3, 2, 9], ..., [], [8, 9, -1]] type='5 * var * int64'>

with

    <Array [[-1], [0], [3], [-1], [1]] type='5 * var * int64'>

From the error message, it is clear that for some sublist(s) the index -1 is out of range. This makes sense; some of our sublists are empty, meaning that there is no valid integer to index into them.

Now let’s look at the result of indexing with mask_identity=True.

index = ak.argmax(array, keepdims=True, axis=-1, mask_identity=True)

Because it contains an option type, index already satisfies rule (2) in Building an awkward index, and we do not need to convert it to a ragged array. We can see that this index succeeds:

array[index]
[[None],
 [10],
 [12],
 [None],
 [9]]
--------
backend: cpu
nbytes: 112 B
type: 5 * var * ?int64

Here, the missing values in the index array correspond to missing values in the output array.

Indexing with missing sublists#

Ragged indexing also supports using None in place of empty sublists within an index. For example, given the following array

array = ak.Array(
    [
        [10, 3, 2, 9],
        [4, 5, 5, 12, 6],
        [],
        [8, 9, -1],
    ]
)
array
[[10, 3, 2, 9],
 [4, 5, 5, 12, 6],
 [],
 [8, 9, -1]]
------------------
backend: cpu
nbytes: 136 B
type: 4 * var * int64

let’s use build a ragged index to pull out some particular values. Rather than using empty lists, we can use None to mask out sublists that we don’t care about:

array[
    [
        [0, 1],
        None,
        [],
        [2],
    ],
]
[[10, 3],
 None,
 [],
 [-1]]
---------
backend: cpu
nbytes: 96 B
type: 4 * option[var * int64]

If we compare this with simply providing an empty sublist,

array[
    [
        [0, 1],
        [],
        [],
        [2],
    ],
]
[[10, 3],
 [],
 [],
 [-1]]
---------
backend: cpu
nbytes: 64 B
type: 4 * var * int64

we can see that the None value introduces an option-type into the final result. None values can be used at any level in the index array to introduce an option-type at that depth in the result.