How to filter with arrays containing missing values#

import awkward as ak
import numpy as np

Indexing with missing values#

In Building an awkward index, we looked building arrays of integers to perform awkward indexing using ak.argmin() and ak.argmax(). In particular, the keepdims argument of ak.argmin() and ak.argmax() is very useful for creating arrays that can be used to index into the original array. However, reducers such as ak.argmax() behave differently when they are asked to operate upon empty lists.

Let’s first create an array that contains empty sublists:

array = ak.Array(
    [
        [],
        [10, 3, 2, 9],
        [4, 5, 5, 12, 6],
        [],
        [8, 9, -1],
    ]
)
array
[[],
 [10, 3, 2, 9],
 [4, 5, 5, 12, 6],
 [],
 [8, 9, -1]]
---------------------
type: 5 * var * int64

Awkward reducers accept a mask_identity argument, which changes the ak.Array.type and the values of the result:

ak.argmax(array, keepdims=True, axis=-1, mask_identity=False)
[[-1],
 [0],
 [3],
 [-1],
 [1]]
-------------------
type: 5 * 1 * int64
ak.argmax(array, keepdims=True, axis=-1, mask_identity=True)
[[None],
 [0],
 [3],
 [None],
 [1]]
--------------------
type: 5 * 1 * ?int64

Setting mask_identity=True yields the identity value for the reducer instead of None when reducing empty lists. From the above examples of ak.argmax(), we can see that the identity for the ak.argmax() is -1: What happens if we try and use the array produced with mask_identity=False to index into array?

As discussed in Indexing with argmin and argmax, we first need to convert at least one dimension to a ragged dimension

index = ak.from_regular(
    ak.argmax(array, keepdims=True, axis=-1, mask_identity=False)
)

Now, if we try and index into array with index, it will raise an exception

array[index]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 array[index]

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1103, in Array.__getitem__(self, where)
   1099 where = _normalize_named_slice(named_axis, where, ndim)
   1101 NamedAxis.mapping = named_axis
-> 1103 indexed_layout = prepare_layout(self._layout._getitem(where, NamedAxis))
   1105 if NamedAxis.mapping:
   1106     return ak.operations.ak_with_named_axis._impl(
   1107         indexed_layout,
   1108         named_axis=NamedAxis.mapping,
   (...)
   1111         attrs=self._attrs,
   1112     )

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:651, in Content._getitem(self, where, named_axis)
    648         return out._getitem_at(0)
    650 elif isinstance(where, ak.highlevel.Array):
--> 651     return self._getitem(where.layout, named_axis)
    653 # Convert between nplikes of different backends
    654 elif (
    655     isinstance(where, ak.contents.Content)
    656     and where.backend is not self._backend
    657 ):

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:728, in Content._getitem(self, where, named_axis)
    725     return where.to_NumpyArray(np.int64)
    727 elif isinstance(where, Content):
--> 728     return self._getitem((where,), named_axis)
    730 elif is_sized_iterable(where):
    731     # Do we have an array
    732     nplike = nplike_of_obj(where, default=None)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:643, in Content._getitem(self, where, named_axis)
    634 named_axis.mapping = _named_axis
    636 next = ak.contents.RegularArray(
    637     this,
    638     this.length,
    639     1,
    640     parameters=None,
    641 )
--> 643 out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
    645 if out.length is not unknown_length and out.length == 0:
    646     return out._getitem_nothing()

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/regulararray.py:698, in RegularArray._getitem_next(self, head, tail, advanced)
    682     assert head.offsets.nplike is index_nplike
    683     self._maybe_index_error(
    684         self._backend[
    685             "awkward_RegularArray_getitem_jagged_expand",
   (...)
    696         slicer=head,
    697     )
--> 698     down = self._content._getitem_next_jagged(
    699         multistarts, multistops, head._content, tail
    700     )
    702     return RegularArray(
    703         down, headlength, self._length, parameters=self._parameters
    704     )
    706 elif isinstance(head, ak.contents.IndexedOptionArray):

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/listoffsetarray.py:424, in ListOffsetArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
    418 def _getitem_next_jagged(
    419     self, slicestarts: Index, slicestops: Index, slicecontent: Content, tail
    420 ) -> Content:
    421     out = ak.contents.ListArray(
    422         self.starts, self.stops, self._content, parameters=self._parameters
    423     )
--> 424     return out._getitem_next_jagged(slicestarts, slicestops, slicecontent, tail)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/listarray.py:545, in ListArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
    534 nextcarry = ak.index.Index64.empty(carrylen, self._backend.index_nplike)
    536 assert (
    537     outoffsets.nplike is self._backend.index_nplike
    538     and nextcarry.nplike is self._backend.index_nplike
   (...)
    543     and self._stops.nplike is self._backend.index_nplike
    544 )
--> 545 self._maybe_index_error(
    546     self._backend[
    547         "awkward_ListArray_getitem_jagged_apply",
    548         outoffsets.dtype.type,
    549         nextcarry.dtype.type,
    550         slicestarts.dtype.type,
    551         slicestops.dtype.type,
    552         sliceindex.dtype.type,
    553         self._starts.dtype.type,
    554         self._stops.dtype.type,
    555     ](
    556         outoffsets.data,
    557         nextcarry.data,
    558         slicestarts.data,
    559         slicestops.data,
    560         slicestarts.length,
    561         sliceindex.data,
    562         sliceindex.length,
    563         self._starts.data,
    564         self._stops.data,
    565         self._content.length,
    566     ),
    567     slicer=ak.contents.ListArray(slicestarts, slicestops, slicecontent),
    568 )
    569 nextcontent = self._content._carry(nextcarry, True)
    570 nexthead, nexttail = ak._slicing.head_tail(tail)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:289, in Content._maybe_index_error(self, error, slicer)
    287 else:
    288     message = self._backend.format_kernel_error(error)
--> 289     raise ak._errors.index_error(self, slicer, message)

IndexError: cannot slice ListArray (of length 5) with [[-1], [0], [3], [-1], [1]]: index out of range while attempting to get index -1 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-39/awkward-cpp/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43)

This error occurred while attempting to slice

    <Array [[], [10, 3, 2, 9], ..., [], [8, 9, -1]] type='5 * var * int64'>

with

    <Array [[-1], [0], [3], [-1], [1]] type='5 * var * int64'>

From the error message, it is clear that for some sublist(s) the index -1 is out of range. This makes sense; some of our sublists are empty, meaning that there is no valid integer to index into them.

Now let’s look at the result of indexing with mask_identity=True.

index = ak.argmax(array, keepdims=True, axis=-1, mask_identity=True)

Because it contains an option type, index already satisfies rule (2) in Building an awkward index, and we do not need to convert it to a ragged array. We can see that this index succeeds:

array[index]
[[None],
 [10],
 [12],
 [None],
 [9]]
----------------------
type: 5 * var * ?int64

Here, the missing values in the index array correspond to missing values in the output array.

Indexing with missing sublists#

Ragged indexing also supports using None in place of empty sublists within an index. For example, given the following array

array = ak.Array(
    [
        [10, 3, 2, 9],
        [4, 5, 5, 12, 6],
        [],
        [8, 9, -1],
    ]
)
array
[[10, 3, 2, 9],
 [4, 5, 5, 12, 6],
 [],
 [8, 9, -1]]
---------------------
type: 4 * var * int64

let’s use build a ragged index to pull out some particular values. Rather than using empty lists, we can use None to mask out sublists that we don’t care about:

array[
    [
        [0, 1],
        None,
        [],
        [2],
    ],
]
[[10, 3],
 None,
 [],
 [-1]]
-----------------------------
type: 4 * option[var * int64]

If we compare this with simply providing an empty sublist,

array[
    [
        [0, 1],
        [],
        [],
        [2],
    ],
]
[[10, 3],
 [],
 [],
 [-1]]
---------------------
type: 4 * var * int64

we can see that the None value introduces an option-type into the final result. None values can be used at any level in the index array to introduce an option-type at that depth in the result.