How to filter with arrays containing missing values#
import awkward as ak
import numpy as np
Indexing with missing values#
In Building an awkward index, we looked building arrays of integers to perform awkward indexing using ak.argmin()
and ak.argmax()
. In particular, the keepdims
argument of ak.argmin()
and ak.argmax()
is very useful for creating arrays that can be used to index into the original array. However, reducers such as ak.argmax()
behave differently when they are asked to operate upon empty lists.
Let’s first create an array that contains empty sublists:
array = ak.Array(
[
[],
[10, 3, 2, 9],
[4, 5, 5, 12, 6],
[],
[8, 9, -1],
]
)
array
[[], [10, 3, 2, 9], [4, 5, 5, 12, 6], [], [8, 9, -1]] --------------------- type: 5 * var * int64
Awkward reducers accept a mask_identity
argument, which changes the ak.Array.type
and the values of the result:
ak.argmax(array, keepdims=True, axis=-1, mask_identity=False)
[[-1], [0], [3], [-1], [1]] ------------------- type: 5 * 1 * int64
ak.argmax(array, keepdims=True, axis=-1, mask_identity=True)
[[None], [0], [3], [None], [1]] -------------------- type: 5 * 1 * ?int64
Setting mask_identity=True
yields the identity value for the reducer instead of None
when reducing empty lists. From the above examples of ak.argmax()
, we can see that the identity for the ak.argmax()
is -1
: What happens if we try and use the array produced with mask_identity=False
to index into array
?
As discussed in Indexing with argmin and argmax, we first need to convert at least one dimension to a ragged dimension
index = ak.from_regular(
ak.argmax(array, keepdims=True, axis=-1, mask_identity=False)
)
Now, if we try and index into array
with index
, it will raise an exception
array[index]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[6], line 1
----> 1 array[index]
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1103, in Array.__getitem__(self, where)
1099 where = _normalize_named_slice(named_axis, where, ndim)
1101 NamedAxis.mapping = named_axis
-> 1103 indexed_layout = prepare_layout(self._layout._getitem(where, NamedAxis))
1105 if NamedAxis.mapping:
1106 return ak.operations.ak_with_named_axis._impl(
1107 indexed_layout,
1108 named_axis=NamedAxis.mapping,
(...)
1111 attrs=self._attrs,
1112 )
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:651, in Content._getitem(self, where, named_axis)
648 return out._getitem_at(0)
650 elif isinstance(where, ak.highlevel.Array):
--> 651 return self._getitem(where.layout, named_axis)
653 # Convert between nplikes of different backends
654 elif (
655 isinstance(where, ak.contents.Content)
656 and where.backend is not self._backend
657 ):
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:728, in Content._getitem(self, where, named_axis)
725 return where.to_NumpyArray(np.int64)
727 elif isinstance(where, Content):
--> 728 return self._getitem((where,), named_axis)
730 elif is_sized_iterable(where):
731 # Do we have an array
732 nplike = nplike_of_obj(where, default=None)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:643, in Content._getitem(self, where, named_axis)
634 named_axis.mapping = _named_axis
636 next = ak.contents.RegularArray(
637 this,
638 this.length,
639 1,
640 parameters=None,
641 )
--> 643 out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
645 if out.length is not unknown_length and out.length == 0:
646 return out._getitem_nothing()
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/regulararray.py:698, in RegularArray._getitem_next(self, head, tail, advanced)
682 assert head.offsets.nplike is index_nplike
683 self._maybe_index_error(
684 self._backend[
685 "awkward_RegularArray_getitem_jagged_expand",
(...)
696 slicer=head,
697 )
--> 698 down = self._content._getitem_next_jagged(
699 multistarts, multistops, head._content, tail
700 )
702 return RegularArray(
703 down, headlength, self._length, parameters=self._parameters
704 )
706 elif isinstance(head, ak.contents.IndexedOptionArray):
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/listoffsetarray.py:424, in ListOffsetArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
418 def _getitem_next_jagged(
419 self, slicestarts: Index, slicestops: Index, slicecontent: Content, tail
420 ) -> Content:
421 out = ak.contents.ListArray(
422 self.starts, self.stops, self._content, parameters=self._parameters
423 )
--> 424 return out._getitem_next_jagged(slicestarts, slicestops, slicecontent, tail)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/listarray.py:545, in ListArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
534 nextcarry = ak.index.Index64.empty(carrylen, self._backend.index_nplike)
536 assert (
537 outoffsets.nplike is self._backend.index_nplike
538 and nextcarry.nplike is self._backend.index_nplike
(...)
543 and self._stops.nplike is self._backend.index_nplike
544 )
--> 545 self._maybe_index_error(
546 self._backend[
547 "awkward_ListArray_getitem_jagged_apply",
548 outoffsets.dtype.type,
549 nextcarry.dtype.type,
550 slicestarts.dtype.type,
551 slicestops.dtype.type,
552 sliceindex.dtype.type,
553 self._starts.dtype.type,
554 self._stops.dtype.type,
555 ](
556 outoffsets.data,
557 nextcarry.data,
558 slicestarts.data,
559 slicestops.data,
560 slicestarts.length,
561 sliceindex.data,
562 sliceindex.length,
563 self._starts.data,
564 self._stops.data,
565 self._content.length,
566 ),
567 slicer=ak.contents.ListArray(slicestarts, slicestops, slicecontent),
568 )
569 nextcontent = self._content._carry(nextcarry, True)
570 nexthead, nexttail = ak._slicing.head_tail(tail)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:289, in Content._maybe_index_error(self, error, slicer)
287 else:
288 message = self._backend.format_kernel_error(error)
--> 289 raise ak._errors.index_error(self, slicer, message)
IndexError: cannot slice ListArray (of length 5) with [[-1], [0], [3], [-1], [1]]: index out of range while attempting to get index -1 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-39/awkward-cpp/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43)
This error occurred while attempting to slice
<Array [[], [10, 3, 2, 9], ..., [], [8, 9, -1]] type='5 * var * int64'>
with
<Array [[-1], [0], [3], [-1], [1]] type='5 * var * int64'>
From the error message, it is clear that for some sublist(s) the index -1
is out of range. This makes sense; some of our sublists are empty, meaning that there is no valid integer to index into them.
Now let’s look at the result of indexing with mask_identity=True
.
index = ak.argmax(array, keepdims=True, axis=-1, mask_identity=True)
Because it contains an option type, index
already satisfies rule (2) in Building an awkward index, and we do not need to convert it to a ragged array. We can see that this index succeeds:
array[index]
[[None], [10], [12], [None], [9]] ---------------------- type: 5 * var * ?int64
Here, the missing values in the index array correspond to missing values in the output array.
Indexing with missing sublists#
Ragged indexing also supports using None
in place of empty sublists within an index. For example, given the following array
array = ak.Array(
[
[10, 3, 2, 9],
[4, 5, 5, 12, 6],
[],
[8, 9, -1],
]
)
array
[[10, 3, 2, 9], [4, 5, 5, 12, 6], [], [8, 9, -1]] --------------------- type: 4 * var * int64
let’s use build a ragged index to pull out some particular values. Rather than using empty lists, we can use None
to mask out sublists that we don’t care about:
array[
[
[0, 1],
None,
[],
[2],
],
]
[[10, 3], None, [], [-1]] ----------------------------- type: 4 * option[var * int64]
If we compare this with simply providing an empty sublist,
array[
[
[0, 1],
[],
[],
[2],
],
]
[[10, 3], [], [], [-1]] --------------------- type: 4 * var * int64
we can see that the None
value introduces an option-type into the final result. None
values can be used at any level in the index array to introduce an option-type at that depth in the result.