How to create arrays of missing data#
Data at any level of an Awkward Array can be “missing,” represented by None
in Python.
This functionality is somewhat like NumPy’s masked arrays, but masked arrays can only declare numerical values to be missing (not, for instance, a row of a 2-dimensional array) and they represent missing data with an np.ma.masked
object instead of None
.
Pandas also handles missing data, but in several different ways. For floating point columns, NaN
(not a number) is used to mean “missing,” and as of version 1.0, Pandas has a pd.NA
object for missing data in other data types.
In Awkward Array, floating point NaN
and a missing value are clearly distinct. Missing data, like all data in Awkward Arrays, are also not represented by any Python object; they are converted to and from None
by ak.to_list()
and ak.from_iter()
.
import awkward as ak
import numpy as np
From Python None#
The ak.Array
constructor and ak.from_iter()
interpret None
as a missing value, and ak.to_list()
converts them back into None
.
ak.Array([1, 2, 3, None, 4, 5])
[1, 2, 3, None, 4, 5] ---------------- type: 6 * ?int64
The missing values can be deeply nested (missing integers):
ak.Array([[[[], [1, 2, None]]], [[[3]]], []])
[[[[], [1, 2, None]]], [[[3]]], []] ---------------------------------- type: 3 * var * var * var * ?int64
They can be shallow (missing lists):
ak.Array([[[[], [1, 2]]], None, [[[3]]], []])
[[[[], [1, 2]]], None, [[[3]]], []] ----------------------------------------- type: 4 * option[var * var * var * int64]
Or both:
ak.Array([[[[], [3]]], None, [[[None]]], []])
[[[[], [3]]], None, [[[None]]], []] ------------------------------------------ type: 4 * option[var * var * var * ?int64]
Records can also be missing:
ak.Array([{"x": 1, "y": 1}, None, {"x": 2, "y": 2}])
[{x: 1, y: 1}, None, {x: 2, y: 2}] -------------- type: 3 * ?{ x: int64, y: int64 }
Potentially missing values are represented in the type string as “?
” or “option[...]
” (if the nested type is a list, which needs to be bracketed for clarity).
From NumPy arrays#
Normal NumPy arrays can’t represent missing data, but masked arrays can. Here is how one is constructed in NumPy:
numpy_array = np.ma.MaskedArray([1, 2, 3, 4, 5], [False, False, True, True, False])
numpy_array
masked_array(data=[1, 2, --, --, 5],
mask=[False, False, True, True, False],
fill_value=999999)
It returns np.ma.masked
objects if you try to access missing values:
numpy_array[0], numpy_array[1], numpy_array[2], numpy_array[3], numpy_array[4]
(np.int64(1), np.int64(2), masked, masked, np.int64(5))
But it uses None
for missing values in tolist
:
numpy_array.tolist()
[1, 2, None, None, 5]
The ak.from_numpy()
function converts masked arrays into Awkward Arrays with missing values, as does the ak.Array
constructor.
awkward_array = ak.Array(numpy_array)
awkward_array
[1, 2, None, None, 5] ---------------- type: 5 * ?int64
The reverse, ak.to_numpy()
, returns masked arrays if the Awkward Array has missing data.
ak.to_numpy(awkward_array)
masked_array(data=[1, 2, --, --, 5],
mask=[False, False, True, True, False],
fill_value=999999)
But np.asarray, the usual way of casting data as NumPy arrays, does not. (np.asarray is supposed to return a plain np.ndarray, which np.ma.masked_array is not.)
np.asarray(awkward_array)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[12], line 1
----> 1 np.asarray(awkward_array)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1517, in Array.__array__(self, dtype)
1512 with ak._errors.OperationErrorContext(
1513 "numpy.asarray", (self,), {"dtype": dtype}
1514 ):
1515 from awkward._connect.numpy import convert_to_array
-> 1517 return convert_to_array(self._layout, dtype=dtype)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:511, in convert_to_array(layout, dtype)
510 def convert_to_array(layout, dtype=None):
--> 511 out = ak.operations.to_numpy(layout, allow_missing=False)
512 if dtype is None:
513 return out
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_dispatch.py:64, in named_high_level_function.<locals>.dispatch(*args, **kwargs)
62 # Failed to find a custom overload, so resume the original function
63 try:
---> 64 next(gen_or_result)
65 except StopIteration as err:
66 return err.value
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_to_numpy.py:48, in to_numpy(array, allow_missing)
45 yield (array,)
47 # Implementation
---> 48 return _impl(array, allow_missing)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_to_numpy.py:60, in _impl(array, allow_missing)
57 backend = NumpyBackend.instance()
58 numpy_layout = layout.to_backend(backend)
---> 60 return numpy_layout.to_backend_array(allow_missing=allow_missing)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:1112, in Content.to_backend_array(self, allow_missing, backend)
1110 else:
1111 backend = regularize_backend(backend)
-> 1112 return self._to_backend_array(allow_missing, backend)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/bytemaskedarray.py:1068, in ByteMaskedArray._to_backend_array(self, allow_missing, backend)
1067 def _to_backend_array(self, allow_missing, backend):
-> 1068 return self.to_IndexedOptionArray64()._to_backend_array(allow_missing, backend)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/indexedoptionarray.py:1615, in IndexedOptionArray._to_backend_array(self, allow_missing, backend)
1613 return nplike.ma.MaskedArray(data, mask)
1614 else:
-> 1615 raise ValueError(
1616 "Content.to_nplike cannot convert 'None' values to "
1617 "np.ma.MaskedArray unless the "
1618 "'allow_missing' parameter is set to True"
1619 )
1620 else:
1621 if allow_missing:
ValueError: Content.to_nplike cannot convert 'None' values to np.ma.MaskedArray unless the 'allow_missing' parameter is set to True
This error occurred while calling
numpy.asarray(
<Array [1, 2, None, None, 5] type='5 * ?int64'>
dtype = None
)
Missing rows vs missing numbers#
In Awkward Array, a missing list is a different thing from a list whose values are missing. However, ak.to_numpy()
converts it for you.
missing_row = ak.Array([[1, 2, 3], None, [4, 5, 6]])
missing_row
[[1, 2, 3], None, [4, 5, 6]] ----------------------------- type: 3 * option[var * int64]
ak.to_numpy(missing_row)
masked_array(
data=[[1, 2, 3],
[--, --, --],
[4, 5, 6]],
mask=[[False, False, False],
[ True, True, True],
[False, False, False]],
fill_value=999999)
NaN is not missing#
Floating point NaN
values are simply unrelated to missing values, in both Awkward Array and NumPy.
missing_with_nan = ak.Array([1.1, 2.2, np.nan, None, 3.3])
missing_with_nan
[1.1, 2.2, nan, None, 3.3] ------------------ type: 5 * ?float64
ak.to_numpy(missing_with_nan)
masked_array(data=[1.1, 2.2, nan, --, 3.3],
mask=[False, False, False, True, False],
fill_value=1e+20)
Missing values as empty lists#
Sometimes, it’s useful to think about a potentially missing value as a length-1 list if it is not missing and a length-0 list if it is. (Some languages define the option type as a kind of list.)
The Awkward functions ak.singletons()
and ak.firsts()
convert from “None
form” to and from “lists form.”
none_form = ak.Array([1, 2, 3, None, None, 5])
none_form
[1, 2, 3, None, None, 5] ---------------- type: 6 * ?int64
lists_form = ak.singletons(none_form)
lists_form
[[1], [2], [3], [], [], [5]] --------------------- type: 6 * var * int64
ak.firsts(lists_form)
[1, 2, 3, None, None, 5] ---------------- type: 6 * ?int64
Masking instead of slicing#
The most common way of filtering data is to slice it with an array of booleans (usually the result of a calculation).
array = ak.Array([1, 2, 3, 4, 5])
array
[1, 2, 3, 4, 5] --------------- type: 5 * int64
booleans = ak.Array([True, True, False, False, True])
booleans
[True, True, False, False, True] -------------- type: 5 * bool
array[booleans]
[1, 2, 5] --------------- type: 3 * int64
The data can also be effectively filtered by replacing values with None
. The following syntax does that:
array.mask[booleans]
[1, 2, None, None, 5] ---------------- type: 5 * ?int64
(Or use the ak.mask()
function.)
An advantage of masking is that the length and nesting structure of the masked array is the same as the original array, so anything that broadcasts with one broadcasts with the other (so that unfiltered data can be used interchangeably with filtered data).
array + array.mask[booleans]
[2, 4, None, None, 10] ---------------- type: 5 * ?int64
whereas
array + array[booleans]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[25], line 1
----> 1 array + array[booleans]
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_operators.py:54, in _binary_method.<locals>.func(self, other)
51 if _disables_array_ufunc(other):
52 return NotImplemented
---> 54 return ufunc(self, other)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1594, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
1592 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
1593 with ak._errors.OperationErrorContext(name, inputs, kwargs):
-> 1594 return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:469, in array_ufunc(ufunc, method, inputs, kwargs)
461 raise TypeError(
462 "no {}.{} overloads for custom types: {}".format(
463 type(ufunc).__module__, ufunc.__name__, ", ".join(error_message)
464 )
465 )
467 return None
--> 469 out = ak._broadcasting.broadcast_and_apply(
470 inputs,
471 action,
472 depth_context=depth_context,
473 lateral_context=lateral_context,
474 allow_records=False,
475 function_name=ufunc.__name__,
476 )
478 out_named_axis = functools.reduce(
479 _unify_named_axis, lateral_context[NAMED_AXIS_KEY].named_axis
480 )
481 if len(out) == 1:
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1200, in broadcast_and_apply(inputs, action, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
1198 backend = backend_of(*inputs, coerce_to_common=False)
1199 isscalar = []
-> 1200 out = apply_step(
1201 backend,
1202 broadcast_pack(inputs, isscalar),
1203 action,
1204 0,
1205 depth_context,
1206 lateral_context,
1207 {
1208 "allow_records": allow_records,
1209 "left_broadcast": left_broadcast,
1210 "right_broadcast": right_broadcast,
1211 "numpy_to_regular": numpy_to_regular,
1212 "regular_to_jagged": regular_to_jagged,
1213 "function_name": function_name,
1214 "broadcast_parameters_rule": broadcast_parameters_rule,
1215 },
1216 )
1217 assert isinstance(out, tuple)
1218 return tuple(broadcast_unpack(x, isscalar) for x in out)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1178, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
1176 return result
1177 elif result is None:
-> 1178 return continuation()
1179 else:
1180 raise AssertionError(result)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1147, in apply_step.<locals>.continuation()
1145 # Any non-string list-types?
1146 elif any(x.is_list and not is_string_like(x) for x in contents):
-> 1147 return broadcast_any_list()
1149 # Any RecordArrays?
1150 elif any(x.is_record for x in contents):
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:663, in apply_step.<locals>.broadcast_any_list()
661 nextparameters.append(x._parameters)
662 else:
--> 663 raise ValueError(
664 "cannot broadcast RegularArray of size "
665 f"{x.size} with RegularArray of size {dim_size}{in_function(options)}"
666 )
667 else:
668 nextinputs.append(x)
ValueError: cannot broadcast RegularArray of size 3 with RegularArray of size 5 in add
This error occurred while calling
numpy.add.__call__(
<Array [1, 2, 3, 4, 5] type='5 * int64'>
<Array [1, 2, 5] type='3 * int64'>
)
With ArrayBuilder#
ak.ArrayBuilder
is described in more detail in this tutorial, but you can add missing values to an array using the null
method or appending None
.
(This is what ak.from_iter()
uses internally to accumulate data.)
builder = ak.ArrayBuilder()
builder.append(1)
builder.append(2)
builder.null()
builder.append(None)
builder.append(3)
array = builder.snapshot()
array
[1, 2, None, None, 3] ---------------- type: 5 * ?int64
In Numba#
Functions that Numba Just-In-Time (JIT) compiles can use ak.ArrayBuilder
or construct a boolean array for ak.mask()
.
(ak.ArrayBuilder
can’t be constructed or converted to an array using snapshot
inside a JIT-compiled function, but can be outside the compiled context.)
import numba as nb
@nb.jit
def example(builder):
builder.append(1)
builder.append(2)
builder.null()
builder.append(None)
builder.append(3)
return builder
builder = example(ak.ArrayBuilder())
array = builder.snapshot()
array
[1, 2, None, None, 3] ---------------- type: 5 * ?int64
@nb.jit
def faster_example():
data = np.empty(5, np.int64)
mask = np.empty(5, np.bool_)
data[0] = 1
mask[0] = True
data[1] = 2
mask[1] = True
mask[2] = False
mask[3] = False
data[4] = 5
mask[4] = True
return data, mask
data, mask = faster_example()
array = ak.Array(data).mask[mask]
array
[1, 2, None, None, 5] ---------------- type: 5 * ?int64