How to create arrays of missing data#

Data at any level of an Awkward Array can be “missing,” represented by None in Python.

This functionality is somewhat like NumPy’s masked arrays, but masked arrays can only declare numerical values to be missing (not, for instance, a row of a 2-dimensional array) and they represent missing data with an np.ma.masked object instead of None.

Pandas also handles missing data, but in several different ways. For floating point columns, NaN (not a number) is used to mean “missing,” and as of version 1.0, Pandas has a pd.NA object for missing data in other data types.

In Awkward Array, floating point NaN and a missing value are clearly distinct. Missing data, like all data in Awkward Arrays, are also not represented by any Python object; they are converted to and from None by ak.to_list() and ak.from_iter().

import awkward as ak
import numpy as np

From Python None#

The ak.Array constructor and ak.from_iter() interpret None as a missing value, and ak.to_list() converts them back into None.

ak.Array([1, 2, 3, None, 4, 5])
[1,
 2,
 3,
 None,
 4,
 5]
----------------
backend: cpu
nbytes: 88 B
type: 6 * ?int64

The missing values can be deeply nested (missing integers):

ak.Array([[[[], [1, 2, None]]], [[[3]]], []])
[[[[], [1, 2, None]]],
 [[[3]]],
 []]
----------------------------------
backend: cpu
nbytes: 144 B
type: 3 * var * var * var * ?int64

They can be shallow (missing lists):

ak.Array([[[[], [1, 2]]], None, [[[3]]], []])
[[[[], [1, 2]]],
 None,
 [[[3]]],
 []]
-----------------------------------------
backend: cpu
nbytes: 144 B
type: 4 * option[var * var * var * int64]

Or both:

ak.Array([[[[], [3]]], None, [[[None]]], []])
[[[[], [3]]],
 None,
 [[[None]]],
 []]
------------------------------------------
backend: cpu
nbytes: 144 B
type: 4 * option[var * var * var * ?int64]

Records can also be missing:

ak.Array([{"x": 1, "y": 1}, None, {"x": 2, "y": 2}])
[{x: 1, y: 1},
 None,
 {x: 2, y: 2}]
-----------------------------------------
backend: cpu
nbytes: 56 B
type: 3 * ?{
    x: int64,
    y: int64
}

Potentially missing values are represented in the type string as “?” or “option[...]” (if the nested type is a list, which needs to be bracketed for clarity).

From NumPy arrays#

Normal NumPy arrays can’t represent missing data, but masked arrays can. Here is how one is constructed in NumPy:

numpy_array = np.ma.MaskedArray([1, 2, 3, 4, 5], [False, False, True, True, False])
numpy_array
masked_array(data=[1, 2, --, --, 5],
             mask=[False, False,  True,  True, False],
       fill_value=999999)

It returns np.ma.masked objects if you try to access missing values:

numpy_array[0], numpy_array[1], numpy_array[2], numpy_array[3], numpy_array[4]
(np.int64(1), np.int64(2), masked, masked, np.int64(5))

But it uses None for missing values in tolist:

numpy_array.tolist()
[1, 2, None, None, 5]

The ak.from_numpy() function converts masked arrays into Awkward Arrays with missing values, as does the ak.Array constructor.

awkward_array = ak.Array(numpy_array)
awkward_array
[1,
 2,
 None,
 None,
 5]
----------------
backend: cpu
nbytes: 45 B
type: 5 * ?int64

The reverse, ak.to_numpy(), returns masked arrays if the Awkward Array has missing data.

ak.to_numpy(awkward_array)
masked_array(data=[1, 2, --, --, 5],
             mask=[False, False,  True,  True, False],
       fill_value=999999)

But np.asarray, the usual way of casting data as NumPy arrays, does not. (np.asarray is supposed to return a plain np.ndarray, which np.ma.masked_array is not.)

np.asarray(awkward_array)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 1
----> 1 np.asarray(awkward_array)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1535, in Array.__array__(self, dtype)
   1510 def __array__(self, dtype=None):
   1511     """
   1512     Intercepts attempts to convert this Array into a NumPy array and
   1513     either performs a zero-copy conversion or raises an error.
   (...)
   1533     cannot be sliced as dimensions.
   1534     """
-> 1535     with ak._errors.OperationErrorContext(
   1536         "numpy.asarray", (self,), {"dtype": dtype}
   1537     ):
   1538         from awkward._connect.numpy import convert_to_array
   1540         return convert_to_array(self._layout, dtype=dtype)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_errors.py:80, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
     78     self._slate.__dict__.clear()
     79     # Handle caught exception
---> 80     raise self.decorate_exception(exception_type, exception_value)
     81 else:
     82     # Step out of the way so that another ErrorContext can become primary.
     83     if self.primary() is self:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1540, in Array.__array__(self, dtype)
   1535 with ak._errors.OperationErrorContext(
   1536     "numpy.asarray", (self,), {"dtype": dtype}
   1537 ):
   1538     from awkward._connect.numpy import convert_to_array
-> 1540     return convert_to_array(self._layout, dtype=dtype)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:511, in convert_to_array(layout, dtype)
    510 def convert_to_array(layout, dtype=None):
--> 511     out = ak.operations.to_numpy(layout, allow_missing=False)
    512     if dtype is None:
    513         return out

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_dispatch.py:64, in named_high_level_function.<locals>.dispatch(*args, **kwargs)
     62 # Failed to find a custom overload, so resume the original function
     63 try:
---> 64     next(gen_or_result)
     65 except StopIteration as err:
     66     return err.value

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_to_numpy.py:48, in to_numpy(array, allow_missing)
     45 yield (array,)
     47 # Implementation
---> 48 return _impl(array, allow_missing)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_to_numpy.py:60, in _impl(array, allow_missing)
     57 backend = NumpyBackend.instance()
     58 numpy_layout = layout.to_backend(backend)
---> 60 return numpy_layout.to_backend_array(allow_missing=allow_missing)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/content.py:1121, in Content.to_backend_array(self, allow_missing, backend)
   1119 else:
   1120     backend = regularize_backend(backend)
-> 1121 return self._to_backend_array(allow_missing, backend)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/bytemaskedarray.py:1077, in ByteMaskedArray._to_backend_array(self, allow_missing, backend)
   1076 def _to_backend_array(self, allow_missing, backend):
-> 1077     return self.to_IndexedOptionArray64()._to_backend_array(allow_missing, backend)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/indexedoptionarray.py:1623, in IndexedOptionArray._to_backend_array(self, allow_missing, backend)
   1621         return nplike.ma.MaskedArray(data, mask)
   1622     else:
-> 1623         raise ValueError(
   1624             "Content.to_nplike cannot convert 'None' values to "
   1625             "np.ma.MaskedArray unless the "
   1626             "'allow_missing' parameter is set to True"
   1627         )
   1628 else:
   1629     if allow_missing:

ValueError: Content.to_nplike cannot convert 'None' values to np.ma.MaskedArray unless the 'allow_missing' parameter is set to True

This error occurred while calling

    numpy.asarray(
        <Array [1, 2, None, None, 5] type='5 * ?int64'>
        dtype = None
    )

Missing rows vs missing numbers#

In Awkward Array, a missing list is a different thing from a list whose values are missing. However, ak.to_numpy() converts it for you.

missing_row = ak.Array([[1, 2, 3], None, [4, 5, 6]])
missing_row
[[1, 2, 3],
 None,
 [4, 5, 6]]
-----------------------------
backend: cpu
nbytes: 96 B
type: 3 * option[var * int64]
ak.to_numpy(missing_row)
masked_array(
  data=[[1, 2, 3],
        [--, --, --],
        [4, 5, 6]],
  mask=[[False, False, False],
        [ True,  True,  True],
        [False, False, False]],
  fill_value=999999)

NaN is not missing#

Floating point NaN values are simply unrelated to missing values, in both Awkward Array and NumPy.

missing_with_nan = ak.Array([1.1, 2.2, np.nan, None, 3.3])
missing_with_nan
[1.1,
 2.2,
 nan,
 None,
 3.3]
------------------
backend: cpu
nbytes: 72 B
type: 5 * ?float64
ak.to_numpy(missing_with_nan)
masked_array(data=[1.1, 2.2, nan, --, 3.3],
             mask=[False, False, False,  True, False],
       fill_value=1e+20)

Missing values as empty lists#

Sometimes, it’s useful to think about a potentially missing value as a length-1 list if it is not missing and a length-0 list if it is. (Some languages define the option type as a kind of list.)

The Awkward functions ak.singletons() and ak.firsts() convert from “None form” to and from “lists form.”

none_form = ak.Array([1, 2, 3, None, None, 5])
none_form
[1,
 2,
 3,
 None,
 None,
 5]
----------------
backend: cpu
nbytes: 80 B
type: 6 * ?int64
lists_form = ak.singletons(none_form)
lists_form
[[1],
 [2],
 [3],
 [],
 [],
 [5]]
---------------------
backend: cpu
nbytes: 88 B
type: 6 * var * int64
ak.firsts(lists_form)
[1,
 2,
 3,
 None,
 None,
 5]
----------------
backend: cpu
nbytes: 80 B
type: 6 * ?int64

Masking instead of slicing#

The most common way of filtering data is to slice it with an array of booleans (usually the result of a calculation).

array = ak.Array([1, 2, 3, 4, 5])
array
[1,
 2,
 3,
 4,
 5]
---------------
backend: cpu
nbytes: 40 B
type: 5 * int64
booleans = ak.Array([True, True, False, False, True])
booleans
[True,
 True,
 False,
 False,
 True]
--------------
backend: cpu
nbytes: 5 B
type: 5 * bool
array[booleans]
[1,
 2,
 5]
---------------
backend: cpu
nbytes: 24 B
type: 3 * int64

The data can also be effectively filtered by replacing values with None. The following syntax does that:

array.mask[booleans]
[1,
 2,
 None,
 None,
 5]
----------------
backend: cpu
nbytes: 45 B
type: 5 * ?int64

(Or use the ak.mask() function.)

An advantage of masking is that the length and nesting structure of the masked array is the same as the original array, so anything that broadcasts with one broadcasts with the other (so that unfiltered data can be used interchangeably with filtered data).

array + array.mask[booleans]
[2,
 4,
 None,
 None,
 10]
----------------
backend: cpu
nbytes: 64 B
type: 5 * ?int64

whereas

array + array[booleans]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[25], line 1
----> 1 array + array[booleans]

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_operators.py:54, in _binary_method.<locals>.func(self, other)
     51 if _disables_array_ufunc(other):
     52     return NotImplemented
---> 54 return ufunc(self, other)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1616, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1551 """
   1552 Intercepts attempts to pass this Array to a NumPy
   1553 [universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
   (...)
   1613 See also #__array_function__.
   1614 """
   1615 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
-> 1616 with ak._errors.OperationErrorContext(name, inputs, kwargs):
   1617     return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_errors.py:80, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
     78     self._slate.__dict__.clear()
     79     # Handle caught exception
---> 80     raise self.decorate_exception(exception_type, exception_value)
     81 else:
     82     # Step out of the way so that another ErrorContext can become primary.
     83     if self.primary() is self:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1617, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1615 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
   1616 with ak._errors.OperationErrorContext(name, inputs, kwargs):
-> 1617     return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:469, in array_ufunc(ufunc, method, inputs, kwargs)
    461         raise TypeError(
    462             "no {}.{} overloads for custom types: {}".format(
    463                 type(ufunc).__module__, ufunc.__name__, ", ".join(error_message)
    464             )
    465         )
    467     return None
--> 469 out = ak._broadcasting.broadcast_and_apply(
    470     inputs,
    471     action,
    472     depth_context=depth_context,
    473     lateral_context=lateral_context,
    474     allow_records=False,
    475     function_name=ufunc.__name__,
    476 )
    478 out_named_axis = functools.reduce(
    479     _unify_named_axis, lateral_context[NAMED_AXIS_KEY].named_axis
    480 )
    481 if len(out) == 1:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1200, in broadcast_and_apply(inputs, action, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
   1198 backend = backend_of(*inputs, coerce_to_common=False)
   1199 isscalar = []
-> 1200 out = apply_step(
   1201     backend,
   1202     broadcast_pack(inputs, isscalar),
   1203     action,
   1204     0,
   1205     depth_context,
   1206     lateral_context,
   1207     {
   1208         "allow_records": allow_records,
   1209         "left_broadcast": left_broadcast,
   1210         "right_broadcast": right_broadcast,
   1211         "numpy_to_regular": numpy_to_regular,
   1212         "regular_to_jagged": regular_to_jagged,
   1213         "function_name": function_name,
   1214         "broadcast_parameters_rule": broadcast_parameters_rule,
   1215     },
   1216 )
   1217 assert isinstance(out, tuple)
   1218 return tuple(broadcast_unpack(x, isscalar) for x in out)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1178, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
   1176     return result
   1177 elif result is None:
-> 1178     return continuation()
   1179 else:
   1180     raise AssertionError(result)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1147, in apply_step.<locals>.continuation()
   1145 # Any non-string list-types?
   1146 elif any(x.is_list and not is_string_like(x) for x in contents):
-> 1147     return broadcast_any_list()
   1149 # Any RecordArrays?
   1150 elif any(x.is_record for x in contents):

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:659, in apply_step.<locals>.broadcast_any_list()
    657         nextparameters.append(x._parameters)
    658     else:
--> 659         raise ValueError(
    660             "cannot broadcast RegularArray of size "
    661             f"{x.size} with RegularArray of size {dim_size}{in_function(options)}"
    662         )
    663 else:
    664     nextinputs.append(x)

ValueError: cannot broadcast RegularArray of size 3 with RegularArray of size 5 in add

This error occurred while calling

    numpy.add.__call__(
        <Array [1, 2, 3, 4, 5] type='5 * int64'>
        <Array [1, 2, 5] type='3 * int64'>
    )

With ArrayBuilder#

ak.ArrayBuilder is described in more detail in this tutorial, but you can add missing values to an array using the null method or appending None.

(This is what ak.from_iter() uses internally to accumulate data.)

builder = ak.ArrayBuilder()

builder.append(1)
builder.append(2)
builder.null()
builder.append(None)
builder.append(3)

array = builder.snapshot()
array
[1,
 2,
 None,
 None,
 3]
----------------
backend: cpu
nbytes: 64 B
type: 5 * ?int64

In Numba#

Functions that Numba Just-In-Time (JIT) compiles can use ak.ArrayBuilder or construct a boolean array for ak.mask().

(ak.ArrayBuilder can’t be constructed or converted to an array using snapshot inside a JIT-compiled function, but can be outside the compiled context.)

import numba as nb
@nb.jit
def example(builder):
    builder.append(1)
    builder.append(2)
    builder.null()
    builder.append(None)
    builder.append(3)
    return builder


builder = example(ak.ArrayBuilder())

array = builder.snapshot()
array
[1,
 2,
 None,
 None,
 3]
----------------
backend: cpu
nbytes: 64 B
type: 5 * ?int64
@nb.jit
def faster_example():
    data = np.empty(5, np.int64)
    mask = np.empty(5, np.bool_)
    data[0] = 1
    mask[0] = True
    data[1] = 2
    mask[1] = True
    mask[2] = False
    mask[3] = False
    data[4] = 5
    mask[4] = True
    return data, mask


data, mask = faster_example()

array = ak.mask(data, mask)
array
[1,
 2,
 None,
 None,
 5]
----------------
backend: cpu
nbytes: 45 B
type: 5 * ?int64