--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.14.1 kernelspec: display_name: Python 3 language: python name: python3 --- How to create arrays of missing data ==================================== Data at any level of an Awkward Array can be "missing," represented by `None` in Python. This functionality is somewhat like NumPy's [masked arrays](https://numpy.org/doc/stable/reference/maskedarray.html), but masked arrays can only declare numerical values to be missing (not, for instance, a row of a 2-dimensional array) and they represent missing data with an `np.ma.masked` object instead of `None`. Pandas also handles missing data, but in several different ways. For floating point columns, `NaN` (not a number) is used to mean "missing," and [as of version 1.0](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-na), Pandas has a `pd.NA` object for missing data in other data types. In Awkward Array, floating point `NaN` and a missing value are clearly distinct. Missing data, like all data in Awkward Arrays, are also not represented by any Python object; they are converted _to_ and _from_ `None` by {func}`ak.to_list` and {func}`ak.from_iter`. ```{code-cell} ipython3 import awkward as ak import numpy as np ``` From Python None ---------------- The {class}`ak.Array` constructor and {func}`ak.from_iter` interpret `None` as a missing value, and {func}`ak.to_list` converts them back into `None`. ```{code-cell} ipython3 ak.Array([1, 2, 3, None, 4, 5]) ``` The missing values can be deeply nested (missing integers): ```{code-cell} ipython3 ak.Array([[[[], [1, 2, None]]], [[[3]]], []]) ``` They can be shallow (missing lists): ```{code-cell} ipython3 ak.Array([[[[], [1, 2]]], None, [[[3]]], []]) ``` Or both: ```{code-cell} ipython3 ak.Array([[[[], [3]]], None, [[[None]]], []]) ``` Records can also be missing: ```{code-cell} ipython3 ak.Array([{"x": 1, "y": 1}, None, {"x": 2, "y": 2}]) ``` Potentially missing values are represented in the type string as "`?`" or "`option[...]`" (if the nested type is a list, which needs to be bracketed for clarity). +++ From NumPy arrays ----------------- Normal NumPy arrays can't represent missing data, but masked arrays can. Here is how one is constructed in NumPy: ```{code-cell} ipython3 numpy_array = np.ma.MaskedArray([1, 2, 3, 4, 5], [False, False, True, True, False]) numpy_array ``` It returns `np.ma.masked` objects if you try to access missing values: ```{code-cell} ipython3 numpy_array[0], numpy_array[1], numpy_array[2], numpy_array[3], numpy_array[4] ``` But it uses `None` for missing values in `tolist`: ```{code-cell} ipython3 numpy_array.tolist() ``` The {func}`ak.from_numpy` function converts masked arrays into Awkward Arrays with missing values, as does the {class}`ak.Array` constructor. ```{code-cell} ipython3 awkward_array = ak.Array(numpy_array) awkward_array ``` The reverse, {func}`ak.to_numpy`, returns masked arrays if the Awkward Array has missing data. ```{code-cell} ipython3 ak.to_numpy(awkward_array) ``` But [np.asarray](https://numpy.org/doc/stable/reference/generated/numpy.asarray.html), the usual way of casting data as NumPy arrays, does not. ([np.asarray](https://numpy.org/doc/stable/reference/generated/numpy.asarray.html) is supposed to return a plain [np.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html), which [np.ma.masked_array](https://numpy.org/doc/stable/reference/generated/numpy.ma.masked_array.html) is not.) ```{code-cell} ipython3 :tags: [raises-exception] np.asarray(awkward_array) ``` Missing rows vs missing numbers ------------------------------- In Awkward Array, a missing list is a different thing from a list whose values are missing. However, {func}`ak.to_numpy` converts it for you. ```{code-cell} ipython3 missing_row = ak.Array([[1, 2, 3], None, [4, 5, 6]]) missing_row ``` ```{code-cell} ipython3 ak.to_numpy(missing_row) ``` NaN is not missing ------------------ Floating point `NaN` values are simply unrelated to missing values, in both Awkward Array and NumPy. ```{code-cell} ipython3 missing_with_nan = ak.Array([1.1, 2.2, np.nan, None, 3.3]) missing_with_nan ``` ```{code-cell} ipython3 ak.to_numpy(missing_with_nan) ``` Missing values as empty lists ----------------------------- Sometimes, it's useful to think about a potentially missing value as a length-1 list if it is not missing and a length-0 list if it is. (Some languages define the [option type as a kind of list](https://www.scala-lang.org/api/2.13.3/scala/Option.html).) The Awkward functions {func}`ak.singletons` and {func}`ak.firsts` convert from "`None` form" to and from "lists form." ```{code-cell} ipython3 none_form = ak.Array([1, 2, 3, None, None, 5]) none_form ``` ```{code-cell} ipython3 lists_form = ak.singletons(none_form) lists_form ``` ```{code-cell} ipython3 ak.firsts(lists_form) ``` Masking instead of slicing -------------------------- The most common way of filtering data is to slice it with an array of booleans (usually the result of a calculation). ```{code-cell} ipython3 array = ak.Array([1, 2, 3, 4, 5]) array ``` ```{code-cell} ipython3 booleans = ak.Array([True, True, False, False, True]) booleans ``` ```{code-cell} ipython3 array[booleans] ``` The data can also be effectively filtered by replacing values with `None`. The following syntax does that: ```{code-cell} ipython3 array.mask[booleans] ``` (Or use the {func}`ak.mask` function.) +++ An advantage of masking is that the length and nesting structure of the masked array is the same as the original array, so anything that broadcasts with one broadcasts with the other (so that unfiltered data can be used interchangeably with filtered data). ```{code-cell} ipython3 array + array.mask[booleans] ``` whereas ```{code-cell} ipython3 :tags: [raises-exception] array + array[booleans] ``` With ArrayBuilder ----------------- {class}`ak.ArrayBuilder` is described in more detail [in this tutorial](how-to-create-arraybuilder), but you can add missing values to an array using the `null` method or appending `None`. (This is what {func}`ak.from_iter` uses internally to accumulate data.) ```{code-cell} ipython3 builder = ak.ArrayBuilder() builder.append(1) builder.append(2) builder.null() builder.append(None) builder.append(3) array = builder.snapshot() array ``` In Numba -------- Functions that Numba Just-In-Time (JIT) compiles can use {class}`ak.ArrayBuilder` or construct a boolean array for {func}`ak.mask`. ({class}`ak.ArrayBuilder` can't be constructed or converted to an array using `snapshot` inside a JIT-compiled function, but can be outside the compiled context.) ```{code-cell} ipython3 import numba as nb ``` ```{code-cell} ipython3 @nb.jit def example(builder): builder.append(1) builder.append(2) builder.null() builder.append(None) builder.append(3) return builder builder = example(ak.ArrayBuilder()) array = builder.snapshot() array ``` ```{code-cell} ipython3 @nb.jit def faster_example(): data = np.empty(5, np.int64) mask = np.empty(5, np.bool_) data[0] = 1 mask[0] = True data[1] = 2 mask[1] = True mask[2] = False mask[3] = False data[4] = 5 mask[4] = True return data, mask data, mask = faster_example() array = ak.mask(data, mask) array ```