Using Awkward Array with Numba#

Why Numba?#

The array-oriented (NumPy-like) interface that Awkward Array provides is often more convenient than imperative code and it’s always faster than pure Python. But sometimes it’s less convenient than imperative code and it’s always slower than C, C++, Julia, Rust, or other compiled code.

  • The matching problem described in How to find the best match between two collections using Cartesian (cross) product is already rather complex—if a problem is more intricate than that, you may want to consider doing it in imperative code, so that you or anyone reading your code don’t get lost in indices.

  • Although all iterations over arrays in Awkward Array are precompiled, most operations involve several passes over the data, which are not cache-friendly and might exceed your working memory budget.

For this reason, Awkward Arrays were made to be interchangeable with Numba, a JIT-compiler for Python. Recently, JIT-compiled C++ and Julia have been added as well. Our intention is not to make you choose upfront whether to use array-oriented syntax or JIT-compiled code, but to mix them in the most convenient ways for each task.

Small example#

import awkward as ak
import numpy as np
import numba as nb
array = ak.Array([
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
    [],
    [{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}],
    [{"x": 6.6, "y": [1, 2, 3, 4, 5, 6]}],
])[np.tile([0, 1, 2, 3], 250000)]
array
[[{x: 1.1, y: [1]}, {x: 2.2, y: [...]}, {x: 3.3, y: [1, 2, 3]}],
 [],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 [{x: 1.1, y: [1]}, {x: 2.2, y: [...]}, {x: 3.3, y: [1, 2, 3]}],
 [],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 [{x: 1.1, y: [1]}, {x: 2.2, y: [...]}, {x: 3.3, y: [1, 2, 3]}],
 [],
 ...,
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 [{x: 1.1, y: [1]}, {x: 2.2, y: [...]}, {x: 3.3, y: [1, 2, 3]}],
 [],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 [{x: 1.1, y: [1]}, {x: 2.2, y: [...]}, {x: 3.3, y: [1, 2, 3]}],
 [],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}]]
----------------------------------------------------------------
backend: cpu
nbytes: 16.0 MB
type: 1000000 * var * {
    x: float64,
    y: var * int64
}

Suppose we want to compute the sum of all y values in each of the million entries above. We can do that with a simple Awkward expression,

ak.sum(ak.sum(array.y, axis=-1), axis=-1)
[10,
 0,
 25,
 21,
 10,
 0,
 25,
 21,
 10,
 0,
 ...,
 21,
 10,
 0,
 25,
 21,
 10,
 0,
 25,
 21]
---------------------
backend: cpu
nbytes: 8.0 MB
type: 1000000 * int64

Although it’s faster than iterating over pure Python loops, it makes intermediate arrays that aren’t necessary for the final result. Allocating them and iterating over all of them slows down the Awkward Array expression relative to compiled code.

%%timeit

ak.sum(ak.sum(array.y, axis=-1), axis=-1)
54.6 ms ± 150 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
@nb.jit
def sum_of_y(array):
    out = np.zeros(len(array), dtype=np.int64)

    for i, list_of_records in enumerate(array):
        for record in list_of_records:
            for y in record.y:
                out[i] += y

    return out
ak.Array(sum_of_y(array))
[10,
 0,
 25,
 21,
 10,
 0,
 25,
 21,
 10,
 0,
 ...,
 21,
 10,
 0,
 25,
 21,
 10,
 0,
 25,
 21]
---------------------
backend: cpu
nbytes: 8.0 MB
type: 1000000 * int64

The JIT-compiled function is faster.

%%timeit

ak.Array(sum_of_y(array))
7.12 ms ± 41.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Combining features of Awkward Array and Numba#

Even on a per-task level, Awkward Array’s array-oriented functions and Numba’s JIT-compilation don’t need to be exclusive. Numba can be used to prepare steps of an array-oriented process, such as generating boolean or integer-valued arrays to use as slices for an Awkward Array.

@nb.jit
def sum_of_y_is_more_than_10(array):
    out = np.zeros(len(array), dtype=np.bool_)

    for i, list_of_records in enumerate(array):
        total = 0
        for record in list_of_records:
            for y in record.y:
                total += y
        if total > 10:
            out[i] = True

    return out
array[sum_of_y_is_more_than_10(array)]
[[{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 ...,
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}],
 [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}],
 [{x: 6.6, y: [1, 2, 3, 4, 5, 6]}]]
-----------------------------------------------------------
backend: cpu
nbytes: 8.0 MB
type: 500000 * var * {
    x: float64,
    y: var * int64
}

Relative strengths and weaknesses#

Awkward Array’s array oriented interface is

  • good for reading and writing data to and from columnar file formats like Parquet,

  • good for interactive exploration in Jupyter, applying a sequence of simple operations to a whole dataset and observing its effects after each operation,

  • good for speed and memory use, relative to pure Python,

  • bad for very intricate calculations with many indices,

  • bad for large intermediate arrays,

  • bad for speed and memory use, relative to custom-compiled code.

Numba’s JIT-compilation is

  • good for writing understandable algorithms with many moving parts,

  • good for speed and memory use, on par with other compiled languages,

  • bad for interactive exploration of data and iterative data analysis, since you have to write whole functions,

  • bad for working through type errors, as you would have in any compiled language (unlike pure Python),

  • bad for unboxing and boxing large non-array data when entering and exiting a compiled function.

The next section lists what you can and can’t do with Awkward Arrays in Numba-compiled code.