How to reduce dimensions (sum/min/any/all)#

After elementwise functions, dimension-reducer functions are the most commonly used. These functions replace a list of numbers with a single, scalar number by adding, multiplying, minimizing, maximizing, or performing logical-or (“any”) or logical-and (“all”).

These are also called aggregation functions; in relational databases, SQL, and data-frames, aggregations are applied after a “group by” operation. Awkward Array doesn’t have “group by” operations; lists are already grouped.

import awkward as ak
import numpy as np

First reducer: ak.sum#

To illustrate all of these functions, let’s consider addition. Given an array:

array = ak.Array([[1, 2, 3], [4, 5], [], [6]])

ak.sum() with no arguments adds all of the values in the nested lists, just like np.sum.

ak.sum(array)
np.int64(21)

With Awkward Arrays, it’s usually more useful to supply an axis argument to reduce one dimension, rather than all dimensions.

For reasons that will be explained below, axis=-1 is the most frequently useful.

ak.sum(array, axis=-1)
[6,
 9,
 0,
 6]
---------------
type: 4 * int64

The axis argument#

Before getting deeper into the axis argument, let’s consider a NumPy array with more dimensions.

array3d = np.array([
    [
        [    1,     2,     3,     4,     5],
        [   10,    20,    30,    40,    50],
        [  100,   200,   300,   400,   500],
    ],
    [
        [0.1  , 0.2  , 0.3  , 0.4  , 0.5  ],
        [0.01 , 0.02 , 0.03 , 0.04 , 0.05 ],
        [0.001, 0.002, 0.003, 0.004, 0.005],
    ],
])

with np.printoptions(suppress=True):
    print(array3d)
[[[  1.      2.      3.      4.      5.   ]
  [ 10.     20.     30.     40.     50.   ]
  [100.    200.    300.    400.    500.   ]]

 [[  0.1     0.2     0.3     0.4     0.5  ]
  [  0.01    0.02    0.03    0.04    0.05 ]
  [  0.001   0.002   0.003   0.004   0.005]]]

This array has 3 dimensions, so in addition to axis=None (reduce everything to a scalar), there are 3 possible axis values.

The first case, axis=0, adds the first 3×5 block to the second 3×5 block, i.e. summing over the first (length-2) dimension. Thus, the 1 is added to 0.1, the 2 is added to 0.2, and so on until the 500 is added to 0.005.

with np.printoptions(suppress=True):
    print(np.sum(array3d, axis=0))
[[  1.1     2.2     3.3     4.4     5.5  ]
 [ 10.01   20.02   30.03   40.04   50.05 ]
 [100.001 200.002 300.003 400.004 500.005]]

The second case, axis=1, adds vertically within each 3×5 block, i.e. summing over the second (length-3) dimension. What’s left are two lists of length 5.

with np.printoptions(suppress=True):
    print(np.sum(array3d, axis=1))
[[111.    222.    333.    444.    555.   ]
 [  0.111   0.222   0.333   0.444   0.555]]

The third case, axis=2, adds horizontally within each 3×5 block, i.e. summing over the third (length-5) dimension. What’s left are two lists of length 3.

with np.printoptions(suppress=True):
    print(np.sum(array3d, axis=2))
[[  15.     150.    1500.   ]
 [   1.5      0.15     0.015]]

Since negative axis counts from the other end of the scale,

  • axis=0 is equivalent to axis=-3

  • axis=1 is equivalent to axis=-2

  • axis=2 is equivalent to axis=-1.

The axis argument with ragged lists#

Awkward Arrays allow the lengths of lists in an array to differ, so we can have

array_ragged = ak.Array([
    [  1,   2,   3     ],
    [ 10,  20          ],
    [100, 200, 300, 400],
])
array_ragged
[[1, 2, 3],
 [10, 20],
 [100, 200, 300, 400]]
----------------------
type: 3 * var * int64

As before, axis=-1 sums over the innermost lists, replacing each of the 3 horizontal rows with a sum.

ak.sum(array_ragged, axis=-1)
[6,
 30,
 1000]
---------------
type: 3 * int64

And axis=-2 sums vertically, replacing each of the 4 vertical columns with a sum. Since the list lengths differ, some of the places we might expect to see a value is an empty gap—it contributes nothing to the result.

ak.sum(array_ragged, axis=0)
[111,
 222,
 303,
 400]
---------------
type: 4 * int64

We also have to choose a convention: should the values be left-aligned or right-aligned within their lists? Awkward Array choses left-aligned.

In ragged data from real datasets, summing over whole lists usually has more meaning than summing over parts of different lists, so axis=-1 is usually the most meaningful choice of axis.

The axis argument with missing data#

Just as empty gaps contribute nothing to the sum, missing values (None) don’t contribute anything, either.

array_ragged = ak.Array([
    [None, None,    3,    4],
    [  10, None,   30      ],
    [ 100,  200,  300,  400],
])
array_ragged
[[None, None, 3, 4],
 [10, None, 30],
 [100, 200, 300, 400]]
----------------------
type: 3 * var * ?int64

axis=-1 sums over each inner list, horizontally, replacing it with a scalar.

ak.sum(array_ragged, axis=-1)
[7,
 40,
 1000]
---------------
type: 3 * int64

And axis=-2 sums over the outer dimension, vertically.

ak.sum(array_ragged, axis=-2)
[110,
 200,
 333,
 404]
---------------
type: 4 * int64

For ak.sum(), each None has the same effect as a 0 value, for ak.prod() (multiplication), each None has the same effect as a 1 value, etc.

The keepdims argument#

Sometimes, you want to replace lists with a length-1 list, rather than a scalar. keepdims=True does that.

ak.sum(array_ragged, axis=-1, keepdims=True)
[[7],
 [40],
 [1000]]
-------------------
type: 3 * 1 * int64
ak.sum(array_ragged, axis=-2, keepdims=True)
[[110, 200, 333, 404]]
----------------------
type: 1 * var * int64

The keepdims argument is particularly useful for ak.argmin() and ak.argmax(), which return positions in a list where the value is minimized or maximized. Those positions can only be used as slice indexes if they’re at the right nesting level, which keepdims=True maintains.

Other reducers#

  • The ak.prod() reducer multiplies, rather than adding.

  • ak.min() and ak.max() minimize and maximize, returning None for empty lists.

  • ak.argmin() and ak.argmax() return the index positions of the minimum or maximum value, with None for empty lists.

  • ak.nansum(), ak.nanprod(), ak.nanmin(), ak.nanmax(), ak.nanargmin(), and ak.nanargmax() ignore floating-point nan values before operating, the way that all reducers ignore None values before operating.

  • ak.count_nonzero() counts non-zero values.

  • ak.count() simply counts values. In NumPy, there’s no need for such a function because it would return constants (drawn from the NumPy array’s shape), but for ragged arrays, it counts the number of values that enter into a reduction. ak.num() also returns lengths of lists, but in a way that’s more useful for slicing; ak.count() is useful as the denominator of expressions in which another reducer (with the same axis and keepdims choices) is in the numerator.

  • ak.any() and ak.all() reduce like logical-or and logical-and, which makes them particularly useful in slices (below).

Reducing over “any” and “all”#

ak.any() and ak.all() reduce boolean arrays, asking if a predicate is satisfied by “any” item or “all” items, respectively.

array_bool = ak.Array([
    [False, False,  True,  True],
    [False,  True, False,  True],
    [False,  True,  True,  True],
])
array_bool
[[False, False, True, True],
 [False, True, False, True],
 [False, True, True, True]]
----------------------------
type: 3 * var * bool
ak.any(array_bool, axis=-1)
[True,
 True,
 True]
--------------
type: 3 * bool
ak.any(array_bool, axis=-2)
[False,
 True,
 True,
 True]
--------------
type: 4 * bool
ak.all(array_bool, axis=-1)
[False,
 False,
 False]
--------------
type: 3 * bool
ak.all(array_bool, axis=-2)
[False,
 False,
 False,
 True]
--------------
type: 4 * bool

Since logical-or is like addition of booleans and logical-and is like multiplication, these reducers could have been replaced with ak.sum() and ak.prod(), but they’re very useful to have because they make some boolean-array slices easier to read.

array = ak.Array([[0, 1, 2], [], [-3, 4], [-5], [-6, -7, -8, -9]])
array
[[0, 1, 2],
 [],
 [-3, 4],
 [-5],
 [-6, -7, -8, -9]]
---------------------
type: 5 * var * int64

Select whole lists if any of their values are negative:

array[ak.any(array < 0, axis=-1)]
[[-3, 4],
 [-5],
 [-6, -7, -8, -9]]
---------------------
type: 3 * var * int64

Select whole lists if all of their values are negative:

array[ak.all(array < 0, axis=-1)]
[[],
 [-5],
 [-6, -7, -8, -9]]
---------------------
type: 3 * var * int64

(If a list is empty, all of its elements satisfy a constraint.)

In both cases above, the selection can be read like an English sentence, “select lists if any…” or “select lists if all…”.

Heterogeneous data and records cannot be reduced#

These two kinds of data types are not reducible. Heterogeneous data allows an array to have multiple numbers of dimensions, so the problem is ill-posed:

ak.sum(ak.Array([[1.1, 2.2, 3.3], [], 4.4, 5.5]))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[25], line 1
----> 1 ak.sum(ak.Array([[1.1, 2.2, 3.3], [], 4.4, 5.5]))

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_dispatch.py:64, in named_high_level_function.<locals>.dispatch(*args, **kwargs)
     62 # Failed to find a custom overload, so resume the original function
     63 try:
---> 64     next(gen_or_result)
     65 except StopIteration as err:
     66     return err.value

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_sum.py:216, in sum(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    213 yield (array,)
    215 # Implementation
--> 216 return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_sum.py:300, in _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    296 axis = regularize_axis(axis, none_allowed=True)
    298 reducer = ak._reducers.Sum()
--> 300 out = ak._do.reduce(
    301     layout,
    302     reducer,
    303     axis=axis,
    304     mask=mask_identity,
    305     keepdims=keepdims,
    306     behavior=ctx.behavior,
    307 )
    309 wrapped_out = ctx.wrap(
    310     out,
    311     highlevel=highlevel,
    312     allow_other=True,
    313 )
    315 # propagate named axis to output

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_do.py:244, in reduce(layout, reducer, axis, mask, keepdims, behavior)
    232 parts = remove_structure(
    233     layout,
    234     flatten_records=False,
   (...)
    238     list_to_regular=True,
    239 )
    241 if len(parts) > 1:
    242     # We know that `flatten_records` must fail, so the only other type
    243     # that can return multiple parts here is the union array
--> 244     raise ValueError(
    245         "cannot use axis=None on an array containing irreducible unions"
    246     )
    247 elif len(parts) == 0:
    248     layout = ak.contents.EmptyArray()

ValueError: cannot use axis=None on an array containing irreducible unions

This error occurred while calling

    ak.sum(
        <Array [[1.1, 2.2, 3.3], [], 4.4, 5.5] type='4 * union[var * float6...'>
    )

And records are sometimes used to represent data with coordinates; applying ak.sum() to non-Cartesian coordinates would be a subtle error.

ak.sum(ak.Array([{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}]), axis=-1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[26], line 1
----> 1 ak.sum(ak.Array([{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}]), axis=-1)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_dispatch.py:64, in named_high_level_function.<locals>.dispatch(*args, **kwargs)
     62 # Failed to find a custom overload, so resume the original function
     63 try:
---> 64     next(gen_or_result)
     65 except StopIteration as err:
     66     return err.value

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_sum.py:216, in sum(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    213 yield (array,)
    215 # Implementation
--> 216 return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/operations/ak_sum.py:300, in _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    296 axis = regularize_axis(axis, none_allowed=True)
    298 reducer = ak._reducers.Sum()
--> 300 out = ak._do.reduce(
    301     layout,
    302     reducer,
    303     axis=axis,
    304     mask=mask_identity,
    305     keepdims=keepdims,
    306     behavior=ctx.behavior,
    307 )
    309 wrapped_out = ctx.wrap(
    310     out,
    311     highlevel=highlevel,
    312     allow_other=True,
    313 )
    315 # propagate named axis to output

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_do.py:296, in reduce(layout, reducer, axis, mask, keepdims, behavior)
    294 parents = ak.index.Index64.zeros(layout.length, layout.backend.index_nplike)
    295 shifts = None
--> 296 next = layout._reduce_next(
    297     reducer,
    298     negaxis,
    299     starts,
    300     shifts,
    301     parents,
    302     1,
    303     mask,
    304     keepdims,
    305     behavior,
    306 )
    308 return next[0]

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/contents/recordarray.py:907, in RecordArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
    905 reducer_recordclass = find_record_reducer(reducer, self, behavior)
    906 if reducer_recordclass is None:
--> 907     raise TypeError(
    908         "no ak.{} overloads for custom types: {}".format(
    909             reducer.name, ", ".join(self.fields)
    910         )
    911     )
    912 else:
    913     # Positional reducers ultimately need to do more work when rebuilding the result
    914     # so asking for a mask doesn't help us!
    915     reducer_should_mask = mask and not reducer.needs_position

TypeError: no ak.sum overloads for custom types: x, y

This error occurred while calling

    ak.sum(
        <Array [{x: 1.1, y: [1]}, {...}] type='2 * {x: float64, y: var * in...'>
        axis = -1
    )