How to find the best match between two collections using Cartesian (cross) product#
In high energy physics (HEP), ak.combinations()
is often needed to find particles whose trajectories are close to each other, separately in many high-energy collision events (axis=1
). In some applications, the two collections that need to be matched are simulated particles and reconstructed versions of those particles (“gen-reco matching”), and in other applications, the two collections are different types of particles, such as electrons and jets.
I’ll describe how to solve such a problem on this page, but avoid domain-specific jargon by casting it as a problem of finding the distance between bunnies and foxes—if a bunny is too close to a fox, it will get eaten!
import awkward as ak
import numpy as np
Setting up the problem#
In 1000 separate yards (big suburb), there’s a random number of bunnies and a random number of foxes, each with random x, y positions. We’re making ragged arrays of records using ak.unflatten()
and ak.zip()
.
np.random.seed(12345)
number_of_bunnies = np.random.poisson(3.5, 1000) # average of 3.5 bunnies/yard
number_of_foxes = np.random.poisson(1.5, 1000) # average of 1.5 foxes/yard
bunny_xy = np.random.normal(0, 1, (number_of_bunnies.sum(), 2))
fox_xy = np.random.normal(0, 1, (number_of_foxes.sum(), 2))
bunnies = ak.unflatten(ak.zip({"x": bunny_xy[:, 0], "y": bunny_xy[:, 1]}), number_of_bunnies)
foxes = ak.unflatten(ak.zip({"x": fox_xy[:, 0], "y": fox_xy[:, 1]}), number_of_foxes)
bunnies
[[{x: -1, y: -0.727}, {x: 0.735, y: -0.594}, {x: -0.588, y: 0.718}], [{x: -0.449, y: -1.61}, {x: 1.59, y: -0.627}, ..., {...}, {x: 1.49, y: -1.42}], [{x: 0.441, y: -0.705}, {x: -1.96, y: 0.31}], [{x: -0.121, y: -0.232}, {x: -0.44, y: -2.03}, ..., {x: 0.0405, y: 1.24}], [{x: -0.703, y: -0.049}, {x: -0.209, y: ..., ...}, ..., {x: 0.525, y: 0.265}], [{x: -0.958, y: 0.119}, {x: 0.885, y: -0.266}], [{x: -1.28, y: 0.235}, {x: 0.625, y: ..., ...}, {x: -0.748, y: -0.336}], [{x: 0.37, y: 0.582}, {x: -0.594, y: -2.68}, ..., {...}, {x: 0.412, y: 0.32}], [{x: 0.57, y: 0.342}, {x: -0.416, y: -0.973}], [{x: -0.0416, y: 0.802}, {x: -0.144, y: ..., ...}, ..., {x: -0.635, y: -1.71}], ..., [{x: 1.7, y: -0.564}], [{x: 1.11, y: -0.309}, {x: 0.155, y: -0.458}, ..., {x: 0.352, y: -1.18}], [{x: -0.534, y: 0.555}, {x: -1.07, y: -1.37}, {x: -0.771, y: 0.699}], [{x: -0.586, y: 1.86}], [{x: 0.0432, y: -1.22}, {x: -1.15, y: 0.45}], [{x: -0.868, y: 1.79}, {x: 0.999, y: -1.16}, ..., {x: -0.877, y: 0.203}], [{x: 0.73, y: 0.335}, {x: -0.372, y: -0.659}, {x: 0.946, y: -1.1}], [{x: -0.622, y: -1.26}], [{x: 1.47, y: 1.5}, {x: 1.94, y: -2.37}]] -------------------------------------------------------------------------------- backend: cpu nbytes: 64.4 kB type: 1000 * var * { x: float64, y: float64 }
foxes
[[{x: 0.183, y: -1.17}], [{x: 0.497, y: 0.761}], [{x: -1.06, y: 0.0902}], [{x: -0.106, y: -1.28}], [{x: 0.338, y: 0.642}], [{x: -0.833, y: -1.11}, {x: -0.163, y: ..., ...}, {x: -0.548, y: -0.546}], [{x: 0.89, y: 0.67}, {x: -1.24, y: 0.543}, {...}, {x: 0.211, y: -1.03}], [{x: 0.411, y: 1.23}, {x: -0.51, y: 0.931}, {...}, {x: 0.953, y: 1.18}], [{x: -3.47, y: -0.273}], [{x: -0.553, y: 0.339}], ..., [{x: 0.856, y: 1.06}, {x: 1.12, y: -0.0269}], [], [], [], [{x: -1.51, y: 0.912}], [{x: -0.665, y: -0.175}], [{x: -0.401, y: -1.2}, {x: -1.33, y: -1.06}], [], [{x: -0.0623, y: -0.596}, {x: 1.35, y: ..., ...}, ..., {x: 0.052, y: -0.242}]] ------------------------------------------------------------------------------- backend: cpu nbytes: 32.5 kB type: 1000 * var * { x: float64, y: float64 }
Find all combinations#
In each yard, we find all bunny-fox pairs, regardless of whether they’re close or not using ak.cartesian()
, and then unpacking the pairs with ak.unzip()
.
pair_bunnies, pair_foxes = ak.unzip(ak.cartesian([bunnies, foxes]))
These two arrays, pair_bunnies
and pair_foxes
, have the same type as bunnies
and foxes
, but different numbers of items in each list because now they’re paired to match each other. Both kinds of animals are duplicated to enable this match.
pair_bunnies
[[{x: -1, y: -0.727}, {x: 0.735, y: -0.594}, {x: -0.588, y: 0.718}], [{x: -0.449, y: -1.61}, {x: 1.59, y: -0.627}, ..., {...}, {x: 1.49, y: -1.42}], [{x: 0.441, y: -0.705}, {x: -1.96, y: 0.31}], [{x: -0.121, y: -0.232}, {x: -0.44, y: -2.03}, ..., {x: 0.0405, y: 1.24}], [{x: -0.703, y: -0.049}, {x: -0.209, y: ..., ...}, ..., {x: 0.525, y: 0.265}], [{x: -0.958, y: 0.119}, {x: -0.958, y: ..., ...}, ..., {x: 0.885, y: -0.266}], [{x: -1.28, y: 0.235}, {x: -1.28, y: 0.235}, ..., {x: -0.748, y: -0.336}], [{x: 0.37, y: 0.582}, {x: 0.37, y: 0.582}, ..., {...}, {x: 0.412, y: 0.32}], [{x: 0.57, y: 0.342}, {x: -0.416, y: -0.973}], [{x: -0.0416, y: 0.802}, {x: -0.144, y: ..., ...}, ..., {x: -0.635, y: -1.71}], ..., [{x: 1.7, y: -0.564}, {x: 1.7, y: -0.564}], [], [], [], [{x: 0.0432, y: -1.22}, {x: -1.15, y: 0.45}], [{x: -0.868, y: 1.79}, {x: 0.999, y: -1.16}, ..., {x: -0.877, y: 0.203}], [{x: 0.73, y: 0.335}, {x: 0.73, y: 0.335}, ..., {...}, {x: 0.946, y: -1.1}], [], [{x: 1.47, y: 1.5}, {x: 1.47, y: 1.5}, {...}, ..., {...}, {x: 1.94, y: -2.37}]] -------------------------------------------------------------------------------- backend: cpu nbytes: 108.4 kB type: 1000 * var * { x: float64, y: float64 }
pair_foxes
[[{x: 0.183, y: -1.17}, {x: 0.183, y: -1.17}, {x: 0.183, y: -1.17}], [{x: 0.497, y: 0.761}, {x: 0.497, y: 0.761}, ..., {...}, {x: 0.497, y: 0.761}], [{x: -1.06, y: 0.0902}, {x: -1.06, y: 0.0902}], [{x: -0.106, y: -1.28}, {x: -0.106, y: ..., ...}, ..., {x: -0.106, y: -1.28}], [{x: 0.338, y: 0.642}, {x: 0.338, y: 0.642}, ..., {...}, {x: 0.338, y: 0.642}], [{x: -0.833, y: -1.11}, {x: -0.163, y: ..., ...}, ..., {x: -0.548, y: -0.546}], [{x: 0.89, y: 0.67}, {x: -1.24, y: 0.543}, ..., {...}, {x: 0.211, y: -1.03}], [{x: 0.411, y: 1.23}, {x: -0.51, y: 0.931}, ..., {...}, {x: 0.953, y: 1.18}], [{x: -3.47, y: -0.273}, {x: -3.47, y: -0.273}], [{x: -0.553, y: 0.339}, {x: -0.553, y: ..., ...}, ..., {x: -0.553, y: 0.339}], ..., [{x: 0.856, y: 1.06}, {x: 1.12, y: -0.0269}], [], [], [], [{x: -1.51, y: 0.912}, {x: -1.51, y: 0.912}], [{x: -0.665, y: -0.175}, {x: -0.665, ...}, {...}, {x: -0.665, y: -0.175}], [{x: -0.401, y: -1.2}, {x: -1.33, y: -1.06}, ..., {...}, {x: -1.33, y: -1.06}], [], [{x: -0.0623, y: -0.596}, {x: 1.35, y: ..., ...}, ..., {x: 0.052, y: -0.242}]] -------------------------------------------------------------------------------- backend: cpu nbytes: 76.5 kB type: 1000 * var * { x: float64, y: float64 }
The two arrays have the same list lengths as each other because they came from the same ak.unzip()
.
ak.num(pair_bunnies), ak.num(pair_foxes)
(<Array [3, 8, 2, 8, 5, 6, 12, 20, ..., 0, 0, 2, 4, 6, 0, 8] type='1000 * int64'>,
<Array [3, 8, 2, 8, 5, 6, 12, 20, ..., 0, 0, 2, 4, 6, 0, 8] type='1000 * int64'>)
Calculating distances#
Since the arrays have the same shapes, they can be used in the same mathematical formula. Here’s the formula for distance:
distances = np.sqrt((pair_bunnies.x - pair_foxes.x)**2 + (pair_bunnies.y - pair_foxes.y)**2)
distances
[[1.27, 0.799, 2.04], [2.55, 1.77, 3.37, 1.57, 0.352, 2, 0.758, 2.39], [1.7, 0.924], [1.05, 0.819, 1.38, 3.03, 1.44, 2.22, 2.89, 2.53], [1.25, 2.61, 0.953, 0.685, 0.421], [1.24, 0.88, 0.781, 1.92, 1.3, 1.46], [2.21, 0.31, 1.36, 1.95, 0.656, 1.93, ..., 1.17, 1.92, 1.01, 0.788, 1.18], [0.645, 0.947, 1.42, 0.834, 4.03, 3.61, ..., 3.12, 0.906, 1.11, 1.48, 1.02], [4.08, 3.13], [0.69, 2.84, 0.312, 2.05], ..., [1.83, 0.789], [], [], [], [2.64, 0.585], [1.97, 1.94, 2.51, 0.434], [1.91, 2.49, 0.545, 1.04, 1.35, 2.28], [], [2.6, 0.607, 2.03, 2.25, 2.67, 3.33, 2.26, 2.84]] ----------------------------------------------------------------------------- backend: cpu nbytes: 52.0 kB type: 1000 * var * float64
Let’s say that 1 unit is close enough for a bunny to be eaten.
eaten = (distances < 1)
eaten
[[False, True, False], [False, False, False, False, True, False, True, False], [False, True], [False, True, False, False, False, False, False, False], [False, False, True, True, True], [False, True, True, False, False, False], [False, True, False, False, True, ..., False, False, False, True, False], [True, True, False, True, False, False, ..., False, True, False, False, False], [False, False], [True, False, True, False], ..., [False, True], [], [], [], [False, True], [False, False, False, True], [False, False, True, False, False, False], [], [False, True, False, False, False, False, False, False]] -------------------------------------------------------------------------------- backend: cpu nbytes: 13.5 kB type: 1000 * var * bool
This is great (not for the bunnies, but perhaps for the foxes). However, if we want to use this information on the original arrays, we’re stuck: this array has a different shape from the original bunnies
(and the original foxes
).
Perhaps the question we really wanted to ask is, “For each bunny, is there any fox that can eat it?”
Combinations with nested=True
#
Asking a question about any fox means performing a reducer, ak.any()
, over lists, one list per bunny. The list would be all of the foxes in its yard. For that, we’ll need to pass nested=True
to ak.cartesian()
.
pair_bunnies, pair_foxes = ak.unzip(ak.cartesian([bunnies, foxes], nested=True))
Now pair_bunnies
and pair_foxes
are one list-depth deeper than the original bunnies
and foxes
.
pair_bunnies
[[[{x: -1, y: -0.727}], [{x: 0.735, ...}], [{x: -0.588, y: 0.718}]], [[{x: -0.449, y: -1.61}], [{x: 1.59, ...}], ..., [{x: 1.49, y: -1.42}]], [[{x: 0.441, y: -0.705}], [{x: -1.96, y: 0.31}]], [[{x: -0.121, y: -0.232}], [{x: -0.44, ...}], ..., [{x: 0.0405, y: 1.24}]], [[{x: -0.703, y: -0.049}], [{x: -0.209, ...}], ..., [{x: 0.525, y: 0.265}]], [[{x: -0.958, y: 0.119}, {x: -0.958, ...}, {x: -0.958, y: 0.119}], [...]], [[{x: -1.28, y: 0.235}, {x: -1.28, ...}, ..., {x: -1.28, y: 0.235}], ...], [[{x: 0.37, y: 0.582}, {x: 0.37, y: ..., ...}, ..., {x: 0.37, y: 0.582}], ...], [[{x: 0.57, y: 0.342}], [{x: -0.416, y: -0.973}]], [[{x: -0.0416, y: 0.802}], [{x: -0.144, ...}], ..., [{x: -0.635, y: -1.71}]], ..., [[{x: 1.7, y: -0.564}, {x: 1.7, y: -0.564}]], [[], [], [], []], [[], [], []], [[]], [[{x: 0.0432, y: -1.22}], [{x: -1.15, y: 0.45}]], [[{x: -0.868, y: 1.79}], [{x: 0.999, ...}], ..., [{x: -0.877, y: 0.203}]], [[{x: 0.73, y: 0.335}, {x: 0.73, y: 0.335}], ..., [{x: 0.946, ...}, ...]], [[]], [[{x: 1.47, y: 1.5}, {x: 1.47, y: 1.5}, {...}, {x: 1.47, y: 1.5}], [...]]] -------------------------------------------------------------------------------- backend: cpu nbytes: 136.7 kB type: 1000 * var * var * { x: float64, y: float64 }
pair_foxes
[[[{x: 0.183, y: -1.17}], [{x: 0.183, ...}], [{x: 0.183, y: -1.17}]], [[{x: 0.497, y: 0.761}], [{x: 0.497, ...}], ..., [{x: 0.497, y: 0.761}]], [[{x: -1.06, y: 0.0902}], [{x: -1.06, y: 0.0902}]], [[{x: -0.106, y: -1.28}], [{x: -0.106, ...}], ..., [{x: -0.106, y: -1.28}]], [[{x: 0.338, y: 0.642}], [{x: 0.338, ...}], ..., [{x: 0.338, y: 0.642}]], [[{x: -0.833, y: -1.11}, {x: -0.163, ...}, {x: -0.548, y: -0.546}], ...], [[{x: 0.89, y: 0.67}, {x: -1.24, ...}, {...}, {x: 0.211, y: -1.03}], ...], [[{x: 0.411, y: 1.23}, {x: -0.51, ...}, {...}, {x: 0.953, y: 1.18}], ...], [[{x: -3.47, y: -0.273}], [{x: -3.47, y: -0.273}]], [[{x: -0.553, y: 0.339}], [{x: -0.553, ...}], ..., [{x: -0.553, y: 0.339}]], ..., [[{x: 0.856, y: 1.06}, {x: 1.12, y: -0.0269}]], [[], [], [], []], [[], [], []], [[]], [[{x: -1.51, y: 0.912}], [{x: -1.51, y: 0.912}]], [[{x: -0.665, y: -0.175}], [{x: -0.665, ...}], ..., [{x: -0.665, y: -0.175}]], [[{x: -0.401, y: -1.2}, {x: -1.33, y: -1.06}], ..., [{x: -0.401, ...}, ...]], [[]], [[{x: -0.0623, y: -0.596}, {x: 1.35, ...}, ..., {x: 0.052, y: -0.242}], ...]] ------------------------------------------------------------------------------- backend: cpu nbytes: 104.7 kB type: 1000 * var * var * { x: float64, y: float64 }
We can compute distances
in the same way, though it’s also one list-depth deeper.
distances = np.sqrt((pair_bunnies.x - pair_foxes.x)**2 + (pair_bunnies.y - pair_foxes.y)**2)
distances
[[[1.27], [0.799], [2.04]], [[2.55], [1.77], [3.37], [1.57], [0.352], [2], [0.758], [2.39]], [[1.7], [0.924]], [[1.05], [0.819], [1.38], [3.03], [1.44], [2.22], [2.89], [2.53]], [[1.25], [2.61], [0.953], [0.685], [0.421]], [[1.24, 0.88, 0.781], [1.92, 1.3, 1.46]], [[2.21, 0.31, 1.36, 1.95], [0.656, ..., 1.17], [1.92, 1.01, 0.788, 1.18]], [[0.645, 0.947, 1.42, 0.834], [4.03, ...], ..., [0.906, 1.11, 1.48, 1.02]], [[4.08], [3.13]], [[0.69], [2.84], [0.312], [2.05]], ..., [[1.83, 0.789]], [[], [], [], []], [[], [], []], [[]], [[2.64], [0.585]], [[1.97], [1.94], [2.51], [0.434]], [[1.91, 2.49], [0.545, 1.04], [1.35, 2.28]], [[]], [[2.6, 0.607, 2.03, 2.25], [2.67, 3.33, 2.26, 2.84]]] ---------------------------------------------------------------------------- backend: cpu nbytes: 80.2 kB type: 1000 * var * var * float64
Similarly for eaten
.
eaten = (distances < 1)
eaten
[[[False], [True], [False]], [[False], [False], [False], [False], [True], [False], [True], [False]], [[False], [True]], [[False], [True], [False], [False], [False], [False], [False], [False]], [[False], [False], [True], [True], [True]], [[False, True, True], [False, False, False]], [[False, True, False, False], [True, ...], [False, False, True, False]], [[True, True, False, True], [False, ...], ..., [True, False, False, False]], [[False], [False]], [[True], [False], [True], [False]], ..., [[False, True]], [[], [], [], []], [[], [], []], [[]], [[False], [True]], [[False], [False], [False], [True]], [[False, False], [True, False], [False, False]], [[]], [[False, True, False, False], [False, False, False, False]]] ----------------------------------------------------------------------------- backend: cpu nbytes: 41.7 kB type: 1000 * var * var * bool
Now each inner list of booleans is answering the questions, “Can fox 0 eat me?”, “Can fox 1 eat me?”, …, “Can fox n eat me?” and there are exactly as many of these lists as there are bunnies. Applying ak.any()
over the innermost lists (axis=-1
),
bunny_eaten = ak.any(eaten, axis=-1)
bunny_eaten
[[False, True, False], [False, False, False, False, True, False, True, False], [False, True], [False, True, False, False, False, False, False, False], [False, False, True, True, True], [True, False], [True, True, True], [True, False, False, False, True], [False, False], [True, False, True, False], ..., [True], [False, False, False, False], [False, False, False], [False], [False, True], [False, False, False, True], [False, True, False], [False], [True, False]] --------------------------------------------------------- backend: cpu nbytes: 11.5 kB type: 1000 * var * bool
We’ve now answered the question, “Can any fox eat me?” for each bunny. After the mayhem, these are the bunnies we have left:
bunnies[~bunny_eaten]
[[{x: -1, y: -0.727}, {x: -0.588, y: 0.718}], [{x: -0.449, y: -1.61}, {x: 1.59, y: -0.627}, ..., {...}, {x: 1.49, y: -1.42}], [{x: 0.441, y: -0.705}], [{x: -0.121, y: -0.232}, {x: 0.23, y: 0.0559}, ..., {x: 0.0405, y: 1.24}], [{x: -0.703, y: -0.049}, {x: -0.209, y: -1.91}], [{x: 0.885, y: -0.266}], [], [{x: -0.594, y: -2.68}, {x: 0.235, y: -1.13}, {x: 0.878, y: -1.94}], [{x: 0.57, y: 0.342}, {x: -0.416, y: -0.973}], [{x: -0.144, y: -2.47}, {x: -0.635, y: -1.71}], ..., [], [{x: 1.11, y: -0.309}, {x: 0.155, y: -0.458}, ..., {x: 0.352, y: -1.18}], [{x: -0.534, y: 0.555}, {x: -1.07, y: -1.37}, {x: -0.771, y: 0.699}], [{x: -0.586, y: 1.86}], [{x: 0.0432, y: -1.22}], [{x: -0.868, y: 1.79}, {x: 0.999, y: -1.16}, {x: -0.389, y: 2.32}], [{x: 0.73, y: 0.335}, {x: 0.946, y: -1.1}], [{x: -0.622, y: -1.26}], [{x: 1.94, y: -2.37}]] -------------------------------------------------------------------------------- backend: cpu nbytes: 84.9 kB type: 1000 * var * { x: float64, y: float64 }
Whereas there was originally an average of 3.5 bunnies per yard, by construction,
ak.mean(ak.num(bunnies, axis=1))
np.float64(3.527)
Now there’s only
ak.mean(ak.num(bunnies[~bunny_eaten], axis=1))
np.float64(2.557)
left.
Asymmetry in the problem#
The way we performed this calculation was asymmetric: for each bunny, we asked if it was eaten. We could have performed a similar, but different, calculation to ask, which foxes get to eat? To do that, we must reverse the order of arguments because nested=True
groups from the left.
pair_foxes, pair_bunnies = ak.unzip(ak.cartesian([foxes, bunnies], nested=True))
distances = np.sqrt((pair_foxes.x - pair_bunnies.x)**2 + (pair_foxes.y - pair_bunnies.y)**2)
eating = (distances < 1)
fox_eats = ak.any(eating, axis=-1)
foxes[fox_eats]
[[{x: 0.183, y: -1.17}], [{x: 0.497, y: 0.761}], [{x: -1.06, y: 0.0902}], [{x: -0.106, y: -1.28}], [{x: 0.338, y: 0.642}], [{x: -0.163, y: 0.496}, {x: -0.548, y: -0.546}], [{x: 0.89, y: 0.67}, {x: -1.24, y: 0.543}, {x: 0.0186, y: -0.154}], [{x: 0.411, y: 1.23}, {x: -0.51, y: 0.931}, {x: 0.953, y: 1.18}], [], [{x: -0.553, y: 0.339}], ..., [{x: 1.12, y: -0.0269}], [], [], [], [{x: -1.51, y: 0.912}], [{x: -0.665, y: -0.175}], [{x: -0.401, y: -1.2}], [], [{x: 1.35, y: 0.905}]] -------------------------------------------------------------------- backend: cpu nbytes: 38.7 kB type: 1000 * var * { x: float64, y: float64 }