How to create arrays by “unflattening” or “grouping”#
import awkward as ak
import pandas as pd
import numpy as np
from urllib.request import urlopen
Finding runs in an array#
It is often the case that one has an array of data that they wish to subdivide into common groups. Let’s imagine that we’re looking at NASA’s Earth Meteorite Landings dataset, and that we wish to find the largest meteorite in each classification. This is known as a groupby
operation, followed by a reduction.
First, we should load the data
with open("../data/y77d-th95.json", "rb") as f:
landing = ak.from_json(f)
landing.fields
['name',
'id',
'nametype',
'recclass',
'mass',
'fall',
'year',
'reclat',
'reclong',
'geolocation',
':@computed_region_cbhk_fwbd',
':@computed_region_nnqa_25f4']
In order to find the largest meteorite by each category, we must first group the entries into categories. This is called a groupby
operation, whereby we are ordering the entire array into subgroups given by a particular label. To perform a groupby
in Awkward Array, we must first sort the array by the category
landing_sorted_class = landing[ak.argsort(landing.recclass)]
landing_sorted_class
[{name: 'Acapulco', id: '10', nametype: 'Valid', recclass: 'Acapulcoite', ...}, {name: 'Silistra', id: '55584', nametype: 'Valid', recclass: ..., ...}, {name: 'Angra dos Reis (stone)', id: '2302', nametype: 'Valid', ...}, {name: 'Aubres', id: '4893', nametype: 'Valid', recclass: 'Aubrite', ...}, {name: 'Bishopville', id: '5059', nametype: 'Valid', recclass: 'Aubrite', ...}, {name: 'Bustee', id: '5181', nametype: 'Valid', recclass: 'Aubrite', ...}, {name: 'Cumberland Falls', id: '5496', nametype: 'Valid', recclass: ..., ...}, {name: 'Khor Temiki', id: '12299', nametype: 'Valid', recclass: ..., ...}, {name: 'Mayo Belwa', id: '15451', nametype: 'Valid', recclass: 'Aubrite', ...}, {name: 'Norton County', id: '17922', nametype: 'Valid', recclass: ..., ...}, ..., {name: 'Aire-sur-la-Lys', id: '425', nametype: 'Valid', recclass: ..., ...}, {name: 'Lusaka', id: '14759', nametype: 'Valid', recclass: 'Unknown', ...}, {name: 'Dyalpur', id: '7757', nametype: 'Valid', recclass: 'Ureilite', ...}, {name: 'Haverö', id: '11859', nametype: 'Valid', recclass: 'Ureilite', ...}, {name: 'Jalanash', id: '12068', nametype: 'Valid', recclass: 'Ureilite', ...}, {name: 'Lahrauli', id: '12433', nametype: 'Valid', recclass: 'Ureilite', ...}, {name: 'Novo-Urei', id: '17933', nametype: 'Valid', recclass: 'Ureilite', ...}, {name: 'Almahata Sitta', id: '48915', nametype: 'Valid', recclass: ..., ...}, {name: 'Pontlyfni', id: '18865', nametype: 'Valid', recclass: ..., ...}] ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- backend: cpu nbytes: 246.5 kB type: 1000 * { name: string, id: string, nametype: string, recclass: string, mass: ?string, fall: string, year: ?string, reclat: ?string, reclong: ?string, geolocation: ?{ type: string, coordinates: var * float64 }, ":@computed_region_cbhk_fwbd": ?string, ":@computed_region_nnqa_25f4": ?string }
This sorted array can be subdivided into sublists of the same category. To determine how long each of these sublists must be, Awkward provides another function ak.run_lengths()
which, as the name implies, finds the lengths of consecutive runs in an array, e.g.
ak.run_lengths([1, 1, 1, 3, 3, 2, 4, 4, 4])
[3, 2, 1, 3] --------------- backend: cpu nbytes: 32 B type: 4 * int64
The function does not accept an axis
argument; Awkward Array only supports finding runs in the innermost axis=-1
axis of the array. Let’s find the lengths of each category sublist using ak.run_lengths()
:
lengths = ak.run_lengths(landing_sorted_class.recclass)
lengths
[1, 1, 1, 9, 1, 3, 1, 1, 4, 2, ..., 21, 2, 2, 1, 34, 2, 5, 1, 1] ----------------- backend: cpu nbytes: 944 B type: 118 * int64
Dividing an array into sublists#
Awkward Array provides an ak.unflatten()
operation that adds a new dimension to an array, using either a single integer denoting the (regular) size of the dimension, or a list of integers representing the lengths of the sublists to create e.g.
ak.unflatten(
["Do", "re", "mi", "fa", "so", "la"],
[1, 2, 2, 1]
)
[['Do'], ['re', 'mi'], ['fa', 'so'], ['la']] ---------------------- backend: cpu nbytes: 108 B type: 4 * var * string
If we pass an integer instead of a list of lengths, we get a regular array
ak.unflatten(
["Do", "re", "mi", "fa", "so", "la"],
2
)
[['Do', 're'], ['mi', 'fa'], ['so', 'la']] -------------------- backend: cpu nbytes: 68 B type: 3 * 2 * string
We can unflatten our sorted array using the length of runs each classification, in order to finalise our groupby operation.
landing_by_class = ak.unflatten(
landing_sorted_class,
lengths
)
landing_by_class
[[{name: 'Acapulco', id: '10', nametype: 'Valid', recclass: ..., ...}], [{name: 'Silistra', id: '55584', nametype: 'Valid', recclass: ..., ...}], [{name: 'Angra dos Reis (stone)', id: '2302', nametype: 'Valid', ...}], [{name: 'Aubres', id: '4893', nametype: 'Valid', recclass: ..., ...}, ...], [{name: "Sutter's Mill", id: '55529', nametype: 'Valid', recclass: 'C', ...}], [{name: 'Bells', id: '5005', nametype: 'Valid', recclass: 'C2-ung', ...}, ...], [{name: 'Ningqiang', id: '16981', nametype: 'Valid', recclass: 'C3-ung', ...}], [{name: 'Gujba', id: '11449', nametype: 'Valid', recclass: 'CBa', ...}], [{name: 'Alais', id: '448', nametype: 'Valid', recclass: 'CI1', ...}, ...], [{name: 'Karoonda', id: '12264', nametype: 'Valid', recclass: ..., ...}, ...], ..., [{name: 'Barcelona (stone)', id: '4944', nametype: 'Valid', ...}, ..., {...}], [{name: 'Cumulus Hills 04075', id: '32531', nametype: 'Valid', ...}, ...], [{name: 'Marjalahti', id: '15426', nametype: 'Valid', ...}, {...}], [{name: 'Rumuruti', id: '22782', nametype: 'Valid', recclass: 'R3.8-6', ...}], [{name: 'Andhara', id: '2294', nametype: 'Valid', recclass: ..., ...}, ...], [{name: 'Aire-sur-la-Lys', id: '425', nametype: 'Valid', ...}, {...}], [{name: 'Dyalpur', id: '7757', nametype: 'Valid', recclass: ..., ...}, ...], [{name: 'Almahata Sitta', id: '48915', nametype: 'Valid', recclass: ..., ...}], [{name: 'Pontlyfni', id: '18865', nametype: 'Valid', recclass: ..., ...}]] --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- backend: cpu nbytes: 247.4 kB type: 118 * var * { name: string, id: string, nametype: string, recclass: string, mass: ?string, fall: string, year: ?string, reclat: ?string, reclong: ?string, geolocation: ?{ type: string, coordinates: var * float64 }, ":@computed_region_cbhk_fwbd": ?string, ":@computed_region_nnqa_25f4": ?string }
We can see the categories of this grouped array by pulling out the first item of each sublist
landing_by_class.recclass[..., 0]
['Acapulcoite', 'Achondrite-ung', 'Angrite', 'Aubrite', 'C', 'C2-ung', 'C3-ung', 'CBa', 'CI1', 'CK4', ..., 'OC', 'Pallasite', 'Pallasite, PMG', 'R3.8-6', 'Stone-uncl', 'Unknown', 'Ureilite', 'Ureilite-an', 'Winonaite'] ------------------ backend: cpu nbytes: 12.7 kB type: 118 * string
The above three steps:
Sort the array
Compute the length of runs within the sorted array
Unflatten the sorted array by the run lengths
form a groupby
operation.
Computing the mass of the largest meteorites#
Now that we have grouped our meteorite landings by classification, we can find the largest mass meteorite in each group. If we look at the type of the array, we can see that the mass
field is actually a string:
landing_by_class.type.show()
118 * var * {
name: string,
id: string,
nametype: string,
recclass: string,
mass: ?string,
fall: string,
year: ?string,
reclat: ?string,
reclong: ?string,
geolocation: ?{
type: string,
coordinates: var * float64
},
":@computed_region_cbhk_fwbd": ?string,
":@computed_region_nnqa_25f4": ?string
}
Let’s convert it to a floating point number
landing_by_class['mass'] = ak.strings_astype(landing_by_class.mass, np.float64)
Now we can find the index of the largest mass in each sublist. We’ll use keepdims=True
in order to be able to use this array to index landing_by_class
and pull out the corresponding record.
i_largest_mass = ak.argmax(landing_by_class.mass, axis=-1, keepdims=True)
Finding the largest meteorite is then a simple case of using i_largest_mass
as an index, and flattening the result to drop the unneeded dimension
largest_meteorite = ak.flatten(
landing_by_class[i_largest_mass],
axis=1,
)
largest_meteorite
[{name: 'Acapulco', id: '10', nametype: 'Valid', recclass: 'Acapulcoite', ...}, {name: 'Silistra', id: '55584', nametype: 'Valid', recclass: ..., ...}, {name: 'Angra dos Reis (stone)', id: '2302', nametype: 'Valid', ...}, {name: 'Norton County', id: '17922', nametype: 'Valid', recclass: ..., ...}, {name: "Sutter's Mill", id: '55529', nametype: 'Valid', recclass: 'C', ...}, {name: 'Tagish Lake', id: '23782', nametype: 'Valid', recclass: 'C2-ung', ...}, {name: 'Ningqiang', id: '16981', nametype: 'Valid', recclass: 'C3-ung', ...}, {name: 'Gujba', id: '11449', nametype: 'Valid', recclass: 'CBa', ...}, {name: 'Orgueil', id: '18026', nametype: 'Valid', recclass: 'CI1', ...}, {name: 'Karoonda', id: '12264', nametype: 'Valid', recclass: 'CK4', ...}, ..., {name: 'Kushiike', id: '12381', nametype: 'Valid', recclass: 'OC', ...}, {name: 'Mineo', id: '16696', nametype: 'Valid', recclass: 'Pallasite', ...}, {name: 'Omolon', id: '18019', nametype: 'Valid', recclass: ..., ...}, {name: 'Rumuruti', id: '22782', nametype: 'Valid', recclass: 'R3.8-6', ...}, {name: 'Hatford', id: '11855', nametype: 'Valid', recclass: 'Stone-uncl', ...}, None, {name: 'Novo-Urei', id: '17933', nametype: 'Valid', recclass: 'Ureilite', ...}, {name: 'Almahata Sitta', id: '48915', nametype: 'Valid', recclass: ..., ...}, {name: 'Pontlyfni', id: '18865', nametype: 'Valid', recclass: ..., ...}] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- backend: cpu nbytes: 275.3 kB type: 118 * ?{ name: string, id: string, nametype: string, recclass: string, fall: string, year: ?string, reclat: ?string, reclong: ?string, geolocation: ?{ type: string, coordinates: var * float64 }, ":@computed_region_cbhk_fwbd": ?string, ":@computed_region_nnqa_25f4": ?string, mass: ?float64 }
Here are there names!
largest_meteorite.name
['Acapulco', 'Silistra', 'Angra dos Reis (stone)', 'Norton County', "Sutter's Mill", 'Tagish Lake', 'Ningqiang', 'Gujba', 'Orgueil', 'Karoonda', ..., 'Kushiike', 'Mineo', 'Omolon', 'Rumuruti', 'Hatford', None, 'Novo-Urei', 'Almahata Sitta', 'Pontlyfni'] -------------------------- backend: cpu nbytes: 17.6 kB type: 118 * ?string