Generic buffers#
Most of the conversion functions target a particular library: NumPy, Arrow, Pandas, or Python itself. As a catch-all for other storage formats, Awkward Arrays can be converted to and from sets of named buffers. The buffers are not (usually) intelligible on their own; the length of the array and a JSON document are needed to reconstitute the original structure. This section will demonstrate how an array-set can be used to store an Awkward Array in an HDF5 file, which ordinarily wouldn’t be able to represent nested, irregular data structures.
import awkward as ak
import numpy as np
import h5py
import json
From Awkward to buffers#
Consider the following complex array:
ak_array = ak.Array(
[
[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
[],
[{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}],
]
)
ak_array
[[{x: 1.1, y: [1]}, {x: 2.2, y: [...]}, {x: 3.3, y: [1, 2, 3]}], [], [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}]] ---------------------------------------------------------------- backend: cpu nbytes: 240 B type: 3 * var * { x: float64, y: var * int64 }
The ak.to_buffers()
function decomposes it into a set of one-dimensional arrays (a zero-copy operation).
form, length, container = ak.to_buffers(ak_array)
The pieces needed to reconstitute this array are:
the
ak.forms.Form
, which defines how structure is built from one-dimensional arrays,the length of the original array,
the one-dimensional arrays in the
container
(acollections.abc.MutableMapping
).
The ak.forms.Form
is like an Awkward ak.types.Type
in that it describes how the data are structured, but with more detail: it includes distinctions such as the difference between ak.contents.ListArray
and ak.contents.ListOffsetArray
, as well as the integer types of structural ak.index.Index
.
It is usually presented as JSON, and has a compact JSON format (when ak.forms.Form.tojson()
is invoked).
form
ListOffsetForm('i64', RecordForm([NumpyForm('float64', form_key='node2'), ListOffsetForm('i64', NumpyForm('int64', form_key='node4'), form_key='node3')], ['x', 'y'], form_key='node1'), form_key='node0')
In this case, the length
is just an integer. It would be a list of integers if ak_array
was partitioned.
length
3
This container
is a new dict, but it could have been a user-specified collections.abc.MutableMapping
if passed into ak.to_buffers()
as an argument.
container
{'node0-offsets': array([0, 3, 3, 5]),
'node2-data': array([1.1, 2.2, 3.3, 4.4, 5.5]),
'node3-offsets': array([ 0, 1, 3, 6, 10, 15]),
'node4-data': array([1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5])}
From buffers to Awkward#
The function that reverses ak.to_buffers()
is ak.from_buffers()
. Its first three arguments are form
, length
, and container
.
ak.from_buffers(form, length, container)
[[{x: 1.1, y: [1]}, {x: 2.2, y: [...]}, {x: 3.3, y: [1, 2, 3]}], [], [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}]] ---------------------------------------------------------------- backend: cpu nbytes: 240 B type: 3 * var * { x: float64, y: var * int64 }
Minimizing the size of the output buffers#
The ak.to_buffers()
/ak.from_buffers()
functions exactly preserve an array, warts and all. Often, you’ll want to only write ak.to_packed()
arrays. “Packing” replaces an array structure with an equivalent structure that has no unreachable elements—data that you can’t see as part of the array, and therefore probably don’t want to write.
Here is an example of an array in need of packing:
unpacked = ak.Array(
ak.contents.ListArray(
ak.index.Index64(np.array([4, 10, 1])),
ak.index.Index64(np.array([7, 10, 3])),
ak.contents.NumpyArray(np.array([999, 4.4, 5.5, 999, 1.1, 2.2, 3.3, 999])),
)
)
unpacked
[[1.1, 2.2, 3.3], [], [4.4, 5.5]] ----------------------- backend: cpu nbytes: 112 B type: 3 * var * float64
This ak.contents.ListArray
is in a strange order and the 999
values are unreachable. (Also, using starts[1] == stops[1] == 10
to represent an empty list is a little odd, though allowed by the specification.)
The ak.to_buffers()
function dutifully writes the 999
values into the output, even though they’re not visible in the array.
ak.to_buffers(unpacked)
(ListForm('i64', 'i64', NumpyForm('float64', form_key='node1'), form_key='node0'),
3,
{'node0-starts': array([ 4, 10, 1]),
'node0-stops': array([ 7, 10, 3]),
'node1-data': array([999. , 4.4, 5.5, 999. , 1.1, 2.2, 3.3, 999. ])})
If the intended purpose of calling ak.to_buffers()
is to write to a file or send data over a network, this is wasted space. It can be trimmed by calling the ak.to_packed()
function.
packed = ak.to_packed(unpacked)
packed
[[1.1, 2.2, 3.3], [], [4.4, 5.5]] ----------------------- backend: cpu nbytes: 72 B type: 3 * var * float64
At high-level, the array appears to be the same, but its low-level structure is quite different:
unpacked.layout
<ListArray len='3'>
<starts><Index dtype='int64' len='3'>[ 4 10 1]</Index></starts>
<stops><Index dtype='int64' len='3'>[ 7 10 3]</Index></stops>
<content><NumpyArray dtype='float64' len='8'>
[999. 4.4 5.5 999. 1.1 2.2 3.3 999. ]
</NumpyArray></content>
</ListArray>
packed.layout
<ListOffsetArray len='3'>
<offsets><Index dtype='int64' len='4'>[0 3 3 5]</Index></offsets>
<content><NumpyArray dtype='float64' len='5'>[1.1 2.2 3.3 4.4 5.5]</NumpyArray></content>
</ListOffsetArray>
This version of the array is more concise when written with ak.to_buffers()
:
ak.to_buffers(packed)
(ListOffsetForm('i64', NumpyForm('float64', form_key='node1'), form_key='node0'),
3,
{'node0-offsets': array([0, 3, 3, 5]),
'node1-data': array([1.1, 2.2, 3.3, 4.4, 5.5])})
Saving Awkward Arrays to HDF5#
The h5py library presents each group in an HDF5 file as a collections.abc.MutableMapping
, which we can use as a container for an array-set. We must also save the form
and length
as metadata for the array to be retrievable.
file = h5py.File("/tmp/example.hdf5", "w")
group = file.create_group("awkward")
group
<HDF5 group "/awkward" (0 members)>
We can fill this group
as a container
by passing it in to ak.to_buffers()
. (See the previous section for more on ak.to_packed()
.)
form, length, container = ak.to_buffers(ak.to_packed(ak_array), container=group)
container
<HDF5 group "/awkward" (4 members)>
Now the HDF5 group has been filled with array pieces.
container.keys()
<KeysViewHDF5 ['node0-offsets', 'node2-data', 'node3-offsets', 'node4-data']>
Here’s one.
np.asarray(container["node0-offsets"])
array([0, 3, 3, 5])
Now we need to add the other information to the group as metadata. Since HDF5 accepts string-valued metadata, we can put it all in as JSON or numbers.
group.attrs["form"] = form.to_json()
group.attrs["form"]
'{"class": "ListOffsetArray", "offsets": "i64", "content": {"class": "RecordArray", "fields": ["x", "y"], "contents": [{"class": "NumpyArray", "primitive": "float64", "inner_shape": [], "parameters": {}, "form_key": "node2"}, {"class": "ListOffsetArray", "offsets": "i64", "content": {"class": "NumpyArray", "primitive": "int64", "inner_shape": [], "parameters": {}, "form_key": "node4"}, "parameters": {}, "form_key": "node3"}], "parameters": {}, "form_key": "node1"}, "parameters": {}, "form_key": "node0"}'
group.attrs["length"] = length
group.attrs["length"]
np.int64(3)
Reading Awkward Arrays from HDF5#
With that, we can reconstitute the array by supplying ak.from_buffers()
the right arguments from the group and metadata.
The group can’t be used as a container
as-is, since subscripting it returns h5py.Dataset
objects, rather than arrays.
reconstituted = ak.from_buffers(
ak.forms.from_json(group.attrs["form"]),
group.attrs["length"],
{k: np.asarray(v) for k, v in group.items()},
)
reconstituted
[[{x: 1.1, y: [1]}, {x: 2.2, y: [...]}, {x: 3.3, y: [1, 2, 3]}], [], [{x: 4.4, y: [1, 2, 3, 4]}, {x: 5.5, y: [1, ..., 5]}]] ---------------------------------------------------------------- backend: cpu nbytes: 240 B type: 3 * var * { x: float64, y: var * int64 }