How to convert to Pandas#

Pandas is a data analysis library for ordered time-series and relational data. In general, Pandas does not define operations for manipulating nested data structures, but in some cases, MultiIndex/advanced indexing can do equivalent things.

import awkward as ak
import pandas as pd
import pyarrow as pa
import urllib.request

From Pandas to Awkward#

At the time of writing, there is no ak.from_dataframe function, but such a thing could be useful.

However, Apache Arrow can be converted to and from Awkward Arrays, and Arrow can be converted to and from Pandas (sometimes zero-copy). See below for more on conversion through Arrow.

From Awkward to Pandas#

The function for Awkward → Pandas conversion is ak.to_dataframe().

ak_array = ak.Array(
    [
        {"x": 1.1, "y": 1, "z": "one"},
        {"x": 2.2, "y": 2, "z": "two"},
        {"x": 3.3, "y": 3, "z": "three"},
        {"x": 4.4, "y": 4, "z": "four"},
        {"x": 5.5, "y": 5, "z": "five"},
    ]
)
ak_array
[{x: 1.1, y: 1, z: 'one'},
 {x: 2.2, y: 2, z: 'two'},
 {x: 3.3, y: 3, z: 'three'},
 {x: 4.4, y: 4, z: 'four'},
 {x: 5.5, y: 5, z: 'five'}]
---------------------------------------------------------
backend: cpu
nbytes: 147 B
type: 5 * {
    x: float64,
    y: int64,
    z: string
}
ak.to_dataframe(ak_array)
x y z
entry
0 1.1 1 one
1 2.2 2 two
2 3.3 3 three
3 4.4 4 four
4 5.5 5 five

Awkward record field names are converted into Pandas column names, even if nested within lists.

ak_array = ak.Array(
    [
        [
            {"x": 1.1, "y": 1, "z": "one"},
            {"x": 2.2, "y": 2, "z": "two"},
            {"x": 3.3, "y": 3, "z": "three"},
        ],
        [],
        [{"x": 4.4, "y": 4, "z": "four"}, {"x": 5.5, "y": 5, "z": "five"}],
    ]
)
ak_array
[[{x: 1.1, y: 1, z: 'one'}, {x: 2.2, ...}, {x: 3.3, y: 3, z: 'three'}],
 [],
 [{x: 4.4, y: 4, z: 'four'}, {x: 5.5, y: 5, z: 'five'}]]
-----------------------------------------------------------------------
backend: cpu
nbytes: 179 B
type: 3 * var * {
    x: float64,
    y: int64,
    z: string
}
ak.to_dataframe(ak_array)
x y z
entry subentry
0 0 1.1 1 one
1 2.2 2 two
2 3.3 3 three
2 0 4.4 4 four
1 5.5 5 five

In this case, we see that the "x", "y", and "z" fields are separate columns, but also that the index is now hierarchical, a MultiIndex. Nested lists become MultiIndex rows and nested records become MultiIndex columns.

Here is an example with three levels of depth:

ak_array = ak.Array(
    [
        [[1.1, 2.2], [], [3.3]],
        [],
        [[4.4], [5.5, 6.6]],
        [[7.7]],
        [[8.8]],
    ]
)
ak_array
[[[1.1, 2.2], [], [3.3]],
 [],
 [[4.4], [5.5, 6.6]],
 [[7.7]],
 [[8.8]]]
-----------------------------
backend: cpu
nbytes: 176 B
type: 5 * var * var * float64
ak.to_dataframe(ak_array)
values
entry subentry subsubentry
0 0 0 1.1
1 2.2
2 0 3.3
2 0 0 4.4
1 0 5.5
1 6.6
3 0 0 7.7
4 0 0 8.8

And here is an example with nested records/hierarchical columns:

ak_array = ak.Array(
    [
        {"I": {"a": _, "b": {"i": _}}, "II": {"x": {"y": {"z": _}}}}
        for _ in range(0, 50, 10)
    ]
)
ak_array
[{I: {a: 0, b: {...}}, II: {x: {y: ..., ...}}},
 {I: {a: 10, b: {...}}, II: {x: {y: ..., ...}}},
 {I: {a: 20, b: {...}}, II: {x: {y: ..., ...}}},
 {I: {a: 30, b: {...}}, II: {x: {y: ..., ...}}},
 {I: {a: 40, b: {...}}, II: {x: {y: ..., ...}}}]
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
backend: cpu
nbytes: 120 B
type: 5 * {
    I: {
        a: int64,
        b: {
            i: int64
        }
    },
    II: {
        x: {
            y: {
                z: int64
            }
        }
    }
}
ak.to_dataframe(ak_array)
I II
a b x
i y
z
entry
0 0 0 0
1 10 10 10
2 20 20 20
3 30 30 30
4 40 40 40

Although nested lists and records can be represented using Pandas’s MultiIndex, different-length lists in the same data structure can only be translated without loss into multiple DataFrames. This is because a DataFrame can have only one MultiIndex, but lists of different lengths require different MultiIndexes.

ak_array = ak.Array(
    [
        {"x": [], "y": [4.4, 3.3, 2.2, 1.1]},
        {"x": [1], "y": [3.3, 2.2, 1.1]},
        {"x": [1, 2], "y": [2.2, 1.1]},
        {"x": [1, 2, 3], "y": [1.1]},
        {"x": [1, 2, 3, 4], "y": []},
    ]
)
ak_array
[{x: [], y: [4.4, 3.3, 2.2, 1.1]},
 {x: [1], y: [3.3, 2.2, 1.1]},
 {x: [1, 2], y: [2.2, 1.1]},
 {x: [1, 2, 3], y: [1.1]},
 {x: [1, 2, 3, 4], y: []}]
------------------------------------------------------
backend: cpu
nbytes: 256 B
type: 5 * {
    x: var * int64,
    y: var * float64
}

To avoid losing any data, ak.to_dataframe() can be used with how=None (the default is how="inner") to return a list of the minimum number of DataFrames needed to encode the data.

In how=None mode, ak.to_dataframe() always returns a list (sometimes with only one item).

ak.to_dataframe(ak_array, how=None)
[                x
 entry subentry   
 1     0         1
 2     0         1
       1         2
 3     0         1
       1         2
       2         3
 4     0         1
       1         2
       2         3
       3         4,
                   y
 entry subentry     
 0     0         4.4
       1         3.3
       2         2.2
       3         1.1
 1     0         3.3
       1         2.2
       2         1.1
 2     0         2.2
       1         1.1
 3     0         1.1]

The default how="inner" combines the above into a single DataFrame using pd.merge. This operation is lossy.

ak.to_dataframe(ak_array, how="inner")
x y
entry subentry
1 0 1 3.3
2 0 1 2.2
1 2 1.1
3 0 1 1.1

The value of how is passed to pd.merge, so outer joins are possible as well.

ak.to_dataframe(ak_array, how="outer")
x y
entry subentry
0 0 NaN 4.4
1 NaN 3.3
2 NaN 2.2
3 NaN 1.1
1 0 1.0 3.3
1 NaN 2.2
2 NaN 1.1
2 0 1.0 2.2
1 2.0 1.1
3 0 1.0 1.1
1 2.0 NaN
2 3.0 NaN
4 0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 4.0 NaN

Conversion through Apache Arrow#

Since Apache Arrow can be converted to and from Awkward Arrays and Pandas, Arrow can connect Awkward and Pandas in both directions. This is an alternative to ak.to_pandas() (described above) with different behavior.

As described in the tutorial on Arrow, the ak.to_arrow() function returns a pyarrow.lib.Arrow object. Arrow’s conversion to Pandas requires a pyarrow.lib.Table.

ak_array = ak.Array(
    [
        [
            {"x": 1.1, "y": 1, "z": "one"},
            {"x": 2.2, "y": 2, "z": "two"},
            {"x": 3.3, "y": 3, "z": "three"},
        ],
        [],
        [{"x": 4.4, "y": 4, "z": "four"}, {"x": 5.5, "y": 5, "z": "five"}],
    ]
)
ak_array
[[{x: 1.1, y: 1, z: 'one'}, {x: 2.2, ...}, {x: 3.3, y: 3, z: 'three'}],
 [],
 [{x: 4.4, y: 4, z: 'four'}, {x: 5.5, y: 5, z: 'five'}]]
-----------------------------------------------------------------------
backend: cpu
nbytes: 179 B
type: 3 * var * {
    x: float64,
    y: int64,
    z: string
}
pa_array = ak.to_arrow(ak_array)
pa_array
<awkward._connect.pyarrow.extn_types.AwkwardArrowArray object at 0x7f72eaca0910>
[
  -- is_valid: all not null
  -- child 0 type: extension<awkward<AwkwardArrowType>>
    [
      1.1,
      2.2,
      3.3
    ]
  -- child 1 type: extension<awkward<AwkwardArrowType>>
    [
      1,
      2,
      3
    ]
  -- child 2 type: extension<awkward<AwkwardArrowType>>
    [
      "one",
      "two",
      "three"
    ],
  -- is_valid: all not null
  -- child 0 type: extension<awkward<AwkwardArrowType>>
    []
  -- child 1 type: extension<awkward<AwkwardArrowType>>
    []
  -- child 2 type: extension<awkward<AwkwardArrowType>>
    [],
  -- is_valid: all not null
  -- child 0 type: extension<awkward<AwkwardArrowType>>
    [
      4.4,
      5.5
    ]
  -- child 1 type: extension<awkward<AwkwardArrowType>>
    [
      4,
      5
    ]
  -- child 2 type: extension<awkward<AwkwardArrowType>>
    [
      "four",
      "five"
    ]
]

We can build a Table manually, ensuring that we set extensionarray=False. The extensionarray flag is normally True, and enables Awkward to preserve metadata through Arrow transformations. However, tools like Arrow’s Pandas conversion do not recognise Awkward’s special extension type, so we must take care to provide Arrow with native types:

pa_table = pa.Table.from_batches(
    [
        pa.RecordBatch.from_arrays(
            [
                ak.to_arrow(ak_array.x, extensionarray=False),
                ak.to_arrow(ak_array.y, extensionarray=False),
                ak.to_arrow(ak_array.z, extensionarray=False),
            ],
            ["x", "y", "z"],
        )
    ]
)
pa_table
pyarrow.Table
x: large_list<item: double not null>
  child 0, item: double not null
y: large_list<item: int64 not null>
  child 0, item: int64 not null
z: large_list<item: large_string not null>
  child 0, item: large_string not null
----
x: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
y: [[[1,2,3],[],[4,5]]]
z: [[["one","two","three"],[],["four","five"]]]
pa_table.to_pandas()
x y z
0 [1.1, 2.2, 3.3] [1, 2, 3] [one, two, three]
1 [] [] []
2 [4.4, 5.5] [4, 5] [four, five]

Note that this is different from the output of ak.to_pandas():

ak.to_dataframe(ak_array)
x y z
entry subentry
0 0 1.1 1 one
1 2.2 2 two
2 3.3 3 three
2 0 4.4 4 four
1 5.5 5 five

The Awkward → Arrow → Pandas route leaves the lists as nested data within each cell, whereas ak.to_dataframe() encodes the nested structure with a MultiIndex/advanced indexing and puts simple values in each cell. Depending on your needs, one or the other may be desirable.

Finally, the Pandas → Arrow → Awkward is currently the only means of turning Pandas DataFrames into Awkward Arrays.

pokemon = urllib.request.urlopen(
    "https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv"
)
df = pd.read_csv(pokemon)
df
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True

800 rows × 13 columns

ak_array = ak.from_arrow(pa.Table.from_pandas(df))
ak_array
[{'#': 1, Name: 'Bulbasaur', 'Type 1': 'Grass', 'Type 2': 'Poison', ...},
 {'#': 2, Name: 'Ivysaur', 'Type 1': 'Grass', 'Type 2': 'Poison', ...},
 {'#': 3, Name: 'Venusaur', 'Type 1': 'Grass', 'Type 2': 'Poison', ...},
 {'#': 3, Name: 'VenusaurMega Venusaur', 'Type 1': 'Grass', 'Type 2': ..., ...},
 {'#': 4, Name: 'Charmander', 'Type 1': 'Fire', 'Type 2': None, ...},
 {'#': 5, Name: 'Charmeleon', 'Type 1': 'Fire', 'Type 2': None, ...},
 {'#': 6, Name: 'Charizard', 'Type 1': 'Fire', 'Type 2': 'Flying', ...},
 {'#': 6, Name: 'CharizardMega Charizard X', 'Type 1': 'Fire', ...},
 {'#': 6, Name: 'CharizardMega Charizard Y', 'Type 1': 'Fire', ...},
 {'#': 7, Name: 'Squirtle', 'Type 1': 'Water', 'Type 2': None, Total: 314, ...},
 ...,
 {'#': 715, Name: 'Noivern', 'Type 1': 'Flying', 'Type 2': 'Dragon', ...},
 {'#': 716, Name: 'Xerneas', 'Type 1': 'Fairy', 'Type 2': None, ...},
 {'#': 717, Name: 'Yveltal', 'Type 1': 'Dark', 'Type 2': 'Flying', ...},
 {'#': 718, Name: 'Zygarde50% Forme', 'Type 1': 'Dragon', 'Type 2': ..., ...},
 {'#': 719, Name: 'Diancie', 'Type 1': 'Rock', 'Type 2': 'Fairy', ...},
 {'#': 719, Name: 'DiancieMega Diancie', 'Type 1': 'Rock', 'Type 2': ..., ...},
 {'#': 720, Name: 'HoopaHoopa Confined', 'Type 1': 'Psychic', ...},
 {'#': 720, Name: 'HoopaHoopa Unbound', 'Type 1': 'Psychic', ...},
 {'#': 721, Name: 'Volcanion', 'Type 1': 'Fire', 'Type 2': 'Water', ...}]
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
backend: cpu
nbytes: 81.7 kB
type: 800 * {
    "#": ?int64,
    Name: ?string,
    "Type 1": ?string,
    "Type 2": ?string,
    Total: ?int64,
    HP: ?int64,
    Attack: ?int64,
    Defense: ?int64,
    "Sp. Atk": ?int64,
    "Sp. Def": ?int64,
    Speed: ?int64,
    Generation: ?int64,
    Legendary: ?bool
}
ak.type(ak_array)
ArrayType(RecordType([OptionType(NumpyType('int64')), OptionType(ListType(NumpyType('uint8', parameters={'__array__': 'char'}), parameters={'__array__': 'string'})), OptionType(ListType(NumpyType('uint8', parameters={'__array__': 'char'}), parameters={'__array__': 'string'})), OptionType(ListType(NumpyType('uint8', parameters={'__array__': 'char'}), parameters={'__array__': 'string'})), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('bool'))], ['#', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']), 800, None)
ak.to_list(ak_array[0])
{'#': 1,
 'Name': 'Bulbasaur',
 'Type 1': 'Grass',
 'Type 2': 'Poison',
 'Total': 318,
 'HP': 45,
 'Attack': 49,
 'Defense': 49,
 'Sp. Atk': 65,
 'Sp. Def': 65,
 'Speed': 45,
 'Generation': 1,
 'Legendary': False}

This array is ready for data analysis.

ak_array[ak_array.Legendary].Attack - ak_array[ak_array.Legendary].Defense
[-15,
 5,
 10,
 20,
 90,
 80,
 10,
 30,
 -40,
 -40,
 ...,
 30,
 36,
 36,
 -21,
 -50,
 50,
 50,
 100,
 -10]
-----------------
backend: cpu
nbytes: 1.0 kB
type: 65 * ?int64