How to convert to Pandas#

Pandas is a data analysis library for ordered time-series and relational data. In general, Pandas does not define operations for manipulating nested data structures, but in some cases, MultiIndex/advanced indexing can do equivalent things.

import awkward as ak
import pandas as pd
import pyarrow as pa
import urllib.request

From Pandas to Awkward#

At the time of writing, there is no ak.from_dataframe function, but such a thing could be useful.

However, Apache Arrow can be converted to and from Awkward Arrays, and Arrow can be converted to and from Pandas (sometimes zero-copy). See below for more on conversion through Arrow.

From Awkward to Pandas#

The function for Awkward → Pandas conversion is ak.to_dataframe().

ak_array = ak.Array(
    [
        {"x": 1.1, "y": 1, "z": "one"},
        {"x": 2.2, "y": 2, "z": "two"},
        {"x": 3.3, "y": 3, "z": "three"},
        {"x": 4.4, "y": 4, "z": "four"},
        {"x": 5.5, "y": 5, "z": "five"},
    ]
)
ak_array

[{x: 1.1, y: 1, z: 'one'},
 {x: 2.2, y: 2, z: 'two'},
 {x: 3.3, y: 3, z: 'three'},
 {x: 4.4, y: 4, z: 'four'},
 {x: 5.5, y: 5, z: 'five'}]
----------------------------
type: 5 * {
    x: float64,
    y: int64,
    z: string
}

ak.to_dataframe(ak_array)

	x	y	z
entry
0	1.1	1	one
1	2.2	2	two
2	3.3	3	three
3	4.4	4	four
4	5.5	5	five

Awkward record field names are converted into Pandas column names, even if nested within lists.

ak_array = ak.Array(
    [
        [
            {"x": 1.1, "y": 1, "z": "one"},
            {"x": 2.2, "y": 2, "z": "two"},
            {"x": 3.3, "y": 3, "z": "three"},
        ],
        [],
        [{"x": 4.4, "y": 4, "z": "four"}, {"x": 5.5, "y": 5, "z": "five"}],
    ]
)
ak_array

[[{x: 1.1, y: 1, z: 'one'}, {x: 2.2, ...}, {x: 3.3, y: 3, z: 'three'}],
 [],
 [{x: 4.4, y: 4, z: 'four'}, {x: 5.5, y: 5, z: 'five'}]]
-----------------------------------------------------------------------
type: 3 * var * {
    x: float64,
    y: int64,
    z: string
}

ak.to_dataframe(ak_array)

		x	y	z
entry	subentry
0	0	1.1	1	one
	1	2.2	2	two
	2	3.3	3	three
2	0	4.4	4	four
2	1	5.5	5	five

In this case, we see that the "x", "y", and "z" fields are separate columns, but also that the index is now hierarchical, a MultiIndex. Nested lists become MultiIndex rows and nested records become MultiIndex columns.

Here is an example with three levels of depth:

ak_array = ak.Array(
    [
        [[1.1, 2.2], [], [3.3]],
        [],
        [[4.4], [5.5, 6.6]],
        [[7.7]],
        [[8.8]],
    ]
)
ak_array

[[[1.1, 2.2], [], [3.3]],
 [],
 [[4.4], [5.5, 6.6]],
 [[7.7]],
 [[8.8]]]
-----------------------------
type: 5 * var * var * float64

ak.to_dataframe(ak_array)

			values
entry	subentry	subsubentry
0	0	0	1.1
	0	1	2.2
	2	0	3.3
2	0	0	4.4
	1	0	5.5
	1	1	6.6
3	0	0	7.7
4	0	0	8.8

And here is an example with nested records/hierarchical columns:

ak_array = ak.Array(
    [
        {"I": {"a": _, "b": {"i": _}}, "II": {"x": {"y": {"z": _}}}}
        for _ in range(0, 50, 10)
    ]
)
ak_array

[{I: {a: 0, b: {...}}, II: {x: {y: ..., ...}}},
 {I: {a: 10, b: {...}}, II: {x: {y: ..., ...}}},
 {I: {a: 20, b: {...}}, II: {x: {y: ..., ...}}},
 {I: {a: 30, b: {...}}, II: {x: {y: ..., ...}}},
 {I: {a: 40, b: {...}}, II: {x: {y: ..., ...}}}]
------------------------------------------------
type: 5 * {
    I: {
        a: int64,
        b: {
            i: int64
        }
    },
    II: {
        x: {
            y: {
                z: int64
            }
        }
    }
}

ak.to_dataframe(ak_array)

	I		II
	a	b	x
		i	y
			z
entry
0	0	0	0
1	10	10	10
2	20	20	20
3	30	30	30
4	40	40	40

Although nested lists and records can be represented using Pandas’s MultiIndex, different-length lists in the same data structure can only be translated without loss into multiple DataFrames. This is because a DataFrame can have only one MultiIndex, but lists of different lengths require different MultiIndexes.

ak_array = ak.Array(
    [
        {"x": [], "y": [4.4, 3.3, 2.2, 1.1]},
        {"x": [1], "y": [3.3, 2.2, 1.1]},
        {"x": [1, 2], "y": [2.2, 1.1]},
        {"x": [1, 2, 3], "y": [1.1]},
        {"x": [1, 2, 3, 4], "y": []},
    ]
)
ak_array

[{x: [], y: [4.4, 3.3, 2.2, 1.1]},
 {x: [1], y: [3.3, 2.2, 1.1]},
 {x: [1, 2], y: [2.2, 1.1]},
 {x: [1, 2, 3], y: [1.1]},
 {x: [1, 2, 3, 4], y: []}]
----------------------------------
type: 5 * {
    x: var * int64,
    y: var * float64
}

To avoid losing any data, ak.to_dataframe() can be used with how=None (the default is how="inner") to return a list of the minimum number of DataFrames needed to encode the data.

In how=None mode, ak.to_dataframe() always returns a list (sometimes with only one item).

ak.to_dataframe(ak_array, how=None)

[                x
 entry subentry   
 1     0         1
 2     0         1
       1         2
 3     0         1
       1         2
       2         3
 4     0         1
       1         2
       2         3
       3         4,
                   y
 entry subentry     
 0     0         4.4
       1         3.3
       2         2.2
       3         1.1
 1     0         3.3
       1         2.2
       2         1.1
 2     0         2.2
       1         1.1
 3     0         1.1]

The default how="inner" combines the above into a single DataFrame using pd.merge. This operation is lossy.

ak.to_dataframe(ak_array, how="inner")

		x	y
entry	subentry
1	0	1	3.3
2	0	1	2.2
2	1	2	1.1
3	0	1	1.1

The value of how is passed to pd.merge, so outer joins are possible as well.

ak.to_dataframe(ak_array, how="outer")

		x	y
entry	subentry
0	0	NaN	4.4
	1	NaN	3.3
	2	NaN	2.2
	3	NaN	1.1
1	0	1.0	3.3
	1	NaN	2.2
	2	NaN	1.1
2	0	1.0	2.2
2	1	2.0	1.1
3	0	1.0	1.1
	1	2.0	NaN
	2	3.0	NaN
4	0	1.0	NaN
	1	2.0	NaN
	2	3.0	NaN
	3	4.0	NaN

Conversion through Apache Arrow#

Since Apache Arrow can be converted to and from Awkward Arrays and Pandas, Arrow can connect Awkward and Pandas in both directions. This is an alternative to ak.to_pandas() (described above) with different behavior.

As described in the tutorial on Arrow, the ak.to_arrow() function returns a pyarrow.lib.Arrow object. Arrow’s conversion to Pandas requires a pyarrow.lib.Table.

ak_array = ak.Array(
    [
        [
            {"x": 1.1, "y": 1, "z": "one"},
            {"x": 2.2, "y": 2, "z": "two"},
            {"x": 3.3, "y": 3, "z": "three"},
        ],
        [],
        [{"x": 4.4, "y": 4, "z": "four"}, {"x": 5.5, "y": 5, "z": "five"}],
    ]
)
ak_array

[[{x: 1.1, y: 1, z: 'one'}, {x: 2.2, ...}, {x: 3.3, y: 3, z: 'three'}],
 [],
 [{x: 4.4, y: 4, z: 'four'}, {x: 5.5, y: 5, z: 'five'}]]
-----------------------------------------------------------------------
type: 3 * var * {
    x: float64,
    y: int64,
    z: string
}

pa_array = ak.to_arrow(ak_array)
pa_array

<awkward._connect.pyarrow.extn_types.AwkwardArrowArray object at 0x7f707ac3c360>
[
  -- is_valid: all not null
  -- child 0 type: extension<awkward<AwkwardArrowType>>
    [
      1.1,
      2.2,
      3.3
    ]
  -- child 1 type: extension<awkward<AwkwardArrowType>>
    [
      1,
      2,
      3
    ]
  -- child 2 type: extension<awkward<AwkwardArrowType>>
    [
      "one",
      "two",
      "three"
    ],
  -- is_valid: all not null
  -- child 0 type: extension<awkward<AwkwardArrowType>>
    []
  -- child 1 type: extension<awkward<AwkwardArrowType>>
    []
  -- child 2 type: extension<awkward<AwkwardArrowType>>
    [],
  -- is_valid: all not null
  -- child 0 type: extension<awkward<AwkwardArrowType>>
    [
      4.4,
      5.5
    ]
  -- child 1 type: extension<awkward<AwkwardArrowType>>
    [
      4,
      5
    ]
  -- child 2 type: extension<awkward<AwkwardArrowType>>
    [
      "four",
      "five"
    ]
]

We can build a Table manually, ensuring that we set extensionarray=False. The extensionarray flag is normally True, and enables Awkward to preserve metadata through Arrow transformations. However, tools like Arrow’s Pandas conversion do not recognise Awkward’s special extension type, so we must take care to provide Arrow with native types:

pa_table = pa.Table.from_batches(
    [
        pa.RecordBatch.from_arrays(
            [
                ak.to_arrow(ak_array.x, extensionarray=False),
                ak.to_arrow(ak_array.y, extensionarray=False),
                ak.to_arrow(ak_array.z, extensionarray=False),
            ],
            ["x", "y", "z"],
        )
    ]
)
pa_table

pyarrow.Table
x: large_list<item: double not null>
  child 0, item: double not null
y: large_list<item: int64 not null>
  child 0, item: int64 not null
z: large_list<item: large_string not null>
  child 0, item: large_string not null
----
x: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
y: [[[1,2,3],[],[4,5]]]
z: [[["one","two","three"],[],["four","five"]]]

pa_table.to_pandas()

	x	y	z
0	[1.1, 2.2, 3.3]	[1, 2, 3]	[one, two, three]
1	[]	[]	[]
2	[4.4, 5.5]	[4, 5]	[four, five]

Note that this is different from the output of ak.to_pandas():

ak.to_dataframe(ak_array)

		x	y	z
entry	subentry
0	0	1.1	1	one
	1	2.2	2	two
	2	3.3	3	three
2	0	4.4	4	four
2	1	5.5	5	five

The Awkward → Arrow → Pandas route leaves the lists as nested data within each cell, whereas ak.to_dataframe() encodes the nested structure with a MultiIndex/advanced indexing and puts simple values in each cell. Depending on your needs, one or the other may be desirable.

Finally, the Pandas → Arrow → Awkward is currently the only means of turning Pandas DataFrames into Awkward Arrays.

pokemon = urllib.request.urlopen(
    "https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv"
)
df = pd.read_csv(pokemon)
df

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
795	719	Diancie	Rock	Fairy	600	50	100	150	100	150	50	6	True
796	719	DiancieMega Diancie	Rock	Fairy	700	50	160	110	160	110	110	6	True
797	720	HoopaHoopa Confined	Psychic	Ghost	600	80	110	60	150	130	70	6	True
798	720	HoopaHoopa Unbound	Psychic	Dark	680	80	160	60	170	130	80	6	True
799	721	Volcanion	Fire	Water	600	80	110	120	130	90	70	6	True

800 rows × 13 columns

ak_array = ak.from_arrow(pa.Table.from_pandas(df))
ak_array

[{'#': 1, Name: 'Bulbasaur', 'Type 1': 'Grass', 'Type 2': 'Poison', ...},
 {'#': 2, Name: 'Ivysaur', 'Type 1': 'Grass', 'Type 2': 'Poison', ...},
 {'#': 3, Name: 'Venusaur', 'Type 1': 'Grass', 'Type 2': 'Poison', ...},
 {'#': 3, Name: 'VenusaurMega Venusaur', 'Type 1': 'Grass', 'Type 2': ..., ...},
 {'#': 4, Name: 'Charmander', 'Type 1': 'Fire', 'Type 2': None, ...},
 {'#': 5, Name: 'Charmeleon', 'Type 1': 'Fire', 'Type 2': None, ...},
 {'#': 6, Name: 'Charizard', 'Type 1': 'Fire', 'Type 2': 'Flying', ...},
 {'#': 6, Name: 'CharizardMega Charizard X', 'Type 1': 'Fire', ...},
 {'#': 6, Name: 'CharizardMega Charizard Y', 'Type 1': 'Fire', ...},
 {'#': 7, Name: 'Squirtle', 'Type 1': 'Water', 'Type 2': None, Total: 314, ...},
 ...,
 {'#': 715, Name: 'Noivern', 'Type 1': 'Flying', 'Type 2': 'Dragon', ...},
 {'#': 716, Name: 'Xerneas', 'Type 1': 'Fairy', 'Type 2': None, ...},
 {'#': 717, Name: 'Yveltal', 'Type 1': 'Dark', 'Type 2': 'Flying', ...},
 {'#': 718, Name: 'Zygarde50% Forme', 'Type 1': 'Dragon', 'Type 2': ..., ...},
 {'#': 719, Name: 'Diancie', 'Type 1': 'Rock', 'Type 2': 'Fairy', ...},
 {'#': 719, Name: 'DiancieMega Diancie', 'Type 1': 'Rock', 'Type 2': ..., ...},
 {'#': 720, Name: 'HoopaHoopa Confined', 'Type 1': 'Psychic', ...},
 {'#': 720, Name: 'HoopaHoopa Unbound', 'Type 1': 'Psychic', ...},
 {'#': 721, Name: 'Volcanion', 'Type 1': 'Fire', 'Type 2': 'Water', ...}]
--------------------------------------------------------------------------------
type: 800 * {
    "#": ?int64,
    Name: ?string,
    "Type 1": ?string,
    "Type 2": ?string,
    Total: ?int64,
    HP: ?int64,
    Attack: ?int64,
    Defense: ?int64,
    "Sp. Atk": ?int64,
    "Sp. Def": ?int64,
    Speed: ?int64,
    Generation: ?int64,
    Legendary: ?bool
}

ak.type(ak_array)

ArrayType(RecordType([OptionType(NumpyType('int64')), OptionType(ListType(NumpyType('uint8', parameters={'__array__': 'char'}), parameters={'__array__': 'string'})), OptionType(ListType(NumpyType('uint8', parameters={'__array__': 'char'}), parameters={'__array__': 'string'})), OptionType(ListType(NumpyType('uint8', parameters={'__array__': 'char'}), parameters={'__array__': 'string'})), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('int64')), OptionType(NumpyType('bool'))], ['#', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']), 800, None)

ak.to_list(ak_array[0])

{'#': 1,
 'Name': 'Bulbasaur',
 'Type 1': 'Grass',
 'Type 2': 'Poison',
 'Total': 318,
 'HP': 45,
 'Attack': 49,
 'Defense': 49,
 'Sp. Atk': 65,
 'Sp. Def': 65,
 'Speed': 45,
 'Generation': 1,
 'Legendary': False}

This array is ready for data analysis.

ak_array[ak_array.Legendary].Attack - ak_array[ak_array.Legendary].Defense

[-15,
 5,
 10,
 20,
 90,
 80,
 10,
 30,
 -40,
 -40,
 ...,
 30,
 36,
 36,
 -21,
 -50,
 50,
 50,
 100,
 -10]
-----------------
type: 65 * ?int64