How to examine a single item in detail#

It’s often useful to pull out a single item from an array to inspect its contents, particularly in the early stages of a data analysis, to get a sense of the data’s structure. This tutorial shows how to extract one item from an Awkward Array and examine it in different ways.

For this example, we’ll to use the Chicago taxi trips dataset from 10 minutes to Awkward Array. Recall that this dataset includes information about trips by various taxis collected over a few years, enriched with GPS path data.

Loading the dataset#

First, let’s load the dataset using the ak.from_parquet() function. We will only load the first row group, for the sake of this demonstration:

import awkward as ak

url = "https://pivarski-princeton.s3.amazonaws.com/chicago-taxi.parquet"
taxi = ak.from_parquet(
    url,
    row_groups=[0],
    columns=["trip.km", "trip.begin.l*", "trip.end.l*", "trip.path.*"],
)

What is a single item?#

The first “item” of this dataset could be a single taxi, which comprises many trips.

single_taxi = taxi[5]
single_taxi
[{trip: {km: 29.6, begin: {lon: -87.9, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 2, begin: {lon: -87.6, ...}, end: {...}, path: [...]}},
 {trip: {km: 2.46, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 50.8, begin: {lon: -87.9, ...}, end: {...}, path: []}},
 {trip: {km: 27, begin: {lon: -87.8, ...}, end: {...}, path: [...]}},
 {trip: {km: 18.6, begin: {lon: -87.9, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 20.2, begin: {lon: -87.9, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 2.27, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 0.724, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 1.05, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 ...,
 {trip: {km: 2.77, begin: {lon: -87.7, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 1.37, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 1.74, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 0.612, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 14.4, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 24.1, begin: {lon: -87.9, ...}, end: {...}, path: []}},
 {trip: {km: 33.4, begin: {lon: -87.9, ...}, end: {...}, path: []}},
 {trip: {km: 24.3, begin: {lon: -87.9, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 1.83, begin: {lon: -87.7, ...}, end: {...}, path: ..., ...}}]
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
backend: cpu
nbytes: 84.5 MB
type: 2403 * ?{
    trip: {
        km: ?float32,
        begin: {
            lon: ?float64,
            lat: ?float64
        },
        end: {
            lon: ?float64,
            lat: ?float64
        },
        path: var * {
            londiff: float32,
            latdiff: float32
        }
    }
}

Or it could be a single trip.

single_trip = single_taxi.trip[5]
single_trip
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/IPython/core/formatters.py:1036, in MimeBundleFormatter.__call__(self, obj, include, exclude)
   1033     method = get_real_method(obj, self.print_method)
   1035     if method is not None:
-> 1036         return method(include=include, exclude=exclude)
   1037     return None
   1038 else:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:2415, in Record._repr_mimebundle_(self, include, exclude)
   2409 def _repr_mimebundle_(self, include=None, exclude=None):
   2410     # order:
   2411     # first: array,
   2412     # last: type,
   2413     # middle: rest sorted by length of prefix (longest first)
-> 2415     rows = highlevel_array_show_rows(
   2416         array=self,
   2417         type=True,
   2418         named_axis=True,
   2419         nbytes=True,
   2420         backend=True,
   2421     )
   2422     header_lines = rows.pop(0).removesuffix("\n").splitlines()
   2424     # it's always the second row (after the array)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/prettyprint.py:490, in highlevel_array_show_rows(array, limit_rows, limit_cols, type, named_axis, nbytes, backend, formatter, precision)
    488     rows.append(named_axis_line)
    489 if nbytes:
--> 490     nbytes_line = f"nbytes: {bytes_repr(array.nbytes)}"
    491     rows.append(nbytes_line)
    492 if backend:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:2223, in Record.__getattr__(self, where)
   2195 """
   2196 Whenever possible, fields can be accessed as attributes.
   2197 
   (...)
   2220      keyword.
   2221 """
   2222 if hasattr(type(self), where):
-> 2223     return super().__getattribute__(where)
   2224 else:
   2225     if where in self._layout.fields:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:2035, in Record.nbytes(self)
   2024 @property
   2025 def nbytes(self):
   2026     """
   2027     The total number of bytes in all the #ak.index.Index,
   2028     and #ak.contents.NumpyArray buffers in this array tree.
   (...)
   2033     array buffers.
   2034     """
-> 2035     return self._layout.nbytes

AttributeError: 'Record' object has no attribute 'nbytes'
<Record {km: 18.6, begin: {...}, end: ..., ...} type='{km: ?float32, begin:...'>

Or it could be a single latitude, longitude position along the path.

single_trip.path
[{londiff: 0.00768, latdiff: -0.001},
 {londiff: 0.00765, latdiff: -0.00116},
 {londiff: 0.00774, latdiff: -0.00154},
 {londiff: 0.00757, latdiff: -0.0022},
 {londiff: 0.0128, latdiff: -0.0028},
 {londiff: 0.0146, latdiff: 0.000311},
 {londiff: 0.0184, latdiff: 0.000353},
 {londiff: 0.0281, latdiff: -0.00218},
 {londiff: 0.0378, latdiff: 0.00184},
 {londiff: 0.0464, latdiff: 0.00296},
 ...,
 {londiff: 0.143, latdiff: -0.00576},
 {londiff: 0.152, latdiff: -0.00935},
 {londiff: 0.16, latdiff: -0.0138},
 {londiff: 0.168, latdiff: -0.0179},
 {londiff: 0.176, latdiff: -0.0231},
 {londiff: 0.181, latdiff: -0.026},
 {londiff: 0.182, latdiff: -0.0266},
 {londiff: 0.19, latdiff: -0.0266},
 {londiff: 0.19, latdiff: -0.0266}]
---------------------------------------------------------
backend: cpu
nbytes: 208 B
type: 26 * {
    londiff: float32,
    latdiff: float32
}
single_point = single_trip.path[5]
single_point
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/IPython/core/formatters.py:1036, in MimeBundleFormatter.__call__(self, obj, include, exclude)
   1033     method = get_real_method(obj, self.print_method)
   1035     if method is not None:
-> 1036         return method(include=include, exclude=exclude)
   1037     return None
   1038 else:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:2415, in Record._repr_mimebundle_(self, include, exclude)
   2409 def _repr_mimebundle_(self, include=None, exclude=None):
   2410     # order:
   2411     # first: array,
   2412     # last: type,
   2413     # middle: rest sorted by length of prefix (longest first)
-> 2415     rows = highlevel_array_show_rows(
   2416         array=self,
   2417         type=True,
   2418         named_axis=True,
   2419         nbytes=True,
   2420         backend=True,
   2421     )
   2422     header_lines = rows.pop(0).removesuffix("\n").splitlines()
   2424     # it's always the second row (after the array)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/prettyprint.py:490, in highlevel_array_show_rows(array, limit_rows, limit_cols, type, named_axis, nbytes, backend, formatter, precision)
    488     rows.append(named_axis_line)
    489 if nbytes:
--> 490     nbytes_line = f"nbytes: {bytes_repr(array.nbytes)}"
    491     rows.append(nbytes_line)
    492 if backend:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:2223, in Record.__getattr__(self, where)
   2195 """
   2196 Whenever possible, fields can be accessed as attributes.
   2197 
   (...)
   2220      keyword.
   2221 """
   2222 if hasattr(type(self), where):
-> 2223     return super().__getattribute__(where)
   2224 else:
   2225     if where in self._layout.fields:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:2035, in Record.nbytes(self)
   2024 @property
   2025 def nbytes(self):
   2026     """
   2027     The total number of bytes in all the #ak.index.Index,
   2028     and #ak.contents.NumpyArray buffers in this array tree.
   (...)
   2033     array buffers.
   2034     """
-> 2035     return self._layout.nbytes

AttributeError: 'Record' object has no attribute 'nbytes'
<Record {londiff: 0.0146, latdiff: ..., ...} type='{londiff: float32, latdi...'>
print(f"longitude: {single_trip.begin.lon + single_point.londiff:.3f}")
print(f"latitude:  {single_trip.begin.lat + single_point.latdiff:.3f}")
longitude: -87.899
latitude:  41.981

In Jupyter notebooks (and this documentation), the array contents are presented in a multi-line format with the data type below a dashed line.

Standard Python repr#

In a Python prompt, the format is more concise:

print(f"{single_taxi!r}")
<Array [{trip: {km: 29.6, ...}}, ..., {...}] type='2403 * ?{trip: {km: ?flo...'>
print(f"{single_trip!r}")
<Record {km: 18.6, begin: {...}, end: ..., ...} type='{km: ?float32, begin:...'>
print(f"{single_point!r}")
<Record {londiff: 0.0146, latdiff: ..., ...} type='{londiff: float32, latdi...'>

The long form can be obtained in a Python prompt with the show method:

single_taxi.show()
[{trip: {km: 29.6, begin: {lon: -87.9, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 2, begin: {lon: -87.6, ...}, end: {...}, path: [...]}},
 {trip: {km: 2.46, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 50.8, begin: {lon: -87.9, ...}, end: {...}, path: []}},
 {trip: {km: 27, begin: {lon: -87.8, ...}, end: {...}, path: [...]}},
 {trip: {km: 18.6, begin: {lon: -87.9, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 20.2, begin: {lon: -87.9, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 2.27, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 0.724, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 1.05, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 ...,
 {trip: {km: 2.77, begin: {lon: -87.7, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 1.37, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 1.74, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 0.612, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 14.4, begin: {lon: -87.6, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 24.1, begin: {lon: -87.9, ...}, end: {...}, path: []}},
 {trip: {km: 33.4, begin: {lon: -87.9, ...}, end: {...}, path: []}},
 {trip: {km: 24.3, begin: {lon: -87.9, ...}, end: {...}, path: ..., ...}},
 {trip: {km: 1.83, begin: {lon: -87.7, ...}, end: {...}, path: ..., ...}}]
single_trip.show()
{km: 18.6,
 begin: {lon: -87.9, lat: 42},
 end: {lon: -87.7, lat: 42},
 path: [{londiff: 0.00768, latdiff: -0.001}, {...}, ..., {londiff: 0.19, ...}]}
single_point.show()
{londiff: 0.0146,
 latdiff: 0.000311}

The show method#

The show method can take a type=True argument to include the type as well (at the top this time, because values are presented in the “most valuable real estate,” which is the bottom of a print-out in the terminal, but the top in a Jupyter notebook).

single_point.show(type=True)
type: {
    londiff: float32,
    latdiff: float32
}{londiff: 0.0146,
 latdiff: 0.000311}

Types also have a show method, so if you only want the type, you can do

single_trip.type.show()
{
    km: ?float32,
    begin: {
        lon: ?float64,
        lat: ?float64
    },
    end: {
        lon: ?float64,
        lat: ?float64
    },
    path: var * {
        londiff: float32,
        latdiff: float32
    }
}

If you need to get this as a string or pass it to an output other than sys.stdout, use the stream parameter.

single_point.show(stream=None)
'{londiff: 0.0146,\n latdiff: 0.000311}'

Using to_list and Python’s pprint for a detailed view#

The repr and show representations print into a restricted space: 1 line (80 characters) for repr, and 20 lines (80 character width) for show without type=True. To do this, they replace data with ellipses (...) until it fits.

You might want to ensure that you see everything. One way to do that is to turn the data into Python objects with ak.to_list() (or to_list or tolist as a method) and pretty-print them with Python’s pprint.

import pprint

trip_list = ak.to_list(single_trip)
pprint.pprint(trip_list)
{'begin': {'lat': 41.980264315, 'lon': -87.913624596},
 'end': {'lat': 41.953582125, 'lon': -87.72345239},
 'km': 18.60401725769043,
 'path': [{'latdiff': -0.00100231496617198, 'londiff': 0.007684595882892609},
          {'latdiff': -0.0011563149746507406, 'londiff': 0.007650596089661121},
          {'latdiff': -0.0015393149806186557, 'londiff': 0.00774059584364295},
          {'latdiff': -0.002198314992710948, 'londiff': 0.007572595961391926},
          {'latdiff': -0.0027953151147812605, 'londiff': 0.012805595993995667},
          {'latdiff': 0.0003106849908363074, 'londiff': 0.014558595605194569},
          {'latdiff': 0.0003526849905028939, 'londiff': 0.01837259531021118},
          {'latdiff': -0.002175315050408244, 'londiff': 0.02808559685945511},
          {'latdiff': 0.001839685020968318, 'londiff': 0.03775859624147415},
          {'latdiff': 0.002964684972539544, 'londiff': 0.04636159539222717},
          {'latdiff': 0.003171684918925166, 'londiff': 0.056837595999240875},
          {'latdiff': 0.0036336849443614483, 'londiff': 0.07005459815263748},
          {'latdiff': 0.004173684865236282, 'londiff': 0.0837985947728157},
          {'latdiff': 0.0029606849420815706, 'londiff': 0.09554759413003922},
          {'latdiff': 0.0019616850186139345, 'londiff': 0.1069725975394249},
          {'latdiff': 0.002038684906437993, 'londiff': 0.11698559671640396},
          {'latdiff': 0.0017906849971041083, 'londiff': 0.12975460290908813},
          {'latdiff': -0.005761315114796162, 'londiff': 0.1434255987405777},
          {'latdiff': -0.009354314766824245, 'londiff': 0.15160559117794037},
          {'latdiff': -0.013837315142154694, 'londiff': 0.16011258959770203},
          {'latdiff': -0.01785031519830227, 'londiff': 0.16788259148597717},
          {'latdiff': -0.023128315806388855, 'londiff': 0.17631658911705017},
          {'latdiff': -0.026048315688967705, 'londiff': 0.18113559484481812},
          {'latdiff': -0.026647314429283142, 'londiff': 0.18232059478759766},
          {'latdiff': -0.026622315868735313, 'londiff': 0.18967759609222412},
          {'latdiff': -0.026618314906954765, 'londiff': 0.19017159938812256}]}

Keep in mind that if you don’t slice a small enough section of data, your terminal or Jupyter notebook may be overwhelmed with output!

Viewing data as JSON#

Another way you can dump everything is to convert the data to JSON with ak.to_json().

print(ak.to_json(single_trip))
{"km":18.60401725769043,"begin":{"lon":-87.913624596,"lat":41.980264315},"end":{"lon":-87.72345239,"lat":41.953582125},"path":[{"londiff":0.007684595882892609,"latdiff":-0.00100231496617198},{"londiff":0.007650596089661121,"latdiff":-0.0011563149746507406},{"londiff":0.00774059584364295,"latdiff":-0.0015393149806186557},{"londiff":0.007572595961391926,"latdiff":-0.002198314992710948},{"londiff":0.012805595993995667,"latdiff":-0.0027953151147812605},{"londiff":0.014558595605194569,"latdiff":0.0003106849908363074},{"londiff":0.01837259531021118,"latdiff":0.0003526849905028939},{"londiff":0.02808559685945511,"latdiff":-0.002175315050408244},{"londiff":0.03775859624147415,"latdiff":0.001839685020968318},{"londiff":0.04636159539222717,"latdiff":0.002964684972539544},{"londiff":0.056837595999240875,"latdiff":0.003171684918925166},{"londiff":0.07005459815263748,"latdiff":0.0036336849443614483},{"londiff":0.0837985947728157,"latdiff":0.004173684865236282},{"londiff":0.09554759413003922,"latdiff":0.0029606849420815706},{"londiff":0.1069725975394249,"latdiff":0.0019616850186139345},{"londiff":0.11698559671640396,"latdiff":0.002038684906437993},{"londiff":0.12975460290908813,"latdiff":0.0017906849971041083},{"londiff":0.1434255987405777,"latdiff":-0.005761315114796162},{"londiff":0.15160559117794037,"latdiff":-0.009354314766824245},{"londiff":0.16011258959770203,"latdiff":-0.013837315142154694},{"londiff":0.16788259148597717,"latdiff":-0.01785031519830227},{"londiff":0.17631658911705017,"latdiff":-0.023128315806388855},{"londiff":0.18113559484481812,"latdiff":-0.026048315688967705},{"londiff":0.18232059478759766,"latdiff":-0.026647314429283142},{"londiff":0.18967759609222412,"latdiff":-0.026622315868735313},{"londiff":0.19017159938812256,"latdiff":-0.026618314906954765}]}

That’s not very readable, so we’ll pass num_indent_spaces=4 to add newlines and indentation, and num_readability_spaces=1 to add spaces after commas (,) and colons (:).

print(ak.to_json(single_trip, num_indent_spaces=4, num_readability_spaces=1))
{
    "km": 18.60401725769043, 
    "begin": {
        "lon": -87.913624596, 
        "lat": 41.980264315
    }, 
    "end": {
        "lon": -87.72345239, 
        "lat": 41.953582125
    }, 
    "path": [
        {
            "londiff": 0.007684595882892609, 
            "latdiff": -0.00100231496617198
        }, 
        {
            "londiff": 0.007650596089661121, 
            "latdiff": -0.0011563149746507406
        }, 
        {
            "londiff": 0.00774059584364295, 
            "latdiff": -0.0015393149806186557
        }, 
        {
            "londiff": 0.007572595961391926, 
            "latdiff": -0.002198314992710948
        }, 
        {
            "londiff": 0.012805595993995667, 
            "latdiff": -0.0027953151147812605
        }, 
        {
            "londiff": 0.014558595605194569, 
            "latdiff": 0.0003106849908363074
        }, 
        {
            "londiff": 0.01837259531021118, 
            "latdiff": 0.0003526849905028939
        }, 
        {
            "londiff": 0.02808559685945511, 
            "latdiff": -0.002175315050408244
        }, 
        {
            "londiff": 0.03775859624147415, 
            "latdiff": 0.001839685020968318
        }, 
        {
            "londiff": 0.04636159539222717, 
            "latdiff": 0.002964684972539544
        }, 
        {
            "londiff": 0.056837595999240875, 
            "latdiff": 0.003171684918925166
        }, 
        {
            "londiff": 0.07005459815263748, 
            "latdiff": 0.0036336849443614483
        }, 
        {
            "londiff": 0.0837985947728157, 
            "latdiff": 0.004173684865236282
        }, 
        {
            "londiff": 0.09554759413003922, 
            "latdiff": 0.0029606849420815706
        }, 
        {
            "londiff": 0.1069725975394249, 
            "latdiff": 0.0019616850186139345
        }, 
        {
            "londiff": 0.11698559671640396, 
            "latdiff": 0.002038684906437993
        }, 
        {
            "londiff": 0.12975460290908813, 
            "latdiff": 0.0017906849971041083
        }, 
        {
            "londiff": 0.1434255987405777, 
            "latdiff": -0.005761315114796162
        }, 
        {
            "londiff": 0.15160559117794037, 
            "latdiff": -0.009354314766824245
        }, 
        {
            "londiff": 0.16011258959770203, 
            "latdiff": -0.013837315142154694
        }, 
        {
            "londiff": 0.16788259148597717, 
            "latdiff": -0.01785031519830227
        }, 
        {
            "londiff": 0.17631658911705017, 
            "latdiff": -0.023128315806388855
        }, 
        {
            "londiff": 0.18113559484481812, 
            "latdiff": -0.026048315688967705
        }, 
        {
            "londiff": 0.18232059478759766, 
            "latdiff": -0.026647314429283142
        }, 
        {
            "londiff": 0.18967759609222412, 
            "latdiff": -0.026622315868735313
        }, 
        {
            "londiff": 0.19017159938812256, 
            "latdiff": -0.026618314906954765
        }
    ]
}

ak.to_json() is also one of the bulk output methods, so it can write data to a file, as a single JSON object or as line_delimited JSON.