10 minutes to Awkward Array#

This is a short, tutorial-style introduction to Awkward Array, aimed towards new users. For details of how to perform specific operations in Awkward Array, e.g. filtering data, see the user-guide index, or use the search tool to identify relevant pages.

The City of Chicago has a Data Portal with lots of interesting datasets. This guide uses a dataset of Chicago taxi trips taken from 2019 through 2021 (3 years).

The dataset that the Data Portal provides has trip start and stop points as longitude, latitude pairs, as well as start and end times (date-stamps), payment details, and the name of each taxi company. To make the example more interesting, an estimated route of each taxi trip has been computed by Open Source Routing Machine (OSRM) and added to the dataset.

In this guide, we’ll look at how to manipulate a jagged dataset to plot taxi routes in Chicago.

Loading the dataset#

Our dataset is formatted as a 611 MB Apache Parquet file, provided here. Alongside JSON, and raw buffers, Awkward can also read Parquet files and Arrow tables.

Given that this file is so large, let’s first look at the metadata with ak.metadata_from_parquet to see what we’re working with:

Hide code cell content
%config InteractiveShell.ast_node_interactivity = "last_expr_or_assign"
import numpy as np
import awkward as ak

metadata = ak.metadata_from_parquet(
    "https://pivarski-princeton.s3.amazonaws.com/chicago-taxi.parquet"
)
{'form': ListOffsetForm('i32', RecordForm([RecordForm([BitMaskedForm('u8', NumpyForm('float32'), True, True), BitMaskedForm('u8', NumpyForm('float32'), True, True), RecordForm([BitMaskedForm('u8', NumpyForm('float64'), True, True), BitMaskedForm('u8', NumpyForm('float64'), True, True), BitMaskedForm('u8', NumpyForm('datetime64[ms]'), True, True)], ['lon', 'lat', 'time']), RecordForm([BitMaskedForm('u8', NumpyForm('float64'), True, True), BitMaskedForm('u8', NumpyForm('float64'), True, True), BitMaskedForm('u8', NumpyForm('datetime64[ms]'), True, True)], ['lon', 'lat', 'time']), ListOffsetForm('i64', RecordForm([NumpyForm('float32'), NumpyForm('float32')], ['londiff', 'latdiff']))], ['sec', 'km', 'begin', 'end', 'path']), RecordForm([BitMaskedForm('u8', NumpyForm('float32'), True, True), BitMaskedForm('u8', NumpyForm('float32'), True, True), BitMaskedForm('u8', NumpyForm('float32'), True, True), IndexedForm('i32', ListOffsetForm('i32', NumpyForm('uint8', parameters={'__array__': 'char'}), parameters={'__array__': 'string'}), parameters={'__array__': 'categorical'})], ['fare', 'tips', 'total', 'type']), IndexedForm('i32', ListOffsetForm('i32', NumpyForm('uint8', parameters={'__array__': 'char'}), parameters={'__array__': 'string'}), parameters={'__array__': 'categorical'})], ['trip', 'payment', 'company'])),
 'fs': <fsspec.implementations.http.HTTPFileSystem at 0x7fef2c096010>,
 'paths': ['https://pivarski-princeton.s3.amazonaws.com/chicago-taxi.parquet'],
 'col_counts': [353,
  351,
  309,
  316,
  300,
  322,
  315,
  305,
  329,
  320,
  309,
  322,
  301,
  334,
  338,
  294,
  303,
  300,
  285,
  318,
  312,
  334,
  347,
  327,
  84],
 'columns': ['.list.item.trip.sec',
  '.list.item.trip.km',
  '.list.item.trip.begin.lon',
  '.list.item.trip.begin.lat',
  '.list.item.trip.begin.time',
  '.list.item.trip.end.lon',
  '.list.item.trip.end.lat',
  '.list.item.trip.end.time',
  '.list.item.trip.path.list.item.londiff',
  '.list.item.trip.path.list.item.latdiff',
  '.list.item.payment.fare',
  '.list.item.payment.tips',
  '.list.item.payment.total',
  '.list.item.payment.type',
  '.list.item.company'],
 'num_rows': 7728,
 'num_row_groups': 25}

Of particular interest here is the num_row_groups value. Parquet has the concept of row groups: contiguous rows of data in the file, and the smallest granularity that can be read.

We can also look at the type of the data to see the structure of the dataset:

metadata["form"].type.show()
var * {
    trip: {
        sec: ?float32,
        km: ?float32,
        begin: {
            lon: ?float64,
            lat: ?float64,
            time: ?datetime64[ms]
        },
        end: {
            lon: ?float64,
            lat: ?float64,
            time: ?datetime64[ms]
        },
        path: var * {
            londiff: float32,
            latdiff: float32
        }
    },
    payment: {
        fare: ?float32,
        tips: ?float32,
        total: ?float32,
        type: categorical[type=string]
    },
    company: categorical[type=string]
}

There are a lot of different columns here (trip.sec, trip.begin.lon, trip.payment.fare, etc.). For this example, we only want a small subset of them. Additionally, we don’t need to load all of the data, as we are only interested in a representative sample. Let’s use ak.from_parquet with the row_groups argument to read (download) only a single group, and the columns argument to read only the necessary columns.

taxi = ak.from_parquet(
    "https://pivarski-princeton.s3.amazonaws.com/chicago-taxi.parquet",
    row_groups=[0],
    columns=["trip.km", "trip.begin.l*", "trip.end.l*", "trip.path.*"],
)
[[{trip: {km: 0, begin: {lon: -87.7, ...}, end: {...}, ...}}, ..., {...}],
 [{trip: {km: 0, begin: {lon: -87.7, ...}, end: {...}, ...}}, ..., {...}],
 [{trip: {km: 0.966, begin: {lon: -87.6, ...}, end: {...}, ...}}, ..., {...}],
 [{trip: {km: 1.29, begin: {lon: -87.6, ...}, end: {...}, ...}}, ..., {...}],
 [{trip: {km: 0, begin: {lon: -87.7, ...}, end: {...}, ...}}, ..., {...}],
 [{trip: {km: 29.6, begin: {lon: -87.9, ...}, end: {...}, ...}}, ..., {...}],
 [{trip: {km: 29.1, begin: {lon: -87.9, ...}, end: {...}, ...}}, ..., {...}],
 [],
 [{trip: {km: 2.74, begin: {lon: -87.6, ...}, end: {...}, ...}}, ..., {...}],
 [{trip: {km: 0, begin: {lon: -87.7, ...}, end: {...}, ...}}, ..., {...}],
 ...,
 [{trip: {km: 0.966, begin: {lon: -87.6, ...}, end: {...}, ...}}, ..., {...}],
 [{trip: {km: 0, begin: {lon: None, ...}, end: {...}, ...}}, {...}, ..., {...}],
 [{trip: {km: 0, begin: {lon: None, ...}, end: {...}, ...}}, {...}, ..., {...}],
 [{trip: {km: 0, begin: {lon: None, ...}, end: {...}, ...}}, {...}, ..., {...}],
 [{trip: {km: 0, begin: {lon: None, ...}, end: {...}, ...}}, {...}, ..., {...}],
 [{trip: {km: 0.483, begin: {lon: -87.9, ...}, end: {...}, ...}}, ..., {...}],
 [{trip: {km: 0, begin: {lon: None, ...}, end: {...}, ...}}, {...}, ..., {...}],
 [{trip: {km: 1.38, begin: {lon: -87.6, ...}, end: {...}, ...}}, ..., {...}],
 [{trip: {km: 0, begin: {lon: -87.7, ...}, end: {...}, ...}}, ..., {...}]]
--------------------------------------------------------------------------------
type: 353 * var * ?{
    trip: {
        km: ?float32,
        begin: {
            lon: ?float64,
            lat: ?float64
        },
        end: {
            lon: ?float64,
            lat: ?float64
        },
        path: var * {
            londiff: float32,
            latdiff: float32
        }
    }
}

We can look at the type of the array to see its structure

taxi.type.show()
353 * var * ?{
    trip: {
        km: ?float32,
        begin: {
            lon: ?float64,
            lat: ?float64
        },
        end: {
            lon: ?float64,
            lat: ?float64
        },
        path: var * {
            londiff: float32,
            latdiff: float32
        }
    }
}

According to the above, this is an array of 353 elements, and each element is a variable length list (var) of records. Each list represents one taxi and each record in each list is a taxi trip.

The trip field contains a record with

  • km: distance traveled in kilometers

  • begin: record containing longitude and latitude of the trip start point

  • end: record containing longitude and latitude of the trip end point

  • path: list of records containing the relative longitude and latitude of the trip waypoints

Reconstructing the routes#

In order to plot the taxi routes, we can use the waypoints given by the path field. However, these waypoints are relative to the trip start point; we need to add the starting position to these relative positions in order to plot them on a map.

The fields of a record can be accessed with attribute notation:

taxi.trip
[[{km: 0, begin: {lon: -87.7, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0, begin: {lon: -87.7, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0.966, begin: {lon: -87.6, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 1.29, begin: {lon: -87.6, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0, begin: {lon: -87.7, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 29.6, begin: {lon: -87.9, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 29.1, begin: {lon: -87.9, ...}, end: {...}, path: [...]}, ..., {...}],
 [],
 [{km: 2.74, begin: {lon: -87.6, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0, begin: {lon: -87.7, ...}, end: {...}, path: [...]}, ..., {...}],
 ...,
 [{km: 0.966, begin: {lon: -87.6, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0, begin: {lon: None, ...}, end: {...}, path: []}, ..., {km: 0, ...}],
 [{km: 0, begin: {lon: None, ...}, end: {...}, path: []}, ..., {km: 10.1, ...}],
 [{km: 0, begin: {lon: None, ...}, end: {...}, path: []}, ..., {km: 0, ...}],
 [{km: 0, begin: {lon: None, ...}, end: {...}, path: []}, ..., {km: 0, ...}],
 [{km: 0.483, begin: {lon: -87.9, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0, begin: {lon: None, ...}, end: {...}, path: []}, ..., {km: 0, ...}],
 [{km: 1.38, begin: {lon: -87.6, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0, begin: {lon: -87.7, ...}, end: {...}, path: [...]}, ..., {...}]]
--------------------------------------------------------------------------------
type: 353 * var * ?{
    km: ?float32,
    begin: {
        lon: ?float64,
        lat: ?float64
    },
    end: {
        lon: ?float64,
        lat: ?float64
    },
    path: var * {
        londiff: float32,
        latdiff: float32
    }
}

or using subscript notation:

taxi["trip"]
[[{km: 0, begin: {lon: -87.7, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0, begin: {lon: -87.7, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0.966, begin: {lon: -87.6, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 1.29, begin: {lon: -87.6, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0, begin: {lon: -87.7, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 29.6, begin: {lon: -87.9, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 29.1, begin: {lon: -87.9, ...}, end: {...}, path: [...]}, ..., {...}],
 [],
 [{km: 2.74, begin: {lon: -87.6, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0, begin: {lon: -87.7, ...}, end: {...}, path: [...]}, ..., {...}],
 ...,
 [{km: 0.966, begin: {lon: -87.6, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0, begin: {lon: None, ...}, end: {...}, path: []}, ..., {km: 0, ...}],
 [{km: 0, begin: {lon: None, ...}, end: {...}, path: []}, ..., {km: 10.1, ...}],
 [{km: 0, begin: {lon: None, ...}, end: {...}, path: []}, ..., {km: 0, ...}],
 [{km: 0, begin: {lon: None, ...}, end: {...}, path: []}, ..., {km: 0, ...}],
 [{km: 0.483, begin: {lon: -87.9, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0, begin: {lon: None, ...}, end: {...}, path: []}, ..., {km: 0, ...}],
 [{km: 1.38, begin: {lon: -87.6, ...}, end: {...}, path: [...]}, ..., {...}],
 [{km: 0, begin: {lon: -87.7, ...}, end: {...}, path: [...]}, ..., {...}]]
--------------------------------------------------------------------------------
type: 353 * var * ?{
    km: ?float32,
    begin: {
        lon: ?float64,
        lat: ?float64
    },
    end: {
        lon: ?float64,
        lat: ?float64
    },
    path: var * {
        londiff: float32,
        latdiff: float32
    }
}

Field lookup can be nested, e.g. with attribute notation:

taxi.trip.path
[[[{londiff: -2.41e-05, latdiff: -3.03e-07}, {londiff: ..., ...}], ..., [...]],
 [[{londiff: -2.41e-05, latdiff: -3.03e-07}, {londiff: ..., ...}], ..., []],
 [[{londiff: -5e-08, latdiff: -2.41e-05}, ..., {londiff: 0.0122, ...}], ...],
 [[{londiff: -2.13e-06, latdiff: 0.000103}, ..., {londiff: ..., ...}], ...],
 [[{londiff: -3.74e-05, latdiff: -1.96e-07}, {londiff: ..., ...}], ..., [...]],
 [[{londiff: 2.97e-05, latdiff: 3.32e-05}, ..., {londiff: 0.282, ...}], ...],
 [[{londiff: 2.97e-05, latdiff: 3.32e-05}, ..., {londiff: 0.277, ...}], ...],
 [],
 [[{londiff: -4e-05, latdiff: 0.000186}, ..., {londiff: -0.0433, ...}], ...],
 [[{londiff: -2.41e-05, latdiff: -3.03e-07}, {londiff: ..., ...}], ..., [...]],
 ...,
 [[{londiff: -1.47e-06, latdiff: 1.84e-05}, ..., {londiff: 0.0065, ...}], ...],
 [[], [], [], [], [], [], []],
 [[], [], [], [], [], [], [], [{...}, ...], ..., [], [], [], [], [], [], []],
 [[], [], [], [], [], [], [], [], [], ..., [], [], [], [], [], [], [], [], []],
 [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []],
 [[{londiff: 2.97e-05, latdiff: 3.32e-05}, {londiff: 2.97e-05, ...}], ...],
 [[], [{londiff: -3.74e-05, latdiff: -1.96e-07}, {...}], ..., [{...}, ...], []],
 [[{londiff: 6.05e-05, latdiff: 9.78e-07}, ..., {londiff: 0.0101, ...}], ...],
 [[{londiff: 0.000149, latdiff: 1.97e-06}, {londiff: 0.000149, ...}], ...]]
--------------------------------------------------------------------------------
type: 353 * var * option[var * {
    londiff: float32,
    latdiff: float32
}]

or with subscript notation:

taxi["trip", "path"]
[[[{londiff: -2.41e-05, latdiff: -3.03e-07}, {londiff: ..., ...}], ..., [...]],
 [[{londiff: -2.41e-05, latdiff: -3.03e-07}, {londiff: ..., ...}], ..., []],
 [[{londiff: -5e-08, latdiff: -2.41e-05}, ..., {londiff: 0.0122, ...}], ...],
 [[{londiff: -2.13e-06, latdiff: 0.000103}, ..., {londiff: ..., ...}], ...],
 [[{londiff: -3.74e-05, latdiff: -1.96e-07}, {londiff: ..., ...}], ..., [...]],
 [[{londiff: 2.97e-05, latdiff: 3.32e-05}, ..., {londiff: 0.282, ...}], ...],
 [[{londiff: 2.97e-05, latdiff: 3.32e-05}, ..., {londiff: 0.277, ...}], ...],
 [],
 [[{londiff: -4e-05, latdiff: 0.000186}, ..., {londiff: -0.0433, ...}], ...],
 [[{londiff: -2.41e-05, latdiff: -3.03e-07}, {londiff: ..., ...}], ..., [...]],
 ...,
 [[{londiff: -1.47e-06, latdiff: 1.84e-05}, ..., {londiff: 0.0065, ...}], ...],
 [[], [], [], [], [], [], []],
 [[], [], [], [], [], [], [], [{...}, ...], ..., [], [], [], [], [], [], []],
 [[], [], [], [], [], [], [], [], [], ..., [], [], [], [], [], [], [], [], []],
 [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []],
 [[{londiff: 2.97e-05, latdiff: 3.32e-05}, {londiff: 2.97e-05, ...}], ...],
 [[], [{londiff: -3.74e-05, latdiff: -1.96e-07}, {...}], ..., [{...}, ...], []],
 [[{londiff: 6.05e-05, latdiff: 9.78e-07}, ..., {londiff: 0.0101, ...}], ...],
 [[{londiff: 0.000149, latdiff: 1.97e-06}, {londiff: 0.000149, ...}], ...]]
--------------------------------------------------------------------------------
type: 353 * var * option[var * {
    londiff: float32,
    latdiff: float32
}]

Let’s look at two fields of interest, path.latdiff, and begin.lat, and their types:

taxi.trip.path.latdiff
[[[-3.03e-07, -3.03e-07], [-6.01e-07, ..., -0.0702], ..., [9.72e-05, 9.72e-05]],
 [[-3.03e-07, -3.03e-07], [-3.03e-07, -3.03e-07], ..., [-0.001, ...], []],
 [[-2.41e-05, -2.81e-05, 0.000381, ..., 0.00059, 0.003, 0.00297], ..., []],
 [[0.000103, 5.72e-05, -0.00501, -0.00504, -0.00583, -0.00583], ..., [...]],
 [[-1.96e-07, -1.96e-07], ..., [0.000336, 0.000681, ..., 0.0209, 0.0209]],
 [[3.32e-05, -0.000256, 0.00181, 0.00268, ..., -0.0923, -0.0922, -0.0941], ...],
 [[3.32e-05, -0.000256, 0.00181, 0.00268, ..., -0.0867, -0.0866, -0.0866], ...],
 [],
 [[0.000186, 0.000173, -0.00301, -0.00311, ..., 0.00177, 0.00178, 0.0016], ...],
 [[-3.03e-07, -3.03e-07], ..., [0.000186, 0.000173, 0.00426, ..., 0.108, 0.11]],
 ...,
 [[1.84e-05, 5.24e-05, -0.00502, -0.00491, -0.00411, -0.00414], ..., [...]],
 [[], [], [], [], [], [], []],
 [[], [], [], [], [], [], [], [3.32e-05, ...], ..., [], [], [], [], [], [], []],
 [[], [], [], [], [], [], [], [], [], ..., [], [], [], [], [], [], [], [], []],
 [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []],
 [[3.32e-05, 3.32e-05], [-0.001, ..., 0.00457], ..., [0.000102, 0.000102]],
 [[], [-1.96e-07, -1.96e-07], [3.37e-05, ...], ..., [3.32e-05, 3.32e-05], []],
 [[9.78e-07, 0.000411, 0.000433, 0.000435, -0.004, -0.004], [...], ..., [...]],
 [[1.97e-06, 1.97e-06], [1.97e-06, 1.97e-06], ..., [...], [0.000186, 0.000186]]]
--------------------------------------------------------------------------------
type: 353 * var * option[var * float32]
taxi.trip.path.latdiff.type.show()
353 * var * option[var * float32]
taxi.trip.begin.lat
[[41.9, 41.9, 41.9, 41.9, 41.9, 41.9, ..., 41.9, 41.9, 41.9, 41.9, 41.9, 42],
 [41.9, 41.9, 41.9, 41.9, 41.9, 41.9, ..., 41.9, 41.9, 41.9, 41.9, 42, 42],
 [41.9, 41.9, 41.9, 41.9, 42, 41.9, 41.9, ..., 42, 42, 42, 41.9, 42, 42, None],
 [41.9, 41.9, 41.9, 41.9, 41.9, 41.9, ..., 41.9, 41.9, 41.9, 41.9, 41.9, 41.9],
 [41.9, 41.9, 41.9, 41.9, 41.9, 41.9, 42, ..., 42, 42, 42, 42, 42, 41.8, 41.9],
 [42, 41.9, 41.9, 42, 41.8, 42, 42, ..., 41.9, 41.9, 41.9, 42, 42, 42, 41.9],
 [42, 41.9, 41.9, 41.9, 41.9, 41.9, ..., 41.9, 41.9, 42, None, 41.9, 41.9],
 [],
 [41.9, 41.9, 41.9, 41.9, 41.9, 41.8, ..., 41.8, None, 41.8, 41.8, 41.8, 41.8],
 [41.9, 41.9, 41.9, 42, 41.8, 41.8, ..., 41.8, 41.9, 42, 41.8, 41.7, 41.9],
 ...,
 [41.9, 41.9, 41.9, 41.9, 41.9, 41.9, ..., 41.9, 41.9, 41.9, 42, 41.9, 41.9],
 [None, None, None, None, None, None, None],
 [None, None, None, None, None, None, ..., None, None, None, None, None, None],
 [None, None, None, None, None, None, ..., None, None, None, None, None, None],
 [None, None, None, None, None, None, ..., None, None, None, None, None, None],
 [42, 42, 42, 42, 42, 42, 42, 42, ..., 42, 41.9, 41.9, 41.9, 41.9, 41.9, 41.9],
 [None, 41.9, 41.8, 41.8, 41.9, 41.9, 42, ..., 41.9, 41.8, 41.9, 42, 42, None],
 [41.9, 41.9, 41.9, 41.9, 41.9, 41.9, ..., 41.9, 41.9, 41.9, 41.9, 41.9, 42],
 [42, 42, 42, 41.9, 41.9, 41.9, 41.9, ..., 41.9, 41.9, 41.9, 41.9, 41.9, 41.9]]
-------------------------------------------------------------------------------
type: 353 * var * ?float64
taxi.trip.begin.lat.type.show()
353 * var * ?float64

Clearly, these two arrays have different dimensions. When we add them together, we want the operation to broadcast: each trip has one starting point, but multiple waypoints. In NumPy, broadcasting aligns to the right, which means that arrays with differing dimensions are made compatible against one another by adding length-1 dimensions to the front of the shape:

x = np.array([1, 2, 3])
y = np.array(
    [
        [4, 5, 6],
        [7, 8, 9],
    ]
)
np.broadcast_arrays(x, y)
(array([[1, 2, 3],
        [1, 2, 3]]),
 array([[4, 5, 6],
        [7, 8, 9]]))

In Awkward, broadcasting aligns to the left by default, which means that length-1 dimensions are added to the end of the shape:

x = ak.Array([1, 2])  # note the missing 3!
y = ak.Array(
    [
        [4, 5, 6],
        [7, 8, 9],
    ]
)
ak.broadcast_arrays(x, y)
[<Array [[1, 1, 1], [2, 2, 2]] type='2 * var * int64'>,
 <Array [[4, 5, 6], [7, 8, 9]] type='2 * var * int64'>]

In this instance, we also want broadcasting to align to the left: we want a single starting point to broadcast against multiple waypoints. We can simply add our two arrays together, and Awkward will broadast them correctly.

taxi_trip_lat = taxi.trip.begin.lat + taxi.trip.path.latdiff
taxi_trip_lon = taxi.trip.begin.lon + taxi.trip.path.londiff
[[[-87.7, -87.7], [-87.7, -87.7, ..., -87.7, -87.7], ..., [-87.7, -87.7]],
 [[-87.7, -87.7], [-87.7, -87.7], ..., [-87.9, -87.9, ..., -87.7, -87.7], []],
 [[-87.6, -87.6, -87.6, -87.6, -87.6, -87.6, -87.6], [-87.6, ...], ..., None],
 [[-87.6, -87.6, -87.6, -87.6, -87.6, -87.6], ..., [-87.6, -87.6, ..., -87.7]],
 [[-87.7, -87.7], ..., [-87.6, -87.6, -87.6, -87.6, -87.6, -87.6, -87.6]],
 [[-87.9, -87.9, -87.9, -87.9, -87.9, ..., -87.6, -87.6, -87.6, -87.6], ...],
 [[-87.9, -87.9, -87.9, -87.9, -87.9, ..., -87.6, -87.6, -87.6, -87.6], ...],
 [],
 [[-87.6, -87.6, -87.6, -87.6, -87.6, ..., -87.7, -87.7, -87.7, -87.7], ...],
 [[-87.7, -87.7], ..., [-87.6, -87.6, -87.6, -87.6, ..., -87.7, -87.7, -87.7]],
 ...,
 [[-87.6, -87.6, -87.6, -87.6, -87.6, -87.6], ..., [-87.6, -87.6, ..., -87.6]],
 [None, None, None, None, None, None, None],
 [None, None, None, None, None, None, ..., None, None, None, None, None, None],
 [None, None, None, None, None, None, ..., None, None, None, None, None, None],
 [None, None, None, None, None, None, ..., None, None, None, None, None, None],
 [[-87.9, -87.9], [-87.9, -87.9, ..., -87.8, -87.8], ..., [-87.6, -87.6]],
 [None, [-87.7, -87.7], [-87.7, -87.7], ..., [...], [-87.9, -87.9], None],
 [[-87.6, -87.6, -87.6, -87.6, -87.6, -87.6], ..., [-87.7, -87.7, ..., -87.7]],
 [[-87.7, -87.7], [-87.7, -87.7], ..., [-87.6, ..., -87.7], [-87.6, -87.6]]]
-------------------------------------------------------------------------------
type: 353 * var * option[var * float64]

Storing the routes#

Having computed taxi_trip_lat and taxi_trip_lon, we might wish to add these as fields to our taxi dataset, so that later manipulations are also applied to these values. Here, we can use the subscript operator [] to add new fields to our dataset.

taxi[("trip", "path", "lat")] = taxi_trip_lat
taxi[("trip", "path", "lon")] = taxi_trip_lon

Note that fields cannot be set using attribute notation.

We can see the result of adding these fields:

taxi.type.show()
353 * var * ?{
    trip: {
        km: ?float32,
        begin: {
            lon: ?float64,
            lat: ?float64
        },
        end: {
            lon: ?float64,
            lat: ?float64
        },
        path: var * {
            londiff: float32,
            latdiff: float32,
            lat: float64,
            lon: float64
        }
    }
}

In Awkward, records are lightweight because they can be composed of existing arrays. We can see this with ak.fields, which returns a list containing the field names of a record:

ak.fields(taxi.trip.path)
['londiff', 'latdiff', 'lat', 'lon']

and ak.unzip, which decomposes an array into the corresponding field values:

ak.unzip(taxi.trip.path)
(<Array [[[-2.41e-05, ...], ..., [...]], ...] type='353 * var * option[var *...'>,
 <Array [[[-3.03e-07, ...], ..., [...]], ...] type='353 * var * option[var *...'>,
 <Array [[[41.9, 41.9], ..., [42, ...]], ...] type='353 * var * option[var *...'>,
 <Array [[[-87.7, -87.7], ..., [...]], ...] type='353 * var * option[var * f...'>)

Finding the longest routes#

Let’s imagine that we want to plot the three longest routes taken by any taxi in the city. The distance travelled by any taxi is given by taxi.trip.km:

taxi.trip.km
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, None, None, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0.966, 4.83, 3.38, 1.77, 5.31, 10.5, ..., 2.09, 2.25, 30.6, 0, 27.5, None],
 [1.29, 1.77, 1.77, 8.69, 0, 0, 2.57, ..., 1.61, 0, 0, 1.77, 0, 10.6, 3.38],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 1.13, 1.29, 0, 0.322, 1.13, 1.45, 1.45, 0],
 [29.6, 2, 2.46, 50.8, 27, 18.6, ..., 0.612, 14.4, 24.1, 33.4, 24.3, 1.83],
 [29.1, 1.29, 2.11, 1.34, 1.66, 0.595, ..., 5.84, 26.4, None, 0.756, 1.56],
 [],
 [2.74, 5.47, 1.61, 4.35, 0.805, 21.1, ..., 20, None, 26.2, 51, 15.6, 22.5],
 [0, 0, 0, 0, 0, 0, 0, 0, ..., 1.13, 0.966, 0.483, 0.805, 1.13, 0.966, 0.805],
 ...,
 [0.966, 1.45, 6.76, 7.24, 3.22, 1.61, ..., 2.74, 2.09, 28.3, 2.25, 5.15, 2.41],
 [None, None, None, None, None, None, None],
 [None, None, None, None, None, None, ..., None, None, None, None, None, None],
 [None, None, None, None, None, None, ..., None, None, None, None, None, None],
 [None, None, None, None, None, None, ..., None, None, None, None, None, None],
 [0.483, 11.3, 0.161, 13.4, 0.161, 0.161, ..., 2.03, 1.56, 2.4, 2.72, 4.76],
 [None, 0, 0, 0, 0, 1.61, 0, None, 0, ..., 4.02, 3.06, 0, 0, 0, 19.5, 0, None],
 [1.38, 28.1, 1.21, 2.32, 1.13, 0.821, ..., 0.869, 2.32, 1.87, 2.91, 4.78],
 [0, 0, 15.1, 28.2, 2.74, 1.45, 1.61, ..., 7.24, 12.7, 2.25, 1.93, 5.47, 0.966]]
--------------------------------------------------------------------------------
type: 353 * var * ?float32

This array has two dimensions: 353, indicating the number of taxis, and var indicating the number of trips for each taxi. Because we want to find the longest trips amongst all taxis, we can flatten one of the dimensions using ak.flatten to produce a list of trips.

trip = ak.flatten(taxi.trip, axis=1)
[{km: 0, begin: {lon: -87.7, lat: 41.9}, end: {...}, path: [...]},
 {km: 0, begin: {lon: -87.7, lat: 41.9}, end: {...}, path: [...]},
 {km: 0, begin: {lon: -87.6, lat: 41.9}, end: {...}, path: [...]},
 {km: 0, begin: {lon: -87.6, lat: 41.9}, end: {...}, path: [...]},
 {km: 0, begin: {lon: -87.6, lat: 41.9}, end: {...}, path: [...]},
 {km: 0, begin: {lon: -87.7, lat: 41.9}, end: {...}, path: [...]},
 {km: 0, begin: {lon: -87.6, lat: 41.9}, end: {...}, path: [...]},
 {km: 0, begin: {lon: -87.6, lat: 41.9}, end: {...}, path: [...]},
 {km: 0, begin: {lon: -87.6, lat: 41.9}, end: {...}, path: [...]},
 {km: 0, begin: {lon: -87.6, lat: 41.9}, end: {...}, path: [...]},
 ...,
 {km: 6.12, begin: {lon: -87.6, lat: ..., ...}, end: {...}, path: [...]},
 {km: 8.37, begin: {lon: -87.6, lat: ..., ...}, end: {...}, path: [...]},
 {km: 2.9, begin: {lon: -87.6, lat: 41.9}, end: {...}, path: [...]},
 {km: 7.24, begin: {lon: -87.6, lat: ..., ...}, end: {...}, path: [...]},
 {km: 12.7, begin: {lon: -87.6, lat: ..., ...}, end: {...}, path: [...]},
 {km: 2.25, begin: {lon: -87.6, lat: ..., ...}, end: {...}, path: [...]},
 {km: 1.93, begin: {lon: -87.6, lat: ..., ...}, end: {...}, path: [...]},
 {km: 5.47, begin: {lon: -87.6, lat: ..., ...}, end: {...}, path: [...]},
 {km: 0.966, begin: {lon: -87.6, lat: ..., ...}, end: {...}, path: [...]}]
--------------------------------------------------------------------------
type: 1003517 * ?{
    km: ?float32,
    begin: {
        lon: ?float64,
        lat: ?float64
    },
    end: {
        lon: ?float64,
        lat: ?float64
    },
    path: var * {
        londiff: float32,
        latdiff: float32,
        lat: float64,
        lon: float64
    }
}

From this list of 1003517 journeys, we can sort by length using ak.argsort.

ix_length = ak.argsort(trip.km, ascending=False)
[754734,
 939745,
 384624,
 554705,
 864957,
 550830,
 432129,
 677351,
 553068,
 863489,
 ...,
 1003425,
 1003442,
 1003465,
 1003481,
 1003487,
 1003488,
 1003489,
 1003492,
 1003493]
---------------------
type: 1003517 * int64

Now let’s take only the three largest trips.

trip_longest = trip[ix_length[:3]]
[{km: 1.57e+03, begin: {lon: -87.8, ...}, end: {...}, path: [...]},
 {km: 1.52e+03, begin: {lon: -87.8, ...}, end: {...}, path: [...]},
 {km: 1.26e+03, begin: {lon: -87.9, ...}, end: {...}, path: [...]}]
-------------------------------------------------------------------
type: 3 * ?{
    km: ?float32,
    begin: {
        lon: ?float64,
        lat: ?float64
    },
    end: {
        lon: ?float64,
        lat: ?float64
    },
    path: var * {
        londiff: float32,
        latdiff: float32,
        lat: float64,
        lon: float64
    }
}

Plotting the longest routes#

ipyleaflet requires a list of coordinate tuples in order to plot a path. Let’s stack these two arrays together to build a (16, 2) array.

lat_lon_taxi_75 = ak.concatenate(
    (trip_longest.path.lat[..., np.newaxis], trip_longest.path.lon[..., np.newaxis]),
    axis=-1,
)
[[[41.8, -87.8], [41.8, -87.8], [41.8, ...], ..., [41.9, -87.6], [41.9, -87.6]],
 [[41.8, -87.8], [41.8, -87.8], [41.8, ...], ..., [41.9, -87.7], [41.9, -87.7]],
 [[42, -87.9], [42, -87.9], [42, ...], ..., [41.9, -87.6], [41.9, -87.6]]]
--------------------------------------------------------------------------------
type: 3 * option[var * 2 * float64]

We can convert this to a list with to_list():

lat_lon_taxi_75.to_list()
[[[41.791755000016295, -87.75306800000835],
  [41.79289000004401, -87.75248199997004],
  [41.793049999865346, -87.74275300016161],
  [41.79308499989902, -87.74206800030079],
  [41.81677700055634, -87.74330599994865],
  [41.81921600026881, -87.73607600030955],
  [41.82489299995696, -87.71802799885627],
  [41.82925399986898, -87.70418600102421],
  [41.832074998299234, -87.6951960000908],
  [41.835063999871366, -87.68566399922129],
  [41.83892299828922, -87.66454599907156],
  [41.845413001606815, -87.64945799966809],
  [41.847186999748104, -87.63850099702832],
  [41.84722199885046, -87.6363319989061],
  [41.84696199966466, -87.62759100010153],
  [41.84792600048577, -87.62020599951502],
  [41.849916996786945, -87.61222300042864],
  [41.85385500116145, -87.61418999840971],
  [41.863579998145454, -87.61869900038],
  [41.86497699690854, -87.61940300574061],
  [41.862404996881835, -87.6187029938912],
  [41.859335998425834, -87.61741200198885]],
 [[41.79259599999993, -87.76961500000002],
  [41.79274700000171, -87.76230300011375],
  [41.79571900010562, -87.76239400010817],
  [41.796845000064806, -87.75508699964993],
  [41.79795999994023, -87.74779599931352],
  [41.79906799999519, -87.7404859999966],
  [41.800207999828146, -87.7331679987561],
  [41.80132699973403, -87.72584199931734],
  [41.805292000356005, -87.71861000012868],
  [41.80988599948107, -87.71114999827259],
  [41.814656000877605, -87.70336799722426],
  [41.82004899933456, -87.69456500213855],
  [41.824459998230324, -87.68734100114816],
  [41.82593700132786, -87.68489899885171],
  [41.8648240029019, -87.68591899823659],
  [41.87145499728142, -87.67796899777167],
  [41.87273699633299, -87.67646199863665],
  [41.90137299738346, -87.67721099805348],
  [41.90138099930702, -87.67659199636691],
  [41.90120299748598, -87.67658699702733]],
 [[41.97910400000017, -87.90301000000035],
  [41.978815000007835, -87.90244999999085],
  [41.98088300003854, -87.90076199988506],
  [41.981751999903445, -87.89977900008343],
  [41.981772000055734, -87.89789700018845],
  [41.980616999992144, -87.89525199983872],
  [41.97808899995123, -87.88553899968646],
  [41.98210400002261, -87.87586600030444],
  [41.98322900020701, -87.86726300115369],
  [41.9834360001534, -87.85678700054667],
  [41.983897999946, -87.84356999839328],
  [41.98443799986688, -87.8298260017731],
  [41.98322500017655, -87.81807700241588],
  [41.982226000020255, -87.80665199900649],
  [41.98230299990808, -87.79663899982951],
  [41.98205500011516, -87.78387000108741],
  [41.974502999886845, -87.77019899780534],
  [41.970909999769155, -87.76201900536797],
  [41.96642700032515, -87.75351200694821],
  [41.962414000269, -87.74574200505994],
  [41.95713599966091, -87.73730800742887],
  [41.9519589991234, -87.72796399588368],
  [41.94322799899799, -87.71683600241921],
  [41.939785998400275, -87.70925099844716],
  [41.934360001321856, -87.69995799357675],
  [41.92964499824864, -87.6915850012996],
  [41.92317600184304, -87.67992999846719],
  [41.91535499700529, -87.66810999507688],
  [41.91309399911386, -87.66672099823735],
  [41.910667002286736, -87.66567200119279],
  [41.910835996355836, -87.65409299428246],
  [41.91096600153667, -87.64585199649117],
  [41.91103900232536, -87.63857999141],
  [41.911170996811215, -87.63322299415849],
  [41.91222999998552, -87.62764799291871],
  [41.91230499752981, -87.6268979876735],
  [41.9103690014146, -87.62596898968003],
  [41.89875599750501, -87.61816700989984],
  [41.89370399722082, -87.61473700101159],
  [41.892390996303384, -87.61441999370835],
  [41.891889996912305, -87.61466398532174],
  [41.891802996482674, -87.62024700338624],
  [41.89099799850208, -87.6202070086696],
  [41.89102400102836, -87.61887201363824]]]

What does our route look like?

import ipyleaflet as ipl

map_taxi_75 = ipl.Map(
    basemap=ipl.basemap_to_tiles(ipl.basemaps.CartoDB.Voyager, "2022-04-08"),
    center=(41.8921, -87.6623),
    zoom=11,
)
for route in lat_lon_taxi_75:
    path = ipl.AntPath(locations=route.to_list(), delay=1000)
    map_taxi_75.add_layer(path)
map_taxi_75