{
"cells": [
{
"cell_type": "markdown",
"id": "b9607e38-b639-4670-91f3-7f9a3c767402",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"# Thinking in arrays"
]
},
{
"cell_type": "markdown",
"id": "df28b3b5-449d-4ff5-8352-66f3d9f5ec1c",
"metadata": {},
"source": [
"_Originally presented as [part](https://github.com/jpivarski-talks/2023-12-18-hsf-india-tutorial-bhubaneswar/blob/main/lesson-3-awkward/lecture-slides.ipynb) of [HSF-India training on December 18, 2023](https://indico.cern.ch/event/1328624/)._"
]
},
{
"cell_type": "markdown",
"id": "81f1b565-ecc3-44be-a885-01010182f7ad",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "bf8f76ed-e777-4e58-83c9-a37698afdc81",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"So far, all the arrays we've dealt with have been rectangular (in $n$ dimensions; \"rectilinear\").\n",
"\n",
"![](8-layer_cube.jpg\")"
]
},
{
"cell_type": "markdown",
"id": "f109c5fc-f23a-434a-b5ed-27b5ce7a1d17",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"What if we had data like this?\n",
"\n",
"```json\n",
"[\n",
" [[1.84, 0.324]],\n",
" [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],\n",
" [[0.459, -1.517, 1.545], [0.33, 0.292]],\n",
" [[-0.376, -1.46, -0.206], [0.65, 1.278]],\n",
" [[], [], [1.617]],\n",
" []\n",
"]\n",
"[\n",
" [[-0.106, 0.611]],\n",
" [[0.118, -1.788, 0.794, 0.658], [-0.105]]\n",
"]\n",
"[\n",
" [[-0.384], [0.697, -0.856]],\n",
" [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305, 1.517, -0.292]]\n",
"]\n",
"[\n",
" [[0.205, -0.355], [-0.265], [1.042]],\n",
" [[-0.004], [-1.167, -0.054, 0.726, 0.213]],\n",
" [[1.741, -0.199, 0.827]]\n",
"]\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "8ee85cb4-c8fe-457b-99b4-b07ccf60e6bb",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"What if we had data like this?\n",
"\n",
"```json\n",
"[\n",
" {\"fill\": \"#b1b1b1\", \"stroke\": \"none\", \"points\": [{\"x\": 5.27453, \"y\": 1.03276},\n",
" {\"x\": -3.51280, \"y\": 1.74849}]},\n",
" {\"fill\": \"#b1b1b1\", \"stroke\": \"none\", \"points\": [{\"x\": 8.21630, \"y\": 4.07844},\n",
" {\"x\": -0.79157, \"y\": 3.49478}, {\"x\": 16.38932, \"y\": 5.29399},\n",
" {\"x\": 10.38641, \"y\": 0.10832}, {\"x\": -2.07070, \"y\": 14.07140},\n",
" {\"x\": 9.57021, \"y\": -0.94823}, {\"x\": 1.97332, \"y\": 3.62380},\n",
" {\"x\": 5.66760, \"y\": 11.38001}, {\"x\": 0.25497, \"y\": 3.39276},\n",
" {\"x\": 3.86585, \"y\": 6.22051}, {\"x\": -0.67393, \"y\": 2.20572}]},\n",
" {\"fill\": \"#d0d0ff\", \"stroke\": \"none\", \"points\": [{\"x\": 3.59528, \"y\": 7.37191},\n",
" {\"x\": 0.59192, \"y\": 2.91503}, {\"x\": 4.02932, \"y\": -1.13601},\n",
" {\"x\": -1.01593, \"y\": 1.95894}, {\"x\": 1.03666, \"y\": 0.05251}]},\n",
" {\"fill\": \"#d0d0ff\", \"stroke\": \"none\", \"points\": [{\"x\": -8.78510, \"y\": -0.00497},\n",
" {\"x\": -15.22688, \"y\": 3.90244}, {\"x\": 5.74593, \"y\": 4.12718}]},\n",
" {\"fill\": \"none\", \"stroke\": \"#000000\", \"points\": [{\"x\": 4.40625, \"y\": -6.953125},\n",
" {\"x\": 4.34375, \"y\": -7.09375}, {\"x\": 4.3125, \"y\": -7.140625},\n",
" {\"x\": 4.140625, \"y\": -7.140625}]},\n",
" {\"fill\": \"none\", \"stroke\": \"#808080\", \"points\": [{\"x\": 0.46875, \"y\": -0.09375},\n",
" {\"x\": 0.46875, \"y\": -0.078125}, {\"x\": 0.46875, \"y\": 0.53125}]}\n",
"]\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "a4a85956-6430-497f-8357-9741d602a6df",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"What if we had data like this?\n",
"\n",
"```json\n",
"[\n",
" {\"movie\": \"Evil Dead\", \"year\": 1981, \"actors\":\n",
" [\"Bruce Campbell\", \"Ellen Sandweiss\", \"Richard DeManincor\", \"Betsy Baker\"]\n",
" },\n",
" {\"movie\": \"Darkman\", \"year\": 1900, \"actors\":\n",
" [\"Liam Neeson\", \"Frances McDormand\", \"Larry Drake\", \"Bruce Campbell\"]\n",
" },\n",
" {\"movie\": \"Army of Darkness\", \"year\": 1992, \"actors\":\n",
" [\"Bruce Campbell\", \"Embeth Davidtz\", \"Marcus Gilbert\", \"Bridget Fonda\",\n",
" \"Ted Raimi\", \"Patricia Tallman\"]\n",
" },\n",
" {\"movie\": \"A Simple Plan\", \"year\": 1998, \"actors\":\n",
" [\"Bill Paxton\", \"Billy Bob Thornton\", \"Bridget Fonda\", \"Brent Briscoe\"]\n",
" },\n",
" {\"movie\": \"Spider-Man 2\", \"year\": 2004, \"actors\":\n",
" [\"Tobey Maguire\", \"Kristen Dunst\", \"Alfred Molina\", \"James Franco\",\n",
" \"Rosemary Harris\", \"J.K. Simmons\", \"Stan Lee\", \"Bruce Campbell\"]\n",
" },\n",
" {\"movie\": \"Drag Me to Hell\", \"year\": 2009, \"actors\":\n",
" [\"Alison Lohman\", \"Justin Long\", \"Lorna Raver\", \"Dileep Rao\", \"David Paymer\"]\n",
" }\n",
"]\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "704a1603-85f5-4680-a9b3-28137f2e7cea",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"What if we had data like this?\n",
"\n",
"```json\n",
"[\n",
" {\"run\": 1, \"luminosityBlock\": 156, \"event\": 46501,\n",
" \"PV\": {\"x\": 0.243, \"y\": 0.393, \"z\": 1.451},\n",
" \"electron\": [],\n",
" \"muon\": [\n",
" {\"pt\": 63.043, \"eta\": -0.718, \"phi\": 2.968, \"mass\": 0.105, \"charge\": 1},\n",
" {\"pt\": 38.120, \"eta\": -0.879, \"phi\": -1.032, \"mass\": 0.105, \"charge\": -1},\n",
" {\"pt\": 4.048, \"eta\": -0.320, \"phi\": 1.038, \"mass\": 0.105, \"charge\": 1}\n",
" ],\n",
" \"MET\": {\"pt\": 21.929, \"phi\": -2.730}\n",
" },\n",
" {\"run\": 1, \"luminosityBlock\": 156, \"event\": 46502,\n",
" \"PV\": {\"x\": 0.244, \"y\": 0.395, \"z\": -2.879},\n",
" \"electron\": [\n",
" {\"pt\": 21.902, \"eta\": -0.702, \"phi\": 0.133, \"mass\": 0.005, \"charge\": 1},\n",
" {\"pt\": 42.632, \"eta\": -0.979, \"phi\": -1.863, \"mass\": 0.008, \"charge\": 1},\n",
" {\"pt\": 78.012, \"eta\": -0.933, \"phi\": -2.207, \"mass\": 0.018, \"charge\": -1},\n",
" {\"pt\": 23.835, \"eta\": -1.362, \"phi\": -0.621, \"mass\": 0.008, \"charge\": -1}\n",
" ],\n",
" \"muon\": [],\n",
" \"MET\": {\"pt\": 16.972, \"phi\": 2.866}},\n",
" ...\n",
"]\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "bcf2ffb7-dba0-477d-933f-ce8d178db643",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"It might be possible to turn these datasets into tabular form using surrogate keys and database normalization, but\n",
"\n",
" * they could be inconvenient or less efficient in that form, depending on what we want to do,\n",
" * they were very likely _given_ in a ragged/untidy form. You can't ignore the data-cleaning step!"
]
},
{
"cell_type": "markdown",
"id": "22e1411d-4ed6-4fe3-8be6-a63e2125162a",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "308210c5-eeda-4a3b-a420-a097a25d3248",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"Dealing with these datasets as JSON or Python objects is inefficient for the same reason as for lists of numbers."
]
},
{
"cell_type": "markdown",
"id": "44c2b36f-3cba-4cef-a821-b2647c497070",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "9871edb0-31fe-485a-a8e5-922a6dc1e187",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"We want arbitrary data structure with array-oriented interface and performance..."
]
},
{
"cell_type": "markdown",
"id": "c579226d-b2b7-4519-8b6a-c160597db348",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"![](awkward-motivation-venn-diagram.svg)"
]
},
{
"cell_type": "markdown",
"id": "a6590eff-e876-4c50-811d-51b870541911",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Libraries for irregular arrays"
]
},
{
"cell_type": "markdown",
"id": "f3df31d7-0185-414f-9c31-7e08a88fdce3",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "c7aa6b97-c9ff-4b3c-8237-4888eb0a5f35",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"![](logo-arrow.svg)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "853cd417-4bd2-413c-83d2-4bfb6b15d1c7",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"import pyarrow as pa"
]
},
{
"cell_type": "markdown",
"id": "7c75130d-065b-4ed7-a8d3-df4b83eb6e8f",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b2233d1b-d680-4ee0-9d84-351422128a77",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"arrow_array = pa.array([\n",
" [{\"x\": 1.1, \"y\": [1]}, {\"x\": 2.2, \"y\": [1, 2]}, {\"x\": 3.3, \"y\": [1, 2, 3]}],\n",
" [],\n",
" [{\"x\": 4.4, \"y\": [1, 2, 3, 4]}, {\"x\": 5.5, \"y\": [1, 2, 3, 4, 5]}]\n",
"])"
]
},
{
"cell_type": "markdown",
"id": "f4aa8cb3-e350-4fbc-b5a2-76ad1d409c00",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc80bf50-aea5-41df-b5f0-9a35d2bae75f",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"arrow_array.type"
]
},
{
"cell_type": "markdown",
"id": "c796e4b8-aa92-44aa-bb2e-8742550dc88a",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0e77c680-bd02-45c2-a899-e8af53bfcafc",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"arrow_array"
]
},
{
"cell_type": "markdown",
"id": "830211c5-e747-4bdb-9beb-d90a07895f10",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "6e5b9cd8-cce1-427a-a97d-3707f9e2016a",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"![](logo-awkward.svg)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "77739aec-06d9-4164-aa9b-86075c86874e",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"import awkward as ak"
]
},
{
"cell_type": "markdown",
"id": "c6212da5-dbdc-43ac-9859-fba6b7c66827",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "87d9dff3-3107-45b0-b636-ae840e4ed638",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"awkward_array = ak.from_arrow(arrow_array)\n",
"awkward_array"
]
},
{
"cell_type": "markdown",
"id": "5ec486a3-44fd-436e-ac1d-eab6ddbe8c08",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "fdd4ce8c-3e17-44c3-9f7b-b18ac5e2449d",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"![](logo-parquet.svg)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "698e3514-659e-4392-a0cd-79ca4741449e",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"ak.to_parquet(awkward_array, \"/tmp/file.parquet\")"
]
},
{
"cell_type": "markdown",
"id": "5e9be1a7-0ad7-4f25-a082-fd6ece2de7e7",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aa676558-b980-42ef-bdc8-f7a2d9248684",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ak.from_parquet(\"/tmp/file.parquet\")"
]
},
{
"cell_type": "markdown",
"id": "b49f0d23-016b-440c-8b25-f0fdd859ba5a",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Awkward Array"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ed60a0e-315f-4f39-917e-b23f998fffe1",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"ragged = ak.Array([\n",
" [\n",
" [[1.84, 0.324]],\n",
" [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],\n",
" [[0.459, -1.517, 1.545], [0.33, 0.292]],\n",
" [[-0.376, -1.46, -0.206], [0.65, 1.278]],\n",
" [[], [], [1.617]],\n",
" []\n",
" ],\n",
" [\n",
" [[-0.106, 0.611]],\n",
" [[0.118, -1.788, 0.794, 0.658], [-0.105]]\n",
" ],\n",
" [\n",
" [[-0.384], [0.697, -0.856]],\n",
" [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305, 1.517, -0.292]]\n",
" ],\n",
" [\n",
" [[0.205, -0.355], [-0.265], [1.042]],\n",
" [[-0.004], [-1.167, -0.054, 0.726, 0.213]],\n",
" [[1.741, -0.199, 0.827]]\n",
" ]\n",
"])"
]
},
{
"cell_type": "markdown",
"id": "e2bc442c-dedb-4aa6-be20-bc7c5e47bab3",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"**Multidimensional indexing**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6a7165ea-fff7-4035-b18d-c05a2c2ca97d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ragged[3, 1, -1, 2]"
]
},
{
"cell_type": "markdown",
"id": "935b1dba-34ef-4e07-b175-be8b754010ef",
"metadata": {},
"source": [
"
\n",
"\n",
"**Basic slicing**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ac1c3e3-741d-4855-9636-9e98f6690fa5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ragged[3, 1:, -1, 1:3]"
]
},
{
"cell_type": "markdown",
"id": "4e382ae9-81db-486f-9078-22295781c739",
"metadata": {},
"source": [
"
\n",
"\n",
"**Advanced slicing**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6322cb65-eb27-45e6-abc5-9f35f81661cf",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ragged[[False, False, True, True], [0, -1, 0, -1], 0, -1]"
]
},
{
"cell_type": "markdown",
"id": "0adeb6c9-65c4-4d35-85f4-91f986b7291d",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"**Awkward slicing**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "13105929-1f22-472a-9138-44ec73d52231",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ragged > 0"
]
},
{
"cell_type": "markdown",
"id": "dd809703-6d16-42b6-9d89-63d4e72bc40c",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55ea9417-62f6-4223-be2b-26e1abc4c42a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ragged[ragged > 0]"
]
},
{
"cell_type": "markdown",
"id": "9f9ee115-1343-469a-93ff-2cf50a432b7c",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"**Reductions**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d105c3cf-5b55-44cd-a163-b5d27767d227",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"ak.sum(ragged)"
]
},
{
"cell_type": "markdown",
"id": "9f003781-8876-4d4e-ad14-851f7d40fd9e",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f7d467e-1fc2-4aed-8721-32eb411a1e6a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ak.sum(ragged, axis=-1)"
]
},
{
"cell_type": "markdown",
"id": "acdc2d38-2f6b-4531-893d-fbc189d54c64",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7b7c68f-a1d3-41d0-9ca1-58e607753186",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"ak.sum(ragged, axis=0)"
]
},
{
"cell_type": "markdown",
"id": "8f37d2af-c67e-4853-b672-793eee0763c1",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"How do we even define reductions on an array with variable length lists?"
]
},
{
"cell_type": "markdown",
"id": "82441d9a-3242-4b21-b486-6ef36e35a69f",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"![](example-reducer-2d.svg)"
]
},
{
"cell_type": "markdown",
"id": "d11a1d81-20f2-45fe-8767-09d95fa9e1c9",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"How do we even define reductions on an array with variable length lists?"
]
},
{
"cell_type": "markdown",
"id": "f1377dfb-acd8-4a6f-aa3b-47b544c90645",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"![](example-reducer-ragged.svg)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fdff4345-b3a2-453a-9d8b-d47d03aca1e9",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"outputs": [],
"source": [
"array = ak.Array([[ 1, 2, 3, 4],\n",
" [ 10, None, 30 ],\n",
" [ 100, 200 ]])"
]
},
{
"cell_type": "markdown",
"id": "7474589a-4b79-4baa-98de-7d6c457a43d1",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac7395e3-dbe9-479f-b595-a37fcb8cc977",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"outputs": [],
"source": [
"ak.sum(array, axis=0).tolist()"
]
},
{
"cell_type": "markdown",
"id": "ad53c2cb-b550-4158-8e44-11303e085050",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36727ca1-d85d-42aa-968e-04a4f7eb98ce",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"outputs": [],
"source": [
"ak.sum(array, axis=1).tolist()"
]
},
{
"cell_type": "markdown",
"id": "473f41e3-2c52-48de-8536-33210b5bca13",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "72050082-e01c-4eb2-84b7-220a85fff881",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"(You almost always want the deepest/maximum `axis`, which you can get with `axis=-1`.)"
]
},
{
"cell_type": "markdown",
"id": "3f335dcf-b128-4b4d-bbdb-dae52c43c03b",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "bd72b617-a540-4ff8-9faa-cfc24befda51",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"### Awkward Arrays in particle physics"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8c8b9fb5-d73d-4689-b8c1-46e11bb73f5f",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"import uproot\n",
"\n",
"file = uproot.open(\"https://github.com/jpivarski-talks/2023-12-18-hsf-india-tutorial-bhubaneswar/raw/main/data/SMHiggsToZZTo4L.root\")\n",
"file"
]
},
{
"cell_type": "markdown",
"id": "c1a51aef-3bb9-4c25-8af7-388f0732bdef",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b25cb34b-ff02-4726-88b3-b33b51cc3a81",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"outputs": [],
"source": [
"tree = file[\"Events\"]\n",
"tree"
]
},
{
"cell_type": "markdown",
"id": "42fec548-3363-4eb1-8297-5a7e130815d1",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9bb09f4f-eb47-4f61-9614-a17fd89afa61",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"outputs": [],
"source": [
"tree.arrays(entry_stop=100)"
]
},
{
"cell_type": "markdown",
"id": "4b5fe20f-b93c-4e99-9251-9728a20cb767",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"The same data fits into Parquet files (a little more easily)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "296ae255-f894-4430-932c-a3ec67863897",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"events = ak.from_parquet(\"https://github.com/jpivarski-talks/2023-12-18-hsf-india-tutorial-bhubaneswar/raw/main/data/SMHiggsToZZTo4L.parquet\")\n",
"events"
]
},
{
"cell_type": "markdown",
"id": "85dc100c-200a-455a-9827-e76b92a0b752",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"View the first event as Python lists and dicts (like JSON)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "30661cff-00f8-4556-8a6e-b25642f13170",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"events[0].to_list()"
]
},
{
"cell_type": "markdown",
"id": "6663c8f3-f84f-4185-bd79-a9c7f7b4f034",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"Get one numeric field (also known as \"column\")."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "68a0c07c-e7f3-4e98-a368-4c8899c833e2",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"events.electron.pt"
]
},
{
"cell_type": "markdown",
"id": "f21e3de6-8631-4cbf-bb39-f91f6e9abfb5",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"Compute something ($p_z = p_T \\sinh\\eta$)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b97c4bcf-58da-4e0a-a47d-c1c4ae1c9da8",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"events.electron.pt * np.sinh(events.electron.eta)"
]
},
{
"cell_type": "markdown",
"id": "f3873822-10c9-4cc9-bbd9-7b7986695900",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"Note that the Vector library works with Awkward Arrays, if it is imported this way:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "79f1de55-90cd-4a61-9627-9a4c110855ec",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"import vector\n",
"vector.register_awkward()"
]
},
{
"cell_type": "markdown",
"id": "c032df1d-bc42-40ed-ab64-fa51726e5730",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "5c343480-9a4a-428d-b695-f9748e0e4641",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"Records with `name=\"Momentum4D\"` and fields with coordinate names (`px`, `py`, `pz`, `E` or `pt`, `phi`, `eta`, `m`) automatically get Vector properties and methods."
]
},
{
"cell_type": "markdown",
"id": "a91c9d45-ecce-4b87-91db-0539b233a76f",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f8eb191-b606-4809-bfdf-ed4c690cbc6a",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"events.electron.type.show()"
]
},
{
"cell_type": "markdown",
"id": "311f608b-66f4-4b12-9426-f24a16bc8525",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "13be5a96-b4f3-4b95-afe9-c2941778aefb",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"# implicitly computes pz = pt * sinh(eta)\n",
"events.electron.pz"
]
},
{
"cell_type": "markdown",
"id": "7a992469-b5e0-402e-bef1-ee46ed234627",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"To make histograms or other plots, we need numbers without structure, so {func}`ak.flatten` the array."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c24df813-578a-4f81-8934-46d657c11024",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"from hist import Hist\n",
"\n",
"Hist.new.Regular(100, 0, 100, name=\" \").Double().fill(\n",
" ak.flatten(events.electron.pt)\n",
").plot();"
]
},
{
"cell_type": "markdown",
"id": "d573f2e8-c0c2-4e23-b5c1-4cc4c4d87541",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"Each event has a different number of electrons and muons ({func}`ak.num` to check)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bf7a2d4a-2386-4f6b-bde0-c563c162b5f0",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"ak.num(events.electron), ak.num(events.muon)"
]
},
{
"cell_type": "markdown",
"id": "81d4d4c4-7235-493a-a293-ff4d45c76ba7",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"
\n",
"\n",
"So what happens if we try to compute something with the electrons' $p_T$ and the muons' $\\eta$?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6efa2427-2732-45af-928e-0d649b7f92c2",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"raises-exception"
]
},
"outputs": [],
"source": [
"events.electron.pt * np.sinh(events.muon.eta)"
]
},
{
"cell_type": "markdown",
"id": "ad8d0d1a-722d-4eb7-b341-839be6919cf6",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"This is data structure-aware, array-oriented programming."
]
},
{
"cell_type": "markdown",
"id": "c31d978c-f829-4e30-8558-01e14bab641a",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"**Application:** Filtering events with an array of booleans."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dbd40cdb-3764-4201-8384-714103194ca7",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"events.MET.pt, events.MET.pt > 20"
]
},
{
"cell_type": "markdown",
"id": "3a886516-e4ba-4fa1-9119-7a450a84562a",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5fa99f53-9b84-44ad-b0c7-39ac39fc15b3",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"len(events), len(events[events.MET.pt > 20])"
]
},
{
"cell_type": "markdown",
"id": "20b44fe8-9311-4c47-b7de-4bcdff4d079f",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"
\n",
"\n",
"**Application:** Filtering particles with an array of lists of booleans."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2319e94e-9bd7-490e-b6c8-c5f560dacebf",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"events.electron.pt, events.electron.pt > 30"
]
},
{
"cell_type": "markdown",
"id": "4ff7bfda-1752-4f3e-9085-08f3e9f64c1b",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f7b75c8-248e-43da-97be-c6525c36db62",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"ak.num(events.electron), ak.num(events.electron[events.electron.pt > 30])"
]
},
{
"cell_type": "markdown",
"id": "4d48aa4f-49bb-455c-b664-3cb405064ab0",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"**Quizlet:** Using the reducer {func}`ak.any`, how would we select _events_ in which any electron has $p_T > 30$ GeV/c$^2$?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "85339038-b580-4eb7-9421-fb32d2a402d1",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"events.electron[events.electron.pt > 30]"
]
},
{
"cell_type": "markdown",
"id": "e5bd3b3f-ef4b-48d4-873d-ff831a85a768",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "28a196d8-56d8-4ee7-ae13-9622865a8076",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"Awkward Array has two combinatorial primitives:"
]
},
{
"cell_type": "markdown",
"id": "e143afb0-4a0f-4188-8aa6-6385a1f7a816",
"metadata": {},
"source": [
"{func}`ak.cartesian` takes a [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) of lists from $N$ different arrays, producing an array of lists of $N$-tuples.\n",
"\n",
"{func}`ak.combinations` takes $N$ [samples without replacement](http://prob140.org/sp18/textbook/notebooks-md/5_04_Sampling_Without_Replacement.html) of lists from a single array, producing an array of lists of $N$-tuples."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3a4d8d1-d210-4fe6-9b2e-6fe3c6c064af",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"outputs": [],
"source": [
"numbers = ak.Array([[1, 2, 3], [], [4]])\n",
"letters = ak.Array([[\"a\", \"b\"], [\"c\"], [\"d\", \"e\"]])"
]
},
{
"cell_type": "markdown",
"id": "c27d876b-cd7f-4be5-9a88-5f75f4b5bc80",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a188c11-5af4-4bac-9d3d-273b98e62d44",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"ak.cartesian([numbers, letters])"
]
},
{
"cell_type": "markdown",
"id": "f60ba73e-1b00-4b2f-9367-318b52f251b3",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af07b963-cc1d-4367-ba48-ffe4da4303e5",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"outputs": [],
"source": [
"values = ak.Array([[1.1, 2.2, 3.3, 4.4], [], [5.5, 6.6]])"
]
},
{
"cell_type": "markdown",
"id": "18a0a5f3-3f49-4f77-a09c-fd5b5748b1df",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5713cc98-4ba3-4087-b9a3-506838b6b020",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"ak.combinations(values, 2)"
]
},
{
"cell_type": "markdown",
"id": "917c2815-6508-474e-b58b-39b57b37aa7e",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"Often, it's useful to separate the separate the left-hand sides and right-hand sides of these pairs with {func}`ak.unzip`, so they can be used in mathematical expressions."
]
},
{
"cell_type": "markdown",
"id": "dc11ada3-92b0-4f7f-9118-a32ba41dc0d2",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c7bac853-59b1-4a52-887a-83c5ec845e32",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"electron_muon_pairs = ak.cartesian([events.electron, events.muon])\n",
"electron_muon_pairs.type.show()"
]
},
{
"cell_type": "markdown",
"id": "1a247a4c-01a6-4739-a828-00ec20ba2e7d",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "51960f1b-eeef-444c-860d-473ad7083396",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"electron_in_pair, muon_in_pair = ak.unzip(electron_muon_pairs)\n",
"electron_in_pair.type.show()"
]
},
{
"cell_type": "markdown",
"id": "8403df4a-0a79-4220-ab3b-b538e818aa7d",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4d696a8-a5a3-47b5-bdd9-a2a2f21b4403",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"outputs": [],
"source": [
"electron_in_pair.pt, muon_in_pair.pt"
]
},
{
"cell_type": "markdown",
"id": "832690e2-7c09-4c18-989e-3fcdc6c90358",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "886f1267-6b6b-4a37-ba8c-2f2680c91e4e",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"outputs": [],
"source": [
"ak.num(electron_in_pair), ak.num(muon_in_pair)"
]
},
{
"cell_type": "markdown",
"id": "eceaad8c-2fb3-4e32-bad7-12b1a6e45034",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"To use Vector's `deltaR` method ($\\Delta R = \\sqrt{\\Delta\\phi^2 + \\Delta\\eta^2}$), we need to have the electrons and muons in separate arrays."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "32eecc20-2397-40c8-a6b1-b03b6211906b",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"electron_in_pair, muon_in_pair = ak.unzip(ak.cartesian([events.electron, events.muon]))"
]
},
{
"cell_type": "markdown",
"id": "dd3a02fa-1b23-40cd-923c-9fb92341eb8e",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d193dfa0-2027-420a-a583-7f6fe35826b7",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"electron_in_pair.deltaR(muon_in_pair)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "464996c0-0a00-4c9f-b3d9-dc6baf70d96a",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"outputs": [],
"source": [
"first_electron_in_pair, second_electron_in_pair = ak.unzip(ak.combinations(events.electron, 2))"
]
},
{
"cell_type": "markdown",
"id": "c05fe323-5350-44b0-858f-849ca4ac4e76",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "de6d4ddc-8410-443f-8c7c-ba669d30b233",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"first_electron_in_pair.deltaR(second_electron_in_pair)"
]
},
{
"cell_type": "markdown",
"id": "076cb4a3-c407-4d50-a4d4-3cdcc6b977dd",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"**Quizlet:** What's this?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a5fcbc46-1bd9-4944-9321-df47e9e2845d",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"(first_electron_in_pair + second_electron_in_pair).mass"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c28f5f07-70eb-4827-99a5-cf5811f6bf68",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"outputs": [],
"source": [
"Hist.new.Reg(120, 0, 120, name=\"mass (GeV)\").Double().fill(\n",
" ak.flatten((first_electron_in_pair + second_electron_in_pair).mass, axis=-1)\n",
").plot();"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 5
}