{ "cells": [ { "cell_type": "markdown", "id": "b9607e38-b639-4670-91f3-7f9a3c767402", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Thinking in arrays" ] }, { "cell_type": "markdown", "id": "df28b3b5-449d-4ff5-8352-66f3d9f5ec1c", "metadata": {}, "source": [ "_Originally presented as [part](https://github.com/jpivarski-talks/2023-12-18-hsf-india-tutorial-bhubaneswar/blob/main/lesson-3-awkward/lecture-slides.ipynb) of [HSF-India training on December 18, 2023](https://indico.cern.ch/event/1328624/)._" ] }, { "cell_type": "markdown", "id": "81f1b565-ecc3-44be-a885-01010182f7ad", "metadata": {}, "source": [ "


" ] }, { "cell_type": "markdown", "id": "bf8f76ed-e777-4e58-83c9-a37698afdc81", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "So far, all the arrays we've dealt with have been rectangular (in $n$ dimensions; \"rectilinear\").\n", "\n", "![](8-layer_cube.jpg\")" ] }, { "cell_type": "markdown", "id": "f109c5fc-f23a-434a-b5ed-27b5ce7a1d17", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "What if we had data like this?\n", "\n", "```json\n", "[\n", " [[1.84, 0.324]],\n", " [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],\n", " [[0.459, -1.517, 1.545], [0.33, 0.292]],\n", " [[-0.376, -1.46, -0.206], [0.65, 1.278]],\n", " [[], [], [1.617]],\n", " []\n", "]\n", "[\n", " [[-0.106, 0.611]],\n", " [[0.118, -1.788, 0.794, 0.658], [-0.105]]\n", "]\n", "[\n", " [[-0.384], [0.697, -0.856]],\n", " [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305, 1.517, -0.292]]\n", "]\n", "[\n", " [[0.205, -0.355], [-0.265], [1.042]],\n", " [[-0.004], [-1.167, -0.054, 0.726, 0.213]],\n", " [[1.741, -0.199, 0.827]]\n", "]\n", "```" ] }, { "cell_type": "markdown", "id": "8ee85cb4-c8fe-457b-99b4-b07ccf60e6bb", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "What if we had data like this?\n", "\n", "```json\n", "[\n", " {\"fill\": \"#b1b1b1\", \"stroke\": \"none\", \"points\": [{\"x\": 5.27453, \"y\": 1.03276},\n", " {\"x\": -3.51280, \"y\": 1.74849}]},\n", " {\"fill\": \"#b1b1b1\", \"stroke\": \"none\", \"points\": [{\"x\": 8.21630, \"y\": 4.07844},\n", " {\"x\": -0.79157, \"y\": 3.49478}, {\"x\": 16.38932, \"y\": 5.29399},\n", " {\"x\": 10.38641, \"y\": 0.10832}, {\"x\": -2.07070, \"y\": 14.07140},\n", " {\"x\": 9.57021, \"y\": -0.94823}, {\"x\": 1.97332, \"y\": 3.62380},\n", " {\"x\": 5.66760, \"y\": 11.38001}, {\"x\": 0.25497, \"y\": 3.39276},\n", " {\"x\": 3.86585, \"y\": 6.22051}, {\"x\": -0.67393, \"y\": 2.20572}]},\n", " {\"fill\": \"#d0d0ff\", \"stroke\": \"none\", \"points\": [{\"x\": 3.59528, \"y\": 7.37191},\n", " {\"x\": 0.59192, \"y\": 2.91503}, {\"x\": 4.02932, \"y\": -1.13601},\n", " {\"x\": -1.01593, \"y\": 1.95894}, {\"x\": 1.03666, \"y\": 0.05251}]},\n", " {\"fill\": \"#d0d0ff\", \"stroke\": \"none\", \"points\": [{\"x\": -8.78510, \"y\": -0.00497},\n", " {\"x\": -15.22688, \"y\": 3.90244}, {\"x\": 5.74593, \"y\": 4.12718}]},\n", " {\"fill\": \"none\", \"stroke\": \"#000000\", \"points\": [{\"x\": 4.40625, \"y\": -6.953125},\n", " {\"x\": 4.34375, \"y\": -7.09375}, {\"x\": 4.3125, \"y\": -7.140625},\n", " {\"x\": 4.140625, \"y\": -7.140625}]},\n", " {\"fill\": \"none\", \"stroke\": \"#808080\", \"points\": [{\"x\": 0.46875, \"y\": -0.09375},\n", " {\"x\": 0.46875, \"y\": -0.078125}, {\"x\": 0.46875, \"y\": 0.53125}]}\n", "]\n", "```" ] }, { "cell_type": "markdown", "id": "a4a85956-6430-497f-8357-9741d602a6df", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "What if we had data like this?\n", "\n", "```json\n", "[\n", " {\"movie\": \"Evil Dead\", \"year\": 1981, \"actors\":\n", " [\"Bruce Campbell\", \"Ellen Sandweiss\", \"Richard DeManincor\", \"Betsy Baker\"]\n", " },\n", " {\"movie\": \"Darkman\", \"year\": 1900, \"actors\":\n", " [\"Liam Neeson\", \"Frances McDormand\", \"Larry Drake\", \"Bruce Campbell\"]\n", " },\n", " {\"movie\": \"Army of Darkness\", \"year\": 1992, \"actors\":\n", " [\"Bruce Campbell\", \"Embeth Davidtz\", \"Marcus Gilbert\", \"Bridget Fonda\",\n", " \"Ted Raimi\", \"Patricia Tallman\"]\n", " },\n", " {\"movie\": \"A Simple Plan\", \"year\": 1998, \"actors\":\n", " [\"Bill Paxton\", \"Billy Bob Thornton\", \"Bridget Fonda\", \"Brent Briscoe\"]\n", " },\n", " {\"movie\": \"Spider-Man 2\", \"year\": 2004, \"actors\":\n", " [\"Tobey Maguire\", \"Kristen Dunst\", \"Alfred Molina\", \"James Franco\",\n", " \"Rosemary Harris\", \"J.K. Simmons\", \"Stan Lee\", \"Bruce Campbell\"]\n", " },\n", " {\"movie\": \"Drag Me to Hell\", \"year\": 2009, \"actors\":\n", " [\"Alison Lohman\", \"Justin Long\", \"Lorna Raver\", \"Dileep Rao\", \"David Paymer\"]\n", " }\n", "]\n", "```" ] }, { "cell_type": "markdown", "id": "704a1603-85f5-4680-a9b3-28137f2e7cea", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "What if we had data like this?\n", "\n", "```json\n", "[\n", " {\"run\": 1, \"luminosityBlock\": 156, \"event\": 46501,\n", " \"PV\": {\"x\": 0.243, \"y\": 0.393, \"z\": 1.451},\n", " \"electron\": [],\n", " \"muon\": [\n", " {\"pt\": 63.043, \"eta\": -0.718, \"phi\": 2.968, \"mass\": 0.105, \"charge\": 1},\n", " {\"pt\": 38.120, \"eta\": -0.879, \"phi\": -1.032, \"mass\": 0.105, \"charge\": -1},\n", " {\"pt\": 4.048, \"eta\": -0.320, \"phi\": 1.038, \"mass\": 0.105, \"charge\": 1}\n", " ],\n", " \"MET\": {\"pt\": 21.929, \"phi\": -2.730}\n", " },\n", " {\"run\": 1, \"luminosityBlock\": 156, \"event\": 46502,\n", " \"PV\": {\"x\": 0.244, \"y\": 0.395, \"z\": -2.879},\n", " \"electron\": [\n", " {\"pt\": 21.902, \"eta\": -0.702, \"phi\": 0.133, \"mass\": 0.005, \"charge\": 1},\n", " {\"pt\": 42.632, \"eta\": -0.979, \"phi\": -1.863, \"mass\": 0.008, \"charge\": 1},\n", " {\"pt\": 78.012, \"eta\": -0.933, \"phi\": -2.207, \"mass\": 0.018, \"charge\": -1},\n", " {\"pt\": 23.835, \"eta\": -1.362, \"phi\": -0.621, \"mass\": 0.008, \"charge\": -1}\n", " ],\n", " \"muon\": [],\n", " \"MET\": {\"pt\": 16.972, \"phi\": 2.866}},\n", " ...\n", "]\n", "```" ] }, { "cell_type": "markdown", "id": "bcf2ffb7-dba0-477d-933f-ce8d178db643", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "It might be possible to turn these datasets into tabular form using surrogate keys and database normalization, but\n", "\n", " * they could be inconvenient or less efficient in that form, depending on what we want to do,\n", " * they were very likely _given_ in a ragged/untidy form. You can't ignore the data-cleaning step!" ] }, { "cell_type": "markdown", "id": "22e1411d-4ed6-4fe3-8be6-a63e2125162a", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "markdown", "id": "308210c5-eeda-4a3b-a420-a097a25d3248", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Dealing with these datasets as JSON or Python objects is inefficient for the same reason as for lists of numbers." ] }, { "cell_type": "markdown", "id": "44c2b36f-3cba-4cef-a821-b2647c497070", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "markdown", "id": "9871edb0-31fe-485a-a8e5-922a6dc1e187", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "We want arbitrary data structure with array-oriented interface and performance..." ] }, { "cell_type": "markdown", "id": "c579226d-b2b7-4519-8b6a-c160597db348", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "![](awkward-motivation-venn-diagram.svg)" ] }, { "cell_type": "markdown", "id": "a6590eff-e876-4c50-811d-51b870541911", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Libraries for irregular arrays" ] }, { "cell_type": "markdown", "id": "f3df31d7-0185-414f-9c31-7e08a88fdce3", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "markdown", "id": "c7aa6b97-c9ff-4b3c-8237-4888eb0a5f35", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "![](logo-arrow.svg)" ] }, { "cell_type": "code", "execution_count": null, "id": "853cd417-4bd2-413c-83d2-4bfb6b15d1c7", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import pyarrow as pa" ] }, { "cell_type": "markdown", "id": "7c75130d-065b-4ed7-a8d3-df4b83eb6e8f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "b2233d1b-d680-4ee0-9d84-351422128a77", "metadata": { "tags": [] }, "outputs": [], "source": [ "arrow_array = pa.array([\n", " [{\"x\": 1.1, \"y\": [1]}, {\"x\": 2.2, \"y\": [1, 2]}, {\"x\": 3.3, \"y\": [1, 2, 3]}],\n", " [],\n", " [{\"x\": 4.4, \"y\": [1, 2, 3, 4]}, {\"x\": 5.5, \"y\": [1, 2, 3, 4, 5]}]\n", "])" ] }, { "cell_type": "markdown", "id": "f4aa8cb3-e350-4fbc-b5a2-76ad1d409c00", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "cc80bf50-aea5-41df-b5f0-9a35d2bae75f", "metadata": { "tags": [] }, "outputs": [], "source": [ "arrow_array.type" ] }, { "cell_type": "markdown", "id": "c796e4b8-aa92-44aa-bb2e-8742550dc88a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "0e77c680-bd02-45c2-a899-e8af53bfcafc", "metadata": { "tags": [] }, "outputs": [], "source": [ "arrow_array" ] }, { "cell_type": "markdown", "id": "830211c5-e747-4bdb-9beb-d90a07895f10", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "markdown", "id": "6e5b9cd8-cce1-427a-a97d-3707f9e2016a", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "![](logo-awkward.svg)" ] }, { "cell_type": "code", "execution_count": null, "id": "77739aec-06d9-4164-aa9b-86075c86874e", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import awkward as ak" ] }, { "cell_type": "markdown", "id": "c6212da5-dbdc-43ac-9859-fba6b7c66827", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "87d9dff3-3107-45b0-b636-ae840e4ed638", "metadata": { "tags": [] }, "outputs": [], "source": [ "awkward_array = ak.from_arrow(arrow_array)\n", "awkward_array" ] }, { "cell_type": "markdown", "id": "5ec486a3-44fd-436e-ac1d-eab6ddbe8c08", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "markdown", "id": "fdd4ce8c-3e17-44c3-9f7b-b18ac5e2449d", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "![](logo-parquet.svg)" ] }, { "cell_type": "code", "execution_count": null, "id": "698e3514-659e-4392-a0cd-79ca4741449e", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "ak.to_parquet(awkward_array, \"/tmp/file.parquet\")" ] }, { "cell_type": "markdown", "id": "5e9be1a7-0ad7-4f25-a082-fd6ece2de7e7", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "aa676558-b980-42ef-bdc8-f7a2d9248684", "metadata": { "tags": [] }, "outputs": [], "source": [ "ak.from_parquet(\"/tmp/file.parquet\")" ] }, { "cell_type": "markdown", "id": "b49f0d23-016b-440c-8b25-f0fdd859ba5a", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Awkward Array" ] }, { "cell_type": "code", "execution_count": null, "id": "6ed60a0e-315f-4f39-917e-b23f998fffe1", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "ragged = ak.Array([\n", " [\n", " [[1.84, 0.324]],\n", " [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],\n", " [[0.459, -1.517, 1.545], [0.33, 0.292]],\n", " [[-0.376, -1.46, -0.206], [0.65, 1.278]],\n", " [[], [], [1.617]],\n", " []\n", " ],\n", " [\n", " [[-0.106, 0.611]],\n", " [[0.118, -1.788, 0.794, 0.658], [-0.105]]\n", " ],\n", " [\n", " [[-0.384], [0.697, -0.856]],\n", " [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305, 1.517, -0.292]]\n", " ],\n", " [\n", " [[0.205, -0.355], [-0.265], [1.042]],\n", " [[-0.004], [-1.167, -0.054, 0.726, 0.213]],\n", " [[1.741, -0.199, 0.827]]\n", " ]\n", "])" ] }, { "cell_type": "markdown", "id": "e2bc442c-dedb-4aa6-be20-bc7c5e47bab3", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "**Multidimensional indexing**" ] }, { "cell_type": "code", "execution_count": null, "id": "6a7165ea-fff7-4035-b18d-c05a2c2ca97d", "metadata": { "tags": [] }, "outputs": [], "source": [ "ragged[3, 1, -1, 2]" ] }, { "cell_type": "markdown", "id": "935b1dba-34ef-4e07-b175-be8b754010ef", "metadata": {}, "source": [ "
\n", "\n", "**Basic slicing**" ] }, { "cell_type": "code", "execution_count": null, "id": "6ac1c3e3-741d-4855-9636-9e98f6690fa5", "metadata": { "tags": [] }, "outputs": [], "source": [ "ragged[3, 1:, -1, 1:3]" ] }, { "cell_type": "markdown", "id": "4e382ae9-81db-486f-9078-22295781c739", "metadata": {}, "source": [ "
\n", "\n", "**Advanced slicing**" ] }, { "cell_type": "code", "execution_count": null, "id": "6322cb65-eb27-45e6-abc5-9f35f81661cf", "metadata": { "tags": [] }, "outputs": [], "source": [ "ragged[[False, False, True, True], [0, -1, 0, -1], 0, -1]" ] }, { "cell_type": "markdown", "id": "0adeb6c9-65c4-4d35-85f4-91f986b7291d", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "**Awkward slicing**" ] }, { "cell_type": "code", "execution_count": null, "id": "13105929-1f22-472a-9138-44ec73d52231", "metadata": { "tags": [] }, "outputs": [], "source": [ "ragged > 0" ] }, { "cell_type": "markdown", "id": "dd809703-6d16-42b6-9d89-63d4e72bc40c", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "55ea9417-62f6-4223-be2b-26e1abc4c42a", "metadata": { "tags": [] }, "outputs": [], "source": [ "ragged[ragged > 0]" ] }, { "cell_type": "markdown", "id": "9f9ee115-1343-469a-93ff-2cf50a432b7c", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "**Reductions**" ] }, { "cell_type": "code", "execution_count": null, "id": "d105c3cf-5b55-44cd-a163-b5d27767d227", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "ak.sum(ragged)" ] }, { "cell_type": "markdown", "id": "9f003781-8876-4d4e-ad14-851f7d40fd9e", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "4f7d467e-1fc2-4aed-8721-32eb411a1e6a", "metadata": { "tags": [] }, "outputs": [], "source": [ "ak.sum(ragged, axis=-1)" ] }, { "cell_type": "markdown", "id": "acdc2d38-2f6b-4531-893d-fbc189d54c64", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "b7b7c68f-a1d3-41d0-9ca1-58e607753186", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "ak.sum(ragged, axis=0)" ] }, { "cell_type": "markdown", "id": "8f37d2af-c67e-4853-b672-793eee0763c1", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "How do we even define reductions on an array with variable length lists?" ] }, { "cell_type": "markdown", "id": "82441d9a-3242-4b21-b486-6ef36e35a69f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "![](example-reducer-2d.svg)" ] }, { "cell_type": "markdown", "id": "d11a1d81-20f2-45fe-8767-09d95fa9e1c9", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "How do we even define reductions on an array with variable length lists?" ] }, { "cell_type": "markdown", "id": "f1377dfb-acd8-4a6f-aa3b-47b544c90645", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "![](example-reducer-ragged.svg)" ] }, { "cell_type": "code", "execution_count": null, "id": "fdff4345-b3a2-453a-9d8b-d47d03aca1e9", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "array = ak.Array([[ 1, 2, 3, 4],\n", " [ 10, None, 30 ],\n", " [ 100, 200 ]])" ] }, { "cell_type": "markdown", "id": "7474589a-4b79-4baa-98de-7d6c457a43d1", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "ac7395e3-dbe9-479f-b595-a37fcb8cc977", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "ak.sum(array, axis=0).tolist()" ] }, { "cell_type": "markdown", "id": "ad53c2cb-b550-4158-8e44-11303e085050", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "36727ca1-d85d-42aa-968e-04a4f7eb98ce", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "ak.sum(array, axis=1).tolist()" ] }, { "cell_type": "markdown", "id": "473f41e3-2c52-48de-8536-33210b5bca13", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "markdown", "id": "72050082-e01c-4eb2-84b7-220a85fff881", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "(You almost always want the deepest/maximum `axis`, which you can get with `axis=-1`.)" ] }, { "cell_type": "markdown", "id": "3f335dcf-b128-4b4d-bbdb-dae52c43c03b", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "markdown", "id": "bd72b617-a540-4ff8-9faa-cfc24befda51", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Awkward Arrays in particle physics" ] }, { "cell_type": "code", "execution_count": null, "id": "8c8b9fb5-d73d-4689-b8c1-46e11bb73f5f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import uproot\n", "\n", "file = uproot.open(\"https://github.com/jpivarski-talks/2023-12-18-hsf-india-tutorial-bhubaneswar/raw/main/data/SMHiggsToZZTo4L.root\")\n", "file" ] }, { "cell_type": "markdown", "id": "c1a51aef-3bb9-4c25-8af7-388f0732bdef", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "b25cb34b-ff02-4726-88b3-b33b51cc3a81", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "tree = file[\"Events\"]\n", "tree" ] }, { "cell_type": "markdown", "id": "42fec548-3363-4eb1-8297-5a7e130815d1", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "9bb09f4f-eb47-4f61-9614-a17fd89afa61", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "tree.arrays(entry_stop=100)" ] }, { "cell_type": "markdown", "id": "4b5fe20f-b93c-4e99-9251-9728a20cb767", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "The same data fits into Parquet files (a little more easily)." ] }, { "cell_type": "code", "execution_count": null, "id": "296ae255-f894-4430-932c-a3ec67863897", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "events = ak.from_parquet(\"https://github.com/jpivarski-talks/2023-12-18-hsf-india-tutorial-bhubaneswar/raw/main/data/SMHiggsToZZTo4L.parquet\")\n", "events" ] }, { "cell_type": "markdown", "id": "85dc100c-200a-455a-9827-e76b92a0b752", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "View the first event as Python lists and dicts (like JSON)." ] }, { "cell_type": "code", "execution_count": null, "id": "30661cff-00f8-4556-8a6e-b25642f13170", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "events[0].to_list()" ] }, { "cell_type": "markdown", "id": "6663c8f3-f84f-4185-bd79-a9c7f7b4f034", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "Get one numeric field (also known as \"column\")." ] }, { "cell_type": "code", "execution_count": null, "id": "68a0c07c-e7f3-4e98-a368-4c8899c833e2", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "events.electron.pt" ] }, { "cell_type": "markdown", "id": "f21e3de6-8631-4cbf-bb39-f91f6e9abfb5", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "Compute something ($p_z = p_T \\sinh\\eta$)." ] }, { "cell_type": "code", "execution_count": null, "id": "b97c4bcf-58da-4e0a-a47d-c1c4ae1c9da8", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "\n", "events.electron.pt * np.sinh(events.electron.eta)" ] }, { "cell_type": "markdown", "id": "f3873822-10c9-4cc9-bbd9-7b7986695900", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "Note that the Vector library works with Awkward Arrays, if it is imported this way:" ] }, { "cell_type": "code", "execution_count": null, "id": "79f1de55-90cd-4a61-9627-9a4c110855ec", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import vector\n", "vector.register_awkward()" ] }, { "cell_type": "markdown", "id": "c032df1d-bc42-40ed-ab64-fa51726e5730", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "markdown", "id": "5c343480-9a4a-428d-b695-f9748e0e4641", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Records with `name=\"Momentum4D\"` and fields with coordinate names (`px`, `py`, `pz`, `E` or `pt`, `phi`, `eta`, `m`) automatically get Vector properties and methods." ] }, { "cell_type": "markdown", "id": "a91c9d45-ecce-4b87-91db-0539b233a76f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "3f8eb191-b606-4809-bfdf-ed4c690cbc6a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "events.electron.type.show()" ] }, { "cell_type": "markdown", "id": "311f608b-66f4-4b12-9426-f24a16bc8525", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "13be5a96-b4f3-4b95-afe9-c2941778aefb", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# implicitly computes pz = pt * sinh(eta)\n", "events.electron.pz" ] }, { "cell_type": "markdown", "id": "7a992469-b5e0-402e-bef1-ee46ed234627", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "To make histograms or other plots, we need numbers without structure, so {func}`ak.flatten` the array." ] }, { "cell_type": "code", "execution_count": null, "id": "c24df813-578a-4f81-8934-46d657c11024", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "from hist import Hist\n", "\n", "Hist.new.Regular(100, 0, 100, name=\" \").Double().fill(\n", " ak.flatten(events.electron.pt)\n", ").plot();" ] }, { "cell_type": "markdown", "id": "d573f2e8-c0c2-4e23-b5c1-4cc4c4d87541", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "Each event has a different number of electrons and muons ({func}`ak.num` to check)." ] }, { "cell_type": "code", "execution_count": null, "id": "bf7a2d4a-2386-4f6b-bde0-c563c162b5f0", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "ak.num(events.electron), ak.num(events.muon)" ] }, { "cell_type": "markdown", "id": "81d4d4c4-7235-493a-a293-ff4d45c76ba7", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "
\n", "\n", "So what happens if we try to compute something with the electrons' $p_T$ and the muons' $\\eta$?" ] }, { "cell_type": "code", "execution_count": null, "id": "6efa2427-2732-45af-928e-0d649b7f92c2", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "raises-exception" ] }, "outputs": [], "source": [ "events.electron.pt * np.sinh(events.muon.eta)" ] }, { "cell_type": "markdown", "id": "ad8d0d1a-722d-4eb7-b341-839be6919cf6", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "This is data structure-aware, array-oriented programming." ] }, { "cell_type": "markdown", "id": "c31d978c-f829-4e30-8558-01e14bab641a", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "**Application:** Filtering events with an array of booleans." ] }, { "cell_type": "code", "execution_count": null, "id": "dbd40cdb-3764-4201-8384-714103194ca7", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "events.MET.pt, events.MET.pt > 20" ] }, { "cell_type": "markdown", "id": "3a886516-e4ba-4fa1-9119-7a450a84562a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "5fa99f53-9b84-44ad-b0c7-39ac39fc15b3", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "len(events), len(events[events.MET.pt > 20])" ] }, { "cell_type": "markdown", "id": "20b44fe8-9311-4c47-b7de-4bcdff4d079f", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "
\n", "\n", "**Application:** Filtering particles with an array of lists of booleans." ] }, { "cell_type": "code", "execution_count": null, "id": "2319e94e-9bd7-490e-b6c8-c5f560dacebf", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "events.electron.pt, events.electron.pt > 30" ] }, { "cell_type": "markdown", "id": "4ff7bfda-1752-4f3e-9085-08f3e9f64c1b", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "5f7b75c8-248e-43da-97be-c6525c36db62", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "ak.num(events.electron), ak.num(events.electron[events.electron.pt > 30])" ] }, { "cell_type": "markdown", "id": "4d48aa4f-49bb-455c-b664-3cb405064ab0", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "**Quizlet:** Using the reducer {func}`ak.any`, how would we select _events_ in which any electron has $p_T > 30$ GeV/c$^2$?" ] }, { "cell_type": "code", "execution_count": null, "id": "85339038-b580-4eb7-9421-fb32d2a402d1", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "events.electron[events.electron.pt > 30]" ] }, { "cell_type": "markdown", "id": "e5bd3b3f-ef4b-48d4-873d-ff831a85a768", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "markdown", "id": "28a196d8-56d8-4ee7-ae13-9622865a8076", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "Awkward Array has two combinatorial primitives:" ] }, { "cell_type": "markdown", "id": "e143afb0-4a0f-4188-8aa6-6385a1f7a816", "metadata": {}, "source": [ "{func}`ak.cartesian` takes a [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) of lists from $N$ different arrays, producing an array of lists of $N$-tuples.\n", "\n", "{func}`ak.combinations` takes $N$ [samples without replacement](http://prob140.org/sp18/textbook/notebooks-md/5_04_Sampling_Without_Replacement.html) of lists from a single array, producing an array of lists of $N$-tuples." ] }, { "cell_type": "code", "execution_count": null, "id": "c3a4d8d1-d210-4fe6-9b2e-6fe3c6c064af", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "numbers = ak.Array([[1, 2, 3], [], [4]])\n", "letters = ak.Array([[\"a\", \"b\"], [\"c\"], [\"d\", \"e\"]])" ] }, { "cell_type": "markdown", "id": "c27d876b-cd7f-4be5-9a88-5f75f4b5bc80", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "4a188c11-5af4-4bac-9d3d-273b98e62d44", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "ak.cartesian([numbers, letters])" ] }, { "cell_type": "markdown", "id": "f60ba73e-1b00-4b2f-9367-318b52f251b3", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "af07b963-cc1d-4367-ba48-ffe4da4303e5", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "values = ak.Array([[1.1, 2.2, 3.3, 4.4], [], [5.5, 6.6]])" ] }, { "cell_type": "markdown", "id": "18a0a5f3-3f49-4f77-a09c-fd5b5748b1df", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "5713cc98-4ba3-4087-b9a3-506838b6b020", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "ak.combinations(values, 2)" ] }, { "cell_type": "markdown", "id": "917c2815-6508-474e-b58b-39b57b37aa7e", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "Often, it's useful to separate the separate the left-hand sides and right-hand sides of these pairs with {func}`ak.unzip`, so they can be used in mathematical expressions." ] }, { "cell_type": "markdown", "id": "dc11ada3-92b0-4f7f-9118-a32ba41dc0d2", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "c7bac853-59b1-4a52-887a-83c5ec845e32", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "electron_muon_pairs = ak.cartesian([events.electron, events.muon])\n", "electron_muon_pairs.type.show()" ] }, { "cell_type": "markdown", "id": "1a247a4c-01a6-4739-a828-00ec20ba2e7d", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "51960f1b-eeef-444c-860d-473ad7083396", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "electron_in_pair, muon_in_pair = ak.unzip(electron_muon_pairs)\n", "electron_in_pair.type.show()" ] }, { "cell_type": "markdown", "id": "8403df4a-0a79-4220-ab3b-b538e818aa7d", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "d4d696a8-a5a3-47b5-bdd9-a2a2f21b4403", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "electron_in_pair.pt, muon_in_pair.pt" ] }, { "cell_type": "markdown", "id": "832690e2-7c09-4c18-989e-3fcdc6c90358", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "886f1267-6b6b-4a37-ba8c-2f2680c91e4e", "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "ak.num(electron_in_pair), ak.num(muon_in_pair)" ] }, { "cell_type": "markdown", "id": "eceaad8c-2fb3-4e32-bad7-12b1a6e45034", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "To use Vector's `deltaR` method ($\\Delta R = \\sqrt{\\Delta\\phi^2 + \\Delta\\eta^2}$), we need to have the electrons and muons in separate arrays." ] }, { "cell_type": "code", "execution_count": null, "id": "32eecc20-2397-40c8-a6b1-b03b6211906b", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "electron_in_pair, muon_in_pair = ak.unzip(ak.cartesian([events.electron, events.muon]))" ] }, { "cell_type": "markdown", "id": "dd3a02fa-1b23-40cd-923c-9fb92341eb8e", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "d193dfa0-2027-420a-a583-7f6fe35826b7", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "electron_in_pair.deltaR(muon_in_pair)" ] }, { "cell_type": "code", "execution_count": null, "id": "464996c0-0a00-4c9f-b3d9-dc6baf70d96a", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "first_electron_in_pair, second_electron_in_pair = ak.unzip(ak.combinations(events.electron, 2))" ] }, { "cell_type": "markdown", "id": "c05fe323-5350-44b0-858f-849ca4ac4e76", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "
" ] }, { "cell_type": "code", "execution_count": null, "id": "de6d4ddc-8410-443f-8c7c-ba669d30b233", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "first_electron_in_pair.deltaR(second_electron_in_pair)" ] }, { "cell_type": "markdown", "id": "076cb4a3-c407-4d50-a4d4-3cdcc6b977dd", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "**Quizlet:** What's this?" ] }, { "cell_type": "code", "execution_count": null, "id": "a5fcbc46-1bd9-4944-9321-df47e9e2845d", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "(first_electron_in_pair + second_electron_in_pair).mass" ] }, { "cell_type": "code", "execution_count": null, "id": "c28f5f07-70eb-4827-99a5-cf5811f6bf68", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "Hist.new.Reg(120, 0, 120, name=\"mass (GeV)\").Double().fill(\n", " ak.flatten((first_electron_in_pair + second_electron_in_pair).mass, axis=-1)\n", ").plot();" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 5 }