Read strings from a binary stream#

Awkward Array implements support for ragged strings as ragged lists of code-units. As such, successive strings are closely packed in memory, leading to high-performance operations.

Let’s imagine that we want to read some logging output that is stored in a text file. For example, a subset of logs from the Android Application framework.

import gzip
import itertools
import pathlib

# Preview logs
log_path = pathlib.Path("..", "samples", "Android.head.log.gz")
with gzip.open(log_path, "rt") as f:
    for line in itertools.islice(f, 8):
        print(line, end="")

12-17 19:31:36.263  1795  1825 I PowerManager_screenOn: DisplayPowerStatesetColorFadeLevel: level=1.0
12-17 19:31:36.263  5224  5283 I SendBroadcastPermission: action:android.com.huawei.bone.NOTIFY_SPORT_DATA, mPermissionType:0
12-17 19:31:36.264  1795  1825 D DisplayPowerController: Animating brightness: target=21, rate=40
12-17 19:31:36.264  1795  1825 I PowerManager_screenOn: DisplayPowerController updatePowerState mPendingRequestLocked=policy=BRIGHT, useProximitySensor=true, useProximitySensorbyPhone=true, screenBrightness=33, screenAutoBrightnessAdjustment=0.0, brightnessSetByUser=true, useAutoBrightness=true, blockScreenOn=false, lowPowerMode=false, boostScreenBrightness=false, dozeScreenBrightness=-1, dozeScreenState=UNKNOWN, useTwilight=false, useSmartBacklight=true, brightnessWaitMode=false, brightnessWaitRet=true, screenAutoBrightness=-1, userId=0
12-17 19:31:36.264  1795  2750 I PowerManager_screenOn: DisplayPowerState Updating screen state: state=ON, backlight=823
12-17 19:31:36.264  1795  2750 I HwLightsService: back light level before map = 823
12-17 19:31:36.264  1795  1825 D DisplayPowerController: Animating brightness: target=21, rate=40
12-17 19:31:36.264  1795  1825 V KeyguardServiceDelegate: onScreenTurnedOn()

To begin with, we can read the decompressed log-files as an array of np.uint8 dtype using NumPy, and convert the resulting array to an Awkward Array

import awkward as ak
import numpy as np

with gzip.open(log_path, "rb") as f:
    # `gzip.open` doesn't return a true file descriptor that NumPy can ingest
    # So, instead we read into memory.
    arr = np.frombuffer(f.read(), dtype=np.uint8)

raw_bytes = ak.from_numpy(arr)
raw_bytes.type.show()

1150841 * uint8

Awkward Array doesn’t support scalar values, so we can’t treat these characters as a single-string. Instead we need at least one dimension. Let’s unflatten our array of characters, to form a length-1 array of characters.

array_of_chars = ak.unflatten(raw_bytes, len(raw_bytes))
array_of_chars

[[49, 50, 45, 49, 55, 32, 49, 57, 58, ..., 99, 114, 105, 109, 40, 41, 13, 10]]
------------------------------------------------------------------------------
type: 1 * 1150841 * uint8

We can then ask Awkward Array to treat this array of lists of characters as an array of strings, using ak.enforce_type()

string = ak.enforce_type(array_of_chars, "string")
string.type.show()

1 * string

The underlying mechanism for implementing strings as lists of code-units can be seen if we inspect the low-level layout that builds the array

string.layout

<ListOffsetArray len='1'>
    <parameter name='__array__'>'string'</parameter>
    <offsets><Index dtype='int64' len='2'>[      0 1150841]</Index></offsets>
    <content><NumpyArray dtype='uint8' len='1150841'>
        <parameter name='__array__'>'char'</parameter>
        [49 50 45 ... 41 13 10]
    </NumpyArray></content>
</ListOffsetArray>

The __array__ parameter is special. It is reserved by Awkward Array, and signals that the layout is a special pre-undertood built-in type. In this case, that type of the outer ak.contents.ListOffsetArray is “string”. It can also be seen that the inner ak.contents.NumpyArray also has an __array__ parameter, this time with a value of char. In Awkward Array, an array of strings must look like this layout; a list with the __array__="string" parameter wrapping a ak.contents.NumpyArray with the __array__="char" parameter.

A single (very long) string isn’t much use. Let’s split this string at the line boundaries

split_at_newlines = ak.str.split_pattern(string, "\n")
split_at_newlines

[[...]]
----------------------
type: 1 * var * string

Now we can remove the temporary length-1 outer dimension that was required to treat the data as a string

lines = split_at_newlines[0]
lines

['12-17 19:31:36.263  1795  1825 I PowerManager_screenOn: DisplayPowerStatesetColorFadeLevel: level=1.0\r',
 '12-17 19:31:36.263  5224  5283 I SendBroadcastPermission: action:android.com.huawei.bone.NOTIFY_SPORT_DATA, mPermissionType:0\r',
 '12-17 19:31:36.264  1795  1825 D DisplayPowerController: Animating brightness: target=21, rate=40\r',
 '12-17 19:31:36.264  1795  1825 I PowerManager_screenOn: DisplayPowerController updatePowerState mPendingRequestLocked=policy=BRIGHT, useProximitySensor=true, useProximitySensorbyPhone=true, screenBrightness=33, screenAutoBrightnessAdjustment=0.0, brightnessSetByUser=true, useAutoBrightness=true, blockScreenOn=false, lowPowerMode=false, boostScreenBrightness=false, dozeScreenBrightness=-1, dozeScreenState=UNKNOWN, useTwilight=false, useSmartBacklight=true, brightnessWaitMode=false, brightnessWaitRet=true, screenAutoBrightness=-1, userId=0\r',
 '12-17 19:31:36.264  1795  2750 I PowerManager_screenOn: DisplayPowerState Updating screen state: state=ON, backlight=823\r',
 '12-17 19:31:36.264  1795  2750 I HwLightsService: back light level before map = 823\r',
 '12-17 19:31:36.264  1795  1825 D DisplayPowerController: Animating brightness: target=21, rate=40\r',
 '12-17 19:31:36.264  1795  1825 V KeyguardServiceDelegate: onScreenTurnedOn()\r',
 '12-17 19:31:36.264  1795  1825 I WindowManger_keyguard: onScreenTurnedOn()\r',
 '12-17 19:31:36.264  1795  1825 D DisplayPowerController: Display ready!\r',
 ...,
 '12-17 19:31:56.801  2852  2866 D KeyguardService: KGSvcCall onScreenTurningOn.\r',
 '12-17 19:31:56.801  2852  2866 D KeyguardViewMediator: notifyScreenOn\r',
 '12-17 19:31:56.801  2852  2852 D KeyguardViewMediator: handleNotifyScreenTurningOn\r',
 '12-17 19:31:56.801  2852  2852 I PhoneStatusBar: onScreenTurningOn\r',
 '12-17 19:31:56.801  2852  2852 D KeyguardViewMediator: handleNotifyScreenTurnedOn End with : 110\r',
 '12-17 19:31:56.801  1795 15987 I WindowManger_keyguard: **** SHOWN CALLED ****\r',
 '12-17 19:31:56.801  1795 15987 D WindowManager: mKeyguardDelegate.ShowListener.onDrawn.\r',
 '12-17 19:31:56.801  1795 15987 I WindowManger_keyguard: hideScrim()\r',
 '']
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
type: 10001 * string

In the low-level layout, we can see that these lines are still just variable-length lists

lines.layout

<ListOffsetArray len='10001'>
    <parameter name='__array__'>'string'</parameter>
    <offsets><Index dtype='int64' len='10002'>
        [      0     102     228 ... 1140773 1140841 1140841]
    </Index></offsets>
    <content><NumpyArray dtype='uint8' len='1140841'>
        <parameter name='__array__'>'char'</parameter>
        [49 50 45 ... 40 41 13]
    </NumpyArray></content>
</ListOffsetArray>

Bytestrings vs strings#

In general, whilst strings can fundamentally be described as lists of bytes (code-units), many string operations do not operate at the byte-level. The ak.str submodule provides a suite of vectorised operations that operate at the code-point (not code-unit) level, such as computing the string length. Consider the following simple string

large_code_point = ak.Array(["Å"])

In Awkward Array, strings are UTF-8 encoded, meaning that a single code-point may comprise up to four code-units (bytes). Although it looks like this is a single character, if we look at the layout it’s clear that the number of code-units is in-fact two

large_code_point.layout

<ListOffsetArray len='1'>
    <parameter name='__array__'>'string'</parameter>
    <offsets><Index dtype='int64' len='2'>
        [0 2]
    </Index></offsets>
    <content><NumpyArray dtype='uint8' len='2'>
        <parameter name='__array__'>'char'</parameter>
        [195 133]
    </NumpyArray></content>
</ListOffsetArray>

This is reflected in the ak.num() function

ak.num(large_code_point)

[2]
---------------
type: 1 * int64

The ak.str module provides a function for computing the length of a string

ak.str.length(large_code_point)

[1]
---------------
type: 1 * int64

Clearly this function is code-point aware.

If one wants to drop the UTF-8 string abstraction, and instead deal with strings as raw byte arrays, there is the bytes type

large_code_point_bytes = ak.enforce_type(large_code_point, "bytes")
large_code_point_bytes

[b'\xc3\x85']
---------------
type: 1 * bytes

The layout of this array has different "bytestring" and "byte" parameters

large_code_point_bytes.layout

<ListOffsetArray len='1'>
    <parameter name='__array__'>'bytestring'</parameter>
    <offsets><Index dtype='int64' len='2'>
        [0 2]
    </Index></offsets>
    <content><NumpyArray dtype='uint8' len='2'>
        <parameter name='__array__'>'byte'</parameter>
        [195 133]
    </NumpyArray></content>
</ListOffsetArray>

Many of the functions in the ak.str module treat bytestrings and strings differently; in the latter case, strings are often manipulated in terms of code-points instead of code-units. Consider ak.str.length() for this array

ak.str.length(large_code_point_bytes)

[2]
---------------
type: 1 * int64

This is clearly counting the bytes (code-units), not code-points.