Read strings from a binary stream#
Awkward Array implements support for ragged strings as ragged lists of code-units. As such, successive strings are closely packed in memory, leading to high-performance operations.
Let’s imagine that we want to read some logging output that is stored in a text file. For example, a subset of logs from the Android Application framework.
import gzip
import itertools
import pathlib
# Preview logs
log_path = pathlib.Path("..", "samples", "Android.head.log.gz")
with gzip.open(log_path, "rt") as f:
for line in itertools.islice(f, 8):
print(line, end="")
12-17 19:31:36.263 1795 1825 I PowerManager_screenOn: DisplayPowerStatesetColorFadeLevel: level=1.0
12-17 19:31:36.263 5224 5283 I SendBroadcastPermission: action:android.com.huawei.bone.NOTIFY_SPORT_DATA, mPermissionType:0
12-17 19:31:36.264 1795 1825 D DisplayPowerController: Animating brightness: target=21, rate=40
12-17 19:31:36.264 1795 1825 I PowerManager_screenOn: DisplayPowerController updatePowerState mPendingRequestLocked=policy=BRIGHT, useProximitySensor=true, useProximitySensorbyPhone=true, screenBrightness=33, screenAutoBrightnessAdjustment=0.0, brightnessSetByUser=true, useAutoBrightness=true, blockScreenOn=false, lowPowerMode=false, boostScreenBrightness=false, dozeScreenBrightness=-1, dozeScreenState=UNKNOWN, useTwilight=false, useSmartBacklight=true, brightnessWaitMode=false, brightnessWaitRet=true, screenAutoBrightness=-1, userId=0
12-17 19:31:36.264 1795 2750 I PowerManager_screenOn: DisplayPowerState Updating screen state: state=ON, backlight=823
12-17 19:31:36.264 1795 2750 I HwLightsService: back light level before map = 823
12-17 19:31:36.264 1795 1825 D DisplayPowerController: Animating brightness: target=21, rate=40
12-17 19:31:36.264 1795 1825 V KeyguardServiceDelegate: onScreenTurnedOn()
To begin with, we can read the decompressed log-files as an array of np.uint8
dtype using NumPy, and convert the resulting array to an Awkward Array
import awkward as ak
import numpy as np
with gzip.open(log_path, "rb") as f:
# `gzip.open` doesn't return a true file descriptor that NumPy can ingest
# So, instead we read into memory.
arr = np.frombuffer(f.read(), dtype=np.uint8)
raw_bytes = ak.from_numpy(arr)
raw_bytes.type.show()
1150841 * uint8
Awkward Array doesn’t support scalar values, so we can’t treat these characters as a single-string. Instead we need at least one dimension. Let’s unflatten our array of characters, to form a length-1 array of characters.
array_of_chars = ak.unflatten(raw_bytes, len(raw_bytes))
array_of_chars
[[49, 50, 45, 49, 55, 32, 49, 57, 58, ..., 99, 114, 105, 109, 40, 41, 13, 10]] ------------------------------------------------------------------------------ backend: cpu nbytes: 1.2 MB type: 1 * 1150841 * uint8
We can then ask Awkward Array to treat this array of lists of characters as an array of strings, using ak.enforce_type()
string = ak.enforce_type(array_of_chars, "string")
string.type.show()
1 * string
The underlying mechanism for implementing strings as lists of code-units can be seen if we inspect the low-level layout that builds the array
string.layout
<ListOffsetArray len='1'>
<parameter name='__array__'>'string'</parameter>
<offsets><Index dtype='int64' len='2'>[ 0 1150841]</Index></offsets>
<content><NumpyArray dtype='uint8' len='1150841'>
<parameter name='__array__'>'char'</parameter>
[49 50 45 ... 41 13 10]
</NumpyArray></content>
</ListOffsetArray>
The __array__
parameter is special. It is reserved by Awkward Array, and signals that the layout is a special pre-undertood built-in type. In this case, that type of the outer ak.contents.ListOffsetArray
is “string”. It can also be seen that the inner ak.contents.NumpyArray
also has an __array__
parameter, this time with a value of char
. In Awkward Array, an array of strings must look like this layout; a list with the __array__="string"
parameter wrapping a ak.contents.NumpyArray
with the __array__="char"
parameter.
A single (very long) string isn’t much use. Let’s split this string at the line boundaries
split_at_newlines = ak.str.split_pattern(string, "\n")
split_at_newlines
[[...]] ---------------------- backend: cpu nbytes: 1.2 MB type: 1 * var * string
Now we can remove the temporary length-1 outer dimension that was required to treat the data as a string
lines = split_at_newlines[0]
lines
['12-17 19:31:36.263 1795 1825 I PowerManager_screenOn: DisplayPowerStatesetColorFadeLevel: level=1.0\r', '12-17 19:31:36.263 5224 5283 I SendBroadcastPermission: action:android.com.huawei.bone.NOTIFY_SPORT_DATA, mPermissionType:0\r', '12-17 19:31:36.264 1795 1825 D DisplayPowerController: Animating brightness: target=21, rate=40\r', '12-17 19:31:36.264 1795 1825 I PowerManager_screenOn: DisplayPowerController updatePowerState mPendingRequestLocked=policy=BRIGHT, useProximitySensor=true, useProximitySensorbyPhone=true, screenBrightness=33, screenAutoBrightnessAdjustment=0.0, brightnessSetByUser=true, useAutoBrightness=true, blockScreenOn=false, lowPowerMode=false, boostScreenBrightness=false, dozeScreenBrightness=-1, dozeScreenState=UNKNOWN, useTwilight=false, useSmartBacklight=true, brightnessWaitMode=false, brightnessWaitRet=true, screenAutoBrightness=-1, userId=0\r', '12-17 19:31:36.264 1795 2750 I PowerManager_screenOn: DisplayPowerState Updating screen state: state=ON, backlight=823\r', '12-17 19:31:36.264 1795 2750 I HwLightsService: back light level before map = 823\r', '12-17 19:31:36.264 1795 1825 D DisplayPowerController: Animating brightness: target=21, rate=40\r', '12-17 19:31:36.264 1795 1825 V KeyguardServiceDelegate: onScreenTurnedOn()\r', '12-17 19:31:36.264 1795 1825 I WindowManger_keyguard: onScreenTurnedOn()\r', '12-17 19:31:36.264 1795 1825 D DisplayPowerController: Display ready!\r', ..., '12-17 19:31:56.801 2852 2866 D KeyguardService: KGSvcCall onScreenTurningOn.\r', '12-17 19:31:56.801 2852 2866 D KeyguardViewMediator: notifyScreenOn\r', '12-17 19:31:56.801 2852 2852 D KeyguardViewMediator: handleNotifyScreenTurningOn\r', '12-17 19:31:56.801 2852 2852 I PhoneStatusBar: onScreenTurningOn\r', '12-17 19:31:56.801 2852 2852 D KeyguardViewMediator: handleNotifyScreenTurnedOn End with : 110\r', '12-17 19:31:56.801 1795 15987 I WindowManger_keyguard: **** SHOWN CALLED ****\r', '12-17 19:31:56.801 1795 15987 D WindowManager: mKeyguardDelegate.ShowListener.onDrawn.\r', '12-17 19:31:56.801 1795 15987 I WindowManger_keyguard: hideScrim()\r', ''] --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- backend: cpu nbytes: 1.2 MB type: 10001 * string
In the low-level layout, we can see that these lines are still just variable-length lists
lines.layout
<ListOffsetArray len='10001'>
<parameter name='__array__'>'string'</parameter>
<offsets><Index dtype='int64' len='10002'>
[ 0 102 228 ... 1140773 1140841 1140841]
</Index></offsets>
<content><NumpyArray dtype='uint8' len='1140841'>
<parameter name='__array__'>'char'</parameter>
[49 50 45 ... 40 41 13]
</NumpyArray></content>
</ListOffsetArray>
Bytestrings vs strings#
In general, whilst strings can fundamentally be described as lists of bytes (code-units), many string operations do not operate at the byte-level. The ak.str
submodule provides a suite of vectorised operations that operate at the code-point (not code-unit) level, such as computing the string length. Consider the following simple string
large_code_point = ak.Array(["Å"])
In Awkward Array, strings are UTF-8 encoded, meaning that a single code-point may comprise up to four code-units (bytes). Although it looks like this is a single character, if we look at the layout it’s clear that the number of code-units is in-fact two
large_code_point.layout
<ListOffsetArray len='1'>
<parameter name='__array__'>'string'</parameter>
<offsets><Index dtype='int64' len='2'>
[0 2]
</Index></offsets>
<content><NumpyArray dtype='uint8' len='2'>
<parameter name='__array__'>'char'</parameter>
[195 133]
</NumpyArray></content>
</ListOffsetArray>
This is reflected in the ak.num()
function
ak.num(large_code_point)
[2] --------------- backend: cpu nbytes: 8 B type: 1 * int64
The ak.str
module provides a function for computing the length of a string
ak.str.length(large_code_point)
[1] --------------- backend: cpu nbytes: 8 B type: 1 * int64
Clearly this function is code-point aware.
If one wants to drop the UTF-8 string abstraction, and instead deal with strings as raw byte arrays, there is the bytes
type
large_code_point_bytes = ak.enforce_type(large_code_point, "bytes")
large_code_point_bytes
[b'\xc3\x85'] --------------- backend: cpu nbytes: 18 B type: 1 * bytes
The layout of this array has different "bytestring"
and "byte"
parameters
large_code_point_bytes.layout
<ListOffsetArray len='1'>
<parameter name='__array__'>'bytestring'</parameter>
<offsets><Index dtype='int64' len='2'>
[0 2]
</Index></offsets>
<content><NumpyArray dtype='uint8' len='2'>
<parameter name='__array__'>'byte'</parameter>
[195 133]
</NumpyArray></content>
</ListOffsetArray>
Many of the functions in the ak.str
module treat bytestrings and strings differently; in the latter case, strings are often manipulated in terms of code-points instead of code-units. Consider ak.str.length()
for this array
ak.str.length(large_code_point_bytes)
[2] --------------- backend: cpu nbytes: 8 B type: 1 * int64
This is clearly counting the bytes (code-units), not code-points.