How to create arrays of strings#
Awkward Arrays can contain strings, although these strings are just a special view of lists of uint8
numbers. As such, the variable-length data are efficiently stored.
NumPy’s strings are padded to have equal width, and Pandas’s strings are Python objects. Awkward Array doesn’t have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though.
import awkward as ak
import numpy as np
From Python strings#
The ak.Array
constructor and ak.from_iter()
recognize strings, and strings are returned by ak.to_list()
.
ak.Array(["one", "two", "three"])
['one', 'two', 'three'] ---------------- type: 3 * string
They may be nested within anything.
ak.Array([["one", "two"], [], ["three"]])
[['one', 'two'], [], ['three']] ---------------------- type: 3 * var * string
From NumPy arrays#
NumPy strings are also recognized by ak.from_numpy()
and ak.to_numpy()
.
numpy_array = np.array(["one", "two", "three", "four"])
numpy_array
array(['one', 'two', 'three', 'four'], dtype='<U5')
awkward_array = ak.Array(numpy_array)
awkward_array
['one', 'two', 'three', 'four'] ---------------- type: 4 * string
Operations with strings#
Since strings are really just lists, some of the list operations “just work” on strings.
ak.num(awkward_array)
[3, 3, 5, 4] --------------- type: 4 * int64
awkward_array[:, 1:]
['ne', 'wo', 'hree', 'our'] ---------------- type: 4 * string
Others had to be specially overloaded for the string case, such as string-equality. The default meaning for ==
would be to descend to the lowest level and compare numbers (characters, in this case).
awkward_array == "three"
[False, False, True, False] -------------- type: 4 * bool
awkward_array == ak.Array(["ONE", "TWO", "three", "four"])
[False, False, True, True] -------------- type: 4 * bool
Similarly, ak.sort()
and ak.argsort()
sort strings lexicographically, not individual characters.
ak.sort(awkward_array)
['four', 'one', 'three', 'two'] ---------------- type: 4 * string
Still other operations had to be inhibited, since they wouldn’t make sense for strings.
np.sqrt(awkward_array)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[11], line 1
----> 1 np.sqrt(awkward_array)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1511, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
1509 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
1510 with ak._errors.OperationErrorContext(name, inputs, kwargs):
-> 1511 return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:466, in array_ufunc(ufunc, method, inputs, kwargs)
458 raise TypeError(
459 "no {}.{} overloads for custom types: {}".format(
460 type(ufunc).__module__, ufunc.__name__, ", ".join(error_message)
461 )
462 )
464 return None
--> 466 out = ak._broadcasting.broadcast_and_apply(
467 inputs, action, allow_records=False, function_name=ufunc.__name__
468 )
470 if len(out) == 1:
471 return wrap_layout(out[0], behavior=behavior, attrs=attrs)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1140, in broadcast_and_apply(inputs, action, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
1138 backend = backend_of(*inputs, coerce_to_common=False)
1139 isscalar = []
-> 1140 out = apply_step(
1141 backend,
1142 broadcast_pack(inputs, isscalar),
1143 action,
1144 0,
1145 depth_context,
1146 lateral_context,
1147 {
1148 "allow_records": allow_records,
1149 "left_broadcast": left_broadcast,
1150 "right_broadcast": right_broadcast,
1151 "numpy_to_regular": numpy_to_regular,
1152 "regular_to_jagged": regular_to_jagged,
1153 "function_name": function_name,
1154 "broadcast_parameters_rule": broadcast_parameters_rule,
1155 },
1156 )
1157 assert isinstance(out, tuple)
1158 return tuple(broadcast_unpack(x, isscalar) for x in out)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1118, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
1116 return result
1117 elif result is None:
-> 1118 return continuation()
1119 else:
1120 raise AssertionError(result)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1087, in apply_step.<locals>.continuation()
1085 # Any non-string list-types?
1086 elif any(x.is_list and not is_string_like(x) for x in contents):
-> 1087 return broadcast_any_list()
1089 # Any RecordArrays?
1090 elif any(x.is_record for x in contents):
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:623, in apply_step.<locals>.broadcast_any_list()
620 nextinputs.append(x)
621 nextparameters.append(NO_PARAMETERS)
--> 623 outcontent = apply_step(
624 backend,
625 nextinputs,
626 action,
627 depth + 1,
628 copy.copy(depth_context),
629 lateral_context,
630 options,
631 )
632 assert isinstance(outcontent, tuple)
633 parameters = parameters_factory(nextparameters, len(outcontent))
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1100, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
1093 else:
1094 raise ValueError(
1095 "cannot broadcast: {}{}".format(
1096 ", ".join(repr(type(x)) for x in inputs), in_function(options)
1097 )
1098 )
-> 1100 result = action(
1101 inputs,
1102 depth=depth,
1103 depth_context=depth_context,
1104 lateral_context=lateral_context,
1105 continuation=continuation,
1106 backend=backend,
1107 options=options,
1108 )
1110 if isinstance(result, tuple) and all(isinstance(x, Content) for x in result):
1111 if any(content.backend is not backend for content in result):
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:402, in array_ufunc.<locals>.action(inputs, **ignore)
397 # Do we have all-strings? If so, we can't proceed
398 if all(
399 x.is_list and x.parameter("__array__") in ("string", "bytestring")
400 for x in contents
401 ):
--> 402 raise TypeError(
403 f"{type(ufunc).__module__}.{ufunc.__name__} is not implemented for string types. "
404 "To register an implementation, add a name to these string(s) and register a behavior overload"
405 )
407 if ufunc is numpy.matmul:
408 raise NotImplementedError(
409 "matrix multiplication (`@` or `np.matmul`) is not yet implemented for Awkward Arrays"
410 )
TypeError: numpy.sqrt is not implemented for string types. To register an implementation, add a name to these string(s) and register a behavior overload
This error occurred while calling
numpy.sqrt.__call__(
<Array ['one', 'two', 'three', 'four'] type='4 * string'>
)
Categorical strings#
A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is categorical data, in R, it’s called a factor, and in Arrow and Parquet, it’s dictionary encoding.
The ak.str.to_categorical()
(requires PyArrow) function makes Awkward Arrays categorical in this sense. ak.to_arrow()
and ak.to_parquet()
recognize categorical data and convert it to the corresponding Arrow and Parquet types.
uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"])
uncategorized
['three', 'one', 'two', 'two', 'three', 'one', 'one', 'one'] ---------------- type: 8 * string
categorized = ak.str.to_categorical(uncategorized)
categorized
['three', 'one', 'two', 'two', 'three', 'one', 'one', 'one'] ---------------------------------- type: 8 * categorical[type=string]
Internally, the data now have an index that selects from a set of unique strings.
categorized.layout.index
<Index dtype='int64' len='8'>[0 1 2 2 0 1 1 1]</Index>
ak.Array(categorized.layout.content)
['three', 'one', 'two'] ---------------- type: 3 * string
The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers.
categorized == "one"
[False, True, False, False, False, True, True, True] -------------- type: 8 * bool
With ArrayBuilder#
ak.ArrayBuilder()
is described in more detail in this tutorial, but you can add strings by calling the string
method or simply appending them.
(This is what ak.from_iter()
uses internally to accumulate data.)
builder = ak.ArrayBuilder()
builder.string("one")
builder.append("two")
builder.append("three")
array = builder.snapshot()
array
['one', 'two', 'three'] ---------------- type: 3 * string