How to create arrays of strings#
Awkward Arrays can contain strings, although these strings are just a special view of lists of uint8 numbers. As such, the variable-length data are efficiently stored.
NumPy’s strings are padded to have equal width, and Pandas’s strings are Python objects. Awkward Array doesn’t have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though.
import awkward as ak
import numpy as np
From Python strings#
The ak.Array constructor and ak.from_iter() recognize strings, and strings are returned by ak.to_list().
ak.Array(["one", "two", "three"])
['one', 'two', 'three'] --------- backend: cpu nbytes: 43 B type: 3 * string
They may be nested within anything.
ak.Array([["one", "two"], [], ["three"]])
[['one', 'two'], [], ['three']] ---------------- backend: cpu nbytes: 75 B type: 3 * var * string
From NumPy arrays#
NumPy strings are also recognized by ak.from_numpy() and ak.to_numpy().
numpy_array = np.array(["one", "two", "three", "four"])
numpy_array
array(['one', 'two', 'three', 'four'], dtype='<U5')
awkward_array = ak.Array(numpy_array)
awkward_array
['one', 'two', 'three', 'four'] --------- backend: cpu nbytes: 84 B type: 4 * string
Operations with strings#
Since strings are really just lists, some of the list operations “just work” on strings.
ak.num(awkward_array)
[3, 3, 5, 4] --- backend: cpu nbytes: 32 B type: 4 * int64
awkward_array[:, 1:]
['ne', 'wo', 'hree', 'our'] -------- backend: cpu nbytes: 51 B type: 4 * string
Others had to be specially overloaded for the string case, such as string-equality. The default meaning for == would be to descend to the lowest level and compare numbers (characters, in this case).
awkward_array == "three"
[False, False, True, False] ------- backend: cpu nbytes: 4 B type: 4 * bool
awkward_array == ak.Array(["ONE", "TWO", "three", "four"])
[False, False, True, True] ------- backend: cpu nbytes: 4 B type: 4 * bool
Similarly, ak.sort() and ak.argsort() sort strings lexicographically, not individual characters.
ak.sort(awkward_array)
['four', 'one', 'three', 'two'] --------- backend: cpu nbytes: 79 B type: 4 * string
Still other operations had to be inhibited, since they wouldn’t make sense for strings.
np.sqrt(awkward_array)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[11], line 1
----> 1 np.sqrt(awkward_array)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1632, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
1567 """
1568 Intercepts attempts to pass this Array to a NumPy
1569 [universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
(...) 1629 See also #__array_function__.
1630 """
1631 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
-> 1632 with ak._errors.OperationErrorContext(name, inputs, kwargs):
1633 return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_errors.py:80, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
78 self._slate.__dict__.clear()
79 # Handle caught exception
---> 80 raise self.decorate_exception(exception_type, exception_value)
81 else:
82 # Step out of the way so that another ErrorContext can become primary.
83 if self.primary() is self:
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1633, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
1631 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
1632 with ak._errors.OperationErrorContext(name, inputs, kwargs):
-> 1633 return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:485, in array_ufunc(ufunc, method, inputs, kwargs)
477 raise TypeError(
478 "no {}.{} overloads for custom types: {}".format(
479 type(ufunc).__module__, ufunc.__name__, ", ".join(error_message)
480 )
481 )
483 return None
--> 485 out = ak._broadcasting.broadcast_and_apply(
486 inputs,
487 action,
488 depth_context=depth_context,
489 lateral_context=lateral_context,
490 allow_records=False,
491 function_name=ufunc.__name__,
492 )
494 out_named_axis = functools.reduce(
495 _unify_named_axis, lateral_context[NAMED_AXIS_KEY].named_axis
496 )
497 if len(out) == 1:
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1223, in broadcast_and_apply(inputs, action, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
1221 backend = backend_of(*inputs, coerce_to_common=False)
1222 isscalar = []
-> 1223 out = apply_step(
1224 backend,
1225 broadcast_pack(inputs, isscalar),
1226 action,
1227 0,
1228 depth_context,
1229 lateral_context,
1230 {
1231 "allow_records": allow_records,
1232 "left_broadcast": left_broadcast,
1233 "right_broadcast": right_broadcast,
1234 "numpy_to_regular": numpy_to_regular,
1235 "regular_to_jagged": regular_to_jagged,
1236 "function_name": function_name,
1237 "broadcast_parameters_rule": broadcast_parameters_rule,
1238 },
1239 )
1240 assert isinstance(out, tuple)
1241 return tuple(broadcast_unpack(x, isscalar) for x in out)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1201, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
1199 return result
1200 elif result is None:
-> 1201 return continuation()
1202 else:
1203 raise AssertionError(result)
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1170, in apply_step.<locals>.continuation()
1168 # Any non-string list-types?
1169 elif any(x.is_list and not is_string_like(x) for x in contents):
-> 1170 return broadcast_any_list()
1172 # Any RecordArrays?
1173 elif any(x.is_record for x in contents):
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:663, in apply_step.<locals>.broadcast_any_list()
660 nextinputs.append(x)
661 nextparameters.append(NO_PARAMETERS)
--> 663 outcontent = apply_step(
664 backend,
665 nextinputs,
666 action,
667 depth + 1,
668 copy.copy(depth_context),
669 lateral_context,
670 options,
671 )
672 assert isinstance(outcontent, tuple)
673 parameters = parameters_factory(nextparameters, len(outcontent))
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1183, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
1176 else:
1177 raise ValueError(
1178 "cannot broadcast: {}{}".format(
1179 ", ".join(repr(type(x)) for x in inputs), in_function(options)
1180 )
1181 )
-> 1183 result = action(
1184 inputs,
1185 depth=depth,
1186 depth_context=depth_context,
1187 lateral_context=lateral_context,
1188 continuation=continuation,
1189 backend=backend,
1190 options=options,
1191 )
1193 if isinstance(result, tuple) and all(isinstance(x, Content) for x in result):
1194 if any(content.backend is not backend for content in result):
File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:421, in array_ufunc.<locals>.action(inputs, **ignore)
416 # Do we have all-strings? If so, we can't proceed
417 if all(
418 x.is_list and x.parameter("__array__") in ("string", "bytestring")
419 for x in contents
420 ):
--> 421 raise TypeError(
422 f"{type(ufunc).__module__}.{ufunc.__name__} is not implemented for string types. "
423 "To register an implementation, add a name to these string(s) and register a behavior overload"
424 )
426 if ufunc is numpy.matmul:
427 raise NotImplementedError(
428 "matrix multiplication (`@` or `np.matmul`) is not yet implemented for Awkward Arrays"
429 )
TypeError: numpy.sqrt is not implemented for string types. To register an implementation, add a name to these string(s) and register a behavior overload
This error occurred while calling
numpy.sqrt.__call__(
<Array ['one', 'two', 'three', 'four'] type='4 * string'>
)
Categorical strings#
A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is categorical data, in R, it’s called a factor, and in Arrow and Parquet, it’s dictionary encoding.
The ak.str.to_categorical() (requires PyArrow) function makes Awkward Arrays categorical in this sense. ak.to_arrow() and ak.to_parquet() recognize categorical data and convert it to the corresponding Arrow and Parquet types.
uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"])
uncategorized
['three', 'one', 'two', 'two', 'three', 'one', 'one', 'one'] --------- backend: cpu nbytes: 100 B type: 8 * string
categorized = ak.str.to_categorical(uncategorized)
categorized
['three', 'one', 'two', 'two', 'three', 'one', 'one', 'one'] --------- backend: cpu nbytes: 107 B type: 8 * categorical[type=string]
Internally, the data now have an index that selects from a set of unique strings.
categorized.layout.index
<Index dtype='int64' len='8'>[0 1 2 2 0 1 1 1]</Index>
ak.Array(categorized.layout.content)
['three', 'one', 'two'] --------- backend: cpu nbytes: 43 B type: 3 * string
The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers.
categorized == "one"
[False, True, False, False, False, True, True, True] ------- backend: cpu nbytes: 8 B type: 8 * bool
With ArrayBuilder#
ak.ArrayBuilder() is described in more detail in this tutorial, but you can add strings by calling the string method or simply appending them.
(This is what ak.from_iter() uses internally to accumulate data.)
builder = ak.ArrayBuilder()
builder.string("one")
builder.append("two")
builder.append("three")
array = builder.snapshot()
array
['one', 'two', 'three'] --------- backend: cpu nbytes: 43 B type: 3 * string