Quick Start#
Import the packages that we’ll use:
[1]:
import awkward as ak
import awkward_pandas as akpd
import numpy as np
import pandas as pd
Check what versions we have:
[2]:
for entry in (ak, akpd, np, pd):
print(f"{entry.__name__:15} {entry.__version__}")
awkward 2.3.2
awkward_pandas 2023.8.1.dev2+g7abb99c
numpy 1.23.5
pandas 2.0.3
Make a simple awkward array:
[3]:
a = ak.from_iter([[1, 2, 3], [4, 5], [6]] * 5)
[4]:
a
[4]:
[[1, 2, 3], [4, 5], [6], [1, 2, 3], [4, 5], [6], [1, 2, 3], [4, 5], [6], [1, 2, 3], [4, 5], [6], [1, 2, 3], [4, 5], [6]] ---------------------- type: 15 * var * int64
We get a series representation of the awkward array by using awkward-pandas form_awkward
function:
[5]:
s = akpd.from_awkward(a, name="a")
[6]:
s
[6]:
0 [1, 2, 3]
1 [4, 5]
2 [6]
3 [1, 2, 3]
4 [4, 5]
5 [6]
6 [1, 2, 3]
7 [4, 5]
8 [6]
9 [1, 2, 3]
10 [4, 5]
11 [6]
12 [1, 2, 3]
13 [4, 5]
14 [6]
Name: a, dtype: awkward
We can put the series in a DataFrame with another built-in pandas type, e.g. a column of integers:
[7]:
df = pd.DataFrame({"integers": np.arange(42, 42 + len(s)), "awkwardstuff": s})
[8]:
df
[8]:
integers | awkwardstuff | |
---|---|---|
0 | 42 | [1, 2, 3] |
1 | 43 | [4, 5] |
2 | 44 | [6] |
3 | 45 | [1, 2, 3] |
4 | 46 | [4, 5] |
5 | 47 | [6] |
6 | 48 | [1, 2, 3] |
7 | 49 | [4, 5] |
8 | 50 | [6] |
9 | 51 | [1, 2, 3] |
10 | 52 | [4, 5] |
11 | 53 | [6] |
12 | 54 | [1, 2, 3] |
13 | 55 | [4, 5] |
14 | 56 | [6] |
With the DataFrame we can start doing usual pandas operations. Here we query the DataFrame based on the column of integers; selecting rows where the integer is even:
[9]:
df.query("integers%2 == 0")
[9]:
integers | awkwardstuff | |
---|---|---|
0 | 42 | [1, 2, 3] |
2 | 44 | [6] |
4 | 46 | [4, 5] |
6 | 48 | [1, 2, 3] |
8 | 50 | [6] |
10 | 52 | [4, 5] |
12 | 54 | [1, 2, 3] |
14 | 56 | [6] |
We can use DataFrame and Series methods:
[10]:
df.max()
[10]:
integers 56
awkwardstuff 6
dtype: int64
[11]:
df.mean()
[11]:
integers 49.0
awkwardstuff 3.5
dtype: float64
[12]:
df.awkwardstuff.min()
[12]:
1
To use functions from the awkward
library, or to access the underlying awkward array directly, we use the ak
accessor on the Series
object:
[13]:
df.awkwardstuff.ak
[13]:
<awkward_pandas.accessor.AwkwardAccessor at 0x13ff18cd0>
Here we’ll use the accessor to show two different paths that provide the same numerical result represented with different objects:
[14]:
df.awkwardstuff.ak.min(axis=1)
[14]:
0 1
1 4
2 6
3 1
4 4
5 6
6 1
7 4
8 6
9 1
10 4
11 6
12 1
13 4
14 6
dtype: awkward
[15]:
ak.min(df.awkwardstuff.ak.array, axis=1)
[15]:
[1, 4, 6, 1, 4, 6, 1, 4, 6, 1, 4, 6, 1, 4, 6] ----------------- type: 15 * ?int64
In both cases we are calling the ak.min
function with the argument axis=1
. The difference:
In the first call we are using the accessor on the
pd.Series
and therefore we return apd.Series
.In the second call we are accessing the underlying array directly, still via the
ak
accessor, but we then call theak.min
function directly, so an awkwardArray
object is returned.
The second path should be somewhat rare when using awkward-pandas. The purpose of awkward-pandas is to plug awkward-arrays into Pandas-like workflows. If you find yourself reaching for the second type of call, then think about if you actually need Pandas at all! You may be fine just using awkward-array. Of course, there will be occasional reasons to need to reach down to the underlying array, which is why we provide that interface.
In general, the ak
accessor on a Series
of awkward
dtype can be used to leverage the awkward
library while continuing to work with Series
objects.
Let’s take a look at another small dataset which contains some players names, their team, and how many goals they’ve scored in some variable number of games that they’ve appeared in.
The raw data:
[16]:
data = """
- name: Bob\n team: tigers\n goals: [0, 0, 0, 1, 2, 0, 1]\n\n- name: Alice\n team: bears\n goals: [3, 2, 1, 0, 1]\n\n- name: Jack\n team: bears\n goals: [0, 0, 0, 0, 0, 0, 0, 0, 1]\n\n- name: Jill\n team: bears\n goals: [3, 0, 2]\n\n- name: Ted\n team: tigers\n goals: [0, 0, 0, 0, 0]\n\n- name: Ellen\n team: tigers\n goals: [1, 0, 0, 0, 2, 0, 1]\n\n- name: Dan\n team: bears\n goals: [0, 0, 3, 1, 0, 2, 0, 0]\n\n- name: Brad\n team: bears\n goals: [0, 0, 4, 0, 0, 1]\n\n- name: Nancy\n team: tigers\n goals: [0, 0, 1, 1, 1, 1, 0]\n\n- name: Lance\n team: bears\n goals: [1, 1, 1, 1, 1]\n\n- name: Sara\n team: tigers\n goals: [0, 1, 0, 2, 0, 3]\n\n- name: Ryan\n team: tigers\n goals: [1, 2, 3, 0, 0, 0, 0]\n
"""
The data in YAML format:
[17]:
print(data)
- name: Bob team: tigers goals: [0, 0, 0, 1, 2, 0, 1] - name: Alice team: bears goals: [3, 2, 1, 0, 1] - name: Jack team: bears goals: [0, 0, 0, 0, 0, 0, 0, 0, 1] - name: Jill team: bears goals: [3, 0, 2] - name: Ted team: tigers goals: [0, 0, 0, 0, 0] - name: Ellen team: tigers goals: [1, 0, 0, 0, 2, 0, 1] - name: Dan team: bears goals: [0, 0, 3, 1, 0, 2, 0, 0] - name: Brad team: bears goals: [0, 0, 4, 0, 0, 1] - name: Nancy team: tigers goals: [0, 0, 1, 1, 1, 1, 0] - name: Lance team: bears goals: [1, 1, 1, 1, 1] - name: Sara team: tigers goals: [0, 1, 0, 2, 0, 3] - name: Ryan team: tigers goals: [1, 2, 3, 0, 0, 0, 0]
We’ll load it into a dictionary and then convert it into an Awkward Array:
[18]:
import yaml
data = yaml.load(data, Loader=yaml.SafeLoader)
data = ak.Array(data)
[19]:
data
[19]:
[{name: 'Bob', team: 'tigers', goals: [0, 0, ..., 0, 1]}, {name: 'Alice', team: 'bears', goals: [3, 2, ..., 0, 1]}, {name: 'Jack', team: 'bears', goals: [0, 0, ..., 0, 1]}, {name: 'Jill', team: 'bears', goals: [3, 0, 2]}, {name: 'Ted', team: 'tigers', goals: [0, 0, ..., 0, 0]}, {name: 'Ellen', team: 'tigers', goals: [1, 0, ..., 0, 1]}, {name: 'Dan', team: 'bears', goals: [0, 0, ..., 0, 0]}, {name: 'Brad', team: 'bears', goals: [0, 0, ..., 0, 1]}, {name: 'Nancy', team: 'tigers', goals: [0, 0, ..., 1, 0]}, {name: 'Lance', team: 'bears', goals: [1, 1, ..., 1, 1]}, {name: 'Sara', team: 'tigers', goals: [0, 1, ..., 0, 3]}, {name: 'Ryan', team: 'tigers', goals: [1, 2, ..., 0, 0]}] ----------------------------------------------------------- type: 12 * { name: string, team: string, goals: var * int64 }
[20]:
s = akpd.from_awkward(data)
The dataset in Awkward Array form as three fields:
[21]:
data.fields
[21]:
['name', 'team', 'goals']
We can expand the Series into a DataFrame using the accessor’s to_columns
method, where simple (non-nested or variable length) types are given their own column:
[22]:
s.ak.to_columns()
[22]:
name | team | awkward-data | |
---|---|---|---|
0 | Bob | tigers | {'goals': [0, 0, 0, 1, 2, 0, 1]} |
1 | Alice | bears | {'goals': [3, 2, 1, 0, 1]} |
2 | Jack | bears | {'goals': [0, 0, 0, 0, 0, 0, 0, 0, 1]} |
3 | Jill | bears | {'goals': [3, 0, 2]} |
4 | Ted | tigers | {'goals': [0, 0, 0, 0, 0]} |
5 | Ellen | tigers | {'goals': [1, 0, 0, 0, 2, 0, 1]} |
6 | Dan | bears | {'goals': [0, 0, 3, 1, 0, 2, 0, 0]} |
7 | Brad | bears | {'goals': [0, 0, 4, 0, 0, 1]} |
8 | Nancy | tigers | {'goals': [0, 0, 1, 1, 1, 1, 0]} |
9 | Lance | bears | {'goals': [1, 1, 1, 1, 1]} |
10 | Sara | tigers | {'goals': [0, 1, 0, 2, 0, 3]} |
11 | Ryan | tigers | {'goals': [1, 2, 3, 0, 0, 0, 0]} |
Notice that the name
and team
columns were just strings, one entry per element of the array. These have been turned into their own individual columns. The goals
field was a variable length list, so it remained an awkward
type and is stored in a column with the default name “awkward-data”.
to_columns
has an extract_all
argument that is False
by default. If we set the argument to True
, then all columns are extracted:
[23]:
df = s.ak.to_columns(extract_all=True)
[24]:
df
[24]:
name | team | goals | |
---|---|---|---|
0 | Bob | tigers | [0, 0, 0, 1, 2, 0, 1] |
1 | Alice | bears | [3, 2, 1, 0, 1] |
2 | Jack | bears | [0, 0, 0, 0, 0, 0, 0, 0, 1] |
3 | Jill | bears | [3, 0, 2] |
4 | Ted | tigers | [0, 0, 0, 0, 0] |
5 | Ellen | tigers | [1, 0, 0, 0, 2, 0, 1] |
6 | Dan | bears | [0, 0, 3, 1, 0, 2, 0, 0] |
7 | Brad | bears | [0, 0, 4, 0, 0, 1] |
8 | Nancy | tigers | [0, 0, 1, 1, 1, 1, 0] |
9 | Lance | bears | [1, 1, 1, 1, 1] |
10 | Sara | tigers | [0, 1, 0, 2, 0, 3] |
11 | Ryan | tigers | [1, 2, 3, 0, 0, 0, 0] |
Notice that the goals
column is of type awkward
[25]:
df.goals
[25]:
0 [0, 0, 0, 1, 2, 0, 1]
1 [3, 2, 1, 0, 1]
2 [0, 0, 0, 0, 0, 0, 0, 0, 1]
3 [3, 0, 2]
4 [0, 0, 0, 0, 0]
5 [1, 0, 0, 0, 2, 0, 1]
6 [0, 0, 3, 1, 0, 2, 0, 0]
7 [0, 0, 4, 0, 0, 1]
8 [0, 0, 1, 1, 1, 1, 0]
9 [1, 1, 1, 1, 1]
10 [0, 1, 0, 2, 0, 3]
11 [1, 2, 3, 0, 0, 0, 0]
Name: goals, dtype: awkward
We can use pure Pandas to investigate the dataset, but since Pandas doesn’t have a builtin ability to handle the nested structure of our goals
column, we’re limited to some coarse information.
For example, we can group by the team and see the average number of goals total goals scored:
[26]:
df.set_index("name") \
.groupby("team", group_keys=True) \
.mean(numeric_only=True)
[26]:
goals | |
---|---|
team | |
bears | 0.805556 |
tigers | 0.615385 |
But with awkward, we can group by the team name and see the average number of goals scored by each player:
[27]:
df.set_index("name") \
.groupby("team", group_keys=True) \
.apply(lambda x: x.goals.ak.mean(axis=1)) \
.sort_values(ascending=False)
[27]:
team name
bears Jill 1.666667
Alice 1.4
Lance 1.0
tigers Sara 1.0
Ryan 0.857143
bears Brad 0.833333
Dan 0.75
tigers Bob 0.571429
Ellen 0.571429
Nancy 0.571429
bears Jack 0.111111
tigers Ted 0.0
dtype: awkward
We can use the awkward data to determine how many games each player has appeared in:
[28]:
df["n_games"] = df.goals.ak.num(axis=1)
[29]:
df
[29]:
name | team | goals | n_games | |
---|---|---|---|---|
0 | Bob | tigers | [0, 0, 0, 1, 2, 0, 1] | 7 |
1 | Alice | bears | [3, 2, 1, 0, 1] | 5 |
2 | Jack | bears | [0, 0, 0, 0, 0, 0, 0, 0, 1] | 9 |
3 | Jill | bears | [3, 0, 2] | 3 |
4 | Ted | tigers | [0, 0, 0, 0, 0] | 5 |
5 | Ellen | tigers | [1, 0, 0, 0, 2, 0, 1] | 7 |
6 | Dan | bears | [0, 0, 3, 1, 0, 2, 0, 0] | 8 |
7 | Brad | bears | [0, 0, 4, 0, 0, 1] | 6 |
8 | Nancy | tigers | [0, 0, 1, 1, 1, 1, 0] | 7 |
9 | Lance | bears | [1, 1, 1, 1, 1] | 5 |
10 | Sara | tigers | [0, 1, 0, 2, 0, 3] | 6 |
11 | Ryan | tigers | [1, 2, 3, 0, 0, 0, 0] | 7 |
We can convert the entire dataframe back to a Series
of type awkward
with the merge
function:
[30]:
s = akpd.merge(df)
[31]:
s
[31]:
0 {'name': 'Bob', 'team': 'tigers', 'goals': [0,...
1 {'name': 'Alice', 'team': 'bears', 'goals': [3...
2 {'name': 'Jack', 'team': 'bears', 'goals': [0,...
3 {'name': 'Jill', 'team': 'bears', 'goals': [3,...
4 {'name': 'Ted', 'team': 'tigers', 'goals': [0,...
5 {'name': 'Ellen', 'team': 'tigers', 'goals': [...
6 {'name': 'Dan', 'team': 'bears', 'goals': [0, ...
7 {'name': 'Brad', 'team': 'bears', 'goals': [0,...
8 {'name': 'Nancy', 'team': 'tigers', 'goals': [...
9 {'name': 'Lance', 'team': 'bears', 'goals': [1...
10 {'name': 'Sara', 'team': 'tigers', 'goals': [0...
11 {'name': 'Ryan', 'team': 'tigers', 'goals': [1...
dtype: awkward
And go back to pure awkward (now with our new n_games
column) using the accessor:
[32]:
s.ak.array
[32]:
[{name: 'Bob', team: 'tigers', goals: [0, 0, ..., 0, 1], n_games: 7}, {name: 'Alice', team: 'bears', goals: [3, 2, ..., 0, 1], n_games: 5}, {name: 'Jack', team: 'bears', goals: [0, 0, ..., 0, 1], n_games: 9}, {name: 'Jill', team: 'bears', goals: [3, 0, 2], n_games: 3}, {name: 'Ted', team: 'tigers', goals: [0, 0, ..., 0, 0], n_games: 5}, {name: 'Ellen', team: 'tigers', goals: [1, 0, ..., 0, 1], n_games: 7}, {name: 'Dan', team: 'bears', goals: [0, 0, ..., 0, 0], n_games: 8}, {name: 'Brad', team: 'bears', goals: [0, 0, ..., 0, 1], n_games: 6}, {name: 'Nancy', team: 'tigers', goals: [0, 0, ..., 1, 0], n_games: 7}, {name: 'Lance', team: 'bears', goals: [1, 1, ..., 1, 1], n_games: 5}, {name: 'Sara', team: 'tigers', goals: [0, 1, ..., 0, 3], n_games: 6}, {name: 'Ryan', team: 'tigers', goals: [1, 2, ..., 0, 0], n_games: 7}] ----------------------------------------------------------------------- type: 12 * { name: string, team: string, goals: var * int64, n_games: int64 }
[33]:
s.ak.array.fields
[33]:
['name', 'team', 'goals', 'n_games']
[34]:
s.ak.array.n_games
[34]:
[7, 5, 9, 3, 5, 7, 8, 6, 7, 5, 6, 7] ---------------- type: 12 * int64