Quick Start#

Import the packages that we’ll use:

[1]:
import awkward as ak
import awkward_pandas as akpd
import numpy as np
import pandas as pd

Check what versions we have:

[2]:
for entry in (ak, akpd, np, pd):
    print(f"{entry.__name__:15} {entry.__version__}")
awkward         2.3.2
awkward_pandas  2023.8.1.dev2+g7abb99c
numpy           1.23.5
pandas          2.0.3

Make a simple awkward array:

[3]:
a = ak.from_iter([[1, 2, 3], [4, 5], [6]] * 5)
[4]:
a
[4]:
[[1, 2, 3],
 [4, 5],
 [6],
 [1, 2, 3],
 [4, 5],
 [6],
 [1, 2, 3],
 [4, 5],
 [6],
 [1, 2, 3],
 [4, 5],
 [6],
 [1, 2, 3],
 [4, 5],
 [6]]
----------------------
type: 15 * var * int64

We get a series representation of the awkward array by using awkward-pandas form_awkward function:

[5]:
s = akpd.from_awkward(a, name="a")
[6]:
s
[6]:
0     [1, 2, 3]
1        [4, 5]
2           [6]
3     [1, 2, 3]
4        [4, 5]
5           [6]
6     [1, 2, 3]
7        [4, 5]
8           [6]
9     [1, 2, 3]
10       [4, 5]
11          [6]
12    [1, 2, 3]
13       [4, 5]
14          [6]
Name: a, dtype: awkward

We can put the series in a DataFrame with another built-in pandas type, e.g. a column of integers:

[7]:
df = pd.DataFrame({"integers": np.arange(42, 42 + len(s)), "awkwardstuff": s})
[8]:
df
[8]:
integers awkwardstuff
0 42 [1, 2, 3]
1 43 [4, 5]
2 44 [6]
3 45 [1, 2, 3]
4 46 [4, 5]
5 47 [6]
6 48 [1, 2, 3]
7 49 [4, 5]
8 50 [6]
9 51 [1, 2, 3]
10 52 [4, 5]
11 53 [6]
12 54 [1, 2, 3]
13 55 [4, 5]
14 56 [6]

With the DataFrame we can start doing usual pandas operations. Here we query the DataFrame based on the column of integers; selecting rows where the integer is even:

[9]:
df.query("integers%2 == 0")
[9]:
integers awkwardstuff
0 42 [1, 2, 3]
2 44 [6]
4 46 [4, 5]
6 48 [1, 2, 3]
8 50 [6]
10 52 [4, 5]
12 54 [1, 2, 3]
14 56 [6]

We can use DataFrame and Series methods:

[10]:
df.max()
[10]:
integers        56
awkwardstuff     6
dtype: int64
[11]:
df.mean()
[11]:
integers        49.0
awkwardstuff     3.5
dtype: float64
[12]:
df.awkwardstuff.min()
[12]:
1

To use functions from the awkward library, or to access the underlying awkward array directly, we use the ak accessor on the Series object:

[13]:
df.awkwardstuff.ak
[13]:
<awkward_pandas.accessor.AwkwardAccessor at 0x13ff18cd0>

Here we’ll use the accessor to show two different paths that provide the same numerical result represented with different objects:

[14]:
df.awkwardstuff.ak.min(axis=1)
[14]:
0     1
1     4
2     6
3     1
4     4
5     6
6     1
7     4
8     6
9     1
10    4
11    6
12    1
13    4
14    6
dtype: awkward
[15]:
ak.min(df.awkwardstuff.ak.array, axis=1)
[15]:
[1,
 4,
 6,
 1,
 4,
 6,
 1,
 4,
 6,
 1,
 4,
 6,
 1,
 4,
 6]
-----------------
type: 15 * ?int64

In both cases we are calling the ak.min function with the argument axis=1. The difference:

  1. In the first call we are using the accessor on the pd.Series and therefore we return a pd.Series.

  2. In the second call we are accessing the underlying array directly, still via the ak accessor, but we then call the ak.min function directly, so an awkward Array object is returned.

The second path should be somewhat rare when using awkward-pandas. The purpose of awkward-pandas is to plug awkward-arrays into Pandas-like workflows. If you find yourself reaching for the second type of call, then think about if you actually need Pandas at all! You may be fine just using awkward-array. Of course, there will be occasional reasons to need to reach down to the underlying array, which is why we provide that interface.

In general, the ak accessor on a Series of awkward dtype can be used to leverage the awkward library while continuing to work with Series objects.

Let’s take a look at another small dataset which contains some players names, their team, and how many goals they’ve scored in some variable number of games that they’ve appeared in.

The raw data:

[16]:
data = """
- name: Bob\n  team: tigers\n  goals: [0, 0, 0, 1, 2, 0, 1]\n\n- name: Alice\n  team: bears\n  goals: [3, 2, 1, 0, 1]\n\n- name: Jack\n  team: bears\n  goals: [0, 0, 0, 0, 0, 0, 0, 0, 1]\n\n- name: Jill\n  team: bears\n  goals: [3, 0, 2]\n\n- name: Ted\n  team: tigers\n  goals: [0, 0, 0, 0, 0]\n\n- name: Ellen\n  team: tigers\n  goals: [1, 0, 0, 0, 2, 0, 1]\n\n- name: Dan\n  team: bears\n  goals: [0, 0, 3, 1, 0, 2, 0, 0]\n\n- name: Brad\n  team: bears\n  goals: [0, 0, 4, 0, 0, 1]\n\n- name: Nancy\n  team: tigers\n  goals: [0, 0, 1, 1, 1, 1, 0]\n\n- name: Lance\n  team: bears\n  goals: [1, 1, 1, 1, 1]\n\n- name: Sara\n  team: tigers\n  goals: [0, 1, 0, 2, 0, 3]\n\n- name: Ryan\n  team: tigers\n  goals: [1, 2, 3, 0, 0, 0, 0]\n
"""

The data in YAML format:

[17]:
print(data)
- name: Bob
  team: tigers
  goals: [0, 0, 0, 1, 2, 0, 1]

- name: Alice
  team: bears
  goals: [3, 2, 1, 0, 1]

- name: Jack
  team: bears
  goals: [0, 0, 0, 0, 0, 0, 0, 0, 1]

- name: Jill
  team: bears
  goals: [3, 0, 2]

- name: Ted
  team: tigers
  goals: [0, 0, 0, 0, 0]

- name: Ellen
  team: tigers
  goals: [1, 0, 0, 0, 2, 0, 1]

- name: Dan
  team: bears
  goals: [0, 0, 3, 1, 0, 2, 0, 0]

- name: Brad
  team: bears
  goals: [0, 0, 4, 0, 0, 1]

- name: Nancy
  team: tigers
  goals: [0, 0, 1, 1, 1, 1, 0]

- name: Lance
  team: bears
  goals: [1, 1, 1, 1, 1]

- name: Sara
  team: tigers
  goals: [0, 1, 0, 2, 0, 3]

- name: Ryan
  team: tigers
  goals: [1, 2, 3, 0, 0, 0, 0]


We’ll load it into a dictionary and then convert it into an Awkward Array:

[18]:
import yaml

data = yaml.load(data, Loader=yaml.SafeLoader)
data = ak.Array(data)
[19]:
data
[19]:
[{name: 'Bob', team: 'tigers', goals: [0, 0, ..., 0, 1]},
 {name: 'Alice', team: 'bears', goals: [3, 2, ..., 0, 1]},
 {name: 'Jack', team: 'bears', goals: [0, 0, ..., 0, 1]},
 {name: 'Jill', team: 'bears', goals: [3, 0, 2]},
 {name: 'Ted', team: 'tigers', goals: [0, 0, ..., 0, 0]},
 {name: 'Ellen', team: 'tigers', goals: [1, 0, ..., 0, 1]},
 {name: 'Dan', team: 'bears', goals: [0, 0, ..., 0, 0]},
 {name: 'Brad', team: 'bears', goals: [0, 0, ..., 0, 1]},
 {name: 'Nancy', team: 'tigers', goals: [0, 0, ..., 1, 0]},
 {name: 'Lance', team: 'bears', goals: [1, 1, ..., 1, 1]},
 {name: 'Sara', team: 'tigers', goals: [0, 1, ..., 0, 3]},
 {name: 'Ryan', team: 'tigers', goals: [1, 2, ..., 0, 0]}]
-----------------------------------------------------------
type: 12 * {
    name: string,
    team: string,
    goals: var * int64
}
[20]:
s = akpd.from_awkward(data)

The dataset in Awkward Array form as three fields:

[21]:
data.fields
[21]:
['name', 'team', 'goals']

We can expand the Series into a DataFrame using the accessor’s to_columns method, where simple (non-nested or variable length) types are given their own column:

[22]:
s.ak.to_columns()
[22]:
name team awkward-data
0 Bob tigers {'goals': [0, 0, 0, 1, 2, 0, 1]}
1 Alice bears {'goals': [3, 2, 1, 0, 1]}
2 Jack bears {'goals': [0, 0, 0, 0, 0, 0, 0, 0, 1]}
3 Jill bears {'goals': [3, 0, 2]}
4 Ted tigers {'goals': [0, 0, 0, 0, 0]}
5 Ellen tigers {'goals': [1, 0, 0, 0, 2, 0, 1]}
6 Dan bears {'goals': [0, 0, 3, 1, 0, 2, 0, 0]}
7 Brad bears {'goals': [0, 0, 4, 0, 0, 1]}
8 Nancy tigers {'goals': [0, 0, 1, 1, 1, 1, 0]}
9 Lance bears {'goals': [1, 1, 1, 1, 1]}
10 Sara tigers {'goals': [0, 1, 0, 2, 0, 3]}
11 Ryan tigers {'goals': [1, 2, 3, 0, 0, 0, 0]}

Notice that the name and team columns were just strings, one entry per element of the array. These have been turned into their own individual columns. The goals field was a variable length list, so it remained an awkward type and is stored in a column with the default name “awkward-data”.

to_columns has an extract_all argument that is False by default. If we set the argument to True, then all columns are extracted:

[23]:
df = s.ak.to_columns(extract_all=True)
[24]:
df
[24]:
name team goals
0 Bob tigers [0, 0, 0, 1, 2, 0, 1]
1 Alice bears [3, 2, 1, 0, 1]
2 Jack bears [0, 0, 0, 0, 0, 0, 0, 0, 1]
3 Jill bears [3, 0, 2]
4 Ted tigers [0, 0, 0, 0, 0]
5 Ellen tigers [1, 0, 0, 0, 2, 0, 1]
6 Dan bears [0, 0, 3, 1, 0, 2, 0, 0]
7 Brad bears [0, 0, 4, 0, 0, 1]
8 Nancy tigers [0, 0, 1, 1, 1, 1, 0]
9 Lance bears [1, 1, 1, 1, 1]
10 Sara tigers [0, 1, 0, 2, 0, 3]
11 Ryan tigers [1, 2, 3, 0, 0, 0, 0]

Notice that the goals column is of type awkward

[25]:
df.goals
[25]:
0           [0, 0, 0, 1, 2, 0, 1]
1                 [3, 2, 1, 0, 1]
2     [0, 0, 0, 0, 0, 0, 0, 0, 1]
3                       [3, 0, 2]
4                 [0, 0, 0, 0, 0]
5           [1, 0, 0, 0, 2, 0, 1]
6        [0, 0, 3, 1, 0, 2, 0, 0]
7              [0, 0, 4, 0, 0, 1]
8           [0, 0, 1, 1, 1, 1, 0]
9                 [1, 1, 1, 1, 1]
10             [0, 1, 0, 2, 0, 3]
11          [1, 2, 3, 0, 0, 0, 0]
Name: goals, dtype: awkward

We can use pure Pandas to investigate the dataset, but since Pandas doesn’t have a builtin ability to handle the nested structure of our goals column, we’re limited to some coarse information.

For example, we can group by the team and see the average number of goals total goals scored:

[26]:
df.set_index("name") \
  .groupby("team", group_keys=True) \
  .mean(numeric_only=True)
[26]:
goals
team
bears 0.805556
tigers 0.615385

But with awkward, we can group by the team name and see the average number of goals scored by each player:

[27]:
df.set_index("name") \
  .groupby("team", group_keys=True) \
  .apply(lambda x: x.goals.ak.mean(axis=1)) \
  .sort_values(ascending=False)
[27]:
team    name
bears   Jill     1.666667
        Alice         1.4
        Lance         1.0
tigers  Sara          1.0
        Ryan     0.857143
bears   Brad     0.833333
        Dan          0.75
tigers  Bob      0.571429
        Ellen    0.571429
        Nancy    0.571429
bears   Jack     0.111111
tigers  Ted           0.0
dtype: awkward

We can use the awkward data to determine how many games each player has appeared in:

[28]:
df["n_games"] = df.goals.ak.num(axis=1)
[29]:
df
[29]:
name team goals n_games
0 Bob tigers [0, 0, 0, 1, 2, 0, 1] 7
1 Alice bears [3, 2, 1, 0, 1] 5
2 Jack bears [0, 0, 0, 0, 0, 0, 0, 0, 1] 9
3 Jill bears [3, 0, 2] 3
4 Ted tigers [0, 0, 0, 0, 0] 5
5 Ellen tigers [1, 0, 0, 0, 2, 0, 1] 7
6 Dan bears [0, 0, 3, 1, 0, 2, 0, 0] 8
7 Brad bears [0, 0, 4, 0, 0, 1] 6
8 Nancy tigers [0, 0, 1, 1, 1, 1, 0] 7
9 Lance bears [1, 1, 1, 1, 1] 5
10 Sara tigers [0, 1, 0, 2, 0, 3] 6
11 Ryan tigers [1, 2, 3, 0, 0, 0, 0] 7

We can convert the entire dataframe back to a Series of type awkward with the merge function:

[30]:
s = akpd.merge(df)
[31]:
s
[31]:
0     {'name': 'Bob', 'team': 'tigers', 'goals': [0,...
1     {'name': 'Alice', 'team': 'bears', 'goals': [3...
2     {'name': 'Jack', 'team': 'bears', 'goals': [0,...
3     {'name': 'Jill', 'team': 'bears', 'goals': [3,...
4     {'name': 'Ted', 'team': 'tigers', 'goals': [0,...
5     {'name': 'Ellen', 'team': 'tigers', 'goals': [...
6     {'name': 'Dan', 'team': 'bears', 'goals': [0, ...
7     {'name': 'Brad', 'team': 'bears', 'goals': [0,...
8     {'name': 'Nancy', 'team': 'tigers', 'goals': [...
9     {'name': 'Lance', 'team': 'bears', 'goals': [1...
10    {'name': 'Sara', 'team': 'tigers', 'goals': [0...
11    {'name': 'Ryan', 'team': 'tigers', 'goals': [1...
dtype: awkward

And go back to pure awkward (now with our new n_games column) using the accessor:

[32]:
s.ak.array
[32]:
[{name: 'Bob', team: 'tigers', goals: [0, 0, ..., 0, 1], n_games: 7},
 {name: 'Alice', team: 'bears', goals: [3, 2, ..., 0, 1], n_games: 5},
 {name: 'Jack', team: 'bears', goals: [0, 0, ..., 0, 1], n_games: 9},
 {name: 'Jill', team: 'bears', goals: [3, 0, 2], n_games: 3},
 {name: 'Ted', team: 'tigers', goals: [0, 0, ..., 0, 0], n_games: 5},
 {name: 'Ellen', team: 'tigers', goals: [1, 0, ..., 0, 1], n_games: 7},
 {name: 'Dan', team: 'bears', goals: [0, 0, ..., 0, 0], n_games: 8},
 {name: 'Brad', team: 'bears', goals: [0, 0, ..., 0, 1], n_games: 6},
 {name: 'Nancy', team: 'tigers', goals: [0, 0, ..., 1, 0], n_games: 7},
 {name: 'Lance', team: 'bears', goals: [1, 1, ..., 1, 1], n_games: 5},
 {name: 'Sara', team: 'tigers', goals: [0, 1, ..., 0, 3], n_games: 6},
 {name: 'Ryan', team: 'tigers', goals: [1, 2, ..., 0, 0], n_games: 7}]
-----------------------------------------------------------------------
type: 12 * {
    name: string,
    team: string,
    goals: var * int64,
    n_games: int64
}
[33]:
s.ak.array.fields
[33]:
['name', 'team', 'goals', 'n_games']
[34]:
s.ak.array.n_games
[34]:
[7,
 5,
 9,
 3,
 5,
 7,
 8,
 6,
 7,
 5,
 6,
 7]
----------------
type: 12 * int64