Pandas Basics

Much like the Numpy Basics Overview, this will be a quite terse overview of the package. The focus will be on simple and common use cases such as importing the package, creating and reading data files, shaping it, transforming it, etc. Pandas has a very simply written documentation that’s easy to read and access (especially on a Jupyter notebook).

Import Pandas And Read Data

import pandas as pd

nba_players = pd.read_csv('nba_players.csv')
nba_players.head()
player team pos number
0 K. Durant GS Warriors SF 35
1 J. Harden Houston Rockets SG 13
2 K.A Towns Minnesota Timberwolves C 32
3 Kyrie Irving Boston Celtics PG 11
4 Anthony Davis NO Pelicans PF 23

Voilá! Look how it pretty it looks (on this notebook, but hopefully my css table stylings do it justice). Everything is arranged in a nicely formatted table with rows, columns, and headings.

Attributes & Methods For Simple Exploring

In most scenarios when we first load data we take explore it. This is to get a sense of what analyses we can perform on it as well as to see if it’s tidy or requires further cleaning on our part. In this very simple dataset there’s not much to explore and nothing to tidy, but regardless of that, let’s see some methods.

first = nba_players.head(1)
sample = nba_players.sample(1)
print(first)
print(sample)
      player         team pos  number
0  K. Durant  GS Warriors  SF      35
      player         team pos  number
0  K. Durant  GS Warriors  SF      35
nba_players.columns
Index(['player', 'team', 'pos', 'number'], dtype='object')
nba_players.describe
<bound method NDFrame.describe of           player                    team pos  number
0      K. Durant             GS Warriors  SF      35
1      J. Harden         Houston Rockets  SG      13
2      K.A Towns  Minnesota Timberwolves   C      32
3   Kyrie Irving          Boston Celtics  PG      11
4  Anthony Davis             NO Pelicans  PF      23>
# General info about column names, type of data, length of data, and memory usage
nba_players.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
player    5 non-null object
team      5 non-null object
pos       5 non-null object
number    5 non-null int64
dtypes: int64(1), object(3)
memory usage: 240.0+ bytes

Add New Data

Merge DataFrames

# New DataFrame with same heading and type of data as initial nba_players 
nba_2 = [
    {'player': 'Kawhi Leonard', 'team': 'SA Spurs', 'pos': 'SF', 'number': 2},
    {'player': 'Damian Lilard', 'team': 'Portland Trailblazers', 'pos': 'PG', 'number': 0},
    {'player': 'Klay Thompson', 'team': 'GS Warriors', 'pos': 'SG', 'number': 11},
    {'player': 'Rudy Gobert', 'team': 'Utah Jazz', 'pos': 'C', 'number': 27},
    {'player': 'Kevin Love', 'team': 'Cleveland Cavaliers', 'pos': 'PF', 'number': 0},
]

nba_players_2 = pd.DataFrame(nba_2)
nba_players_2
number player pos team
0 2 Kawhi Leonard SF SA Spurs
1 0 Damian Lilard PG Portland Trailblazers
2 11 Klay Thompson SG GS Warriors
3 27 Rudy Gobert C Utah Jazz
4 0 Kevin Love PF Cleveland Cavaliers

The merging process is similar to SQL JOINs.

# right returns only the new dataframe
nba_players.merge(nba_players_2, how='right')
player team pos number
0 Kawhi Leonard SA Spurs SF 2
1 Damian Lilard Portland Trailblazers PG 0
2 Klay Thompson GS Warriors SG 11
3 Rudy Gobert Utah Jazz C 27
4 Kevin Love Cleveland Cavaliers PF 0
# left returns only the initial dataframe
nba_players.merge(nba_players_2, how='left')
player team pos number
0 K. Durant GS Warriors SF 35
1 J. Harden Houston Rockets SG 13
2 K.A Towns Minnesota Timberwolves C 32
3 Kyrie Irving Boston Celtics PG 11
4 Anthony Davis NO Pelicans PF 23
# outer returns new dataframe of combined initial and new dataframe
nba_players = nba_players.merge(nba_players_2, how='outer')
nba_players
player team pos number
0 K. Durant GS Warriors SF 35
1 J. Harden Houston Rockets SG 13
2 K.A Towns Minnesota Timberwolves C 32
3 Kyrie Irving Boston Celtics PG 11
4 Anthony Davis NO Pelicans PF 23
5 Kawhi Leonard SA Spurs SF 2
6 Damian Lilard Portland Trailblazers PG 0
7 Klay Thompson GS Warriors SG 11
8 Rudy Gobert Utah Jazz C 27
9 Kevin Love Cleveland Cavaliers PF 0

Concat DataFrames

nba_misc_details = [
    {'coach': 'Steve Kerr', 'arena': 'Oracle Arena', 'championships': 5},
    {'coach': 'Mike D\'Antoni', 'arena': 'Toyota Center', 'championships': 2},
    {'coach': 'Tom Thibodeau', 'arena': 'Target Center', 'championships': 0},
    {'coach': 'Brad Stevens', 'arena': 'TD Center', 'championships': 17},
    {'coach': 'Alvin Gentry', 'arena': 'Smoothie King Center', 'championships': 0},
    {'coach': 'Gregg Popovich', 'arena':'AT&T Center', 'championships': 5},
    {'coach': 'Terry Stotts', 'arena': 'Moda Center', 'championships': 1},
    {'coach': 'Steve Kerr', 'arena': 'Oracle Arena', 'championships': 5},
    {'coach': 'Quin Snyder', 'arena': 'Vivint Smart Home Arena', 'championships': 0},
    {'coach': 'Tyronn Lue', 'arena': 'Quicken Loans Arena', 'championships': 1},
]

nba_misc = pd.DataFrame(nba_misc_details)

# concat behaves like an append, attaching the new dataframe either as row data or column data
nba_players_details = pd.concat([nba_players, nba_misc], axis=1)
nba_players_details
player team pos number arena championships coach
0 K. Durant GS Warriors SF 35 Oracle Arena 5 Steve Kerr
1 J. Harden Houston Rockets SG 13 Toyota Center 2 Mike D'Antoni
2 K.A Towns Minnesota Timberwolves C 32 Target Center 0 Tom Thibodeau
3 Kyrie Irving Boston Celtics PG 11 TD Center 17 Brad Stevens
4 Anthony Davis NO Pelicans PF 23 Smoothie King Center 0 Alvin Gentry
5 Kawhi Leonard SA Spurs SF 2 AT&T Center 5 Gregg Popovich
6 Damian Lilard Portland Trailblazers PG 0 Moda Center 1 Terry Stotts
7 Klay Thompson GS Warriors SG 11 Oracle Arena 5 Steve Kerr
8 Rudy Gobert Utah Jazz C 27 Vivint Smart Home Arena 0 Quin Snyder
9 Kevin Love Cleveland Cavaliers PF 0 Quicken Loans Arena 1 Tyronn Lue

Edit Entire Column

It seems that the we have attributed to the players the wrong championships information. Those values are actually the total championships for the teams. We would actually like that column to be for individual championships.

nba_players_details['championships'] = [1, 0, 0, 1, 0, 1, 0, 2, 0, 1]
nba_players_details
player team pos number arena championships coach
0 K. Durant GS Warriors SF 35 Oracle Arena 1 Steve Kerr
1 J. Harden Houston Rockets SG 13 Toyota Center 0 Mike D'Antoni
2 K.A Towns Minnesota Timberwolves C 32 Target Center 0 Tom Thibodeau
3 Kyrie Irving Boston Celtics PG 11 TD Center 1 Brad Stevens
4 Anthony Davis NO Pelicans PF 23 Smoothie King Center 0 Alvin Gentry
5 Kawhi Leonard SA Spurs SF 2 AT&T Center 1 Gregg Popovich
6 Damian Lilard Portland Trailblazers PG 0 Moda Center 0 Terry Stotts
7 Klay Thompson GS Warriors SG 11 Oracle Arena 2 Steve Kerr
8 Rudy Gobert Utah Jazz C 27 Vivint Smart Home Arena 0 Quin Snyder
9 Kevin Love Cleveland Cavaliers PF 0 Quicken Loans Arena 1 Tyronn Lue

Rearrange Columns

nba_players_details.columns
Index(['player', 'team', 'pos', 'number', 'arena', 'championships', 'coach'], dtype='object')
nba_players_details = nba_players_details[['player', 'number', 'team', 'pos', 'championships', 'coach', 'arena']]
nba_players_details
player number team pos championships coach arena
0 K. Durant 35 GS Warriors SF 1 Steve Kerr Oracle Arena
1 J. Harden 13 Houston Rockets SG 0 Mike D'Antoni Toyota Center
2 K.A Towns 32 Minnesota Timberwolves C 0 Tom Thibodeau Target Center
3 Kyrie Irving 11 Boston Celtics PG 1 Brad Stevens TD Center
4 Anthony Davis 23 NO Pelicans PF 0 Alvin Gentry Smoothie King Center
5 Kawhi Leonard 2 SA Spurs SF 1 Gregg Popovich AT&T Center
6 Damian Lilard 0 Portland Trailblazers PG 0 Terry Stotts Moda Center
7 Klay Thompson 11 GS Warriors SG 2 Steve Kerr Oracle Arena
8 Rudy Gobert 27 Utah Jazz C 0 Quin Snyder Vivint Smart Home Arena
9 Kevin Love 0 Cleveland Cavaliers PF 1 Tyronn Lue Quicken Loans Arena

Rename Columns

nba_players_details.columns = ['Player', 'Number', 'Team', 'Position', 'Rings', 'Coach', 'Arena']
nba_players_details
Player Number Team Position Rings Coach Arena
0 K. Durant 35 GS Warriors SF 1 Steve Kerr Oracle Arena
1 J. Harden 13 Houston Rockets SG 0 Mike D'Antoni Toyota Center
2 K.A Towns 32 Minnesota Timberwolves C 0 Tom Thibodeau Target Center
3 Kyrie Irving 11 Boston Celtics PG 1 Brad Stevens TD Center
4 Anthony Davis 23 NO Pelicans PF 0 Alvin Gentry Smoothie King Center
5 Kawhi Leonard 2 SA Spurs SF 1 Gregg Popovich AT&T Center
6 Damian Lilard 0 Portland Trailblazers PG 0 Terry Stotts Moda Center
7 Klay Thompson 11 GS Warriors SG 2 Steve Kerr Oracle Arena
8 Rudy Gobert 27 Utah Jazz C 0 Quin Snyder Vivint Smart Home Arena
9 Kevin Love 0 Cleveland Cavaliers PF 1 Tyronn Lue Quicken Loans Arena

Edit Single Column Data

nba_players_details.iloc[3]
Player        Kyrie Irving
Number                  11
Team        Boston Celtics
Position                PG
Rings                    1
Coach         Brad Stevens
Arena            TD Center
Name: 3, dtype: object
nba_players_details.loc[3, 'Arena']
'TD Center'
nba_players_details.loc[3, 'Arena'] = 'Boston Garden'
nba_players_details
Player Number Team Position Rings Coach Arena
0 K. Durant 35 GS Warriors SF 1 Steve Kerr Oracle Arena
1 J. Harden 13 Houston Rockets SG 0 Mike D'Antoni Toyota Center
2 K.A Towns 32 Minnesota Timberwolves C 0 Tom Thibodeau Target Center
3 Kyrie Irving 11 Boston Celtics PG 1 Brad Stevens Boston Garden
4 Anthony Davis 23 NO Pelicans PF 0 Alvin Gentry Smoothie King Center
5 Kawhi Leonard 2 SA Spurs SF 1 Gregg Popovich AT&T Center
6 Damian Lilard 0 Portland Trailblazers PG 0 Terry Stotts Moda Center
7 Klay Thompson 11 GS Warriors SG 2 Steve Kerr Oracle Arena
8 Rudy Gobert 27 Utah Jazz C 0 Quin Snyder Vivint Smart Home Arena
9 Kevin Love 0 Cleveland Cavaliers PF 1 Tyronn Lue Quicken Loans Arena

Total Number of Championships

total_rings = nba_players_details['Rings'].sum()
total_rings
6
nba_players_details.groupby('Team', axis=0)['Player'].count()
Team
Boston Celtics            1
Cleveland Cavaliers       1
GS Warriors               2
Houston Rockets           1
Minnesota Timberwolves    1
NO Pelicans               1
Portland Trailblazers     1
SA Spurs                  1
Utah Jazz                 1
Name: Player, dtype: int64

Of the teams represented in the table only the GS Warriors have two players, whereas all the other teams have one player apiece.

nba_players_details.groupby('Rings', axis=0)['Player'].count()
Rings
0    5
1    4
2    1
Name: Player, dtype: int64

Five players have no rings, 4 have one, and only one player has two rings.

That’s where this post will be wrapped up. We have looked at some common ways to work with data using the Pandas package. There are a multitude of even more specific uses that we didn’t delve into, but in the upcoming posts we’ll reiterate some of the above operations as well as some new ones.