Pandas Basics
Much like the Numpy Basics Overview, this will be a quite terse overview of the package. The focus will be on simple and common use cases such as importing the package, creating and reading data files, shaping it, transforming it, etc. Pandas has a very simply written documentation that’s easy to read and access (especially on a Jupyter notebook).
Import Pandas And Read Data
import pandas as pd
nba_players = pd.read_csv('nba_players.csv')
nba_players.head()
|
player |
team |
pos |
number |
0 |
K. Durant |
GS Warriors |
SF |
35 |
1 |
J. Harden |
Houston Rockets |
SG |
13 |
2 |
K.A Towns |
Minnesota Timberwolves |
C |
32 |
3 |
Kyrie Irving |
Boston Celtics |
PG |
11 |
4 |
Anthony Davis |
NO Pelicans |
PF |
23 |
Voilá! Look how it pretty it looks (on this notebook, but hopefully my css table stylings do it justice). Everything is arranged in a nicely formatted table with rows, columns, and headings.
Attributes & Methods For Simple Exploring
In most scenarios when we first load data we take explore it. This is to get a sense of what analyses we can perform on it as well as to see if it’s tidy or requires further cleaning on our part. In this very simple dataset there’s not much to explore and nothing to tidy, but regardless of that, let’s see some methods.
first = nba_players.head(1)
sample = nba_players.sample(1)
print(first)
print(sample)
player team pos number
0 K. Durant GS Warriors SF 35
player team pos number
0 K. Durant GS Warriors SF 35
Index(['player', 'team', 'pos', 'number'], dtype='object')
<bound method NDFrame.describe of player team pos number
0 K. Durant GS Warriors SF 35
1 J. Harden Houston Rockets SG 13
2 K.A Towns Minnesota Timberwolves C 32
3 Kyrie Irving Boston Celtics PG 11
4 Anthony Davis NO Pelicans PF 23>
# General info about column names, type of data, length of data, and memory usage
nba_players.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
player 5 non-null object
team 5 non-null object
pos 5 non-null object
number 5 non-null int64
dtypes: int64(1), object(3)
memory usage: 240.0+ bytes
Add New Data
Merge DataFrames
# New DataFrame with same heading and type of data as initial nba_players
nba_2 = [
{'player': 'Kawhi Leonard', 'team': 'SA Spurs', 'pos': 'SF', 'number': 2},
{'player': 'Damian Lilard', 'team': 'Portland Trailblazers', 'pos': 'PG', 'number': 0},
{'player': 'Klay Thompson', 'team': 'GS Warriors', 'pos': 'SG', 'number': 11},
{'player': 'Rudy Gobert', 'team': 'Utah Jazz', 'pos': 'C', 'number': 27},
{'player': 'Kevin Love', 'team': 'Cleveland Cavaliers', 'pos': 'PF', 'number': 0},
]
nba_players_2 = pd.DataFrame(nba_2)
nba_players_2
|
number |
player |
pos |
team |
0 |
2 |
Kawhi Leonard |
SF |
SA Spurs |
1 |
0 |
Damian Lilard |
PG |
Portland Trailblazers |
2 |
11 |
Klay Thompson |
SG |
GS Warriors |
3 |
27 |
Rudy Gobert |
C |
Utah Jazz |
4 |
0 |
Kevin Love |
PF |
Cleveland Cavaliers |
The merging process is similar to SQL JOINs.
# right returns only the new dataframe
nba_players.merge(nba_players_2, how='right')
|
player |
team |
pos |
number |
0 |
Kawhi Leonard |
SA Spurs |
SF |
2 |
1 |
Damian Lilard |
Portland Trailblazers |
PG |
0 |
2 |
Klay Thompson |
GS Warriors |
SG |
11 |
3 |
Rudy Gobert |
Utah Jazz |
C |
27 |
4 |
Kevin Love |
Cleveland Cavaliers |
PF |
0 |
# left returns only the initial dataframe
nba_players.merge(nba_players_2, how='left')
|
player |
team |
pos |
number |
0 |
K. Durant |
GS Warriors |
SF |
35 |
1 |
J. Harden |
Houston Rockets |
SG |
13 |
2 |
K.A Towns |
Minnesota Timberwolves |
C |
32 |
3 |
Kyrie Irving |
Boston Celtics |
PG |
11 |
4 |
Anthony Davis |
NO Pelicans |
PF |
23 |
# outer returns new dataframe of combined initial and new dataframe
nba_players = nba_players.merge(nba_players_2, how='outer')
nba_players
|
player |
team |
pos |
number |
0 |
K. Durant |
GS Warriors |
SF |
35 |
1 |
J. Harden |
Houston Rockets |
SG |
13 |
2 |
K.A Towns |
Minnesota Timberwolves |
C |
32 |
3 |
Kyrie Irving |
Boston Celtics |
PG |
11 |
4 |
Anthony Davis |
NO Pelicans |
PF |
23 |
5 |
Kawhi Leonard |
SA Spurs |
SF |
2 |
6 |
Damian Lilard |
Portland Trailblazers |
PG |
0 |
7 |
Klay Thompson |
GS Warriors |
SG |
11 |
8 |
Rudy Gobert |
Utah Jazz |
C |
27 |
9 |
Kevin Love |
Cleveland Cavaliers |
PF |
0 |
Concat DataFrames
nba_misc_details = [
{'coach': 'Steve Kerr', 'arena': 'Oracle Arena', 'championships': 5},
{'coach': 'Mike D\'Antoni', 'arena': 'Toyota Center', 'championships': 2},
{'coach': 'Tom Thibodeau', 'arena': 'Target Center', 'championships': 0},
{'coach': 'Brad Stevens', 'arena': 'TD Center', 'championships': 17},
{'coach': 'Alvin Gentry', 'arena': 'Smoothie King Center', 'championships': 0},
{'coach': 'Gregg Popovich', 'arena':'AT&T Center', 'championships': 5},
{'coach': 'Terry Stotts', 'arena': 'Moda Center', 'championships': 1},
{'coach': 'Steve Kerr', 'arena': 'Oracle Arena', 'championships': 5},
{'coach': 'Quin Snyder', 'arena': 'Vivint Smart Home Arena', 'championships': 0},
{'coach': 'Tyronn Lue', 'arena': 'Quicken Loans Arena', 'championships': 1},
]
nba_misc = pd.DataFrame(nba_misc_details)
# concat behaves like an append, attaching the new dataframe either as row data or column data
nba_players_details = pd.concat([nba_players, nba_misc], axis=1)
nba_players_details
|
player |
team |
pos |
number |
arena |
championships |
coach |
0 |
K. Durant |
GS Warriors |
SF |
35 |
Oracle Arena |
5 |
Steve Kerr |
1 |
J. Harden |
Houston Rockets |
SG |
13 |
Toyota Center |
2 |
Mike D'Antoni |
2 |
K.A Towns |
Minnesota Timberwolves |
C |
32 |
Target Center |
0 |
Tom Thibodeau |
3 |
Kyrie Irving |
Boston Celtics |
PG |
11 |
TD Center |
17 |
Brad Stevens |
4 |
Anthony Davis |
NO Pelicans |
PF |
23 |
Smoothie King Center |
0 |
Alvin Gentry |
5 |
Kawhi Leonard |
SA Spurs |
SF |
2 |
AT&T Center |
5 |
Gregg Popovich |
6 |
Damian Lilard |
Portland Trailblazers |
PG |
0 |
Moda Center |
1 |
Terry Stotts |
7 |
Klay Thompson |
GS Warriors |
SG |
11 |
Oracle Arena |
5 |
Steve Kerr |
8 |
Rudy Gobert |
Utah Jazz |
C |
27 |
Vivint Smart Home Arena |
0 |
Quin Snyder |
9 |
Kevin Love |
Cleveland Cavaliers |
PF |
0 |
Quicken Loans Arena |
1 |
Tyronn Lue |
Edit Entire Column
It seems that the we have attributed to the players the wrong championships information. Those values are actually the total championships for the teams. We would actually like that column to be for individual championships.
nba_players_details['championships'] = [1, 0, 0, 1, 0, 1, 0, 2, 0, 1]
nba_players_details
|
player |
team |
pos |
number |
arena |
championships |
coach |
0 |
K. Durant |
GS Warriors |
SF |
35 |
Oracle Arena |
1 |
Steve Kerr |
1 |
J. Harden |
Houston Rockets |
SG |
13 |
Toyota Center |
0 |
Mike D'Antoni |
2 |
K.A Towns |
Minnesota Timberwolves |
C |
32 |
Target Center |
0 |
Tom Thibodeau |
3 |
Kyrie Irving |
Boston Celtics |
PG |
11 |
TD Center |
1 |
Brad Stevens |
4 |
Anthony Davis |
NO Pelicans |
PF |
23 |
Smoothie King Center |
0 |
Alvin Gentry |
5 |
Kawhi Leonard |
SA Spurs |
SF |
2 |
AT&T Center |
1 |
Gregg Popovich |
6 |
Damian Lilard |
Portland Trailblazers |
PG |
0 |
Moda Center |
0 |
Terry Stotts |
7 |
Klay Thompson |
GS Warriors |
SG |
11 |
Oracle Arena |
2 |
Steve Kerr |
8 |
Rudy Gobert |
Utah Jazz |
C |
27 |
Vivint Smart Home Arena |
0 |
Quin Snyder |
9 |
Kevin Love |
Cleveland Cavaliers |
PF |
0 |
Quicken Loans Arena |
1 |
Tyronn Lue |
Rearrange Columns
nba_players_details.columns
Index(['player', 'team', 'pos', 'number', 'arena', 'championships', 'coach'], dtype='object')
nba_players_details = nba_players_details[['player', 'number', 'team', 'pos', 'championships', 'coach', 'arena']]
nba_players_details
|
player |
number |
team |
pos |
championships |
coach |
arena |
0 |
K. Durant |
35 |
GS Warriors |
SF |
1 |
Steve Kerr |
Oracle Arena |
1 |
J. Harden |
13 |
Houston Rockets |
SG |
0 |
Mike D'Antoni |
Toyota Center |
2 |
K.A Towns |
32 |
Minnesota Timberwolves |
C |
0 |
Tom Thibodeau |
Target Center |
3 |
Kyrie Irving |
11 |
Boston Celtics |
PG |
1 |
Brad Stevens |
TD Center |
4 |
Anthony Davis |
23 |
NO Pelicans |
PF |
0 |
Alvin Gentry |
Smoothie King Center |
5 |
Kawhi Leonard |
2 |
SA Spurs |
SF |
1 |
Gregg Popovich |
AT&T Center |
6 |
Damian Lilard |
0 |
Portland Trailblazers |
PG |
0 |
Terry Stotts |
Moda Center |
7 |
Klay Thompson |
11 |
GS Warriors |
SG |
2 |
Steve Kerr |
Oracle Arena |
8 |
Rudy Gobert |
27 |
Utah Jazz |
C |
0 |
Quin Snyder |
Vivint Smart Home Arena |
9 |
Kevin Love |
0 |
Cleveland Cavaliers |
PF |
1 |
Tyronn Lue |
Quicken Loans Arena |
Rename Columns
nba_players_details.columns = ['Player', 'Number', 'Team', 'Position', 'Rings', 'Coach', 'Arena']
nba_players_details
|
Player |
Number |
Team |
Position |
Rings |
Coach |
Arena |
0 |
K. Durant |
35 |
GS Warriors |
SF |
1 |
Steve Kerr |
Oracle Arena |
1 |
J. Harden |
13 |
Houston Rockets |
SG |
0 |
Mike D'Antoni |
Toyota Center |
2 |
K.A Towns |
32 |
Minnesota Timberwolves |
C |
0 |
Tom Thibodeau |
Target Center |
3 |
Kyrie Irving |
11 |
Boston Celtics |
PG |
1 |
Brad Stevens |
TD Center |
4 |
Anthony Davis |
23 |
NO Pelicans |
PF |
0 |
Alvin Gentry |
Smoothie King Center |
5 |
Kawhi Leonard |
2 |
SA Spurs |
SF |
1 |
Gregg Popovich |
AT&T Center |
6 |
Damian Lilard |
0 |
Portland Trailblazers |
PG |
0 |
Terry Stotts |
Moda Center |
7 |
Klay Thompson |
11 |
GS Warriors |
SG |
2 |
Steve Kerr |
Oracle Arena |
8 |
Rudy Gobert |
27 |
Utah Jazz |
C |
0 |
Quin Snyder |
Vivint Smart Home Arena |
9 |
Kevin Love |
0 |
Cleveland Cavaliers |
PF |
1 |
Tyronn Lue |
Quicken Loans Arena |
Edit Single Column Data
nba_players_details.iloc[3]
Player Kyrie Irving
Number 11
Team Boston Celtics
Position PG
Rings 1
Coach Brad Stevens
Arena TD Center
Name: 3, dtype: object
nba_players_details.loc[3, 'Arena']
nba_players_details.loc[3, 'Arena'] = 'Boston Garden'
nba_players_details
|
Player |
Number |
Team |
Position |
Rings |
Coach |
Arena |
0 |
K. Durant |
35 |
GS Warriors |
SF |
1 |
Steve Kerr |
Oracle Arena |
1 |
J. Harden |
13 |
Houston Rockets |
SG |
0 |
Mike D'Antoni |
Toyota Center |
2 |
K.A Towns |
32 |
Minnesota Timberwolves |
C |
0 |
Tom Thibodeau |
Target Center |
3 |
Kyrie Irving |
11 |
Boston Celtics |
PG |
1 |
Brad Stevens |
Boston Garden |
4 |
Anthony Davis |
23 |
NO Pelicans |
PF |
0 |
Alvin Gentry |
Smoothie King Center |
5 |
Kawhi Leonard |
2 |
SA Spurs |
SF |
1 |
Gregg Popovich |
AT&T Center |
6 |
Damian Lilard |
0 |
Portland Trailblazers |
PG |
0 |
Terry Stotts |
Moda Center |
7 |
Klay Thompson |
11 |
GS Warriors |
SG |
2 |
Steve Kerr |
Oracle Arena |
8 |
Rudy Gobert |
27 |
Utah Jazz |
C |
0 |
Quin Snyder |
Vivint Smart Home Arena |
9 |
Kevin Love |
0 |
Cleveland Cavaliers |
PF |
1 |
Tyronn Lue |
Quicken Loans Arena |
Total Number of Championships
total_rings = nba_players_details['Rings'].sum()
total_rings
nba_players_details.groupby('Team', axis=0)['Player'].count()
Team
Boston Celtics 1
Cleveland Cavaliers 1
GS Warriors 2
Houston Rockets 1
Minnesota Timberwolves 1
NO Pelicans 1
Portland Trailblazers 1
SA Spurs 1
Utah Jazz 1
Name: Player, dtype: int64
Of the teams represented in the table only the GS Warriors have two players, whereas all the other teams have one player apiece.
nba_players_details.groupby('Rings', axis=0)['Player'].count()
Rings
0 5
1 4
2 1
Name: Player, dtype: int64
Five players have no rings, 4 have one, and only one player has two rings.
That’s where this post will be wrapped up. We have looked at some common ways to work with data using the Pandas package. There are a multitude of even more specific uses that we didn’t delve into, but in the upcoming posts we’ll reiterate some of the above operations as well as some new ones.