Pandas basics
Creating dataframes
Let’s start with the basics. How to create a dataframe.
Loading from CSV
Dataframes can be created from CSV files by using the following method:
import pandas as pd
pd.read_csv('../../data/mpg.csv')
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
393 | 27.0 | 4 | 140.0 | 86 | 2790 | 15.6 | 82 | 1 | ford mustang gl |
394 | 44.0 | 4 | 97.0 | 52 | 2130 | 24.6 | 82 | 2 | vw pickup |
395 | 32.0 | 4 | 135.0 | 84 | 2295 | 11.6 | 82 | 1 | dodge rampage |
396 | 28.0 | 4 | 120.0 | 79 | 2625 | 18.6 | 82 | 1 | ford ranger |
397 | 31.0 | 4 | 119.0 | 82 | 2720 | 19.4 | 82 | 1 | chevy s-10 |
398 rows × 9 columns
If you want to bypass the column name detection (from tcolumnsheader), you can supply the column names directly:
columns = [f"field_{i}" for i in range(9)]
pd.read_csv('../../data/mpg.csv', names=columns, skiprows=1, header=None)
field_0 | field_1 | field_2 | field_3 | field_4 | field_5 | field_6 | field_7 | field_8 | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
393 | 27.0 | 4 | 140.0 | 86 | 2790 | 15.6 | 82 | 1 | ford mustang gl |
394 | 44.0 | 4 | 97.0 | 52 | 2130 | 24.6 | 82 | 2 | vw pickup |
395 | 32.0 | 4 | 135.0 | 84 | 2295 | 11.6 | 82 | 1 | dodge rampage |
396 | 28.0 | 4 | 120.0 | 79 | 2625 | 18.6 | 82 | 1 | ford ranger |
397 | 31.0 | 4 | 119.0 | 82 | 2720 | 19.4 | 82 | 1 | chevy s-10 |
398 rows × 9 columns
Random dataframes
import numpy as np
N=100
df = pd.DataFrame({
'a':np.random.randn(N),
'b':np.random.choice( [5,7,np.nan], N),
'c':np.random.choice( ['foo','bar','baz'], N),
})
df.head()
a | b | c | |
---|---|---|---|
0 | -1.663535 | 5.0 | baz |
1 | 0.408342 | NaN | foo |
2 | 0.424239 | NaN | bar |
3 | 1.512528 | 5.0 | foo |
4 | -2.062024 | 5.0 | foo |
Concatenate dataframes
Row-wise
To concatenate dataframes row-wise (i.e. to append more rows to dataframes with the same structure) we can use the .concat()
method. For instance, if we create a new random dataframe:
df_extra = pd.DataFrame({
'a':np.random.randn(N),
'b':np.random.choice( [11,12,13], N),
'c':np.random.choice( ['zombie','woof','nite'], N),
})
df_extra.head()
a | b | c | |
---|---|---|---|
0 | -0.308607 | 13 | nite |
1 | 1.241705 | 11 | nite |
2 | 1.449900 | 11 | zombie |
3 | 0.741891 | 12 | nite |
4 | -1.586658 | 12 | zombie |
We can now concatenate an arbitray number of dataframes by passing them as a list:
df_all = pd.concat([df, df_extra])
df_all.sample(9)
a | b | c | |
---|---|---|---|
98 | -0.412014 | 11.0 | zombie |
38 | -0.656568 | 5.0 | foo |
15 | 0.263778 | 13.0 | zombie |
53 | -0.133630 | NaN | baz |
21 | -0.106806 | NaN | bar |
34 | -0.675752 | 7.0 | foo |
13 | 0.181684 | 12.0 | zombie |
8 | -1.189809 | 12.0 | zombie |
0 | -0.308607 | 13.0 | nite |
Column operations
Check column existence
The in
keyword can be used directly to check column existence.
'b' in df
True
Renaming columns
df.rename(columns={"a": "new_name"}, inplace=True)
df.columns
Index(['new_name', 'b', 'c'], dtype='object')
Using a mapping function. In this case str.upper()
:
df.rename(columns=str.upper, inplace=True)
df.columns
Index(['NEW_NAME', 'B', 'C'], dtype='object')
We can also use a lambda. For instance, using lambda x: x.capitalize()
would result:
df.rename(columns=lambda x: x.capitalize(), inplace=True)
df.columns
Index(['New_name', 'B', 'C'], dtype='object')
A list of column names can be passed directly to columns.
df.columns = ["first", "second", "third"]
df.columns
Index(['first', 'second', 'third'], dtype='object')
Dropping columns
A column can be dropped using the .drop()
method along with the column
keyword. For instance in the dataframe df
:
We can drop the second
column using:
df.drop(columns='second')
first | third | |
---|---|---|
0 | -1.663535 | baz |
1 | 0.408342 | foo |
2 | 0.424239 | bar |
3 | 1.512528 | foo |
4 | -2.062024 | foo |
... | ... | ... |
95 | -0.856850 | foo |
96 | 0.850722 | foo |
97 | -1.076690 | bar |
98 | 2.074288 | baz |
99 | -1.176129 | baz |
100 rows × 2 columns
The del
keyword is also a possibility. However, del
changes the dataframe in-place, therefore we will make a copy of the dataframe first.
df_copy = df.copy()
df_copy
first | second | third | |
---|---|---|---|
0 | -1.663535 | 5.0 | baz |
1 | 0.408342 | NaN | foo |
2 | 0.424239 | NaN | bar |
3 | 1.512528 | 5.0 | foo |
4 | -2.062024 | 5.0 | foo |
... | ... | ... | ... |
95 | -0.856850 | NaN | foo |
96 | 0.850722 | NaN | foo |
97 | -1.076690 | 7.0 | bar |
98 | 2.074288 | 7.0 | baz |
99 | -1.176129 | 7.0 | baz |
100 rows × 3 columns
del df_copy['second']
df_copy
first | third | |
---|---|---|
0 | -1.663535 | baz |
1 | 0.408342 | foo |
2 | 0.424239 | bar |
3 | 1.512528 | foo |
4 | -2.062024 | foo |
... | ... | ... |
95 | -0.856850 | foo |
96 | 0.850722 | foo |
97 | -1.076690 | bar |
98 | 2.074288 | baz |
99 | -1.176129 | baz |
100 rows × 2 columns
Yet another possibility is to drop the column by index. For instance:
df.drop(columns=df.columns[1])
first | third | |
---|---|---|
0 | -1.663535 | baz |
1 | 0.408342 | foo |
2 | 0.424239 | bar |
3 | 1.512528 | foo |
4 | -2.062024 | foo |
... | ... | ... |
95 | -0.856850 | foo |
96 | 0.850722 | foo |
97 | -1.076690 | bar |
98 | 2.074288 | baz |
99 | -1.176129 | baz |
100 rows × 2 columns
Or we could use ranges, for instance:
df.drop(columns=df.columns[0:2])
third | |
---|---|
0 | baz |
1 | foo |
2 | bar |
3 | foo |
4 | foo |
... | ... |
95 | foo |
96 | foo |
97 | bar |
98 | baz |
99 | baz |
100 rows × 1 columns
Subsetting and indexing
Indexing performance
Let’s assume the case where you have a column BOOL
with values Y
or N
that you want to replace with an integer 1
or 0
value. The inital1 instinct would be to do something like:
df["BOOL"] = df["BOOL"].eq("Y").mul(1)
This will result in the warning
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Pandas documentation recommends the usage of the following idiom, since it can be considerably faster:
df.loc[:, ("BOOL")] = df.loc[:, ("BOOL")].eq("Y").mul(1)
and Pythonic? ↩︎