Pandas basics

Creating dataframes

Let’s start with the basics. How to create a dataframe.

Loading from CSV

Dataframes can be created from CSV files by using the following method:

import pandas as pd

pd.read_csv('../../data/mpg.csv')
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
393 27.0 4 140.0 86 2790 15.6 82 1 ford mustang gl
394 44.0 4 97.0 52 2130 24.6 82 2 vw pickup
395 32.0 4 135.0 84 2295 11.6 82 1 dodge rampage
396 28.0 4 120.0 79 2625 18.6 82 1 ford ranger
397 31.0 4 119.0 82 2720 19.4 82 1 chevy s-10

398 rows × 9 columns

If you want to bypass the column name detection (from tcolumnsheader), you can supply the column names directly:

columns = [f"field_{i}" for i in range(9)]
pd.read_csv('../../data/mpg.csv', names=columns, skiprows=1, header=None)
field_0 field_1 field_2 field_3 field_4 field_5 field_6 field_7 field_8
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
393 27.0 4 140.0 86 2790 15.6 82 1 ford mustang gl
394 44.0 4 97.0 52 2130 24.6 82 2 vw pickup
395 32.0 4 135.0 84 2295 11.6 82 1 dodge rampage
396 28.0 4 120.0 79 2625 18.6 82 1 ford ranger
397 31.0 4 119.0 82 2720 19.4 82 1 chevy s-10

398 rows × 9 columns

Random dataframes

import numpy as np

N=100

df = pd.DataFrame({
    'a':np.random.randn(N),
    'b':np.random.choice( [5,7,np.nan], N),
    'c':np.random.choice( ['foo','bar','baz'], N),
    })
df.head()
a b c
0 -1.663535 5.0 baz
1 0.408342 NaN foo
2 0.424239 NaN bar
3 1.512528 5.0 foo
4 -2.062024 5.0 foo

Concatenate dataframes

Row-wise

To concatenate dataframes row-wise (i.e. to append more rows to dataframes with the same structure) we can use the .concat() method. For instance, if we create a new random dataframe:

df_extra = pd.DataFrame({
    'a':np.random.randn(N),
    'b':np.random.choice( [11,12,13], N),
    'c':np.random.choice( ['zombie','woof','nite'], N),
    })
df_extra.head()
a b c
0 -0.308607 13 nite
1 1.241705 11 nite
2 1.449900 11 zombie
3 0.741891 12 nite
4 -1.586658 12 zombie

We can now concatenate an arbitray number of dataframes by passing them as a list:

df_all = pd.concat([df, df_extra])
df_all.sample(9)
a b c
98 -0.412014 11.0 zombie
38 -0.656568 5.0 foo
15 0.263778 13.0 zombie
53 -0.133630 NaN baz
21 -0.106806 NaN bar
34 -0.675752 7.0 foo
13 0.181684 12.0 zombie
8 -1.189809 12.0 zombie
0 -0.308607 13.0 nite

Column operations

Check column existence

The in keyword can be used directly to check column existence.

'b' in df
True

Renaming columns

df.rename(columns={"a": "new_name"}, inplace=True)
df.columns
Index(['new_name', 'b', 'c'], dtype='object')

Using a mapping function. In this case str.upper():

df.rename(columns=str.upper, inplace=True)
df.columns
Index(['NEW_NAME', 'B', 'C'], dtype='object')

We can also use a lambda. For instance, using lambda x: x.capitalize() would result:

df.rename(columns=lambda x: x.capitalize(), inplace=True)
df.columns
Index(['New_name', 'B', 'C'], dtype='object')

A list of column names can be passed directly to columns.

df.columns = ["first", "second", "third"]
df.columns
Index(['first', 'second', 'third'], dtype='object')

Dropping columns

A column can be dropped using the .drop() method along with the column keyword. For instance in the dataframe df: We can drop the second column using:

df.drop(columns='second')
first third
0 -1.663535 baz
1 0.408342 foo
2 0.424239 bar
3 1.512528 foo
4 -2.062024 foo
95 -0.856850 foo
96 0.850722 foo
97 -1.076690 bar
98 2.074288 baz
99 -1.176129 baz

100 rows × 2 columns

The del keyword is also a possibility. However, del changes the dataframe in-place, therefore we will make a copy of the dataframe first.

df_copy = df.copy()
df_copy
first second third
0 -1.663535 5.0 baz
1 0.408342 NaN foo
2 0.424239 NaN bar
3 1.512528 5.0 foo
4 -2.062024 5.0 foo
95 -0.856850 NaN foo
96 0.850722 NaN foo
97 -1.076690 7.0 bar
98 2.074288 7.0 baz
99 -1.176129 7.0 baz

100 rows × 3 columns

del df_copy['second']
df_copy
first third
0 -1.663535 baz
1 0.408342 foo
2 0.424239 bar
3 1.512528 foo
4 -2.062024 foo
95 -0.856850 foo
96 0.850722 foo
97 -1.076690 bar
98 2.074288 baz
99 -1.176129 baz

100 rows × 2 columns

Yet another possibility is to drop the column by index. For instance:

df.drop(columns=df.columns[1])
first third
0 -1.663535 baz
1 0.408342 foo
2 0.424239 bar
3 1.512528 foo
4 -2.062024 foo
95 -0.856850 foo
96 0.850722 foo
97 -1.076690 bar
98 2.074288 baz
99 -1.176129 baz

100 rows × 2 columns

Or we could use ranges, for instance:

df.drop(columns=df.columns[0:2])
third
0 baz
1 foo
2 bar
3 foo
4 foo
95 foo
96 foo
97 bar
98 baz
99 baz

100 rows × 1 columns

Subsetting and indexing

Indexing performance

Let’s assume the case where you have a column BOOL with values Y or N that you want to replace with an integer 1 or 0 value. The inital1 instinct would be to do something like:

df["BOOL"] = df["BOOL"].eq("Y").mul(1)

This will result in the warning

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Pandas documentation recommends the usage of the following idiom, since it can be considerably faster:

df.loc[:, ("BOOL")] = df.loc[:, ("BOOL")].eq("Y").mul(1)

Footnotes

  1. and Pythonic?↩︎