Pandas basics

Creating dataframes

Let’s start with the basics. How to create a dataframe.

Loading from CSV

Dataframes can be created from CSV files by using the following method:

import pandas as pd

pd.read_csv('../../data/mpg.csv')

mpgcylindersdisplacementhorsepowerweightaccelerationmodel_yearoriginname
018.08307.0130350412.0701chevrolet chevelle malibu
115.08350.0165369311.5701buick skylark 320
218.08318.0150343611.0701plymouth satellite
316.08304.0150343312.0701amc rebel sst
417.08302.0140344910.5701ford torino
..............................
39327.04140.086279015.6821ford mustang gl
39444.0497.052213024.6822vw pickup
39532.04135.084229511.6821dodge rampage
39628.04120.079262518.6821ford ranger
39731.04119.082272019.4821chevy s-10

398 rows × 9 columns

If you want to bypass the column name detection (from tcolumnsheader), you can supply the column names directly:

columns = [f"field_{i}" for i in range(9)]
pd.read_csv('../../data/mpg.csv', names=columns, skiprows=1, header=None)

field_0field_1field_2field_3field_4field_5field_6field_7field_8
018.08307.0130350412.0701chevrolet chevelle malibu
115.08350.0165369311.5701buick skylark 320
218.08318.0150343611.0701plymouth satellite
316.08304.0150343312.0701amc rebel sst
417.08302.0140344910.5701ford torino
..............................
39327.04140.086279015.6821ford mustang gl
39444.0497.052213024.6822vw pickup
39532.04135.084229511.6821dodge rampage
39628.04120.079262518.6821ford ranger
39731.04119.082272019.4821chevy s-10

398 rows × 9 columns

Random dataframes

import numpy as np

N=100

df = pd.DataFrame({
    'a':np.random.randn(N),
    'b':np.random.choice( [5,7,np.nan], N),
    'c':np.random.choice( ['foo','bar','baz'], N),
    })
df.head()

abc
0-1.6635355.0baz
10.408342NaNfoo
20.424239NaNbar
31.5125285.0foo
4-2.0620245.0foo

Concatenate dataframes

Row-wise

To concatenate dataframes row-wise (i.e. to append more rows to dataframes with the same structure) we can use the .concat() method. For instance, if we create a new random dataframe:

df_extra = pd.DataFrame({
    'a':np.random.randn(N),
    'b':np.random.choice( [11,12,13], N),
    'c':np.random.choice( ['zombie','woof','nite'], N),
    })
df_extra.head()

abc
0-0.30860713nite
11.24170511nite
21.44990011zombie
30.74189112nite
4-1.58665812zombie

We can now concatenate an arbitray number of dataframes by passing them as a list:

df_all = pd.concat([df, df_extra])
df_all.sample(9)

abc
98-0.41201411.0zombie
38-0.6565685.0foo
150.26377813.0zombie
53-0.133630NaNbaz
21-0.106806NaNbar
34-0.6757527.0foo
130.18168412.0zombie
8-1.18980912.0zombie
0-0.30860713.0nite

Column operations

Check column existence

The in keyword can be used directly to check column existence.

'b' in df
True

Renaming columns

df.rename(columns={"a": "new_name"}, inplace=True)
df.columns
Index(['new_name', 'b', 'c'], dtype='object')

Using a mapping function. In this case str.upper():

df.rename(columns=str.upper, inplace=True)
df.columns
Index(['NEW_NAME', 'B', 'C'], dtype='object')

We can also use a lambda. For instance, using lambda x: x.capitalize() would result:

df.rename(columns=lambda x: x.capitalize(), inplace=True)
df.columns
Index(['New_name', 'B', 'C'], dtype='object')

A list of column names can be passed directly to columns.

df.columns = ["first", "second", "third"]
df.columns
Index(['first', 'second', 'third'], dtype='object')

Dropping columns

A column can be dropped using the .drop() method along with the column keyword. For instance in the dataframe df: We can drop the second column using:

df.drop(columns='second')

firstthird
0-1.663535baz
10.408342foo
20.424239bar
31.512528foo
4-2.062024foo
.........
95-0.856850foo
960.850722foo
97-1.076690bar
982.074288baz
99-1.176129baz

100 rows × 2 columns

The del keyword is also a possibility. However, del changes the dataframe in-place, therefore we will make a copy of the dataframe first.

df_copy = df.copy()
df_copy

firstsecondthird
0-1.6635355.0baz
10.408342NaNfoo
20.424239NaNbar
31.5125285.0foo
4-2.0620245.0foo
............
95-0.856850NaNfoo
960.850722NaNfoo
97-1.0766907.0bar
982.0742887.0baz
99-1.1761297.0baz

100 rows × 3 columns

del df_copy['second']
df_copy

firstthird
0-1.663535baz
10.408342foo
20.424239bar
31.512528foo
4-2.062024foo
.........
95-0.856850foo
960.850722foo
97-1.076690bar
982.074288baz
99-1.176129baz

100 rows × 2 columns

Yet another possibility is to drop the column by index. For instance:

df.drop(columns=df.columns[1])

firstthird
0-1.663535baz
10.408342foo
20.424239bar
31.512528foo
4-2.062024foo
.........
95-0.856850foo
960.850722foo
97-1.076690bar
982.074288baz
99-1.176129baz

100 rows × 2 columns

Or we could use ranges, for instance:

df.drop(columns=df.columns[0:2])

third
0baz
1foo
2bar
3foo
4foo
......
95foo
96foo
97bar
98baz
99baz

100 rows × 1 columns

Subsetting and indexing

Indexing performance

Let’s assume the case where you have a column BOOL with values Y or N that you want to replace with an integer 1 or 0 value. The inital1 instinct would be to do something like:

df["BOOL"] = df["BOOL"].eq("Y").mul(1)

This will result in the warning

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Pandas documentation recommends the usage of the following idiom, since it can be considerably faster:

df.loc[:, ("BOOL")] = df.loc[:, ("BOOL")].eq("Y").mul(1)

  1. and Pythonic? ↩︎