Pandas basics

Creating dataframes

Let’s start with the basics. How to create a dataframe.

Loading from CSV

Dataframes can be created from CSV files by using the following method:

import pandas as pd

pd.read_csv('../../data/mpg.csv')

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	1	ford torino
…	…	…	…	…	…	…	…	…	…
393	27.0	4	140.0	86	2790	15.6	82	1	ford mustang gl
394	44.0	4	97.0	52	2130	24.6	82	2	vw pickup
395	32.0	4	135.0	84	2295	11.6	82	1	dodge rampage
396	28.0	4	120.0	79	2625	18.6	82	1	ford ranger
397	31.0	4	119.0	82	2720	19.4	82	1	chevy s-10

398 rows × 9 columns

If you want to bypass the column name detection (from tcolumnsheader), you can supply the column names directly:

columns = [f"field_{i}" for i in range(9)]
pd.read_csv('../../data/mpg.csv', names=columns, skiprows=1, header=None)

	field_0	field_1	field_2	field_3	field_4	field_5	field_6	field_7	field_8
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	1	ford torino
…	…	…	…	…	…	…	…	…	…
393	27.0	4	140.0	86	2790	15.6	82	1	ford mustang gl
394	44.0	4	97.0	52	2130	24.6	82	2	vw pickup
395	32.0	4	135.0	84	2295	11.6	82	1	dodge rampage
396	28.0	4	120.0	79	2625	18.6	82	1	ford ranger
397	31.0	4	119.0	82	2720	19.4	82	1	chevy s-10

398 rows × 9 columns

Random dataframes

import numpy as np

N=100

df = pd.DataFrame({
    'a':np.random.randn(N),
    'b':np.random.choice( [5,7,np.nan], N),
    'c':np.random.choice( ['foo','bar','baz'], N),
    })
df.head()

	a	b	c
0	-1.663535	5.0	baz
1	0.408342	NaN	foo
2	0.424239	NaN	bar
3	1.512528	5.0	foo
4	-2.062024	5.0	foo

Concatenate dataframes

Row-wise

To concatenate dataframes row-wise (i.e. to append more rows to dataframes with the same structure) we can use the .concat() method. For instance, if we create a new random dataframe:

df_extra = pd.DataFrame({
    'a':np.random.randn(N),
    'b':np.random.choice( [11,12,13], N),
    'c':np.random.choice( ['zombie','woof','nite'], N),
    })
df_extra.head()

	a	b	c
0	-0.308607	13	nite
1	1.241705	11	nite
2	1.449900	11	zombie
3	0.741891	12	nite
4	-1.586658	12	zombie

We can now concatenate an arbitray number of dataframes by passing them as a list:

df_all = pd.concat([df, df_extra])
df_all.sample(9)

	a	b	c
98	-0.412014	11.0	zombie
38	-0.656568	5.0	foo
15	0.263778	13.0	zombie
53	-0.133630	NaN	baz
21	-0.106806	NaN	bar
34	-0.675752	7.0	foo
13	0.181684	12.0	zombie
8	-1.189809	12.0	zombie
0	-0.308607	13.0	nite

Column operations

Check column existence

The in keyword can be used directly to check column existence.

'b' in df

True

Renaming columns

df.rename(columns={"a": "new_name"}, inplace=True)
df.columns

Index(['new_name', 'b', 'c'], dtype='object')

Using a mapping function. In this case str.upper():

df.rename(columns=str.upper, inplace=True)
df.columns

Index(['NEW_NAME', 'B', 'C'], dtype='object')

We can also use a lambda. For instance, using lambda x: x.capitalize() would result:

df.rename(columns=lambda x: x.capitalize(), inplace=True)
df.columns

Index(['New_name', 'B', 'C'], dtype='object')

A list of column names can be passed directly to columns.

df.columns = ["first", "second", "third"]
df.columns

Index(['first', 'second', 'third'], dtype='object')

Dropping columns

A column can be dropped using the .drop() method along with the column keyword. For instance in the dataframe df: We can drop the second column using:

df.drop(columns='second')

	first	third
0	-1.663535	baz
1	0.408342	foo
2	0.424239	bar
3	1.512528	foo
4	-2.062024	foo
…	…	…
95	-0.856850	foo
96	0.850722	foo
97	-1.076690	bar
98	2.074288	baz
99	-1.176129	baz

100 rows × 2 columns

The del keyword is also a possibility. However, del changes the dataframe in-place, therefore we will make a copy of the dataframe first.

df_copy = df.copy()
df_copy

	first	second	third
0	-1.663535	5.0	baz
1	0.408342	NaN	foo
2	0.424239	NaN	bar
3	1.512528	5.0	foo
4	-2.062024	5.0	foo
…	…	…	…
95	-0.856850	NaN	foo
96	0.850722	NaN	foo
97	-1.076690	7.0	bar
98	2.074288	7.0	baz
99	-1.176129	7.0	baz

100 rows × 3 columns

del df_copy['second']
df_copy

	first	third
0	-1.663535	baz
1	0.408342	foo
2	0.424239	bar
3	1.512528	foo
4	-2.062024	foo
…	…	…
95	-0.856850	foo
96	0.850722	foo
97	-1.076690	bar
98	2.074288	baz
99	-1.176129	baz

100 rows × 2 columns

Yet another possibility is to drop the column by index. For instance:

df.drop(columns=df.columns[1])

	first	third
0	-1.663535	baz
1	0.408342	foo
2	0.424239	bar
3	1.512528	foo
4	-2.062024	foo
…	…	…
95	-0.856850	foo
96	0.850722	foo
97	-1.076690	bar
98	2.074288	baz
99	-1.176129	baz

100 rows × 2 columns

Or we could use ranges, for instance:

df.drop(columns=df.columns[0:2])

	third
0	baz
1	foo
2	bar
3	foo
4	foo
…	…
95	foo
96	foo
97	bar
98	baz
99	baz

100 rows × 1 columns

Subsetting and indexing

Indexing performance

Let’s assume the case where you have a column BOOL with values Y or N that you want to replace with an integer 1 or 0 value. The inital¹ instinct would be to do something like:

df["BOOL"] = df["BOOL"].eq("Y").mul(1)

This will result in the warning

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Pandas documentation recommends the usage of the following idiom, since it can be considerably faster:

df.loc[:, ("BOOL")] = df.loc[:, ("BOOL")].eq("Y").mul(1)

Footnotes

and Pythonic?↩︎