Pandas basics

Dataframe operations

Create a random dataframe

import pandas as pd
import numpy as np

N=100

df = pd.DataFrame({
    'a':np.random.randn(N),
    'b':np.random.choice( [5,7,np.nan], N),
    'c':np.random.choice( ['foo','bar','baz'], N),
    })
df.head()

abc
0-1.9170577.0baz
1-1.8676145.0foo
20.2982977.0foo
30.4982425.0bar
40.3902405.0bar

Concatenate dataframes

Row-wise

To concatenate dataframes row-wise (i.e. to append more rows to dataframes with the same structure) we can use the .concat() method. For instance, if we create a new random dataframe:

df_extra = pd.DataFrame({
    'a':np.random.randn(N),
    'b':np.random.choice( [11,12,13], N),
    'c':np.random.choice( ['zombie','woof','nite'], N),
    })
df_extra.head()

abc
00.86363313woof
1-0.81980212woof
2-0.20259712zombie
30.80885711nite
4-2.08967113woof

We can now concatenate an arbitray number of dataframes by passing them as a list:

df_all = pd.concat([df, df_extra])
df_all.sample(9)

abc
38-0.135197NaNfoo
30.4982425.0bar
57-0.52011713.0zombie
880.63606512.0zombie
33-1.53086413.0zombie
440.596794NaNbaz
17-0.131911NaNbaz
701.0502895.0baz
800.223156NaNfoo

Column operations

Check column existence

The in keyword can be used directly to check column existence.

'b' in df
True

Renaming columns

df.rename(columns={"a": "new_name"}, inplace=True)
df.columns
Index(['new_name', 'b', 'c'], dtype='object')

Using a mapping function. In this case str.upper():

df.rename(columns=str.upper, inplace=True)
df.columns
Index(['NEW_NAME', 'B', 'C'], dtype='object')

We can also use a lambda. For instance, using lambda x: x.capitalize() would result:

df.rename(columns=lambda x: x.capitalize(), inplace=True)
df.columns
Index(['New_name', 'B', 'C'], dtype='object')

A list of column names can be passed directly to columns.

df.columns = ["first", "second", "third"]
df.columns
Index(['first', 'second', 'third'], dtype='object')

Dropping columns

A column can be dropped using the .drop() method along with the column keyword. For instance in the dataframe df: We can drop the second column using:

df.drop(columns='second')

firstthird
0-1.917057baz
1-1.867614foo
20.298297foo
30.498242bar
40.390240bar
.........
95-0.848204bar
96-0.552840baz
972.051078foo
980.770107baz
991.837310bar

100 rows × 2 columns

The del keyword is also a possibility. However, del changes the dataframe in-place, therefore we will make a copy of the dataframe first.

df_copy = df.copy()
df_copy

firstsecondthird
0-1.9170577.0baz
1-1.8676145.0foo
20.2982977.0foo
30.4982425.0bar
40.3902405.0bar
............
95-0.8482047.0bar
96-0.5528405.0baz
972.0510787.0foo
980.770107NaNbaz
991.8373107.0bar

100 rows × 3 columns

del df_copy['second']
df_copy

firstthird
0-1.917057baz
1-1.867614foo
20.298297foo
30.498242bar
40.390240bar
.........
95-0.848204bar
96-0.552840baz
972.051078foo
980.770107baz
991.837310bar

100 rows × 2 columns

Yet another possibility is to drop the column by index. For instance:

df.drop(columns=df.columns[1])

firstthird
0-1.917057baz
1-1.867614foo
20.298297foo
30.498242bar
40.390240bar
.........
95-0.848204bar
96-0.552840baz
972.051078foo
980.770107baz
991.837310bar

100 rows × 2 columns

Or we could use ranges, for instance:

df.drop(columns=df.columns[0:2])

third
0baz
1foo
2foo
3bar
4bar
......
95bar
96baz
97foo
98baz
99bar

100 rows × 1 columns