Pandas basics
Dataframe operations
Create a random dataframe
import pandas as pd
import numpy as np
N=100
df = pd.DataFrame({
'a':np.random.randn(N),
'b':np.random.choice( [5,7,np.nan], N),
'c':np.random.choice( ['foo','bar','baz'], N),
})
df.head()
a | b | c | |
---|---|---|---|
0 | -1.917057 | 7.0 | baz |
1 | -1.867614 | 5.0 | foo |
2 | 0.298297 | 7.0 | foo |
3 | 0.498242 | 5.0 | bar |
4 | 0.390240 | 5.0 | bar |
Concatenate dataframes
Row-wise
To concatenate dataframes row-wise (i.e. to append more rows to dataframes with the same structure) we can use the .concat()
method. For instance, if we create a new random dataframe:
df_extra = pd.DataFrame({
'a':np.random.randn(N),
'b':np.random.choice( [11,12,13], N),
'c':np.random.choice( ['zombie','woof','nite'], N),
})
df_extra.head()
a | b | c | |
---|---|---|---|
0 | 0.863633 | 13 | woof |
1 | -0.819802 | 12 | woof |
2 | -0.202597 | 12 | zombie |
3 | 0.808857 | 11 | nite |
4 | -2.089671 | 13 | woof |
We can now concatenate an arbitray number of dataframes by passing them as a list:
df_all = pd.concat([df, df_extra])
df_all.sample(9)
a | b | c | |
---|---|---|---|
38 | -0.135197 | NaN | foo |
3 | 0.498242 | 5.0 | bar |
57 | -0.520117 | 13.0 | zombie |
88 | 0.636065 | 12.0 | zombie |
33 | -1.530864 | 13.0 | zombie |
44 | 0.596794 | NaN | baz |
17 | -0.131911 | NaN | baz |
70 | 1.050289 | 5.0 | baz |
80 | 0.223156 | NaN | foo |
Column operations
Check column existence
The in
keyword can be used directly to check column existence.
'b' in df
True
Renaming columns
df.rename(columns={"a": "new_name"}, inplace=True)
df.columns
Index(['new_name', 'b', 'c'], dtype='object')
Using a mapping function. In this case str.upper()
:
df.rename(columns=str.upper, inplace=True)
df.columns
Index(['NEW_NAME', 'B', 'C'], dtype='object')
We can also use a lambda. For instance, using lambda x: x.capitalize()
would result:
df.rename(columns=lambda x: x.capitalize(), inplace=True)
df.columns
Index(['New_name', 'B', 'C'], dtype='object')
A list of column names can be passed directly to columns.
df.columns = ["first", "second", "third"]
df.columns
Index(['first', 'second', 'third'], dtype='object')
Dropping columns
A column can be dropped using the .drop()
method along with the column
keyword. For instance in the dataframe df
:
We can drop the second
column using:
df.drop(columns='second')
first | third | |
---|---|---|
0 | -1.917057 | baz |
1 | -1.867614 | foo |
2 | 0.298297 | foo |
3 | 0.498242 | bar |
4 | 0.390240 | bar |
... | ... | ... |
95 | -0.848204 | bar |
96 | -0.552840 | baz |
97 | 2.051078 | foo |
98 | 0.770107 | baz |
99 | 1.837310 | bar |
100 rows × 2 columns
The del
keyword is also a possibility. However, del
changes the dataframe in-place, therefore we will make a copy of the dataframe first.
df_copy = df.copy()
df_copy
first | second | third | |
---|---|---|---|
0 | -1.917057 | 7.0 | baz |
1 | -1.867614 | 5.0 | foo |
2 | 0.298297 | 7.0 | foo |
3 | 0.498242 | 5.0 | bar |
4 | 0.390240 | 5.0 | bar |
... | ... | ... | ... |
95 | -0.848204 | 7.0 | bar |
96 | -0.552840 | 5.0 | baz |
97 | 2.051078 | 7.0 | foo |
98 | 0.770107 | NaN | baz |
99 | 1.837310 | 7.0 | bar |
100 rows × 3 columns
del df_copy['second']
df_copy
first | third | |
---|---|---|
0 | -1.917057 | baz |
1 | -1.867614 | foo |
2 | 0.298297 | foo |
3 | 0.498242 | bar |
4 | 0.390240 | bar |
... | ... | ... |
95 | -0.848204 | bar |
96 | -0.552840 | baz |
97 | 2.051078 | foo |
98 | 0.770107 | baz |
99 | 1.837310 | bar |
100 rows × 2 columns
Yet another possibility is to drop the column by index. For instance:
df.drop(columns=df.columns[1])
first | third | |
---|---|---|
0 | -1.917057 | baz |
1 | -1.867614 | foo |
2 | 0.298297 | foo |
3 | 0.498242 | bar |
4 | 0.390240 | bar |
... | ... | ... |
95 | -0.848204 | bar |
96 | -0.552840 | baz |
97 | 2.051078 | foo |
98 | 0.770107 | baz |
99 | 1.837310 | bar |
100 rows × 2 columns
Or we could use ranges, for instance:
df.drop(columns=df.columns[0:2])
third | |
---|---|
0 | baz |
1 | foo |
2 | foo |
3 | bar |
4 | bar |
... | ... |
95 | bar |
96 | baz |
97 | foo |
98 | baz |
99 | bar |
100 rows × 1 columns