Pandas basics
Creating dataframes
Let’s start with the basics. How to create a dataframe.
Loading from CSV
Dataframes can be created from CSV files by using the following method:
import pandas as pd
pd.read_csv('../../data/mpg.csv')| mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
| … | … | … | … | … | … | … | … | … | … |
| 393 | 27.0 | 4 | 140.0 | 86 | 2790 | 15.6 | 82 | 1 | ford mustang gl |
| 394 | 44.0 | 4 | 97.0 | 52 | 2130 | 24.6 | 82 | 2 | vw pickup |
| 395 | 32.0 | 4 | 135.0 | 84 | 2295 | 11.6 | 82 | 1 | dodge rampage |
| 396 | 28.0 | 4 | 120.0 | 79 | 2625 | 18.6 | 82 | 1 | ford ranger |
| 397 | 31.0 | 4 | 119.0 | 82 | 2720 | 19.4 | 82 | 1 | chevy s-10 |
398 rows × 9 columns
If you want to bypass the column name detection (from tcolumnsheader), you can supply the column names directly:
columns = [f"field_{i}" for i in range(9)]
pd.read_csv('../../data/mpg.csv', names=columns, skiprows=1, header=None)| field_0 | field_1 | field_2 | field_3 | field_4 | field_5 | field_6 | field_7 | field_8 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
| … | … | … | … | … | … | … | … | … | … |
| 393 | 27.0 | 4 | 140.0 | 86 | 2790 | 15.6 | 82 | 1 | ford mustang gl |
| 394 | 44.0 | 4 | 97.0 | 52 | 2130 | 24.6 | 82 | 2 | vw pickup |
| 395 | 32.0 | 4 | 135.0 | 84 | 2295 | 11.6 | 82 | 1 | dodge rampage |
| 396 | 28.0 | 4 | 120.0 | 79 | 2625 | 18.6 | 82 | 1 | ford ranger |
| 397 | 31.0 | 4 | 119.0 | 82 | 2720 | 19.4 | 82 | 1 | chevy s-10 |
398 rows × 9 columns
Random dataframes
import numpy as np
N=100
df = pd.DataFrame({
'a':np.random.randn(N),
'b':np.random.choice( [5,7,np.nan], N),
'c':np.random.choice( ['foo','bar','baz'], N),
})
df.head()| a | b | c | |
|---|---|---|---|
| 0 | -1.663535 | 5.0 | baz |
| 1 | 0.408342 | NaN | foo |
| 2 | 0.424239 | NaN | bar |
| 3 | 1.512528 | 5.0 | foo |
| 4 | -2.062024 | 5.0 | foo |
Concatenate dataframes
Row-wise
To concatenate dataframes row-wise (i.e. to append more rows to dataframes with the same structure) we can use the .concat() method. For instance, if we create a new random dataframe:
df_extra = pd.DataFrame({
'a':np.random.randn(N),
'b':np.random.choice( [11,12,13], N),
'c':np.random.choice( ['zombie','woof','nite'], N),
})
df_extra.head()| a | b | c | |
|---|---|---|---|
| 0 | -0.308607 | 13 | nite |
| 1 | 1.241705 | 11 | nite |
| 2 | 1.449900 | 11 | zombie |
| 3 | 0.741891 | 12 | nite |
| 4 | -1.586658 | 12 | zombie |
We can now concatenate an arbitray number of dataframes by passing them as a list:
df_all = pd.concat([df, df_extra])
df_all.sample(9)| a | b | c | |
|---|---|---|---|
| 98 | -0.412014 | 11.0 | zombie |
| 38 | -0.656568 | 5.0 | foo |
| 15 | 0.263778 | 13.0 | zombie |
| 53 | -0.133630 | NaN | baz |
| 21 | -0.106806 | NaN | bar |
| 34 | -0.675752 | 7.0 | foo |
| 13 | 0.181684 | 12.0 | zombie |
| 8 | -1.189809 | 12.0 | zombie |
| 0 | -0.308607 | 13.0 | nite |
Column operations
Check column existence
The in keyword can be used directly to check column existence.
'b' in dfTrue
Renaming columns
df.rename(columns={"a": "new_name"}, inplace=True)
df.columnsIndex(['new_name', 'b', 'c'], dtype='object')
Using a mapping function. In this case str.upper():
df.rename(columns=str.upper, inplace=True)
df.columnsIndex(['NEW_NAME', 'B', 'C'], dtype='object')
We can also use a lambda. For instance, using lambda x: x.capitalize() would result:
df.rename(columns=lambda x: x.capitalize(), inplace=True)
df.columnsIndex(['New_name', 'B', 'C'], dtype='object')
A list of column names can be passed directly to columns.
df.columns = ["first", "second", "third"]
df.columnsIndex(['first', 'second', 'third'], dtype='object')
Dropping columns
A column can be dropped using the .drop() method along with the column keyword. For instance in the dataframe df: We can drop the second column using:
df.drop(columns='second')| first | third | |
|---|---|---|
| 0 | -1.663535 | baz |
| 1 | 0.408342 | foo |
| 2 | 0.424239 | bar |
| 3 | 1.512528 | foo |
| 4 | -2.062024 | foo |
| … | … | … |
| 95 | -0.856850 | foo |
| 96 | 0.850722 | foo |
| 97 | -1.076690 | bar |
| 98 | 2.074288 | baz |
| 99 | -1.176129 | baz |
100 rows × 2 columns
The del keyword is also a possibility. However, del changes the dataframe in-place, therefore we will make a copy of the dataframe first.
df_copy = df.copy()
df_copy| first | second | third | |
|---|---|---|---|
| 0 | -1.663535 | 5.0 | baz |
| 1 | 0.408342 | NaN | foo |
| 2 | 0.424239 | NaN | bar |
| 3 | 1.512528 | 5.0 | foo |
| 4 | -2.062024 | 5.0 | foo |
| … | … | … | … |
| 95 | -0.856850 | NaN | foo |
| 96 | 0.850722 | NaN | foo |
| 97 | -1.076690 | 7.0 | bar |
| 98 | 2.074288 | 7.0 | baz |
| 99 | -1.176129 | 7.0 | baz |
100 rows × 3 columns
del df_copy['second']
df_copy| first | third | |
|---|---|---|
| 0 | -1.663535 | baz |
| 1 | 0.408342 | foo |
| 2 | 0.424239 | bar |
| 3 | 1.512528 | foo |
| 4 | -2.062024 | foo |
| … | … | … |
| 95 | -0.856850 | foo |
| 96 | 0.850722 | foo |
| 97 | -1.076690 | bar |
| 98 | 2.074288 | baz |
| 99 | -1.176129 | baz |
100 rows × 2 columns
Yet another possibility is to drop the column by index. For instance:
df.drop(columns=df.columns[1])| first | third | |
|---|---|---|
| 0 | -1.663535 | baz |
| 1 | 0.408342 | foo |
| 2 | 0.424239 | bar |
| 3 | 1.512528 | foo |
| 4 | -2.062024 | foo |
| … | … | … |
| 95 | -0.856850 | foo |
| 96 | 0.850722 | foo |
| 97 | -1.076690 | bar |
| 98 | 2.074288 | baz |
| 99 | -1.176129 | baz |
100 rows × 2 columns
Or we could use ranges, for instance:
df.drop(columns=df.columns[0:2])| third | |
|---|---|
| 0 | baz |
| 1 | foo |
| 2 | bar |
| 3 | foo |
| 4 | foo |
| … | … |
| 95 | foo |
| 96 | foo |
| 97 | bar |
| 98 | baz |
| 99 | baz |
100 rows × 1 columns
Subsetting and indexing
Indexing performance
Let’s assume the case where you have a column BOOL with values Y or N that you want to replace with an integer 1 or 0 value. The inital1 instinct would be to do something like:
df["BOOL"] = df["BOOL"].eq("Y").mul(1)This will result in the warning
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Pandas documentation recommends the usage of the following idiom, since it can be considerably faster:
df.loc[:, ("BOOL")] = df.loc[:, ("BOOL")].eq("Y").mul(1)Footnotes
and Pythonic?↩︎