6 Working with DataFrames

Now that we have learned how to create DataFrames, and some basic operations, let’s deepen our understanding with a few more tricks that help us work with DataFrames more efficiently. We’ll start of by importing the necessary libraries.

import numpy as np
import pandas as pd

Additionally, we will use a sample dataset to demonstrate the operations. Let’s load the dataset from the seaborn library.

Note

Seaborn is a commonly used library for data visualization in Python. It comes with a few sample datasets that we can use for practice. We’ll use the classic iris dataset for this tutorial. We’ll learn more about Seaborn in the upcoming chapters.

import seaborn as sns

# Load the dataset
df = sns.load_dataset('iris')
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

As we can see, the dataset contains four features: sepal_length, sepal_width, petal_length, and petal_width. The target variable is species, which is the type of iris flower. We also used the head() method to display the first five rows of the dataset. Let’s learn more about the dataset using some common methods found in Pandas.

6.1 Common methods in Pandas

We already saw that the head() method displays the first five rows of the dataset. Five is the default value, but we can change in using the n parameter. Similarly, we can use the tail() method to display the last rows of the dataset.

# Display the last nine rows
df.tail(9)

	sepal_length	sepal_width	petal_length	petal_width	species
141	6.9	3.1	5.1	2.3	virginica
142	5.8	2.7	5.1	1.9	virginica
143	6.8	3.2	5.9	2.3	virginica
144	6.7	3.3	5.7	2.5	virginica
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

6.1.1 General information about the dataset

The info() method provides a concise summary of the dataset. It displays the number of non-null values in each column, the data type of each column, and the memory usage of the dataset.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

Summary statistics can be obtained using the describe() method. It provides the count, mean, standard deviation, minimum, maximum, and the quartiles of the dataset. This provides a handy way of getting an overall impression of the dataset.

df.describe()

	sepal_length	sepal_width	petal_length	petal_width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

If we just want to know the number of rows and columns in the dataset, we can use the shape attribute.

df.shape

(150, 5)

The columns attribute provides the names of the columns in the dataset.

df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

6.1.2 Unique values in the dataset

There are a few methods that help us get unique values in the dataset. The unique() method provides the unique values in a column. Let’s see how this works by getting the unique values in the species column.

unique_species = df['species'].unique()
unique_species

array(['setosa', 'versicolor', 'virginica'], dtype=object)

We can see that the dataset contains three unique species of iris flowers. This is pretty simple to see directly, but if we would have more than three species, we could also check the number of unique values like this:

len(unique_species)

There is actually a dedicated method for obtaining the number of unique values. The nunique() method provides the number of unique values in each column of the dataset.

df.nunique()

sepal_length    35
sepal_width     23
petal_length    43
petal_width     22
species          3
dtype: int64

Finally, the value_counts() method provides the frequency of each unique value in a column. Let’s see how this works by getting the frequency of each species in the species column.

df['species'].value_counts()

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

6.1.3 Sorting values

The sort_values() method allows us to sort the dataset based on one or more columns. Let’s see how this works by sorting the dataset based on the sepal_length column.

df.sort_values('sepal_length').head()

	sepal_length	sepal_width	petal_length	petal_width	species
13	4.3	3.0	1.1	0.1	setosa
42	4.4	3.2	1.3	0.2	setosa
38	4.4	3.0	1.3	0.2	setosa
8	4.4	2.9	1.4	0.2	setosa
41	4.5	2.3	1.3	0.3	setosa

6.1.4 Applying functions to columns

The apply() method allows us to apply a function to columns in the dataset. For example, if we take a look at the first few values for the species column, we can see that the values are strings. Let’s apply the upper() method to the species column to convert the strings to uppercase.

df['species'].apply(lambda x: x.upper()).head()

0    SETOSA
1    SETOSA
2    SETOSA
3    SETOSA
4    SETOSA
Name: species, dtype: object

For simple functions like finding the length of a string, we can use the len function directly.

df['species'].apply(len).head()

0    6
1    6
2    6
3    6
4    6
Name: species, dtype: int64

The apply() method coupled with lambda expressions grants us the ability to easily craft custom modifications on existing columns. This will be useful when we need to clean or transform data.

# Make species uppercase and add a suffix to the end
df['species'].apply(lambda x: x.upper() + '_SPECIES').head()

0    SETOSA_SPECIES
1    SETOSA_SPECIES
2    SETOSA_SPECIES
3    SETOSA_SPECIES
4    SETOSA_SPECIES
Name: species, dtype: object

6.2 Grouping data

If you are familiar with SQL queries, you might have used the GROUP BY clause to group data based on a particular column. Pandas provides a similar functionality using the groupby() method. Grouping data allows you to calculate for example statistics like the mean for distinct groups in the dataset. Let’s take a look at how this works by grouping the iris data based on the species column and calculating the mean of the other columns.

df.groupby('species').mean()

	sepal_length	sepal_width	petal_length	petal_width
species
setosa	5.006	3.428	1.462	0.246
versicolor	5.936	2.770	4.260	1.326
virginica	6.588	2.974	5.552	2.026

We can also target specific columns. Let’s see how this works by taking the sum of the sepal_length for each species.

df.groupby('species')['sepal_length'].sum()

species
setosa        250.3
versicolor    296.8
virginica     329.4
Name: sepal_length, dtype: float64

Some commonly used aggregation functions to chain after a groupby clause are:

mean()
sum()
count()
min()
max()

We can also use the agg() method to apply multiple aggregation functions at once. Let’s by taking the count, min, and max of the sepal_length and sepal_width columns for each species.

df.groupby('species')[['sepal_length', 'sepal_width']].agg(['count', 'min', 'max'])

	sepal_length			sepal_width
	count	min	max	count	min	max
species
setosa	50	4.3	5.8	50	2.3	4.4
versicolor	50	4.9	7.0	50	2.0	3.4
virginica	50	4.9	7.9	50	2.2	3.8

If you want to get summary statistics per group, you can also use the describe() method.

# let's get the summary statistics for all columns
# and show results for the petal_length column
df.groupby('species').describe()['petal_length']

	count	mean	std	min	25%	50%	75%	max
species
setosa	50.0	1.462	0.173664	1.0	1.4	1.50	1.575	1.9
versicolor	50.0	4.260	0.469911	3.0	4.0	4.35	4.600	5.1
virginica	50.0	5.552	0.551895	4.5	5.1	5.55	5.875	6.9

6.3 Handling missing values

In a perfect world, we would always work with complete data tables without any missing values. However, out in the wild we are often confronted with incomplete data, which is why dealing with missing values is an important skill to have. Pandas provides a few methods to handle missing values.

Let’s look at our iris dataset and introduce some missing values.

# one missing value to the sepal_length column row 3
df.loc[2, 'sepal_length'] = np.nan
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	NaN	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

There are a couple of methods which help us determine if missing values are present. The following methods are commonly used:

isnull(): returns a DataFrame of the same shape as the original dataset, where each cell is either True or False depending on whether the value is missing or not.
notnull(): returns the opposite of isnull(). It returns True if the value is not missing, and False otherwise.
isna(): an alias for isnull().
notna(): an alias for notnull().

The isnull() method returns a DataFrame of the same shape as the original dataset, where each cell is either True or False depending on whether the value is missing or not.

df.head().isnull()

	sepal_length	sepal_width	petal_length	petal_width	species
0	False	False	False	False	False
1	False	False	False	False	False
2	True	False	False	False	False
3	False	False	False	False	False
4	False	False	False	False	False

The isna() method does exactly the same thing as isnull(), which we can see in the example below.

df.head().isna()

	sepal_length	sepal_width	petal_length	petal_width	species
0	False	False	False	False	False
1	False	False	False	False	False
2	True	False	False	False	False
3	False	False	False	False	False
4	False	False	False	False	False

The notnull() method returns the opposite of isnull(). It returns True if the value is not missing, and False otherwise.

df.head()['sepal_length'].notnull()

0     True
1     True
2    False
3     True
4     True
Name: sepal_length, dtype: bool

Related to missing values we can easily drop rows with missing values using the dropna() method. By default, it drops rows where at least one element is missing.

df.dropna().head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
5	5.4	3.9	1.7	0.4	setosa

The dropna() method also has a parameter how which can be set to all. This will only drop rows where all elements are missing. It is also possible to drop columns with missing values by setting the axis parameter to 1. Let’s see how this works.

# drop columns with missing values
df.dropna(axis=1).head()

	sepal_width	petal_length	petal_width	species
0	3.5	1.4	0.2	setosa
1	3.0	1.4	0.2	setosa
2	3.2	1.3	0.2	setosa
3	3.1	1.5	0.2	setosa
4	3.6	1.4	0.2	setosa

6.3.1 Filling missing values

Instead of dropping rows with missing values, we can also fill them with a specific value. This is ofter referred to as imputation. The fillna() method allows us to fill missing values with a specific value. Let’s fill the missing values in the sepal_length column with the mean of the column.

# use the mean of sepal_length column to fill in for missing values
df['sepal_length'] = df['sepal_length'].fillna(df['sepal_length'].mean())
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.100000	3.5	1.4	0.2	setosa
1	4.900000	3.0	1.4	0.2	setosa
2	5.851007	3.2	1.3	0.2	setosa
3	4.600000	3.1	1.5	0.2	setosa
4	5.000000	3.6	1.4	0.2	setosa

6.3.2 Missing values for other data types

So far, we have only looked at missing values in numerical columns. However, missing values can also occur in categorical columns. Let’s introduce a missing value in the species column.

# introduce a missing value in the species column
df.loc[2, 'species'] = np.nan
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.100000	3.5	1.4	0.2	setosa
1	4.900000	3.0	1.4	0.2	setosa
2	5.851007	3.2	1.3	0.2	NaN
3	4.600000	3.1	1.5	0.2	setosa
4	5.000000	3.6	1.4	0.2	setosa

Let’s inspect the datatypes of the columns in the dataset.

df.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

We can now try to impute the most common value in the species column for the missing value. Before we do that let’s show to use the mode() method to get the most common value in a column.

df['species'].mode()

0    versicolor
1     virginica
Name: species, dtype: object

We can see that mode returns a Series object. To get the actual value we can use the iloc method, or just use the mode()[0] method. Let’s use this to fill the missing value in the species column.

# fill missing values in the species column with the most common value
df['species'] = df['species'].fillna(df['species'].mode()[0])
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.100000	3.5	1.4	0.2	setosa
1	4.900000	3.0	1.4	0.2	setosa
2	5.851007	3.2	1.3	0.2	versicolor
3	4.600000	3.1	1.5	0.2	setosa
4	5.000000	3.6	1.4	0.2	setosa

Obviously, here we have imputed the incorrect species in place of the missing value. In practice, you would need to be more careful and consider the context of the data before imputing missing values. However, this example shows how you can impute missing values in categorical columns. There is more to learn about missing values, but for now, we will leave it at this.

6.4 Joining DataFrames

Quite often we find ourselves working with multiple datasets that we need to combine. Pandas provides a few methods to join DataFrames. Let’s start by creating two DataFrames to demonstrate how this works.

sales_regions = pd.DataFrame({
    'region': ['North', 'South', 'East', 'West'],
    'manager': ['John', 'Sara', 'Tom', 'Alice']
})

sales_regions

	region	manager
0	North	John
1	South	Sara
2	East	Tom
3	West	Alice

# sales for three years by region
sales_results = pd.DataFrame({
    'year': [2019, 2020, 2021, 2019, 2020, 2021, 
    2019, 2020, 2021, 2019, 2020, 2021],
    'region': ['North', 'North', 'North', 'South', 'South', 'South', 
    'East', 'East', 'East', 'West', 'West', 'West'],
    'sales': [1000, 1200, 1500, 800, 900, 1000, 
    700, 800, 900, 600, 700, 800]
})

sales_results

	year	region	sales
0	2019	North	1000
1	2020	North	1200
2	2021	North	1500
3	2019	South	800
4	2020	South	900
5	2021	South	1000
6	2019	East	700
7	2020	East	800
8	2021	East	900
9	2019	West	600
10	2020	West	700
11	2021	West	800

Given these DataFrames, might want to join them based on the region column. We could for example calculate the total sales per region, and add the information to the sales_regions DataFrame. We can do this using the merge() method. Let’s see how this works.

# let's first calculate the total sales per region
sales_per_region = sales_results.groupby('region')['sales'].sum().reset_index()
sales_per_region

	region	sales
0	East	2400
1	North	3700
2	South	2700
3	West	2100

We used the groupby() method we learned earlier to calculate the total sales per region. We then used the reset_index() method to convert the grouped data back to a DataFrame. Now we can merge the sales_regions DataFrame with the sales_per_region DataFrame.

# merge the sales_regions DataFrame with the sales_per_region DataFrame
sales_regions.merge(sales_per_region, on='region', how='left')

	region	manager	sales
0	North	John	3700
1	South	Sara	2700
2	East	Tom	2400
3	West	Alice	2100

If you have ever used SQL, you might have noticed that this feels very similar to a SQL join. The on parameter specifies the column to join on, and the how parameter specifies the type of join. Most common choices for the how parameter are probably left, right, inner, and outer. The default value is inner.

There is more functionality when it comes using merge for joining DataFrames, but for the time being we will leave it at this.

6.4.1 Concatenating DataFrames

Now sometimes we have DataFrames where the content is different but the columns are the same. In this case we might want to concatenate the DataFrames. Let’s assume we have sales data from previous years that we want to add to the sales_results DataFrame.

sales_results_2018 = pd.DataFrame({
    'year': [2018, 2018, 2018, 2018],
    'region': ['North', 'South', 'East', 'West'],
    'sales': [500, 400, 300, 200]
})

sales_results_2018

	year	region	sales
0	2018	North	500
1	2018	South	400
2	2018	East	300
3	2018	West	200

We can see that the format of this DataFrame is the same as the sales_results DataFrame. We can concatenate the two DataFrames using the concat() method, which takes a list of DataFrames as input.

# concatenate the sales results from different years
sales_all = pd.concat([sales_results_2018, sales_results])
sales_all

	year	region	sales
0	2018	North	500
1	2018	South	400
2	2018	East	300
3	2018	West	200
0	2019	North	1000
1	2020	North	1200
2	2021	North	1500
3	2019	South	800
4	2020	South	900
5	2021	South	1000
6	2019	East	700
7	2020	East	800
8	2021	East	900
9	2019	West	600
10	2020	West	700
11	2021	West	800

All looks good, but we can see that the index is not in order. We can reset the index using the reset_index() method.

sales_all.reset_index(drop=True, inplace=True)
sales_all

	year	region	sales
0	2018	North	500
1	2018	South	400
2	2018	East	300
3	2018	West	200
4	2019	North	1000
5	2020	North	1200
6	2021	North	1500
7	2019	South	800
8	2020	South	900
9	2021	South	1000
10	2019	East	700
11	2020	East	800
12	2021	East	900
13	2019	West	600
14	2020	West	700
15	2021	West	800

The drop=True parameter means that we want to get rid of the old index. The inplace=True parameter means that we want to modify the DataFrame in place, instead of creating a new DataFrame. This will make the index start from 0, and the changes are applied to the DataFrame.

6.4.2 Concatenating along columns

Somethis we want to stick new columns at the end of an existing DataFrame. We can do this using the concat() method with the axis parameter set to 1. Let’s look at an example.

sales_by_year = pd.DataFrame({
    'region': ['North', 'South', 'East', 'West'],
    '2018': [500, 400, 300, 200],   
    '2019': [1000, 800, 700, 600],
    '2020': [1200, 900, 800, 700]
})

sales_by_year

	region	2018	2019	2020
0	North	500	1000	1200
1	South	400	800	900
2	East	300	700	800
3	West	200	600	700

year_2021_2022 = pd.DataFrame({
    '2021': [1500, 1000, 900, 800],
    '2022': [1600, 1100, 1000, 900]
})

year_2021_2022

	2021	2022
0	1500	1600
1	1000	1100
2	900	1000
3	800	900

We can see that the year_2021_2022 DataFrame is a natural extension to the sales_by_year DataFrame. Let’s combine the two.

sales_by_year = pd.concat([sales_by_year, year_2021_2022], axis=1)
sales_by_year

	region	2018	2019	2020	2021	2022
0	North	500	1000	1200	1500	1600
1	South	400	800	900	1000	1100
2	East	300	700	800	900	1000
3	West	200	600	700	800	900

6.4.3 Joining DataFrames on the index values

So far we have joined DataFrames based on columns. However, we can also join DataFrames based on the index values. Let’s see how this works by creating two DataFrames with the some overlapping index values.

df1 = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]},
    index=[1, 2, 3])
df1

	A	B
1	1	4
2	2	5
3	3	6

df2 = pd.DataFrame({
    'C': [7, 8, 9],
    'D': [10, 11, 12]},
    index=[2, 3, 4])

df2

	C	D
2	7	10
3	8	11
4	9	12

We can now join the two DataFrames based on the index values using the join() method. Let’s take a look at a left join.

df1.join(df2, how='left')

	A	B	C	D
1	1	4	NaN	NaN
2	2	5	7.0	10.0
3	3	6	8.0	11.0

We can see that the df1 DataFrame is the left DataFrame, which will be preserved in the join operation. The df2 DataFrame is the right DataFrame, which will be joined to the left DataFrame based on the index values. If a value is missing in the right DataFrame, it will be filled with NaN.

A right join will preserve the right DataFrame.

df1.join(df2, how='right')

	A	B	C	D
2	2.0	5.0	7	10
3	3.0	6.0	8	11
4	NaN	NaN	9	12

6.5 Reading in data

In the real world, we aren’t usually working with example datasets like the iris dataset. Instead, we are burdened with the task of reading in the data. Luckily, Pandas provides a variety of methods to read in data from different sources. Let’s take a look at probably the most common method: read_csv(). As the name suggests, this method reads in data from a CSV file. We are going to read in a dataset called pets, which is saved as a CSV file under a folder called data.

# read in the pets dataset
pets = pd.read_csv('data/pets.csv')
pets

	Pet	Name	Age	Owner
0	Parrot	Polly	15	Cathy
1	Cat	Fluffy	3	John
2	Dog	Spot	5	Amanda
3	Guinea pig	Squeeky	1	Mark

We can see that the read_csv() method reads in the data from the CSV file and creates a DataFrame. As mentioned earlier, Pandas provides several methods for reading in many file types such as Excel files, SQL databases, and more. You can browse the available methods by typing pd.read_ and pressing Tab for autocomplete suggestions. For saving data to a file you can use the to_csv() method. Let’s save the pets DataFrame to a CSV file.

# save the pets DataFrame to a CSV file
pets.to_csv('data/pets_saved.csv', index=False)

Here we used the index=False parameter to avoid saving the index values to the CSV file. This is usually preferred, unless you have a specific reason for creating an extra column for the index values in the saved file. Next we will turn our attention to visualizing data, which is one of the most important aspects of data analysis.