6  Working with DataFrames

Now that we have learned how to create DataFrames, and some basic operations, let’s deepen our understanding with a few more tricks that help us work with DataFrames more efficiently. We’ll start of by importing the necessary libraries.

import numpy as np
import pandas as pd

Additionally, we will use a sample dataset to demonstrate the operations. Let’s load the dataset from the seaborn library.

Note

Seaborn is a commonly used library for data visualization in Python. It comes with a few sample datasets that we can use for practice. We’ll use the classic iris dataset for this tutorial. We’ll learn more about Seaborn in the upcoming chapters.

import seaborn as sns

# Load the dataset
df = sns.load_dataset('iris')
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

As we can see, the dataset contains four features: sepal_length, sepal_width, petal_length, and petal_width. The target variable is species, which is the type of iris flower. We also used the head() method to display the first five rows of the dataset. Let’s learn more about the dataset using some common methods found in Pandas.

6.1 Common methods in Pandas

We already saw that the head() method displays the first five rows of the dataset. Five is the default value, but we can change in using the n parameter. Similarly, we can use the tail() method to display the last rows of the dataset.

# Display the last nine rows
df.tail(9)
sepal_length sepal_width petal_length petal_width species
141 6.9 3.1 5.1 2.3 virginica
142 5.8 2.7 5.1 1.9 virginica
143 6.8 3.2 5.9 2.3 virginica
144 6.7 3.3 5.7 2.5 virginica
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

6.1.1 General information about the dataset

The info() method provides a concise summary of the dataset. It displays the number of non-null values in each column, the data type of each column, and the memory usage of the dataset.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

Summary statistics can be obtained using the describe() method. It provides the count, mean, standard deviation, minimum, maximum, and the quartiles of the dataset. This provides a handy way of getting an overall impression of the dataset.

df.describe()
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

If we just want to know the number of rows and columns in the dataset, we can use the shape attribute.

df.shape
(150, 5)

The columns attribute provides the names of the columns in the dataset.

df.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

6.1.2 Unique values in the dataset

There are a few methods that help us get unique values in the dataset. The unique() method provides the unique values in a column. Let’s see how this works by getting the unique values in the species column.

unique_species = df['species'].unique()
unique_species
array(['setosa', 'versicolor', 'virginica'], dtype=object)

We can see that the dataset contains three unique species of iris flowers. This is pretty simple to see directly, but if we would have more than three species, we could also check the number of unique values like this:

len(unique_species)
3

There is actually a dedicated method for obtaining the number of unique values. The nunique() method provides the number of unique values in each column of the dataset.

df.nunique()
sepal_length    35
sepal_width     23
petal_length    43
petal_width     22
species          3
dtype: int64

Finally, the value_counts() method provides the frequency of each unique value in a column. Let’s see how this works by getting the frequency of each species in the species column.

df['species'].value_counts()
species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

6.1.3 Sorting values

The sort_values() method allows us to sort the dataset based on one or more columns. Let’s see how this works by sorting the dataset based on the sepal_length column.

df.sort_values('sepal_length').head()
sepal_length sepal_width petal_length petal_width species
13 4.3 3.0 1.1 0.1 setosa
42 4.4 3.2 1.3 0.2 setosa
38 4.4 3.0 1.3 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
41 4.5 2.3 1.3 0.3 setosa

6.1.4 Applying functions to columns

The apply() method allows us to apply a function to columns in the dataset. For example, if we take a look at the first few values for the species column, we can see that the values are strings. Let’s apply the upper() method to the species column to convert the strings to uppercase.

df['species'].apply(lambda x: x.upper()).head()
0    SETOSA
1    SETOSA
2    SETOSA
3    SETOSA
4    SETOSA
Name: species, dtype: object

For simple functions like finding the length of a string, we can use the len function directly.

df['species'].apply(len).head()
0    6
1    6
2    6
3    6
4    6
Name: species, dtype: int64

The apply() method coupled with lambda expressions grants us the ability to easily craft custom modifications on existing columns. This will be useful when we need to clean or transform data.

# Make species uppercase and add a suffix to the end
df['species'].apply(lambda x: x.upper() + '_SPECIES').head()
0    SETOSA_SPECIES
1    SETOSA_SPECIES
2    SETOSA_SPECIES
3    SETOSA_SPECIES
4    SETOSA_SPECIES
Name: species, dtype: object

6.2 Grouping data

If you are familiar with SQL queries, you might have used the GROUP BY clause to group data based on a particular column. Pandas provides a similar functionality using the groupby() method. Grouping data allows you to calculate for example statistics like the mean for distinct groups in the dataset. Let’s take a look at how this works by grouping the iris data based on the species column and calculating the mean of the other columns.

df.groupby('species').mean()
sepal_length sepal_width petal_length petal_width
species
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026

We can also target specific columns. Let’s see how this works by taking the sum of the sepal_length for each species.

df.groupby('species')['sepal_length'].sum()
species
setosa        250.3
versicolor    296.8
virginica     329.4
Name: sepal_length, dtype: float64

Some commonly used aggregation functions to chain after a groupby clause are:

  • mean()
  • sum()
  • count()
  • min()
  • max()

We can also use the agg() method to apply multiple aggregation functions at once. Let’s by taking the count, min, and max of the sepal_length and sepal_width columns for each species.

df.groupby('species')[['sepal_length', 'sepal_width']].agg(['count', 'min', 'max'])
sepal_length sepal_width
count min max count min max
species
setosa 50 4.3 5.8 50 2.3 4.4
versicolor 50 4.9 7.0 50 2.0 3.4
virginica 50 4.9 7.9 50 2.2 3.8

If you want to get summary statistics per group, you can also use the describe() method.

# let's get the summary statistics for all columns
# and show results for the petal_length column
df.groupby('species').describe()['petal_length']
count mean std min 25% 50% 75% max
species
setosa 50.0 1.462 0.173664 1.0 1.4 1.50 1.575 1.9
versicolor 50.0 4.260 0.469911 3.0 4.0 4.35 4.600 5.1
virginica 50.0 5.552 0.551895 4.5 5.1 5.55 5.875 6.9

6.3 Handling missing values

In a perfect world, we would always work with complete data tables without any missing values. However, out in the wild we are often confronted with incomplete data, which is why dealing with missing values is an important skill to have. Pandas provides a few methods to handle missing values.

Let’s look at our iris dataset and introduce some missing values.

# one missing value to the sepal_length column row 3
df.loc[2, 'sepal_length'] = np.nan
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 NaN 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

There are a couple of methods which help us determine if missing values are present. The following methods are commonly used:

  • isnull(): returns a DataFrame of the same shape as the original dataset, where each cell is either True or False depending on whether the value is missing or not.
  • notnull(): returns the opposite of isnull(). It returns True if the value is not missing, and False otherwise.
  • isna(): an alias for isnull().
  • notna(): an alias for notnull().

The isnull() method returns a DataFrame of the same shape as the original dataset, where each cell is either True or False depending on whether the value is missing or not.

df.head().isnull()
sepal_length sepal_width petal_length petal_width species
0 False False False False False
1 False False False False False
2 True False False False False
3 False False False False False
4 False False False False False

The isna() method does exactly the same thing as isnull(), which we can see in the example below.

df.head().isna()
sepal_length sepal_width petal_length petal_width species
0 False False False False False
1 False False False False False
2 True False False False False
3 False False False False False
4 False False False False False

The notnull() method returns the opposite of isnull(). It returns True if the value is not missing, and False otherwise.

df.head()['sepal_length'].notnull()
0     True
1     True
2    False
3     True
4     True
Name: sepal_length, dtype: bool

Related to missing values we can easily drop rows with missing values using the dropna() method. By default, it drops rows where at least one element is missing.

df.dropna().head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa

The dropna() method also has a parameter how which can be set to all. This will only drop rows where all elements are missing. It is also possible to drop columns with missing values by setting the axis parameter to 1. Let’s see how this works.

# drop columns with missing values
df.dropna(axis=1).head()
sepal_width petal_length petal_width species
0 3.5 1.4 0.2 setosa
1 3.0 1.4 0.2 setosa
2 3.2 1.3 0.2 setosa
3 3.1 1.5 0.2 setosa
4 3.6 1.4 0.2 setosa

6.3.1 Filling missing values

Instead of dropping rows with missing values, we can also fill them with a specific value. This is ofter referred to as imputation. The fillna() method allows us to fill missing values with a specific value. Let’s fill the missing values in the sepal_length column with the mean of the column.

# use the mean of sepal_length column to fill in for missing values
df['sepal_length'] = df['sepal_length'].fillna(df['sepal_length'].mean())
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.100000 3.5 1.4 0.2 setosa
1 4.900000 3.0 1.4 0.2 setosa
2 5.851007 3.2 1.3 0.2 setosa
3 4.600000 3.1 1.5 0.2 setosa
4 5.000000 3.6 1.4 0.2 setosa

6.3.2 Missing values for other data types

So far, we have only looked at missing values in numerical columns. However, missing values can also occur in categorical columns. Let’s introduce a missing value in the species column.

# introduce a missing value in the species column
df.loc[2, 'species'] = np.nan
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.100000 3.5 1.4 0.2 setosa
1 4.900000 3.0 1.4 0.2 setosa
2 5.851007 3.2 1.3 0.2 NaN
3 4.600000 3.1 1.5 0.2 setosa
4 5.000000 3.6 1.4 0.2 setosa

Let’s inspect the datatypes of the columns in the dataset.

df.dtypes
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

We can now try to impute the most common value in the species column for the missing value. Before we do that let’s show to use the mode() method to get the most common value in a column.

df['species'].mode()
0    versicolor
1     virginica
Name: species, dtype: object

We can see that mode returns a Series object. To get the actual value we can use the iloc method, or just use the mode()[0] method. Let’s use this to fill the missing value in the species column.

# fill missing values in the species column with the most common value
df['species'] = df['species'].fillna(df['species'].mode()[0])
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.100000 3.5 1.4 0.2 setosa
1 4.900000 3.0 1.4 0.2 setosa
2 5.851007 3.2 1.3 0.2 versicolor
3 4.600000 3.1 1.5 0.2 setosa
4 5.000000 3.6 1.4 0.2 setosa

Obviously, here we have imputed the incorrect species in place of the missing value. In practice, you would need to be more careful and consider the context of the data before imputing missing values. However, this example shows how you can impute missing values in categorical columns. There is more to learn about missing values, but for now, we will leave it at this.

6.4 Joining DataFrames

Quite often we find ourselves working with multiple datasets that we need to combine. Pandas provides a few methods to join DataFrames. Let’s start by creating two DataFrames to demonstrate how this works.

sales_regions = pd.DataFrame({
    'region': ['North', 'South', 'East', 'West'],
    'manager': ['John', 'Sara', 'Tom', 'Alice']
})

sales_regions
region manager
0 North John
1 South Sara
2 East Tom
3 West Alice
# sales for three years by region
sales_results = pd.DataFrame({
    'year': [2019, 2020, 2021, 2019, 2020, 2021, 
    2019, 2020, 2021, 2019, 2020, 2021],
    'region': ['North', 'North', 'North', 'South', 'South', 'South', 
    'East', 'East', 'East', 'West', 'West', 'West'],
    'sales': [1000, 1200, 1500, 800, 900, 1000, 
    700, 800, 900, 600, 700, 800]
})

sales_results
year region sales
0 2019 North 1000
1 2020 North 1200
2 2021 North 1500
3 2019 South 800
4 2020 South 900
5 2021 South 1000
6 2019 East 700
7 2020 East 800
8 2021 East 900
9 2019 West 600
10 2020 West 700
11 2021 West 800

Given these DataFrames, might want to join them based on the region column. We could for example calculate the total sales per region, and add the information to the sales_regions DataFrame. We can do this using the merge() method. Let’s see how this works.

# let's first calculate the total sales per region
sales_per_region = sales_results.groupby('region')['sales'].sum().reset_index()
sales_per_region
region sales
0 East 2400
1 North 3700
2 South 2700
3 West 2100

We used the groupby() method we learned earlier to calculate the total sales per region. We then used the reset_index() method to convert the grouped data back to a DataFrame. Now we can merge the sales_regions DataFrame with the sales_per_region DataFrame.

# merge the sales_regions DataFrame with the sales_per_region DataFrame
sales_regions.merge(sales_per_region, on='region', how='left')
region manager sales
0 North John 3700
1 South Sara 2700
2 East Tom 2400
3 West Alice 2100

If you have ever used SQL, you might have noticed that this feels very similar to a SQL join. The on parameter specifies the column to join on, and the how parameter specifies the type of join. Most common choices for the how parameter are probably left, right, inner, and outer. The default value is inner.

There is more functionality when it comes using merge for joining DataFrames, but for the time being we will leave it at this.

6.4.1 Concatenating DataFrames

Now sometimes we have DataFrames where the content is different but the columns are the same. In this case we might want to concatenate the DataFrames. Let’s assume we have sales data from previous years that we want to add to the sales_results DataFrame.

sales_results_2018 = pd.DataFrame({
    'year': [2018, 2018, 2018, 2018],
    'region': ['North', 'South', 'East', 'West'],
    'sales': [500, 400, 300, 200]
})

sales_results_2018
year region sales
0 2018 North 500
1 2018 South 400
2 2018 East 300
3 2018 West 200

We can see that the format of this DataFrame is the same as the sales_results DataFrame. We can concatenate the two DataFrames using the concat() method, which takes a list of DataFrames as input.

# concatenate the sales results from different years
sales_all = pd.concat([sales_results_2018, sales_results])
sales_all
year region sales
0 2018 North 500
1 2018 South 400
2 2018 East 300
3 2018 West 200
0 2019 North 1000
1 2020 North 1200
2 2021 North 1500
3 2019 South 800
4 2020 South 900
5 2021 South 1000
6 2019 East 700
7 2020 East 800
8 2021 East 900
9 2019 West 600
10 2020 West 700
11 2021 West 800

All looks good, but we can see that the index is not in order. We can reset the index using the reset_index() method.

sales_all.reset_index(drop=True, inplace=True)
sales_all
year region sales
0 2018 North 500
1 2018 South 400
2 2018 East 300
3 2018 West 200
4 2019 North 1000
5 2020 North 1200
6 2021 North 1500
7 2019 South 800
8 2020 South 900
9 2021 South 1000
10 2019 East 700
11 2020 East 800
12 2021 East 900
13 2019 West 600
14 2020 West 700
15 2021 West 800

The drop=True parameter means that we want to get rid of the old index. The inplace=True parameter means that we want to modify the DataFrame in place, instead of creating a new DataFrame. This will make the index start from 0, and the changes are applied to the DataFrame.

6.4.2 Concatenating along columns

Somethis we want to stick new columns at the end of an existing DataFrame. We can do this using the concat() method with the axis parameter set to 1. Let’s look at an example.

sales_by_year = pd.DataFrame({
    'region': ['North', 'South', 'East', 'West'],
    '2018': [500, 400, 300, 200],   
    '2019': [1000, 800, 700, 600],
    '2020': [1200, 900, 800, 700]
})

sales_by_year
region 2018 2019 2020
0 North 500 1000 1200
1 South 400 800 900
2 East 300 700 800
3 West 200 600 700
year_2021_2022 = pd.DataFrame({
    '2021': [1500, 1000, 900, 800],
    '2022': [1600, 1100, 1000, 900]
})

year_2021_2022
2021 2022
0 1500 1600
1 1000 1100
2 900 1000
3 800 900

We can see that the year_2021_2022 DataFrame is a natural extension to the sales_by_year DataFrame. Let’s combine the two.

sales_by_year = pd.concat([sales_by_year, year_2021_2022], axis=1)
sales_by_year
region 2018 2019 2020 2021 2022
0 North 500 1000 1200 1500 1600
1 South 400 800 900 1000 1100
2 East 300 700 800 900 1000
3 West 200 600 700 800 900

6.4.3 Joining DataFrames on the index values

So far we have joined DataFrames based on columns. However, we can also join DataFrames based on the index values. Let’s see how this works by creating two DataFrames with the some overlapping index values.

df1 = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]},
    index=[1, 2, 3])
df1
A B
1 1 4
2 2 5
3 3 6
df2 = pd.DataFrame({
    'C': [7, 8, 9],
    'D': [10, 11, 12]},
    index=[2, 3, 4])

df2
C D
2 7 10
3 8 11
4 9 12

We can now join the two DataFrames based on the index values using the join() method. Let’s take a look at a left join.

df1.join(df2, how='left')
A B C D
1 1 4 NaN NaN
2 2 5 7.0 10.0
3 3 6 8.0 11.0

We can see that the df1 DataFrame is the left DataFrame, which will be preserved in the join operation. The df2 DataFrame is the right DataFrame, which will be joined to the left DataFrame based on the index values. If a value is missing in the right DataFrame, it will be filled with NaN.

A right join will preserve the right DataFrame.

df1.join(df2, how='right')
A B C D
2 2.0 5.0 7 10
3 3.0 6.0 8 11
4 NaN NaN 9 12

6.5 Reading in data

In the real world, we aren’t usually working with example datasets like the iris dataset. Instead, we are burdened with the task of reading in the data. Luckily, Pandas provides a variety of methods to read in data from different sources. Let’s take a look at probably the most common method: read_csv(). As the name suggests, this method reads in data from a CSV file. We are going to read in a dataset called pets, which is saved as a CSV file under a folder called data.

# read in the pets dataset
pets = pd.read_csv('data/pets.csv')
pets
Pet Name Age Owner
0 Parrot Polly 15 Cathy
1 Cat Fluffy 3 John
2 Dog Spot 5 Amanda
3 Guinea pig Squeeky 1 Mark

We can see that the read_csv() method reads in the data from the CSV file and creates a DataFrame. As mentioned earlier, Pandas provides several methods for reading in many file types such as Excel files, SQL databases, and more. You can browse the available methods by typing pd.read_ and pressing Tab for autocomplete suggestions. For saving data to a file you can use the to_csv() method. Let’s save the pets DataFrame to a CSV file.

# save the pets DataFrame to a CSV file
pets.to_csv('data/pets_saved.csv', index=False)

Here we used the index=False parameter to avoid saving the index values to the CSV file. This is usually preferred, unless you have a specific reason for creating an extra column for the index values in the saved file. Next we will turn our attention to visualizing data, which is one of the most important aspects of data analysis.