import numpy as np
import pandas as pd
6 Working with DataFrames
Now that we have learned how to create DataFrames, and some basic operations, let’s deepen our understanding with a few more tricks that help us work with DataFrames more efficiently. We’ll start of by importing the necessary libraries.
Additionally, we will use a sample dataset to demonstrate the operations. Let’s load the dataset from the seaborn library.
Seaborn is a commonly used library for data visualization in Python. It comes with a few sample datasets that we can use for practice. We’ll use the classic iris
dataset for this tutorial. We’ll learn more about Seaborn in the upcoming chapters.
import seaborn as sns
# Load the dataset
= sns.load_dataset('iris')
df df.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
As we can see, the dataset contains four features: sepal_length
, sepal_width
, petal_length
, and petal_width
. The target variable is species
, which is the type of iris flower. We also used the head()
method to display the first five rows of the dataset. Let’s learn more about the dataset using some common methods found in Pandas.
6.1 Common methods in Pandas
We already saw that the head()
method displays the first five rows of the dataset. Five is the default value, but we can change in using the n
parameter. Similarly, we can use the tail()
method to display the last rows of the dataset.
# Display the last nine rows
9) df.tail(
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
141 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
142 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
143 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
144 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
6.1.1 General information about the dataset
The info()
method provides a concise summary of the dataset. It displays the number of non-null values in each column, the data type of each column, and the memory usage of the dataset.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
Summary statistics can be obtained using the describe()
method. It provides the count, mean, standard deviation, minimum, maximum, and the quartiles of the dataset. This provides a handy way of getting an overall impression of the dataset.
df.describe()
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
If we just want to know the number of rows and columns in the dataset, we can use the shape
attribute.
df.shape
(150, 5)
The columns
attribute provides the names of the columns in the dataset.
df.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')
6.1.2 Unique values in the dataset
There are a few methods that help us get unique values in the dataset. The unique()
method provides the unique values in a column. Let’s see how this works by getting the unique values in the species
column.
= df['species'].unique()
unique_species unique_species
array(['setosa', 'versicolor', 'virginica'], dtype=object)
We can see that the dataset contains three unique species of iris flowers. This is pretty simple to see directly, but if we would have more than three species, we could also check the number of unique values like this:
len(unique_species)
3
There is actually a dedicated method for obtaining the number of unique values. The nunique()
method provides the number of unique values in each column of the dataset.
df.nunique()
sepal_length 35
sepal_width 23
petal_length 43
petal_width 22
species 3
dtype: int64
Finally, the value_counts()
method provides the frequency of each unique value in a column. Let’s see how this works by getting the frequency of each species in the species
column.
'species'].value_counts() df[
species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
6.1.3 Sorting values
The sort_values()
method allows us to sort the dataset based on one or more columns. Let’s see how this works by sorting the dataset based on the sepal_length
column.
'sepal_length').head() df.sort_values(
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
13 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
42 | 4.4 | 3.2 | 1.3 | 0.2 | setosa |
38 | 4.4 | 3.0 | 1.3 | 0.2 | setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
41 | 4.5 | 2.3 | 1.3 | 0.3 | setosa |
6.1.4 Applying functions to columns
The apply()
method allows us to apply a function to columns in the dataset. For example, if we take a look at the first few values for the species
column, we can see that the values are strings. Let’s apply the upper()
method to the species
column to convert the strings to uppercase.
'species'].apply(lambda x: x.upper()).head() df[
0 SETOSA
1 SETOSA
2 SETOSA
3 SETOSA
4 SETOSA
Name: species, dtype: object
For simple functions like finding the length of a string, we can use the len
function directly.
'species'].apply(len).head() df[
0 6
1 6
2 6
3 6
4 6
Name: species, dtype: int64
The apply()
method coupled with lambda expressions grants us the ability to easily craft custom modifications on existing columns. This will be useful when we need to clean or transform data.
# Make species uppercase and add a suffix to the end
'species'].apply(lambda x: x.upper() + '_SPECIES').head() df[
0 SETOSA_SPECIES
1 SETOSA_SPECIES
2 SETOSA_SPECIES
3 SETOSA_SPECIES
4 SETOSA_SPECIES
Name: species, dtype: object
6.2 Grouping data
If you are familiar with SQL queries, you might have used the GROUP BY
clause to group data based on a particular column. Pandas provides a similar functionality using the groupby()
method. Grouping data allows you to calculate for example statistics like the mean for distinct groups in the dataset. Let’s take a look at how this works by grouping the iris
data based on the species
column and calculating the mean of the other columns.
'species').mean() df.groupby(
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
species | ||||
setosa | 5.006 | 3.428 | 1.462 | 0.246 |
versicolor | 5.936 | 2.770 | 4.260 | 1.326 |
virginica | 6.588 | 2.974 | 5.552 | 2.026 |
We can also target specific columns. Let’s see how this works by taking the sum of the sepal_length
for each species.
'species')['sepal_length'].sum() df.groupby(
species
setosa 250.3
versicolor 296.8
virginica 329.4
Name: sepal_length, dtype: float64
Some commonly used aggregation functions to chain after a groupby clause are:
mean()
sum()
count()
min()
max()
We can also use the agg()
method to apply multiple aggregation functions at once. Let’s by taking the count, min, and max of the sepal_length
and sepal_width
columns for each species.
'species')[['sepal_length', 'sepal_width']].agg(['count', 'min', 'max']) df.groupby(
sepal_length | sepal_width | |||||
---|---|---|---|---|---|---|
count | min | max | count | min | max | |
species | ||||||
setosa | 50 | 4.3 | 5.8 | 50 | 2.3 | 4.4 |
versicolor | 50 | 4.9 | 7.0 | 50 | 2.0 | 3.4 |
virginica | 50 | 4.9 | 7.9 | 50 | 2.2 | 3.8 |
If you want to get summary statistics per group, you can also use the describe()
method.
# let's get the summary statistics for all columns
# and show results for the petal_length column
'species').describe()['petal_length'] df.groupby(
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
species | ||||||||
setosa | 50.0 | 1.462 | 0.173664 | 1.0 | 1.4 | 1.50 | 1.575 | 1.9 |
versicolor | 50.0 | 4.260 | 0.469911 | 3.0 | 4.0 | 4.35 | 4.600 | 5.1 |
virginica | 50.0 | 5.552 | 0.551895 | 4.5 | 5.1 | 5.55 | 5.875 | 6.9 |
6.3 Handling missing values
In a perfect world, we would always work with complete data tables without any missing values. However, out in the wild we are often confronted with incomplete data, which is why dealing with missing values is an important skill to have. Pandas provides a few methods to handle missing values.
Let’s look at our iris dataset and introduce some missing values.
# one missing value to the sepal_length column row 3
2, 'sepal_length'] = np.nan
df.loc[ df.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | NaN | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
There are a couple of methods which help us determine if missing values are present. The following methods are commonly used:
isnull()
: returns a DataFrame of the same shape as the original dataset, where each cell is eitherTrue
orFalse
depending on whether the value is missing or not.notnull()
: returns the opposite ofisnull()
. It returnsTrue
if the value is not missing, andFalse
otherwise.isna()
: an alias forisnull()
.notna()
: an alias fornotnull()
.
The isnull()
method returns a DataFrame of the same shape as the original dataset, where each cell is either True
or False
depending on whether the value is missing or not.
df.head().isnull()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | False | False | False | False | False |
1 | False | False | False | False | False |
2 | True | False | False | False | False |
3 | False | False | False | False | False |
4 | False | False | False | False | False |
The isna()
method does exactly the same thing as isnull()
, which we can see in the example below.
df.head().isna()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | False | False | False | False | False |
1 | False | False | False | False | False |
2 | True | False | False | False | False |
3 | False | False | False | False | False |
4 | False | False | False | False | False |
The notnull()
method returns the opposite of isnull()
. It returns True
if the value is not missing, and False
otherwise.
'sepal_length'].notnull() df.head()[
0 True
1 True
2 False
3 True
4 True
Name: sepal_length, dtype: bool
Related to missing values we can easily drop rows with missing values using the dropna()
method. By default, it drops rows where at least one element is missing.
df.dropna().head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
The dropna()
method also has a parameter how
which can be set to all
. This will only drop rows where all elements are missing. It is also possible to drop columns with missing values by setting the axis
parameter to 1
. Let’s see how this works.
# drop columns with missing values
=1).head() df.dropna(axis
sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|
0 | 3.5 | 1.4 | 0.2 | setosa |
1 | 3.0 | 1.4 | 0.2 | setosa |
2 | 3.2 | 1.3 | 0.2 | setosa |
3 | 3.1 | 1.5 | 0.2 | setosa |
4 | 3.6 | 1.4 | 0.2 | setosa |
6.3.1 Filling missing values
Instead of dropping rows with missing values, we can also fill them with a specific value. This is ofter referred to as imputation. The fillna()
method allows us to fill missing values with a specific value. Let’s fill the missing values in the sepal_length
column with the mean of the column.
# use the mean of sepal_length column to fill in for missing values
'sepal_length'] = df['sepal_length'].fillna(df['sepal_length'].mean())
df[ df.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.100000 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.900000 | 3.0 | 1.4 | 0.2 | setosa |
2 | 5.851007 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.600000 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.000000 | 3.6 | 1.4 | 0.2 | setosa |
6.3.2 Missing values for other data types
So far, we have only looked at missing values in numerical columns. However, missing values can also occur in categorical columns. Let’s introduce a missing value in the species
column.
# introduce a missing value in the species column
2, 'species'] = np.nan
df.loc[ df.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.100000 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.900000 | 3.0 | 1.4 | 0.2 | setosa |
2 | 5.851007 | 3.2 | 1.3 | 0.2 | NaN |
3 | 4.600000 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.000000 | 3.6 | 1.4 | 0.2 | setosa |
Let’s inspect the datatypes of the columns in the dataset.
df.dtypes
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species object
dtype: object
We can now try to impute the most common value in the species
column for the missing value. Before we do that let’s show to use the mode()
method to get the most common value in a column.
'species'].mode() df[
0 versicolor
1 virginica
Name: species, dtype: object
We can see that mode returns a Series object. To get the actual value we can use the iloc
method, or just use the mode()[0]
method. Let’s use this to fill the missing value in the species
column.
# fill missing values in the species column with the most common value
'species'] = df['species'].fillna(df['species'].mode()[0])
df[ df.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.100000 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.900000 | 3.0 | 1.4 | 0.2 | setosa |
2 | 5.851007 | 3.2 | 1.3 | 0.2 | versicolor |
3 | 4.600000 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.000000 | 3.6 | 1.4 | 0.2 | setosa |
Obviously, here we have imputed the incorrect species in place of the missing value. In practice, you would need to be more careful and consider the context of the data before imputing missing values. However, this example shows how you can impute missing values in categorical columns. There is more to learn about missing values, but for now, we will leave it at this.
6.4 Joining DataFrames
Quite often we find ourselves working with multiple datasets that we need to combine. Pandas provides a few methods to join DataFrames. Let’s start by creating two DataFrames to demonstrate how this works.
= pd.DataFrame({
sales_regions 'region': ['North', 'South', 'East', 'West'],
'manager': ['John', 'Sara', 'Tom', 'Alice']
})
sales_regions
region | manager | |
---|---|---|
0 | North | John |
1 | South | Sara |
2 | East | Tom |
3 | West | Alice |
# sales for three years by region
= pd.DataFrame({
sales_results 'year': [2019, 2020, 2021, 2019, 2020, 2021,
2019, 2020, 2021, 2019, 2020, 2021],
'region': ['North', 'North', 'North', 'South', 'South', 'South',
'East', 'East', 'East', 'West', 'West', 'West'],
'sales': [1000, 1200, 1500, 800, 900, 1000,
700, 800, 900, 600, 700, 800]
})
sales_results
year | region | sales | |
---|---|---|---|
0 | 2019 | North | 1000 |
1 | 2020 | North | 1200 |
2 | 2021 | North | 1500 |
3 | 2019 | South | 800 |
4 | 2020 | South | 900 |
5 | 2021 | South | 1000 |
6 | 2019 | East | 700 |
7 | 2020 | East | 800 |
8 | 2021 | East | 900 |
9 | 2019 | West | 600 |
10 | 2020 | West | 700 |
11 | 2021 | West | 800 |
Given these DataFrames, might want to join them based on the region
column. We could for example calculate the total sales per region, and add the information to the sales_regions
DataFrame. We can do this using the merge()
method. Let’s see how this works.
# let's first calculate the total sales per region
= sales_results.groupby('region')['sales'].sum().reset_index()
sales_per_region sales_per_region
region | sales | |
---|---|---|
0 | East | 2400 |
1 | North | 3700 |
2 | South | 2700 |
3 | West | 2100 |
We used the groupby()
method we learned earlier to calculate the total sales per region. We then used the reset_index()
method to convert the grouped data back to a DataFrame. Now we can merge the sales_regions
DataFrame with the sales_per_region
DataFrame.
# merge the sales_regions DataFrame with the sales_per_region DataFrame
='region', how='left') sales_regions.merge(sales_per_region, on
region | manager | sales | |
---|---|---|---|
0 | North | John | 3700 |
1 | South | Sara | 2700 |
2 | East | Tom | 2400 |
3 | West | Alice | 2100 |
If you have ever used SQL, you might have noticed that this feels very similar to a SQL join. The on
parameter specifies the column to join on, and the how
parameter specifies the type of join. Most common choices for the how
parameter are probably left
, right
, inner
, and outer
. The default value is inner
.
There is more functionality when it comes using merge
for joining DataFrames, but for the time being we will leave it at this.
6.4.1 Concatenating DataFrames
Now sometimes we have DataFrames where the content is different but the columns are the same. In this case we might want to concatenate the DataFrames. Let’s assume we have sales data from previous years that we want to add to the sales_results
DataFrame.
= pd.DataFrame({
sales_results_2018 'year': [2018, 2018, 2018, 2018],
'region': ['North', 'South', 'East', 'West'],
'sales': [500, 400, 300, 200]
})
sales_results_2018
year | region | sales | |
---|---|---|---|
0 | 2018 | North | 500 |
1 | 2018 | South | 400 |
2 | 2018 | East | 300 |
3 | 2018 | West | 200 |
We can see that the format of this DataFrame is the same as the sales_results
DataFrame. We can concatenate the two DataFrames using the concat()
method, which takes a list of DataFrames as input.
# concatenate the sales results from different years
= pd.concat([sales_results_2018, sales_results])
sales_all sales_all
year | region | sales | |
---|---|---|---|
0 | 2018 | North | 500 |
1 | 2018 | South | 400 |
2 | 2018 | East | 300 |
3 | 2018 | West | 200 |
0 | 2019 | North | 1000 |
1 | 2020 | North | 1200 |
2 | 2021 | North | 1500 |
3 | 2019 | South | 800 |
4 | 2020 | South | 900 |
5 | 2021 | South | 1000 |
6 | 2019 | East | 700 |
7 | 2020 | East | 800 |
8 | 2021 | East | 900 |
9 | 2019 | West | 600 |
10 | 2020 | West | 700 |
11 | 2021 | West | 800 |
All looks good, but we can see that the index is not in order. We can reset the index using the reset_index()
method.
=True, inplace=True)
sales_all.reset_index(drop sales_all
year | region | sales | |
---|---|---|---|
0 | 2018 | North | 500 |
1 | 2018 | South | 400 |
2 | 2018 | East | 300 |
3 | 2018 | West | 200 |
4 | 2019 | North | 1000 |
5 | 2020 | North | 1200 |
6 | 2021 | North | 1500 |
7 | 2019 | South | 800 |
8 | 2020 | South | 900 |
9 | 2021 | South | 1000 |
10 | 2019 | East | 700 |
11 | 2020 | East | 800 |
12 | 2021 | East | 900 |
13 | 2019 | West | 600 |
14 | 2020 | West | 700 |
15 | 2021 | West | 800 |
The drop=True
parameter means that we want to get rid of the old index. The inplace=True
parameter means that we want to modify the DataFrame in place, instead of creating a new DataFrame. This will make the index start from 0, and the changes are applied to the DataFrame.
6.4.2 Concatenating along columns
Somethis we want to stick new columns at the end of an existing DataFrame. We can do this using the concat()
method with the axis
parameter set to 1. Let’s look at an example.
= pd.DataFrame({
sales_by_year 'region': ['North', 'South', 'East', 'West'],
'2018': [500, 400, 300, 200],
'2019': [1000, 800, 700, 600],
'2020': [1200, 900, 800, 700]
})
sales_by_year
region | 2018 | 2019 | 2020 | |
---|---|---|---|---|
0 | North | 500 | 1000 | 1200 |
1 | South | 400 | 800 | 900 |
2 | East | 300 | 700 | 800 |
3 | West | 200 | 600 | 700 |
= pd.DataFrame({
year_2021_2022 '2021': [1500, 1000, 900, 800],
'2022': [1600, 1100, 1000, 900]
})
year_2021_2022
2021 | 2022 | |
---|---|---|
0 | 1500 | 1600 |
1 | 1000 | 1100 |
2 | 900 | 1000 |
3 | 800 | 900 |
We can see that the year_2021_2022
DataFrame is a natural extension to the sales_by_year
DataFrame. Let’s combine the two.
= pd.concat([sales_by_year, year_2021_2022], axis=1)
sales_by_year sales_by_year
region | 2018 | 2019 | 2020 | 2021 | 2022 | |
---|---|---|---|---|---|---|
0 | North | 500 | 1000 | 1200 | 1500 | 1600 |
1 | South | 400 | 800 | 900 | 1000 | 1100 |
2 | East | 300 | 700 | 800 | 900 | 1000 |
3 | West | 200 | 600 | 700 | 800 | 900 |
6.4.3 Joining DataFrames on the index values
So far we have joined DataFrames based on columns. However, we can also join DataFrames based on the index values. Let’s see how this works by creating two DataFrames with the some overlapping index values.
= pd.DataFrame({
df1 'A': [1, 2, 3],
'B': [4, 5, 6]},
=[1, 2, 3])
index df1
A | B | |
---|---|---|
1 | 1 | 4 |
2 | 2 | 5 |
3 | 3 | 6 |
= pd.DataFrame({
df2 'C': [7, 8, 9],
'D': [10, 11, 12]},
=[2, 3, 4])
index
df2
C | D | |
---|---|---|
2 | 7 | 10 |
3 | 8 | 11 |
4 | 9 | 12 |
We can now join the two DataFrames based on the index values using the join()
method. Let’s take a look at a left join.
='left') df1.join(df2, how
A | B | C | D | |
---|---|---|---|---|
1 | 1 | 4 | NaN | NaN |
2 | 2 | 5 | 7.0 | 10.0 |
3 | 3 | 6 | 8.0 | 11.0 |
We can see that the df1
DataFrame is the left DataFrame, which will be preserved in the join operation. The df2
DataFrame is the right DataFrame, which will be joined to the left DataFrame based on the index values. If a value is missing in the right DataFrame, it will be filled with NaN
.
A right join will preserve the right DataFrame.
='right') df1.join(df2, how
A | B | C | D | |
---|---|---|---|---|
2 | 2.0 | 5.0 | 7 | 10 |
3 | 3.0 | 6.0 | 8 | 11 |
4 | NaN | NaN | 9 | 12 |
6.5 Reading in data
In the real world, we aren’t usually working with example datasets like the iris
dataset. Instead, we are burdened with the task of reading in the data. Luckily, Pandas provides a variety of methods to read in data from different sources. Let’s take a look at probably the most common method: read_csv()
. As the name suggests, this method reads in data from a CSV file. We are going to read in a dataset called pets
, which is saved as a CSV file under a folder called data.
# read in the pets dataset
= pd.read_csv('data/pets.csv')
pets pets
Pet | Name | Age | Owner | |
---|---|---|---|---|
0 | Parrot | Polly | 15 | Cathy |
1 | Cat | Fluffy | 3 | John |
2 | Dog | Spot | 5 | Amanda |
3 | Guinea pig | Squeeky | 1 | Mark |
We can see that the read_csv()
method reads in the data from the CSV file and creates a DataFrame. As mentioned earlier, Pandas provides several methods for reading in many file types such as Excel files, SQL databases, and more. You can browse the available methods by typing pd.read_
and pressing Tab
for autocomplete suggestions. For saving data to a file you can use the to_csv()
method. Let’s save the pets
DataFrame to a CSV file.
# save the pets DataFrame to a CSV file
'data/pets_saved.csv', index=False) pets.to_csv(
Here we used the index=False
parameter to avoid saving the index values to the CSV file. This is usually preferred, unless you have a specific reason for creating an extra column for the index values in the saved file. Next we will turn our attention to visualizing data, which is one of the most important aspects of data analysis.