# import numpy and pandas libraries
import numpy as np
import pandas as pd
# create a series from a list
= [1, 2, 3, 4]) pd.Series(data
0 1
1 2
2 3
3 4
dtype: int64
Pandas is a Python library commonly used for data manipulation and analysis. It provides data structures like Series and DataFrame that are designed to work with structured data very easily and efficiently. This is why Pandas is used a lot in Data Science work. It is built on top of NumPy, so many aspects should feel familiar. You can start by installing Pandas from the terminal using pip:
pip install pandas
Now let’s learn about the two main data structures in Pandas: Series and DataFrame.
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call the series constructor:
# import numpy and pandas libraries
import numpy as np
import pandas as pd
# create a series from a list
= [1, 2, 3, 4]) pd.Series(data
0 1
1 2
2 3
3 4
dtype: int64
We can see that the series look very much like the list or a NumPy array. The series also has an index that can be used to access the elements of the series, but the difference is that we can specify the index values for our series:
# create a series with custom index
= pd.Series(data = [1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])
my_series my_series
a 1
b 2
c 3
d 4
dtype: int64
We can access the elements of the series using the index values:
# access the elements of the series
'a'] my_series[
1
You can also use the dot notation:
# access the elements of the series using the dot notation
my_series.b
2
A List is not the only data type that can be used to create a series. You can also use a dictionary:
# creating a series from a dictionary
= {'a': 1, 'b': 2, 'c': 3, 'd': 4}
my_dict pd.Series(my_dict)
a 1
b 2
c 3
d 4
dtype: int64
You can also create a series from a NumPy array etc.
# creating a series from a NumPy array
= np.array([1, 2, 3, 4])
my_array pd.Series(my_array)
0 1
1 2
2 3
3 4
dtype: int32
We already saw the bracket + index and the dot + index notation in action. In addition to using the index, we can use the iloc
attribute to access the elements of the series by using numerical indexing:
# access the first element of the series by using the positional index
0] my_series.iloc[
1
You can also use the loc
attribute to access the elements of the series by their index. You can add multiple index values to access multiple elements:
# access the elements of the series by their index
'a', 'c']] my_series.loc[[
a 1
c 3
dtype: int64
We can also use the :
operator to access a range of elements:
# access a range of elements
'a':'c'] my_series.loc[
a 1
b 2
c 3
dtype: int64
Assigning values to the elements of the series is also possible and works in the same way as with NumPy arrays:
# assign a value to an element of the series
'a'] = 100
my_series[ my_series
a 100
b 2
c 3
d 4
dtype: int64
A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns). You can think of being similar to a spreadsheet. It is generally the most commonly used pandas object. Like the Series object we learned earlier, the DataFrame also accepts many different kinds of input. Let’s see what a DataFrame looks like:
# create a DataFrame from a dictionary
= {'name': ['John', 'Anna', 'Peter', 'Linda'],
data 'age': [23, 36, 32, 45],
'city': ['New York', 'Paris', 'Berlin', 'London']}
= pd.DataFrame(data)
df df
name | age | city | |
---|---|---|---|
0 | John | 23 | New York |
1 | Anna | 36 | Paris |
2 | Peter | 32 | Berlin |
3 | Linda | 45 | London |
We can see that the DataFrame has a default index that starts from 0, and that the data is displayed in a tabular format which makes it easy to read. We can also specify the index values:
# create a DataFrame with custom index
= pd.DataFrame(data, index = ['a', 'b', 'c', 'd'])
df df
name | age | city | |
---|---|---|---|
a | John | 23 | New York |
b | Anna | 36 | Paris |
c | Peter | 32 | Berlin |
d | Linda | 45 | London |
The columns are actually Pandas Series objects. We can access the columns of the DataFrame using the column name and the bracket notation. Let’s access the ‘name’ column of the DataFrame:
# access the first column of the DataFrame
'name'] df[
a John
b Anna
c Peter
d Linda
Name: name, dtype: object
Let’s inspect the type of the column:
type(df['name'])
pandas.core.series.Series
We can see that the column is a Pandas Series object. We can also access the columns using the dot notation:
df.name
a John
b Anna
c Peter
d Linda
Name: name, dtype: object
The issue that we may run into using the dot notation is that it may not work if the column we are trying to access has the same name as a DataFrame method. For example, if we have a column named ‘count’, we cannot access it using the dot notation because ‘count’ is a DataFrame method. We can access the columns using the loc
attribute:
# add count column to the DataFrame
'count'] = [1, 2, 3, 4]
df[ df
name | age | city | count | |
---|---|---|---|---|
a | John | 23 | New York | 1 |
b | Anna | 36 | Paris | 2 |
c | Peter | 32 | Berlin | 3 |
d | Linda | 45 | London | 4 |
We can access the columns using the loc
attribute:
# access the columns of the DataFrame using the loc attribute
'count'] df.loc[:,
a 1
b 2
c 3
d 4
Name: count, dtype: int64
As we saw above, we can create a new column by assigning a list to a new column name. We can also create a new column by using the existing columns:
# create a new column by using the existing columns
'age_plus_count'] = df['age'] + df['count']
df[ df
name | age | city | count | age_plus_count | |
---|---|---|---|---|---|
a | John | 23 | New York | 1 | 24 |
b | Anna | 36 | Paris | 2 | 38 |
c | Peter | 32 | Berlin | 3 | 35 |
d | Linda | 45 | London | 4 | 49 |
If we now want to access all the new columns we created, we can use the bracket notation:
# access the new columns
'count', 'age_plus_count']] df[[
count | age_plus_count | |
---|---|---|
a | 1 | 24 |
b | 2 | 38 |
c | 3 | 35 |
d | 4 | 49 |
Now you might be wondering how to delete a column. We can use the drop
method to delete a column:
# deleting the new columns
'count', 'age_plus_count'], axis=1) df.drop([
name | age | city | |
---|---|---|---|
a | John | 23 | New York |
b | Anna | 36 | Paris |
c | Peter | 32 | Berlin |
d | Linda | 45 | London |
This prints out the DataFrame without the columns. However, you should note that the drop
method does not modify the original DataFrame by default. We can confirm this by printing the original DataFrame:
df
name | age | city | count | age_plus_count | |
---|---|---|---|---|---|
a | John | 23 | New York | 1 | 24 |
b | Anna | 36 | Paris | 2 | 38 |
c | Peter | 32 | Berlin | 3 | 35 |
d | Linda | 45 | London | 4 | 49 |
If you want to remove the columns permanently, you need to use the inplace
parameter:
# deleting the new columns permanently
'count', 'age_plus_count'], axis=1, inplace=True)
df.drop([ df
name | age | city | |
---|---|---|---|
a | John | 23 | New York |
b | Anna | 36 | Paris |
c | Peter | 32 | Berlin |
d | Linda | 45 | London |
This gives us the original DataFrame without the columns we deleted. You might have noticed how we specified axis = 1
in the drop
method. This is used to refer to dropping columns. Rows are dropped by specifying axis = 0
. The terminology comes from NumPy, where the first axis is the rows and the second axis is the columns.
Ok, so we know how to access the columns of a DataFrame. How do we access the rows? There are a number of ways to do this. For example, we can use the loc
attribute to access the rows of a DataFrame by their index:
# access the rows of the DataFrame by their index
= df.loc['a']
row_a
row_a
name John
age 23
city New York
Name: a, dtype: object
If we inspect the type of row_a
, we can see that it is also a Pandas Series object, just like the individual columns were Series objects as well:
type(row_a)
pandas.core.series.Series
How do we access multiple rows? We can use the loc
attribute and pass a list of index values:
# access multiple rows
'a', 'c']] df.loc[[
name | age | city | |
---|---|---|---|
a | John | 23 | New York |
c | Peter | 32 | Berlin |
This, in turn, returns back a DataFrame. We can also use the iloc
attribute to access the rows by their numerical index:
# get the last two rows of the DataFrame
2:] df.iloc[
name | age | city | |
---|---|---|---|
c | Peter | 32 | Berlin |
d | Linda | 45 | London |
We can also use the slicing notation for DataFrames. For example, to get the first two rows of the DataFrame:
2] df[:
name | age | city | |
---|---|---|---|
a | John | 23 | New York |
b | Anna | 36 | Paris |
Do you still remember the drop
method we used to delete columns? As we already discussed, we can also use it to delete rows. Let’s delete the first row:
# delete the first row
'a', axis=0) df.drop(
name | age | city | |
---|---|---|---|
b | Anna | 36 | Paris |
c | Peter | 32 | Berlin |
d | Linda | 45 | London |
Since we did not add inplace=True
, the original DataFrame is not modified. We can confirm this by printing the original DataFrame:
df
name | age | city | |
---|---|---|---|
a | John | 23 | New York |
b | Anna | 36 | Paris |
c | Peter | 32 | Berlin |
d | Linda | 45 | London |
We are now familiar with some common ways of accessing specific rows and columns of a DataFrame. We are now ready to combine what we already know to access individual elements. We can use the loc
attribute and pass the row and column index values:
# access an element of the DataFrame
'a', 'name'] df.loc[
'John'
If we know the numerical index of the row and column, we can use the iloc
attribute:
# get element from row 2 column 3
1, 2] df.iloc[
'Paris'
By combining the techniques we learned, we can access multiple elements:
# get multiple elements
'a', 'c'], ['name', 'city']] df.loc[[
name | city | |
---|---|---|
a | John | New York |
c | Peter | Berlin |
# first two rows
0:2][['name', 'age']] df[
name | age | |
---|---|---|
a | John | 23 |
b | Anna | 36 |
It quite common in data science to filter data based on some condition. With Pandas, we can use the bracket notation to filter our data. Let’s first create a DataFrame with random numbers so that we can better illustrate this point:
# creating a numeric only DataFrame
from numpy.random import randn
101)
np.random.seed(
= pd.DataFrame(randn(3, 4), index = ['A', 'B', 'C'], columns = ['col1', 'col2', 'col3', 'col4'])
df2
df2
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
Now, we can see that our DataFrame contains a bunch of numbers. Let’s say we want to find all the elements that are greater than 0. We can use the bracket notation to create a condition that returns a boolean DataFrame:
> 0 df2
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
A | True | True | True | True |
B | True | False | False | True |
C | False | True | True | False |
As a result, we get a DataFrame with equal dimensions to the original DataFrame, but the numeric values have been replaced with True
or False
values based on the condition. We can use this boolean DataFrame to filter the original DataFrame:
> 0] df2[df2
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | NaN | NaN | 0.605965 |
C | NaN | 0.740122 | 0.528813 | NaN |
Now we see that only the elements that satisfy the condition are displayed.
It is actually more common to select a subset of a DataFrame based on condition applied to a specific column. The idea is basically the same as we saw above. For example, let’s say we want to find all the elements in the col1
column that are greater than 0:
'col1'] > 0 df2[
A True
B True
C False
Name: col1, dtype: bool
This returns a Pandas Series object with boolean values. We can use this Series object to filter the DataFrame:
'col1'] > 0] df2[df2[
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
Similarly, if we go back to our original DataFrame with the people data, we could filter the DataFrame based on the age column. Let’s say we want to find people over the age of 25:
'age'] > 25] df[df[
name | age | city | |
---|---|---|---|
b | Anna | 36 | Paris |
c | Peter | 32 | Berlin |
d | Linda | 45 | London |
Very convenient.
Filtering using multiple conditions is also possible. We can use the &
(and) operator to combine the conditions. Let’s say we want to find people over the age of 25 who live in Paris:
'age'] > 25) & (df['city'] == 'Paris')] df[(df[
name | age | city | |
---|---|---|---|
b | Anna | 36 | Paris |
Please note the syntax: we need to use parentheses around each condition. Also, we are not using the and
keyword. We are using the &
operator. This brings us to the next point: we can use the |
(or) operator to combine conditions. Let’s say we want to find people younger than 25 or people who live in Paris:
'age'] < 25) | (df['city'] == 'Paris')] df[(df[
name | age | city | |
---|---|---|---|
a | John | 23 | New York |
b | Anna | 36 | Paris |
Pretty neat. We can also use the ~
(not) operator to negate a condition. Let’s say we want to find people who do not live in Paris:
~(df['city'] == 'Paris')] df[
name | age | city | |
---|---|---|---|
a | John | 23 | New York |
c | Peter | 32 | Berlin |
d | Linda | 45 | London |
Since the result is also a DataFrame, we can interrogate the results as we would with any other DataFrame. For example, getting the names of people who do not live in Paris is easy:
~(df['city'] == 'Paris')]['name'] df[
a John
c Peter
d Linda
Name: name, dtype: object
In this chapter, we learned about the basics of the Pandas library. In the next chapter, we will learn a little bit more about working with Pandas DataFrames, including how to handle missing data and how to group data.