5  Pandas Basics

Pandas is a Python library commonly used for data manipulation and analysis. It provides data structures like Series and DataFrame that are designed to work with structured data very easily and efficiently. This is why Pandas is used a lot in Data Science work. It is built on top of NumPy, so many aspects should feel familiar. You can start by installing Pandas from the terminal using pip:

pip install pandas

Now let’s learn about the two main data structures in Pandas: Series and DataFrame.

5.1 Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call the series constructor:

# import numpy and pandas libraries
import numpy as np
import pandas as pd

# create a series from a list
pd.Series(data = [1, 2, 3, 4])
0    1
1    2
2    3
3    4
dtype: int64

We can see that the series look very much like the list or a NumPy array. The series also has an index that can be used to access the elements of the series, but the difference is that we can specify the index values for our series:

# create a series with custom index
my_series = pd.Series(data = [1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])
my_series
a    1
b    2
c    3
d    4
dtype: int64

We can access the elements of the series using the index values:

# access the elements of the series
my_series['a']
1

You can also use the dot notation:

# access the elements of the series using the dot notation
my_series.b
2

5.1.1 Creating a Series from other data types

A List is not the only data type that can be used to create a series. You can also use a dictionary:

# creating a series from a dictionary
my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
pd.Series(my_dict)
a    1
b    2
c    3
d    4
dtype: int64

You can also create a series from a NumPy array etc.

# creating a series from a NumPy array
my_array = np.array([1, 2, 3, 4])
pd.Series(my_array)
0    1
1    2
2    3
3    4
dtype: int32

5.1.2 Accessing elements of a Series

We already saw the bracket + index and the dot + index notation in action. In addition to using the index, we can use the iloc attribute to access the elements of the series by using numerical indexing:

# access the first element of the series by using the positional index
my_series.iloc[0]
1

You can also use the loc attribute to access the elements of the series by their index. You can add multiple index values to access multiple elements:

# access the elements of the series by their index
my_series.loc[['a', 'c']]
a    1
c    3
dtype: int64

We can also use the : operator to access a range of elements:

# access a range of elements
my_series.loc['a':'c']
a    1
b    2
c    3
dtype: int64

Assigning values to the elements of the series is also possible and works in the same way as with NumPy arrays:

# assign a value to an element of the series
my_series['a'] = 100
my_series
a    100
b      2
c      3
d      4
dtype: int64

5.2 DataFrame

A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns). You can think of being similar to a spreadsheet. It is generally the most commonly used pandas object. Like the Series object we learned earlier, the DataFrame also accepts many different kinds of input. Let’s see what a DataFrame looks like:

# create a DataFrame from a dictionary
data = {'name': ['John', 'Anna', 'Peter', 'Linda'],
        'age': [23, 36, 32, 45],
        'city': ['New York', 'Paris', 'Berlin', 'London']}

df = pd.DataFrame(data)
df
name age city
0 John 23 New York
1 Anna 36 Paris
2 Peter 32 Berlin
3 Linda 45 London

We can see that the DataFrame has a default index that starts from 0, and that the data is displayed in a tabular format which makes it easy to read. We can also specify the index values:

# create a DataFrame with custom index
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd'])
df
name age city
a John 23 New York
b Anna 36 Paris
c Peter 32 Berlin
d Linda 45 London

5.2.1 Accessing Columns of a DataFrame

The columns are actually Pandas Series objects. We can access the columns of the DataFrame using the column name and the bracket notation. Let’s access the ‘name’ column of the DataFrame:

# access the first column of the DataFrame
df['name']
a     John
b     Anna
c    Peter
d    Linda
Name: name, dtype: object

Let’s inspect the type of the column:

type(df['name'])
pandas.core.series.Series

We can see that the column is a Pandas Series object. We can also access the columns using the dot notation:

df.name
a     John
b     Anna
c    Peter
d    Linda
Name: name, dtype: object

The issue that we may run into using the dot notation is that it may not work if the column we are trying to access has the same name as a DataFrame method. For example, if we have a column named ‘count’, we cannot access it using the dot notation because ‘count’ is a DataFrame method. We can access the columns using the loc attribute:

# add count column to the DataFrame
df['count'] = [1, 2, 3, 4]
df
name age city count
a John 23 New York 1
b Anna 36 Paris 2
c Peter 32 Berlin 3
d Linda 45 London 4

We can access the columns using the loc attribute:

# access the columns of the DataFrame using the loc attribute
df.loc[:, 'count']
a    1
b    2
c    3
d    4
Name: count, dtype: int64

As we saw above, we can create a new column by assigning a list to a new column name. We can also create a new column by using the existing columns:

# create a new column by using the existing columns
df['age_plus_count'] = df['age'] + df['count']
df
name age city count age_plus_count
a John 23 New York 1 24
b Anna 36 Paris 2 38
c Peter 32 Berlin 3 35
d Linda 45 London 4 49

If we now want to access all the new columns we created, we can use the bracket notation:

# access the new columns
df[['count', 'age_plus_count']]
count age_plus_count
a 1 24
b 2 38
c 3 35
d 4 49

Now you might be wondering how to delete a column. We can use the drop method to delete a column:

# deleting the new columns
df.drop(['count', 'age_plus_count'], axis=1)
name age city
a John 23 New York
b Anna 36 Paris
c Peter 32 Berlin
d Linda 45 London

This prints out the DataFrame without the columns. However, you should note that the drop method does not modify the original DataFrame by default. We can confirm this by printing the original DataFrame:

df
name age city count age_plus_count
a John 23 New York 1 24
b Anna 36 Paris 2 38
c Peter 32 Berlin 3 35
d Linda 45 London 4 49

If you want to remove the columns permanently, you need to use the inplace parameter:

# deleting the new columns permanently
df.drop(['count', 'age_plus_count'], axis=1, inplace=True)
df
name age city
a John 23 New York
b Anna 36 Paris
c Peter 32 Berlin
d Linda 45 London

This gives us the original DataFrame without the columns we deleted. You might have noticed how we specified axis = 1 in the drop method. This is used to refer to dropping columns. Rows are dropped by specifying axis = 0. The terminology comes from NumPy, where the first axis is the rows and the second axis is the columns.

5.2.2 Accessing Rows of a DataFrame

Ok, so we know how to access the columns of a DataFrame. How do we access the rows? There are a number of ways to do this. For example, we can use the loc attribute to access the rows of a DataFrame by their index:

# access the rows of the DataFrame by their index
row_a = df.loc['a']

row_a
name        John
age           23
city    New York
Name: a, dtype: object

If we inspect the type of row_a, we can see that it is also a Pandas Series object, just like the individual columns were Series objects as well:

type(row_a)
pandas.core.series.Series

How do we access multiple rows? We can use the loc attribute and pass a list of index values:

# access multiple rows
df.loc[['a', 'c']]
name age city
a John 23 New York
c Peter 32 Berlin

This, in turn, returns back a DataFrame. We can also use the iloc attribute to access the rows by their numerical index:

# get the last two rows of the DataFrame
df.iloc[2:]
name age city
c Peter 32 Berlin
d Linda 45 London

We can also use the slicing notation for DataFrames. For example, to get the first two rows of the DataFrame:

df[:2]
name age city
a John 23 New York
b Anna 36 Paris

Do you still remember the drop method we used to delete columns? As we already discussed, we can also use it to delete rows. Let’s delete the first row:

# delete the first row
df.drop('a', axis=0)
name age city
b Anna 36 Paris
c Peter 32 Berlin
d Linda 45 London

Since we did not add inplace=True, the original DataFrame is not modified. We can confirm this by printing the original DataFrame:

df
name age city
a John 23 New York
b Anna 36 Paris
c Peter 32 Berlin
d Linda 45 London

5.2.3 Accessing Elements of a DataFrame

We are now familiar with some common ways of accessing specific rows and columns of a DataFrame. We are now ready to combine what we already know to access individual elements. We can use the loc attribute and pass the row and column index values:

# access an element of the DataFrame
df.loc['a', 'name']
'John'

If we know the numerical index of the row and column, we can use the iloc attribute:

# get element from row 2 column 3
df.iloc[1, 2]
'Paris'

By combining the techniques we learned, we can access multiple elements:

# get multiple elements
df.loc[['a', 'c'], ['name', 'city']]
name city
a John New York
c Peter Berlin
# first two rows
df[0:2][['name', 'age']]
name age
a John 23
b Anna 36

5.3 Conditional selection

It quite common in data science to filter data based on some condition. With Pandas, we can use the bracket notation to filter our data. Let’s first create a DataFrame with random numbers so that we can better illustrate this point:

# creating a numeric only DataFrame
from numpy.random import randn
np.random.seed(101)

df2 = pd.DataFrame(randn(3, 4), index = ['A', 'B', 'C'], columns = ['col1', 'col2', 'col3', 'col4'])

df2
col1 col2 col3 col4
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001

Now, we can see that our DataFrame contains a bunch of numbers. Let’s say we want to find all the elements that are greater than 0. We can use the bracket notation to create a condition that returns a boolean DataFrame:

df2 > 0
col1 col2 col3 col4
A True True True True
B True False False True
C False True True False

As a result, we get a DataFrame with equal dimensions to the original DataFrame, but the numeric values have been replaced with True or False values based on the condition. We can use this boolean DataFrame to filter the original DataFrame:

df2[df2 > 0]
col1 col2 col3 col4
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 NaN NaN 0.605965
C NaN 0.740122 0.528813 NaN

Now we see that only the elements that satisfy the condition are displayed.

5.3.1 Column based filtering

It is actually more common to select a subset of a DataFrame based on condition applied to a specific column. The idea is basically the same as we saw above. For example, let’s say we want to find all the elements in the col1 column that are greater than 0:

df2['col1'] > 0
A     True
B     True
C    False
Name: col1, dtype: bool

This returns a Pandas Series object with boolean values. We can use this Series object to filter the DataFrame:

df2[df2['col1'] > 0]
col1 col2 col3 col4
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965

Similarly, if we go back to our original DataFrame with the people data, we could filter the DataFrame based on the age column. Let’s say we want to find people over the age of 25:

df[df['age'] > 25]
name age city
b Anna 36 Paris
c Peter 32 Berlin
d Linda 45 London

Very convenient.

5.3.2 Multiple conditions

Filtering using multiple conditions is also possible. We can use the & (and) operator to combine the conditions. Let’s say we want to find people over the age of 25 who live in Paris:

df[(df['age'] > 25) & (df['city'] == 'Paris')]
name age city
b Anna 36 Paris

Please note the syntax: we need to use parentheses around each condition. Also, we are not using the and keyword. We are using the & operator. This brings us to the next point: we can use the | (or) operator to combine conditions. Let’s say we want to find people younger than 25 or people who live in Paris:

df[(df['age'] < 25) | (df['city'] == 'Paris')]
name age city
a John 23 New York
b Anna 36 Paris

Pretty neat. We can also use the ~ (not) operator to negate a condition. Let’s say we want to find people who do not live in Paris:

df[~(df['city'] == 'Paris')]
name age city
a John 23 New York
c Peter 32 Berlin
d Linda 45 London

Since the result is also a DataFrame, we can interrogate the results as we would with any other DataFrame. For example, getting the names of people who do not live in Paris is easy:

df[~(df['city'] == 'Paris')]['name']
a     John
c    Peter
d    Linda
Name: name, dtype: object

5.4 Summary

In this chapter, we learned about the basics of the Pandas library. In the next chapter, we will learn a little bit more about working with Pandas DataFrames, including how to handle missing data and how to group data.