Joins: Part 1

Joe Ramirez
3 min readApr 6, 2021

--

As you embark on your data science journey, one of the first things you will have to learn in Pandas is how to join different datasets. This will be an absolutely essential skill to have as you will find that it’s extremely rare that all the data you will need for data analysis and machine learning will be contained in a single dataset.

As such, you will be required to combine information from many different datasets into a single readable dataset before you begin your exploratory data analysis. This blog will better help you understand how you can accomplish that task using Pandas in Python.

To start, let’s say we have two DataFrames:

df1

df2

What if we just want to combine these two DataFrames together keeping all the rows and columns? We can accomplish this using pd.concat():

df_join = pd.concat([df1, df2])

Our resulting DataFrame, df_join, looks like this:

As you can see, all the rows and columns from both DataFrames have been combined into one large DataFrame that makes it much easier to see all the data together. Unfortunately, since the data for the rows and columns was different in the two DataFrames, we have NaN values. NaN stands for “Not a Number” and it represents a missing value. For example, John was not in df2, therefore, he does not have any values in the rows Gender, Eye Color, and Hair Color. As a result, he has NaN values in those rows.

However, Lucas was in both DataFrames and because of this, he does not have a single NaN value in his column. For the others, we could either fill the NaN values later or we could even choose to drop the columns with missing values.

I’m using very small DataFrames in order to drive home the concept of joins, but what if both of these DataFrames were extremely large and we quickly wanted to see only the columns that were in both DataFrames joined together?

We can accomplish this using an inner join:

df_join = pd.concat([df1, df2], join ='inner')

The result looks like this:

As you can see, only Lucas was returned in df_join because he was the only column that appeared in both DataFrames.

When you’re working with very large DataFrames and want to see only the parts in common between both DataFrames, inner joins are extremely useful.

--

--