To perform mutating joins in pandas similar to {dplyr} joins in R, you can use the .merge() method and specify the appropriate how parameter. Below is a guide to achieve the inner_join, left_join, and full_join functionality in pandas.
# Inner join on key1 and key2inner_join = pd.merge(x, y, on=['key1', 'key2'], how='inner')inner_join
key1
key2
value_x
value_y
0
A
1
apple
red
1
B
2
banana
yellow
2
A
4
date
green
17.1.6.2 Different Names
# DataFrames with different column names for keysx = pd.DataFrame({'key1_x': ['A', 'B', 'C'], 'key2_x': [1, 2, 3], 'value_x': ['apple', 'banana', 'cherry']})y = pd.DataFrame({'key1_y': ['A', 'B', 'A'], 'key2_y': [1, 2, 4], 'value_y': ['red', 'yellow', 'green']})# Merge on different column namesmerged = pd.merge(x, y, left_on=['key1_x', 'key2_x'], right_on=['key1_y', 'key2_y'], how='inner')merged
key1_x
key2_x
value_x
key1_y
key2_y
value_y
0
A
1
apple
A
1
red
1
B
2
banana
B
2
yellow
17.2 Filtering Joins
In pandas, filtering joins like semi_join() and anti_join() can be implemented using methods such as .merge(), boolean indexing, and pandas operations like .isin(). Below is how to achieve the functionality of {dplyr} filtering joins in pandas.
import pandas as pd# DataFrame x (left table)x = pd.DataFrame({'key': [1, 2, 3, 4],'value_x': ['a', 'b', 'c', 'd']})# DataFrame y (right table)y = pd.DataFrame({'key': [2, 3, 5],'value_y': ['e', 'f', 'g']})
x
key
value_x
0
1
a
1
2
b
2
3
c
3
4
d
y
key
value_y
0
2
e
1
3
f
2
5
g
17.2.1semi_join()
Description: Keeps all rows in x where there is a matching key in y. The result only includes columns from x.
You can use .isin() to filter rows in x where the key exists in y.
# Semi-join: Keep rows in x that have a match in ysemi_join = x[x['key'].isin(y['key'])]semi_join
key
value_x
1
2
b
2
3
c
y['key'] provides the keys in y.
x['key'].isin(y['key']) creates a boolean mask for rows in x where key exists in y.
17.2.2anti_join()
Description: Keeps all rows in x where there is no matching key in y. The result only includes columns from x.
You can use .isin() with the ~ operator (logical NOT) to filter rows in x where the key does not exist in y.
# Anti-join: Drop rows in x that have a match in yanti_join = x[~x['key'].isin(y['key'])]anti_join
key
value_x
0
1
a
3
4
d
~x['key'].isin(y['key']) creates a boolean mask for rows in x where key is not in y.
17.2.3 Summary of Filtering Joins
Join Type
Description
Implementation
semi_join(x, y)
Keeps rows in x where there is a match in y.
x[x['key'].isin(y['key'])]
anti_join(x, y)
Keeps rows in x where there is no match in y.
x[~x['key'].isin(y['key'])]
17.2.3.1 Example of Combined Filtering Joins
If you want to perform both filtering joins and combine their outputs for analysis, you can do so in pandas:
# Semi-join: Rows in x with a match in ysemi_join = x[x['key'].isin(y['key'])]semi_join# Anti-join: Rows in x without a match in yanti_join = x[~x['key'].isin(y['key'])]anti_join