17 Joining DataFrames

import pandas as pd

17.1 Mutating Joins

To perform mutating joins in pandas similar to {dplyr} joins in R, you can use the .merge() method and specify the appropriate how parameter. Below is a guide to achieve the inner_join, left_join, and full_join functionality in pandas.

# DataFrame x (left table)
x = pd.DataFrame({
    'key': [1, 2, 3],
    'value_x': ['a', 'b', 'c']
})

# DataFrame y (right table)
y = pd.DataFrame({
    'key': [2, 3, 4],
    'value_y': ['d', 'e', 'f']
})

	key	value_x
0	1	a
1	2	b
2	3	c

	key	value_y
0	2	d
1	3	e
2	4	f

17.1.1 `inner_join()`

Description: Includes only rows where the key matches in both x and y.

# Perform an inner join
inner_join = pd.merge(x, y, on='key', how='inner')
inner_join

	key	value_x	value_y
0	2	b	d
1	3	c	e

on='key' specifies the column to join on.
how='inner' keeps only rows with matching keys in both DataFrames.

17.1.2 `left_join()`

Description: Includes all rows from x, regardless of whether there is a matching row in y.

# Perform a left join
left_join = pd.merge(x, y, on='key', how='left')
left_join

	key	value_x	value_y
0	1	a	NaN
1	2	b	d
2	3	c	e

how='left' ensures all rows from x are retained.
Rows in x with no match in y have NaN values for columns from y.

17.1.3 `full_join()`

Description: Includes all rows from both x and y. Rows with no match in either DataFrame are filled with NaN.

# Perform a full join
full_join = pd.merge(x, y, on='key', how='outer')
full_join

	key	value_x	value_y
0	1	a	NaN
1	2	b	d
2	3	c	e
3	4	NaN	f

how='outer' includes all rows from both x and y.
Rows with no match are filled with NaN.

17.1.4 Summary of `pandas.merge()` `how` Parameter:

Join Type	`how` Parameter	Description
`inner_join()`	`'inner'`	Keeps only rows with keys in both `x` and `y`.
`left_join()`	`'left'`	Keeps all rows from `x` and adds matching rows from `y` (or `NaN` if no match).
`full_join()`	`'outer'`	Keeps all rows from both `x` and `y`, filling unmatched rows with `NaN`.

17.1.5 Joining on Columns with Different Names

If the key column names differ between x and y, use the left_on and right_on parameters.

# DataFrames with different key column names
x = pd.DataFrame({'id_x': [1, 2, 3], 'value_x': ['a', 'b', 'c']})
y = pd.DataFrame({'id_y': [2, 3, 4], 'value_y': ['d', 'e', 'f']})

# Perform an inner join with different key column names
inner_join = pd.merge(x, y, left_on='id_x', right_on='id_y', how='inner')
inner_join

	id_x	value_x	id_y	value_y
0	2	b	2	d
1	3	c	3	e

17.1.6 Joining on Multiple Columns

17.1.6.1 Same Names

import pandas as pd

# DataFrame x
x = pd.DataFrame({
    'key1': ['A', 'B', 'C', 'A'],
    'key2': [1, 2, 3, 4],
    'value_x': ['apple', 'banana', 'cherry', 'date']
})
x

# DataFrame y
y = pd.DataFrame({
    'key1': ['A', 'B', 'A', 'D'],
    'key2': [1, 2, 4, 5],
    'value_y': ['red', 'yellow', 'green', 'blue']
})
y

	key1	key2	value_y
0	A	1	red
1	B	2	yellow
2	A	4	green
3	D	5	blue

# Inner join on key1 and key2
inner_join = pd.merge(x, y, on=['key1', 'key2'], how='inner')
inner_join

	key1	key2	value_x	value_y
0	A	1	apple	red
1	B	2	banana	yellow
2	A	4	date	green

17.1.6.2 Different Names

# DataFrames with different column names for keys
x = pd.DataFrame({'key1_x': ['A', 'B', 'C'], 'key2_x': [1, 2, 3], 'value_x': ['apple', 'banana', 'cherry']})
y = pd.DataFrame({'key1_y': ['A', 'B', 'A'], 'key2_y': [1, 2, 4], 'value_y': ['red', 'yellow', 'green']})

# Merge on different column names
merged = pd.merge(x, y, left_on=['key1_x', 'key2_x'], right_on=['key1_y', 'key2_y'], how='inner')
merged

	key1_x	key2_x	value_x	key1_y	key2_y	value_y
0	A	1	apple	A	1	red
1	B	2	banana	B	2	yellow

17.2 Filtering Joins

In pandas, filtering joins like semi_join() and anti_join() can be implemented using methods such as .merge(), boolean indexing, and pandas operations like .isin(). Below is how to achieve the functionality of {dplyr} filtering joins in pandas.

import pandas as pd

# DataFrame x (left table)
x = pd.DataFrame({
    'key': [1, 2, 3, 4],
    'value_x': ['a', 'b', 'c', 'd']
})

# DataFrame y (right table)
y = pd.DataFrame({
    'key': [2, 3, 5],
    'value_y': ['e', 'f', 'g']
})

	key	value_x
0	1	a
1	2	b
2	3	c
3	4	d

	key	value_y
0	2	e
1	3	f
2	5	g

17.2.1 `semi_join()`

Description: Keeps all rows in x where there is a matching key in y. The result only includes columns from x.

You can use .isin() to filter rows in x where the key exists in y.

# Semi-join: Keep rows in x that have a match in y
semi_join = x[x['key'].isin(y['key'])]
semi_join

	key	value_x
1	2	b
2	3	c

y['key'] provides the keys in y.
x['key'].isin(y['key']) creates a boolean mask for rows in x where key exists in y.

17.2.2 `anti_join()`

Description: Keeps all rows in x where there is no matching key in y. The result only includes columns from x.

You can use .isin() with the ~ operator (logical NOT) to filter rows in x where the key does not exist in y.

# Anti-join: Drop rows in x that have a match in y
anti_join = x[~x['key'].isin(y['key'])]
anti_join

	key	value_x
0	1	a
3	4	d

~x['key'].isin(y['key']) creates a boolean mask for rows in x where key is not in y.

17.2.3 Summary of Filtering Joins

Join Type	Description	Implementation
`semi_join(x, y)`	Keeps rows in `x` where there is a match in `y`.	`x[x['key'].isin(y['key'])]`
`anti_join(x, y)`	Keeps rows in `x` where there is no match in `y`.	`x[~x['key'].isin(y['key'])]`

17.2.3.1 Example of Combined Filtering Joins

If you want to perform both filtering joins and combine their outputs for analysis, you can do so in pandas:

# Semi-join: Rows in x with a match in y
semi_join = x[x['key'].isin(y['key'])]
semi_join
# Anti-join: Rows in x without a match in y
anti_join = x[~x['key'].isin(y['key'])]
anti_join

	key	value_x
0	1	a
3	4	d