17  Joining DataFrames

import pandas as pd

17.1 Mutating Joins

To perform mutating joins in pandas similar to {dplyr} joins in R, you can use the .merge() method and specify the appropriate how parameter. Below is a guide to achieve the inner_join, left_join, and full_join functionality in pandas.

# DataFrame x (left table)
x = pd.DataFrame({
    'key': [1, 2, 3],
    'value_x': ['a', 'b', 'c']
})

# DataFrame y (right table)
y = pd.DataFrame({
    'key': [2, 3, 4],
    'value_y': ['d', 'e', 'f']
})
x
key value_x
0 1 a
1 2 b
2 3 c
y
key value_y
0 2 d
1 3 e
2 4 f

17.1.1 inner_join()

  • Description: Includes only rows where the key matches in both x and y.
# Perform an inner join
inner_join = pd.merge(x, y, on='key', how='inner')
inner_join
key value_x value_y
0 2 b d
1 3 c e
  • on='key' specifies the column to join on.
  • how='inner' keeps only rows with matching keys in both DataFrames.

17.1.2 left_join()

  • Description: Includes all rows from x, regardless of whether there is a matching row in y.
# Perform a left join
left_join = pd.merge(x, y, on='key', how='left')
left_join
key value_x value_y
0 1 a NaN
1 2 b d
2 3 c e
  • how='left' ensures all rows from x are retained.
  • Rows in x with no match in y have NaN values for columns from y.

17.1.3 full_join()

  • Description: Includes all rows from both x and y. Rows with no match in either DataFrame are filled with NaN.
# Perform a full join
full_join = pd.merge(x, y, on='key', how='outer')
full_join
key value_x value_y
0 1 a NaN
1 2 b d
2 3 c e
3 4 NaN f
  • how='outer' includes all rows from both x and y.
  • Rows with no match are filled with NaN.

17.1.4 Summary of pandas.merge() how Parameter:

Join Type how Parameter Description
inner_join() 'inner' Keeps only rows with keys in both x and y.
left_join() 'left' Keeps all rows from x and adds matching rows from y (or NaN if no match).
full_join() 'outer' Keeps all rows from both x and y, filling unmatched rows with NaN.

17.1.5 Joining on Columns with Different Names

If the key column names differ between x and y, use the left_on and right_on parameters.

# DataFrames with different key column names
x = pd.DataFrame({'id_x': [1, 2, 3], 'value_x': ['a', 'b', 'c']})
y = pd.DataFrame({'id_y': [2, 3, 4], 'value_y': ['d', 'e', 'f']})

# Perform an inner join with different key column names
inner_join = pd.merge(x, y, left_on='id_x', right_on='id_y', how='inner')
inner_join
id_x value_x id_y value_y
0 2 b 2 d
1 3 c 3 e

17.1.6 Joining on Multiple Columns

17.1.6.1 Same Names

import pandas as pd

# DataFrame x
x = pd.DataFrame({
    'key1': ['A', 'B', 'C', 'A'],
    'key2': [1, 2, 3, 4],
    'value_x': ['apple', 'banana', 'cherry', 'date']
})
x

# DataFrame y
y = pd.DataFrame({
    'key1': ['A', 'B', 'A', 'D'],
    'key2': [1, 2, 4, 5],
    'value_y': ['red', 'yellow', 'green', 'blue']
})
y
key1 key2 value_y
0 A 1 red
1 B 2 yellow
2 A 4 green
3 D 5 blue
# Inner join on key1 and key2
inner_join = pd.merge(x, y, on=['key1', 'key2'], how='inner')
inner_join
key1 key2 value_x value_y
0 A 1 apple red
1 B 2 banana yellow
2 A 4 date green

17.1.6.2 Different Names

# DataFrames with different column names for keys
x = pd.DataFrame({'key1_x': ['A', 'B', 'C'], 'key2_x': [1, 2, 3], 'value_x': ['apple', 'banana', 'cherry']})
y = pd.DataFrame({'key1_y': ['A', 'B', 'A'], 'key2_y': [1, 2, 4], 'value_y': ['red', 'yellow', 'green']})

# Merge on different column names
merged = pd.merge(x, y, left_on=['key1_x', 'key2_x'], right_on=['key1_y', 'key2_y'], how='inner')
merged
key1_x key2_x value_x key1_y key2_y value_y
0 A 1 apple A 1 red
1 B 2 banana B 2 yellow

17.2 Filtering Joins

In pandas, filtering joins like semi_join() and anti_join() can be implemented using methods such as .merge(), boolean indexing, and pandas operations like .isin(). Below is how to achieve the functionality of {dplyr} filtering joins in pandas.

import pandas as pd

# DataFrame x (left table)
x = pd.DataFrame({
    'key': [1, 2, 3, 4],
    'value_x': ['a', 'b', 'c', 'd']
})

# DataFrame y (right table)
y = pd.DataFrame({
    'key': [2, 3, 5],
    'value_y': ['e', 'f', 'g']
})
x
key value_x
0 1 a
1 2 b
2 3 c
3 4 d
y
key value_y
0 2 e
1 3 f
2 5 g

17.2.1 semi_join()

  • Description: Keeps all rows in x where there is a matching key in y. The result only includes columns from x.

You can use .isin() to filter rows in x where the key exists in y.

# Semi-join: Keep rows in x that have a match in y
semi_join = x[x['key'].isin(y['key'])]
semi_join
key value_x
1 2 b
2 3 c
  • y['key'] provides the keys in y.
  • x['key'].isin(y['key']) creates a boolean mask for rows in x where key exists in y.

17.2.2 anti_join()

  • Description: Keeps all rows in x where there is no matching key in y. The result only includes columns from x.

You can use .isin() with the ~ operator (logical NOT) to filter rows in x where the key does not exist in y.

# Anti-join: Drop rows in x that have a match in y
anti_join = x[~x['key'].isin(y['key'])]
anti_join
key value_x
0 1 a
3 4 d
  • ~x['key'].isin(y['key']) creates a boolean mask for rows in x where key is not in y.

17.2.3 Summary of Filtering Joins

Join Type Description Implementation
semi_join(x, y) Keeps rows in x where there is a match in y. x[x['key'].isin(y['key'])]
anti_join(x, y) Keeps rows in x where there is no match in y. x[~x['key'].isin(y['key'])]

17.2.3.1 Example of Combined Filtering Joins

If you want to perform both filtering joins and combine their outputs for analysis, you can do so in pandas:

# Semi-join: Rows in x with a match in y
semi_join = x[x['key'].isin(y['key'])]
semi_join
# Anti-join: Rows in x without a match in y
anti_join = x[~x['key'].isin(y['key'])]
anti_join
key value_x
0 1 a
3 4 d