Replace missing value from other columns using coalesce join in dplyr

dplyr

data

2023

A simple solution to handle NA value.

Author

Aster Hu

Published

May 11, 2023

The original coalesce function

When cleaning and aggregating data using dplyr package in R, coalesce() is extremely handy to replace the NA value with values from other columns.

A simple example

How does coalesce() work originally? Below is a simple example to showcase one of the common usage.

# Get some simple data
df1 <- data.frame(id  = c(1:3),
                  col = c('A', NA, NA))

df2 <- data.frame(id = c(1:3),
                  col = c(NA, 'B', NA))

df1

# A tibble: 3 × 2
     id col  
  <int> <chr>
1     1 A    
2     2 <NA> 
3     3 <NA>

df2

# A tibble: 3 × 2
     id col  
  <int> <chr>
1     1 <NA> 
2     2 B    
3     3 <NA>

The desired result is to combine these two tables and replace all NA values when the key id matches, like this:

  id  col
1  1    A
2  2    B
3  3 <NA>

Below is how we usually do it using coalesce(), by joining the tables first and then coalescing from the identical vectors col from two data frames.

library(dplyr)

# The usual dplyr way to coalesce
full_join(df1, df2, by = "id") %>% 
  mutate(
    col = coalesce(col.x, col.y),
    .keep = "unused" # Remove temporary columns ended with .x and .y
  )

# A tibble: 3 × 2
     id col  
  <int> <chr>
1     1 A    
2     2 B    
3     3 <NA>

What’s the pain point of `coalesce()`?

Looks great so far, right? But remember, this is the most simple case. In the real world, it’s common to have more than one variable that need to be coalesced. Applying the same method multiple times can quickly become chores.

# A more complicated example
# Both colA and colB need to be coalesced
df1 <- data.frame(id  = c(1:3),
                  colA = c('A', NA, NA),
                  colB = c('X', NA, NA))

df2 <- data.frame(id = c(1:3),
                  colA = c(NA, 'B', NA),
                  colB = c(NA, 'Y', NA))

df1

# A tibble: 3 × 3
     id colA  colB 
  <int> <chr> <chr>
1     1 A     X    
2     2 <NA>  <NA> 
3     3 <NA>  <NA>

df2

# A tibble: 3 × 3
     id colA  colB 
  <int> <chr> <chr>
1     1 <NA>  <NA> 
2     2 B     Y    
3     3 <NA>  <NA>

If we use the same way as the simple example, it looks like this:

# The usual dplyr way to coalesce
full_join(df1, df2, by = "id") %>% 
  mutate(
    # Coalesce becomes more redundant when multiple columns are involved
    colA = coalesce(colA.x, colA.y),
    colB = coalesce(colB.x, colB.y),
    .keep = "unused" # Remove temporary columns ended with .x and .y
  )

# A tibble: 3 × 3
     id colA  colB 
  <int> <chr> <chr>
1     1 A     X    
2     2 B     Y    
3     3 <NA>  <NA>

We get the desired result, but the code looks quite cumbersome. Image if we have more variables like colC, colD, …, etc. You get the idea.

Unfortunately, it’s usually a wide data set with numerous columns whenever I need to use coalesce(). Not only that, there are other scenarios that coalesce cannot handle.

By default, it always override values from .x over .y and cannot do the other way due to its origin from SQL COALESCE. This could be a problem if different non-missing value exists in both columns. For example:

# A more complicated example
# Both colA and colB need to be coalesced
# For colB, df1 has 'X1' while df2 has `X2`
df1 <- data.frame(id  = c(1:3),
                  colA = c('A', NA, NA),
                  colB = c('X1', NA, NA))

df2 <- data.frame(id = c(1:3),
                  colA = c(NA, 'B', NA),
                  colB = c('X2', 'Y', NA))

df1

# A tibble: 3 × 3
     id colA  colB 
  <int> <chr> <chr>
1     1 A     X1   
2     2 <NA>  <NA> 
3     3 <NA>  <NA>

df2

# A tibble: 3 × 3
     id colA  colB 
  <int> <chr> <chr>
1     1 <NA>  X2   
2     2 B     Y    
3     3 <NA>  <NA>

# The usual dplyr way to coalesce
full_join(df1, df2, by = "id") %>% 
  mutate(
    # Coalesce becomes more redundant when multiple columns are involved
    colA = coalesce(colA.x, colA.y),
    colB = coalesce(colB.x, colB.y),
    .keep = "unused" # Remove temporary columns ended with .x and .y
  ) #  And it always takes the left value over the right

# A tibble: 3 × 3
     id colA  colB 
  <int> <chr> <chr>
1     1 A     X1   
2     2 B     Y    
3     3 <NA>  <NA>

As the above result shows, coalesce() takes the left value X1 in df1, but what if I want X2 from df2 to be shown in the result? In reality, the default option might not meet the expectation.

Build our own coalesce join function

To simplify the workflow of coalesce(), I wrote a coalesce_join() function with my partner Rick to resolve all the pain points we mentioned. The code is shown below and it is also available on gist.

require(dplyr)
require(stringr)

coalesce_join <- function(x, 
                          y, 
                          by = NULL, 
                          keep = c("left", "right"), # "left" means keep value from left table if values exist in both tables.
                          suffix = c(".x", ".y"), # Same as the suffix argument in dplyr joins. 
                          join = c("full_join","left_join", "right_join", "inner_join") # Choose a join type from the list. The default is full_join.
                          ) { 
  keep = match.arg(keep) 
  join = match.arg(join) 
  join = match.fun(join) # Confirm the join argument is in the list and match the string to the function
  
  # Depends on the keep argument, overwrite the duplicate value
  # If keep = "left", the value from the left table will be kept, vice versa.
  if (keep == "left") suffix_ = suffix else suffix_ = rev(suffix)
  
  join(x, y, by = by, suffix = suffix) %>% 
    mutate(
      across( # Apply the coalesce function to all overlapped columns
        ends_with(suffix_[1]), # Select columns ended with .x if keep = "left"; or .y if keep = "right"
        ~coalesce(.,
                  get(str_replace(cur_column(), suffix_[1], suffix_[2])) # Replace .x in var.x with .y to generate var.y, if keep = "left"; or vice versa.
        ),
        .names = "{str_remove(.col, suffix_[1])}" # Remove the suffix from the combined columns
      ),
      .keep = "unused") # Remove the temporary columns ended with suffix
  }

Usage and arguments

coalesce_join(x, 
              y, 
              by = NULL, 
              keep = c("left", "right"),
              suffix = c(".x", ".y"),
              join = c("full_join","left_join", "right_join", "inner_join")
              )

keep: The default value is left.

If left, it keeps value from the left table when values exist in both tables
If right, it keeps value from the right table when values exist in both tables

join: Choose a join type from the following:

full_join
left_join
right_join
inner_join

The default is full_join.

Example

Let’s try applying it to the same data:

coalesce_join(df1, df2, by = "id")

# A tibble: 3 × 3
     id colA  colB 
  <int> <chr> <chr>
1     1 A     X1   
2     2 B     Y    
3     3 <NA>  <NA>

So it worked and we get the same result, with few more advantages:

Eliminate the chores of writing the chain(s) of coalesce(.x, .y)
Simplify the code by combining coalesce() and dplyr joins. The join argument is where we select the join type, from full_join, left_join, right_join, inner_join. anti_join is not in the list, obviously, because coalesce() will not be applicable.
It can be customized to take value from right table, if different non-missing values exists. In the example above, if we want to take the right value X2, we can simply do so by setting the argument keep = "right":

# Take the value from right table when non-missing value exists in both
coalesce_join(df1, df2, by = "id", keep = "right")

# A tibble: 3 × 3
     id colA  colB 
  <int> <chr> <chr>
1     1 A     X2   
2     2 B     Y    
3     3 <NA>  <NA>