Replace missing value from other columns using coalesce join in dplyr
R
dplyr
data
2023
A simple solution to handle NA value.
Author
Aster Hu
Published
May 11, 2023
The original coalesce function
When cleaning and aggregating data using dplyr package in R, coalesce() is extremely handy to replace the NA value with values from other columns.
A simple example
How does coalesce() work originally? Below is a simple example to showcase one of the common usage.
# Get some simple datadf1 <-data.frame(id =c(1:3),col =c('A', NA, NA))df2 <-data.frame(id =c(1:3),col =c(NA, 'B', NA))df1
# A tibble: 3 × 2
id col
<int> <chr>
1 1 A
2 2 <NA>
3 3 <NA>
df2
# A tibble: 3 × 2
id col
<int> <chr>
1 1 <NA>
2 2 B
3 3 <NA>
The desired result is to combine these two tables and replace all NA values when the key id matches, like this:
id col
1 1 A
2 2 B
3 3 <NA>
Below is how we usually do it using coalesce(), by joining the tables first and then coalescing from the identical vectors col from two data frames.
library(dplyr)# The usual dplyr way to coalescefull_join(df1, df2, by ="id") %>%mutate(col =coalesce(col.x, col.y),.keep ="unused"# Remove temporary columns ended with .x and .y )
# A tibble: 3 × 2
id col
<int> <chr>
1 1 A
2 2 B
3 3 <NA>
What’s the pain point of coalesce()?
Looks great so far, right? But remember, this is the most simple case. In the real world, it’s common to have more than one variable that need to be coalesced. Applying the same method multiple times can quickly become chores.
# A more complicated example# Both colA and colB need to be coalesceddf1 <-data.frame(id =c(1:3),colA =c('A', NA, NA),colB =c('X', NA, NA))df2 <-data.frame(id =c(1:3),colA =c(NA, 'B', NA),colB =c(NA, 'Y', NA))df1
# A tibble: 3 × 3
id colA colB
<int> <chr> <chr>
1 1 A X
2 2 <NA> <NA>
3 3 <NA> <NA>
df2
# A tibble: 3 × 3
id colA colB
<int> <chr> <chr>
1 1 <NA> <NA>
2 2 B Y
3 3 <NA> <NA>
If we use the same way as the simple example, it looks like this:
# The usual dplyr way to coalescefull_join(df1, df2, by ="id") %>%mutate(# Coalesce becomes more redundant when multiple columns are involvedcolA =coalesce(colA.x, colA.y),colB =coalesce(colB.x, colB.y),.keep ="unused"# Remove temporary columns ended with .x and .y )
# A tibble: 3 × 3
id colA colB
<int> <chr> <chr>
1 1 A X
2 2 B Y
3 3 <NA> <NA>
We get the desired result, but the code looks quite cumbersome. Image if we have more variables like colC, colD, …, etc. You get the idea.
Unfortunately, it’s usually a wide data set with numerous columns whenever I need to use coalesce(). Not only that, there are other scenarios that coalesce cannot handle.
By default, it always override values from .x over .y and cannot do the other way due to its origin from SQL COALESCE. This could be a problem if different non-missing value exists in both columns. For example:
# A more complicated example# Both colA and colB need to be coalesced# For colB, df1 has 'X1' while df2 has `X2`df1 <-data.frame(id =c(1:3),colA =c('A', NA, NA),colB =c('X1', NA, NA))df2 <-data.frame(id =c(1:3),colA =c(NA, 'B', NA),colB =c('X2', 'Y', NA))df1
# A tibble: 3 × 3
id colA colB
<int> <chr> <chr>
1 1 A X1
2 2 <NA> <NA>
3 3 <NA> <NA>
df2
# A tibble: 3 × 3
id colA colB
<int> <chr> <chr>
1 1 <NA> X2
2 2 B Y
3 3 <NA> <NA>
# The usual dplyr way to coalescefull_join(df1, df2, by ="id") %>%mutate(# Coalesce becomes more redundant when multiple columns are involvedcolA =coalesce(colA.x, colA.y),colB =coalesce(colB.x, colB.y),.keep ="unused"# Remove temporary columns ended with .x and .y ) # And it always takes the left value over the right
# A tibble: 3 × 3
id colA colB
<int> <chr> <chr>
1 1 A X1
2 2 B Y
3 3 <NA> <NA>
As the above result shows, coalesce() takes the left value X1 in df1, but what if I want X2 from df2 to be shown in the result? In reality, the default option might not meet the expectation.
Build our own coalesce join function
To simplify the workflow of coalesce(), I wrote a coalesce_join() function with my partner Rick to resolve all the pain points we mentioned. The code is shown below and it is also available on gist.
require(dplyr)require(stringr)coalesce_join <-function(x, y, by =NULL, keep =c("left", "right"), # "left" means keep value from left table if values exist in both tables.suffix =c(".x", ".y"), # Same as the suffix argument in dplyr joins. join =c("full_join","left_join", "right_join", "inner_join") # Choose a join type from the list. The default is full_join. ) { keep =match.arg(keep) join =match.arg(join) join =match.fun(join) # Confirm the join argument is in the list and match the string to the function# Depends on the keep argument, overwrite the duplicate value# If keep = "left", the value from the left table will be kept, vice versa.if (keep =="left") suffix_ = suffix else suffix_ =rev(suffix)join(x, y, by = by, suffix = suffix) %>%mutate(across( # Apply the coalesce function to all overlapped columnsends_with(suffix_[1]), # Select columns ended with .x if keep = "left"; or .y if keep = "right"~coalesce(.,get(str_replace(cur_column(), suffix_[1], suffix_[2])) # Replace .x in var.x with .y to generate var.y, if keep = "left"; or vice versa. ),.names ="{str_remove(.col, suffix_[1])}"# Remove the suffix from the combined columns ),.keep ="unused") # Remove the temporary columns ended with suffix }
Usage and arguments
coalesce_join(x, y, by =NULL, keep =c("left", "right"),suffix =c(".x", ".y"),join =c("full_join","left_join", "right_join", "inner_join") )
keep: The default value is left.
If left, it keeps value from the left table when values exist in both tables
If right, it keeps value from the right table when values exist in both tables
join: Choose a join type from the following:
full_join
left_join
right_join
inner_join
The default is full_join.
Example
Let’s try applying it to the same data:
coalesce_join(df1, df2, by ="id")
# A tibble: 3 × 3
id colA colB
<int> <chr> <chr>
1 1 A X1
2 2 B Y
3 3 <NA> <NA>
So it worked and we get the same result, with few more advantages:
Eliminate the chores of writing the chain(s) of coalesce(.x, .y)
Simplify the code by combining coalesce() and dplyr joins. The join argument is where we select the join type, from full_join, left_join, right_join, inner_join. anti_join is not in the list, obviously, because coalesce() will not be applicable.
It can be customized to take value from right table, if different non-missing values exists. In the example above, if we want to take the right value X2, we can simply do so by setting the argument keep = "right":
# Take the value from right table when non-missing value exists in bothcoalesce_join(df1, df2, by ="id", keep ="right")
# A tibble: 3 × 3
id colA colB
<int> <chr> <chr>
1 1 A X2
2 2 B Y
3 3 <NA> <NA>