How the two languages handle NA in joins differently
Author
Aster Hu
Published
June 26, 2023
Recently, I encountered a situation where I needed to translate an Access SQL query to R, and I noticed the contrasting behaviors of these two languages when it comes to handling NA/NULL values in left joins.
The impact of NA/NULL values on joins
When performing a join operation between two tables, it is essential to consider the presence of NA/NULL values. Let’s take a left join as an example, where all records from the left table are retained and are attempted to be matched with corresponding rows in the right table. However, the presence of NA/NULL values raises questions about whether they should be matched with any corresponding values in the right table. As a result, this can lead to unexpected results due to the uncertainty inherent in the logic of the chosen tool.
SQL’s approach to handle NULL in joins
Let’s look at a simple example by creating two tables named fname and lname.
# fname
id firstname
1 1 Ada
2 2 Bob
3 NULL Unknown
# lname
id lastname
1 1 Smith
2 NULL To be decided
In the fname table, the 3rd observation contains a NULL value in id, while still having a placeholder value of “Unknown” in the firstname column. The same situation also occurs in the lname table, where the 2nd observation has a NULL value in id and it’s being labeled as “To be decided” in the lastname column.
It may seem unusual, but this type of data is not uncommom in business settings, where it often occurs due to low-quality data or when businesses require a placeholder for empty observations that can be analyzed later.
If we apply a LEFT JOIN in SQL, the code would be:
SELECT*FROM fnameLEFTJOIN lname ON fname.id= lname.id;
And this is the outcome we will get.
id firstname id lastname11 Ada 1 Smith22 Bob NULLNULL3NULL Unknown NULLNULL
As we can see, the placeholder value “To be decided” in the lname table has disappeared. This occurs because in SQL, NULL values in tables or views being joined never match each other1, resulting in SQL not returning a matched result.
The question is, is this the expected outcome? Based on above result, it’s impossible to tell whether a match couldn’t be found, or if it represent a NULL value. Besides, what if we want to match these two NULL values and have “To be decided” correspond with “Unknown”? These questions remain until we have a clear mind of how we wish to handle NULL values.
How R handles NA in joins
More precisely, we are referring to dplyr::left_join. Let’s create the same tables in R and perform a left join.
library(dplyr)fname <-tibble(id =c(1:2, NA_character_),firstname =c("Ada", "Bob", "Unknown"))lname <-tibble(id =c(1, NA_character_),lastname =c("Smith", "To be decided"))left_join(fname, lname, by ="id")
# A tibble: 3 × 3
id firstname lastname
<chr> <chr> <chr>
1 1 Ada Smith
2 2 Bob <NA>
3 <NA> Unknown To be decided
Note
If you see a warning message saying “Each row in x is expected to match at most 1 row in y”, this is a result of the multiple matches warning in dplyr 1.1.0. Tidyverse has modified the behaviour of multiple matches warning since dplyr 1.1.1, significantly reducing the number of warnings. For more information, please refer to the release note of dplyr 1.1.1
From the above table, it actually returned a matched result “To be decided” for the NA value! This is because in R, by default two NA or NaN values are considered as equal, like %in%, match(), and merge().2
The good news is that we can modify this behaviour in R by utilizing the na_matches argument. When we set na_matches = "never", it will behave the same way as in SQL.
left_join(fname, lname, by ="id", na_matches ="never")
# A tibble: 3 × 3
id firstname lastname
<chr> <chr> <chr>
1 1 Ada Smith
2 2 Bob <NA>
3 <NA> Unknown <NA>
What if there are multiple NAs?
Generally speaking, having multiple NAs in key variable in joins is not considered a best practice and it is usually preferable to omit them. However, we may come across such situations from time to time.
Assuming we have another NA in the lname table:
lname <-data.frame(id =c(1, NA_character_, NA_character_),lastname =c("Smith", "To be decided", "Pending"))lname
id lastname
1 1 Smith
2 <NA> To be decided
3 <NA> Pending
What would happen if we join the two tables again?
left_join(fname, lname, by ="id")
# A tibble: 4 × 3
id firstname lastname
<chr> <chr> <chr>
1 1 Ada Smith
2 2 Bob <NA>
3 <NA> Unknown To be decided
4 <NA> Unknown Pending
Since R treats two NAs as equal, similar to handling duplicates in key variables, it will return all the information from the right table due to multiple matches.
Conclusion and takeaway
In SQL, NULL values in tables being joined are not considered as equal. When both table have NA/NULL value in the key variable, SQL treats them as unmatched, which differs from the default behaviour of dplyr::left_join, where NA is treated as a a match. When working with both SQL and R for data analysis, it is crucial to be mindful of the divergent handling of NULL values and adjust the join behaviour accordingly in order to deliver consistent outcomes.