- The problem of openxlsx
- Solution 1: Resolve it in excel
- Solution 2: Use other packages
- Looking foward
At work, I rely on the R package
openxlsx heavily and use
openxlsx::read.xlsx to import excel spreadsheet, which is the most common format in business settings.
Now here is the issue I discovered recently. I have a spreadsheet that contains excel formula which generates value with special character
&. However when importing the file using
Here is the sample of spreadsheet sample_data.xlsx. This is what it looks like in excel:
|colB is text||US & Canada|
|colB is formula (=B2)||US & Canada|
Note that the 2nd observation in colB is a formula “=B2”. Therefore the expected result should be the same as the 1st observation.
# Import spreadshet to R df <- openxlsx::read.xlsx("sample_data.xlsx") df colA colB 1 colB is text US & Canada 2 colB is formula (=B2) US & Canada
However, after importing to R, the value
& instead. This is what the result shows in R:
|colB is text|| |
|colB is formula (=B2)|| |
The obvious solution is to get rid of the problem at the beginning. There are many ways can achieve the desired outcome - in the end, what we want is the value in R matches the value in the spreadsheet.
- Convert to csv. Csv is my preferred format as it eliminates lots of format issues in xlsx. The problem is that this will creates multiple files , and people can get confused on which files to use. Besides, it might not be a good option if the file is meant to be a living dataset that need continuous edits.
- Paste as value. Another way to do it is to transform the formula to plain text in excel by pasting as value. If the spreadsheet is okay to be edited and is not too big, it would be an easy solution.
Sometimes you probably don’t want to touch the raw data. Also, if it’s a large dataset, it might be unrealistic to perform such functionality in excel as the solution 1 suggested. Since this issue only occurs when using
openxlsx package, we can choose other packages as a substitute.
readxl is one of the packages belong to tidyverse collections. It’s designed to read data out of excel into R for both xls and xlsx, and does not have other dependencies. It’s a dependent package outside of tidyverse library so we would need to specifically call it out in R.
readxl::read_excel is the substitute we’re looking for and it does the job well.
library(readxl) # Import df <- read_excel("sample_data.xlsx") df # A tibble: 4 × 2 colA colB <chr> <chr> 1 colB is text US & Canada 2 colB is formula (=B2) US & Canada
readxl has its limitation. Comparing with
readxl::read_excel has less arguments and therefore will possibly generates other format issue. For example,
detectDates is the argument I use often that
readxl::read_excel does not have, so I have to recognize dates and perform conversion later. Also, if I need to save the output to spreadsheet, I would need to call out another library
writexl and it’s just not as convenient as the all-in-one package
I still prefer
openxlsx to other as I like its arguments and it fits the business world well. There’s an antecedent package
openxlsx2 on the way - Let’s wait and see if it resolves the issue.