From Excel to R: Beginner’s Guide on Data Manipulation
What is R and why?
R is a powerful tool to conduct complicated data analysis. In short, R is an open source programming environment, and RStudio is a common used user interface (IDE) that makes using R easier.
Many people would compare R with Python, to my understanding, both has pros and cons. I chose R simply because it fits my goal better, and I felt the language is easier to read thanks to R’s packages.
I am a heavily Excel user, so this section will be focus on the basic functions that I used to perform in Excel.
Installation and Setup R
Download R from CRAN official website, then install RStudio from RStudio website.
R has many powerful packages and can be installed from RStudio directly. There’re only two packages need to be installed for now: tidyverse
and openxlsx
. To do it, go to menu bar - Tool
- Install Packages
, put the name of the package and click Install.
Import csv/xlsx Files
There’s some prep needed to be done before any “actual” code. It is essential for every R script.
First is to import the packages that we just installed.
# Import packages
library(tidyverse)
library(openxlsx)
Second step is to setup the working directory, which is the default folder to import/export documents. R only recognizes forward slash, which is contrary to Windows’ backslash, so you would need to make changes if you are on Windows system.
setwd("/Users/aster/myworkingdir")
Then we can import our raw data. Ideally, csv
is preferred if it is available.
<- read.csv('Sales.csv')
sale # prod/sale is the assigned name of the imported dataframe
If the file is xlsx
, the function would be read.xlsx
instead, and we need to specify the name of worksheet when importing.
<- read.xlsx('sample.xlsx', # Name of the workbook
df sheet = 'Sheet 1', # Name of the worksheet
na.strings = "" # Convert blank cell to NA in R
)
Let’s take a glimpse on the imported dataframe.
Filter Columns
Filtering in Excel is not difficult when there’re only a few columns. What if there’re more than 30? Even finding the right column takes time. In such cases, R becomes a handy tool. We put names in the script instead of searching through endless columns, and R has the autocomplete feature when typing the name, making the process even smoother.
# filter Region = Asia
<- sale %>% filter(Region == 'Asia') asiaonly
Note that it should be double equal
if it relates to specific value (e.g. strings, integer).
To apply multiple filters, simply add more conditions inside the bracket.
# multiple filters:
# filter 1: Region = Asia
# filter 2: Sales Channel = Online
<- sale %>% filter(Region == 'Asia',
asiaonly == 'Online') Sales.Channel
Remove Unwanted Columns
For example, if we only need to review the first and second columns:
# Show only first two clumns, and remove other columns
<- sale %>% select(1:2) regioncountry
Usually, it’s easier to use column names (aka variables in R) when dealing with non-consecutive columns.
# Show Regions and Channel only, and remove other columns
<- sale %>% select(Region, Sales.Channel) regionchannel
If we want to remove a certain column while keep everything else, apply a minus sign in front of variables.
# Remove Channel column
<- sale %>% select(-Sales.Channel) nochannel
Those are basic functions that I’ve been used all the time. In the future I’ll introduce more advanced functions that Excel normally cannot handle.