R

Hello folks!!

This page contains some simple tips and tricks for R users. These are commonly used steps in many most science engagements. I am not sure about others but I tend to forget the constructs often and spend time looking up examples. So I thought, why not put them all in one place where they could certainly benefit me, but hopefully some others as well.

A good way to getting and setting the path in your projects

path = getwd()
setwd(path)
#path

Once your projects get larger you end up using a lot of packages. The awesome ‘pacman’ package keeps it easy for you to remember and load all needed pcakages at once. Give it a shot: http://trinker.github.io/pacman/

if (!require("pacman")) install.packages("pacman")
pacman::p_load(package1, package1, etc.)

Reading delimited files which are comma separated text files

DelimFile = read.delim(file="testData.txt", header = T, sep = ",", na.strings = ")
# Getting a summary of the file
summary(DelimFile)
# What is the number of rows
nrow(DelimFile)
# What is the number of columns
ncol(DelimFile)
# Look at the top few rows
head(DelimFile)
# Look at the bottom few rows
tail(DelimFile)

How to remove rows with NAs from the above file

No_NAS = DelimFile[complete.cases(DelimFile),]

Converting format of a date column to required date format

file$colname <- as.Date(file$colname, "%Y%m%d")



Check the class of that column in the dataframe

class(file$colname)

Using the versatile lubridate package to extract information from ‘date’ format columns

library(lubridate)
# Let's say you have a dataframe with a date of birth column called dob and a date
# of payment column called paymentDate. You can extract some information using the 
# Lubridate package as follows
DF$DOB_year = year(DF$dob)
DF$Payment_Day = weekdays.Date(DF$paymentDate)
DF$Payment_Month = months(DF$paymentDate)
Df$Payment_year = year(DF$paymentDate)
DF$Payment_week = week(DF$paymentDate)

In some cases you may want to unfactor a factor variable and make if numeric for further analysis. The varhandle package comes in handy in such cases.

library(varhandle)
DF$paymentAmount <- unfactor(DF$paymentAmount)
DF$paymentAMT = as.numeric(DF$paymentAmount)
str(DF)

Subtracting one column from another and creating a new column

DF$PendingAmount = DF$owedAmount - DF$paymentAMT

Using the cut function to create buckets, in this case we are using age and we are creating buckets of 10 from 0-120

# Creating Age Buckets, in this case we are 
DF$age_group <- cut(DF$Age, breaks=seq(0,120,10))

Using ifelse statement to check for a condition and create a resultant column

DF$Status <- ifelse(DF$variable1 >50, "High", "Low")
This will create a new column called status by checking on variable1's value

Inner Joins using the awesome dplyr package

NewDataFrame = inner_join(DF1, DF2, by = c("DF1_ID" = "DF2_ID"))
# You can use this when the ids you are joining on in each dataframe is different

Joining dataframes using merge

MergedDF = merge(DF1, DF2, by.x = "ID1", by.y = "ID2", all.x=TRUE)
#all.x=TRUE means that all elements from DF1 will be kept.

Removing specific columns of a dataframe by column number

DF <- DF[,-(1:3)]

Measuring time taken for a chunk of code. There are much more fancier ways of doing this. But the following works pretty well.

# Start the clock!
MyClock <- proc.time()
############Put your code here#############
# Stop the clock
proc.time() - MyClock

Connecting to a Database and pull data into R

library(dplyr) 
# dplyr provides a nice way to connect to databases
TableFromDB = tbl(mydb, sql("SELECT * from XYZ"))
# The above is a construct, we need to still get data into it
TableFromDB_df = as.data.frame(TableFromDB, n=-1) # n=-1 gets all the data

Your standard for loops…Yes, I said it..I like them a lot as they make your code quite readable

for (i in 1:n)
{
 Put the loopy stuff here.
}

An amazingly fast way in R to do what loops do is by using the apply family of functions, such as: lapply, sapply, mapply, tapply. A quick google search can get you tons of examples. But here’s a basic way to use lapply to do the same thing that the above ‘for’ loop does. The key difference is that lapply operates on a dataframe or a matrix, but returns a list. These lists can later be bound together using do.call.

dflist  = lapply(i=1:nrows(df_on_which_you_want_to_operate), function(i) {put the function here with index 'i' })
# this will give you a collection of lists which can be bound as follows to create a dataframe
do.call(rbind,dflist)

This is one of my favorite ones…don”t you like to see a progress bar in the R command window when you run your code? Use the progress bar. The github page for “progress” library is very cool with lot of examples

library(progress)
# See this link for more examples: https://github.com/gaborcsardi/progress
pb <- progress_bar$new(
  format = "  downloading [:bar] :percent eta: :eta",
  total = 100, clear = FALSE, width= 60)
for (i in 1:(how_long_you_want_your_for_loop_to_run) {
  pb$tick()
  Sys.sleep(1 / (how_long_you_want_your_for_loop_to_run))
}

Subset your matrix based on a pattern match. This becomes very useful when you have a lot of columns.

NewDF = Old_DF[, grepl("XYZ_", colnames(Old_DF))]
# grepl gets you all the indices and you can then use those indices to extract columns from a dataframe
# The above will get you all columns that match the pattern "XYZ_"

Using the paste function to add and modify elements.

MyMatrix[1,2]= paste(MyMatrix[1,2], "New Element", sep = "--")
# this will use the paste function to modify the current element in the position [1,2] by adding the string "New Element" and it will be separated by "--"

There are multiple functions from different packages to write out csv files. As you start to deal with larger files in data science projects, you will end up using these often. In increasing order of speed based on my experience are:

  • write.csv in base R
  • write_csv from the readr package [install using install.package(“readr”)
  • write.csv.raw from the iotools pacakge [this has some challenges and needs some formatting but is crazy fast
  • There is also the recently released, feathr package, but I have not tried it yet.
# THE iotools package and write.csv.raw function
library(iotools)
# bear in mind this package only writes out dataframes, so if you have matrices
# you will want to convert them to dataframes first
cat(noquote(paste0(paste0(names(DFyouwanttowrite), collapse = ","), "\n")), file = "DFyouwanttowrite.csv")
# The above step is important to write out the column headers first, else this package does not format it well
write.csv.raw(DFyouwanttowrite, "DFyouwanttowrite.csv", append = TRUE)

# THE readr package and write_csv or write_tsv function
write_csv(dataframe,"dataframe.csv", na = "NA", col_names = TRUE) # comma separated
write_tsv(dataframe,"dataframe.txt", na = "NA", col_names = TRUE) # tab separated
# the readr package by default does not write the rownames which is a good thing

# The standard write.csv function
write.csv(dataframe, "dataframe.csv", sep = ',', row.names = FALSE, col.names = TRUE)
# you need to set row.names = F, if you do not want the rownames written

# A way to directly write xls files is with the WriteXLS package
library(WriteXLS)
WriteXLS(“DF”, “DF.xls”, na = "NA")

Aggregations are some of the most important task in any data science activity. You can use base R or make use of some very nifty packages especially the data.table package, plyr, and dplyr. All of them are fanstastic!

# Let's say we want to aggregate credit card bills based on a unique id of a 
# person and the year of payment, we can simply use the aggregate function to do that as below
MyAggregation1 = aggregate(payment ~ UID  + Payment_year, mydataframe, sum)
# Whatever you want to aggregate goes on the left of the tilde, 
# and whatever you want to aggregate on goes on the right. Pretty simple. 
# You can use the + sign to get aggregations for multiple levels. 
# We can use function such as "sum", "average", and "length." 
# Length simply gives you a count, other two are self-explanatory.
Advertisements