Here are a bunch of introductory functions that I use all the time.
If I haven’t said it before, I’ll say it now, the ‘#’ symbol is the start of my comments. Anything to the right of the ‘#’ symbol is not considered part of the code and is there for not necessary. However, if you copy a command like this
x<-seq(1,10,1) #x is a sequence of 1 to 10 by 1
and you copy and past the entire statement into R, you will not see an error message because ‘#’ was used to make the comment. However, if I were to write a command like this
x<-seq(1,10,1) x is a sequence of 1 to 10 by 1
you would get an error message saying, “Error: unexpected symbol in “x<-seq(1,10,1) x.”
We only need to create a couple of small variables to learn several frequently implemented functions. Let’s do that here
x <- c(1,2,3,4,5)
y <- c(6,7,8,9,10)
We will cover the following functions: sum(), mean(), median(), length(), table(), and max()
Let’s first figure out what length() means
length(x)
length(y)
You should get 5 for both of these statements. length() computes the number of records in a vector. This is handy in figuring out the n of different variables in a data frame. So there are 5 records in both x and y. What happens if there are missing data in one of these? NOTE: missing data are marked “NA” in R.
x1 <- c(1,2,NA,4,5)
y1 <- c(6,7,8,NA,10)
Now rerun the length() function on each of these variables
length(x1)
length(y1)
You should again being getting the 5 as the result for each of these statements. So we know that R will count each row in a vector even if it is missing with the length() command.
Computing a mean in R is pretty much as simple or simpler than computing a mean in any other computer program. Let’s try it out
sum(x)/length(x) #Just take the sum of all values in x divided by the total number of cases (n)
sum(y)/length(y)
This yields a mean of 3 for x and 8 for y which is the same as simply typing out
mean(x)
mean(y)
Let’s see what happens when we “manually” compute the means for the data within missing values (NAs).
sum(x1)/length(x1)
sum(y1)/length(y1)
We run into trouble here, getting NAs back for both statements. Since there are missing values in these variables, we have to tell R how to handle them with both the sum() and length() commands. Let’s omit them with the following
sum(na.omit(x1))/length(na.omit(x1))
sum(na.omit(y1))/length(na.omit(y1))
You should see a mean of 3 for x1 and 7.75 for y1 when we omit missing data. These examples should sufficiently show the utility of sum() and length() but when we’re interested in computing a mean, using the mean() function is going to be the most efficient. If we want to omit missing data using mean() we can use
mean(na.omit(x1))
mean(na.omit(y1))
or
mean(x1,na.rm=T) #this is a logical statement saying remove missing cases equals TRUE or T for short
mean(y1,na.rm=T)
Similarly, computing the median of a variable is easily implemented with
median(x)
median(y)
which should produce 3 and 8, respectively. Specifying na.rm=T or na.omit() is done in the same way as mean().
We’ll modify our x and y variables here (by creating new objects in R) for the next set of examples
x2<-c(1,2,3,4,5,5)
y2<-c(1,1,2,3,4,5)
Finding the mode can be a little more laborious than the median or mean but is still very doable with the following
names(table(x2))[table(x2)==max(table(x2))] #names() is a function that will return column names of a matrix or data frame. Tables fall in this category.
names(table(y2))[table(y2)==max(table(y2))]
A two step approach is also possible with
tx<-table(x2) #This stores the frequency table of x2 values in the object tx which can be referenced later on
names(tx[tx==max(tx)]) #This says to return the name of the value of tx which is equal to the maximum value of tx
ty<-table(y2)
names(ty[ty==max(ty)])
Both of these approaches should have shown 5 as the modal value for x2 and 1 as the modal value for y2.