Here are a bunch of introductory functions that I use all the time.
If I haven’t said it before, I’ll say it now, the ‘#’ symbol is the start of my comments. Anything to the right of the ‘#’ symbol is not considered part of the code and is there for not necessary. However, if you copy a command like this
x<-seq(1,10,1) #x is a sequence of 1 to 10 by 1
and you copy and past the entire statement into R, you will not see an error message because ‘#’ was used to make the comment. However, if I were to write a command like this
x<-seq(1,10,1) x is a sequence of 1 to 10 by 1
you would get an error message saying, “Error: unexpected symbol in “x<-seq(1,10,1) x.”
We only need to create a couple of small variables to learn several frequently implemented functions. Let’s do that here
x <- c(1,2,3,4,5)
y <- c(6,7,8,9,10)
We will cover the following functions: sum(), mean(), median(), length(), table(), and max()
Let’s first figure out what length() means
You should get 5 for both of these statements. length() computes the number of records in a vector. This is handy in figuring out the n of different variables in a data frame. So there are 5 records in both x and y. What happens if there are missing data in one of these? NOTE: missing data are marked “NA” in R.
x1 <- c(1,2,NA,4,5)
y1 <- c(6,7,8,NA,10)
Now rerun the length() function on each of these variables
You should again being getting the 5 as the result for each of these statements. So we know that R will count each row in a vector even if it is missing with the length() command.
Computing a mean in R is pretty much as simple or simpler than computing a mean in any other computer program. Let’s try it out
sum(x)/length(x) #Just take the sum of all values in x divided by the total number of cases (n)
This yields a mean of 3 for x and 8 for y which is the same as simply typing out
Let’s see what happens when we “manually” compute the means for the data within missing values (NAs).
We run into trouble here, getting NAs back for both statements. Since there are missing values in these variables, we have to tell R how to handle them with both the sum() and length() commands. Let’s omit them with the following
You should see a mean of 3 for x1 and 7.75 for y1 when we omit missing data. These examples should sufficiently show the utility of sum() and length() but when we’re interested in computing a mean, using the mean() function is going to be the most efficient. If we want to omit missing data using mean() we can use
mean(x1,na.rm=T) #this is a logical statement saying remove missing cases equals TRUE or T for short
Similarly, computing the median of a variable is easily implemented with
which should produce 3 and 8, respectively. Specifying na.rm=T or na.omit() is done in the same way as mean().
We’ll modify our x and y variables here (by creating new objects in R) for the next set of examples
Finding the mode can be a little more laborious than the median or mean but is still very doable with the following
names(table(x2))[table(x2)==max(table(x2))] #names() is a function that will return column names of a matrix or data frame. Tables fall in this category.
A two step approach is also possible with
tx<-table(x2) #This stores the frequency table of x2 values in the object tx which can be referenced later on
names(tx[tx==max(tx)]) #This says to return the name of the value of tx which is equal to the maximum value of tx
Both of these approaches should have shown 5 as the modal value for x2 and 1 as the modal value for y2.