experimenting with the updated ggplot2 package

ggplot2 (v 0.9.2) was recently released. To see what’s new use the following:

news(Version == “0.9.2″, package = “ggplot2″)

I was playing around with it when I was making some regression discontinuity plots in R. I wrote a function for this and here it is:

rdd_plot <- function(x, y, t, data){
  library(ggplot2)
  p = ggplot(data)
  p1 = p + geom_point(aes(x, y, colour = t))
  p2 = p1 + geom_smooth(aes(x, y, group = t), method = "lm", 
            se = F, lwd = 1.2, colour = "red1")
  print(p2)
}

Test it out with some fake data:

library(ggplot2)
r <- .8
x <- rnorm(1000)
t <- x
t[t < mean(x)] <- 0
t[t >= mean(x)] <- 1
table(t)
y <- rnorm(1000)
y <- x * r + y * sqrt(1 - r^2)
cor(x, y)
qplot(x, y)
fake <- data.frame(x, y, t)
fake$y[fake$t == 1] <- fake$y[fake$t == 1] + sd(fake$y[fake$t == 1])
rdd_plot(x = fake$x, y = fake$y, t = fake$t, data = fake)

Should give you something that looks like this:

Posted in Uncategorized | Leave a comment

experimenting with maps in R

A number of resources exist for making elegant maps in R, especially from Hadley Wickham (http://had.co.nz/ggplot2/). The maps package and ggplot2 work well together to make national, state, or county borders. I need school district borders for a specific exercise. I’m no expert in mapping software (e.g. tiger files, shape files, etc.), however I know these are standard file formats for most mapping datasets. The Census website (http://www.census.gov/cgi-bin/geo/shapefiles2010/main) houses publicly available shapefiles for school districts. So, I used ggplot2 to play around with with these files. Here is the code and output:

library(maptools)
library(gpclib)
library(ggplot2)
d1 <- readShapeSpatial("tl_2010_39_unsd10.shp")
plot(d1)
meta <- as.data.frame(d1)
gpclibPermit()
d1ddf <- fortify(d1, region="NAME10")
id <- d1ddf$id
id <- id[!duplicated(id)]
y <- rnorm(length(id))
perf <- data.frame(id, y)
d2ddf <- merge(d1ddf, perf, by="id")
q1 <- quantile(d2ddf$y, c(.2, .4, .6, .8), na.rm=T)[1]
q2 <- quantile(d2ddf$y, c(.2, .4, .6, .8), na.rm=T)[2]
q3 <- quantile(d2ddf$y, c(.2, .4, .6, .8), na.rm=T)[3]
q4 <- quantile(d2ddf$y, c(.2, .4, .6, .8), na.rm=T)[4]
d2ddf$y_cat[d2ddf$y < q1] <- "20th Percentile or Below"
d2ddf$y_cat[d2ddf$y >= q1 & d2ddf$y<=q2] <- "21st-40th Percentile"
d2ddf$y_cat[d2ddf$y > q2 & d2ddf$y<=q3] <- "41st-60th Percentile"
d2ddf$y_cat[d2ddf$y > q3 & d2ddf$y<=q4] <- "61st-80th Percentile"
d2ddf$y_cat[d2ddf$y > q4] <- "Above 80th Percentile"
d2ddf$y_cat <- ordered(factor(d2ddf$y_cat, levels=c("20th Percentile or Below",
                                                    "21st-40th Percentile",
                                                    "41st-60th Percentile",
                                                    "61st-80th Percentile",
                                                    "Above 80th Percentile")))
p <- ggplot(d2ddf)
p1 <- p + geom_polygon(aes(long, lat, group=group))
p2 <- p1 + geom_polygon(aes(fill=y_cat,long,lat,group=group)) +
           geom_polygon(data = d1, colour = alpha("white", 1/2),
           size = 0.2, fill = NA) +
           scale_fill_brewer(pal="PuRd", name="Outcome") +
           geom_path(aes(long,lat,group=group),colour="white") +
           opts(title="OH Outcomes") +
           xlab("Longitude") +
           ylab("Latitude")
p2

This is probably not the most efficient way to produce maps like this (especially since this outcome is arbitrary) but I’m pleased with the output:

Posted in Uncategorized | Leave a comment

playing with ggplot 2

I have been playing around with ggplot2 a lot recently.  One thing I like about ggplot2 is the ability to condense a lot of information into graphical summaries.  Sometimes, I need a quick and easy way to look at survey response frequencies across different groups.  I came across this posting and wanted to try it myself on some of my own data.  Here is what I did:

library(ggplot2)
data_qa1 <- read.csv("data_qa1.csv")
p <- ggplot(data_qa1)
p1 <- p + geom_bar(aes(group, adj.freq, colour=group),
                   stat="identity")
p2 <- p1 + geom_bar(aes(group, adj.freq, fill=item),
                    stat="identity", position="dodge")
p2

And this is what came out:

I am pleased with it.  Then following the directions in the ggplot2 book, I was able to facet several item sets at the same time by creating a viewport function:

layout <- grid.layout( nrow = 2, ncol = 2,
widths = unit (c(1,1), c("null", "null")),
heights = unit (c(1,2), c("null", "null")))
vplayout <- function (...) {
  grid.newpage()
  pushViewport(viewport(layout = layout))
}
subplot <- function(x, y)
viewport(layout.pos.row=x, layout.pos.col=y)
vplayout()

data_qa1 <- read.csv("data_qa1.csv")
data_qa2 <- read.csv("data_qa2.csv")
data_qa3 <- read.csv("data_qa3.csv")
data_qa4 <- read.csv("data_qa4.csv")  

p <- ggplot(data_qa1)
p1 <- p + geom_bar(aes(group, adj.freq, colour=group),
                   stat="identity")
p2 <- p1 + geom_bar(aes(group, adj.freq, fill=item),
                    stat="identity", position="dodge")
print(p2,vp=subplot(1, 1))

p <- ggplot(data_qa2)
p1 <- p + geom_bar(aes(group, adj.freq, colour=group),
                   stat="identity")
p2 <- p1 + geom_bar(aes(group, adj.freq, fill=item),
                    stat="identity", position="dodge")
print(p2,vp=subplot(1, 2))

p <- ggplot(data_qa3)
p1 <- p + geom_bar(aes(group, adj.freq, colour=group),
                   stat="identity")
p2 <- p1 + geom_bar(aes(group, adj.freq, fill=item),
                    stat="identity", position="dodge")
print(p2,vp=subplot(2, 1))

p <- ggplot(data_qa4)
p1 <- p + geom_bar(aes(group, adj.freq, colour=group),
                   stat="identity")
p2 <- p1 + geom_bar(aes(group, adj.freq, fill=item),
                    stat="identity", position="dodge")
print(p2,vp=subplot(2, 2))

Which gave me this:

Posted in Uncategorized | Leave a comment

Formatting Character Strings in R

Formatting characters from lower to upper case:

x<-c("a","b","c","d")
toupper(x)

From upper to lower case:

x<-c("A","B","C")
tolower(x)     #all lower case

Capitalize words in R:

#install.packages("Hmisc") if you don't have it already
library(Hmisc)
x<-c("arizona","california","indiana","illinois")
capitalize(x)
#[1] "Arizona"    "California" "Indiana"    "Illinois"

Remove trailing or leading “white” space:

x<-c("a      ","       b","     c      ","a","b "," c")
nlevels(factor(x)) #a, b, and c have multiple factor levels
#[1] 6
library(gdata)     #install.packages("gdata") if not installed
trim(x)            #trims leading and trailing white spaces with a simple
                   #function

Extracting portions of a character string:

x<-c("test","this","string")
substr(x,1,2)    #subtract the characters in x starting with 1 and
#stopping with 2 (takes the first two characters)
substr(x,1,100) #take 1 to 100 characters in x (should get them all!)
#could also specify substr(x,1,nchar(x))
x<-c(1,200,3000)
x1<-paste("AIR",x,sep="")       #concatenate “AIR” to the front of the
                                #value of x
x2<-paste(x1,"00000",sep="")    #add 5 trailing zeros to x1
x3<-substr(x2,1,8)              #keep the 1st through 8th character in
                                #x2

Nested one-line version of the above:

substr(paste(paste("AIR",x,sep=""),"00000",sep=""),1,8)
Posted in Uncategorized | Leave a comment

Flat Tables

This is a nice little function that I learned about in John Verzani’s book.  It makes table() output look a little nicer:

data<-data.frame(x=rbinom(100,1,.3),y=rbinom(100,1,.5))
#Mock up a quick toy for the example
ftable(table(data$x,data$y))
    0  1
0  40 29
1  13 18
Posted in Uncategorized | Leave a comment

Indexing Vectors and Data Frames in R

First create a vector containing some values, like this:

x<-c(6,7,8,9,10,NA,NA)

Now try the following:

x[1]                       #gives the first value of x
x[c(1,3,5)]                #gives the first, third, and fifth values of x
x[-4]                      #gives all values excluding the fourth
x[1:2]                     #gives the first and second values (same as x[c(1,2)]
x[x==10]                   #gives x where x is equal to 10
x[x<10]                    #gives x where x is less than 10
x[x<=10]                   #gives all values less than/equal to 10
x[x>=7]                    #gives all values greater than or equal to 7
x[x==6|x==9]               #gives all values where x is 6 or 9
x[x!=8 & x!=9]             #gives all values where x is not equal to 8 and is not equal to 9
x[is.na(x)]                #gives all missing values of x

Take advantage of R’s object-oriented framework and store new values into new objects (variables):

x_new<-x[x==6|x==9]        #all values where x is 6 or 9

Accessing specific rows in a data frame follows the same logic.  Just remember that indexing data frames follows this structure [Rows, Columns] or [R, C].  Try these out:

kids<-read.csv("data.csv")    
#or library(UsingR); kids<-kid.weights #[intstall.packages("UsingR") if not installed]
kids[1,]                      #What is the value of the first row vector of data?
kids[,1]                      #What is the value of the first column vector of data?

Subsetting a data frame:

new<- kids[kids$gender=="M",]  
#give me the rows in the data frame where gender is equal to M
nrow(new)    #see how many rows are in the new (subsetted) data frame

new1<-kids[kids$gender=="M" & kids$age <=40,]

new2<-kids[kids$gender=="M" & (kids$age <=40 | kids$age >=50),] 
#Give me all rows in the data frame where gender equals male and age
#is less than or equal to 40 months or greater than or equal to 50 months.
Posted in Uncategorized | Tagged | Leave a comment

Exporting Data

Exporting data from R to some other format is just as simple as getting the data in.  I like working with comma-separated value files (.csv) but sometimes I want to export directly into Stata or SPSS.  Let’s create a (correlated) dataframe to export:

x<-rnorm(100)
y<-rnorm(100)+10
r<-.4
y<-x*r+y*sqrt(1-r^2)
g<-c(rep("m",50),rep("f",50))
data<-data.frame(x,y,g)

Exporting to a .csv or .txt file is very simple.  For .csv files use the following:

write.csv(data,"test1.csv")  #Export the R object "data" to a file called "test1.csv"

NOTE: R will save export files to the current working directory.  If you want to save to a different location specify the path name of the desired location within.  For example,

write.csv(data,"C:/mydata/test1.csv")  #Save "data" to a folder called "mydata" on the C drive

You can also export your data to a .txt file.  When exporting, you can also specify how the data are separated.  If we want to export to a .txt file with comma separation we would use the following:

write.table(data,"test2.csv",sep=",")  #sep= tells R how you want the data separated

If we don’t specify how the data will be separated, the default methods R uses is tab or space separation.  Look at the results of the following:

write.table(data,"test3.txt")

Exporting our data from R to Stata, SPSS, and SAS is not that much more difficult that exporting to simple .csv or .txt files.  The first thing we have to do is call the ‘foreign’ package (or install it if you haven’t already).

install.packages("foreign")  #Do this only of you don't have the foreign package already
library(foreign)

Exporting to Stata is very straight forward.  Try out the following code:

write.dta(data,"test4.dta")

Exporting to SPSS files takes just  bit more work.  To create an SPSS version the data we’re using we can use the following:

write.foreign(data,"test5.txt","test5.sav","SPSS")

What this does is store the data you created in R to a plain .txt file and creates a syntax file (in .sav) to read in the data.  When you open the newly created “test5.sav” file a syntax window will open.  Highlight and run that syntax and you should see the same data (x,y,g) that you created in R.  You might notice, however, that the variable ‘g’ is no longer composed of characters (“m”,”f”) but is now numeric (2, 1).

The write.foreign() function requires us to specify four different elements: the data frame (df), the syntax file to accompany the data (codefile), the data file name (datafile), and the stats package we’re exporting to (package). write.foreign() can export to SPSS, Stata, and SAS.  To learn more about the write.foreign() function type

?write.foreign

Let’s use write.foreign() to export to SAS:

write.foreign(data,"test6.txt","test6.ssd","SAS")

or

write.foreign(data,"test6.txt","test6.sas7bdat","SAS")

Update: I have had mixed results exporting data from R to SAS using the write.foreign() function from the ‘foreign’ package.  As a reliable alternative, I typically export to a .csv file using write.csv() and import the data from within SAS.  In fact, I find myself almost always exporting my data from R to a .csv file because of its reliability and portability to other packages; SPSS, Stata, and SAS all read in .csv files easily.

Posted in Uncategorized | Leave a comment

Installing Packages

Packages contain collections of functions.  The R community has generated over 1,000 different packages.  R come with a base set of packages including the ‘stats’ package.  You will probably at some point want to install additional packages that can handle unique tasks.  For example, maybe you want to install the ‘lattice’ package to improve the graphics you generate.  The following code will get ‘lattice’ on to your computer:

install.packages("lattice")

You will then be prompted to select a mirror (the location you will download your package from).  Choose one near you (I live in Chicago and always use CA1 – Berkeley) and then click OK.  The download should begin.

Alternatively, you could you the drop-down menus.  Select Packages–>Install Packages.  Now you can look up the package you are interested, select a mirror from which to download, and begin downloading.

Posted in Uncategorized | Leave a comment

More on Introductory Functions

Here are a bunch of introductory functions that I use all the time.

If I haven’t said it before, I’ll say it now, the ‘#’ symbol is the start of my comments.  Anything to the right of the ‘#’ symbol is not considered part of the code and is there for not necessary.  However, if you copy a command like this

x<-seq(1,10,1)  #x is a sequence of 1 to 10 by 1

and you copy and past the entire statement into R, you will not see an error message because ‘#’ was used to make the comment.  However, if I were to write a command like this

x<-seq(1,10,1)  x is a sequence of 1 to 10 by 1

you would get an error message saying, “Error: unexpected symbol in “x<-seq(1,10,1)  x.”

We only need to create a couple of small variables to learn several frequently implemented functions.  Let’s do that here

x <- c(1,2,3,4,5)
y <- c(6,7,8,9,10)

We will cover the following functions: sum(), mean(), median(), length(), table(), and max()

Let’s first figure out what length() means

length(x)
length(y)

You should get 5 for both of these statements.  length() computes the number of records in a vector.  This is handy in figuring out the n of different variables in a data frame.  So there are 5 records in both x and y.  What happens if there are missing data in one of these?  NOTE: missing data are marked “NA” in R.

x1 <- c(1,2,NA,4,5)
y1 <- c(6,7,8,NA,10)

Now rerun the length() function on each of these variables

length(x1)
length(y1)

You should again being getting the 5 as the result for each of these statements.  So we know that R will count each row in a vector even if it is missing with the length() command.

Computing a mean in R is pretty much as simple or simpler than computing a mean in any other computer program.  Let’s try it out

sum(x)/length(x)  #Just take the sum of all values in x divided by the total number of cases (n)
sum(y)/length(y)

This yields a mean of 3 for x and 8 for y which is the same as simply typing out

mean(x)
mean(y)

Let’s see what happens when we “manually” compute the means for the data within missing values (NAs).

sum(x1)/length(x1)
sum(y1)/length(y1)

We run into trouble here, getting NAs back for both statements.  Since there are missing values in these variables, we have to tell R how to handle them with both the sum() and length() commands.  Let’s omit them with the following

sum(na.omit(x1))/length(na.omit(x1))
sum(na.omit(y1))/length(na.omit(y1))

You should see a mean of 3 for x1 and 7.75 for y1 when we omit missing data.  These examples should sufficiently show the utility of sum() and length() but when we’re interested in computing a mean, using the mean() function is going to be the most efficient.  If we want to omit missing data using mean() we can use

mean(na.omit(x1))
mean(na.omit(y1))

or

mean(x1,na.rm=T)  #this is a logical statement saying remove missing cases equals TRUE or T for short
mean(y1,na.rm=T)

Similarly, computing the median of a variable is easily implemented with

median(x)
median(y)

which should produce 3 and 8, respectively.  Specifying na.rm=T or na.omit() is done in the same way as mean().

We’ll modify our x and y variables here (by creating new objects in R) for the next set of examples

x2<-c(1,2,3,4,5,5)
y2<-c(1,1,2,3,4,5)

Finding the mode can be a little more laborious than the median or mean but is still very doable with the following

names(table(x2))[table(x2)==max(table(x2))]  #names() is a function that will return column names of a matrix or data frame. Tables fall in this category.
names(table(y2))[table(y2)==max(table(y2))]

A two step approach is also possible with

tx<-table(x2)  #This stores the frequency table of x2 values in the object tx which can be referenced later on
names(tx[tx==max(tx)])  #This says to return the name of the value of tx which is equal to the maximum value of tx
ty<-table(y2)
names(ty[ty==max(ty)])

Both of these approaches should have shown 5 as the modal value for x2 and 1 as the modal value for y2.

Posted in Uncategorized | Leave a comment

Quick Numerical and Graphical Summaries of Data

Let’s generate a small dataset to work with

x<-rnorm(100)
y<-rnorm(100)+10 #The mean of y will be 10 units higher than x

#Let’s correlate these data using a Pearson correlation of .4

r<-.4
y<-x*r+y*sqrt(1-r^2)
g<-c(rep("m",50),rep("f",50))
We can put each of these variables (x,y,g) into a common data frame for usage with:
data<-data.frame(x,y,g)
We can numerically scan and summarize our data quickly with:
summary(data)
head(data)
tail(data)

Since there are two continuous variables in these data (x, y), we can generate histograms to examine their distributions with:

hist(data$x,col="cyan")   #filling in the histogram with col="cyan" is optional

which should look like this (though slightly different since you generated your own random data):

And we can look at the distribution of y with:

hist(data$y,col="cyan")

which should look close to this:

Producing simple scatterplots between continuous variables is implements easily with the plot() command.

plot(data$y~data$x,cex=1.3,pch=21,bg="cyan")  #cex, pch, and bg are formatting options

or

plot(data$x,data$y,cex=1.3,pch=21,bg="cyan")

Should produce something looking like this:

We can add a best fitting linear regression line to this plot with:

abline(lm(data$y~data$x),lwd=2,lty=2)  #lwd and lty are formatting options

Which should look like this:

We have one categorical variable (g) which we can also examine graphically with the boxplot() command.

boxplot(data$x~data$g,col="cyan")  #col is a formatting option (not necessary)

Should look something like this:

We can also look at the distribution of y for both levels of g using

boxplot(data$y~data$g,col="cyan")

Produces this:

Posted in Uncategorized | Leave a comment

More lme Examples

http://www.maths.anu.edu.au/~johnm/r-book/xtras/mlm-lme.pdf

Posted in Uncategorized | Tagged , , | Leave a comment

lmer from Gelman

http://www.stat.columbia.edu/~cook/movabletype/archives/2006/01/fitting_multile.html

Posted in Uncategorized | Tagged , , | Leave a comment

lme from University of Michigan

http://www.stat.columbia.edu/~cook/movabletype/archives/2006/01/fitting_multile.html

Posted in Uncategorized | Tagged , , | Leave a comment

Interesting ggplot2 Trick

http://www.r-bloggers.com/a-quick-ggplot2-hack-multiple-dataframes/

Posted in Uncategorized | Tagged , , | 2 Comments

Some Basic Functions

Here are a couple of functions that are from scratch. The basic form is always the same:

x<-function(){ #open the function here
#body of function
return() #value of list vector to be returned
} #close the function here
#Try out the functions below and try writing one of your own
#Building Functions
#1. Recreating the mean function
mean1<-function(x){
mean<-sum(x)/length(x)
return(mean)
}
x<-seq(1:5)
mean1(x)
mean(x)
#2. A function to return z-scores
z.score<-function(x){ #x is a vector of length n
z.score=(x-mean(x)/sd(x))
z.score
}
z.score(x)
Posted in Uncategorized | Tagged | Leave a comment

Quick R’s Take on Power Functions

http://www.statmethods.net/stats/power.html

Posted in Uncategorized | Tagged | Leave a comment

Chi Square GOF Test

The following chunk of code will create a function to conduct a chi-square goodness of fit test in R.

#Professor Example - Goodness of fit
#Do the observed favorability proportions depart markedly from their
#expected value?
enroll<-c(32,25,10)
expected<-c(22.3,22.3,22.3)
rbind(enroll,expected)
chi<-sum(((enroll-expected)^2)/expected)
#Here is a small function to use for GOF in R
chi.GOF<-function(o,e){
k<-length(o)-1
chi<-sum((o-e)^2/e)
p.chi<-1-pchisq(chi,k)
return(matrix(c(chi,p.chi),
nrow=1,ncol=2,dimnames=list("",
c("Chi-Square","p"))))
}
chi.GOF(enroll,expected) #enroll=o(observed), expected=e(expected)
#Smoking Example - Test of Association
smoke<-matrix(c(29,16,55,198,107,181),byrow=T,nrow=2,
dimnames=list(c("Smokers","Nonsmokers"),c("1Cycle","2Cycle","3Cycle")))
chisq.test(smoke)
Posted in Uncategorized | Tagged , , | Leave a comment

Quick R’s Take on ANOVA (aov)

http://www.statmethods.net/stats/anova.html

I also posted some other notes on BB for ANOVA.

Posted in Uncategorized | Tagged , | Leave a comment

Rattle

http://cran.r-project.org/web/packages/rattle/index.html

Posted in Uncategorized | Tagged , | Leave a comment

Haven’t found any great solutions to the missing data issue in SPSS files…

I came along this forum (http://r.789695.n4.nabble.com/R-for-Windows-GUI-closes-when-I-try-to-read-spss-td862748.html) and it comes to a solution we considered which was exporting data to a .csv and then reading it into R. But that isn’t helpful if you have a lot of variables to recode and multiple values to missing data.

If anyone has come across a great way to handle these missing data when using read.spss(), let me know. use.missings=T doesn’t seem to do it for me.

Posted in Uncategorized | Tagged , , | Leave a comment

R cheat sheet

Josh sent this link (http://www.personality-project.org/R/r.commands.html) along which contains a cheat sheet of common R commands. I like it and if anyone else has found some, send them along. Paolo’s blog (“One R Tip A Day”) might have something like this, too.

Posted in Getting Started | Tagged , | Leave a comment

dchisq and dnorm Plots

I tried this for plotting out some chi-square distributions with various df and it seemed to work out:

s<-seq(0,25,.01)
plot(s,dchisq(s,2),type="l",lwd=2,col="dodgerblue4")
lines(s,dchisq(s,5),type="l",lwd=2,col="green1")
lines(s,dchisq(s,10),type="l",lwd=2,col="orange1")

This should look something like this:

Here’s something similar for a normal distribution (notice that I use slightly different code for plotting the curves):

n=15
curve(dnorm(x,mean=0,sd=1/sqrt(n)),-5,5,xlab="x",
ylab="Density",lwd=4,col="dodgerblue4")
n=5
curve(dnorm(x,mean=0,sd=1/sqrt(n)),-5,5,xlab="x",
ylab="Density",lwd=4,col="green1",add=T)
n=1
curve(dnorm(x,mean=0,sd=1/sqrt(n)),-5,5,xlab="x",
ylab="Density",lwd=4,col="orange1",add=T)

This should end up looking like this:

Posted in Uncategorized | Tagged , , , | Leave a comment

The R Journal

I just found this (http://journal.r-project.org/) and found some great packages in the (few) issues that the group has published. It’s definitely worth checking out.

Posted in Uncategorized | Tagged , , | 2 Comments

Simple Simulation for Helping to Learn R

I have found it useful to create quick and simple vectors and data frames within R since I know that much of the real world data I might be working with may have certain peculiarities that sometimes impede the learning process for me.  Here are some of the common functions I use:

Note: placing a ? before R functions in the console will bring up their help page, which for the following functions will be useful to know.

rnorm()

This is a function draws a sample of size n from a normal distribution with a specified mean (mu) and standard deviation (sd) (if not specified these are set to 0 and 1 respectively.  Here’s and example of using this function to create a vector:

x<-rnorm(1000,25,12)

This is a vector of length 1000, mu = 25, sd = 12

rbinom()

This function will draw a sample of size n from the binomial distribution with a specified number of trials (n1) and probability of success (p).  This is often useful for simulating categorical response variables.  Here’s and example:

y<-rbinom(100,1,.5)

I’ve simulated 100 cases from a binomial distribution with n1=1 trial and a .5 probability of a success (1).

These two do a lot for me but there are many, many others that are useful and may be more useful to you in your work such as chisq(), unif(), t(), gamma(), hyper(), geom(), and pois().

Putting the simulated vectors into a data frame is relatively simple:

data<-data.frame(x,y)
attach(data)

Now it’s ready to use.

Posted in Uncategorized | Leave a comment

edit() and fix()

So I’ve discovered that fix() command has been more helpful to me when I want to edit a data frame and potentially multiple vectors while edit() looks like it is best used with single vectors.  So far I haven’t found anything that edit() can do and fix() can’t.  So, I’m a fix() fan for editing vectors and data frames in R.  Example:

data<-read.csv("data.csv")
fix(data)
Posted in Uncategorized | Leave a comment

Getting SPSS and Stata Data Files Into R

I use SPSS and Stata quite a bit and want to know how to get .sav and .dta files into R.

Getting generic data files into R (e.g. .csv & .txt) are fairly simple and handled with R’s base library.  Getting this pre-formatted data files into R takes a bit more work (primarily the use of an additional package).

The foreign() package is apparently the key to doing this.  Packages are compact and efficient code repositories that the R community has generated and maintained over the years.  I think we’re going to find these quite useful throughout the semester.

So back to the task – getting an SPSS file into R.  There are some working SPSS files in Blackboard under “Course Documents –> R –> R Data”.  Feel free to use your own or download the course data files.  Either way, we either 1) have to know where this data is stored once we download it (or where your data is)  or 2) have R shortcuts mapped to the files where this data is stored (see previous post).

If you haven’t created shortcuts mapped to your working directory you can always set your working directory once you’re in R:

setwd("C:/mydata")

This is a generic folder and path name that you may use but you may also have something more elaborate like:

setwd("C:/Documents and Settings/Rwilliams/My Documents/R Data")

If you don’t set your working directory, you can still get data into R, but you’ll have to enter the entire path name each time (e.g. “C:/Documents and Settings/Rwilliams/My Documents/R Data/data.sav”).

You can also check the contents of your working directory with:

dir()

If you see you data in there, then you’re ready to start calling it.

The following code should get the SPSS data file into R:

library(foreign)

This calls the foreign package from your R library so we can read in foreign datasets

data<-read.spss("NELS88_student.sav",use.value.labels=T, max.value.labels=Inf, to.data.frame=T)

Since SPSS files often have labels that represent categorical variables we tell R to treat these labels as such (may not always be appropriate).  We also don’t put a limit on factor levels by specifying “Inf”.  Finally, we tell R that this is to become a data frame.

Your data should now be stored in the R object “data”.  Depending on the size of the data you just imported, it may not be useful to take a look at all of the data at one time.   A couple of things I typically do to make sure my data made it in correctly are below.

summary(data)

This gives a summary of each vector in the data frame.  Sometimes this can be cumbersome but if you know what you’re looking for it can help.

head(data)

This provides the first 6 rows of the data frame.

tail(data)

This provides the last six rows of the data.

nrow(data)

This let’s us know how may rows are in the data

ncol(data)

This lets us know how many columns are in the data

Datasets from Stata can easily be imported using the following command:

library(foreign)
data<-read.dta("data.dta")
Posted in Uncategorized | Tagged , , | 2 Comments

Importing Data (.csv and .txt)

First task in getting data into R is finding where your default working directory is.  Start R an type the following command:

dir()

This will show the files in your current working directory.  If you installed R using default settings, you’re probably looking at your “My Documents” folder if you’re using Windows and maybe your “Home” folder if you’re using Ubuntu Linux or a Mac.  Regardless of where your current working directory is, you can easily move it wherever you’d like using a simple command and a path name.  You might want to consider the following:  Do you want to store all of your R data and activity in the same location or will you have multiple folders with different data associated with different projects?  If you’re like me and you have multiple projects going, you may consider making R shortcuts that map directly to certain project folders.  Here’s how:

1. Create a project folder (say “Project X”)

2. Copy an R icon and paste it inside the Project X folder (or wherever you want it to be)

3. Copy the path name of the Project X folder (maybe C:\Project X)

4. Right click on the newly created R shortcut icon. In the “Start In:” field and paste the Project X path name inside the double quotation marks

5. Click Apply.  Click OK.  Now you’re all set.  Every time you start R from that shortcut icon, it will automatically look to the Project X folder when pulling data in (which beats typing out long path names over and over again during an R session!).

Another option is to start R and change the working directory to map to a folder of your choice during your session using setwd().  This command is simple and only requires you to fill in the path name of the folder you’re interested in working with (i.e. the folder that has the data you want to use).  Say we launched R using without having mapped our working directories like we did above and “My Documents” was the working directory.  We have multiple files in “C:/Project X” that we’d like to use.  In stead of using having to type out the path name every time we import a data file from Project X we will set the working directory in this session (not permanently) to the Project X folder.

setwd("C:/Project X")

You can follow this up by

dir()

to make sure you see all the files you expected to see (not essential).

If you don’t like either of these two options, you’ll need to type out the full path name to each of your data files each time you pull them in. For example,

data1<-read.csv("C:/Project X/data1.csv")
data2<-read.csv("C:/Project X/data2.csv")

I typically use comma-delimited files with R and will go over those in this post. Commas separated files are generally Excel .csv files for me but can also be comma-separated text files. In their raw form they end up looking something like this:

variable1,variable2,variable3
12,7,18,0,4,9

If you copy and paste these three lines of code into a text editor and save it as test_file.txt file you can read it into R with the following command:

data<-read.table("test_file.txt",header=T,sep=",")

We are creating an R object called “data” When we call the object we should see the same values above.

data
variable1 variable2 variable3
1        12         7        18
2         0         4         9

The 1 and 2 in the first column are the row numbers in R.  I don’t typically keep my data in .txt files but rather work spreadsheets like Excel that keep my data in an easy-to-see format. Consequently, I use .csv files most often. These files can be created in Excel very easily by going to “Save As” -> “Save as Type” -> select “CSV (comma delimited)”. So it’s no problem to get any of your data you already have into a comma delimited format for R. Try opening up Excel (or OpenOffice) and type in those same values I just used for the .txt file and save it as a .csv file in your working directory. I’ve called the file “test.csv”. The following command fetches this data and brings it into R for analysis:

data<-read.csv("test.csv")
variable1 variable2 variable3
1        12         7        18
2         0         4         9

The command for bringing in .csv files is a little more efficient and that’s why I like it. If you have other methods for getting these kinds of files in that is even more efficient, please let me know. I will talk about getting SPSS, STATA, and SAS data into R shortly (at least the way I’ve done it) and also how to get data OUT of R into a preferred format.

Posted in Uncategorized | 2 Comments

Installing R

Installing R has always been fairly straight-forward for me.  Just visit http://cran.r-project.org/ to install R on Windows, Mac, or Linux operating systems. Once downloaded, I like to create at least a couple of shortcuts (maybe a desktop and quick-launch icon). Once R is downloaded it is useful to create an R data folder.  Maybe you already have a folder or folders where you save all your data files – that’s fine.

TIP: One thing that can be frustrating is telling R where your data folder (also known as a “working directory”) is every time you run R.  So, one thing I ‘ve found helpful is linking my shortcuts to specific data folders (I only know how to do this in Windows, so Linux and Mac users please chime in):

  1. Copy the file location of your working directory
  2. Right click on an R shortcut you’re created (maybe your desktop shortcut)
  3. Paste the file location (e.g. “C:/Users/Me/My Documents/My R Folder”) into the “Start In” field
  4. Select “Apply” and close that window

Now you every time you open R from that specific shortcut, R knows where to begin looking for your data when you start feeding it files; you don’t have to map out the location of your data every time.

Posted in Getting Started | Tagged | Leave a comment

Welcome

Welcome to the Experimenting with R blog.

Posted in Uncategorized | Leave a comment