R365 – Day 3 – Graphics pt. 1 – Scatterplots
I was working on grading exams today and I wanted to see whether I graded any harder or easier than my fellow TA. So I set up the gradebook so that exams that she graded were one number and exams that I graded were marked another number. I set out to do this in Excel and I was shocked that I could not. I have not felt that embarrassed in a long while. After looking through a couple of tutorials online, I found a StackOverflow article that went through some stupidly complicated version of the solution (here: http://stackoverflow.com/questions/15124103/excel-how-can-i-make-a-scatter-plot-which-colors-by-a-third-column). Their solution basically involved splitting up the original data series into three separate columns, which is fine when you have like 8 data points, but gets really inefficient really quickly. Troubled by the lack of help in the Excel world, I searched through some tutorials for R and quickly found a solution with my old frienemy, ifelse(). Ifelse() is a useful logical function that starts out really nice and useful until you start nesting them. Excel is nice because they color code all of the nested layers, but using my basic R console does not color anything. Moving on to RStudio (http://www.rstudio.com/) was amazingly helpful for problems like this, and you could also use a text writer like TinnR.
Armed with the knowledge that others had succeeded in plotting different colors for a scatter plot, I went about looking for code I could adapt to my needs. Some people view coding as massive and furious amounts of typing followed by a quick solution. I think this makes coding daunting for people, because they don’t feel like instant computer whizzes right off the bat. A better view of coding is looking around for similar problems, taking the code and adapting it to your purpose, and building your perfect code out of a patchwork quilt of loosely modified or blatantly stolen code from other people. I think that once people realize that coding is very approachable, more people will get into it, which will undoubtedly result in more code that I can steal…er…modify.
This StackOverflow exchange (http://stackoverflow.com/questions/17551193/r-color-scatter-plot-points-based-on-values) had a similar problem of filling in a scatter plot based on a different column. After working on the problem for a few minutes, I realized that I am really really bad at importing data into R, so one of my future posts will focus on how to import data. After staring at the data a little while longer, I realized that I would need to modify a bit more because the person in the example used their Y-value data to decide what color things should be. I have a separate row (filled with me vs. her) that I need to compare against. My first thought is to turn the dataset into a matrix using the function rbind(). This did not seem to help any, and I noticed that a lot of the codes were using ‘dataframes’, which I was unfamiliar with. After checking in on what a dataframe is (?data.frame), I found that they basically function as matrices or lists. The dataframes from the examples also had an x count value, so I added one.
After much effort and a LOT of help from this discussion (http://stackoverflow.com/questions/7466023/how-to-give-color-to-each-class-in-scatter-plot-in-r?rq=1) I was able to make this graph
I will work on making an appropriate example set for how I did it, but for now you can use:
##code for making scatterplots with different colors for different treatments
##making the dataset
#the length of the dataset will be 197
#enter the y variable
##Make a set of “treatments”
##Treatments (how you’re going to split up the scatter plot) need to be characters for some reason
## This turns any number above 0 into a, and everything else into b
#attach your data
#plot, colors need to be concatenated for some reason