Month: January 2014

R365: Day 4 – poibin AND pnn!

R365: Day 4: poibin and pnn
Today I did some field work and for most of the day and could not get to coding until the evening. I searched through the comprehensive list of R packages and randomly picked out a package “poibin” that sounded interesting. The problem (and wonderful thing!) about just floating through the list of packages is that it is easy to see other packages that sound like they might be interesting. Some packages have obtuse names, but the descriptions for the packages is usually good enough to help you figure out what they do. Because I was naturally looking to see what I missed, I noticed a package “pnn” that also looked cool, so todays post is a Daily Double!
Poibin is a package that helps you calculate the exact values for the Poisson binomial distribution. For a second before I read the vignette for the package, I thought it calculated BOTH the poisson and binomial, but it actually calculates the Poisson binomial, a distribution I am fairly unfamiliar with. To my general knowledge, Poisson distributions are useful because they approximate normal distributions for small sample sizes. Binomial distributions are useful because they help with binomial data, like presence/absence data, and they treat all of the likelihoods of each presence/absence equally. The Poisson binomial distribution (I learned from Wikipedia) is the generalized form of the binomial distribution wherein the probabilities are NOT assumed to be equal to one another. I have been meaning to learn about what the different distributions are and what they best represent for a long time, and this was a perfect opportunity.
Poibin has one main group of functions that calculates the probability. Like other distributions I have seen in the past (normal and uniform), you can adjust the function to suit your purpose. Mostly I like to be able to generate a random series of numbers given a distribution, which in this case is given by rpoibin(). All of the examples given in the short vignette for the package use a weight of 2 for every case. Removing the weights caused the resulting numbers to go down, but I am not really sure what the role of the weights were or what they did besides functionally increasing the value of the resulting random numbers. I am not sure if I will use poibin, I think its often more useful to use the specific distributions of binomial or beta binomial, and leave the generalized formula for other things. Still really cool to find out more about distributions and to test drive a new package.
The pnn package was less straightforeward but I think more rewarding than the poibin package. Pnn helps to solve and create probabalistic neural networks. Neural networks mimic human learning by taking a set of data, analyzing it, and deciding whether new data falls into one of the categories described in the original set. They are used widely in machine learning activity, and research on neural networks has been going strong since at least the 1970’s. Pnn has several functions for running and adapting neural networks from sets of data. Creating a functioning neural network comes in 4 steps:
1. Learn- given a set of data, create or update a neural network
2. Smooth- set a smoothing parameter, which helps to define how closely you want your neural network to be trained by the data. So if you set it very close, it will begin to assume that random error and noise in the training data is valid, but if you set it loose then the neural network will ignore data that should be considered for the test.
3. Performance- measure how well your neural network would match against a real set of data, and can give you an exact success rate for accurate calls.
4. Guess- guess the categories of new data point or vector, giving a probability for each possible option and guessing based on the highest probabilities.
I really liked the documentation and the examples used for the package vignette, it was very clear and the code worked very well every time. I like that the authors included the phrase library(pnn) at the top of every example so that there wouldn’t be weird issues with loading the package before using it, something I always forget to do.
Given the relatively short amount of time that I devote to each of these posts (~1-2 hours) I generally stay away from larger packages, which might have many many many functions and would be difficult to cover to any sort of depth or quality. Even testing out all of the functions might be relatively difficult for some of the larger packages. To compensate for this, I will be splitting up big packages in the future so that I can cover more depth. I am very interested in looking at ggplot2 and VGAM packages, and I might devote a whole week to each of them, we’ll see.

Advertisements

R365: Day 3 – scatterplots

R365 – Day 3 – Graphics pt. 1 – Scatterplots

I was working on grading exams today and I wanted to see whether I graded any harder or easier than my fellow TA. So I set up the gradebook so that exams that she graded were one number and exams that I graded were marked another number. I set out to do this in Excel and I was shocked that I could not. I have not felt that embarrassed in a long while. After looking through a couple of tutorials online, I found a StackOverflow article that went through some stupidly complicated version of the solution (here: http://stackoverflow.com/questions/15124103/excel-how-can-i-make-a-scatter-plot-which-colors-by-a-third-column). Their solution basically involved splitting up the original data series into three separate columns, which is fine when you have like 8 data points, but gets really inefficient really quickly. Troubled by the lack of help in the Excel world, I searched through some tutorials for R and quickly found a solution with my old frienemy, ifelse(). Ifelse() is a useful logical function that starts out really nice and useful until you start nesting them. Excel is nice because they color code all of the nested layers, but using my basic R console does not color anything. Moving on to RStudio (http://www.rstudio.com/) was amazingly helpful for problems like this, and you could also use a text writer like TinnR.

Armed with the knowledge that others had succeeded in plotting different colors for a scatter plot, I went about looking for code I could adapt to my needs. Some people view coding as massive and furious amounts of typing followed by a quick solution. I think this makes coding daunting for people, because they don’t feel like instant computer whizzes right off the bat. A better view of coding is looking around for similar problems, taking the code and adapting it to your purpose, and building your perfect code out of a patchwork quilt of loosely modified or blatantly stolen code from other people. I think that once people realize that coding is very approachable, more people will get into it, which will undoubtedly result in more code that I can steal…er…modify.

This StackOverflow exchange (http://stackoverflow.com/questions/17551193/r-color-scatter-plot-points-based-on-values) had a similar problem of filling in a scatter plot based on a different column. After working on the problem for a few minutes, I realized that I am really really bad at importing data into R, so one of my future posts will focus on how to import data. After staring at the data a little while longer, I realized that I would need to modify a bit more because the person in the example used their Y-value data to decide what color things should be. I have a separate row (filled with me vs. her) that I need to compare against. My first thought is to turn the dataset into a matrix using the function rbind(). This did not seem to help any, and I noticed that a lot of the codes were using ‘dataframes’, which I was unfamiliar with. After checking in on what a dataframe is (?data.frame), I found that they basically function as matrices or lists. The dataframes from the examples also had an x count value, so I added one.

After much effort and a LOT of help from this discussion (http://stackoverflow.com/questions/7466023/how-to-give-color-to-each-class-in-scatter-plot-in-r?rq=1) I was able to make this graph

Image

I will work on making an appropriate example set for how I did it, but for now you can use:

##code for making scatterplots with different colors for different treatments

##making the dataset

#the length of the dataset will be 197

Length=c(1:197)

#enter the y variable

Grades=c(15,27,35,25,21,27,13,19,18,22,27,27,23,29,26,26,23,24,23,31,27,28,24,18,25,23,18,26,29,15,22,2,4,17,24,21,23,19,18,23,14,11,22,8,19,26,22,15,18,23,19,29,19,20,29,23,13,22,21,29,25,22,18,25,26,96,26,17,29,21,18,20,27,20,24,20,60,21,23,24,24,24,22,28,17,27,24,23,22,23,70,22,25,26,8,15,61,21,21,14,17,25,23,9,28,24,6,25,26,25,15,13,21,23,20,18,17,22,92,15,5,26,9,26,16,40,22,25,17,27,16,18,29,28,25,63,19,27,26,22,9,19,28,18,6,24,28,4,20,15,4,26,42,5,11,21,25,20,25,23,14,26,29,22,23,24,27,14,27,10,24,26,6,22,13,22,16,18,21,99,22,22,46,27,27,24,98,19,23,49,23,6,23,15,24,22,15)

##Make a set of “treatments”

Grader=c(1,1,0,0,1,0,1,0,1,1,1,1,0,1,1,0,1,1,0,0,0,1,1,0,1,0,1,0,0,0,0,1,1,0,0,1,0,1,0,1,1,0,0,1,0,1,1,1,1,0,1,1,0,1,1,1,1,1,1,0,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,1,0,1,1,1,0,1,0,0,1,1,1,1,1,0,1,0,1,1,0,1,1,1,1,0,1,1,0,1,0,1,0,1,1,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,1,1,1,0,0,0,0,0,1,1,0,1,1,1,1,1,1,0,0,1,0,1,1,0,0,1,0,1,1,1,1,1,1,0,1,0,0,0,0,1,1,1,0,0,0,1,0,1,0)

##Treatments (how you’re going to split up the scatter plot) need to be characters for some reason

## This turns any number above 0 into a, and everything else into b

Z=ifelse(Grader>0,letters[1],letters[2])

data=data.frame(x=Length,y=Grades,z=Z)

#attach your data

attach(data)

#plot, colors need to be concatenated for some reason

plot(x,y,col=c(“red”,”blue”)[z])

R365: Day 2 – sperich

R365: Day 2 – sperich

For my second post, I went back to the long list of R packages and flipped through the pages with my eyes closed and picked out … sperich! I swear I will try to have a more formal way to pick a random package out next time. In fact, I might use tomorrow as a non-random day to look at what kind of sampling packages R has available.

sperich is a package designed to help estimate centers of species biodiversity and estimate species range based on occurrence. The package has been out since about 2012, but while I was running it I ran across a couple of functions that would not work without installing new packages. Normally these things are shipped in with the dependencies. One interesting function that was included as part of the example set was the function image(), which is in the {graphics} package, and creates a colored grid. They used the example function add.Edges() to illustrate how to draw boarders between two points on a grid.

While working through the examples for this package, I realized that most of the outputs from the examples were matrices, but I was (naively) hoping for something along the lines of a heat map. The matrices would be pretty easy to translate into a heat map, but I was kind of hoping they would do it for me (I’m lazy). I think this package would work well for its intended purpose (extrapolating edges of species boundaries), and it would be amazing if I knew a bit more about how to plot stuff out and make it look pretty.Tomorrow I think I will look over the {graphics} package a little more closely, learning how to properly graph things would be very useful.

R365 Day 1 “likert”

R365: Day 1 – mgcv likert
My home laptop is a Linux, and does not display all of the available packages (?). I am not sure if this means the other packages are not supported in Linux, or if I am currently unable to view them. The list of packages I can see consists of about 30 of the most powerful and useful packages (from what I can judge), rather than the 5000+ that are supported by R, presumably in Windows and/or Macintosh.
That being said, I scrolled down the list, closed my eyes, and picked the first package. mgcv. mgcv? Turns out it stands for “Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation”, which took me a minute to translate the acronyms, long enough to go “aaaaaahhhh” and rethink this whole blog. OR cheat and find a new package!
After a brief search, turns out you can install most packages (if you have root access in Linux) by just using install.packages(“[package name]”). This article from r-bloggers helped a lot (http://www.r-bloggers.com/installing-r-packages/). With this in hand, I searched through the comprehensive list of R packages and found the package “likert”, which helps to visualize likert type items. What in the world are likert items? After some quick Wikipedia-ing (a word which should be a verb, so I will treat it as such, suck it grammar nazis), I found out that I was already very familiar with the concepts of Likert Scales. If you have ever done a survey where the format is a statement (“I like rabbits”) followed by five options (strongly agree to strongly disagree), then you too have seen Likert Items Likert Scales. Rensis Likert (pronounced “LICK-urt”) was a psychologist from Michigan who worked on management theory. He developed his scale system as part of his PhD thesis in 1932 (always mind-boggling and a little depressing to find groundbreaking PhD theses). He found that the scales allowed for more exact results with fewer questions. A Likert Item is an individual statement and rating, whereas a Likert Scale is the sum of all of the ratings, and is used to compare across different groups.
So what does the likert package do? One of the major functions in the package likert is the function likert (go figure), which performs statistical tests on sets of Likert Items. Their example:
     library(“likert”)
 data(pisaitems)
    items29 <- pisaitems[,substr(names(pisaitems), 1,5) == ‘ST25Q’]
    names(items29) <- c(“Magazines”, “Comic books”, “Fiction”,
    “Non-fiction books”, “Newspapers”)
    l29 <- likert(items29)
    summary(l29)
    plot(l29)

plots responses to a survey about how often people read books and magazines. The summary statistics from likert describe the percentages of people in different groups, and the overall means (out of 5) for the different Likert Items. The package allows for Likert Items to be visualized as bar plots, heat maps, and histograms. The package is fairly new (October 2013), and seems to still  be partially in the works, but this would probably be a good tool for people who perform surveys and like data.