Month: February 2014

R365: Day 23 – Neural Networks

I’ve wanted to explore neural networks for a while and today seems like a pretty good day to do it! Rainy, boring, and with a pint of pale ale, anything in R is entertaining. Artificial neural networks (ANNs) are a general class of analyses that look at finding connections between datasets. The math behind neural networks can be a bit over my head, but it seems that given a set of inputs and a specified number of “layers” in the model and a (typically) non-linear activation function, you arrive at an output. This underlying process mimics how our own minds process data and come to (sometimes erroneous) causal relationships. Because the artificial neural network is fairly naïve to what sorts of inputs are being used, you can end up with some cock-eyed results if you don’t screen them yourself. Think of when you were a kid and some technology “magically” performed a task and your mind came to the simplest (and often most fanciful) conclusions. This screening may be especially relevant in biology, where lots of different factors both biotic and abiotic can impact growth of a single species, but some interactions do not make sense (eg – number of grasshoppers does not meaningfully impact precipitation). For a really good / not completely dumbed-down visualization of ANNs, this page  was pretty useful even for someone with little to no formal mathematical training. I first heard of ANNs about 2 years ago when I was analyzing a set of data looking for a causal relationship between temperature and growth of a powdery mildew fungus. I had performed a factorial data and my committee member (and current advisor) Neil McRoberts suggested that I throw all of the data into an ANN and see what happens. I ended up going in another direction with the analysis, but I have always been interested. I recently re-read a review article in the journal Phytopathology about ANNs and I wanted to re-visit them.

This article gives a lucid and well-annotated example of how to use {neuralnet} to calculate square roots of an input.

I changed up the script a little bit so that it would be cubes, but really props to the original coder for such nice annotation.

#Going to create a neural network to perform sqare rooting
#Type ?neuralnet for more information on the neuralnet library

#Generate 50 random numbers uniformly distributed between 0 and 100
#And store them as a dataframe
traininginput <-  as.data.frame(rnorm(50, min=0, max=100))
trainingoutput <- (traininginput)^3

#Column bind the data into one variable
trainingdata <- cbind(traininginput,trainingoutput)
colnames(trainingdata) <- c("Input","Output")

#Train the neural network
#Going to have 10 hidden layers
#Threshold is a numeric value specifying the threshold for the partial
#derivatives of the error function as stopping criteria.
net.cube <- neuralnet(Output~Input,trainingdata, hidden=10, threshold=0.01)
print(net.cube)

#Plot the neural network
plot(net.cube)

#Test the neural network on some training data
testdata <- as.data.frame((1:10)^(1/3)) #Generate some cubed root numbers
net.results <- compute(net.cube, testdata) #Run them through the neural network

#Lets see what properties net.sqrt has
ls(net.results)

#Lets see the results
print(net.results$net.result)

#Lets display a better version of the results
cleanoutput <- cbind(testdata,(testdata)^3,
                     as.data.frame(net.results$net.result))
colnames(cleanoutput) <- c("Input","Expected Output","Neural Net Output")
print(cleanoutput)
155
Advertisements

R365: Day 25: attach() and detach()

I have been going through various packages and analyses and sometimes its good to just step back and ask ‘’why in the world is this not working?”. I explored the different datasets that are available in R in the {datasets} package, and we saw several other datasets strewn about in various stages of usefulness (KMsurv from my most recent post comes to mind), but sometimes you want to work with just one single dataset and you don’t want to deal with specifying the column type using the ‘’$’’ symbol every time. This is useful if you just want to hold on to a particular portion of a dataset as a vector.

library(survival)
 library(KMsurv)
 data(aids)
 attach(aids)
 plot(infect~induct)

Rplot1

Now you can call different columns from the {aids} dataset, like the infection or induction time for AIDS. This is pretty cool, but one of the cool (and depressing) things about this dataset is that they have split it up by whether the patients were adults or not. Maybe we can separate out the children and adult datasets to see what kind of differences there are. After searching for a bit, I gave up on doing this with the base package, and decided to do it using {ggplot2} using the help from this post.

library(ggplot2)
 #GGPLOT
 qplot(x=infect, y=induct,
 + data=aids,
 + colour=adult,
 + main="AIDS infect~induction") +
 + geom_point()+xlab("induction")+ylab("infection")

Rplot01

Which is pretty nice! Much nicer than I thought it would be, I was honestly getting kind of worried. After you’re all done, remember to detach() your data, otherwise R will get confused if you work with another package with similarly titled columns.

R365: Day 24 – Survival Analysis

My wife is recently working on persistence data that she collected during her entomology master’s degree and her advisors suggested that she use survival analysis to analyze the data. I wanted to try to help her out, but I previously did all of my survival analysis work in JMP, so I wanted to see what it is like in R. Survival analysis is often used in biology, especially entomology, and describes “time-to-event” for different treatments. Most often in entomology, this sort of analysis will be used to compare the effects of two treatments (eg- insecticide vs. control) on the survival of the insect target. Data for survival analysis is often binary (dead or alive) so its not super useful for comparing sub-lethal effects, but it is really robust. Data is also collected on the individual level, and you need to keep track as far as which individuals had an event occur. Its not traditionally used on the group scale (% of group at event), although I would imagine that the statistics would work. Although entomologists use this analysis a lot, survival analysis can also be used in plant pathology. During my master’s degree, I used survival analysis to compare time until sporulation between treatments. Technically, you can sometimes get away with more than two treatments for comparison, but you need to adjust your p-value accordingly (because all of a sudden you’re making several pair-wise comparisons, and a p-value of 0.05 might not be appropriate). Because survival analysis is just a subset of time-to-event analysis, it is applicable to a lot of other fields like engineering (time until something breaks) and economics (time until something is making money). Data is binary, so it has to have a clear threshold, it cannot be “sort-of dead”.

For most of this post, I will be using this page (http://anson.ucdavis.edu/~hiwang/teaching/10fall/R_tutorial%201.pdf) from UC Davis statistical researcher Hui Wang (http://anson.ucdavis.edu/~hiwang/). The article assumes that you know some stuff about survival analysis and how to use R, but its really straightforward and well written. Some portions towards the end of the page require you to have access to an online resource, but I was unable to link up to the site. Maybe more server problems like I saw with {R1magic}?

Using the {survival} package and {KMsurv} dataset, you can explore all sorts of cool medical survival datasets about AIDS, kidney transplants, STDs, and more! Its actually kind of morbid looking through all of these datasets about diseased people. I love epidemiology but I think if I had to do human health I would be super depressed.

The first thing I want to do is be able to compare survival between two different sets of data. The function survdiff() can help with this using the G-rho family of tests. By setting the rho value to 0 or 1, you adjust whether the analysis is performed as a log-rank test or a Peto test. This article (http://stats.stackexchange.com/questions/23323/how-to-perform-a-wilcoxon-signed-rank-test-for-survival-data-in-r) gave an AWESOME explanation for the difference between the two. TL;DR – rho=0 will result in equal weighting for all events, and rho=1 will result in heavier weighting for earlier events. This would matter if you think that it is important if lots of people die early in the study, or just die continuously across the study. Let’s try this out, first with rho=0.

compare1<-survdiff(Surv(time,status) ~sex,data=kidney, rho=0)
 compare1
Call:
 survdiff(formula = Surv(time, status) ~ sex, data = kidney, rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 20 18 10.2 5.99 8.31
sex=2 56 40 47.8 1.28 8.31
Chisq= 8.3 on 1 degrees of freedom, p= 0.00395

Now lets see the same comparison with rho=1

compare1<-survdiff(Surv(time,status) ~sex,data=kidney, rho=1)
compare1
Call:
 survdiff(formula = Surv(time, status) ~ sex, data = kidney, rho = 1)
N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 20 13.1 5.37 11.24 18.7
sex=2 56 19.0 26.78 2.25 18.7
Chisq= 18.7 on 1 degrees of freedom, p= 1.53e-05

So both tests seem to have significant results, with big differences by sex in the recurrence time until kidney infections for people who received catheters. While both Chi-Squared tests seem to have significant results, there seems to be a noticeable effect from when people died based on rho = 0 or 1, with p-values of 0.00395 and 0.0000153. Unfortunately, I don’t think that this test explicitly tells you exactly which way the skew is going (male vs. female infection). So lets look at the same data using the Kaplan-Meier estimator, which estimates the survival curve for different groups. We can do this using the survfit() function.

compare<-survfit(Surv(time,status) ~sex,data=kidney)
compare
Call: survfit(formula = Surv(time, status) ~ sex, data = kidney)
records n.max n.start events median 0.95LCL 0.95UCL
sex=1 20 20 20 18 22 12 63
sex=2 56 56 56 40 130 66 190
plot(compare,lty=2:3)
legend(400,0.8,c("male","female"),lty=2:3)

Rplot

So it looks like men have a higher rate of kidney infection, with lots of early-stage infections, compared with women. Bummer. ‘Til next time, see you space cowboy!

R365: Day 22 – {nws}

I wanted to write a relatively short post, so I decided to “randomly” select another package and look through it. In the near future I will be covering fuzzy logic, neural networks, and more detailed work on mapping, but those will each be multi day posts. I am still not sure how to break them up, one of the kinks with this blog is that it is easy to do short, simple posts about random stuff, but if I want to fully explore entire topics or even big packages (eg – ggplot2) I will need a good way to break up the work.
I was kind of hoping that {nws} would be about weather forecasting or something like that, but the package is actually about … using the NetWorkSpaces server to do cross language exchange. If I was griping about handling large topics, {nws} was the wrong package to get. The vignette is 54 pages long and I have no idea what their description sentence actually means. It seems like you can use the package to develop code in both Python (an awesome object-oriented language I wish I knew more about) and R. It requires that you download the NetWorkSpaces server for the package to be operational. I do not know how to adjust my computer to set up a connection to a server. It seems like you need to use the function socketconnection() and set server=TRUE, but that did not seem to work for me. Even things that should have worked without the server, like a dataset on German credit (germandata) was not working for me and did not seem to want to load.
I went through the examples in the package and none of them seemed to want to work without the server. I think that this package helps you to make your code more efficient by running calculations in parallel. I am honestly not up to the point in any of my models to need parallel structure for my code, although it is probably a good idea to know how to design code that way from the getgo. Today was kind of a bust, I am sorry folks. I do not think I will re-visit this package, but it is cool to know it exists, and frustrating to know that I have NO idea how servers work. See you, space cowboy!

R365- Day 21 {R1magic}

I was feeling kind of uninspired, a couple of the possible posts that I was hoping to do were either too big (“Lets explore all of Fuzzy Logic!”) or the functions that I thought existed apparently don’t (?) (“I could have sworn there was an ifthenelse function…”). So to get over the hump I chose a package at “random” (no, I still have not made an algorithm to choose packages at random, I scrolled through the packages and chose one out without looking at it). {R1magic} seemed like it might have some promise and I liked the name (“maybe it really does magic? Or mimics D&D game setups?”).
{R1magic} supposedly helps with “compressive sampling”, also known as “compressive sensing”, is used to reconstruct a signal when only a few measurements have been taken. It sounds like it involves interpolating between data points to find intermediary points, and one of the listed applications in Wikipedia is MRI and imaging (a lot of the examples involve compression of .jpg images). If you would like to know more, Rice University has a nice page on the subject with LOTS of links (http://dsp.rice.edu/cs). The {R1magic} package lets you do a lot of things (many of which I don’t think I understand). One of the few that I do is the generation of random Gaussian matrices using the function GaussianMatrix(m,n) with m/n being row and column numbers. These could be useful for generating randomly formulated matrices for simulations. While looking through example code, I ran across set.seed() function, which sets a “starting point” for a pseudo-random number generator. I am not sure I fully understand, but I think if you have set a certain seed number, you exclude that number from further iterations, so it effectively becomes a sort of “code” for the pseudo-random number generation. So while changing the seed number it will still be random, it won’t be the same sort of random that you would have gotten with the original. Maybe. What did I just write? #fakestatisticianproblems
One cool function in the {R1magic} package was sparse signal, which lets you randomly generate a dataset where the data will be cruising along until it spikes. You set the number of datapoints and the number of spikes and receive a random dataset output, like below.
Rplot01
> sparseSignal(2000,10)->a
> plot(a)
> lines(a)

I think that this sort of random dataset generation would be useful for testing random models of detection methods. So being able to test my, say, epidemiological model versus a random dataset would be pretty useful.
Honestly, I have no idea what is going on in this package and feel like I am missing the more interesting portions of what it is trying to say. I do think portions of the package could be useful in simulations or testing models versus randomly generated models, but I doubt those were the intended purposes of the package. Still interesting to know whats going on in signal to noise interpretation.

R365: Day 20- AncestryMapper

While looking over the ancestry() dataset in {LatticeExtra}, I found a cool, short package called {AncestryMapper}. It seems to have genetic data on where people are from and how closely they are related. While being a standalone package, the vignette for the package is really short, only 3 pages long. The package basically has 3 functions:

1)    AncestryMapper – that basically demos a SNP dataset on different tribes and their relationship (see below)

2)    calculateAMids – looks like it calculates, but only does so for the entered dataset

3)    plotAMids – which just makes pretty graphs

 

This package is based on a PLoSone paper (found here: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0049438) that used this tool to map a lot of human SNP data.

This package really makes me want to sequence my genome, maybe I’ll finally pony up the money for it after the wedding.

Rplot03

R365: Day 19 – Jitter and Rug

Every once in a while you want to make your data just a little bit off, so that you can add some stochasticity to a model for example. {jitter} is a program that lets you add a uniformly distributed amount of variation to data. It uses runif() to add and subtract a little bit from your data. You can control how much “jitter” is in your data by altering the ‘factor’, the second portion of the jitter(x, factor, amount) program. You can also set it by adjusting the actual amount, which I think is taken as 1= completely different data. I talked about using jitter in other ways earlier on when I mentioned that jitter can be used to mimic variation that is typically seen in hand drawn lines, like in xkcd.com comic strips.

Rplot08 Rplot07 Rplot01

a=1:100 
plot(jitter(a,1), main="Jitter with Factor 1")
 plot(jitter(a,10), main="Jitter with Factor 10")
 plot(jitter(a,100), main="Jitter with Factor 100")

 

At the bottom of the {jitter} vignette, they suggest checking out {rug}, which I had never heard of. The description of the function is a bit weird, it says, “Adds a rug representation (1-d plot) of the data to the plot”. Are we talking Persian here? Maybe something more modern?

After puttering around with {rug} for a while, it seems to show where your data is condensed on one axis, by default the x-axis. This could be useful if you wanted to have a visualization for when things typically occur. The example they give in the vignette uses the {faithful} dataset, which records the waiting and erupting times for Old Faithful Geyser in Yellowstone National Park, and shows that most of the eruptions fall into one of two groups. I am not sure I really like {rug} because while it does condense data, it also loses data. It also is not any sort of statistical tool (as far as I can tell), so it won’t give you useful summary statistics. Also, Vernon, if you are reading this, Mode 4 Lyfe!

Rplot09

R365: Day 18 – XKCD!

I am a HUGE fan of the internet comic XKCD (xkcd.com), which is “a webcomic about romance, sarcasm, math, and language”. Reading that description, I can’t help but think of how perfectly that matches the ambiance of the comic strip. I have never felt such emotional attachment and investment as I have for those silly little stick figures. Of course, I am not alone, and lots and lots of people LOVE XKCD, and it only figures that at least one person decided to make the distinctive XKCD font available for the unwashed masses and make a theme available for R. You can download the XKCD font (“Humor Sans” .ttf) from several places (Dropbox, Github, Sourceforge, etc). Using the {extrafont} package described in the last post, you can load in the package. Its official name in R is “Humor Sans”, and I did not need an underscore inbetween the words to get it to work. A really awesome example of the magic that Randall Monroe makes happen is this strip (http://xkcd.com/1319/) that both highlights the insightful graphs and my own problem when deciding whether to automate code or not. In honor of the wonderful comic strip, lets see if I can make some absolutely absurd correlations happen (in the next 10 minutes).
…What has happened in the last 45 minutes:
Step 1: try to come up with clever/funny graphs.
Step 2: realize I am neither clever nor funny.
Step 3: cry.

But lets look at {swiss} data on the relationship between education, fertility, and child mortality.
Rplot01 Rplot02 Rplot03
> ggplot(swiss, aes(x=Fertility, y=Education)) + geom_point() +
+     ggtitle(“Fertility by Education”) +
+     xlab(“Fertility”) + ylab(“Education”) +
+     theme(text=element_text(size=16, family=”Humor Sans”))

One user on a Stackoverflow site suggested using jitter() to replicate the hand drawn look for most of the graphs. I got it to work but it was not working out for the dataset I was using.

The {xkcd} package also lets you draw stick figures on all your graphs, but it seems to require quite a bit of tweaking to make it look good. This is their example:

Rplot05

Or after a little playing around, you can make them look pretty stupid.

Rplot06

Have Fun!

datascaled <- data.frame(x=c(-3,3),y=c(-30,30))
> p <- ggplot(data=datascaled, aes(x=x,y=y)) + geom_point()
> xrange <- range(datascaled$x)
> yrange <- range(datascaled$y)
> ratioxy <- diff(xrange) / diff(yrange)
>
> mapping <- aes(x=x,
+                y=y,
+                scale=scale,
+                ratioxy=ratioxy,
+                angleofspine = angleofspine,
+                anglerighthumerus = anglerighthumerus,
+                anglelefthumerus = anglelefthumerus,
+                anglerightradius = anglerightradius,
+                angleleftradius = angleleftradius,
+                anglerightleg =  anglerightleg,
+                angleleftleg = angleleftleg,
+                angleofneck = angleofneck,
+                color = color )
>
> dataman <- data.frame( x= c(-1,0,1), y=c(-10,0,10),
+                        scale = c(10,7,5),
+                        ratioxy = ratioxy,
+                        angleofspine =  seq(- pi / 2, -pi/2 + pi/8, l=3) ,
+                        anglerighthumerus = -pi/6,
+                        anglelefthumerus = pi + pi/6,
+                        anglerightradius = 0,
+                        angleleftradius = rnorm(3,- pi/4, pi/4),
+                        angleleftleg = 6*pi/2  + pi / 12 ,
+                        anglerightleg = 8*pi/2  – pi / 12,
+                        angleofneck = runif(3, min = 3 * pi / 2 – pi/10 , max = 3 * pi / 2 + pi/10),
+                        color=c(“A”,”B”,”C”))
>
> p + xkcdman(mapping,dataman)

A big shoutout to Erin H-B for pointing out this awesome package! Thanks Erin!

R365: Day 17 – Extra Fonts

I am not sure that I understand why people hate Comic sans so much, but in case you DON’T! you can easily make graphs using all sorts of fonts. The {extrafont} package holds lots of different fonts, from things you would find on most computers (Arial) to things that were designed seemingly for the purpose of programming languages (Humor sans, which mimics XKCD!). The computer standard ones are called TrueType, which were the standards for Apple and Microsoft. These include things like Helvetica, Times Roman, and Courier.

To install most of the TrueType fonts, use font.install(). BEWARE! This process takes a long time (~5-10 minutes for my computer). You can download the whole library of fonts or just a few. EVEN WINGDINGS! Oh god I find this so funny, its actually disturbing my lab mate a bit. Okay, so you can download lots of standard fonts. But what about non-standard fonts?…

Okay hang on for a minute, where did Wingdings come from? Like originally? It seems that back in the day, typesetters could issue non-standard types like the star of David, diamond, and circles and called them ‘dingbats’, which is what my mom used to call me when I was being a joker. Wingdings were developed by Microsoft back in the early 1990’s for their Windows 3.1, and have been available ever since.

Sorry for the aside.

If you find a non-standard type, you can download the .ttf file and install the text the same way that you would anything else.

So lets modify an example set that I found here (https://github.com/wch/extrafont); We will look at how temperature affects vapor pressure of mercury using the {pressure} dataset. For fun we will plot out our titles in Jokerman.

Rplot07

 ggplot(pressure, aes(x=temperature, y=pressure)) + geom_point() +
 ggtitle("Effect of Temp of Vapor Pressure of Mercury") +
 xlab("Temp") + ylab("Pressure") +
 theme(text=element_text(size=16, family="Jokerman"))

So that is kind of cool. I had previously seen people plot out stuff using numbers instead of dots or symbols. I thought that would be cool to see with some symbols from Wingdings (I swear I am not obsessed…)

Rplot087

 a=rnorm(100); plot(a,pch=(c("a","s","d","f","z","x","c","v","5")),family=windowsFont("Wingdings"))

As far as what to do if you want normal looking x and y labels, I have not figured that out yet. But a kind of a cool way to integrate new symbols beyond the regular {pch} crowd.

R365: Day 16 – Cities in Maps

R365: Day16 – Cities in Maps
 I wanted to revisit the {maps} package again, it had a lot of data inside that I feel like I was completely unaware of. R keeps a list of major (40K+) cities in each of the US states as well as major world cities as part of its us.cities and world.cities portion of {maps}.
 For example, here are all of the cities in California with a population over 150K
 Image
 requires(maps)
 data(us.cities)
 map("state","CA")
 map.axes()
 map.scale()
 map.cities(us.cities, country="CA",minpop=150000)
and US cities with populations over 1M
 Image
 map("state", interior = FALSE)
 map("state", boundary = FALSE, lty = 5, add = TRUE)
 map.cities(us.cities,minpop=1000000, label=TRUE)

What I really want to do is to color in the states using if/then/else statements, but I cant figure out how to do that right now. Maybe tomorrow!