R365: Day 24 – Survival Analysis

My wife is recently working on persistence data that she collected during her entomology master’s degree and her advisors suggested that she use survival analysis to analyze the data. I wanted to try to help her out, but I previously did all of my survival analysis work in JMP, so I wanted to see what it is like in R. Survival analysis is often used in biology, especially entomology, and describes “time-to-event” for different treatments. Most often in entomology, this sort of analysis will be used to compare the effects of two treatments (eg- insecticide vs. control) on the survival of the insect target. Data for survival analysis is often binary (dead or alive) so its not super useful for comparing sub-lethal effects, but it is really robust. Data is also collected on the individual level, and you need to keep track as far as which individuals had an event occur. Its not traditionally used on the group scale (% of group at event), although I would imagine that the statistics would work. Although entomologists use this analysis a lot, survival analysis can also be used in plant pathology. During my master’s degree, I used survival analysis to compare time until sporulation between treatments. Technically, you can sometimes get away with more than two treatments for comparison, but you need to adjust your p-value accordingly (because all of a sudden you’re making several pair-wise comparisons, and a p-value of 0.05 might not be appropriate). Because survival analysis is just a subset of time-to-event analysis, it is applicable to a lot of other fields like engineering (time until something breaks) and economics (time until something is making money). Data is binary, so it has to have a clear threshold, it cannot be “sort-of dead”.

For most of this post, I will be using this page (http://anson.ucdavis.edu/~hiwang/teaching/10fall/R_tutorial%201.pdf) from UC Davis statistical researcher Hui Wang (http://anson.ucdavis.edu/~hiwang/). The article assumes that you know some stuff about survival analysis and how to use R, but its really straightforward and well written. Some portions towards the end of the page require you to have access to an online resource, but I was unable to link up to the site. Maybe more server problems like I saw with {R1magic}?

Using the {survival} package and {KMsurv} dataset, you can explore all sorts of cool medical survival datasets about AIDS, kidney transplants, STDs, and more! Its actually kind of morbid looking through all of these datasets about diseased people. I love epidemiology but I think if I had to do human health I would be super depressed.

The first thing I want to do is be able to compare survival between two different sets of data. The function survdiff() can help with this using the G-rho family of tests. By setting the rho value to 0 or 1, you adjust whether the analysis is performed as a log-rank test or a Peto test. This article (http://stats.stackexchange.com/questions/23323/how-to-perform-a-wilcoxon-signed-rank-test-for-survival-data-in-r) gave an AWESOME explanation for the difference between the two. TL;DR – rho=0 will result in equal weighting for all events, and rho=1 will result in heavier weighting for earlier events. This would matter if you think that it is important if lots of people die early in the study, or just die continuously across the study. Let’s try this out, first with rho=0.

compare1<-survdiff(Surv(time,status) ~sex,data=kidney, rho=0)
 survdiff(formula = Surv(time, status) ~ sex, data = kidney, rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 20 18 10.2 5.99 8.31
sex=2 56 40 47.8 1.28 8.31
Chisq= 8.3 on 1 degrees of freedom, p= 0.00395

Now lets see the same comparison with rho=1

compare1<-survdiff(Surv(time,status) ~sex,data=kidney, rho=1)
 survdiff(formula = Surv(time, status) ~ sex, data = kidney, rho = 1)
N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 20 13.1 5.37 11.24 18.7
sex=2 56 19.0 26.78 2.25 18.7
Chisq= 18.7 on 1 degrees of freedom, p= 1.53e-05

So both tests seem to have significant results, with big differences by sex in the recurrence time until kidney infections for people who received catheters. While both Chi-Squared tests seem to have significant results, there seems to be a noticeable effect from when people died based on rho = 0 or 1, with p-values of 0.00395 and 0.0000153. Unfortunately, I don’t think that this test explicitly tells you exactly which way the skew is going (male vs. female infection). So lets look at the same data using the Kaplan-Meier estimator, which estimates the survival curve for different groups. We can do this using the survfit() function.

compare<-survfit(Surv(time,status) ~sex,data=kidney)
Call: survfit(formula = Surv(time, status) ~ sex, data = kidney)
records n.max n.start events median 0.95LCL 0.95UCL
sex=1 20 20 20 18 22 12 63
sex=2 56 56 56 40 130 66 190


So it looks like men have a higher rate of kidney infection, with lots of early-stage infections, compared with women. Bummer. ‘Til next time, see you space cowboy!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s