R365: Day 47 – fapply() in {timeSeries}

I need to be able to average out a value over a set time period. There are a couple of options, I mostly was considering period.apply() in {xts} because most of my data is in xts format. However the indexing for period.apply() seems to not be super flexible. While puttering around stackoverflow articles, I ran into this site that described basically all I needed to do.

fapply() sets a time series, a beginning and end date, and a function to apply. Bam. Exactly what I needed.

I edited some of the base example code to fit my needs:

library(timeSeries)
 library(ggplot2)
 ## Percentual Returns of Swiss Bond Index and Performance Index -
 LPP <- 100 * LPP2005REC[, c("SBI", "SPI")]
 head(LPP, 20)
## Aggregate Quarterly Returns -
##plot(applySeries(LPP, by = "monthly", FUN = colSums), main= "SBI and SPI Index Over Time")

Rplot
Next I tried to create my own time series and use the flexibility of the function

> x=timeSeries(runif(100), seq(Sys.Date(), length.out = 100, by = "days"))
> plot(x)
Rplot01
> fapply(x,from=Sys.timeDate()+10,to=Sys.timeDate()+15, FUN = mean)

 TS.1
[1,] 0.6262633

Error in applySeries(x = x, from = from, to = to, FUN = FUN, ...) : 
 outsite of range

What is going on here? After looking through Google for about 20 minutes with the aggravating misspelling of “outsite of range”, I could not find a solution. Eventually I saw lower down in the package vignette that dates were quoted, made into strings. Note also that dates have to be in “standard unambiguous format”: Year-Month-Day. And so…

> x=timeSeries(runif(100), seq(Sys.Date(), length.out = 100, by = "days"))
> plot(x)
> fapply(x,Sys.timeDate()+10,to=Sys.timeDate()+15, FUN = mean) 
TS.1 [1,] 0.6262633

After working with the code a bit further, I wanted to be able to get the data from multiple time intervals without having to make a bunch of different calls. Luckily, fapply() is pretty flexible and can be coerced into accepting multiple calls for multiple rows:

LPP <- 100 * LPP2005REC[, c("SBI", "SPI")]
from=c("2005-11-11","2005-11-17")
to = c("2005-12-11","2005-12-17")
fapply(LPP,from=from,to=to, FUN = colMeans)
GMT
              SBI         SPI
2005-12-11 0.032217205 0.1924305
2005-12-17 0.008718791 0.1292479

But what if we just want to know the mean from exactly a week later after the start date? Its annoying to re-enter the dates over and over again. Saying

LPP <- 100 * LPP2005REC[, c("SBI", "SPI")]
from=c("2005-11-11","2005-11-17")
to = from + 7
fapply(LPP,from=from,to=to, FUN = colMeans)

Gives us the error

Error in from + 7 : non-numeric argument to binary operator

BUT, if we adjust the from to say as.Date() we get …

LPP <- 100 * LPP2005REC[, c("SBI", "SPI")]
from=as.Date(c("2005-11-11","2005-11-17"))
to = from + 7
fapply(LPP,from=from,to=to, FUN = colMeans)
GMT
 SBI SPI
2005-11-18 0.015421400 0.3817752
2005-11-24 0.002559783 0.4332508

R365: Day 46 – logistic regression in R

I am mostly using this post (http://www.ats.ucla.edu/stat/r/dae/logit.htm) as an example.

#get data from server
#data is an example from what factors result in admission (binary) to grad school: GPA, school rank, and GRE score
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
## view the first few rows of the data
head(mydata)
#look at a summary of the data
summary(mydata)
#I like boxplots along with summaries, they help me
boxplot(mydata$rank)
#use sapply() to get the standard deviation of teh data
#sapply works on a list or vector of data and returns a vector
#apply() did not work on this dataset, but lapply() did, but gave a list(?) output i think
#lapply(mydata, sd)
sapply(mydata, sd)
#now they want us to do a two-way contingency table for categorical outcomes
#contingency tables are useful, but they can sometimes get cumbersome if there are lots of variables
#the help reference manual for xtab() suggests using ftable() if you have lots of variables you want to look at
xtabs(~admit + rank, data = mydata)
#First they switch rank to be a 'factor'
mydata$rank <- factor(mydata$rank)
#Then they use a generalized linear model to run the logistic regression
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
#now you can call mylogit or a summary of mylogit
summary(mylogit)
##The output shows that all of the coefficients are significant. The estimate shows the log odds that a single unit change has on the outcome
##log odds are how logistic regressions express the odds of an outcome. 
##odds ratios are the odds of success over odds failure (eg 80% success, .8/.2, or 4 to 1)
##log odds allow this to be expressed from a range -inf to +inf
#see http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm for better explanation
#lets look at the confidence interval for the analysis
confint.default(mylogit)
#if you have a ranked item (like school rank in this example) you can use a Wald test to see if Rank is overall significant
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 4:6)
#now we can look at the coefficients as true odds ratios by raising them to exp
#for example, One unit increase in GPA is 2.23X more likely to have admission
exp(coef(mylogit))
#whatif you want to focus in on the effects of one coefficient?
#yiou can hold the other coefficients at their mean and test out the different ranks
newdata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1
#here we see that, GPA and GRE being average, a student from rank 1 university has a 52% chance of getting in, whereas a student rank 4 hasa 18% chance
#we can look at other factors as well, like GPA and rank
#first create 100 subsets between 200 and 800 GRE score
newdata2 <- with(mydata, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100),
 4), gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))
#now estimate the predicted probability based on GRE and rank
newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "link",
 se = TRUE))
newdata3 <- within(newdata3, {
 PredictedProb <- plogis(fit)
 LL <- plogis(fit - (1.96 * se.fit))
 UL <- plogis(fit + (1.96 * se.fit))
})
## view first few rows of final dataset
head(newdata3)
#now plot using ggplot2
library(ggplot2)
ggplot(newdata3, aes(x = gre, y = PredictedProb)) + geom_ribbon(aes(ymin = LL,
 ymax = UL, fill = rank), alpha = 0.2) + geom_line(aes(colour = rank),
 size = 1)

Rplot08
#now we can look at how effective our 3 coefficient model was at predicting, and whether it was better than an empty model
with(mylogit, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
#with a chi square of 41.46 and a pvalue of 7.6e-8, this model is better than an empty model

R365: Day 45 – {{xts}} applying functions daily/weekly/monthly

If you have time series data, you may want to reduce the data from hourly to daily/weekly/monthly data. Reasons to do this:

  • easy to summarize noisy data
  • see overall patterns across time
  • c’mon, it looks cooler
xts.ts <- xts(rnorm(231),as.Date(13514:13744,origin="1970-01-01"))
plot(xts.ts)
plot(apply.monthly(xts.ts,mean))
par(mfrow=c(3,1))
plot(xts.ts, main="Original Data")
plot((apply.weekly(xts.ts,mean)), main="Weekly Averages")
plot((apply.monthly(xts.ts,mean)), main="Monthly Averages")

Rplot

 

The downside is that some of those peaks that seem really important when you zoom out to weekly or monthly data are not super noticeable at the daily level. It depends on what you want to do with the analysis.

 

R365: Day 44 – Cumulative Index with Upper and Lower Boundaries

This topic sounds SUPER specific, when I was writing that title I was like “no one is going to read this, oh well.”

BUT! What this is useful for is setting up an index where you know that the index should not go above or below a certain point.

Yeah…thats pretty specific…

BUT it is useful in plant pathology for calculating the Gubler-Thomas (GT) powdery mildew risk index. I need to program the index for work.

Here is some example script for how to set upper and lower thresholds using iflese() statements

#set up a vector with data
a=rep(1:10)
#set up a function, name it whatever
GTA=function(a){
 #have one of your functions do what you want;
 #this function checks if a is divisible by 3, if yes, sqrt(a), if no, (a)^4
 b=ifelse(a%%3>0,sqrt(a),a^(4))
 #this checks if the value is over 10
 c=ifelse(b>10,10,b)
 #this checks if the value is under -10
 d=ifelse(c<(-10),-10,c)
 print(d)
}
#run the function
q=GTA(a)
#...and plot it out
plot(q)
lines(q)

Then see the beautiful plot:

Rplot14

 

R365: Day 43 – counting cumulative events

So I need to modify a model that I built for my master’s degree so that it is all in R. The advantages are that way, more people can read, use, and check over my work. The downside is that Excel is really easy (though not necessarily efficient) to program in once you get the hang of it. Part of the code requires that I know how many consecutive hours in a day fit a temperature requirement.

This post from StackOverflow was super helpful on how to check for how many consecutive values are true, which is step one of my problem. It was also a really cool article because it compared the computation time between a single line of code using lapply() versus setting up your own function. It compared the efficiency using system.time() for each function.

Generally, apply() and family are really nice at applying a function over a range, but they are not always (!) more efficient (i.e. faster) than using a function or a for-loop, supposedly because they run their own for-loop within the apply() function. This article in StackOverflow talked about how sapply(), tapply(), and lapply() are implemented using C, making them efficient, while apply() is a wrapper function; wrapper functions call other functions. This article gives a good summary of how to use the apply() family if you’re interested.

x <- sample(0:1,1000,T)
#define a function called cumul_zeros
cumul_zeros <- function(x) {
 #tell it to count 0's
 x <- !x
 #compute the run lengths for 0's
 rl <- rle(x)
 #tell R to only look at the lengths, not the values (you have already told it to only count lengths of 0s)
 len <- rl$lengths
 #set a variable so that it knows which values were of what length
 v <- rl$values
 #cumulative length, so that you know where things are in the series
 cumLen <- cumsum(len)
 #define x as z
 z <- x
 # replace the 0 at the end of each zero-block in z by the 
 # negative of the length of the preceding 1-block...
 iDrops <- c(0, diff(v)) < 0
 z[ cumLen[ iDrops ] ] <- -len[ c(iDrops[-1],FALSE) ]
 # ... to ensure that the cumsum below does the right thing.
 # We zap the cumsum with x so only the cumsums for the 1-blocks survive:
 x*cumsum(z)
}

But I have no idea what is going on in hte last three lines of live code, it looks like … they replace something with a negative value?

The one line of code that is more inefficient seems easier to understand

(!x) * unlist(lapply(rle(x)$lengths, seq_len))

The negation of x (the zeros) are counted by their length.

Now we want to see if we can get this thing to count the maximum consecutive value over a 24 hour period. From reading a few StackOverflow articles, it looks like tapply might work for our problem. tapply() applies a summary function to a vector if it is given a certain factor. So in our case, I can tell R to compute the maximum value for a day. I have another problem in that I need to calculate my function at 6AM, not midnight. This means that I cannot set the factor to be Julian calendar date, which would have been nice and easy. For this example I will ignore that problem.

x <- sample(0:1,1000,T)
#the logical seems to tell the operation only to count stuff that is zero
w=(!x) * unlist(lapply(rle(x)$lengths, seq_len))
#set up a vector to hold hours
y=rep_len(1:24,length.out=1000)
#A vector to describe julian calendar date
d=rep(1:365, each=24, len = 1000)
#create a dataframe to hold both hours and the consecutives
z=data.frame(d,y,w)
##try using tapply? see http://www.r-bloggers.com/r-function-of-the-day-tapply/ and 
##http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega
r=tapply(z$w,z$d,max)
r

and the output looks like

> r
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 
 5 3 3 3 3 6 3 2 4 2 5 7 6 6 4 5 3 1 3 3 2 3 3 6 4 9 4 4 2 5 4 3 3 5 4 2 7 3 6 6 4 3

which is awesome! exactly what I need. Next up – creating a cumulative index based on a rule about the output.

 

R365: Day 42 {rworldmap}

{rworldmap} is a cool tool that can be used to map global data on either a country or gridded scale.

below is the population density of Latin America

Rplot13

require(rworldmap)
mapGriddedData(mapRegion="latin america")

Lots of cool datasets, including land use, productivity, gdp, population (circa 2005), and biodiversity.

 

R365: Day 41 – {AMOEBA}

Hand still busted. I ran across {AMOEBA} while looking for spatial analysis packages. {AMOEBA} is a package with a single function (!), AMOEBA(). AMOEBA stands for A Multidirectional Optimum Ecotope-Based Algorithm, which lets you calculate spatial clusters and has an awesome name. Using their base example, we can plot crime across Columbus, OH.

Rplot08

#looking at the AMOEBA help example
###########################################
########## Example of boundaries ##########
###########################################
require(AMOEBA)
### Columbus OH spatial analysis data set (spdep package)
data(columbus)
require(maptools)
map<-readShapePoly(system.file('etc/shapes/columbus.shp',package='spdep')[1])
### Application of AMOEBA
res<-AMOEBA(columbus$CRIME,col.gal.nb,1,1)
### Plot
color<-rev(rainbow(length(table(res)),start=0,end=2/6))
plot(map,col=color[as.factor(res)])
title('Clustering of crimes at Columbus (OH)')
names<-c('\nLow','\nMedium','\nHigh')
require(plotrix)
color.legend(map@bbox[1,1],map@bbox[2,1]-0.2,map@bbox[1,2],map@bbox[2,1]-0.4,names,color,align='rb')

Or % houses without plumbing

Rplot11

 

or housing value

Rplot12

 

i dont know why they didnt just build AMOEBA into {SPDEP}, the mother package, but its still cool.

R365: Day 40 – if else statements in for loops

My hand is still busted so I cant really type so well.

Modifying the code that we used in for loops, let us try to add in an if else statement

#based on code developed by Carol Hom
#a for loop to monitor the progress of a gut parasite
#pretend i had more than you initially, but i am fairly resiliant against the parasites
#set initial parameters for parasite populations
me.init=800;
you.init=10;
#set a vector to record the populations as the for-loop progresses
me=rep(1,100);
you=rep(1,100);
#set initial population. me is a vector so me[] tells you which point in the vector to set as the initial
me[1]=me.init;
you[1]=you.init;
#the for loop has to be of length me. One way to get around that is to set it to length(me) instead of 30
for (t in 2:length(me)){
 #set howthe pop will grow. you could also use me[1] instead of me.init
#insert an if statement that does an even coin flip
 if(rbinom(1,1,(1/2))>=1){
 me[t]=me[1]*(0.9)^t
 you[t]=you[1]*(1.1)^t
 } else {
#in this scenario, every other day (chosen at random) you are more resistant than I am
 me[t]=me[1]*(1.2)^t
 you[t]=you[1]*(0.9)^t
}
#always close the loops
 #catenulate time, pop me, pop you, and end the line
 cat(t,me[t],you[t],"\n");
 #dontforget to close out
}
#create a matrix called data by binding the two columns called me and you
data=cbind(log(me),log(you));
data;
#plot the log of the parasite loads in me and you
matplot(1:100,data,pch=20,25,xlab="time",ylab="log parasite load",type="b");

This leads to an oscillating dynamics for the model

Rplot09

because (net) you are more resistant, you end up having a lowe over all net number of parasites

What happens if it is not an even flip? Say 1 in 10 events are bad days for me? (changing the rbinom(…,prob=0.1)

Rplot10

similar dynamics! the frequency changes but the overall trend looks similar! very cool.

 

R365: Day 39- For loops

Rplot08

Busted my hand yesterday so I cant really type

For loops are useful to iterate a process for a given length of time

this example outlines how to set up a basic for loop

the cat() call (haha) might not be necessary, there might be ways around it, but i cant figure them out now

 

#based on code developed by Carol Hom
#a for loop to monitor the progress of a gut parasite
#pretend i had more than you initially, but i am fairly resiliant against the parasites
#set initial parameters for parasite populations
me.init=800;
you.init=10;
#set a vector to record the populations as the for-loop progresses
me=rep(1,100);
you=rep(1,100);
#set initial population. me is a vector so me[] tells you which point in the vector to set as the initial
me[1]=me.init;
you[1]=you.init;
#the for loop has to be of length me. One way to get around that is to set it to length(me) instead of 30
for (t in 2:length(me)){
 #set howthe pop will grow. you could also use me[1] instead of me.init
 me[t]=me[1]*(0.99)^t
 you[t]=you[1]*(1.01)^t
 #catenulate time, pop me, pop you, and end the line
 cat(t,me[t],you[t],"\n");
 #dontforget to close out
}
#create a matrix called data by binding the two columns called me and you
data=cbind(log(me),log(you));
data;
#plot the log of the parasite loads in me and you
matplot(1:100,data,pch=20,25,xlab="time",ylab="log parasite load",type="b");

R365: Day 36- {xts}

So as we saw in the previous post, {zoo} is a useful and powerful tool for handling and analyzing time series data. Unfortunately, the datasets that I am working with are 1) irregularly spaced and  2) irregularly timed. This means that some of the ordinary things that I could do with time series data (lags, ACFs, CCFs, etc) I cannot do without some manipulation and coercion. The R package {xts} can help in some ways by allowing you to merge two time series datasets without lining them up or organizing them before hand. This article gave me the idea to use {xts}, as it answered my problem of how to plot things with different time series. Therefore I will stea- er… borrow their example (with a few modifications) to see how this package and function work.

##set up the dates for the example system
date.a<-seq(as.Date('2014-01-01'),as.Date('2014-02-01'),by = 2)
date.b<-seq(as.Date('2014-01-01'),as.Date('2014-02-15'),by = 3)
date.c<-seq(as.Date('2014-01-01'),as.Date('2014-02-01'),by = 1)
##create data frames for the time series
df.a <- data.frame(time=date.a, A=sin((1:16)*pi/8))
df.b <- data.frame(time=date.b, B=cos((1:16)*pi/8))
df.c <- data.frame(time=date.c, C=cos((1:16)*pi/16))
##Merge the time series as xts() time series using merge()
my.ts <- merge(xts(df.a$A,df.a$time),xts(df.b$B,df.b$time),xts(df.c$C,df.c$time))
##look at what the time series looks like
my.ts
##treat my.ts as a zoo object, and use na.approx to replace NAs with interpolated values
z <- na.approx(as.zoo(my.ts))
# plot.zoo
plot(z, screen = 1, col = 1:3, ylab = "Y", xlab = "")
# xyplot.zoo
library(lattice)
xyplot(z, screen = 1, col = 1:3, ylab = "Y", xlab = "")
# autoplot.zoo
library(ggplot2)
autoplot(z, facet = NULL) + ylab("Y")

You even get to pick your poison as far as {lattice}, {ggplot2}, or {base} is concerned

Rplot

merge() is a really cool and powerful data organization function in R {zoo} that lets you merge data points along a common time series. I know that I have spent good chunks of time organizing things in Excel that I probably could have done in five minutes in {zoo}.

In the above example, we chose to interpolate between NA values using the na.approx() function. We can also do other things with NA values, such as omit them (using {stats} na.omit()), aggregate them (na.aggregate() from {zoo}), and trim off some of the values (na.trim from {zoo}). Beware that some of these can have consequences! For instance, na.omit omits rows that have ANY NAs, so the resulting graph looks like this:

Rplot05

because two of our time series did not have dates after the 31st! Bummer.

What happens if you don’t address the NAs and you just leave it as an as.zoo() object?

Rplot06

Also a bummer. It looks like you have to address the NAs somehow, but I have no good ideas for those of you who are squeemish about aggregating or reducing data outside of “just plot the points idiot”.