fapply() sets a time series, a beginning and end date, and a function to apply. Bam. Exactly what I needed.

I edited some of the base example code to fit my needs:

library(timeSeries) library(ggplot2) ## Percentual Returns of Swiss Bond Index and Performance Index - LPP <- 100 * LPP2005REC[, c("SBI", "SPI")] head(LPP, 20) ## Aggregate Quarterly Returns - ##plot(applySeries(LPP, by = "monthly", FUN = colSums), main= "SBI and SPI Index Over Time")

Next I tried to create my own time series and use the flexibility of the function

> x=timeSeries(runif(100), seq(Sys.Date(), length.out = 100, by = "days")) > plot(x) > fapply(x,from=Sys.timeDate()+10,to=Sys.timeDate()+15, FUN = mean) TS.1 [1,] 0.6262633 Error in applySeries(x = x, from = from, to = to, FUN = FUN, ...) : outsite of range

What is going on here? After looking through Google for about 20 minutes with the aggravating misspelling of “outsite of range”, I could not find a solution. Eventually I saw lower down in the package vignette that dates were quoted, made into strings. Note also that dates have to be in “standard unambiguous format”: Year-Month-Day. And so…

> x=timeSeries(runif(100), seq(Sys.Date(), length.out = 100, by = "days")) > plot(x) > fapply(x,Sys.timeDate()+10,to=Sys.timeDate()+15, FUN = mean) TS.1 [1,] 0.6262633

After working with the code a bit further, I wanted to be able to get the data from multiple time intervals without having to make a bunch of different calls. Luckily, fapply() is pretty flexible and can be coerced into accepting multiple calls for multiple rows:

LPP <- 100 * LPP2005REC[, c("SBI", "SPI")] from=c("2005-11-11","2005-11-17") to = c("2005-12-11","2005-12-17") fapply(LPP,from=from,to=to, FUN = colMeans) GMT SBI SPI 2005-12-11 0.032217205 0.1924305 2005-12-17 0.008718791 0.1292479

But what if we just want to know the mean from exactly a week later after the start date? Its annoying to re-enter the dates over and over again. Saying

LPP <- 100 * LPP2005REC[, c("SBI", "SPI")] from=c("2005-11-11","2005-11-17") to = from + 7 fapply(LPP,from=from,to=to, FUN = colMeans)

Gives us the error

Error in from + 7 : non-numeric argument to binary operator

BUT, if we adjust the from to say as.Date() we get …

LPP <- 100 * LPP2005REC[, c("SBI", "SPI")] from=as.Date(c("2005-11-11","2005-11-17")) to = from + 7 fapply(LPP,from=from,to=to, FUN = colMeans) GMT SBI SPI 2005-11-18 0.015421400 0.3817752 2005-11-24 0.002559783 0.4332508

]]>

#get data from server #data is an example from what factors result in admission (binary) to grad school: GPA, school rank, and GRE score mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv") ## view the first few rows of the data head(mydata) #look at a summary of the data summary(mydata) #I like boxplots along with summaries, they help me boxplot(mydata$rank) #use sapply() to get the standard deviation of teh data #sapply works on a list or vector of data and returns a vector #apply() did not work on this dataset, but lapply() did, but gave a list(?) output i think #lapply(mydata, sd) sapply(mydata, sd) #now they want us to do a two-way contingency table for categorical outcomes #contingency tables are useful, but they can sometimes get cumbersome if there are lots of variables #the help reference manual for xtab() suggests using ftable() if you have lots of variables you want to look at xtabs(~admit + rank, data = mydata) #First they switch rank to be a 'factor' mydata$rank <- factor(mydata$rank) #Then they use a generalized linear model to run the logistic regression mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial") #now you can call mylogit or a summary of mylogit summary(mylogit) ##The output shows that all of the coefficients are significant. The estimate shows the log odds that a single unit change has on the outcome ##log odds are how logistic regressions express the odds of an outcome. ##odds ratios are the odds of success over odds failure (eg 80% success, .8/.2, or 4 to 1) ##log odds allow this to be expressed from a range -inf to +inf #see http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm for better explanation #lets look at the confidence interval for the analysis confint.default(mylogit) #if you have a ranked item (like school rank in this example) you can use a Wald test to see if Rank is overall significant wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 4:6) #now we can look at the coefficients as true odds ratios by raising them to exp #for example, One unit increase in GPA is 2.23X more likely to have admission exp(coef(mylogit)) #whatif you want to focus in on the effects of one coefficient? #yiou can hold the other coefficients at their mean and test out the different ranks newdata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response") newdata1 #here we see that, GPA and GRE being average, a student from rank 1 university has a 52% chance of getting in, whereas a student rank 4 hasa 18% chance #we can look at other factors as well, like GPA and rank #first create 100 subsets between 200 and 800 GRE score newdata2 <- with(mydata, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100), 4), gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))#now estimate the predicted probability based on GRE and rank newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "link", se = TRUE)) newdata3 <- within(newdata3, { PredictedProb <- plogis(fit) LL <- plogis(fit - (1.96 * se.fit)) UL <- plogis(fit + (1.96 * se.fit)) })## view first few rows of final dataset head(newdata3) #now plot using ggplot2 library(ggplot2) ggplot(newdata3, aes(x = gre, y = PredictedProb)) + geom_ribbon(aes(ymin = LL, ymax = UL, fill = rank), alpha = 0.2) + geom_line(aes(colour = rank), size = 1) #now we can look at how effective our 3 coefficient model was at predicting, and whether it was better than an empty model with(mylogit, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE)) #with a chi square of 41.46 and a pvalue of 7.6e-8, this model is better than an empty model

]]>

- easy to summarize noisy data
- see overall patterns across time
- c’mon, it looks cooler

xts.ts <- xts(rnorm(231),as.Date(13514:13744,origin="1970-01-01")) plot(xts.ts) plot(apply.monthly(xts.ts,mean))par(mfrow=c(3,1)) plot(xts.ts, main="Original Data") plot((apply.weekly(xts.ts,mean)), main="Weekly Averages") plot((apply.monthly(xts.ts,mean)), main="Monthly Averages")

The downside is that some of those peaks that seem really important when you zoom out to weekly or monthly data are not super noticeable at the daily level. It depends on what you want to do with the analysis.

]]>

BUT! What this is useful for is setting up an index where you know that the index should not go above or below a certain point.

Yeah…thats pretty specific…

BUT it is useful in plant pathology for calculating the Gubler-Thomas (GT) powdery mildew risk index. I need to program the index for work.

Here is some example script for how to set upper and lower thresholds using iflese() statements

#set up a vector with data a=rep(1:10) #set up a function, name it whatever GTA=function(a){ #have one of your functions do what you want; #this function checks if a is divisible by 3, if yes, sqrt(a), if no, (a)^4 b=ifelse(a%%3>0,sqrt(a),a^(4)) #this checks if the value is over 10 c=ifelse(b>10,10,b) #this checks if the value is under -10 d=ifelse(c<(-10),-10,c) print(d) } #run the function q=GTA(a) #...and plot it out plot(q) lines(q)

Then see the beautiful plot:

]]>

This post from StackOverflow was super helpful on how to check for how many consecutive values are true, which is step one of my problem. It was also a really cool article because it compared the computation time between a single line of code using lapply() versus setting up your own function. It compared the efficiency using system.time() for each function.

Generally, apply() and family are really nice at applying a function over a range, but they are not always (!) more efficient (i.e. faster) than using a function or a for-loop, supposedly because they run their own for-loop within the apply() function. This article in StackOverflow talked about how sapply(), tapply(), and lapply() are implemented using C, making them efficient, while apply() is a wrapper function; wrapper functions call other functions. This article gives a good summary of how to use the apply() family if you’re interested.

x <- sample(0:1,1000,T) #define a function called cumul_zeros cumul_zeros <- function(x) { #tell it to count 0's x <- !x #compute the run lengths for 0's rl <- rle(x) #tell R to only look at the lengths, not the values (you have already told it to only count lengths of 0s) len <- rl$lengths #set a variable so that it knows which values were of what length v <- rl$values #cumulative length, so that you know where things are in the series cumLen <- cumsum(len) #define x as z z <- x # replace the 0 at the end of each zero-block in z by the # negative of the length of the preceding 1-block... iDrops <- c(0, diff(v)) < 0 z[ cumLen[ iDrops ] ] <- -len[ c(iDrops[-1],FALSE) ] # ... to ensure that the cumsum below does the right thing. # We zap the cumsum with x so only the cumsums for the 1-blocks survive: x*cumsum(z) }

But I have no idea what is going on in hte last three lines of live code, it looks like … they replace something with a negative value?

The one line of code that is more inefficient seems easier to understand

(!x) * unlist(lapply(rle(x)$lengths, seq_len))

The negation of x (the zeros) are counted by their length.

Now we want to see if we can get this thing to count the maximum consecutive value over a 24 hour period. From reading a few StackOverflow articles, it looks like tapply might work for our problem. tapply() applies a summary function to a vector if it is given a certain factor. So in our case, I can tell R to compute the maximum value for a day. I have another problem in that I need to calculate my function at 6AM, not midnight. This means that I cannot set the factor to be Julian calendar date, which would have been nice and easy. For this example I will ignore that problem.

x <- sample(0:1,1000,T) #the logical seems to tell the operation only to count stuff that is zero w=(!x) * unlist(lapply(rle(x)$lengths, seq_len)) #set up a vector to hold hours y=rep_len(1:24,length.out=1000) #A vector to describe julian calendar date d=rep(1:365, each=24, len = 1000) #create a dataframe to hold both hours and the consecutives z=data.frame(d,y,w) ##try using tapply? see http://www.r-bloggers.com/r-function-of-the-day-tapply/ and ##http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega r=tapply(z$w,z$d,max) r

and the output looks like

> r 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 5 3 3 3 3 6 3 2 4 2 5 7 6 6 4 5 3 1 3 3 2 3 3 6 4 9 4 4 2 5 4 3 3 5 4 2 7 3 6 6 4 3

which is awesome! exactly what I need. Next up – creating a cumulative index based on a rule about the output.

]]>

below is the population density of Latin America

require(rworldmap) mapGriddedData(mapRegion="latin america")

Lots of cool datasets, including land use, productivity, gdp, population (circa 2005), and biodiversity.

]]>

#looking at the AMOEBA help example########################################### ########## Example of boundaries ########## ########################################### require(AMOEBA) ### Columbus OH spatial analysis data set (spdep package) data(columbus) require(maptools) map<-readShapePoly(system.file('etc/shapes/columbus.shp',package='spdep')[1]) ### Application of AMOEBA res<-AMOEBA(columbus$CRIME,col.gal.nb,1,1) ### Plot color<-rev(rainbow(length(table(res)),start=0,end=2/6)) plot(map,col=color[as.factor(res)]) title('Clustering of crimes at Columbus (OH)') names<-c('\nLow','\nMedium','\nHigh') require(plotrix) color.legend(map@bbox[1,1],map@bbox[2,1]-0.2,map@bbox[1,2],map@bbox[2,1]-0.4,names,color,align='rb')

Or % houses without plumbing

or housing value

i dont know why they didnt just build AMOEBA into {SPDEP}, the mother package, but its still cool.

]]>

Modifying the code that we used in for loops, let us try to add in an if else statement

#based on code developed by Carol Hom #a for loop to monitor the progress of a gut parasite #pretend i had more than you initially, but i am fairly resiliant against the parasites #set initial parameters for parasite populations me.init=800; you.init=10; #set a vector to record the populations as the for-loop progresses me=rep(1,100); you=rep(1,100); #set initial population. me is a vector so me[] tells you which point in the vector to set as the initial me[1]=me.init; you[1]=you.init; #the for loop has to be of length me. One way to get around that is to set it to length(me) instead of 30 for (t in 2:length(me)){ #set howthe pop will grow. you could also use me[1] instead of me.init #insert an if statement that does an even coin flip if(rbinom(1,1,(1/2))>=1){ me[t]=me[1]*(0.9)^t you[t]=you[1]*(1.1)^t } else { #in this scenario, every other day (chosen at random) you are more resistant than I am me[t]=me[1]*(1.2)^t you[t]=you[1]*(0.9)^t } #always close the loops #catenulate time, pop me, pop you, and end the line cat(t,me[t],you[t],"\n"); #dontforget to close out } #create a matrix called data by binding the two columns called me and you data=cbind(log(me),log(you)); data; #plot the log of the parasite loads in me and you matplot(1:100,data,pch=20,25,xlab="time",ylab="log parasite load",type="b");

This leads to an oscillating dynamics for the model

because (net) you are more resistant, you end up having a lowe over all net number of parasites

What happens if it is not an even flip? Say 1 in 10 events are bad days for me? (changing the rbinom(…,prob=0.1)

similar dynamics! the frequency changes but the overall trend looks similar! very cool.

]]>

Busted my hand yesterday so I cant really type

For loops are useful to iterate a process for a given length of time

this example outlines how to set up a basic for loop

the cat() call (haha) might not be necessary, there might be ways around it, but i cant figure them out now

#based on code developed by Carol Hom #a for loop to monitor the progress of a gut parasite #pretend i had more than you initially, but i am fairly resiliant against the parasites #set initial parameters for parasite populations me.init=800; you.init=10; #set a vector to record the populations as the for-loop progresses me=rep(1,100); you=rep(1,100); #set initial population. me is a vector so me[] tells you which point in the vector to set as the initial me[1]=me.init; you[1]=you.init; #the for loop has to be of length me. One way to get around that is to set it to length(me) instead of 30 for (t in 2:length(me)){ #set howthe pop will grow. you could also use me[1] instead of me.init me[t]=me[1]*(0.99)^t you[t]=you[1]*(1.01)^t #catenulate time, pop me, pop you, and end the line cat(t,me[t],you[t],"\n"); #dontforget to close out } #create a matrix called data by binding the two columns called me and you data=cbind(log(me),log(you)); data; #plot the log of the parasite loads in me and you matplot(1:100,data,pch=20,25,xlab="time",ylab="log parasite load",type="b");

]]>

##set up the dates for the example system date.a<-seq(as.Date('2014-01-01'),as.Date('2014-02-01'),by = 2) date.b<-seq(as.Date('2014-01-01'),as.Date('2014-02-15'),by = 3) date.c<-seq(as.Date('2014-01-01'),as.Date('2014-02-01'),by = 1)##create data frames for the time series df.a <- data.frame(time=date.a, A=sin((1:16)*pi/8)) df.b <- data.frame(time=date.b, B=cos((1:16)*pi/8)) df.c <- data.frame(time=date.c, C=cos((1:16)*pi/16))##Merge the time series as xts() time series using merge() my.ts <- merge(xts(df.a$A,df.a$time),xts(df.b$B,df.b$time),xts(df.c$C,df.c$time))##look at what the time series looks like my.ts ##treat my.ts as a zoo object, and use na.approx to replace NAs with interpolated valuesz <- na.approx(as.zoo(my.ts))# plot.zoo plot(z, screen = 1, col = 1:3, ylab = "Y", xlab = "")# xyplot.zoo library(lattice) xyplot(z, screen = 1, col = 1:3, ylab = "Y", xlab = "")# autoplot.zoo library(ggplot2) autoplot(z, facet = NULL) + ylab("Y")

You even get to pick your poison as far as {lattice}, {ggplot2}, or {base} is concerned

merge() is a really cool and powerful data organization function in R {zoo} that lets you merge data points along a common time series. I know that I have spent good chunks of time organizing things in Excel that I probably could have done in five minutes in {zoo}.

In the above example, we chose to interpolate between NA values using the na.approx() function. We can also do other things with NA values, such as omit them (using {stats} na.omit()), aggregate them (na.aggregate() from {zoo}), and trim off some of the values (na.trim from {zoo}). Beware that some of these can have consequences! For instance, na.omit omits rows that have ANY NAs, so the resulting graph looks like this:

because two of our time series did not have dates after the 31st! Bummer.

What happens if you don’t address the NAs and you just leave it as an as.zoo() object?

Also a bummer. It looks like you have to address the NAs somehow, but I have no good ideas for those of you who are squeemish about aggregating or reducing data outside of “just plot the points idiot”.

]]>