R365: Day 43 – counting cumulative events

So I need to modify a model that I built for my master’s degree so that it is all in R. The advantages are that way, more people can read, use, and check over my work. The downside is that Excel is really easy (though not necessarily efficient) to program in once you get the hang of it. Part of the code requires that I know how many consecutive hours in a day fit a temperature requirement.

This post from StackOverflow was super helpful on how to check for how many consecutive values are true, which is step one of my problem. It was also a really cool article because it compared the computation time between a single line of code using lapply() versus setting up your own function. It compared the efficiency using system.time() for each function.

Generally, apply() and family are really nice at applying a function over a range, but they are not always (!) more efficient (i.e. faster) than using a function or a for-loop, supposedly because they run their own for-loop within the apply() function. This article in StackOverflow talked about how sapply(), tapply(), and lapply() are implemented using C, making them efficient, while apply() is a wrapper function; wrapper functions call other functions. This article gives a good summary of how to use the apply() family if you’re interested.

x <- sample(0:1,1000,T)
#define a function called cumul_zeros
cumul_zeros <- function(x) {
 #tell it to count 0's
 x <- !x
 #compute the run lengths for 0's
 rl <- rle(x)
 #tell R to only look at the lengths, not the values (you have already told it to only count lengths of 0s)
 len <- rl$lengths
 #set a variable so that it knows which values were of what length
 v <- rl$values
 #cumulative length, so that you know where things are in the series
 cumLen <- cumsum(len)
 #define x as z
 z <- x
 # replace the 0 at the end of each zero-block in z by the 
 # negative of the length of the preceding 1-block...
 iDrops <- c(0, diff(v)) < 0
 z[ cumLen[ iDrops ] ] <- -len[ c(iDrops[-1],FALSE) ]
 # ... to ensure that the cumsum below does the right thing.
 # We zap the cumsum with x so only the cumsums for the 1-blocks survive:

But I have no idea what is going on in hte last three lines of live code, it looks like … they replace something with a negative value?

The one line of code that is more inefficient seems easier to understand

(!x) * unlist(lapply(rle(x)$lengths, seq_len))

The negation of x (the zeros) are counted by their length.

Now we want to see if we can get this thing to count the maximum consecutive value over a 24 hour period. From reading a few StackOverflow articles, it looks like tapply might work for our problem. tapply() applies a summary function to a vector if it is given a certain factor. So in our case, I can tell R to compute the maximum value for a day. I have another problem in that I need to calculate my function at 6AM, not midnight. This means that I cannot set the factor to be Julian calendar date, which would have been nice and easy. For this example I will ignore that problem.

x <- sample(0:1,1000,T)
#the logical seems to tell the operation only to count stuff that is zero
w=(!x) * unlist(lapply(rle(x)$lengths, seq_len))
#set up a vector to hold hours
#A vector to describe julian calendar date
d=rep(1:365, each=24, len = 1000)
#create a dataframe to hold both hours and the consecutives
##try using tapply? see http://www.r-bloggers.com/r-function-of-the-day-tapply/ and 

and the output looks like

> r
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 
 5 3 3 3 3 6 3 2 4 2 5 7 6 6 4 5 3 1 3 3 2 3 3 6 4 9 4 4 2 5 4 3 3 5 4 2 7 3 6 6 4 3

which is awesome! exactly what I need. Next up – creating a cumulative index based on a rule about the output.