Month: April 2014

R365: Day 44 – Cumulative Index with Upper and Lower Boundaries

This topic sounds SUPER specific, when I was writing that title I was like “no one is going to read this, oh well.”

BUT! What this is useful for is setting up an index where you know that the index should not go above or below a certain point.

Yeah…thats pretty specific…

BUT it is useful in plant pathology for calculating the Gubler-Thomas (GT) powdery mildew risk index. I need to program the index for work.

Here is some example script for how to set upper and lower thresholds using iflese() statements

#set up a vector with data
a=rep(1:10)
#set up a function, name it whatever
GTA=function(a){
 #have one of your functions do what you want;
 #this function checks if a is divisible by 3, if yes, sqrt(a), if no, (a)^4
 b=ifelse(a%%3>0,sqrt(a),a^(4))
 #this checks if the value is over 10
 c=ifelse(b>10,10,b)
 #this checks if the value is under -10
 d=ifelse(c<(-10),-10,c)
 print(d)
}
#run the function
q=GTA(a)
#...and plot it out
plot(q)
lines(q)

Then see the beautiful plot:

Rplot14

 

Advertisements

R365: Day 43 – counting cumulative events

So I need to modify a model that I built for my master’s degree so that it is all in R. The advantages are that way, more people can read, use, and check over my work. The downside is that Excel is really easy (though not necessarily efficient) to program in once you get the hang of it. Part of the code requires that I know how many consecutive hours in a day fit a temperature requirement.

This post from StackOverflow was super helpful on how to check for how many consecutive values are true, which is step one of my problem. It was also a really cool article because it compared the computation time between a single line of code using lapply() versus setting up your own function. It compared the efficiency using system.time() for each function.

Generally, apply() and family are really nice at applying a function over a range, but they are not always (!) more efficient (i.e. faster) than using a function or a for-loop, supposedly because they run their own for-loop within the apply() function. This article in StackOverflow talked about how sapply(), tapply(), and lapply() are implemented using C, making them efficient, while apply() is a wrapper function; wrapper functions call other functions. This article gives a good summary of how to use the apply() family if you’re interested.

x <- sample(0:1,1000,T)
#define a function called cumul_zeros
cumul_zeros <- function(x) {
 #tell it to count 0's
 x <- !x
 #compute the run lengths for 0's
 rl <- rle(x)
 #tell R to only look at the lengths, not the values (you have already told it to only count lengths of 0s)
 len <- rl$lengths
 #set a variable so that it knows which values were of what length
 v <- rl$values
 #cumulative length, so that you know where things are in the series
 cumLen <- cumsum(len)
 #define x as z
 z <- x
 # replace the 0 at the end of each zero-block in z by the 
 # negative of the length of the preceding 1-block...
 iDrops <- c(0, diff(v)) < 0
 z[ cumLen[ iDrops ] ] <- -len[ c(iDrops[-1],FALSE) ]
 # ... to ensure that the cumsum below does the right thing.
 # We zap the cumsum with x so only the cumsums for the 1-blocks survive:
 x*cumsum(z)
}

But I have no idea what is going on in hte last three lines of live code, it looks like … they replace something with a negative value?

The one line of code that is more inefficient seems easier to understand

(!x) * unlist(lapply(rle(x)$lengths, seq_len))

The negation of x (the zeros) are counted by their length.

Now we want to see if we can get this thing to count the maximum consecutive value over a 24 hour period. From reading a few StackOverflow articles, it looks like tapply might work for our problem. tapply() applies a summary function to a vector if it is given a certain factor. So in our case, I can tell R to compute the maximum value for a day. I have another problem in that I need to calculate my function at 6AM, not midnight. This means that I cannot set the factor to be Julian calendar date, which would have been nice and easy. For this example I will ignore that problem.

x <- sample(0:1,1000,T)
#the logical seems to tell the operation only to count stuff that is zero
w=(!x) * unlist(lapply(rle(x)$lengths, seq_len))
#set up a vector to hold hours
y=rep_len(1:24,length.out=1000)
#A vector to describe julian calendar date
d=rep(1:365, each=24, len = 1000)
#create a dataframe to hold both hours and the consecutives
z=data.frame(d,y,w)
##try using tapply? see http://www.r-bloggers.com/r-function-of-the-day-tapply/ and 
##http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega
r=tapply(z$w,z$d,max)
r

and the output looks like

> r
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 
 5 3 3 3 3 6 3 2 4 2 5 7 6 6 4 5 3 1 3 3 2 3 3 6 4 9 4 4 2 5 4 3 3 5 4 2 7 3 6 6 4 3

which is awesome! exactly what I need. Next up – creating a cumulative index based on a rule about the output.

 

R365: Day 42 {rworldmap}

{rworldmap} is a cool tool that can be used to map global data on either a country or gridded scale.

below is the population density of Latin America

Rplot13

require(rworldmap)
mapGriddedData(mapRegion="latin america")

Lots of cool datasets, including land use, productivity, gdp, population (circa 2005), and biodiversity.

 

R365: Day 41 – {AMOEBA}

Hand still busted. I ran across {AMOEBA} while looking for spatial analysis packages. {AMOEBA} is a package with a single function (!), AMOEBA(). AMOEBA stands for A Multidirectional Optimum Ecotope-Based Algorithm, which lets you calculate spatial clusters and has an awesome name. Using their base example, we can plot crime across Columbus, OH.

Rplot08

#looking at the AMOEBA help example
###########################################
########## Example of boundaries ##########
###########################################
require(AMOEBA)
### Columbus OH spatial analysis data set (spdep package)
data(columbus)
require(maptools)
map<-readShapePoly(system.file('etc/shapes/columbus.shp',package='spdep')[1])
### Application of AMOEBA
res<-AMOEBA(columbus$CRIME,col.gal.nb,1,1)
### Plot
color<-rev(rainbow(length(table(res)),start=0,end=2/6))
plot(map,col=color[as.factor(res)])
title('Clustering of crimes at Columbus (OH)')
names<-c('\nLow','\nMedium','\nHigh')
require(plotrix)
color.legend(map@bbox[1,1],map@bbox[2,1]-0.2,map@bbox[1,2],map@bbox[2,1]-0.4,names,color,align='rb')

Or % houses without plumbing

Rplot11

 

or housing value

Rplot12

 

i dont know why they didnt just build AMOEBA into {SPDEP}, the mother package, but its still cool.

R365: Day 40 – if else statements in for loops

My hand is still busted so I cant really type so well.

Modifying the code that we used in for loops, let us try to add in an if else statement

#based on code developed by Carol Hom
#a for loop to monitor the progress of a gut parasite
#pretend i had more than you initially, but i am fairly resiliant against the parasites
#set initial parameters for parasite populations
me.init=800;
you.init=10;
#set a vector to record the populations as the for-loop progresses
me=rep(1,100);
you=rep(1,100);
#set initial population. me is a vector so me[] tells you which point in the vector to set as the initial
me[1]=me.init;
you[1]=you.init;
#the for loop has to be of length me. One way to get around that is to set it to length(me) instead of 30
for (t in 2:length(me)){
 #set howthe pop will grow. you could also use me[1] instead of me.init
#insert an if statement that does an even coin flip
 if(rbinom(1,1,(1/2))>=1){
 me[t]=me[1]*(0.9)^t
 you[t]=you[1]*(1.1)^t
 } else {
#in this scenario, every other day (chosen at random) you are more resistant than I am
 me[t]=me[1]*(1.2)^t
 you[t]=you[1]*(0.9)^t
}
#always close the loops
 #catenulate time, pop me, pop you, and end the line
 cat(t,me[t],you[t],"\n");
 #dontforget to close out
}
#create a matrix called data by binding the two columns called me and you
data=cbind(log(me),log(you));
data;
#plot the log of the parasite loads in me and you
matplot(1:100,data,pch=20,25,xlab="time",ylab="log parasite load",type="b");

This leads to an oscillating dynamics for the model

Rplot09

because (net) you are more resistant, you end up having a lowe over all net number of parasites

What happens if it is not an even flip? Say 1 in 10 events are bad days for me? (changing the rbinom(…,prob=0.1)

Rplot10

similar dynamics! the frequency changes but the overall trend looks similar! very cool.

 

R365: Day 39- For loops

Rplot08

Busted my hand yesterday so I cant really type

For loops are useful to iterate a process for a given length of time

this example outlines how to set up a basic for loop

the cat() call (haha) might not be necessary, there might be ways around it, but i cant figure them out now

 

#based on code developed by Carol Hom
#a for loop to monitor the progress of a gut parasite
#pretend i had more than you initially, but i am fairly resiliant against the parasites
#set initial parameters for parasite populations
me.init=800;
you.init=10;
#set a vector to record the populations as the for-loop progresses
me=rep(1,100);
you=rep(1,100);
#set initial population. me is a vector so me[] tells you which point in the vector to set as the initial
me[1]=me.init;
you[1]=you.init;
#the for loop has to be of length me. One way to get around that is to set it to length(me) instead of 30
for (t in 2:length(me)){
 #set howthe pop will grow. you could also use me[1] instead of me.init
 me[t]=me[1]*(0.99)^t
 you[t]=you[1]*(1.01)^t
 #catenulate time, pop me, pop you, and end the line
 cat(t,me[t],you[t],"\n");
 #dontforget to close out
}
#create a matrix called data by binding the two columns called me and you
data=cbind(log(me),log(you));
data;
#plot the log of the parasite loads in me and you
matplot(1:100,data,pch=20,25,xlab="time",ylab="log parasite load",type="b");

R365: Day 36- {xts}

So as we saw in the previous post, {zoo} is a useful and powerful tool for handling and analyzing time series data. Unfortunately, the datasets that I am working with are 1) irregularly spaced and  2) irregularly timed. This means that some of the ordinary things that I could do with time series data (lags, ACFs, CCFs, etc) I cannot do without some manipulation and coercion. The R package {xts} can help in some ways by allowing you to merge two time series datasets without lining them up or organizing them before hand. This article gave me the idea to use {xts}, as it answered my problem of how to plot things with different time series. Therefore I will stea- er… borrow their example (with a few modifications) to see how this package and function work.

##set up the dates for the example system
date.a<-seq(as.Date('2014-01-01'),as.Date('2014-02-01'),by = 2)
date.b<-seq(as.Date('2014-01-01'),as.Date('2014-02-15'),by = 3)
date.c<-seq(as.Date('2014-01-01'),as.Date('2014-02-01'),by = 1)
##create data frames for the time series
df.a <- data.frame(time=date.a, A=sin((1:16)*pi/8))
df.b <- data.frame(time=date.b, B=cos((1:16)*pi/8))
df.c <- data.frame(time=date.c, C=cos((1:16)*pi/16))
##Merge the time series as xts() time series using merge()
my.ts <- merge(xts(df.a$A,df.a$time),xts(df.b$B,df.b$time),xts(df.c$C,df.c$time))
##look at what the time series looks like
my.ts
##treat my.ts as a zoo object, and use na.approx to replace NAs with interpolated values
z <- na.approx(as.zoo(my.ts))
# plot.zoo
plot(z, screen = 1, col = 1:3, ylab = "Y", xlab = "")
# xyplot.zoo
library(lattice)
xyplot(z, screen = 1, col = 1:3, ylab = "Y", xlab = "")
# autoplot.zoo
library(ggplot2)
autoplot(z, facet = NULL) + ylab("Y")

You even get to pick your poison as far as {lattice}, {ggplot2}, or {base} is concerned

Rplot

merge() is a really cool and powerful data organization function in R {zoo} that lets you merge data points along a common time series. I know that I have spent good chunks of time organizing things in Excel that I probably could have done in five minutes in {zoo}.

In the above example, we chose to interpolate between NA values using the na.approx() function. We can also do other things with NA values, such as omit them (using {stats} na.omit()), aggregate them (na.aggregate() from {zoo}), and trim off some of the values (na.trim from {zoo}). Beware that some of these can have consequences! For instance, na.omit omits rows that have ANY NAs, so the resulting graph looks like this:

Rplot05

because two of our time series did not have dates after the 31st! Bummer.

What happens if you don’t address the NAs and you just leave it as an as.zoo() object?

Rplot06

Also a bummer. It looks like you have to address the NAs somehow, but I have no good ideas for those of you who are squeemish about aggregating or reducing data outside of “just plot the points idiot”.

R365: Day 38 – If Else statements in R

One way to logically organize code is to perform if-then-else statements. You can do these sorts of statements in most coding languages and even in Excel (!), which is a really easy format if you just need to do something quickly. The basic gist of if-then-else statements goes something like this:

If (some event or situation is true)

Then (do something)

Else (in all other cases, do this other thing)

These sorts of statements are useful for organizing and quickly analyzing data. In R, you can use these to summarize data or to help add stochasticity to a model. This is probably not the most efficient way to add stochaticity, but it is effective.

Lets pretend that it rains about three days a week. We could use historic data to back this up, but those data only go so far; eventually you need to start forecasting. You can use a fitted model to previous data to try to predict when rain will occur, but these models don’t end up working so well. Depending on your goal, it can be more effective just to model hypothetical weather.

if ( rbinom(1,1,(4/7))>=1 ){ 
 "Sunny!" 
} else { 
 "Rainy!" 
}

The output from this will give you back either “Sunny!” or “Rainy!” depending on the results of the rbinom() statement.

If (haha!) that code up there looks really cluttered and you don’t feel like dealing with it, then you can use ifelse() as one statement to do the same thing

ifelse((rbinom(1,1,(4/7))>=1),"Sunny!","Rainy!")

I have no idea why people seem to prefer doing the if statement up there, it seems like way more work and less intuitive. It might be that ifelse is less efficient, so if you have a big code it will take longer, but that doesnt really affect me now. Another cool thing is that you can nest if-then-else statements to create some reasonably complex tasks.

 ifelse((rbinom(1,1,(4/7))>=1),ifelse((rbinom(1,1,(13/14))>=1),"Sunny!","Thunderstorms!"),"Rainy!")

Here we nested function within our other ifelse() to say that some of the time there will be thunderstorms instead of sun.

R365: Day 37 – Random numbers

One of my friends is working on stochastic and deterministic modeling, so I thought that I would go a bit into stochastic modeling for the next few posts.

What is stochastic/chaotic/deterministic? Stochastic means truly random, where the events at one time point do not affect events in the future. There are lots of ways to generate random numbers, including some really awesome books, or using atmospheric noise. Chaotic systems are not random! Chaotic systems are deterministic, but you need to know the initial conditions to make accurate predictions. Weather is considered a chaotic system. Deterministic models means that the model progresses in a predictable fashion. That does not necessarily mean linear, but these models tend to be stable and (over a very long time period) tend to come to some sort of equilibrium, even if that means oscillating into infinity.

I first ran across these terms while taking a population biology class. We would analyze populations of endangered species and determine if their populations were likely to be deterministic (which in that context meant doomed to extinction). (bummer). After entering grad school I took a class on dynamic models in biology using this book as a textbook and working our way through life tables, matrices, ODEs and attractors.

The heart of any stochastic modeling is the ability to generate randomness. There are lots of different ways to create random numbers, but some of the simplest are the use of runif() or sample(). Both of them will give you a random number from a uniform distribution. If you are interested in another distribution, there are lots of probability distributions, one of them might suit you better. For instance, if you just want to generate a coin tossing random integer, you can use

rbinom(100,1,prob=0.5)

Which will generate 100 random digits of either 0 or 1. If you are interested in an uneven split, you can easily adjust the prob= portion to suit your needs. If what you need is an actual number, then sample() is a good tool. Sample() can be used with or without replacement, so that if you wanted to sample something from a group only once, you could. Sample() also lets you sample from either a vector or a positive integer. So if you had a vector of things like names (like a list of states), then you could sample randomly from those as well. For instance,

a=sample(state.name, 10, replace=FALSE)
 [1] "New Mexico" "Nevada" "Colorado" "Nebraska" "Hawaii" "California" "Tennessee" "Delaware" 
 [9] "West Virginia" "Connecticut"

or you could combine this with {maps} to create

Rplot08

a=sample(state.name, 10, replace=FALSE)
map('state', interior = FALSE)
map('state', boundary = FALSE, lty = 5, add = TRUE)
map('state',region=a,fill=TRUE,col=heat.colors(10),add=TRUE)

 This post was pretty useful for looking into random numbers.

R365: Day 35 – {zoo}

Most people in science have to deal with time series at some point or another: you measured glucose levels in blood over a period of weeks, you recorded the number of birds in a habitat over a summer, you monitored the efficiency of a an enzyme as it was overwhelmed with substrate. Beyond science, people like economists monitor for trends in markets over time (e.g.-the number of new home buyers in the US from 2006-2012). Whatever your poison, time series are inescapable. We have already covered some of the tools that you can use to view and analyse time series, like ts() and acf() and detrend(). While these tools are useful, they don’t always work if your time series is irregular (irregular data intervals, time data in different units, etc). {zoo} is a useful package if you do anything with time series.

Something that {zoo} does not like to do is handle duplicate time data. So if you have two data points from the same date in a {zoo} time series, many of the zoo functions will not work. One way to handle that is to average the replicated time data points, using aggregate()

> z <- suppressWarnings(zoo(1:8, c(1, 2, 2, 2, 3, 4, 5, 5)))
> z
1 2 2 2 3 4 5 5 
1 2 3 4 5 6 7 8 
> aggregate(z, identity, mean)
 1 2 3 4 5 
1.0 3.0 5.0 6.0 7.5

or just taking the last data point in a series of repeated data points:

> aggregate(z, identity, tail, 1)
1 2 3 4 5 
1 4 5 6 8

and you can even interpolate between time points if that is useful to you

> time(z) <- na.approx(ifelse(duplicated(time(z)), NA, time(z)), na.rm = FALSE)
> z[!is.na(time(z))]
 1 2 2.3333 2.6667 3 4 5 
 1 2 3 4 5 6 7

{zoo} also does not like to display log scales for the x axis, although you can circumvent that using the suppressWarnings() function. You can also create your own script to get around that

> z <- zoo(1:100)
> plot(z, log = “y”, panel = function(…, log) lines(…))

Rplot08

{zoo} can also read directly from .txt and .csv files, but you have to have them formatted correctly. I had an issue when I had multiple different time series (eg-site A, B, C,D, and E) all next to each other in one file. {zoo} read the first column as a date, but it did not interpret other columns as dates.

 

I will talk more about {zoo} as things come up.