R365: Day 4: poibin and pnn
Today I did some field work and for most of the day and could not get to coding until the evening. I searched through the comprehensive list of R packages and randomly picked out a package “poibin” that sounded interesting. The problem (and wonderful thing!) about just floating through the list of packages is that it is easy to see other packages that sound like they might be interesting. Some packages have obtuse names, but the descriptions for the packages is usually good enough to help you figure out what they do. Because I was naturally looking to see what I missed, I noticed a package “pnn” that also looked cool, so todays post is a Daily Double!
Poibin is a package that helps you calculate the exact values for the Poisson binomial distribution. For a second before I read the vignette for the package, I thought it calculated BOTH the poisson and binomial, but it actually calculates the Poisson binomial, a distribution I am fairly unfamiliar with. To my general knowledge, Poisson distributions are useful because they approximate normal distributions for small sample sizes. Binomial distributions are useful because they help with binomial data, like presence/absence data, and they treat all of the likelihoods of each presence/absence equally. The Poisson binomial distribution (I learned from Wikipedia) is the generalized form of the binomial distribution wherein the probabilities are NOT assumed to be equal to one another. I have been meaning to learn about what the different distributions are and what they best represent for a long time, and this was a perfect opportunity.
Poibin has one main group of functions that calculates the probability. Like other distributions I have seen in the past (normal and uniform), you can adjust the function to suit your purpose. Mostly I like to be able to generate a random series of numbers given a distribution, which in this case is given by rpoibin(). All of the examples given in the short vignette for the package use a weight of 2 for every case. Removing the weights caused the resulting numbers to go down, but I am not really sure what the role of the weights were or what they did besides functionally increasing the value of the resulting random numbers. I am not sure if I will use poibin, I think its often more useful to use the specific distributions of binomial or beta binomial, and leave the generalized formula for other things. Still really cool to find out more about distributions and to test drive a new package.
The pnn package was less straightforeward but I think more rewarding than the poibin package. Pnn helps to solve and create probabalistic neural networks. Neural networks mimic human learning by taking a set of data, analyzing it, and deciding whether new data falls into one of the categories described in the original set. They are used widely in machine learning activity, and research on neural networks has been going strong since at least the 1970’s. Pnn has several functions for running and adapting neural networks from sets of data. Creating a functioning neural network comes in 4 steps:
1. Learn- given a set of data, create or update a neural network
2. Smooth- set a smoothing parameter, which helps to define how closely you want your neural network to be trained by the data. So if you set it very close, it will begin to assume that random error and noise in the training data is valid, but if you set it loose then the neural network will ignore data that should be considered for the test.
3. Performance- measure how well your neural network would match against a real set of data, and can give you an exact success rate for accurate calls.
4. Guess- guess the categories of new data point or vector, giving a probability for each possible option and guessing based on the highest probabilities.
I really liked the documentation and the examples used for the package vignette, it was very clear and the code worked very well every time. I like that the authors included the phrase library(pnn) at the top of every example so that there wouldn’t be weird issues with loading the package before using it, something I always forget to do.
Given the relatively short amount of time that I devote to each of these posts (~1-2 hours) I generally stay away from larger packages, which might have many many many functions and would be difficult to cover to any sort of depth or quality. Even testing out all of the functions might be relatively difficult for some of the larger packages. To compensate for this, I will be splitting up big packages in the future so that I can cover more depth. I am very interested in looking at ggplot2 and VGAM packages, and I might devote a whole week to each of them, we’ll see.
R365: Day 4: poibin and pnn