Fitting distributions with R

Fitting distribution with R is something I have to do once in a while, but where do I start?

A good starting point to learn more about distribution fitting with R is Vito Ricci's tutorial on CRAN. I also find the vignettes of the actuar and fitdistrplus package a good read. I haven't looked into the recently published Handbook of fitting statistical distributions with R, by Z. Karian and E.J. Dudewicz, but it might be worthwhile in certain cases, see Xi'An's review. A more comprehensive overview of the various R packages is given by the CRAN Task View: Probability Distributions, maintained by Christophe Dutang.

How do I decide which distribution might be a good starting point?

I came across the paper Probabilistic approaches to risk by Aswath Damodaran. In Appendix 6.1 Aswath discusses the key characteristics of the most common distributions and in Figure 6A.15 he provides a decision tree diagram for choosing a distribution:


JD Long points to the Clickable diagram of distribution relationships by John Cook in his blog entry about Fitting distribution X to data from distribution Y . With those two charts I find it not too difficult anymore to find a reasonable starting point.

Once I have decided which distribution might be a good fit I start usually with the fitdistr function of the MASS package. However, since I discovered the fitdistrplus package I have become very fond of the fitdist function, as it comes with a wonderful plot method. It plots an empirical histogram with a theoretical density curve, a QQ and PP-plot and the empirical cumulative distribution with the theoretical distribution. Further, the package provides also goodness of fit tests via gofstat.

Suppose I have only 50 data points, of which I believe that they follow a log-normal distribution. How much variance can I expect? Well, let's experiment. I draw 50 random numbers from a log-normal distribution, fit the distribution to the sample data and repeat the exercise 50 times and plot the results using the plot function of the fitdistrplus package.


I notice quite a big variance in the results. For some samples other distributions, e.g. logistic, could provide a better fit. You might argue that 50 data points is not a lot of data, but in real life it often is, and hence this little example already shows me that fitting a distribution to data is not just about applying an algorithm, but requires a sound understanding of the process which generated the data as well.

n <- 50
m <- 50
set.seed(1)
mu <- -0.4
sig <- 0.12
x <- matrix(data=rlnorm(n*m, mu, sig), nrow=m)
library(fitdistrplus)
## Fit a log-normal distribution to the 50 random data set
f <- apply(x, 2, fitdist, "lnorm")
## Plot the results
for(i in 1:n)
plot(f[[i]])
## Save plot in an animated GIF-file
library(animation)
saveGIF({for(i in 1:n) plot(f[[i]])})
apply((sapply(f, "[[", "estimate")),1, summary)
# meanlog sdlog
# Min. -0.4347 0.09876
# 1st Qu. -0.4140 0.11480
# Median -0.4010 0.12110
# Mean -0.4011 0.12270
# 3rd Qu. -0.3899 0.12950
# Max. -0.3522 0.14780
## How much variance can we expect in the mean and std?
## Expeted mean
ExpectedMean <- function(mu, sig){
exp(mu+ sig^2/2)
}
## Expected std
ExpectedStd <- function(mu, sig){
sqrt((exp(sig^2)-1)*exp(2*mu + sig^2))
}
summary(apply(sapply(f, "[[", "estimate"), 2,
function(x) ExpectedMean(x[1], x[2])))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.6529 0.6665 0.6747 0.6748 0.6819 0.7087
summary(apply(sapply(f, "[[", "estimate"), 2,
function(x) ExpectedStd(x[1], x[2])))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.06604 0.07880 0.08212 0.08316 0.08794 0.10170
## Let's look at the goodness of fit statistics to get an
## idea how much variance we can expect there:
gof.ln <- lapply(f, gofstat)
gof.test <- lapply(gof.ln, function(x){
data.frame(x[c("chisqpvalue", "cvm", "ad", "ks")])
})
apply(do.call("rbind", gof.test), 2, summary)
# chisqpvalue cvm ad ks
# Min. 0.0002673 0.02117 0.1537 0.05438
# 1st Qu. 0.1394000 0.03755 0.2708 0.07488
# Median 0.3578000 0.04841 0.3216 0.08054
# Mean 0.3814000 0.05489 0.3564 0.08431
# 3rd Qu. 0.6409000 0.06913 0.4358 0.09436
# Max. 0.9245000 0.13220 0.7395 0.12570