# Journal of Statistical Theory and Applications

Volume 17, Issue 2, June 2018, Pages 307 - 323

# A Tutorial on Levels of Granularity: From Histograms to Clusters to Predictive Distributions

Authors
STANLEY L. SCLOVEslsclove@uic.edu
Department of Information & Decision Sciences, University of Illinois at Chicago, 601 S. Morgan St., Chicago, IL 60607-7124
Received 13 March 2017, Accepted 30 May 2017, Available Online 30 June 2018.
DOI
10.2991/jsta.2018.17.2.10How to use a DOI?
Keywords
Cluster Analysis; Finite Mixture Model; Bayesian models; Compound models; prior distribution; infinite mixture
Abstract

Consider the problem of modeling datasets such as numbers of accidents in a population of insured persons, or incidences of an illness in a population. Various levels of detail or granularity may be considered in describing the parent population. The levels used in fitting data and hence in describing the population may vary from a single distribution, possibly with extreme values, to a bimodal distribution, to a mixture of two or more distributions via the Finite Mixture Model, to modeling the population at the individual level via a compound model, which may be viewed as an infinite mixture model. Given a dataset, it is shown how to evaluate the fits of the various models by information criteria. Two datasets are considered in detail, one discrete, the other, continuous.

Open Access
This is an open access article under the CC BY-NC license (http://creativecommons.org/licences/by-nc/4.0/).

## 1.1 Background and Summary

Consider the problem of modeling a dataset of numbers accidents in a population of insured persons, or the incidence of an illness in a population. One can consider a spectrum of levels of granularity in describing the data and the corresponding population, from histograms, to a single distribution from a parametric family, to a bimodal distribution, to a mixture of two or more distributions, to modeling the population at the individual level via compound (predictive) distributions. Examples included are a dataset of employee days ill (a discrete variable) and a dataset of family expenditures on groceries (a continuous variable). Fits to the data obtained by various levels of granularity are compared using the Bayesian Information Criterion.

## 1.2 Levels of “Granularity”

In discussing the level at which to analyze a dataset, it will be shown how to go from individuals to histograms and modes to clusters or mixture components, back to individuals. The various levels proceed from histograms to clusters to predictive distributions

That is, one can go from modes to sub-populations back to individuals. These choices might be called choosing the level of “granularity” at which to analyze the dataset. Methods for various levels of granularity include those indicated in the table.

Method parameters Model
Histograms different bin widths Multinomial
Cluster Analysis parameter values for clusters of individuals Finite Mixture Model
Predictive distribution parameter value for each individual Bayesian prior distribution
Table 1:

Methods for various levels of granularity

There will be two extended analyses of datasets. These datasets are neither new nor large nor high-dimensional but hopefully will still be found to be interesting, particularly from the point of view of this paper. These two datasets are:

• Expenditure in a week on fruits and vegetables for 60 English families (Connor & Morrell, Statistics in Theory & Practice)

• Days ill in a year for 50 miners (hypothetical data from Kenkel, Statistics for Management & Economics)

## 2 Days Ill dataset

The first, with discrete data, concerns a (hypothetical) dataset of days ill in a year of n = 50 miners (Kenkel 1984). The days ill are of course integer values. They range from 0 to 18 days in the year.

 days 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 freq 2 3 5 5 2 5 5 4 6 3 0 1 4 1 2 0 0 1 1
Table 2:

Frequencies of days ill

The histogram (shown here with bins 0, 1-2, 3-4, … ,17-18) suggests bimodality, with modes at about 7 days and 12 days. (Strictly speaking, it can be argued that there is an error in this figure; namely, the rectangular bar for the value 0 has the same width as that for 1-2, 3-4, etc., even though these span two values instead of one. As will be discussed below, the width should be chosen so that relative frequency is proportional to area, and area = width × height, so that the height should be doubled if the bin width is halved. This will be discussed further below.)

The sample mean is about x¯=6.6 days and the sample variance is s2 = 19.07, that is, the sample standard deviation is s = 4.37 days. A single Poisson would not provide a good fit — for a Poisson distribution, the mean and variance are equal, but here the variance is much larger than the mean.

## 2.1 Extreme values

But, if one does fit a Poisson to the dataset, omitting, say, the upper two values 17 and 18 days, what conclusions might be drawn? Would the 17 and 18 be considered to be particularly unusual? That is to say, later, having fit a distribution, we can assess the probability of such extreme observations.

For now, we note that the mean of the other 50-2 = 48 observations is 5.1 days; the variance, 17.47, still very different values, so a single Poisson would not provide a good fit even after omitting the two largest, possibly outlying, observations. Note that also, besides the gap at 15 and 16 days, there is a gap at 10 days, nobody having been ill that number of days. Perhaps a mixture of distributions would make more sense.

## 2.2 Poisson mixture model

Consequently, a mixture of two Poissons was fit. The mixture model has p.m.f. (probability mass function) p(x) = π1 p1(x) + π2 p2(x), where p1(·) is the p.m.f. of a Poisson distribution with parameter λ1 and p2(·) is the p.m.f. of a Poisson distribution with parameter λ2.

Parameter estimates for the Poisson mixture model. Starting values for iterative estimation were obtained by a Student’s t method of clustering into two groups (Sclove 2016). In this method, each cut-point is tried.

In more detail: Obtain the order statistic, resulting from ordering the the observations {x1, x2, … , xn} as

x(1)x(2)x(n).

Candidate cut-off points c1, c2, … , cn−1 are chosen in between the order statistics:

x(1))<c1<x(2)<c2<<x(n1)<cn1<x(n).

For example, one can take cj to be the midpoint cj = (x(j) + x(j + 1))/2. One then computes two-sample t for each clustering, that is, for each cut-point. The best clustering into two clusters is the one which gives the largest value of |t|, or, equivalently, of t2. This is because t2 compares the within-groups sum of squares and the between-groups sum of squares.

Pooled t assumes equal variances in the two clusters. Unpooled t avoids this assumption. Not assuming equal variances, one can consider the standardized difference between means as an objective function, that is, one can use an unpooled two-sample t as the criterion, maximizing it over cut points. The unpooled t statistic is given by t2=(x¯1x¯2)2/(s12/n1+s22/n2) , that is, |t|=|x¯1x¯2|/s12/n1+s22/n2 .

The value of the two-sample, unequal variance Student’s t is computed for each cut-point, and the cut-point giving the largest t2 is chosen.

Iteration. The means of the resulting clusters are taken as tentative estimates of the distributional means; the cluster relative frequencies, as tentative estimates of the mixing probabilities. These estimates were 2.1 days, 8.9 days, with cluster frequencies of 17 out of 50 and 33 out of 50, that is, mixing probabilities of .34, .66. By doing a grid search in the vicinity of these initial estimates to maximize the mixture-model likelihood, the following estimates were obtained: λˆ1=2.77days , πˆ1=.40 , λˆ2=9.12days , πˆ2=.60 . Also, taking these as starting points for the EM (Expectation-Maximization) algorithm, only slightly different estimates were obtained; these were λˆ1=2.84days , λˆ2=9.20days , πˆ1=.41 , πˆ2=.59 .

Extreme value assessment with the fitted Poisson mixture. With the fitted Poisson mixture, the estimate of Pr{X > 16} is .007, fairly small. So perhaps results of 17 or more days could and should be considered outliers.

## 2.3 Comparison of models by model-selection criteria

The two fits, by histogram and by Poisson mixture, were compared by means of model-selection criteria. Given K alternative models, indexed by k = 1, 2, … , K, penalized-likelihood model-selection criteria are smaller-is-better criteria that can be written in the form

MSCk=2LLk+a(n)mk,
where mk is the number of free parameters used in fitting Model k, LLk is the log maximum likelihood of Model k, and a(n) = ln n for BIC (Bayesian Information Criterion; Schwarz 1978) and a(n) = 2 for all n for AIC (Akaike’s Information Criterion; Akaike 1973, 1974; Kashyap 1982; Sakamoto 1992). That is, for k = 1, 2, … , K alternative models,
AICk=2LLk+2mk,
and
BICk=2LLk+(lnn)mk.

The number of parameters for the Poisson mixture is 2 means plus 2 mixing probabilities, less 1 because the probabilities must add to 1. That is 3 free parameters for the Poisson mixture. The number of parameters for the histogram, scored by the multinomial distribution with 17 categories (0 through 18, but 15 and 16 are missing), less 1 because the multinomial probabilities must add to 1, leaving 16 free parameters.

The results are in the next table. The histogram wins by a bit according to AIC, but the Poisson mixture wins by a wide margin according to BIC. To see this, note that BIC is derived (Schwarz 1978) as the first terms in the Taylor series expansion of (-2 times) the posterior probability of Model k, Pr(Model k | data) = ppk, say. That is,

2lnppkConst.+BICk,orBICkCexp(BICk/2).

Model, k − 2 LLk mk AICk BICk ppk
k = 1 : histogram 261.642 16 293.642 324.234 5.0 × 10−7
k = 2 : Poisson mixture 283.473 3 289.473 295.209 ≈ 1
Table 3:

Comparison of two models

Model, k BICk same - 295 exp(−same/2) ppk
1 324.234 29.234 4.5 × 10−7 4.98 × 10−7
2 295.209 0.209 0.90085 1.000

sum = 0.90085
Table 4:

Calculation of posterior probabilties of alternative models

To calculate the posterior probabilities, one subtracts a large constant from each, divides by 2, exponeniates the negative of this, and sums these, dividing by the sum to normalize.

Different bin widths. If one makes several histograms, with different bin widths, how should the likelihood for histograms be computed? Given data points x1, x2, … , xn, the likelihood for a given p.m.f. p(·) is

L=i=1np(xi).

Here p(xi) is the p.m.f. at the data point xi. (For continuous data, we would write the p.d.f., f(xi). ) But in the context of histograms what we can take p(xi) to be?

Denote the number of bins by J. Let the bin width be denoted by h. This is an increment along the x-axis.

Let the bins be indexed by j, j = 1, 2, … , J. The class limits are x0, x0 + h, x0 + 2h, … , x0 + Jh. The class intervals (bins) are [x0, x0 + h), [x0 + h, x0 + 2h), … , [x0 + (J − 1)h, x0 + Jh). In the present application, x0 = 0. Now, let j(xi) denote the bin containing xi and nj(xi) be the frequency in that bin. To approximate f(xi), motivated by f(x1) ≈ [F(x2) − F(x1)]/(x2x1) = [F(x1 + h) − F(x1)]/[(xi + h) − x1] = [F(x1 + h) − F(x1)]/h, write

f(xI)[F(x)F(xi)]/h
and
f(xi)=probability density atxiprobability in bin containingxi/width of bin=[nj(xi)/n]/h=nj(xi)/nh.

That is, the concept is that probability density is probability per unit length along the x axis. Thus the likelihood is

L=i=1)np(xi)=i=1n(nj(xi)/nh)=(1/hn)i=1n(nj(xi)/n).

Note that p(x1,x2,,xn)=i=1npj(xi)=j=1Jpjnj is a multinomial p.m.f. with probabilities pj and frequencies nj for the J categories. The maximized likelihood L is this multinomial (with pj estimated as nj/n), divided by hn, which may be viewed as an adjustment to the likelihood due to the bin width h. In computing the likelihood, the probability density is to be used, where “density” is probability / bin width. Note that with a continuous variable we would compute probability density as f(xi)/h, that is, f(xi)/(Lebesgue measure of the bin interval), whereas with a discrete variable we are really computing probability density as p(xi)/h, where now h is the counting measure of the bin interval.

Varying bin widths. In the case of non-constant bin widths, with a bin-width of hj for the j-th interval, take the probability density at xi to be nj(i)/hj(i), where hj(i) is the width of the interval in which xi falls and nj(i) (short for nj(xi) ) is the frequency (count) in that interval. The likelihood is

L=i=1np(xi)=i=1n(nj(i)/n)/hj(i)=(1/nn)i=1nnj/hj(i).

 days 0 -1 2-3 4-5 6-7 8- 9 10- 11 12- 13 14- 15 16-17 18-19 freq 5 10 7 9 9 1 5 2 1 1
Table 5:

Sample distribution with a bin width of 2

 days 0 1 2-3 4-5 6-7 8- 9 10- 11 12- 13 14- 15 16-17 18 bin width 1 1 2 2 2 2 2 2 2 2 1 freq 2 3 10 7 9 9 1 5 2 1 1
Table 6:

Sample distribution with varying bin widths: bins 0, 1, 2-3,4-5.6-7, … , 16-17, 18

According to AIC, the histogram with varying bin widths wins, the Poisson mixture coming in second. According to BIC (and, equivalentlly, posterior probability), the Poisson mixture scores the best, by far. But the point is not just which model wins, but that such a comparison, comparing histograms on the one hand with fitted distributions on the other, can be made.

Levels of granularity, cont’d. Perhaps another level of granularity is approached by predictive distributions, which may be viewed as getting to the individual level of granularity. Predictive distributions may be viewed in the light of compound distributions resulting from a prior distribution on the parameter at the individual level. From the viewpoint of modern statistics, a predictive distribution is merely the marginal distribution of the observable random variable, having integrated out the prior on the parameter. (Details to follow.)

The Yule-Greenwood model approaches modeling at the individual level, stating that each individual may have his or her own accident rate λ and so is an example of a compound model. In terms of granularity, the Yule-Greenwood model is a classical example at the level of the individual in that it employs a Poisson model for each individual’s accident rate λ and then puts a (Gamma) distribution over the population of values of λ. The model is the Gamma-Poisson model (sometimes called the Poisson-Gamma model) and is a prime example of a compound model. The Gamma is a conjugate prior distribution for the Poisson, meaning that the posterior distribution of λ is also a member of the Gamma family. We discuss this further below; first, however, we fit histograms and mixtures to a dataset with continuous data.

## 3 An example with continuous data

The variable in the next example is expenditure in a week (£) of n = 60 English families on fruits and vegetables (Connor and Morrell 1977, data from the British Institute of Cost and Management Accountants). The data are reported to two decimals. This sort of measurement, treated as continuous, contrasts with the integer-valued variable considered in the example above.

Here are the data, sorted from smallest to largest:

0.21, 0.33, 0.36, 0.38, 0.41, 0.46, 0.48, 0.48, 0.51, 0.51, 0.51, 0.58, 0.64, 0.66, 0.69, 0.69, 0.71, 0.74, 0.74, 0.78, 0.78, 0.79, 0.84, 0.87, 0.87, 0.88, 0.89, 0.91, 0.91, 0.93, 0.98, 0.98, 1.03, 1.03, 1.05, 1.08, 1.12, 1.16, 1.17, 1.19, 1.24, 1.25, 1.26, 1.26, 1.28, 1.33, 1.38, 1.44, 1.48, 1.51, 1.53, 1.58, 1.61, 1.62, 1.76, 1.78, 1.79, 1.83, 1.96, 2.13

The minimum is 0.21 £; the maximum, 2.13 £. The sample mean is x¯=1.022£ , the sample standard deviation, s = 0.4562 £(sample variance s2 = 0.2081). The frequency distribution (see below in Figure 2 and Table 7) suggests possible bimodality.

Model, k − 2 LLk mk AICk BICk ppk
histogram, bin width h=1 261.6 16 293.6 324.2 .000
histogram, bin width h=2 273.2 9 291.2 308.4 .001
histogram, varying bin widths 267.8 9 285.8 303.0 .020
Poisson mixture 283.5 3 289.5 295.2 .978
Table 7:

Comparison of models

 lower limit 0.21 0.31 0.41 0.51 0.61 0.71 0.81 0.91 1.01 1.11 upper limit 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 Frequency 1 3 4 4 4 6 5 5 4 4 lower limit 1.21 1.31 1.41 1.51 1.61 1.71 1.81 1.91 2.01 2.11 upper limit 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 Frequency 5 2 2 3 2 3 1 1 0 1
Table 8:

Frequency distribution of weekly expenditure (£)

The distribution as tabulated here has a bin width h of 0.10. We consider below also the results for h = 0.2, for fitting a single Gamma and also for fitting a mixture of two (Gaussian) distributions. Below we compare these four fits by means of AIC and BIC.

## 3.1 Fitting a Gamma distribution

The two-parameter Gamma p.d.f. is

f(x)=λm1ex/β/[Γ(m)βm],m>0,λ>0,β>0,x>0.

The mean is . The variance is 2. Method-of-moments estimates are, for the scale parameter β = σ2/μ, βˆ=s2/x¯=0.2081/1.022=0.2035 . and for the shape parameter m = μ/β, so mˆ=x¯/βˆ=1.022/0.2035=5.0246 .

## 3.2.1 Gaussian mixture

The mixture model has p.d.f. f(x) = π1 f1(x) + π2 f2(x), where f1(·) is the p.d.f. of a Gaussian distribution with mean μ1 and variance σ12 and f2(·) is the p.d.f. of a Gaussian with mean μ2 and variance σ22 . The estimates are μˆ1=0.72£ , μˆ2=1.46£ , σˆ1=0.23£ , σˆ2=0.27£ , πˆ1=.62 , πˆ2=.38 . The results were obtained by approximate maximization of the likelihood doing an EM (Expectation-Maximization) iteration. Starting values were obtained by the Student’s t method (Sclove 2016). As mentioned above, in this method, each cut-point is tried. Two-sample, unequal variance Student’s t is computed for each cut-point, and the cut-point giving the largest t is chosen. This gives starting values obtained from the means and variances of the two resulting clusters. Then, given starting values μ1(0), σ1(0) , μ2(0) , σ2(0) , one then computes fj(xi;μj(0),σj2(0)) , j = 1, 2, i = 1, 2, … , n, and posterior probabilities of group membership, pp(j | xi), the posterior probability that xi arose from population j; at step s, for populations j = 1, 2,

pp(s)(j|xi)=πj(s)fj(xi;μj(s),σj2(s))/[π1(s)f1(xi;μj(s),σj2(s))+π2(s)f2(xi;μj(s),σj2(s))].

Then the estimates are updated. (See McLachlan and Peel (2000), p. 82.) For j = 1, 2,

μj(s+1)=i=1npp(s)(j|xi)xi/i=1npp(s)(j|xi).

As a check, note that if for one group j, it happened that pp(s)(j|xi) = 1 for all cases i, then the new estimate of the mean for that j is simply i=1nxi/n=x¯ . The updated estimates of the second moments are, for j = 1, 2,

μ2j(s+1)=i=1npp(s)(j|xi)xi2/i=1npp(s)(j|xi).

Then, variance equals raw second moment minus square of mean, or

σj2(s+1)=μ2j(s+1)[μj(s+1)]2.

Note that the numerical estimates of σ1 and σ2 are somewhat different; the ratio of variances is (0.27/0.23)2 = 0.075/0.051 = 1.46. Given this, it does not seem particularly worthwhile in this case to trying fitting two Gaussians with equal variances.

The table summarizes the results. According to AIC, the ranking is: Gamma, Gaussian mixture; histogram with bin width .2, histogram with bin width .1. According to BIC, the ranking is: Gaussian mixture, Gamma, histogram with bin width .2, histogram with bin width .1.

Model, k − 2 LLk mk AICk BICk ppk
histogram, bin width h=0.1 61.68 18 97.68 135.38 .000
histogram, bin width h=0.2 66.25 9 84.25 103.10 .000
Gamma 71.75 2 75.75 79.94 .382
Gaussian mixture 70.79 5 80.79 78.98 .618
Table 9:

Comparison of results

## 3.2.2 Gamma mixture

The mixture model has p.d.f. f(x) = π1 f1(x) + π2 f2(x), where now f1(·) is the p.d.f. of a Gamma distribution with shape parameter m1 and scale parameter β1 and f2(·) is the p.d.f. of a Gamma distribution with shape parameter m2 and scale parameter β2. The means and variances are, for distributions j = 1, 2, μj = mjβj and σj2=mj/βj2 . The inverse expressions, for the Gamma parameters in terms of the mean and variance, are βj=σj2/μj and mj=μj2/σj2 .

The resulting estimates are mˆ1=7.459 , βˆ1=0.099 , mˆ2=22.228 , βˆ2=0.066 , πˆ1=.61 , πˆ2=.39 .

The results were obtained by approximate maximization of the likelihood doing an EM (Expectation-Maximization) iteration. Starting values were obtained by the Student’s t method (Sclove 2016).

Given starting values μ1(0), σ1(0) , μ2(0) , σ2(0) , these were converted to starting values for m1, m2, β1, β2. One then computes the values of the Gamma densities,

fj(xi;mj(0),βj(0)),j=1,2,i=1,2,,n,
and posterior probabilities of group membership, pp(j | xi), the posterior probability that xi arose from population j; at step s, for populations j = 1, 2,
pp(s)(j|xi)=πj(s)fj(xi;mj(s),βj(s))/[π1(s)f1(xi;mj(s),βj(s))+π2(s)f2(xi;mj(s),βj(s))].

Then the estimates are updated. For j = 1, 2,

μj(s+1)=i=1npp(s)(j|xi)xi/i=1npp(s)(j|xi).

As a check, note that if for one group j, it happened that pp(s)(j|xi) = 1 for all cases i, then the new estimate of the mean for that j is simply i=1nxi/n=x¯ . The updated estimates of the second moments are, for j = 1, 2,

μ2j(s+1)=i=1npp(s)(j|xi)xi2/i=1npp(s)(j|xi).

Then, of course, variance equals raw second moment minus square of mean, or

σj2(s+1)=μ2j(s+1)[μj(s+1)]2.

Then these are converted to updated estimates of mj, βj, and the iteration continues until satisfactory convergence. LL, AIC, and BIC are computed with the resulting estimates.

## 3.3 Comparison of models by model-selection criteria

The table summarizes the results. According to AIC, the ranking is: Gamma; Gaussian mixture; Gamma mixture; histogram with bin width .2, histogram with bin width .1. According to BIC, the ranking is: Gaussian mixture, Gamma, Gamma mixture, histogram with bin width .2, histogram with bin width .1 The Gamma mixture has only a small posterior probability.

Model, k − 2 LLk mk AICk BICk ppk
histogram, bin width h=0.1 61.68 18 97.68 135.38 .000
histogram, bin width h=0.2 66.25 9 84.25 103.10 .000
Gamma 71.75 2 75.75 79.94 .382
Gaussian mixture 70.79 5 80.79 78.98 .617
Gamma mixture 71.75 5 81.75 92.22 .001
Table 10:

Comparison of results

## 4 Compound Models in General

First, notation notation for probability functions will be reviewed.

The probability density function (p.d.f.) of a continuous random variable (r.v.) X, evaluated at x, will be denoted by fX(x). The p.d.f. of a continuous random variable Y, evaluated at y, is similarly denoted by fY(y).

Now consider a bivariate variable x = (y, z). The joint p.d.f. of the r.v.s Y and Z, evaluated at (y, z) is fY,Z(y, z). Example: Y = WT, X = HT, the value of the joint p.d.f. at y = 80 kg and z = 170 cm is fWT,HT (80, 170).

Other notations include:

• fY|X(y|x): conditional probability density function of the r.v. Y, given that the value of the r.v. X is x. Example: fWT|HT(wt | HT = 170cm). This represents the bell-shaped curve of weights for men of height 170 cm.

• fY,Z(y | z) = fY|Z(y|z) fZ(z): This is the joint p.d.f. expressed as the product of the conditional of Y given Z and the marginal of Z

• fY(y) = ∫ fY,Z(y, z) dz = ∫ fY|Z(y|z) fZ(z) dz: marginal p.d.f. of Y

In the development that follows, fZ(z) plays the role of the prior probability function on the parameter. That is, denoting the parameter by θ, the function fZ(z) will become fΘ(θ).

The elements of compound models are:

• The distribution of the observable r.v., given the parameter(s), that is, the conditional distribution of X, given the parameter(s); and the marginal distribution (predictive distribution). The marginal distribution will have the hyperparameters among its parameters.

• the prior distribution. Its parameters are called hyperparameters.

A generic symbol for the parameter(s) of the conditional distribution of X is the conventional θ. As a generic symbol for the hyperparameters, one could use α, since the prior comes first in the model when one thinks of the parameter value being given first, and then the value of the variable being observed.

For use in compound models, the probability functions include the following:

The conditional distribution of the observable r.v. X, given the value of the parameter, is fX(x|θ); p.d.f. of X for given θ.

The prior distribution on the parameter θ with hyperparameter vector α, fΘ(θ; α).

The naming of compound models takes the form, prior distribution – conditional distribution. In the Gamma-Poisson model, the conditional distribution of X given λ is Poisson(λ) and the prior distribution on λ is Gamma. In the Beta-Binomial mode, the conditional distribution of X given p is Binomial with success probability p and the prior distribution on p is Beta. (Note that some people use the form, conditional distribution – prior distribution, e.g., Poisson-Gamma.)

## 5.1 Probability functions for the Gamma-Poisson model

In the Gamma-Poisson model, the distribution of X is Poisson with parameter usually called λ. The p.m.f. is

p(k)=eλλk/k!,λ>0,k=0,1,2,.

The mean and variance are both equal to λ.

Such a distribution can be considered, say, for the number of accidents per individual per year. For the days ill dataset (days ill in a year for a sample of n = 50 miners), we have fit a single Poisson (with mean 6.58 days per year). We looked at histograms and observed bimodality. Further, the fact that the sample variance of 19.06 was considerably larger than the sample mean was a hint of inadequacy of a single Poisson. As discussed above, a mixture of Poissons was fitted, with mixing probabilities about .6 and .4 and means about 3 days and 9 days. A finer level of granularity would be obtained by saying that each person has his own value of λ and putting a distributon on these over the population.

## 5.2 Gamma family of distributions

A Gamma distribution could be a good choice. The choice of the Gamma family is non-restrictive in that the family can achieve a wide variety of shapes. The single-parameter gamma has a shape parameter m; the two-parameter gamma family has, in addition, a scale parameter, β. (The reciprocal of β is the rate parameter, so-called because it is the rate, or intensity, of the associated Poisson process.) A Gamma distribution with parameter m, has p.d.f. f(λ) = Const. λm−1 eλ, λ > 0. The constant is 1/Γ(m). More generally, the two-parameter Gamma can be used: the p.d.f. is

f(λ)=λm1eλ/βΓ(m)βm,β>0,λ>0.

The mean is . The variance is 2.

The Negative Exponential family of distributions. The special case of m = 1 in the Gamma family gives the negative exponential family of distributions. So the

f(λ)=eλ/β/β,λ>0.

The mean is β. The variance is β2.

## 5.3 Development of the Gamma-Poisson model

Putting a population distribution over a parameter can be a very helpful way of modeling. The resulting model is called a compound model. In a compound model, the random variable X is considered as the result of sampling that yields an individual and that individual’s value of a parameter, and then the individual’s value of X is observed, from a distribution with that value of the parameter. Note that a compound model can be viewed as an infinite mixture model.

In this discussion, focus is on a couple of particular compound models, the Gamma-Poisson, and later, the Beta-Binomial.

The Yule-Greenwood model, from a modern viewpoint, is an application of the Gamma-Poisson model to a financial, in fact, actuarial, situation. It is in terms of a model for accident rates in a population. Suppose that the yearly number of accidents of any given individual i in a population is distributed according to a Poisson distribution with parameter λi accidents per year. (This is count data, similar to the days ill data.) Then the probability that individual i, with parameter value λi, has exactly k accidents in a year, k = 0, 1, 2, … , is

eλiλik/k!,λi>0,k=0,1,2,.

Some individuals are more accident prone (have a higher accident rate) than others, so different individuals have different values of λ. A distribution can be put on λ to deal with this. This is the Yule-Greenwood model, dating from 1920; a precursor of the Predictive Distributions of the new Predictive Analytics, predating even Abraham Wald (1950) as a founder of modern mathematical statistics and decision theory and Jimmie Savage (1954) as a founder of modern Bayesian Statistics.

The standard choice of a prior distribution on λ is a Gamma distribution. The Gamma family is a conjugate family to the Poisson, meaning that the prior and posterior distributions of λ are both in the Gamma family.

The joint distribution of X and Λ. The joint probability function of X and Λ is

fX,Λ(x,λ)=fΛ(λ)pX|Λ(x|λ),x=0,1,2,,λ>0.

The expressions for the Gamma and Poisson are put into this. That is, the weight assigned to pX||Λ(x|λ) is fΛ(λ).

The joint probability function is used to obtain

• the marginal distribution of X, by integrating out λ, and

• then the posterior distribution of Λ given x, by dividing the joint probability function by the marginal probability mass function of X.

Putting in the expressions for the Gamma and Poisson, it is seen that the marginal distribution of X, the number of accidents that a randomly selected individual has in a year, is of the form

fX(x)=0fX,Λ(x,λ)dλ=fX|Λ(x|λ)fΛ(λ)dλ.

When the prior is Gamma and the conditional is Poisson, this marginal distribution can be shown to be negative binomial. Its parameters are m and p = 1/(1 + β).

In the Bayesian model, the parameter of the conditional distribution of X, say θ, is treated as a random variable Θ.

In the Gamma-Poisson model, θ is the Poisson parameter λ.

The conditional distribution of X given that Θ = θ is Poisson(λ). The probability mass function is

pX|Λ(x;λ)=eλλx/x!,x=0,1,2,,.

The joint p.d.f. of X and Λ can be written as pX||Λ(x|λ) which is fΛ(λ)(x, λ) = fΛ pX(x|λ.

As mentioned above, from this, the posterior distribution of Λ, that is, the distribution of Λ given x, can be computed, and the marginal distribution of X can be computed. Marginal maximum likelihood estimation can be applied to obtain estimates of the remaining parameters.

Posterior distribution of Λ. Analogous to Pr(B|A) = Pr(AB) / Pr(A), the p.d.f. of the posterior distribution is the joint p.d.f., divided by the marginal p.d.f. of X :

fΛ|X(λ|x)=fX,Λ(x,λ)/fX(x).

This will turn out to be a Gamma distribution, that is, it is in the same family as the prior. The Gamma is a conjugate prior for the Poisson.

Marginal distribution of X. In Predictive Analytics, the marginal distribution of X is computed as a model of a future observation or observations of X.

The marginal distribution of X is obtained by integrating the joint distribution with respect to the parameter. Note that this computation combines information, by weighting the conditional distribution of X given λ with the prior on λ. This computation of the p.d.f. is, as stated above, fX(x)=0tf(x|λ)fΛ(λ)dλ .

Moments. The mean of the marginal distribution of X is mq/p = m β. The variance of the marginal distribution of X is mq/p2 = m β(1 + β).

## 5.4 Empirical Bayes estimation

Empirical Bayes estimation, at least in the present context, means estimating the parameters of the prior using observations from the marginal distribution.

The hyperparameters in terms of the moments of the marginal. The parameters of the prior are called hyperparameters. In this case, they are λ and β. Suppose we solve for them in terms of the first two moments of the marginal.

Estimating the prior parameters from the marginal. Estimates of the prior parameters m and β can be obtained by, for example, taking the expressions for the hyperparameters m and β in terms of the first two raw moments and plugging in estimates m1 and m2 . Given a sample X1, X2, … , Xn, we have m1=X¯=i=1NXi/N and m2=i=1nXi2/n .

## 5.5 Application to the days ill dataset

Returning to the days ill dataset for n = 50 miners, the days ill in a year ranged from 0 to 18; the distribution seems to be bimodal.

The p.m.f. of the Negative Binomial distribution with parameters m and p is where k is the number of trials in excess of m required to get m Heads. In the Gamma-Poisson model, the marginal distribution of X is Negative Binomial with parameters with parameters m and p = 1/(1 + β).

Given that the true mean of the marginal (“predictive distribution”) Negative Binomial is μ = mq/p = and the true variance is σ2 = mq/p2 = (1 + β), and the sample mean x¯=6.58 and the sample variance s2 = 19.07, one can set up two equations and solve for method of moments estimates of the hyperparameters m and β in the Gamma prior for λ.

The equations are [1] : m β = 6.58; [2] : m β(1 + β) = 19.07.

Putting [1] in [2] gives 6.58(1 + β) = 19.07, 1 + β = 19.07/6.58 2.898, betaˆ1.898 . Then m ≈ 6.58/β = 6.58/1.898 ≈ 3.467. Now, μ = m(1 − p)/p = m/pm, μ + m = m/p, p = m/(μ + m) or, estimating p = 3.467/(6.58 + 3.467) = 3.467/10.05 = 0.345. So now we have estimates of the hyperparameters.

To estimate the mean and variance of the Gamma prior, one can proceed as follows. The mean of the prior is , estimated as 6.58 days ill per year. The variance of the prior is 2, estimated as 6.58(1.898) ≈ 12.49. The standard deviation is thus estimated as 12.493.03 days ill per year.

Maximum likelihood estimates are not in closed form but numerical values for them could be obtained by numerical maximization of the likelihood function. It is helpful to use the method of moments as a quick and simple method to get an idea of the values of the parameters.

The table includes the marginal negative binomial with m = 3 and p = .344 in the comparison.

Model, k − 2 LLk mk AICk BICk ppk
histogram, bin width h=1 261.6 16 293.6 324.2 .000
histogram, bin width h=2 273.2 9 291.2 308.4 .000
histogram, varying bin widths 267.8 9 285.8 303.0 .002
Poisson mixture 283.5 3 289.5 295.2 .100
marginal Negative Binomial 283.0 2 286.0 290.8 .898
Table 11:

Comparison of models, cont’d

According to AIC, the histogram with varying bin widths still wins, the Negative Binomial coming in second. According to BIC (and, equivalentlly, posterior probability), the Negative Binomial scores the best, by far. This Negative Binomial is unimodal with a mode of .115 at 3 days. Because it is unimodal, it perhaps does not capture the flavor of the original data, which is reflected better by the Poisson mixture.

## 6.1 Beta-Binomial model

Another compound model is the Beta-Binomial model.In this model, the conditional distribution of X given p is Binomial(n,p). The prior on p is Beta(α, β). The posterior distribution of p given x is Beta(α + x, β + nx). It is as if there had been a first round of α + β trials, with α successes, followed by a second round of n trials, with x successes. Method of Moments estimates of the parameters of the prior can be relatively easilty obtained. So can the Bayes estimates.

## 6.2 Normal-Normal model

We have considered the Gamma-Poisson model and, briefly, another prominent compound model, the Beta-Binomial model, with a Beta prior on the Binomial success probability parameters. Still another compound model is the Normal-Normal model.

In the Normal-Normal model, X is distributed according to 𝒩(μ, σ2), the Gaussian distribution with mean μ and variance σ2. The prior on μ can be taken to be 𝒩(μ0,σ02) , or perhaps a Gaussian with a different mean if there is some particular reason to do this.

The posterior distribution is again Normal. That is, the Normal family is the conjugate family for the Normal distribution. The marginal distribution is also Normal, with mean ℰ[X] = ℰ[ℰ[X|μ] = ℰ[μ] = μ0. The variance of the marginal distribution is the mean of the conditional variance plus the variance of the conditional mean, 𝒱[X]=[σ2]+𝒱[[μ]=σ2+𝒱[μ]=σ2+σ02 . These two terms are the “components of variance”. The decomposition of the variance can be obtained also by doing the requisite algebra on the product of the prior and conditional. This model is similar to a Random Effects (Model II) model in ANOVA. The group effects (group mean minus overall mean) are considered as a sample from 𝒩(0,σ02) .

## 7 Proposed Extensions and Issues

Multivariate models. Multivariate generalizations, where the response is a vector rather than a scalar, could be interesting, both in general and in the context of the analysis of variance.

Parameter-space boundary issues. Chen and Szroeter (2016) provide a more widely applicable version of BIC (originally called the Schwarz Information Criterion, SIC). This version deals with boundary issues in the parameter space. These issues can be important when dealing with mixture models, as the parameter space changes as the number of component distributions changes..

A BIC for Maximum Marginal Likelihood. In view of the fact that, in this paper, compound models have been mentioned as a level of granularity for modeling at the individual level, it is worth mentioning that Spiegelhalter et al. (2002) use an information theoretic argument to derive a measure pD for the effective number of parameters in a model as the difference between the posterior mean of the deviance and the deviance at the posterior means of the parameters of interest. In general this measure pD approximately corresponds to the trace of the product of Fisher’s information and the posterior covariance, which in normal models is the trace of the ’hat’ matrix projecting observations onto fitted values, that is, the matrix of estimated regression coefficients. The properties of the measure in exponential families are explored in their paper. The posterior mean deviance is suggested as a Bayesian measure of fit or adequacy, and the contributions of individual observations to the fit and complexity can give rise to a diagnostic plot of deviance residuals against leverages. Adding pD to the posterior mean deviance gives a deviance information criterion for comparing models, which is related to other information criteria and has an approximate decision theoretic justification.

## Acknowledgments

An earlier version of this paper was presented at the 2016 Annual Meeting of the Classification Society, June 1-4, 2016, University of Missouri, and also at ICOSDA 2016, the 2016 International Conference on Statistical Distributions and Applications, October 14-16, 2016, Niagara Falls, Ontario, Canada, organized by Professor Ejaz Ahmed of Brock University and Professors Felix Famoye and Carl Lee of Central Michigan University. This paper was presented in the topic-invited session ”Extreme Value Distributions and Models”, organized by Professor Mei Luang Huang, Brock University; sincere thanks are extended to Professor Huang for the invitation to speak at the conference.

## Bibliography.

The books mentioned below on predictive analytics, those of Murphy and Bishop, do not discuss the Gamma-Poisson model explicitly. Those who wish to consult these books may however refer to them to find Murphy, p. 41 on the Gamma family of dsitributions and/or Bishop, p. 688 on the Gamma family. As mentioned, an original paper, anticipating the subject of compound models, is that of Major Greenwood and G. Udny Yule (1920). To review background in Probability Theory in general, see, for example, Emanuel Parzen (1992) or Sheldon Ross (2014). See also Parzen, Stochastic Processes (1962) or Ross, Applied Probability Models (1970).

## References

[1]H Akaike, “Information Theory and an Extension of the Maximum Likelihood Principle”, BN Petrov and F Càski (editors), Second International Symposium on Information Theory, Akademiai Kiàdo, Budapest, 1973, pp. 267-281.
[5]LR Connor and AJH Morrell, Statistics in Theory and Practice, 7th ed., Pitman, London, 1977.
[8]J Kenkel, Introductory Statistics for Management and Economics, 2nd ed., Duxbury Press, Boston, MA, 1984, pp. 31. Exercise 4
[10]KP Murphy, Machine Learning: a Probabilistic Perspective, The MIT Press, 2012.
[11]E Parzen, Modern Probability Theory and its Applications, Wiley, 1960. (Reprinted in 1992 as a Wiley Classics Edition.)
[12]E Parzen, Stochastic Processes, Wiley, New York, 1962. Dover Publications (Reprint of the original, published by Wiley.)
[13]SM Ross, Applied Probability Models with Optimization Applications, Holden-Day, San Francisco, 1970. Reprinted, Dover, New York, 1992
[15]Y Sakamoto, Categorical Data Analysis by AIC, Springer, New York, 1992.
[18]SL Sclove, “t for Two (Clusters)”, in Presented June 2, 2016, at the Classification Society Annual Meeting (University of Missouri, Columbia, 2016). Submitted for consideration for publication
[20]A Wald, Statistical Decision Functions, John Wiley and Sons, Chapman and Hall, New York, London, 1950.
Journal
Journal of Statistical Theory and Applications
Volume-Issue
17 - 2
Pages
307 - 323
Publication Date
2018/06/30
ISSN (Online)
2214-1766
ISSN (Print)
1538-7887
DOI
10.2991/jsta.2018.17.2.10How to use a DOI?
Open Access
This is an open access article under the CC BY-NC license (http://creativecommons.org/licences/by-nc/4.0/).

TY  - JOUR
AU  - STANLEY L. SCLOVE
PY  - 2018
DA  - 2018/06/30
TI  - A Tutorial on Levels of Granularity: From Histograms to Clusters to Predictive Distributions
JO  - Journal of Statistical Theory and Applications
SP  - 307
EP  - 323
VL  - 17
IS  - 2
SN  - 2214-1766
UR  - https://doi.org/10.2991/jsta.2018.17.2.10
DO  - 10.2991/jsta.2018.17.2.10
ID  - SCLOVE2018
ER  -