Your data analysis solution

Histograms and distribution fitting tutorial in Excel

2017-03-02

The purpose of this tutorial is to generate a histogram and test if a sample follows a negative binomial distribution using the XLSTAT distribution fitting tool in Excel. This distribution is often used to represent the aggregation/dispersion phenomenon of bacteria in water environments.

Data to create a histogram and fit a distribution

An Excel sheet with both the data and the results can be downloaded by clicking on the button below.

Download the data

The data correspond to an experiment where 200 samples of water from a river were cultured on medium with nutrients to determine the presence or absence of bacterial contamination with Escherichia coli. The number of colonies has been counted after 72 hours of incubation. In the Bact-Data column you will find the counts for the 200 samples.

Setting up the dialog box to create an histogram

Once XLSTAT is open, select the XLSTAT / Describing data / Histograms command (see below).

XLSTAT visualizing data menu, histograms

The dialog box then appears. Select the data on the Excel sheet named Data. 

In the general tab, select column B in the Data field. We activate the discrete option because the counts are discrete values. The Sample labels option is left activated because the first row of the data selection contains the name of the sample.

XLSTAT histograms dialog box, general tabXLSTAT histograms dialog box, options tabXLSTAT histograms dialog box, charts tab

Click on the OK button to launch the computations. The results will then be displayed on a new sheet.

Interpreting a histogram

The histogram is displayed on the sheet Histogram below the Summary Statistics table and followed by a table where the statistics of the histogram are available.

Histogram with Range=1

On the histogram we can see that the most frequent value is 0, which represents over 20% of the data. That is, in more than one sample out of five, no bacteria has been found. We also notice that the frequency decreases quickly. In one sample, over 36 colonies have been counted.

The following video shows how to do it.

Creating a histogram specifying the bounds of the intervals

Since we want to test the fit between the negative binomial distribution function and the sample (the Chi-square test requires that there is are least 5 data in a class), and because of the uncertain precision of the counts of the bacteria, it seems necessary to group the counts into larger classes. For this reason, we create a list of bounds that seems coherent with our problem: 0,1,2,3,4,5,10,15,20,40.

In order to verify if the frequencies of the new classes are greater than 5 and decrease regularly, we create a new histogram, specifying this time the bounds of the intervals in the tab Options.

XLSTAT histogram, user defind intervals

The computations begin once you have clicked on the OK button, and the new histogram appears (see sheet "Histogram1").

Histogram with user defined range

The following video shows you how to reproduce those results.

As we are satisfied by this result, we can now use the distribution fitting tool to test if the sample follows a negative binomial distribution.

Setting up the dialog box to fit a distribution

Select the XLSTAT / Modeling data / Distribution fitting command (see below).

XLSTAT tool bar for Distribution fitting

The Distribution fitting dialog box then appears. Select the data on the Excel sheet named Data.

In the General tab, select column B in the Data field. We let XLSTAT estimate the parameters of the negative binomial distribution function.

XLSTAT offers two different formulations of the negative binomial distribution. The one that is adapted to our case is the second one.

XLSTAT distribution fitting general tab

In the Options tab, activate the Goodness of Chi-square tests, which is necessary to test our assumption. We use the bounds that we defined above.

XLSTAT distribution fitting options tab

Select the following options in the tab Charts.

XLSTAT distribution fitting charts tab

Interpreting the results of a distribution fitting analysis

The first result of interest for us is the value of the k and p parameters of the negative binomial distribution (fitted using the maximum likelihood method), and the estimates of the sample and theoretical mean, variance, skewness and kurtosis. The closer these statistics obtained from the data and from the parameters, the better the fit. Here, the fit is excellent. Note: the theoretical mean is given by kp, and the variance by kp(p+1).

Estimated parameters and statistics table

The Chi-square goodness of fit test allows to test if the Chi-square distance between the empirical and theoretical distribution functions is above a critical value or not. A visual comparison between the observed and theoretical frequencies is available on the next figure.

Chart for Observed and theoretical frequencies

For classes 2, 6 and 7, there seems to be a slight difference. In spite of this small difference, the p-value computed for the test (0.767) is significantly higher than the significance level we have chosen (0.05). Therefore, the Chi-square test confirms our hypothesis that the data follow a negative binomial distribution.

As a conclusion, the presence of the bacteria of interest in the river in which the sample were collected, follows a negative binomial distribution (k=0.839, p=5.761), with a mean of 4.8 and a variance of 32.7.

The following video shows you how to do the fitting of the distribution.

1c26995d494fb3061dd0ae8571ffc0a4@xlstat.desk-mail.com
https://cdn.desk.com/
false
desk
Loading
seconds ago
a minute ago
minutes ago
an hour ago
hours ago
a day ago
days ago
about
false
Invalid characters found
/customer/portal/articles/autocomplete
9283