In this tutorial we will create a histogram and use the XLSTAT distribution fitting tool to test if a sample follows a negative binomial distribution in **Excel**. This distribution is often use to represents the aggregation/dispersion phenomenon of bacteria in water environments.

## Data to create a histogram and fit a distribution

An Excel sheet with both the data and the results can be downloaded by clicking here.

The data correspond to an experiment where 200 samples of water from a river were cultured on medium with nutrients to determine the presence or absence of bacterial contamination with Escherichia coli. The number of colonies has been counted after 72 hours of incubation. In the Bact-Data column you will find the counts for the 200 samples.

## Setting up the dialog box to create an histogram

After opening XLSTAT, select the **XLSTAT / Describing data / Histograms** command, or click on the corresponding button of the **Describing data** toolbar (see below).

Once you've clicked on the button, the dialog box appears. Select the data on the Excel sheet.

The **Data** are in the B column. We activate the **discrete** option because the counts are discrete values. The **Sample labels** option is left activated because the first row of the data selection contains the name of the sample.

The computations begin once you have clicked on the **OK** button. The results will then be displayed.

## Interpreting a histogram

After some summary statistics, the histogram is displayed on sheet **Histogram**, followed by a table where the statistics of the histogram are available.

On the histogram we can see that the most frequent value is 0, which represents over 20% of the data. That is, in more than one sample out of five, no bacteria has been found. We also notice that the frequency decreases quickly. In one sample, over 36 colonies have been counted.

The following video shows how to do it.

## Creating a histogram specifying the bounds of the intervals

Because we want to test the fit between the negative binomial distribution function and the sample, (the Chi-square test requires that there is are least 5 data in a class), and because the uncertain precision of the counts of the bacteria, it seems necessary to group the counts into larger classes. For that reason, we created a list of bounds that seemed coherent with our problem: 0,1,2,3,4,5,10,15,20,40.

In order to verify if the frequencies of the new classes are greater than 5 and decrease regularly, we create a new histogram, specifying this time the bounds of the intervals.

To activate this tool, select the **XLSTAT / Preparing data / Discretization** command, or click on the corresponding button of the **Discretization** toolbar (see below).

The computations begin once you have clicked on the **OK** button, and the new histogram appears (see in sheet "Histogram1").

The following video shows you how to reproduce those results.

As we are satisfied by this result, we can now use the distribution fitting tool to test if the sample follows a negative binomial distribution.

## Setting up the dialog box to fit a distribution

To activate this tool, select the **XLSTAT / Modeling data / Distribution fitting** command, or click on the corresponding button of the **Modeling Data** toolbar (see below).

Once you've clicked on the button, the dialog box appears. Select the data on the Excel sheet. The **Data** are in the B column. We let XLSTAT **estimate** the parameters of the negative binomial distribution function.

XLSTAT offers two different formulations of the negative binomial distribution. The one that is adapted to our case is the second one.

We activate the options for the Kolmogorov-Smirnov and the Goodness of Chi-square tests, which are necessary to test our assumption. For the Chi-square test, we use the bounds that we defined above.

The following chart options have been selected.

## Interpreting the results of a distribution fitting analysis

The first result of interest for us is the value of the k and p parameters of the negative binomial distribution (fitted using the maximum likelihood method), and the estimates of the sample and theoretical mean, variance, skewness and kurtosis. The closer these statistics obtained from the data and from the parameters, the better the fit. Here, the fit is excellent. Note: the theoretical mean is given by kp, and the variance by kp(p+1).

The Chi-square goodness of fit test allows to test if the Chi-square distance between the empirical and theoretical distribution functions is above a critical value or not. A visual comparison between the observed and theoretical frequencies is available on the next figure.

For classes 1, 6 and 7, there seems to be a slight difference. In spite of this small difference, the p-value computed for the test (0.767) is significantly higher than the significance level we have chosen (0.05). Therefore, the Chi-square test confirms our hypothesis that the data follow a negative binomial distribution.

As a conclusion, the presence of the bacteria of interest in the river in which the sample were collected, is follows a negative binomial distribution (k=0.839, p=5.763), with a mean of 4.8 and a variance of 32.7.

The following video shows you how to do the fitting of the distribution.