The histogram is one of the seven basic tools of quality control used to summarize, display and analyze process data. Karl Pearson, 1857–1936, introduced it as a way of showing the probability distribution of a continuous variable.
The derivation of the word “histogram” is uncertain. Sometimes it is said to be derived from the Greek “histos” meaning “anything set upright” (as the masts of a ship, the bar of a loom, or the vertical bars of a histogram); and “gramma,” i.e., 'drawing, record, writing. It is also said that Karl Pearson derived the name from “historical diagram.”
A histogram consists of tabular frequencies, shown as adjacent rectangles, erected over discrete intervals, with an area equal to the frequency of the observations in the interval. The height of a rectangle is also equal to the frequency density of the interval, i.e., the frequency divided by the width of the interval. The total area of the histogram is equal to the number of data. A histogram may also be normalized displaying relative frequencies. It then shows the proportion of cases that fall into each of several categories, with the total area equaling 1. The categories are usually specified as consecutive, non-overlapping intervals of a variable. The categories (intervals) must be adjacent, and often are chosen to be of the same size. The rectangles of a histogram are drawn so that they touch each other to indicate that the original variable is continuous.
The ordinary histogram shows the number of datum per unit interval so that the height of each bar is equal to the proportion of total data that falls into that category. The area under the curve represents the total number of data. This histogram shows absolute numbers, with the frequency in thousands.
In Figure 1, the histogram on the right differs from the one on the left in that it shows the data cumulatively—and the total area of all the bars is equal 100%. The curve displayed is a simple density estimate.
In other words, a histogram represents a frequency distribution by means of rectangles whose widths represent class intervals and whose areas are proportional to the corresponding frequencies. The intervals are placed together in order to show that the data represented by the histogram, while exclusive, is also continuous. (For example, in a histogram it is possible to have two connecting intervals of 10.5–20.5 and 20.5–33.5, but not two connecting intervals of 10.5–20.5 and 22.5–32.5. Empty intervals are represented as empty and not skipped.)
Histograms are used to plot density of data, and often for density estimation: estimating the probability density function of the underlying variable. The total area of a histogram used for probability density is always normalized to 1. Since the sum of the intervals on the x-axis is always 1, histograms are identical to relative frequency plots.
Above are examples of ordinary and cumulative histograms of the same data. The data shown is a random sample of 10,000 points from a normal distribution with a mean of 0 and a standard deviation of 1.
SHAPE OR FORM OF A DISTRIBUTION
The shape of a histogram provides important information about the data distribution. The histogram is may be highly or moderately skewed to the left or right. A symmetrical shape is also possible, although a histogram is never perfectly symmetrical. If the histogram is skewed to the left, or negatively skewed, the tail extends further to the left.
The mode of a distribution is that value which is most frequently occurring or has the largest probability of occurrence. The sample mode occurs at the peak of the histogram.
For many phenomena, it is quite common for the distribution of the response values to cluster around a single mode (unimodal) and then distribute themselves with lesser frequency out into the tails. The normal distribution is the classic example of a unimodal distribution.
The histogram shown in Figure 2 illustrates data from a bimodal (2 peak) distribution. The histogram serves as a tool for diagnosing problems such as bimodality. Questioning the underlying reason for distributional non-unimodality frequently leads to greater insight and improved deterministic modeling of the phenomenon under study. For example, for the data presented above, the bimodal histogram is caused by a lack of uniformity in the data.
An example of a distribution skewed to the left might be the relative frequency of exam scores. Most of the scores are above 70 percent and only a few low scores occur. An example for a distribution skewed to the right or positively skewed is a histogram showing the relative frequency of housing values. A relatively small number of expensive homes create the skeweness to the right. The tail extends further to the right. The shape of a symmetrical distribution mirrors the skeweness of the left or right tail. For example, the histogram of data for IQ scores. Histograms can be unimodal, bi-modal or multi-modal, depending on the dataset.
A truncated histogram ends abruptly at one end, which indicates possible sorting or inspection of non-conforming parts. This may also mean that part of the distribution has been removed by screening, 100 % inspection or review. Such practices are usually costly and are good candidates for improvement efforts.
Plateau Histograms. A nearly flat or plateau-like histogram often means that the process is not well defined or understood by those doing the work or inspection. Since individuals run the process in different ways, there are a great many different measurements and none that stand out. The solution is to more clearly define the process and/or piece part parameters.
The plateau might be called a “multimodal distribution.” Several processes with normal distributions are combined. Because there are many peaks close together, the top of the distribution resembles a plateau.
Number of cells and width. There is no “best” number of cells, and different cell sizes can reveal different features of the data. Some theoreticians have attempted to determine an optimal number of cells, but these methods generally make strong assumptions about the shape of the distribution. Depending on the actual data distribution and the goals of the analysis, different cell widths may be appropriate, so experimentation is usually needed to determine an appropriate width. There are, however, various useful guidelines and rules of thumb.
Most engineers favor setting the number of cells somewhere between 11 and 17, but always an odd number. The later point is important so that the mid-point of the distribution is not split between two cells. It is also a good rule, when using measurement data, to set the cell limits a point halfway between the number of decimal points of the most precise data. Consider what happens where a cell is 4 to 8 and the next cell 8 to 12. A reading of 8 could fall in either cell, hence the rule.
Kurtosis. In probability theory and statistics, kurtosis is derived from the Greek word meaning bulging is any measure of the “peakedness” of the probability distribution of a real-valued random variable. In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population.
One math-based common measure of kurtosis, originating with Karl Pearson, is based on a scaled version of the fourth moment of the data or population, but it has been argued that this measure really measures heavy tails, and not peakedness. For this measure, higher kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations. It is common practice to use an adjusted version of Pearson’s kurtosis, the excess kurtosis, to provide a comparison of the shape of a given distribution to that of the normal distribution. Distributions with negative or positive excess kurtosis are called platykurtic or leptokurtic distributions, respectively. When a curve, or histogram, is compared to a normal distribution, a platykurtic data set has a flatter peak around its mean, which causes thin tails within the distribution.
Leptokurtic is a description of the kurtosis in a distribution in which the statistical value is positive. Leptokurtic distributions have higher peaks around the mean compared to normal distributions. The Japanese scientist, Genechi Taguchi, argued that the goal of manufacturing should not be to simply produce product within the specification, but rather the goal should be to produce product as close to nominal as possible. He argued that any deviation from nominal has a cost.
There isn’t space in this column to fully explain this idea—suffice to say that a leptokurtic distribution will produce superior product. There is a greater difference between a part produced near the statistical design limit in a process producing a platykurtic distribution and one with a leptokurtic distribution.
The Taguchi Principle is the basic upon which six-sigma theory and practice are based.
Leslie W. Flott, Ph.B., CQE, ASQ Fellow, is certified as an IDEM Wastewater Treatment Operator and Indiana Wastewater Treatment Operator. He received his Bachelor of Science Degree in Chemistry from Northwestern University and his Masters Degree in materials engineering from Notre Dame University. Most recently, Flott served as the environmental program director and instructor at Ivy Tech Community College. Prior to that, he was the health, environment, and safety manager at Wayne Metal Protection Company.