Box plot

Overview


In descriptive statistics, a boxplot (also known as a box-and-whisker diagram or plot or candlestick chart) is a convenient way of graphically depicting groups of numerical data through their five-number summaries (the smallest observation, lower quartile (Q1), median, upper quartile (Q3), and largest observation). A boxplot also indicates which observations, if any, might be considered outliers. The boxplot was invented in 1977 by the American statistician John Tukey.

Boxplots are able to visually show different types of populations, without making any assumptions of the underlying statistical distribution. The spacings between the different parts of the box help indicate variance, skewness and identify outliers. Boxplots can be drawn either horizontally or vertically.

Construction
For a data set, one constructs a horizontal box plot in the following manner:
 * Calculate the first quartile (x.25), the median (x.50) and third quartile (x.75)
 * Calculate the interquartile range (IQR) by subtracting the first quartile from the third quartile. (x.75-x.25)
 * Construct a box above the number line bounded on the left by the first quartile (x.25) and on the right by the third quartile (x.75). The box may be as tall as one likes, although reasonably proportioned boxplots are customary.
 * Indicate where the median lies inside of the box with the presence of a symbol or a line dividing the box at the median value.
 * The mean value of the data can also be labeled with a point.
 * Any data observation which lies more than 1.5*IQR lower than the first quartile or 1.5*IQR higher than the third quartile is considered an outlier. Indicate where the smallest value that is not an outlier is by a vertical tic mark or "whisker", and connect the whisker to the box via a horizontal line.  Likewise, indicate where the largest value that is not an outlier is by a "whisker", and connect that whisker to the box via another horizontal line.
 * Indicate outliers by open and closed dots. "Extreme" outliers, or those which lie more than three times the IQR to the left and right from the first and third quartiles, respectively, are indicated by the presence of an open dot.  "Mild" outliers - that is, those observations which lie more than 1.5 times the IQR from the first and third quartile but are not also extreme outliers are indicated by the presence of a closed dot.
 * Add an appropriate label to the number line and title the boxplot.
 * A boxplot may be constructed in a similar manner vertically as opposed to horizontally by merely interchanging "bottom" for "left" and "top" for "right" in the above description.

Example
A plain-text version might look like this:

+-+-+      o           *     |---|   + | |---| +-+-+    +---+---+---+---+---+---+---+---+---+---+   number line 0  1   2   3   4   5   6   7   8   9  10

For this data set:
 * smallest non-outlier observation = 5 (left "whisker")
 * lower (first) quartile (Q1, x.25) = 7
 * median (second quartile) (Med, x.5) = 8.5
 * upper (third) quartile (Q3, x.75) = 9
 * largest non-outlier observation = 10
 * interquartile range, IQR = $$Q3-Q1$$ = 2
 * the value 3.5 is a "mild" outlier, between 1.5*(IQR) and 3*(IQR) below Q1
 * the value 0.5 is an "extreme" outlier, more than 3*(IQR) below Q1
 * the data is skewed to the left (negatively skewed)

The horizontal lines (the "whiskers") extend to at most 1.5 times the box width (the interquartile range) from either or both ends of the box. They must end at an observed value, thus connecting all the values outside the box that are not more than 1.5 times the box width away from the box. Three times the box width marks the boundary between "mild" and "extreme" outliers. In this boxplot, "mild" and "extreme" outliers are differentiated by closed and open dots, respectively.

There are alternative implementations of this detail of the box plot in various software packages, such as the whiskers extending to at most the 5th and 95th (or some more extreme) percentiles. Such approaches do not conform to Tukey's definition, with its emphasis on the median in particular and counting methods in general, and they tend to produce "outliers" for all data sets larger than ten, no matter what the shape of the distribution.

Visualization


The boxplot is a quick graphic approach for examining one or more sets of data. Boxplots may seem more primitive than a histogram or probability density function (pdf) but they do have some advantages. Besides saving space on paper, boxplots are quicker to generate by hand. Histograms and probability density functions require assumptions of the statistical distribution. This assumption can be a major barrier because binning techniques can heavily influence the histogram and incorrect variance calculations will heavily affect the probability density function.

Because looking at a statistical distribution is more intuitive than looking at a boxplot, comparing the boxplot against the probability density function (theoretical histogram) for a normal N(0,1σ2) distribution may be a useful tool for understanding the boxplot (Figure 2).