Exploratory data analysis

Exploratory data analysis (EDA) is about looking at data to form hypotheses worth testing, complementing the tools of conventional statistics for testing hypotheses. It was so named by John Tukey.

EDA development
Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis) more emphasis needed to be placed on using data to suggest hypotheses to test. In particular, he held that confusion of the two types of analysis and employing them on the same set of data can lead to systematic bias owing to the issues endemic in testing hypotheses suggested by the data.

The objectives of EDA are to:
 * Suggest hypotheses about the causes of observed phenomena
 * Assess assumptions on which statistical inference will be based
 * Support the selection of appropriate statistical tools and techniques
 * Provide a basis for further data collection through surveys or experiments

Tukey's books were notoriously opaque, and so several attempts were made to popularise his EDA ideas. Prominent among these was the Statistics in Society (MDST242) course of The Open University.

Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking.

Techniques
There are a number of tools that are useful for EDA, but EDA is defined more by the attitude taken than the techniques used.

The principal graphical techniques used in EDA are:


 * Box plot
 * Histogram
 * MultiVari chart
 * Run chart
 * Pareto chart
 * Scatter plot
 * Stem-and-leaf plot

The principal quantitative techniques are:


 * Median polish
 * Letter values
 * Resistant line
 * Resistant smooth
 * Rootogram

Graphical and quantitative techniques are:


 * Multidimensional scaling
 * Ordination

History
Many EDA ideas can be traced back to earlier authors, for example:
 * Francis Galton - his emphasis on order-statistics and percentiles
 * Arthur Bowley - used precursors of the stemplot and five-figure summary (Bowley actually used a "seven-figure summary", including the extremes, deciles and quartiles, along with the median)
 * Andrew Ehrenberg's philosophy of data reduction (see his book of the same name).

The Open University course Statistics in Society (MDST 242), took the above ideas, and merged them with Gottfried Noether's work, which introduced statistical inference via coin-tossing and the median test.

For details of the above, see John Bibby's book HOTS: History of Teaching Statistics.

Software

 * CMU-DAP (Carnegie-Mellon University Data Analysis Package, FORTRAN source for EDA tools with English-style command syntax, 1977)
 * Fathom (for high-school and intro college courses)
 * LiveGraph (free real-time data series plotter)
 * TinkerPlots (for upper elementary and middle school students)