Data dredging

Data dredging (data fishing, data snooping) is the inappropriate (sometimes deliberately so) search for 'statistically significant' relationships in large quantities of data. This activity was formerly known in the statistical community as data mining, but that term is now in widespread use with an essentially positive meaning, so the pejorative term data dredging is now used instead.

Conventional statistical procedure is to formulate a research hypothesis, (such as 'people in higher social classes live longer') then collect relevant data, then carry out a statistical significance test to see whether the results could be due to the effects of chance.

A key point is that every hypothesis must be tested with evidence that was not used in constructing the hypothesis. This is because every data set must contain some chance patterns which are not be present in the population under study, or simply disappear with a sufficiently large sample size. If the hypothesis is not tested on a different data set from the same population, it is likely that the patterns found are chance patterns.

As a simplistic example, first throwing five coins, with a result of 2 heads and 3 tails, might lead one to ask why the coin favors tails by fifty percent, whereas first forming the hypothesis might lead one to conclude that only a 5-0 or 0-5 result would be very surprising, since the odds are 93.75% against this happening by chance. In the latter case, it becomes obvious that the data is not anomalous.

It is important to realise that the alleged statistical significance here is completely spurious - significance tests do not protect against data dredging. When testing a data set on which the hypothesis is known to be true, the data set is by definition not a representative data set, and any resulting significance levels are meaningless.

Examples
In meteorology, dataset A is often weather data up to the present, which ensures that, even subconsciously, subset B of the data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis. This ensures no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.

Consider an analysis of sales in the period following an advertising campaign. Suppose that aggregate sales were unchanged, but that analysis of a sample of households found that sales did go up more for Spanish-speaking households, or for households with incomes between $35,000 and $50,000, or for households that had refinanced in the past two years, or whatever, comparing the treatment and control groups, and that such increase(s) was/were 'statistically significant'. There would certainly be a temptation to report such findings as 'proof' that the campaign was successful, or would be successful if targeted to such group(s) in other markets.

Remedies
One way to construct hypotheses while avoiding the problems of data dredging is randomization. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset - say, subset A - is examined for creating hypotheses. Once a hypothesis has been formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where such a hypothesis is also supported by B is it reasonable to believe that the hypothesis might be valid.

Another remedy for data dredging is to record the number of all significance tests conducted during the experiment and simply multiply the final significance level by this number (the Bonferroni correction). This solution does not prevent from collective data mining, because the probability that a certain null hypothesis is being tested is influenced by the number of previous successful rejections of this hypothesis, which in itself reduces the probability of rejecting the null.