Why Data Dredging is trending

“If you torture the data long enough, it will confess to anything.”   — Ronald Coase, British economist

 If you thought DATA was only ‘mined’ and ‘extracted’ for analysis, take a look at this frequently used method of ‘data dredging’.

As we move over from traditional eyeballing of statistical data to dig deeper into machine based techniques, the entire process of DATA extraction gets more technique based.

One such DATA extraction practice is analysis of large volumes of data in the quest for ANY possible relationships. An example would be “fishing” in very large datasets to analyse crime clusters without understanding causation. Or say “snooping” into an App user’s habits for finding correlations.  That is, combing data for patterns without pre-established hypotheses or objectives. Which sounds absurd, but may actually throw-up significant unseen relationships (what does the App user do at lunchtime when in the vicinity of Connaught Place, New Delhi?).

With the evolution of Big Data a fundamentally different practice of experimental design has evolved. Formerly, the project / questions asked would decide what data to collect, for analysis of the same. Now, the low cost of data storage has caused a rethink with all kinds of data being collected first and then searched for significant patterns.

This practice of “data dredging” differs from traditional Data Mining practices.

Data dredgingData Dredging explained

Where the sample size is not truly representative, there is ‘confounding’ or ‘selection bias’, or there exists too many hypotheses for a given dataset, there may occur some highly correlated data that are statistically significant. Whereas, there is no effect between the variables and confidence level is .05 (5%). This is a typical case of “data dredging” with false positive findings, a result of looking at too many possible associations. One way to conquer errors  of “data dredging” is being stringent with “significance” levels, moving to P<0.001 or beyond.

Applications of Data Dredging

  • Forensic Analysis
  • Market Basket Analysis
  • Risk Analysis
  • Fraud detection
  • Medical Science
  • Public Health
  • Clinical Research
  • Digital Analytics
  • Social Media

When does Data Dredging occur?

  • Failure to make adjustments for statistical effects of search in large models
  • When there is statistical bias, confounding or misrepresentation of  the P<0.05 significance test
  • When there is suboptimal model construction
  • When there is ‘Overfitting’ of data
  • When too many hypotheses are tested without proper statistical control
  • When there is ‘Oversearching’ of relationships between variables
  • Overestimation of model’s accuracy
  • When Data Mining technique is explicitly used to prove a particular pre-established point of view!

So the next time you read such research findings like “Teens who eat lots of chocolate tend to be slimmer” – take it with a pinch of salt. Better, look at it as a possible consequence of distorted “data dredging”!

Leave a Reply

Your email address will not be published. Required fields are marked *