Sangeeta Mar 18, 2015 No Comments
“If you torture the data long enough, it will confess to anything.” — Ronald Coase, British economist
If you thought DATA was only ‘mined’ and ‘extracted’ for analysis, take a look at this frequently used method of ‘data dredging’.
As we move over from traditional eyeballing of statistical data to dig deeper into machine based techniques, the entire process of DATA extraction gets more technique based.
One such DATA extraction practice is analysis of large volumes of data in the quest for ANY possible relationships. An example would be “fishing” in very large datasets to analyse crime clusters without understanding causation. Or say “snooping” into an App user’s habits for finding correlations. That is, combing data for patterns without pre-established hypotheses or objectives. Which sounds absurd, but may actually throw-up significant unseen relationships (what does the App user do at lunchtime when in the vicinity of Connaught Place, New Delhi?).
With the evolution of Big Data a fundamentally different practice of experimental design has evolved. Formerly, the project / questions asked would decide what data to collect, for analysis of the same. Now, the low cost of data storage has caused a rethink with all kinds of data being collected first and then searched for significant patterns.
This practice of “data dredging” differs from traditional Data Mining practices.
Where the sample size is not truly representative, there is ‘confounding’ or ‘selection bias’, or there exists too many hypotheses for a given dataset, there may occur some highly correlated data that are statistically significant. Whereas, there is no effect between the variables and confidence level is .05 (5%). This is a typical case of “data dredging” with false positive findings, a result of looking at too many possible associations. One way to conquer errors of “data dredging” is being stringent with “significance” levels, moving to P<0.001 or beyond.
Applications of Data Dredging
When does Data Dredging occur?
So the next time you read such research findings like “Teens who eat lots of chocolate tend to be slimmer” – take it with a pinch of salt. Better, look at it as a possible consequence of distorted “data dredging”!