What if earthquakes were predicted?

The weekend was taken over by the worst earthquake in the sub-continent in the last eight decades. Thousands of lives have been lost, scores injured and property worth millions destroyed. News of further losses pour in as the aftershocks continue to shake us all up.

Geologists and scientists have always been monitoring the occurrence and frequency of earthquakes to simulate models. As the potential of seismic activity along tectonic fault lines is pre-established, it is surely possible to correlate structured and unstructured data from various sources to predict earthquakes? That’s what we die-hard believers in GIS and analytics would like to think.

Well, you may be surprised to know it is indeed possible.

Earthquake prediction began with the Haicheng earthquake of China (1975). A blend of  empirical analysis, intuitive judgment, extensive scientific studies and a series of foreshocks was used to make the prediction. Evacuation of the one million people populated city was ordered just days before a 7.3-magnitude quake devastated Haicheng.. The success of prediction was based on earthquake precursors – unusually high temperatures, sulphurous gas emissions, strange animal behaviour, abnormal earthquake cloud formations – appearing along fault lines, together with geologic data.

Nearly four decades on. Advances in analytical rigour have moved beyond the descriptive to the predictive. A company called Terra Seismic now successful forecasts earthquakes, like the Tarapaca megaquake (8.1) of Chile, the Mexico quake (7.2) and the 6.4 quake in Indonesia nine days before it hit on March 3. Using its flagship Quake Hunters, the company offers seismic analytics-as-a-service to insurers and government agencies.

Terra seismic logo

The Terra Seismic message tells you ‘Forecasting Earthquakes’ is possible!

Companies like Terra Seismic use satellite Big Data, earth observations and the internet “to predict major earthquakes anywhere in the world with 90% accuracy”. Data from satellite services, as well as ground based sensors is used to measure abnormalities in the atmosphere that occur before the real quake. The Apache Server system processes voluminous satellite data, correlating with sensor and other data based on earlier occurrences, for real-time estimation, analysis and simulation.

In an earlier blog, we have explored the power of big data analytics in climate science. In seismic activity prediction too, big data has the capability to analyse potentially seismic zones, connect the same with huge volumes of structured and unstructured data to construct fairly accurate models using statistical analysis.

Analysing unstructured data

Take twitter data for instance. You can mine information related to earthquakes or its occurrence by targeting hashtags based on scientifically established earthquakes precursors. So hashtags would factor in #unusual #clouds #behavior #animals #weather #gasemissions #cloudformations #hightemp #EarthquakePrediction and so on.

Using various filters, streaming endpoints, images, feeds, links, and the like, huge amounts of unstructured data can be correlated with potential seismic areas, hazard maps and fault lines.

Earthquake images

Pinned from QuakeHunters and USGS

Analysis of Structured Data –  includes satellite data from satellite services across the world, global meteorological data, data from field and laboratory observations,

geophysical, geological, geochemical, mathematical and computational modeling of fault zones and seismic activity

atmospheric data, geologic information like crustal deformation

Correlated information – long-term probabilistic hazard assessments (shaking hazard maps, 30-year earthquake probability reports), foreshock probabilities, historical reports

Earthquake predictions can not only be used to prepare and ensure minimal loss of lives, but also be leveraged for risk and vulnerability assessments.

Now are you wondering like I am why the severe 7.9 magnitude earthquake of Nepal was not forecasted, given that the area was long expecting a magnitude 8 earthquake?

NASA has just put out that earthquake prediction is not possible. Yet, as proved by Terra Seismic in a couple of cases, it might just be possible!  So does the analytical mind give up on the possibilities of Big Data Analytics to make earthquake predictions?

UPDATE (1st May)

The USGS has put up on its website an Aftershock Forecast & Table (27th April)

“In the coming week, the USGS expects 3-14 M≥5 aftershocks of the magnitude 7.8 Nepal earthquake.  Additionally, the USGS estimates that there is a 54% chance of a M≥6 aftershock, and a 7% chance of a M≥7 aftershock during this one-week period.….   Based on general earthquake statistics, the expected number of M≥ 3 or 4 aftershocks can be estimated by multiplying the expected number of M>=5 aftershocks by 100 or 10, respectively.”

So, the analytical  rigour DOES extend to earthquake prediction. What if Terra Seismic failed, it hasn’t stopped the USGS from making earthquake predictions or developing apps and systems that make statistical inferences!!

Additional reading:

USGS – Earthquake Topics for education

USGS – Seismic Hazard Analysis tools

The ShakeAlert App endorsed by USGS

A list of free to download eBooks on Analytics, Data Mining and Big Data

The Complete Guide to Facebook Analytics

This is a great book to start with understanding how metrics are used for branding, using the success story of Facebook.

Blueprint for Success: Starting a Business Analyst Career

It gives an insight into the profession of a Business Analyst and walks you through the process of  launching your career in Business Analytics.

Multivariate Data Analysis

This edition is a simple applications-oriented introduction to multivariate analysis, for the non-statistician.

Data Structures and Algorithms

The book covers the fundamentals of data structure and algorithms design.

Theory and Applications for Advanced Text Mining

This is an introductory book to advanced text mining techniques.

No More Secrets with Big Data Analytics 

This book provides an overview of Big Data reality and how to unlock its potential.

Medical Statistics: A Guide to SPSS, Data Analysis and Critical Appraisal

A useful resource of statistical tools meant for researchers.

Data Warehousing and Data Mining 

The book is an introduction to data warehousing and data mining.

Modeling With Data

It explains how to execute computationally intensive analyses on very large data sets, create and debug statistical models, and how to run an analysis and check the results.

Data Mining and Analysis – Fundamental Concepts and Algorithms

The book lays the foundations of data mining and analysis, integrating related concepts from machine learning and statistics with an algorithmic perspective. It makes use of examples, covers core methods and cutting-edge research.

An Introduction to inbound Marketing Analytics

This highlights important inbound marketing metrics for data-driven decision making, for optimised marketing. It also tells you how to analyse the most important marketing channels.

Big Data Analytics and Social Business

The eBook explores data analytics and social business in the context of Big Data.

Gaussian Processes for Machine Learning

It describes the mathematical foundations and practical application of Gaussian processes, and how to apply Gaussian process methods to solve a range of problems.

How to make a career shift to Analytics

Are you thinking of making a career shift to analytics? Then perhaps you are wondering how to go about it. You  maybe asking yourself, How steep is the analytics learning curve? How long does it take to break into the analytics job scene?  …and more.

Making a career in analytics is not just about steep salaries, or being in the midst of a fast-paced exciting profession. It is about your passion for data, analysis and logical reasoning. It is also about domain knowledge, analytical tools and technical skills.

So if you are thinking of making a lateral shift to analytics, walk through these steps to know how to land your dream job!

Know the Analytics landscape

At the very outset, look around and learn more about the analytics industry. Understand what analytics is all about, why it is deployed, how it is used, and by which industries. Subscribe to various newsletters, check out Analytics websites and news, and get a general feel of the analytics hiring landscape. Who is hiring? What does the hirer look for? What skills and knowledge are in demand for mid to high positions? What is the salary trend?

If you  feel inspired by what you learned, then move on to find whether analytics is the right choice for you.

Find if you are ready

Once you have understood the Analytics industry and the requirements, it is time for some personal reflection.

Am I willing to give the time and effort to learn what is required?

Do I have the patience to wait through a period of about 6-12 months before landing the job I want?

Do I have the perseverance to put in a couple of years working my way through a new domain?

Can my work experience fit into the analytics industry?

If you tick all the above, then you have climbed the first rung to making a career transition to analytics.

Match your current experience and skills to analytics

Check out industries and application areas. Match them to your interests, learning abilities and work experience.

What job titles do I look for?  

Which industries offer a better scope in a career in analytics?

Does my experience make a difference in getting good jobs?

Does my current line of work use analytics?

What kind of analytics job roles would be right for me?

Identify where you can fit your skills and work experience to your advantage. Pick out suitable industry or sectors. Short-list job roles that best match your interests, industry experience and credentials.

Identify the gaps

Check out job sites.

What are the skill sets, educational and experience requirements listed? What tools are mandatory for job roles you have identified?

Ask yourself the following –

Do I have the necessary skills? What are the areas I need to brush up on (my high school statistics)? Which software and tools do I need to learn? What other qualifications I need to fulfill? Are there  soft skills specific to the analytics ecosystem?

What books can I get my hands on? Which websites or blogs to look up? How many hours do I need to devote to the learning process?

In other words, find the gaps. Identify the areas where you need to train or groom yourself for the changeover to analytics.

Get trained

This is perhaps the most important step for making a career transition to analytics.

Check out neighbourhood analytics training schools or online programmes. See what programmes are on offer. Match them to your needs, budget and time.

Ask yourself,

Which programme or study best fits my requirements?

Does the program strengthen the foundational aspects of analytics?

Does the course offer chances to work with real-world projects or intern with corporate houses?

What tools do I learn in the course which I cannot, otherwise? Are there tools that I can learn on my own for free, using online resources?

Do I really need a domain based analytics programme, or a basic foundational course on data analytics?

Does a weekend programme suit me or one held in the evenings during the week?

Discuss with the counselor or trainer. Bounce your ideas. Talk about your plans  to arrive at the right choice.

Enroll at the earliest and get started! We recommend taking one of the IVY Professional courses that give you a sound overall grooming for most anlaytics job roles.

Look out for that ‘dream job’

As you move along your analytics training curve, keep a look out for jobs on offer. Networking sites like Linked and Analytic Bridge are great places to showcase your experience and skills. Follow appropriate  twitter accounts. Make the right noise and strike the right chord with the right people.

Check out job sites like Monster, Analytics Recruiting, Head Hunter, iCrunchData, Jobs in Data and Analytic Talent which feature mid to senior level career positions.

Make the pitch

Make sure to highlight your experience –  domain knowledge, technical qualification and experience,  programming skills,  consulting, research or model design, team management, projects handled – and knowledge of tools or software.

So go ahead and make that much longed-for mid-career transition to analytics! With the right approach and skill sets, you can actually command the kind of salary and job you are looking for.

Analytics Companies in Kolkata

This is a list of analytics companies with presence in Kolkata.

The specialisations are not exhaustive, as many analytics start-ups are a work-in-progress. If you have any additions or suggestions to make, please feel free to leave the company’s name as a comment below and we’ll add it to the list.



Big Data Analytics, Text Analytics, Social Media Analytics, Predictive Modelling, Data Science

Cognizant Analytics

Decision Sciences (Analytics)


Fractal Analytics (currently inoperational at Kolkata)


Customer analytics, Operations, Big Data, AI and Machine Learning

Genpact Analytics 


Data analytics and Research, Financial Services analytics, Healthcare and Retail analytics delivery

Gyan Research and Analytics


Business consulting and research across various domains

HSBC Analytics


Banking, Risk

ICRA Techno Analytics Ltd


ICRA Technology Services


Business analytics

Ideal Analytics


Big data analytics, self-service BI, analytics on demand



Big Data & Analytics, Digital Marketing and CRM, Cloud

Quantta Analytics


Big Data, Predictive Analytics, Text Mining, Location Analytics (using GIS), SaaS

Sibia Analytics


Predictive modeling, decision support service, social media analytics, marketing analytics – CPG/FMCG, Retail, eCommerce, Telecom



Big Data Analytics

Wipro Analytics


Analytics deliveries in Big Data, Supply Chain, Operational


(updated 27.04.2015)

Top 5 terms in Ecommerce Analytics

  1. Users / Visitors

Ecommerce 6

Definition: the number of people who visited the website within a given time. Also known as ‘unique visitors’, it refers to the number of first time visits only.

 Calculation: Visitors are identified by Cookies and a unique ID assigned. Alternatively, IP addresses are also used.

The number of visitors is calculated for a day, week, month or stated time.

Desired Goals: Higher the unique visitors, the better.

Why are unique visitors measured?

This is used to check the audience, and understand the website’s influence or reach.

There are two calculations, depending upon the report being generated

  • The pre-calculated data method – used in reports with a single dimension of time frame, like Date, or Week in a particular year.
  • Data calculated on the fly method – used in custom reports over any dimension, like city, or browser. It is applied for computation of large data sets, with reference data in raw session tables.
  1. Sessions (visits)

Ecommerce 7

Definition: The length of time that a User spends on each visit to the website, for a given date or length of time.

Calculation: length of time in seconds

Desired Goals: longer the sessions, better the ability to engage the visitor for purchases (transactions) and repeat visits

Significance of Sessions

Average Session Duration and Percentage of new Sessions are yardsticks to understand the interest of visitors and how to engage their interest.

Average Session Duration =

Ecommerce 8

Percentage of new Sessions – The percentage of new sessions or visits indicates how many visitors to the site are first-time visitors.

  1. Channel

Ecommerce 5

Definition: The traffic source or origin of the visitors – direct links, email, PPC, search


  • help find important channels for revenue / acquisition
  • to find opportunities for incentives like loyalty programs or cashback 
  1. Transactions

Ecommerce 9

Definition: Number of closed deals or orders placed and confirmed (after passing through the payment gateway). Where a purchase order is abandoned in the shopping cart, basket or booking points, it is not a transaction, but an abandonment.

Calculation: Each order placed by a visitor / customer is a transaction, allotted a unique Order ID.

 5.  Churn

Definition: measures customers who do not return to the site. For example, a churn rate of 60% means 60 out of every 100 customers do not return to buy

Here, customers refer to visitors who have actually made a purchase.

Calculation: calculated as a percentage of customers lost, or percentage of recurring sales lost

Eg. Number of customers lost last quarter /starting customers last quarter

Desired Goals: lower the churn rate, the better.

Significance of Churn rate

  • to influence customers to come back for repeat purchases
  • to calculate the Lifetime Value (LTV) per customer, a benchmark of the business policies
  • to analyse reasons for high churn rates and take corrective measures for a higher retention

Why should Statisticians consider Analytics as a career?

Are you a statistician waiting to stretch your capabilities? Do you want to move beyond academics, biosciences and policy making? Then hold your breath.

A recent trend in analytics hiring reveals that “Statistics” is a desired degree for many of the job roles.

Statistician roles 2

Statistical methods are applied to almost every area of business and industry.  From developing of analytics algorithms to constructing predictive models and studying complex datasets, statisticians have become critical to most analytics based functions.

Statistical analysis is no longer just for political pollsters, clinical trials and policy makers, but intrinsic to the business process too. So if you want to get involved in the exciting world of business,  add an analytics certification to your Statistics bachelor’s / post grad. degree.

Hiring companies want ‘analytics savvy’  people with a statistics background.

So how to know if you are ready for the ‘analytics savvy’?

  • You love working with data, especially huge noisy datasets.
  • Developing predictive models is your forte.
  • Mapping a business problem into the statistical world and communicating it in business terminology is easy.
  • Problem solving with the technical approach to “why”, is one your skills.
  • You have knowledge of database and programming.
  • You are familiar with the popular statistical software “R”, and / or SAS, SPSS.
  • Your strengths lie in the following – Logistic regression in a Big Data environment, GLM/regression theory, experimental design, cluster analysis, decision trees, machine learning applications.
  • You have useful niche skills like DOE, time series, survival analysis, or multivariate methods

So you may have ticked most or all the above, and what now?

Learn more about possibilities.


  • Combining Statistics with an add-on course/ certificate in Data Analytics or Big Data Analytics increases your employability quotient.
  • You can leverage your interest in science, technology, or business to suit both, your statistical knowledge and interest area.
  • Opt for jobs in any of the following verticals –finance, insurance, eCommerce, manufacturing, Government (census, population research, policy making), pharmaceutical, environmental science, economic analysis, medicine, education, health and social services, elections, and other areas of government and business.
  • Research in your chosen field of application, playing with various models to arrive at best-fit solutions.
  • Expect a lower barrier to job entry – your statistical degree is your strength.
  • Exercise options to work individually and/or as part of an interdisciplinary team.

A statistician’s toolbox is loaded with methodology, tools and approaches well suited to the business problems.

Here are some Job Roles that demand Statistics + Analytics skills:

Statistician roles

Bottomline: With Big Data having fueled one of the most hyper-growth niches of employment, a statistician armed with an analytics certification can look forward to excellent career prospects in his/her chosen field.

So go ahead and check out at IVY for Analytics courses ideal for the Statistics student or professional.

The latest buzz in Indian Analytics industry – March 2015

Analytics maturity in India climbs new rungs of opportunity, as it gets tied to cricket, beauty pageants and news distribution. Personalisation knocks on your living room with customised ads on your TV set and tailor-made news in your local language on your smart phone.

Takes you back to this quote;

The ladder of success is best climbed by stepping on the rungs of opportunity – Ayn Rand.

Acquisitions, Partnerships & Expansions

# Bangalore based regional news aggregator NewsHunt, has acquired the mobile analytics company Vauntz. This will help it to publish finely tuned, targeted and personalised content to users in local languages. NewsHunt also plans to develop apps that will distribute personalised videos and audios with regional content.

# British-Australian mining corporation, Rio Tinto, has unveiled its Analytics Excellence Centre in Pune, to improve equipment productivity across its global operations.  A world-first for the mining industry!

New launches – start-ups, services, vendor products, apps

#Ahmedabad based software company ElegantJ BI, has launched version 4 of its BI suite, with easy-to-use API integration for 3rd party applications.

An easy-to-use API through web services and Java script, embeds ElegantJ BI dashboards, reports, KPI and graphs into other vendor applications or in-house applications; for fast, affordable implementation and mobile access.

# TCS has launched a smart phone application that allows users to track, analyse and visualise Twitter conversations. Devised for the oncoming UK General Election,  the ‘ElectUK’ app will engage voters, their representatives and political commentators, turning the  smart phone into a Big Data social media analytics tool.

# Gurgaon based SilverPush, has introduced an innovation for ad analytics leveraging its cross-device expertise. It moves beyond the conventional impact measurement of TV commercials to retargeting the TV viewers on mobile.

“The company uses a sophisticated deterministic modelling to identify the multiple devices associated with a single user and maps his/ her demographics and behavioural property into a unique id. “

# Tableau Software is introducing six new features to its latest BI suite making data visualisation easier. Data preparation, query performance, admin views and a smarter, map functionality are the features introduced.

The Indian analytics landscape with a head start in IT, is all set to rock across verticals and application scenarios. As start-ups continue to make news, the analytics professionals can look forward to challenging and lucrative times ahead.

Why you need to understand Digital Analytics

Before I do a blog on the latest craze in the hiring landscape – ecommerce analytics, I’d like to cover digital analytics in its new avatar.

As this article in Forbes magazine says, 2014 was the Year of Digital Marketing Analytics, when the ‘digital’ took over the marketing landscape. According to analytics pundits, the trend spills over to 2015, with Digital Analytics continuing to dominate the analytics space.

We have already covered web analytics, and explained marketing analytics. So let’s examine digital analytics.

What is Digital Analytics?

Digital Analytics is defined as

“the collection, measurement, analysis and reporting of data for optimizing channel usage…”

The most popular definition is from Avinash Kaushik.

Digital analytics is the analysis of qualitative and quantitative data from your business and the competition to drive a continual improvement of the online experience that your customers and potential customers have which translates to your desired outcomes (both online and offline).

So what it does, is collect, measure, analyse and report data residing in digital media for optimum usage, answer questions and offer insights for business actions.

In the larger picture,

web analytics < digital analytics < marketing analytics < e-commerce analytics

Where is Digital Analytics used?

Digital 1Why Digital Analytics?

 Digital 2


Why you need to equip yourself with Digital Analytics skills?

Today every activity, opinion or event is digitised. Social media, cell phones and the internet have greatly influenced consumer behaviour and the way businesses work. There is a seamless flow of information across the digital space, waiting to be harnessed for insights.

  • Every firm, commercial venture and store, has a digital presence – website, mobile apps, e-commerce, social media, content or comments on the web, and digital advertisements (video, TV and web).
  • Businesses are adopting a data-driven approach for maximizing ROI on the digital spend.
  • Social media, websites, cell phones, TV viewership and even store footfalls have become digitised points of data collection.
  • Digital analytics has become an integral part of every functional area – marketing, fraud detection, retail, social media, healthcare, banking and so on.
  • Organizations are moving their marketing budget to digital channels. Growth in digital ad spending is the fastest at 30%, expected to account for 9.51% of all ad spending in 2015. The e-commerce segment has seen the highest digital ad spends, growing at a CAGR of 59 % since 2011; followed by the Telecom and FMCG

Digital 3

  • 70 % of companies intend to increase investment in third-party social media services like Facebook.
  • Businesses like hotel booking and travel portals are tripling their spends on digital assets (investments in digital media management).

It is a ‘digital’ environment today.

So whether you are engaged in analysing the reason for the failure of a new product, or tracking repeat defaulters of a particular loan scheme, you will need to engage tools of digital analytics at some stage/s. It is no longer about digital marketing alone.

Bottomline – While web analytics may soon be passé, digital analytics is here to stay.  So it is time to hone your skills in digital analytics too.

10 Tips to face and ace an Analytics Interview

The way businesses work is fast changing. We now have telephone interviews, skype sessions and of course, the conventional face-to-face. The entire process of getting interviewed starts with a telephone call or email from your institute or company. So at the very outset, what you need to do is gather information about the interview – the venue, time, whether you are expected to give tests or a presentation, any specific expectations, and so on.

With the above, you take the first step towards preparing yourself for the interview!

General, Behavioural (Soft skills) – How well you communicate, interact and present yourself

  1. First impressions do matter. So arrive before time. Dress appropriately. No strong perfumes. Walk in with a confident gait. And please, turn off your cell phone!
  2. Greet the members of the interview panel (“Good morning, Sirs and Madam”, Good afternoon, Sirs”). Make eye contact with each member. A firm handshake and a smile on your face. Wait to be asked to sit.
  3. Talk positive. Avoid any negative phrases (“I am not sure”, “I don’t think so”, “I will not be able to”). Talk slowly and clearly, but confidently. Clear communication is very important.
  4. Understand the question before you answer. If you are not sure, you can ask politely for the question to be repeated (“I am sorry I could not get the question. Could you please repeat it?”), rather than making a wild guess.

Technical Assessment – How sound is your knowledge

  1. Be prepared for questions that test your basic statistical concepts, and how well you can apply to different algorithms.
  2. Be prepared for questions that are solved using analytical tools like R and SAS, or whatever you have mentioned in your CV.
  3. Review the projects you have worked on – as a student or analyst. Be prepared to outline the analytical process, algorithm used, model validation / testing done, challenges faced and how they were resolved.

Skill Assessment / Situational Questions – How well you demonstrate your logical reasoning and business thinking skills

  1. Puzzle questions – Adopt a confident approach. Tackle the questions in a structured top-down manner, covering maximum aspects and situations. Ask questions. Call out assumptions if you are making them. Put it down on a paper for a methodical answer in the shortest time.
  2. Case Studies (Real World situations) – Address the problem clearly. Describe your solution and the action you would take. Make sure to include appropriate analysis and technical skill application.

Talk aloud as you work through your answer. An ideal situation would be an                           interactive session. Explain your suppositions and the factors that you think may                  have an effect on your estimate.

  1. Presentation – The key here is to think on your feet, as you will most likely get a time limit. Outline the problem in paper immediately. Use figures to illustrate. Circle solutions. If asked to use the computer or whiteboard, transfer your ideas to the screen / board.

Focus on logic. Present a structured thought process. Cut out the fancy.

Bottomline: Interviewing methods and questions depend upon whether you are applying for an entry level / mid level position. The interviewing panel may also adopt new approaches that test you for your logical reasoning and out-of-the-box thinking. The technique here is to think fast (ace/play that winning swerve), present what you know (all the aces up your sleeve), as best you can.


Useful resources from IVY PROFESSIONAL SCHOOL:

How to land a great job in Analytics

Top 30 Common Interview Questions

Latest Trends in Analytics Hiring

‘Net’ that job

How good is your personal AQ (Analytics Quotient)

5 Skills you need to become an Analytic Professional

Use Social Networking Sites to get a job

Be Prepared

Why Data Dredging is trending

“If you torture the data long enough, it will confess to anything.”   — Ronald Coase, British economist

 If you thought DATA was only ‘mined’ and ‘extracted’ for analysis, take a look at this frequently used method of ‘data dredging’.

As we move over from traditional eyeballing of statistical data to dig deeper into machine based techniques, the entire process of DATA extraction gets more technique based.

One such DATA extraction practice is analysis of large volumes of data in the quest for ANY possible relationships. An example would be “fishing” in very large datasets to analyse crime clusters without understanding causation. Or say “snooping” into an App user’s habits for finding correlations.  That is, combing data for patterns without pre-established hypotheses or objectives. Which sounds absurd, but may actually throw-up significant unseen relationships (what does the App user do at lunchtime when in the vicinity of Connaught Place, New Delhi?).

With the evolution of Big Data a fundamentally different practice of experimental design has evolved. Formerly, the project / questions asked would decide what data to collect, for analysis of the same. Now, the low cost of data storage has caused a rethink with all kinds of data being collected first and then searched for significant patterns.

This practice of “data dredging” differs from traditional Data Mining practices.

Data dredgingData Dredging explained

Where the sample size is not truly representative, there is ‘confounding’ or ‘selection bias’, or there exists too many hypotheses for a given dataset, there may occur some highly correlated data that are statistically significant. Whereas, there is no effect between the variables and confidence level is .05 (5%). This is a typical case of “data dredging” with false positive findings, a result of looking at too many possible associations. One way to conquer errors  of “data dredging” is being stringent with “significance” levels, moving to P<0.001 or beyond.

Applications of Data Dredging

  • Forensic Analysis
  • Market Basket Analysis
  • Risk Analysis
  • Fraud detection
  • Medical Science
  • Public Health
  • Clinical Research
  • Digital Analytics
  • Social Media

When does Data Dredging occur?

  • Failure to make adjustments for statistical effects of search in large models
  • When there is statistical bias, confounding or misrepresentation of  the P<0.05 significance test
  • When there is suboptimal model construction
  • When there is ‘Overfitting’ of data
  • When too many hypotheses are tested without proper statistical control
  • When there is ‘Oversearching’ of relationships between variables
  • Overestimation of model’s accuracy
  • When Data Mining technique is explicitly used to prove a particular pre-established point of view!

So the next time you read such research findings like “Teens who eat lots of chocolate tend to be slimmer” – take it with a pinch of salt. Better, look at it as a possible consequence of distorted “data dredging”!

Information, Career Advices, & Job Alerts on Analytics, Actuarial Science Careers | www.ivyproschool.com