Common Statistical Models used in Predictive Analytics

In a previous blog, we covered the use of predictive modelling techniques to predict future outcomes. The techniques used differ for various applications. However, there are some fundamental statistical techniques, mathematical algorithms and neural network systems used in predictive modeling.

Model bldg

Model Building for Forecasting

Statistical techniques and tools in use

  • Linear regression
  • Logistic regression
  • Cluster analysis
  • Analysis of variation (ANOVA)
  • Chi-squared test
  • Correlation
  • Factor analysis
  • Association rules
  • Decision trees
  • Time series
  • Experimental design
  • Bayesian theory – Naïve Bayes classifier
  • Sampling
  • Matrix operations
  • K-nearest neighbor algorithm (k-NN)
  • Pearson’s r

Commonly used Statistical models

Logistic Regression:

Logistic regression models the relation between a dependent and two or more independent variables (explanatory and response variables). It takes a look at how significant the relationship is between the variables. The probability (p) that event “1” occurs rather than event “2”. Where a good fit of the model is obtained, you can plug in the independent variable values for a new observation and predict if the dependent value will be 0 or 1.

Examples:

Banks – for building scorecards of customers applying for loans. The loan officer identifies characteristics that indicate probability of loan default, and further use this to build a scorecard of good and bad credit risks. Data of past, current and potential customers are used to execute a Logistic Regression Model. The model is leveraged to classify potential customers who have applied for loan, as good or bad credit risks. This uses binary logistic as the ‘dependent’ variable is dichotomous (loan default OR no default).

Education institutions – An engineering college would estimate enrolments of fresh students to determine cut-off marks and freeze admissions. A multiple logistic regression model is used to factor Class10, Class 12 and related AIJEE scores, distance from college, demographic information including stream preferences, historical data of student enrolments, to calculate probability of enrollment. The estimated model has to fit the data adequately to show the significance. Calculations can also be made to estimate the effect of how a single independent variable affects the likelihood of application.

Time Series:

The Time Series forecasting model is used to make predictions of future values based on previously observed / historical values. The two main goals are the identification of the phenomenon represented by the sequence of observations, and the forecasting of future values in the time series variable. The pattern of observed time series data is identified, described and integrated with other data. The identified pattern is further extrapolated to predict future events.

Model eg

Time Series predictive models are used to make forecasts where the temporal dimension is critical to the analysis. Typical application scenarios are demand prediction of a product during a particular month / period, estimation of inventory costs, forecast of train passengers for the next financial year, and so on.

Clustering:

Clusters in the data are used for modelling predictions by grouping ‘like’ objects for a probability distribution. A model is hypothesized for each of the clusters to find the best fit of that model to each other. Clusters in customer behaviour may be used for predictive modeling, i.e. behavioural clustering, to predict behaviour or buying patterns of customers. Clusters in product segmentation may be used to predict what different categories of products customers are likely to buy. Algorithms auto-segment the objects based on several variables, to devise the cluster DNA. This is then leveraged for predictive insights.

Cluster models are used to predict demand of products (customer ordering baby clothes is likely to order diapers), brand preferences, predict efficacy of drug amongst a certain age group in clinical trials, predict stock market trends, identify groups of car insurance policy holders with a higher average claim cost, and more.

Decision Trees: This statistical technique is a tree-like predictive model of decisions and possible consequences. Based on Boolean tests, specific facts are used to make general conclusions / decision points represented by nodes. Rules trace the series of paths from root to nodes, till an action is derived. Problems are structured as a tree with end nodes as branches, representing a specific event or scenario, or subject probability.

decision tree

 

A basic Decision Tree Modeling graph to predict how many buy ice cream because they crave for it, even if they don’t have extra money.