 Top Data Science Interview Questions and Answers

Data Science interviews can often get tricky. This is because in this era of 2021, everyone aims to become a data scientist leading to tough competition and difficult interview questions. It is an interesting subject and the criticality of business depends on the insights drawn by a data scientist. Hence, higher the criticality, more the dependability and more will be the worth of the data scientist in the organization. Ivy Professional School is a pioneer in developing the career of aspiring data scientists and analysts since 2008.
Here we have listed 15 fundamental interview questions that recruiters use to analyze the potential of a candidate. The interviewees can help themselves by looking at these questions to broaden their spectrum about the fundamentals required for data science. We recommend you to read it till the end as we have a bonus question which is related to one of the questions from the 15 ones.

1) How to select for the‘k’ in k-means?

There is an elbow method that one can use to select k for k-means clustering. This method signifies run k-means clustering on the data set where ‘k’ is the number of clusters.
That is why, within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.

2) What is the significance of p-value when it comes to hypothesis testing?

The significance of p-values are as follows:
• When p-value is ≤ 0.05
• The above equation provides strong reason against the null hypothesis; which means one can reject the null hypothesis.
• When p-value is > 0.05
• The above equation provides weak reason against the null hypothesis, so you accept the null hypothesis.
• p-value at cutoff 0.05
• This value is marginal, which means that it could go either way of rejecting or accepting.

3) How to build a random forest model?

A random forest is build from a number of decision trees. yo have to split the data into different packages and make a decision tree in each of the different groups of data. Then, the random forest brings all those trees together.
Below are the steps to build a random forest model:
• One has to select ‘k’ features in random from a total of ‘m’ features where k << m
• From the ‘k’ features, calculate the node D using the best split point
• Split the node into daughter nodes using the best split
• Repeat steps two and three until leaf nodes are finalized
• Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees

4) How can one avoid to over fit the model?

Over fitting the model means that a larger amount of data is ignored for a smaller amount of a data set. Consequently, there are three main methods to avoid over fitting:
• Keeping the model simple — take fewer variables into account, thereby removing some of the noise in the training data
• Use cross-validation techniques, such as k folds cross-validation
• Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause over fitting

5) Define Machine Learning?

The ability of a system to understand by learning of its own and without being explicitly programmed, is known as machine learning. It is an application of Artificial Intelligence.

6) What is the difference between uni-variate, bi-variate, and multivariate analysis?

• Univariate data contains only one variable which we use to describe the data.
• Bivariate data involves two different variables which we use for determining the relationship between the two variables.
• Multivariate data involves three or more variables. It contains more than one dependent variable, but the significance is same as that of the bivariate variable.

7) How to calculate the Euclidean distance in Python?

Let’s say A1 = [0,6] and A2 = [4,3]. Then, the Euclidean distance calculation goes like:
ED = sqrt( (A1-A2)**2 + (A1-A2)**2 )

8) What is the meaning of dimensional reduction and what are its benefits?

Dimensional reduction means conversion of a data set with vast dimensions into data with lesser dimensions (fields). However, this should not reduce the relevance of the data and should make it more concise. It reduces computation time, and improves storage.

9) Can I do Machine Learning using MS-Excel?

Yes, MS-Excel is one of the platforms where Machine Learning can be performed.

10) How does one treat outliers?

There are various ways to treat outliers:
• One can drop outliers only if it is having a null/garbage value. By garbage value, we mean values that have no relevance. For example, a string value in the place of a numeric value is known to be garbage and can easily be removed.
• You can remove Extreme values, which are outliers.
If you do not want to drop outliers, you can treat them as well:
• Choose a different model. Sometimes a non linear model could fit the data that is being treated as outlier by linear models.
• Normalizing the data. This way, the extreme data points are pulled to a similar range.
• You can use algorithms that are less affected by outliers; an example would be of random forests.

11) Between Python and R, which one would you pick for text analytics, and why?

Python is better than R in terms of text analytics:
• It has Pandas library that offers easy to use data structures as well as tools that are high in performance when it comes to data analysis.
• It has a faster performance when it comes to text analytics

R is a best-fit for machine learning than just text analysis. Hence, python is better.

12) What is a confusion matrix and how can you calculate the accuracy using it?

A confusion matrix is a simple table we use to describe the performance of a classification model. The true values are supposed to be known for the test data. Here, the matrix compares the actual target values with those predicted by the machine learning model. The formulae of the accuracy is :
Accuracy = (True Positive + True Negative) / Total Observations

13) What is the difference between supervised and unsupervised machine learning?

Supervised Machine learning requires training of labelled data. However, unsupervised Machine learning doesn’t require labelled data.

14) What are the different kernels functions in SVM?

There are four types of kernels in SVM:
• Linear Kernel: It is used when the data is linearly separable, i.e. using a single line.
• Polynomial kernel: It is a kernel function commonly used with support vector machines (SVMs) and other kernelized models. It represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables. Thus providing access to non-linear models.
• Radial basis kernel: Radial Basis Kernel is a kernel function that is used in machine learning to find a non-linear classifier or regression line.
• Sigmoid kernel: It comes from the Neural Network field, where the bipolar sigmoid function is often used as an activation function for artificial neurons. An SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network.

15) What is the meaning of variance?

Variance is the spread between numbers in a data set. It causes error when the model learns noise and performs bad on the data set. This can lead to high sensitivity and over fitting of the model. It is the average of the squared differences from the mean.
It’s time for one of our bonus interview questions now. This question is not related to the technical knowledge about data science rather answers the basic requirement a recruiter looks for in a candidate. It is important for a data scientist/data analyst/decision scientist to understand the exact requirement before analyzing the data. Further, it improves their understanding level and helps them deliver in time.

Bonus Question – What is the importance of Analysis of Data?

Data is everywhere whether in unstructured or structured form. Yet stand-alone data makes no sense unless measured, managed or optimized. To gain valuable insights into a given set of data, the data has to lend itself to analysis using a clearly defined methodology, strategy and business goals. Data Science, which includes Analytics (quantitative analysis of data), Big Data management / reporting and Data Structure Algorithms, is the scientific process of transforming data into insight for making better decisions. One can learn more about data science, machine learning and artificial intelligence here.
Did you find the interview questions helpful? Get more tips here.

Data Science is the future

At Ivy, we have always aimed at preparing our students for the fast paced world. In this world of data analytics, we want people to learn and excel more. Data Science is not changing the world, it is defining the world. Answering such questions will boost the confidence of the recruiter on the candidate. It will also make the recruiter abrest of the candidate’s understanding and job fitness. Responsibility is what is takes for a data scientist to play the role of a business analyst and take the entire business to another level. (Learn what it takes to become a business analyst).