Top 10 Data Science and Analytics Interview Questions

Data science, or data-driven decision making, is an interdisciplinary field about scientific methods, process and systems with an aim to extract knowledge from data in various forms. The ultimate goal of data scientist is to take decision-based data driven knowledge. Data Science is gaining prominence with each passing day. Data Science is churning out plenty of opportunities for those interested in pursuing a career as a data scientist. Analytics Insights brings you the top 10 Data Science and Analytics interview questions for a rewarding career in data science-

1. What is Data Science? List the differences between Supervised and Unsupervised Learning.

Answer: Data science is an inter-disciplinary field that deploys scientific methods, processes, algorithms and systems to extract knowledge and insights from structural and unstructured data. Data science brings together the concepts of data mining, machine learning and big data.

Supervised and Unsupervised learning are the two techniques of machine learning used in different scenarios and with different datasets. Differentiating between the two, supervised learning is a machine learning method in which models are trained using labelled data. In supervised learning, models need to find the mapping function to map the input variable (X) with the output variable (Y). This form of learning needs supervision to train the model, which is similar to as a student learns things in the presence of a teacher. Supervised learning can be used for two types of problems: Classification and Regression.

Example: Suppose there is an image of different types of animals. The task of supervised learning model is to identify the animals and classify them accordingly. So, to identify the image in supervised learning, data scientists will give the input data as well as output for that, which means they will train the model by the shape, size, colour, of each animal. Once the training is completed, model will be tested by giving the new set of animals. The model will identify the animal and predict the output using a suitable algorithm.

Unsupervised learning is another machine learning method in which patterns inferred from the unlabelled input data. The goal of unsupervised learning is to find the structure and patterns from the input data. Unsupervised learning does not need any supervision. Instead, it finds patterns from the data by its own.

The two types of problems associated with unsupervised learning are Clustering and Association.

Example: To understand the unsupervised learning, we will use the example given above. So, unlike supervised learning, here we will not provide any supervision to the model. Instead would just provide the input dataset to the model and allow the model to find the patterns from the data. With the help of a suitable algorithm, the model will train itself and divide the animals into different groups according to the most similar features between them.

2. What is linear regression?

Answer: Linear regression helps to understand the linear relationship between the dependent and the independent variables. Linear regression is a supervised learning algorithm, which helps to establish the linear relationship between two variables. One is the predictor or the independent variable and the other is the response or the dependent variable. Linear Regression, aims to understand how the dependent variable changes w.r.t the independent variable. If there is only one independent variable, then it is called simple linear regression, and if there is more than one independent variable then it is known as multiple linear regression.

3. Between Python and R, which is best suited for text analytics?

Answer: Python would be the best option for Pandas library that provides easy to use data structures and high-performance data analysis tools. R is more suitable for machine learning than just text analysis. Python performs faster for all types of text analytics.

4. Explain SVM machine learning algorithm

Answer: SVM stands for support vector machine, it is a supervised machine learning algorithm which can be used for both Regression and Classification. SVM tries to plot it in n-dimensional space with the value of each feature being the value of a particular coordinate. SVM uses hyper planes to separate out different classes based on the provided kernel function.

5. What are outlier values and how to treat them?

Answer: Outlier values, or simply outliers, are data points in statistics that don’t belong to a certain population. An outlier value is an abnormal observation that is very much different from other values belonging to the set.

Identification of outlier values can be done by using univariate or some other graphical analysis method. Few outlier values can be assessed individually but assessing a large set of outlier values require the substitution of the same with either the 99th or the 1st percentile values.

There are two popular ways of treating outlier values:

• To change the value so that it can be brought within a range

• To simply remove the value

6. The various steps involved in an analytics project.

Answer: Following are the numerous steps involved in an analytics project:

• Understanding the business problem

• Exploring the data and familiarizing with the same

• Preparing the data for modelling by means of detecting outlier values, transforming variables, treating missing values, et cetera

• Running the model and analysing the result for making appropriate changes or modifications to the model (an iterative step that repeats until the best possible outcome is gained)

• Validating the model using a new dataset

• Implementing the model and tracking the result for analysing the performance of the same

7. Explain Deep Learning

Answer: Deep Learning is a neural network method based on convolutional neural networks (CNN). Deep learning has a wide array of uses, ranging from social network filtering to medical image analysis and speech recognition. Caffe, Chainer, Keras, Microsoft Cognitive Toolkit, Pytorch, and TensorFlow are some of the most popular Deep Learning frameworks as of today.

Although Deep Learning has been present for a long time, it is only recently that it has gained worldwide acclaim:

• An increase in the amount of data generation via various sources

• The growth in hardware resources required for running Deep Learning models

8. What are the skills required as a Python specialist Data Scientist?

Answer:

• Expertize in Pandas Dataframes, Scikit-learn, and N-dimensional NumPy Arrays.

• Skills to apply element-wise vector and matrix operations on NumPy arrays.

• Able to understand built-in data types, including tuples, sets, dictionaries, and various others.

• It is equipped with Anaconda distribution and the Conda package manager.

• Capability in writing efficient list comprehensions, small, clean functions, and avoid traditional for loops.

• Knowledge of Python script and optimizing bottlenecks

9. Are there differences between Deep Learning and Machine Learning?

Answer: Yes, there are differences between Deep Learning and Machine learning. These are stated as under:

Deep Learning	Machine Learning
It gives computers the ability to learn without being explicitly programmed.	It gives computers a limited to unlimited ability wherein nothing major can be done without getting programmed, and many things can be done without the prior programming. It includes supervised, unsupervised, and reinforcement machine learning processes.
It is a subcomponent of machine learning that is concerned with algorithms that are inspired by the structure and functions of the human brains called the Artificial Neural Networks.	It includes Deep Learning as one of its components.

10. Why are Tensorflow considered a high priority in learning Data Science?

Answer: Tensorflow is considered a high priority in learning Data Science because it provides support to using computer languages such as C++ and Python. This way, it makes various processes under data science to achieve faster compilation and completion within the stipulated time frame and faster than the conventional Keras and Torch libraries. Tensorflow supports the computing devices, including the CPU and GPU for faster inputs, editing, and analysis of the data.