In this blog, we will cover major data analyst/scientist interview questions and answers. The demand for data analysts is growing in today’s advancement in technology. When looking to hire a data analyst, here are the top data analyst/science interview and answers to ask during an interview.

**1. What are the top responsibilities of a Data Analyst?**

Each profession has its own unique way of handling responsibilities for smooth running of tasks/processes of businesses or organizations.

Responsibilities of a data analyst may include;

• Understanding the data structure and sources relevant to the business;

• Being able to extract the data from these sources in a timely & efficient manner;

• Identify, evaluate and implement services and tools from external sources to support the validation of data and cleansing;.

• Develop and support various reporting processes of the business;

• Perform an audit of data and resolve any business associated issues for clients;

• Ensure database security by developing access system user levels;

• Analyze, identify & interpret process trends or patterns primarily in complex data sets and trigger alerts for the business teams;

• Evaluating historical data and making forecasts for growing the business;

• Developing and validating predictive models to improve business processes and identify key growth strategies.

**2. What are the key skills required for a data scientist?**

• Mathematics/Statistics Knowledge; A Data scientist should be able to work on statistical concepts seamlessly. Without a good hold on Statistics, a data scientist will not be able to understand basics such as cleaning and manipulating data.

• Programming skills: Should be familiar with computer software and tools including; scripting language (Matlab, Python), Spreadsheet (Excel) and Statistical Language (SAS, R, SPSS), Querying Language (SQL, Hive, Pig). Other computer skills include; big data tools (Spark, Hive HQL), programming (JavaScript, XML), and so on.

• Logical Deduction: This is a skill that comes with experience. The data scientist should be able to immediately identify anomalies and be able to draw out strategies from trends. Without this skill, a data scientist is not able to add value to the business.

• Besides these skills, domain knowledge is increasingly becoming a requirement for a data scientist. Example: Credit Risk, supply chain management, etc.

• Attention to details, decision making and problem-solving, communication skills, are some of the soft skills that a data scientist must develop.

**3. Summarize the various steps in an analytics project**

• Defining the objective function;

• Identifying key sources of data required for the analysis;

• Data preparation & cleaning;

• Data modelling

• Model Validation

• Implementation and tracking (deployment and monitoring the results)

**4. Define Data Cleansing (data cleaning)**

Refers to the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Data cleansing also refers to identifying incomplete, incorrect, inconsistent or irrelevant parts of the record set, table, or database data and then replacing, modifying, or deleting the dirty data. In model development, data cleaning also means identifying anomalies in the data that cannot be represented consistently by one model. Example: For income estimation models, very high values of income that are not consistent with the data should be either removed or capped to a maximum limit. The aim is to enhance the quality of data.

**5. What are the best practices for data cleaning?**

Best practices for data cleaning includes;

• Understanding the range (Min./Max.), mean, median and plotting a normal curve;

• Identifying outliers in the data and treating them;

• Missing value treatment;

**6. Explain what is logistic regression?**

Logistic regression is a statistical method for examining a dataset consisting of one or more independent variables that define an outcome.

**7. Give some of the best tools useful for data analysis**

• Solver

• NodeXL

• KNIME

• R Programming

• SAS

• Weka

• Apache Spark

• Orange

• Io

• Talend

• RapidMiner

• OpenRefine

• Tableau

• Google Search Operators

• Google Fusion Tables

• Wolfram Alpha’s

• Pentaho

**8. What is the difference between data mining and data profiling?**

**Data profiling** is the process of analyzing the data available from an existing information source like a database and collecting statistics or informative summaries about that data. It may be information on various attributes like discrete value, value range etc.

**Data mining** is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. It can be focusing on cluster analysis, dependencies, sequence discovery, detection of unused records and others.

**9. What are some common problems faced by data analyst?**

Problems include;

• Data storage and quality

• Identifying overlapping data

• Common misspelling

• Duplicate data entries

• Varying value representations

• Missing values

• Illegal values

• Security and privacy of data

**10. What are Hadoop and MapReduce?**

It’s the name of the programming framework developed by Apache for processing large data set, for an application in a distributed computing environment. You can read more about hadoop and Map Reduce here

**11. What are the generally observed missing patterns?**

This includes missing at random (MAR), missing completely at random (MCAR), not missing at random (NMAR), missing depending on unobserved input variable, and missing depending on the missing value itself.

**12. What is KNN imputation method?**

In the KNN imputation method, the missing attribute values are imputed by using the attribute values that are most similar to the attribute whose values are missing. If you use a distance function, you can determine the similarity of two attributes.

**13. What are the data validation methods used by data analyst?**

• Data screening

• Data verification

Validation methods may include allowed character checks, batch totals, cardinality check, consistency checks, control totals, cross-system consistency checks, data type checks, file existence check, format or picture check, logic check, limit check, presence check, range check, referential integrity, spelling and grammar check, and much more.

**14. What should a data analyst do with suspected or missing data?**

• Prepare detailed validation report that provides information on all suspected or missing data

• Suspected data should be further analyzed to validate their credibility

• Replace and assign a validation code to an invalid data

• To analyze missing data, a data analyst should use the best analysis techniques like deletion method, model-based methods, single imputation methods, etc.

**15. How do you deal with the multi-source problems?**

• Performing a schema integration through the restructuring of schemas

• Identifying and merging similar records into a single record which will contain all relevant attributes without redundancy

**16. Explain what is an Outlier?**

Outlier refers to a value/observation that appears far away and diverges from an overall pattern (lies at an abnormal distance from other values) in a data sample

**17. Name the different types of outliers?**

There are three different types of outliers;

• Global outliers (also called “point anomalies”)

• Contextual (conditional) outliers

• Collective outliers

**18. What is Hierarchical Clustering Algorithm**

The hierarchical clustering algorithm is an algorithm that groups similar objects into groups called clusters. It’s the process of combining and dividing existing data groups to create a hierarchical structure that represents the order in which the groups are divided or merged.

**19. Define time series analysis?**

It’s a statistical technique that focuses on time series data or trend analysis. It’s used to forecast the output of a process through the analysis of the previous data using various available statistical methods like exponential smoothening, log-linear regression method, etc.

**20. Lists some of the statistical methods that are useful for data-analyst?**

• Markov process

• Imputation techniques, etc.

• Spatial and cluster processes

• Bayesian method

• Mathematical optimization

• Rank statistics, percentile, outliers detection

• Simplex algorithm

**21. What is the K-mean algorithm?**

It’s an algorithm used for data partitioning in a clustered architecture. K-mean algorithm classifies a given data set through a certain number of clusters (for example k clusters). Here objects are divided into several k groups.

With the k-mean algorithm, the clusters are spherical, so data points in a cluster are centred on that cluster and the variance or the spread of the cluster is almost similar. Each data point belongs to the closest cluster.

**22. Define collaborative filtering**

It’s an algorithm used to design a recommendation system based on actual user behavioural data (user behavioural analytics). It’s most commonly used by big sites with collaborative filters like “recommended for you”, or “you may also like” etc. especially if you’re shopping, reading a list etc. Other users behavioural response includes popups that are based on your browsing history.

**23. What is Map Reduce?**

It’s a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Map reduce split big data sets into subsets, processing each subset on a different server and then blending results obtained on each.

**24. What is the correlogram analysis?**

A correlogram is an image (visual inspection) of correlation statistics. It’s a graph used to interpret a set of autocorrelation coefficients. The correlogram is a commonly used tool for checking randomness in a data set.

**25. What is n-gram?**

N-Gram is a sequence of tokens usually words, characters or subsets of characters. It is a kind of probabilistic language model used to predict the next item in the sequence following the form of (n-1).

**26. What is the imputation process? List out different types of imputation techniques?**

The imputation process is a technique used to replace missing data elements with substituted values. Here are two types of imputation processes with subtypes:

Single imputation and multiple imputations. Sub-types of single imputation include:

• Hot-deck imputation

• Cold deck imputation

• Mean imputation

• Regression imputation

• Stochastic regression

**27. What is Logistic Regression?**

Logistic regression is one of the statistical methods used by data analysts to examine a dataset where a single and multiple independent variables define an outcome. It is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). Data analyst to depict that the presence of a risk factor increases the odds of a given outcome by a specific factor.

**28. What is a hash table collision? How can it be prevented?**

There are other excellent data analyst/science interview and answers to ask during an interview. The above questions are just to give you a hint of what you should ask a data analyst to ensure you hire the right candidate.

These are some of the top data analyst or data scientist interview questions, we will keep updating the same. Stay tuned.