Data analytics, or data analysis, is the process of screening, cleaning, transforming, and modeling data with the objective of discovering useful information, suggesting conclusions, and supporting problem solving as well as decision making. Data analytics has multiple approaches, including a variety of techniques and tools and is used in many different environments. It is being done using graphical and statistical tools. The selection of tools for a given data analytics task depends on the overall objective, the source and types of data given.
The objective of the data analytics task can be to screen or inspect the data in order to find out whether the data fulfils certain requirements. These requirements can be a certain distribution, a certain homogeneity of the dataset (no outliers) or just the behaviour of the data under certain stratification conditions (using demographics).
Another objective would be the analysis of data, in particular survey data, to determine the reliability of the survey instrument that was used to collect data. Cronbach’s Alpha is often applied to perform this task. Cronbach’s Alpha determines whether survey items (questions/statements) that belong to the same factor are really behaving in a similar way, i.e. showing the same characteristic as other items in that factor. Testing reliability of a survey instrument is a prerequisite for further analysis using the dataset in question.
Data Manipulation Before Data Analytics
Often enough, data is not ready for analysis. This can be due to a data collection format that is not accepted by the subsequent analysis tools. This can also be due to a distribution that makes it harder to analyse the data. Hence, reorganising , standardising or transforming (to normal distribution) the dataset might be necessary.
Data Analytics with Descriptive Statistics
Descriptive Statistics includes a set of tools that is used to quantitatively describe a set of data. It usually indicates central tendency, variability, minimum, maximum as well as distribution and deviation from this distribution (kurtosis, skewness). Descriptive statistics might also highlight potential outliers for further inspection and action.
In contrast to descriptive statistics characterising a certain given set of data, inferential statistics uses a subset of the population, a sample, to draw conclusions regarding the population. The risk involved is calculated and depends on the required confidence level, confidence interval and the sample size at hand as well as the variation in the data set.
Data Analytics with Factor Analysis
Factor Analysis is used to determine clusters in datasets, i.e. it finds observed variables that show a similar variability and can therefore be included into the same factor. A factor is a dependent, unobserved variable that includes multiple observed variables in the same cluster. Under certain circumstances, this can lead to a reduction of observed variables and hence the increase of sample size in the remaining unobserved variables (factors). Both outcomes improve the power of subsequent statistical analysis of the data.
Factor analysis can be done with different approaches to pursue a multitude of objectives. Exploratory factor analysis (EFA) is used to identify complex interrelationships among items and determine clusters/factors whilst there is no predetermination of factors. Confirmatory factor analysis (CFA) is used to test the hypothesis that the items are associated with specific factors. In this case, factors are predetermined before the analysis.
Data Analytics For Problem Solving
Data analytics can be helpful in problem solving by establishing the significance of the relationship between problems (Y) and potential root causes (X). A large variety of tools is being used and the selection of tools for a given data analytics task depends on the overall objective, the source and types of data given. Discrete data, such as counts or attributes require different tools than continuous data, such as measurements. Whilst continuous data can be transformed into discrete data for decision making, this process is irreversible.
Depending on the data in X and Y, regression analysis or hypothesis testing will be used to answer the question whether there is a relationship between problem and alleged root cause. These tools do not take away the decision, but they tell the risk for a certain decision. The decision is still to be made by the process owner (example). The roadmap for these tests follows a certain standard:
- Plot the data. There is a variety of data display tools available that help to prepare a decision. This step must not be disregarded. Otherwise the statistics might be misleading.
- Formulate the hypothesis. The hypothesis or model to be tested is the statistical expression for the practical problem. There is usually a null-hypothesis and an alternative hypothesis. Test are designed to reject the null-hypothesis and accept the alternative hypothesis.
- Decide on the acceptable risk. Since hypothesis and regression tools do not take a decision but give the risk for a certain decision, the acceptable risk (α) needs to be defined before conducting the test.
- Select the right tool, i.e. the appropriate hypothesis test for a certain situation. A series of tests might be necessary to be performed under certain circumstances. Read more about 2-sample t-test, ANOVA, linear regression, binary logistic regression, 2-proportion test.
- Test the assumptions. Many statistical tools can only be applied under certain circumstances. These circumstances must be checked for the test result to be valid.
- Conduct the test. If all assumptions are met, perform the test or regression on the actual dataset.
- Make a decision. Decide whether there is enough evidence to reject the null-hypothesis and accept the alternative. Translate the statistical outcome into practical relevance.
Applications for data analytics are found in all private and public organisations without limits. Already some decades ago, companies like Motorola and General Electric discovered the power of data analytics and made this the core of their Six Sigma movement. They made sure, that problem solving was data-driven and applied data analytics wherever appropriate. Nowadays, data analytics is widely used for problem solving as part of most Lean Six Sigma projects. Black Belts are usually well-versed in this kind of data analysis.
I only believe in statistics that I doctored myself. Winston Churchill