Data analytics, or data analysis, is the process of screening, cleaning, transforming, and modeling data with the objective of discovering useful information, suggesting conclusions, and supporting problem solving as well as decision making. Researchers use multiple approaches, including a variety of techniques and tools and finds applications in many different environments. Data analytics usually covers two steps, graphical analysis and statistical analysis. The selection of tools for a given data analytics task depends on the overall objective, the source and types of data given.
The objective of the data analytics task can be to screen or inspect the data in order to find out whether the data fulfils certain requirements. These requirements can be a certain distribution, a certain homogeneity of the dataset (no outliers) or just the behaviour of the data under certain stratification conditions (using demographics).
Another objective would be the analysis of data, in particular survey data, to determine the reliability of the survey instrument that was used to collect data. Cronbach’s Alpha is often applied to perform this task. Cronbach’s Alpha determines whether survey items (questions/statements) that belong to the same factor are really behaving in a similar way, i.e. showing the same characteristic as other items in that factor. Testing reliability of a survey instrument is a prerequisite for further analysis using the dataset in question.
Data Manipulation Before Data Analytics
Often enough, data is not ready for analysis. This can be due to a data collection format that is not in sync with subsequent analysis tools. This can also be due to a distribution that makes it harder to analyse the data. Hence, reorganising , standardising or transforming (to normal distribution) the dataset might be necessary.
Data Analytics with Descriptive Statistics
Descriptive Statistics includes a set of tools that is used to quantitatively describe a set of data. It usually indicates central tendency, variability, minimum, maximum as well as distribution and deviation from this distribution (kurtosis, skewness). Descriptive statistics might also highlight potential outliers for further inspection and action.
In contrast to descriptive statistics characterising a certain given set of data, inferential statistics uses a subset of the population, a sample, to draw conclusions regarding the population. The inherent risk depends on the required confidence level, confidence interval and the sample size at hand as well as the variation in the data set. The test result includes this risk.
Data Analytics with Factor Analysis
Factor Analysis helps determine clusters in datasets, i.e. it finds empirical variables that show a similar variability. These variables may therefore construct the same factor. A factor is a dependent, unobserved variable that includes multiple observed variables in the same cluster. Under certain circumstances, this can lead to a reduction of observed variables and hence the increase of sample size in the remaining unobserved variables (factors). Both outcomes improve the power of subsequent statistical analysis of the data.
Factor analysis can be done with different approaches to pursue a multitude of objectives. Exploratory factor analysis (EFA) is used to identify complex interrelationships among items and determine clusters/factors whilst there is no predetermination of factors. Confirmatory factor analysis (CFA) is used to test the hypothesis that the items are associated with specific factors. In this case, factors are predetermined before the analysis.
Data Analytics For Problem Solving
Data analytics can be helpful in problem solving by establishing the significance of the relationship between problems (Y) and potential root causes (X). A large variety of tools is available. The selection of tools for a given data analytics task depends on the overall objective, the source and types of data. Discrete data, such as counts or attributes require different tools than continuous data, such as measurements. Whilst continuous data are transformable into discrete data for decision making, this process is irreversible.
Depending on the data in X and Y, regression analysis or hypothesis testing will be used to answer the question whether there is a relationship between problem and alleged root cause. These tools do not take away the decision, but they tell the risk for a certain decision. The decision is still to be made by the process owner (example). The roadmap for these tests follows a certain standard:
Data Analytics Steps
- Plot the data. There is a variety of data display tools available that help to prepare a decision. This step is of utmost importance. Otherwise the statistics might be misleading.
- Formulate the hypothesis. The hypothesis or model is the statistical expression for the practical problem. There is usually a null-hypothesis and an alternative hypothesis. Per design, tests target on rejecting the null-hypothesis and accept the alternative hypothesis.
- Decide on the acceptable risk. Hypothesis and regression tools do not take a decision but give the risk for a certain decision. Hence, the researcher needs to define the acceptable risk (α) before conducting the test.
- Select the right tool, i.e. the appropriate hypothesis test for a certain situation. A series of tests might be necessary under certain circumstances. Read more about 2-sample t-test, ANOVA, linear regression, binary logistic regression, 2-proportion test, test for equal variances, Chi-Squared test.
- Test the assumptions. Many statistical tools can only deliver reliable results under certain circumstances. These circumstances must be in place for valid test results.
- Conduct the test. After confirming all assumptions, perform the test or regression on the actual dataset.
- Make a decision. Decide whether there is enough evidence to reject the null-hypothesis and accept the alternative. Translate the statistical outcome into practical relevance.
Applications for data analytics are evident in all private and public organisations without limits. Already some decades ago, companies like Motorola and General Electric discovered the power of data analytics and made this the core of their Six Sigma movement. They made sure, that problem solving was data-driven and applied data analytics wherever appropriate. Nowadays, data analytics is widely used for problem solving as part of most Lean Six Sigma projects. Black Belts are usually well-versed in this kind of data analysis.
I only believe in statistics that I doctored myself. Winston Churchill