Data analytics, or data analysis, is the process of screening, cleaning, transforming, and modeling data with the objective of discovering useful information, suggesting conclusions, and supporting problem solving as well as decision making. Researchers use multiple approaches, including a variety of techniques and tools. Data analytics finds applications in many different environments. It usually covers two steps, graphical analysis and statistical analysis. The selection of tools for a given data analytics task depends on the overall objective, the source and types of data given.
The objective of the data analytics task can be to screen or inspect the data in order to find out whether the data fulfils certain requirements. These requirements can be a certain distribution, a certain homogeneity of the dataset (no outliers) or just the behaviour of the data under certain stratification conditions (using demographics).
Another objective would be the analysis of data, in particular survey data, to determine the reliability of the survey instrument we used to collect data. Cronbach’s Alpha is often applied to perform this task. Cronbach’s Alpha determines whether survey items (questions/statements) that belong to the same factor are really behaving in a similar way, i.e. showing the same characteristic as other items in that factor. Testing reliability of a survey instrument is a prerequisite for further analysis using the dataset in question.
Often enough, data is not ready for analysis. This can be due to a data collection format that is not in sync with subsequent analysis tools. This can also be due to a distribution that makes it harder to analyse the data. Hence, reorganising , standardising or transforming (to normal distribution) the dataset might be necessary.
Descriptive Statistics includes a set of tools that is used to quantitatively describe a set of data. It usually indicates central tendency, variability, minimum, maximum as well as distribution and deviation from this distribution (kurtosis, skewness). Descriptive statistics might also highlight potential outliers for further inspection and action.
In contrast to descriptive statistics characterising a certain given set of data, inferential statistics uses a subset of the population, a sample, to draw conclusions regarding the population. The inherent risk depends on the required confidence level, confidence interval and the sample size at hand as well as the variation in the data set. The test result includes this risk.
Factor Analysis helps determine clusters in datasets, i.e. it finds empirical variables that show a similar variability. These variables may therefore construct the same factor. A factor is a dependent, unobserved variable that includes multiple observed variables in the same cluster. Under certain circumstances, this can lead to a reduction of observed variables and hence the increase of sample size in the remaining unobserved variables (factors). Both outcomes improve the power of subsequent statistical analysis of the data.
Factor analysis can use different approaches to pursue a multitude of objectives. Exploratory factor analysis (EFA) is used to identify complex interrelationships among items and determine clusters/factors whilst there is no predetermination of factors. Confirmatory factor analysis (CFA) is used to test the hypothesis that the items are associated with specific factors. In this case, factors are predetermined before the analysis.
Data analytics can be helpful in problem solving by establishing the significance of the relationship between problems (Y) and potential root causes (X). A large variety of tools is available. The selection of tools for a given data analytics task depends on the overall objective, the source and types of data. Discrete data, such as counts or attributes require different tools than continuous data, such as measurements. Whilst continuous data are transformable into discrete data for decision making, this process is irreversible.
Depending on the data in X and Y, regression analysis or hypothesis testing will be used to answer the question whether there is a relationship between problem and alleged root cause. These tools do not take away the decision, but they tell the risk for a certain decision. The decision is still to be made by the process owner (example). The roadmap for these tests follows a certain standard:
Applications for data analytics are evident in all private and public organisations without limits. Already some decades ago, companies like Motorola and General Electric discovered the power of data analytics and made this the core of their Six Sigma movement. They made sure, that problem solving is based on data and applied data analytics wherever appropriate. Nowadays, data analytics or data science is vital part of problem solving and most Lean Six Sigma projects. Six Sigma Black Belts are usually well-versed in this kind of data analysis.
I only believe in statistics that I doctored myself. Winston Churchill