Making Sense of Linear Regression
 22
 Jan
 2018
 Posted ByJC
 InData Science. (PDF).
 No Comments.
Linear regression is one of the most commonly used hypothesis tests in Lean Six Sigma work. Linear regression offers the statistics for testing whether two or more sets of continuous data correlate with each other, i.e. whether one drives another one.
Therefore, here is an example starting with the absolute basics of the linear regression. The data stem from a business simulation showcasing package delivery with several parameters being measured. The objective is to find the driver for Delivery Time Y. So, the question is, whether one or more of the potential drivers Coding Time, Packaging Time, Sender and Package Type make a significant difference on the Delivery Time (Figure 1).
To perform linear regression, we follow these steps:
1. Plot the Data
For any statistical application, it is essential to combine it with a graphical representation of the data. Several tools are available for this purpose. If Y (Delivery Time) and X are both continuous (Coding Time and Packaging Time), the scatter plot is the tool of choice. For discrete X (Sender and Package Type) and continuous Y (Delivery Time), a stratified histogram, a dotplot or a boxplot offer help.
The plots in Figure 2 show all four potential drivers and their potential influence on the delivery time. Obviously, Coding Time drives Delivery Time and Sender and Package Type influence Delivery Time as well. However, there is a problem with this set of plots, the underlying thinking and our preliminary conclusions: Plotting one X against the Y disregards all other variables, i.e. their potential influence is still part of the data set but shows up as random variation. Therefore, obvious questions are:
 Would the obviously marginal influence of Packaging Time on Delivery Time change if the variation by Sender or Packaging Time is accounted for, i.e. eliminated from the model?
 Is it possible that Long Distance packages take longer because the majority of them go to East Coast?
Since it is not possible to show all four Xs and their relationship with the Y in the same plot, we need to refer to statistics for help.
2. Formulate the Hypothesis for Linear Regression
In this case, the parameter of interest is an average, i.e. the nullhypothesis is
H_{0}: X does not influence Y,
with μ_{A} and μ_{B} being the population means of both companies.
This means, the alternative hypothesis is
H_{A}: X has a significant influence on Y.
3. Decide on the Acceptable Risk
Since there is no reason for changing the commonly used acceptable risk of 5%, i.e. 0.05, we use this risk as our threshold for making our decision.
4. Select the Right Tool
If there is a need for testing the influence of one or more continuous X on a continuous Y, the popular test for this situation is Linear Regression.
5. Test the Assumptions
First of all, there are no requirements on Xs or Y for a regression to be valid. However, the residuals, the data points including the rest variation after application of the regression model, need to follow these requirements:
 Firstly, residuals need to be normally distributed,
 Secondly, residuals need to be independent of time,
 Thirdly, residuals need to be independent of X and
 Finally residuals need to be independent of Fits, i.e. the calculated value for Y after applying the model.
6. Conduct the Test
Using the linear regression, statistics software SigmaXL generates the output in Figure 3.
The interpretation of linear regression statistics follows certain steps:
 Check of VIF (Variance Inflation Factors). If one or more VIF show a value of 5 or higher, the model is not valid since two or more factors (Xs) correlate with each other. This intercorrelation needs to be removed before proceeding. Solving this issue usually means removing Xs one by one until all VIF are below 5. Try removing factors that you cannot easily measure or control first.
 Check the model pvalue. If this pvalue is below 0.05 (or 5%), the model is valid, i.e. there is a significant XY relationship.
 Check the pvalue for all predictor terms. Remove nonsignificant Xs (predictors, factors) onebyone and rerun the model after each removal. Start with the highest pvalue.
 Check the residuals whether they adhere to the assumptions. This check is especially suitable for finding extreme measurements, i.e. outliers, values that do not seem to fit the model. These values should be investigated regarding accuracy of data collection.
 Formulate the final model.
Packaging Time seems to be nonsignificant, i.e. we have removed this term. Since Sender_Jurong and Sender_Woodlands are part of the Predictor Term Sender, they are automatically part of the model.
7. Make a Decision
As a result, rejecting H_{0} means that there is a significant relationship between Predictor Terms (Xs) and Delivery Time (Y). In order to influence (reduce) the Delivery Time, we have the option to work on Coding Time, Package Type and Sender. Since Package Type and Sender are customer requirements, it could make sense to change our promised delivery time for different Senders and different Package Types. The regression output delivers a regression equation that helps predicting Delivery Time for different factor settings:
Delivery Time = (2.492) + (7.291) * Coding Time + (4.088) * Sender_East Coast + (1.900) * Package Type_Short Distance
Interested in the stats? Read here.
Here is the data:
Regression Data
No  Delivery Time  Coding Time  Packaging Time  Sender  Package Type 

1  6.3  0.42  0.079  Woodlands  Long Distance 
2  13  0.632  0.105  East Coast  Long Distance 
3  4.9  0.452  0.093  Changi  Short Distance 
4  5.3  0.383  0.108  Changi  Long Distance 
5  6.2  0.588  0.1  Woodlands  Long Distance 
6  18.8  0.93  0.112  East Coast  Long Distance 
7  5.2  0.431  0.128  Changi  Long Distance 
8  5.2  0.407  0.1  Changi  Long Distance 
9  7.2  0.396  0.108  Woodlands  Long Distance 
10  5.2  0.573  0.134  East Coast  Long Distance 
11  6  0.305  0.099  Changi  Long Distance 
12  8.3  0.628  0.064  Jurong  Long Distance 
13  5.4  0.459  0.097  Woodlands  Long Distance 
14  4.5  0.522  0.089  Changi  Short Distance 
15  5.1  0.619  0.088  Jurong  Long Distance 
16  12.2  0.554  0.085  East Coast  Long Distance 
17  4.3  0.593  0.125  Woodlands  Short Distance 
18  3.3  0.334  0.065  Changi  Short Distance 
19  3.1  0.315  0.088  Jurong  Short Distance 
20  3.2  0.35  0.125  Changi  Short Distance 
21  2.2  0.23  0.111  Jurong  Short Distance 
22  7.4  0.561  0.105  East Coast  Long Distance 
23  3.3  0.31  0.108  Woodlands  Short Distance 
24  2.4  0.29  0.093  Changi  Short Distance 
25  3.4  0.353  0.092  Jurong  Short Distance 
26  3  0.164  0.104  Changi  Short Distance 
27  4.3  0.32  0.125  Woodlands  Short Distance 
28  2.8  0.252  0.104  Changi  Short Distance 
29  5  0.358  0.108  Jurong  Long Distance 
30  2.4  0.703  0.094  Changi  Short Distance 
More data:
Height Shoe Gender
Height  Gender  ShoeSize 

186  Male  47 
169  Female  38 
173  Male  42 
165  Female  36 
165  Female  37 
169  Male  41 
175  Male  43 
174  Male  44 
158  Female  35 
158  Female  36 
181  Male  43 
178  Male  43 
176  Male  42 
170  Female  37 
188  Female  35 
185  Male  45 
183  Male  43 
178  Male  41 
170  Female  39 
167  Female  38 
168  Female  35 

Download Article in PDF.
Recent Articles
 My Job Does Not Require me to be Creative
 Don’t Procrastinate Feedback
 The Future Challenges for the HR Practitioner
 Building the Muscles of Your Workforce
 Beware the Hawthorne
Recent Comments
 NILOY MITTER on My Job Does Not Require me to be Creative
 Eugene on My Job Does Not Require me to be Creative
 Bhupinder Kaur on Is Group Coaching Possible?
 Chiang Meng on Is Group Coaching Possible?
 UK on Don’t Automate, Obliterate!