Chew Jian Chieh
Trust & Safety Operations Leader at LinkedIn, People Manager, Six Sigma Master Black Belt

Making Sense of Chi-Squared Test – Finding Differences in Proportions

Every blood donor of a large blood bank has to go through five process steps. These steps are Registration, Screening, HB Test, Donation and Refreshment. At the end of the process, that often takes around an hour, feedback forms are available for the donors. In one week, 210 donors have returned these forms with their satisfaction score for each process step. This satisfaction score is measured using a six-point scale. The desired rating is either 5 (satisfied) or 6 (very satisfied). The results are summarised in Figure 1.

Figure 1: Data for Chi Squared Test

From the results obtained, there is an obvious majority of “Not 5 or 6” ratings for Step 5. The question is, is there a significant difference to the other steps’ satisfaction score or are we observing just random variation. How can we test this?

One way is to use the so-called Chi-Squared test (also written Χ2 test, another method is Fisher’s Exact test). The Chi-Squared test is a statistical tool to check whether there is a significant difference between observed frequencies (discrete data) and expected frequencies for two or more groups.

To perform the Chi-Squared test, the following steps are necessary:

1. Plot the Data

Figure 2: Column Chart for Chi-Squared Test Data

For any statistical application, it is essential to combine it with a graphical representation of the data. The selection of tools for this purpose is limited. They include pie chart, column chart and bar chart.

The column chart in Figure 2 reveals, that the customer satisfaction for Step 5 seems drastically lower (more “Not 5 or 6” ratings) compared to the other steps. Although, there does not seem to be any doubt about a worse rating for Step 5, it is a good practice to confirm the finding with a statistical test. Additionally, a statistical test can help to calculate the risk for this decision.

2. Formulate the Hypothesis for ANOVA

In this case, the parameter of interest is an average, i.e. the null-hypothesis is

H0: P1 = P2 = P3 = P4 = P5, with all P being the population proportions of the five process steps in this blood donation process.

This means, the alternative hypothesis is

HA: At least one P is different to at least one other P.

3. Decide on the Acceptable Risk

Since there is no reason for changing the commonly used acceptable risk of 5%, i.e. 0.05, we use this risk as our threshold for making our decision.

4. Select the Right Tool

If there is a need for comparing two or more proportions, the popular test for this situation is the Chi-Squared test.

5. Test the Assumptions

Finally, there is only one prerequisite for the Chi-Squared test to work properly: All expected values need to be at least 5. After the test, we will know whether this assumption holds true.

Figure 3: Chi-Squared Test Results

6. Conduct the Chi-Squared Test

Using the Chi-Squared test, statistics software SigmaXL generates the output in Figure 3.

Since the p-value is 0.0000, i.e. less than 0.05 (or 5 percent), we reject H0 and accept HA. The risk for this decision is zero.

From the test result at Figure 3 we can conclude, that all expected counts are larger than five. This means, the test result is valid.

7. Make a Decision

With this, the Chi-Squared test statistics means that there is at least one significant difference.

In order to interpret the Chi-Squared test results, three steps are necessary:

1. Check the p-value. As mentioned earlier, the p-value = 0. This means there is a difference.
2. Lock for the highest standardised residual (Std. Residual). The highest standardised residual is 15.703 at step 5. This means, step 5 is significantly different at level “Not 5o6”.
3. Check the nature of the deviation. This means, compare Observed Counts and Expected Counts and conclude what the deviation means. In our case, Step 5 is supposed to produce 57.71 “Not 5o6” satisfaction ratings if it was following the average of all steps. Instead, Step 5 delivers 177 of these ratings. Ergo, Step 5 has received statistically worse ratings from the donors.

With this result, it is quite obvious, that interventions are needed to increase the customer satisfaction with Step 5.

Interested in the stats? Read here.

Making Sense of Test For Equal Variances

Three teams compete in our CHL business simulation, CHL Red, CHL Green and CHL Blue. After completing Day One, it looks like the teams show a very different performance (Figure 1). Although the means look very similar, the variation is strikingly different. This is surprising, since all teams start with exactly the same prerequisites. To test this assumption of different variability among the teams, the Test for Equal Variances is deployed.

Figure 1: Data of CHL Blue, CHL Green and CHL Red in a Time Series Plot

Finding a significant difference in variances (square of standard deviation), in the variation of different data sets is important. These data sets could stem from different teams performing the same job. If they show a different variation, it usually means, that different procedures are used to perform the same job. Looking into this may offer opportunities for improvement bu learning from the best. However, this only makes sense if this difference is proven, i.e. statistically significant.

To perform the Test for Equal Variances, we take the following steps:

1. Plot the Data

Figure 2: Box Plot for CHL Blue, CHL Green and CHL Red

For any statistical application, it is essential to combine it with a graphical representation of the data. Several tools are available for this purpose. They include the popular stratified histogram, dotplot or boxplot. The Time Series Plot at Figure 1 does not clearly show the variability of the three teams.

The boxplot in Figure 2 shows much better an obvious difference in the variation between the three groups. A statistical test can help to calculate the risk for this decision.

2. Formulate the Hypothesis for Test For Equal Variances

In this case, the parameter of interest is an average, i.e. the null-hypothesis is

H0: σBlue = σGreen = σRed,

with all σ being the population standard deviation of the three different teams.

This means, the alternative hypothesis is

HA: At least one σ is different to at least one other σ.

3. Decide on the Acceptable Risk

Since there is no reason for changing the commonly used acceptable risk of 5%, i.e. 0.05, we use this risk as our threshold for making our decision.

4. Select the Right Tool

Figure 3: Descriptive Statistics for CHL Blue, CHL Green and CHL Red

If there is a need for comparing variances, there are at least three popular tests available:

1. For two normally distributed data sets: F-test,
2. For more than two normally distributed data sets: Bartlett’s test and
3. For two or more non-normally distributed data sets: Levene’s test.

Since test that are based on a certain distribution are usually sharper, we need to check whether we have normal data. Figure three reveals, that CHL Blue does not show normality following the p-value of the Anderson-Darling Normality Test. Therefore, we need to run a Test for Equal Variances following Levene’s test

5. Test the Assumptions

Finally, there are no other prerequisites for running a Levene’s test.

6. Conduct the Test

Figure 4: Levene’s Test for Equal Variances

Running the Test For Equal Variances, using the statistics software SigmaXL generates the output in Figure 4.

Since the p-value for Levene’s Test Statistic is 0.0000, i.e. less than 0.05 (or 5 percent), we reject H0 and accept HA.

7. Make a Decision

With this, the Levene’s statistics means that there is at least one significant difference.

Additionally, this statistics shows which CHL is different from which other CHL. The p-value for Levene Pairwise Probabilities is 0.0000 between CHL Blue and CHL Green, as well as between CHL Green and CHL Red, i.e. there is a significant difference between CHL Blue and CHL Green as well as between CHL Green and CHL Red. The boxplot shows the direction of this difference.

Finally, the statistics informs that CHL Green seems to have a significantly better way to run the simulation with much less variation, i.e. StDev of 1.09min compared to 3.92 and 4.17, respectively. After further looking into the procedure, we recognise that CHL Green organises packages in First-in-First-out (FIFO) order, whereas CHL Blue and CHL Red do not ensure FIFO.

Interested in the stats? Read here.

Making Sense of the Two-Proportions Test

Consider a production process that produced 10,000 widgets in January and experienced a total of 112 rejected widgets after a quality control inspection (i.e., failure rate = 1.12%). A Six Sigma project was deployed to fix this problem and by March the improvement plan was in place. In April, the process produced 8,000 widgets and experienced a total of 63 rejects (failure rate = 0.79%). Did the process indeed improve?

Figure 1: Pie Charts for Two-Proportions Test

The appropriate hypothesis test for this question is the two-proportions test. As the name suggests it is used when comparing the percentages of two groups. It only works, however, when the raw data behind the percentages (100 rejects out of 10,000 parts produced and 63 out of 8,000 respectively) is available since the sample size is a determining factor for the test statistics.

To perform this test, we take the following steps:

1. Plot the Data

For any statistical application, it is essential to combine it with a graphical representation of the data. The selection of tools for this purpose is limited. They include pie chart, column chart and bar chart.

The pie chart in Figure 1 shows that the percentage of defective widgets has gone down from January to April. However, there is not a large drop. Therefore, based on this plot, it is risky to draw a conclusion that there is a significant (i.e. statistically proven) difference between the defect rate in January and that in April. A statistical test can help to calculate the risk for this decision.

2. Formulate the Hypothesis for Two-Proportions Test

In this case, the parameter of interest is a proportion, i.e. the null-hypothesis is

H0: PJanuary = PApril,

with PJanuary and PApril being the real defect percentage for these two months.

This means, the alternative hypothesis is

HA: PJanuary ≠ PApril.

3. Decide on the Acceptable Risk

Since there is no reason for changing the commonly used acceptable risk of 5%, i.e. 0.05, we use this risk as our threshold for making our decision.

4. Select the Right Tool

If there is a need for comparing two proportions, the popular test for this situation is the two-proportions test.

5. Test the Assumptions

There are no prerequisites for the application of this test.

6. Conduct the Test

Using the two-proportions test, statistics software SigmaXL generates the output in Figure 2.

Figure 2: Results of Two-Proportions Test

Since the p-value is 0.0264, i.e. less than 0.05 (or 5 percent), we reject H0 and accept HA.

7. Make a Decision

As a result, rejecting H0 means that there is evidence for a significant difference between the January and the April batch. The risk for being wrong with this assumption is only 2.64%.

In conclusion, we can trust the change in the widget production line and expect improved quality under the new conditions.

Interested in the stats? Read here.

Making Sense of Linear Regression

Linear regression is one of the most commonly used hypothesis tests in Lean Six Sigma work. Linear regression offers the statistics for testing whether two or more sets of continuous data correlate with each other, i.e. whether one drives another one.

Figure 1: Case Data for Linear Regression

Therefore, here is an example starting with the absolute basics of the linear regression. The data stem from a business simulation showcasing package delivery with several parameters being measured. The objective is to find the driver for Delivery Time Y. So, the question is, whether one or more of the potential drivers Coding Time, Packaging Time, Sender and Package Type make a significant difference on the Delivery Time (Figure 1).

To perform linear regression, we follow these steps:

1. Plot the Data

For any statistical application, it is essential to combine it with a graphical representation of the data. Several tools are available for this purpose. If Y (Delivery Time) and X are both continuous (Coding Time and Packaging Time), the scatter plot is the tool of choice. For discrete X (Sender and Package Type) and continuous Y (Delivery Time), a stratified histogram, a dotplot or a boxplot offer help.

Figure 2: Plots for Linear Regression Case

The plots in Figure 2 show all four potential drivers and their potential influence on the delivery time. Obviously, Coding Time drives Delivery Time and Sender and Package Type influence Delivery Time as well. However, there is a problem with this set of plots, the underlying thinking and our preliminary conclusions: Plotting one X against the Y disregards all other variables, i.e. their potential influence is still part of the data set but shows up as random variation. Therefore, obvious questions are:

1. Would the obviously marginal influence of Packaging Time on Delivery Time change if the variation by Sender or Packaging Time is accounted for, i.e. eliminated from the model?
2. Is it possible that Long Distance packages take longer because the majority of them go to East Coast?

Since it is not possible to show all four Xs and their relationship with the Y in the same plot, we need to refer to statistics for help.

2. Formulate the Hypothesis for Linear Regression

In this case, the parameter of interest is an average, i.e. the null-hypothesis is

H0: X does not influence Y,

with μA and μB being the population means of both companies.

This means, the alternative hypothesis is

HA: X has a significant influence on Y.

3. Decide on the Acceptable Risk

Since there is no reason for changing the commonly used acceptable risk of 5%, i.e. 0.05, we use this risk as our threshold for making our decision.

4. Select the Right Tool

If there is a need for testing the influence of one or more continuous X on a continuous Y, the popular test for this situation is Linear Regression.

5. Test the Assumptions

First of all, there are no requirements on Xs or Y for a regression to be valid. However, the residuals, the data points including the rest variation after application of the regression model, need to follow these requirements:

• Firstly, residuals need to be normally distributed,
• Secondly, residuals need to be independent of time,
• Thirdly, residuals need to be independent of X and
• Finally residuals need to be independent of Fits, i.e. the calculated value for Y after applying the model.

6. Conduct the Test

Using the linear regression, statistics software SigmaXL generates the output in Figure 3.

Figure 3: Linear Regression Model – Step 1

The interpretation of linear regression statistics follows certain steps:

1. Check of VIF (Variance Inflation Factors). If one or more VIF show a value of 5 or higher, the model is not valid since two or more factors (Xs) correlate with each other. This inter-correlation needs to be removed before proceeding. Solving this issue usually means  removing Xs one by one until all VIF are below 5. Try removing factors that you cannot easily measure or control first.
2. Check the model p-value. If this p-value is below 0.05 (or 5%), the model is valid, i.e. there is a significant X-Y relationship.
3. Check the p-value for all predictor terms. Remove non-significant Xs (predictors, factors) one-by-one and re-run the model after each removal. Start with the highest p-value.
4. Check the residuals whether they adhere to the assumptions. This check is especially suitable for finding extreme measurements, i.e. outliers, values that do not seem to fit the model. These values should be investigated regarding accuracy of data collection.
5. Formulate the final model.

Figure 4: Linear Regression Model – Step 2

Packaging Time seems to be non-significant, i.e. we have removed this term. Since Sender_Jurong and Sender_Woodlands are part of the Predictor Term Sender, they are automatically part of the model.

7. Make a Decision

As a result, rejecting H0 means that there is a significant relationship between Predictor Terms (Xs) and Delivery Time (Y). In order to influence (reduce) the Delivery Time, we have the option to work on Coding Time, Package Type and Sender. Since Package Type and Sender are customer requirements, it could make sense to change our promised delivery time for different Senders and different Package Types. The regression output delivers a regression equation that helps predicting Delivery Time for different factor settings:

Delivery Time = (2.492) + (7.291) * Coding Time + (4.088) * Sender_East Coast + (-1.900) * Package Type_Short Distance

Interested in the stats? Read here.

Here is the data:

[table id=7 /]

More data:

[table id=8 /]

Making Sense of ANOVA – Find Differences in Population Means

Three methods for dissolving a powder in water show a different time (in minutes) it takes until the powder dissolves fully. The results are summarised in Figure 1.

Figure 1: Data for Dissolving a Powder (in Minutes)

There is an assumption that the population means of the three methods Method 1, Method 2 and Method 3 are not all equal (i.e., at least one method is different from the others). How can we test this?

One way is to use multiple two-sample t-tests and compare Method 1 with Method 2, Method 1 with Method 3 and Method 2 with Method 3 (comparing all the pairs). But if each test is 0.05, the probability of making a Type 1 error when running three tests would increase.

A better method is ANOVA (analysis of variances), which is a statistical technique for determining the existence of differences among several population means. The technique requires the analysis of different forms of variances – hence the name. But note: ANOVA is not a test to show that variances are different (that is a different test); it is testing whether means are different.

To perform this ANOVA, the following steps must be taken:

Figure 2: Boxplot of Data for Different Methods

1. Plot the Data

For any statistical application, it is essential to combine it with a graphical representation of the data. Several tools are available for this purpose. They include the popular stratified histogram, dotplot or boxplot.

The boxplot in Figure 2 shows that the dissolution time for Method 1 seems lowest and for Method 2 seems highest. However, there is a certain degree of overlap between the data sets. Therefore, based on this plot, it is risky to draw a conclusion that there is a significant (i.e. statistically proven) difference between any of these methods. A statistical test can help to calculate the risk for this decision.

2. Formulate the Hypothesis for ANOVA

In this case, the parameter of interest is an average, i.e. the null-hypothesis is

H0: μ1 = μ2 = μ3,

with all μ being the population means of the three methods to dissolve the powder.

This means, the alternative hypothesis is

HA: At least one μ is different to at least one other μ.

3. Decide on the Acceptable Risk

Since there is no reason for changing the commonly used acceptable risk of 5%, i.e. 0.05, we use this risk as our threshold for making our decision.

4. Select the Right Tool

If there is a need for comparing more than two means, the popular test for this situation is the ANOVA.

5. Test the Assumptions

Finally, the prerequisites for the ANOVA, the analysis of variances, to work properly are:

1. All data sets must be normal and
2. All variances must not be significantly different from each other.

Figure 3: Descriptive Statistics for Dissolving a Powder (in Minutes)

Firstly, since all samples show a p-value above 0.05 (or 5 percent) for the Anderson-Darling Normality test (Figure 3), we can conclude that all samples are normally distributed. The test for normality uses the Anderson Darling test for which the null hypothesis is “Data are normally distributed” and the alternative hypothesis is “Data are not normally distributed.”

Secondly, as an alternative to perform a test for equal variances, it is appropriate to check whether the confidence intervals for sigma (95% CI Sigma) overlap. If there is a large overlap, the assumption for no significant difference between the variances is valid.

This means, both prerequisites for ANOVA are met.

Figure 4: ANOVA Test Statistics

6. Conduct the Test

Using the ANOVA, the analysis of variances, statistics software SigmaXL generates the output in Figure 4.

Since the p-value is 0.0223, i.e. less than 0.05 (or 5 percent), we reject H0 and accept HA.

7. Make a Decision

With this, ANOVA statistics means that there is at least one significant difference.

Additionally, this statistics shows which method is different from which other method. The p-value for pairwise comparison is 0.0066 between Method 1 and Method 2, i.e. there is a significant difference between Method 1 and Method 2. The boxplot shows the direction of this difference.

Finally, the statistics informs that this X (Methods) covers only 36% of the total variation in the methods. There might be other Xs explaining part of the rest variation of 64%.

Interested in the stats? Read here.

Making Sense of Binary Logistic Regression

In some situations, Lean Six Sigma practitioners find a Y that is discrete and Xs that are continuous. How can we apply regression in these cases? Black Belt training indicated that the correct technique is something called logistic regression or binary regression. But this tool is often not well understood.

Table 1: Data from Shuttle Investigation

An example about a well-known space shuttle accident can help to demystify logistic regression using the simplest logistic regression – binary logistic regression (binary regression), where the Y has just two potential outcomes (i.e., “yes” or “no”).

The data in Table 1 comes from the Presidential Commission on the Space Shuttle Challenger Accident (1986). The data consists of the number of the flight, the air temperature at the time of the launch and whether or not there was damage to the booster rocket field joints. Using normal regression and given a particular temperature at launch time, this data can help to determine the probability of damage to the booster rocket field joints.

There are five steps to apply logistics regression.

Step 1. Plot the Data

A stratified dot plot might help to graphically display the data for binary regression. It is obvious from Figure 1 that the probability of damage is greater at lower temperatures. However, there is quite a fair bit of overlap in the distribution. Is launch temperature a real X (i.e., a real predictor of damage)? And if so, what is the probability of damage for any given launch temperature?

Figure 1: Logistic Regression – Dot Plot of Shuttle Data

As a result of that, it is obvious that the probability of damage is greater at lower temperatures. However, there is quite a fair bit of overlap in the distribution. Is launch temperature a real X (i.e., a real predictor of damage)? And if so, what is the probability of damage for any given launch temperature?

Step 2. Formulate the Regression Model

Any regression requires a continuous output or Y. However, in this case the Y is discrete with only two categories or two events: Damage – yes or no. What to do? The “trick” behind the logistic regression is to turn the discrete output into a continuous output by calculating the probability (p) for the occurrence of a specific event. That means, the logistic regression provides a model to predict the p for a specific event for Y (here, the damage of booster rocket field joints, p = P[Y=1]) given any value of X (here, the temperature at the time of the launch). The logistic regression equation has the form:

ln(p/(1-p) = β0 + β1 X

This function is the so-called “logit” function where this regression has its name from. The procedure for modeling a logistic model is determining the actual percentages for an event as a function of the X and finding the best constant and coefficients fitting the different percentages.

Figure 2: Logistic Regression – Regression Model

This is exactly the equation that comes out of statistical software’s output for logistics regression:

ln(p/(1-p) = 14.204 – 0.226 Temperature

Step 3. Check Validity of Regression Model

There are two major checks that test the validity of this model:

1. P-value for the coefficient is less than 0.05.
p-value is calculated for each coefficient. If the p-value is low then there is a significant relationship between the Xvariable and the Y. In this case, the coefficient for temperature has p-value of 0.042 (i.e., there is a significant relationship between the temperature and the probability of a damage).
2. P-value of the “goodness of fit” tests are greater than 0.05.
Goodness-of-fit tests are conducted to see whether the model adequately fits the actual situation. Low p-values indicate a significant difference of the model from the observed data. Hence, the p-values should be above 0.05 to show that there are no significant differences between the predicted probabilities (from the model) and the observed probabilities (from the raw data). In this case, from the goodness-of-fit tests, none of them show a significant difference – the regression model is valid.

Step 4. Reverse the Logit Equation

Reversing the Logit equation helps to obtain an answer to the question, given a particular setting of X, what is the probability of failure? Reversing this, the result is:

p = (e exp(14.204 – 0.226 Temperature)) / ((1+e exp(14.204 – 0.226 Temperature)))

with e = 2.71828.

On the day of the Challenger incident, the temperature was 31 degree Fahrenheit. Hence, the probability of damage to the booster rocket field joints on that day is:

p = (e exp(14.204 – 0.226 Temperature)) / ((1+e exp(14.204 – 0.226 x 31))) = 0.999

Hence, damage was almost a certainty.

Step 5. Visualize the Results (Optional)

The event probability for all the possible temperature settings is the result of applying appropriate statistical software. Using this event probability, we produce the scatter plot (decreasing logistic plot) in Figure 3.

Figure 3: Scatter Plot of Damage Versus Temperature

Using this scatter plot, it is quite easy to do some prediction. For example, if the outside temperature is about 55 degrees F, the probability of getting a damage – or better a leak in the booster rocket – is more than 85%.

Making Sense of the Two-Sample T-Test

The two-sample t-test is one of the most commonly used hypothesis tests in Lean Six Sigma work. The two-sample t-test offers the statistics for comparing average of two groups and identify whether the groups are really significantly different or if the difference is due instead to random chance.

Figure 1: Two-Sample t-Test Data

Most importantly, it helps to answer questions like whether the average success rate is higher after implementing a new sales tool than before or whether the test results of patients who received a drug are better than test results of those who received a placebo.

Here is an example starting with the absolute basics of the two-sample t-test. The question is, whether there is a significant (or only random) difference in the average cycle time to deliver a pizza from Pizza Company A vs. Pizza Company B. Figure 1 shows the data collected from a sample of deliveries of Company A and Company B.

To perform this test, the following steps must be taken:

Figure 2: Two-Sample t-Test – Boxplot

1. Plot the Data

For any statistical application, it is essential to combine it with a graphical representation of the data. Several tools are available for this purpose. They include the popular stratified histogram, dotplot or boxplot.

The boxplot in Figure 2 shows that the delivery time for Pizza Company B seems to be lower than for A. However, there is a certain degree of overlap between the two data sets. Therefore, based on this plot, it is risky to draw a conclusion that there is a significant (i.e. statistically proven) difference between the average delivery time of the two companies. A statistical test can help to calculate the risk for this decision.

2. Formulate the Hypothesis for Two-Sample t-Test

In this case, the parameter of interest is an average, i.e. the null-hypothesis is

H0: μA = μB,

with μA and μB being the population means of both companies.

This means, the alternative hypothesis is

HA: μA ≠ μB.

3. Decide on the Acceptable Risk

Since there is no reason for changing the commonly used acceptable risk of 5%, i.e. 0.05, we use this risk as our threshold for making our decision.

Figure 3: Descriptive Statistics for both Samples

4. Select the Right Tool

If there is a need for comparing two means, the popular test for this situation is the two-sample t-test or Student’s t-test.

5. Test the Assumptions

Finally, the only prerequisite for the application of the two-sample t-test is that data needs to be normal. Therefore, we have drawn the descriptive statistics for both samples (company A and company B).

Since both samples have a p-value above 0.05 (or 5 percent) for the Anderson-Darling Normality test, we can conclude that both samples are normally distributed. The test for normality uses the Anderson Darling test for which the null hypothesis is “Data are normally distributed” and the alternative hypothesis is “Data are not normally distributed.”

Figure 4: Two-sample t-test statistics

6. Conduct the Test

Using the two-sample t-test, statistics software SigmaXL generates the output in Figure 4.

Since the p-value is 0.289, i.e. greater than 0.05 (or 5 percent), we cannot reject H0.

7. Make a Decision

As a result, not rejecting H0 means that there is not enough evidence for assuming a difference. Hence, there is no difference between the means. To say that there is a difference is taking a 28.9% risk of being wrong.

Interested in the stats? Read here.

Making Sense of Attribute Gage R&R Calculations

Measurement error is unavoidable. There will always be some measurement variation that is due to the measurement system itself.

Most problematic measurement system issues come from measuring attribute data in terms that rely on human judgment such as good/bad, pass/fail, etc. This is because it is very difficult for all testers to apply the same operational definition of what is “good” and what is “bad.”

However, such measurement systems exist throughout industries. One example is quality control inspectors using a high-powered microscope to determine whether a pair of contact lenses is defect free. Hence, it is important to quantify how well such measurement systems are working.

Understanding Attribute Gage R&R

The tool used for this kind of analysis is called attribute gage R&R. The gage R&R stands for repeatability and reproducibility. Repeatability means that the same operator, measuring the same thing, using the same gage, should get the same reading every time. Reproducibility means that different operators, measuring the same thing, using the same gage, should get the same reading every time.

Most importantly, attribute gage R&R reveals two important findings – percentage of repeatability and percentage of reproducibility. Ideally, both percentages should be 100 percent. But generally, the rule of thumb is anything above 90 percent is quite adequate. However, this depends on the application.

Obtaining these percentages needs only simple mathematics. And there is really no need for sophisticated software. Nevertheless, some statistics software such as Minitab or SigmaXL have a module called Attribute Agreement Analysis that does the same and much more, and this makes analysts’ lives easier.

However, it is important for analysts to understand what the statistical software is doing to make good sense of the report. In this article, the steps are reproduced using spreadsheet software with a case study as an example.

Steps to Calculate Gage R&R

Setting Up the Attribute Gage R&R

Step 1: Select between 20 to 30 test samples that represent the full range of variation encountered in actual production runs. Practically speaking, if we use “clearly good” parts and “clearly bad” parts, the ability of the measurement system to accurately categorise the ones in between will not show. For maximum confidence, a 50-50 mix of good/bad parts is recommended. A 30:70 ratio is acceptable.

Step 2: Have a master appraiser categorize each test sample into its true attribute category.

Step 3: Select two to three inspectors who do the job. Have them categorize each test sample without knowing what the master appraiser has rated them.

Step 4: Place the test samples in a new random order and have the inspectors repeat their assessments.

Step 5: For each inspector, count the number of times his or her two readings agree. Divide this number by the total inspected to obtain the percentage of agreement. This is the individual repeatability of that inspector.

Figure 1: Attribute Gage R&R Data

Furthermore, to obtain the overall repeatability, obtain the average of all individual repeatability percentages for all inspectors. In this case study, the overall repeatability is 93.33%, which means if the measurements are repeated on the same set of items, there is a 93.33% chance of getting the same results, which is not bad but not perfect.

Understanding the Results of Attribute Gage R&R

In this case, the individual repeatability of Inspector 1 is 100%, of Inspector 2 is 95% and of Inspector 3 is 85%. This means for example that Inspector 3 is only consistent with himself 85% of the time. He is probably inexperienced and needs retraining.

Step 6: Compute the number of times each inspector’s two assessments agree with each other and also the standard produced by the master appraiser in Step 2.

This percentage is the individual effectiveness or reproducibility. In this case, Inspector 1 is in agreement with the standard only 75% of the time.  Inspector 3 is in agreement with the standard only 65% of the time. Inspector 1 is clearly experienced in doing this kind of inspection. But he does not know the standard very well. Inspector 2 does. Inspector 3 needs explanation of the standard, too, after receiving general training in attribute inspection.

Step 7: Compute the percentage of times all the inspectors’ assessments agree for the first and second measurement for each sample item.

Step 8: Compute the percentage of the time all the inspectors’ assessments agree with each other and with the standard.

Finally, this percentage gives the overall effectiveness of the measurement system. The result is 65%. Hence, this is the percent of time all inspectors agree and their agreement matches with the standard.

In conclusion, Minitab and SigmaXL produce a lot more statistics in the output of the attribute agreement analysis, but for most cases and use, the analysis outlined in this article should suffice.

So What If the Gage R&R Is Not Good?

The key in all measurement systems is having a clear test method and clear criteria for what to accept and what to reject. The steps are as follows:

1. Identify what is to be measured.
2. Select the measurement instrument.
3. Develop the test method and criteria for pass or fail.
4. Test the test method and criteria (the operational definition) with some test samples (perform a gage R&R study).
5. Confirm that the gage R&R in the study is close to 100 percent.
6. Document the test method and criteria.
7. Train all inspectors on the test method and criteria.
8. Pilot run the new test method and criteria and perform periodic gage R&Rs to check if the measurement system is good.
9. Launch the new test method and criteria.

Interested in more details? Read here.

Eight Workable Strategies for Creating Lean Government

Lean Government. Even to the seasoned Lean practitioner, the idea of a Lean government sounds far-fetched. Governments are traditionally seen as the epitome of bureaucracy, and the guardians of red tape, incomprehensible forms and endless queues. But there are workable Lean strategies for governments seeking to reduce waste and become more efficient. Eight are outlined here.

Perhaps considering the eight ideas can spur government change agents to study Lean literature for potential improvement applications and in the longer run, start a Lean revolution in governments.

The idealised goal of Lean is “one-piece flow,” also known as continuous flow. One-piece flow is achieved when all waste is eliminated from the value stream and all that remains is value-added work from the perspective of customers. In manufacturing, one-piece flow is an ideal and will always be an ideal because of fluctuations in customer demands plus the customer requirements for ever shorter delivery time forces the manufacturer to create partially completed or completed inventories. This type of manufacturing strategy actually creates waste because there is a need for storage and management of storage.

The interesting thing about Lean Government is this that one-piece flow operation is almost achievable here because there is really no requirement for in-process inventories. There is really no such thing as a partially finished job that is not the result of a customer order within government processes.

What would one-piece flow feel like in a typical government value stream? Consider a typical government value stream. It has only four value-added processing steps from the customer’s perspective.

Adding up all the value added processing time, it should take no more than three hours to obtain a reply. Lean government is a possibility.

Most governments and their value streams are not lean. Recall personal experiences trying to obtain a government grant, applying for an international passport, getting a drivers license or applying for a business permit. The typical experience is that is not that it took just three hours. More likely, it took more than a week. Nonetheless, it is possible to make government value streams lean. Here are eight ways:

No. 1 – Synchronisation to Customer Demands

Most government value streams are not designed and synchronised to customer demands. In Lean manufacturing, the concept of Takt time, or beat time, is well understood but within most governments, this concept is unheard of. Takt time is a concept that is used to design work and it measures the pace of customer demand. It is the “available time for production” divided by the “customer demand.” The resulting number tells how fast each process step must operate to obtain one-piece flow.

Here is a government example: Suppose 30 citizens apply for a particular government permit in one working day and each working day consists of seven working hours. The Takt time of this permit application process is 420 (7 x 60) minutes divided by 30 applications, which is equal to 14 minutes. This means that for these 30 applications to be processed, every 14 minutes, one permit must be processed to satisfy the customer demand.

The first permit will take the sum of all processing times to complete. Suppose there are 10 processing steps,  synchronised to Takt at 14 minutes each; then the first permit will take 140 minutes to be completed. If one-piece flow is achieved, the next permit in the queue will leave the line exactly 14 minutes after the first permit and so on. To complete all 30 permits, it will actually take 140 minutes plus 406 (29 x 14) minutes, assuming one-piece flow operations.

To achieve this, the cycle time for each processing step must be 14 minutes or less to meet the demand. If any processing step takes more than 14 minutes, it becomes a bottleneck and work will get stuck at that point. However, if one process step takes two hours (120 minutes), it takes about 9 (120 / 14) staff to cope with the workload assuming these staff members work 100% on only this process – which is very idealistic. In reality, it takes rather 12 to 15 staff to complete this task assuming these staff members are only partially available for this process.

Government value streams are rarely designed around Takt time because the concept does not exist within most governments. One of the prerequisites for Lean Government itself does not exist. Most public sector administrators reject the idea that such a concept translates into their environment. As a result, workforce allocations in government value streams are rarely rationalised around Takt time, resulting in over capacity in some parts of the stream and under capacity in other parts.

The main waste that this produces is work-in-process inventory (WIP) and the most visible manifestation of this is the ever-full in-tray. WIP kills one-piece flow because it disables a processing step from producing to Takt. But WIP build up is inevitable in any government value stream that is not synchronised to Takt. This is the main reason why a three-hour job needs more than a week for processing.

No. 2 – Understand Variations in Customer Demand

Synchronisation to Takt generally requires two things – reducing the processing time of the step and establishing the correct staffing level. Suppose it takes 120 minutes to complete the application process at the government counter. To achieve one-piece flow at a Takt time of 14 minutes (as in the previous example) would require the manning of at least nine counters (120 divided by 14 minutes). This assumes that one customer arrives into the stream every 14 minutes. Reducing the processing time to 90 minutes would allow the manning level to be reduced to seven counters (90 divided by 14 minutes, data analytics).

Of course, customers do not normally arrive at specific intervals. Most value streams experience significant variation in customer demand throughout the course of any typical workday. When the counter process is not synchronised to fluctuating customer demand, the familiar queue builds up. The typical government response to this problem is to build waiting areas and queue ticketing systems. This wastes not only expensive floor space (which taxpayers pay for) but more importantly it wastes the time of the citizens (customers). In some of the government value streams, this queuing can take up to hours.

This happens because fluctuations in customer demands typically are not monitored and also because government processes generally ask for more information than necessary at this first step, hence lengthening the processing time unnecessarily. If fluctuations in customer demands were monitored, the manning levels can be adjusted to match the requirements. This requires a workforce that is not only multi-skilled but also flexible – which brings up the next problem.

No. 3 – Create Work Cells

Most government value streams are organised around separate departments and functions. For example, to obtain a government approval for a permit, an application form probably has to flow through no less than three separate departments and/or functions prior to approval. The main reason is that people performing a particular type of function are normally grouped together in the same place.

Because of this, in most government setups, there is a type of internal post office system (registry process) that handles this movement of work from one part of the organisation to another. From a Lean perspective, this creates waste of transportation and waiting. In some government processes studied, this registry process makes at most three to four collections and deliveries a workday. Collected WIP is sorted according to destination and delivered at the next allocated time slot. This causes two problems – a waste of time managing the movement of WIP between processes and, more severely, the creation of a natural batch of work that kills the one-piece flow capability of the receiving processing step.

The solution to this kind of problem is deceptively simple. Why not create a work cell where all the necessary value-adding processing steps and personnel are located together? This cuts out the need for the registry process, which should take out 50 percent of the total processing time and allow for smoother work flow because batching is no longer required. Implementing this kind of solution has proven to be remarkably difficult, largely because of the mindset that says jobs of the same function should be clustered together.

A key feature of Lean work cells is the training of multi-skilled and flexible workers. In a Lean work cell, the goal is to have all workers trained to a level where everyone can perform the job at every workstation. Since everyone can do every job, processes are never left half finished because the right person to do a particular job is not around.

No. 4 – Eliminate Batching Work and Multi-Tasking

Because work within most governments is organised around functions and not around processes, most government officials are required to multi-task. Most government officials, at all levels, participate in more than one value stream. They also have a whole host of other types of work that takes them away from the main value-creating work streams (normally meetings and more meetings). To compensate for this, most government personnel batch their work – often waiting for a minimum number of work items to build-up before working on them.

This strategy increases their personal efficiency. Obviously it is more efficient to process a batch of similar type of work within a compressed time slot than to process them as they arrive. This is because batching eliminates the need for several set-ups. (It is a common perception that administrative work requires no set-up time. Anyone who has done administrative work knows that this is not true. Every time a particular type of work is to be performed, the processing officer needs at least the time to adjust their mind to that new type of work.)

However, the whole batching problem and the time it takes can easily be eliminated if work is organised around work cells. But, as noted, work cells are not easy to create in governments. A Lean Government would most likely have those.

No. 5 – Enforce First in, First out

In manufacturing, “first in, first out” (FIFO) is the normal rule applied to the processing order of work. If a company does not adhere to FIFO, much variation is added into the total distribution of processing time. For instance, in a last-in-first-out system, the jobs that come in last are processed quickly while the jobs that come in first take much longer to process.

Normally, in manufacturing value streams, there are FIFO lanes that prevent the FIFO rule from being violated. In government processes, jobs are often delivered into an in-tray. The in-tray creates a natural last-in-first-out effect leading to large overall processing time fluctuations. Large overall processing time fluctuations make the overall process less capable of meeting customer requirements as a whole.

The solution is once again the creation of Lean work cells, where work is pulled from one processing step to the next rather than pushed. If work is always pulled (that is, work is only ordered from the previous processing step when the operator is free), the FIFO rule will always be adhered to. Once again, the move from a push culture towards a pull culture is difficult for most governments. The normal government manager’s mindset is to load people with more work than they can do so as to ensure that they are always occupied.

No. 6 – Implement Standardised Work and Load Levelling

Related to first-in-first-out issue is the lack of understanding and application of standardised work within government value streams. Even in highly repeatable work, it is fairly common to find different government workers performing similar tasks using slightly different methods and hence taking slightly different time. Because work is not standardised, there is no basis for evaluation and improvement. Often, the “best” workers are loaded with more work because they work faster and more efficiently than other workers.
Overall and over time, this encourages government workers to slow their pace. They learn that additional work will be pushed to them once they complete a certain amount of their current workload. Hence, production is paced according what is deemed reasonable by the supervisor and not paced according to customer demands.

No. 7 – Do Today’s Work Today

Most government officials do not believe that work that arrives today can be finished today. They are correct to believe so because the way the work streams are currently set up do not allow work that arrives today to be complete on the same day. Over time, this cultivates a mindset that says, “We can always do it tomorrow.”

What many governments may not realise is that customer demands remain largely constant from day to day. That is, the number of people applying for a particular permit each day tends to average out. If about 300 apply on Monday, a similar number are like to apply on Tuesday. If the government agency only managed to process 100 out of the 300 applications on Monday, there will be about 500 applications waiting to be processed on Tuesday (200 from previous day). This gives good opportunities for Lean Government.

The accumulation of WIP has the effect of lengthening the expected flow time of the job. When the WIP is only 300, one can reasonably expect the permit to be processed within three days. However, by the end of the month, with WIP levels at 5,900, one can only expect their permit to be processed after 59 days. And the problem continues to grow.

The only way to stop this is to design value streams that can complete what comes in by the same day.

No. 8 – Make the Value Stream Visible

Last but certainly not least, the easiest way toward Lean Governments is to teach government officials value stream mapping. Unlike manufacturing, there is no visible line in government. In fact, most people working in government do not even know they are part of a larger value stream. They think largely in terms of their job and their function.

Making the value stream visible through value stream mapping exposes non-valued steps, time wasted by transportation and WIP, excessive process variation caused by non-standard work processes and production rules, waste caused by rework, waste caused by excessive checking and more.

When a value stream map is created for their operations, many government officials are surprised by how much time and money is wasted. They are also surprised by how easy it is, once the value stream is visualised, to produce Lean Government value streams.

Using the Power for Good Hypothesis Testing

Rejecting a null hypothesis when it is false is what every good hypothesis test should do. The “power of the test” is the measure of how good a test is. It is the probability that the test will reject Ho when in fact it is false.
Published