When gathering data (sampling) from a population for the purpose of describing as well as making inferences on said population, randomization is essential. Why? "Random sampling eliminates bias by giving all individuals an equal chance to be chosen."(1) and, the “mathematical theorems that are the foundation for frequently used statistical procedures are based on this assumption”(2).
When facilitating training in Six Sigma, I often “quiz” students after we have journeyed for a while down the hypothesis testing and analytical path, by providing them with a “case study” on a popular Canadian voluntary “tax”… Lotto649.
Note: When I tell my 89-year old mother how extremely remote the odds of winning Lotto-649 are, she always replies, “Well, someone has to win!” Yet, when this grande dame sees the projected interest that might be charged on her credit card (bill) if she fails to pay on time (she always pays balance off in full each month), she refers to the interest as a “fine”… In that respect, she’s a very smart lady but… she faithfully continues to buy her lottery tickets each week and hasn’t won… yet!
Lotto-649 tickets consist of six numbers, ranging from 01 to 49, (hence the name “649”).
A couple of years ago, I downloaded from the Ontario Lottery and Gaming Corporation data on the frequency of each number drawn (number 1 to 49) since the launch of that weekly lottery. The results are depicted in the following chart.
There appears to be a difference in the frequency of the number of times each value (from 1-49) is selected from the random draw.
To further illustrate the difference, I ask the students to group each number into seven categories and calculate the total frequency for each category: Numbers 1-7, 8-14, 15-21, 22-28, 29-35, 36-42 and 43-49. The following chart illustrates the result of this sub-grouping of the numbers.
Based on the preceding chart, it would appear that numbers that fall in the range of 29-35 and 43-49 are drawn more frequently in Lotto-649 and poor little 8-14 the least amount of times!
At this point, we can employ the trusted “Chi Square” analysis to determine if what we are seeing is simply chance, or if the differences observed are statistically significant (i.e. a low probability of occurring by chance). The results are as follows;
With a resulting p-value of 0.025, we are 97.5% confident to claim that there is a difference in the overall frequency represented in these sub-groups! And all the time we thought the lottery was “fair” :-(
At that point, I ask the students what “strategy” they might now employ to increase their odds of winning future Lotto-649 games. Some students will suggest that selecting numbers within the ranges of 29-35 and 43-49 may be their path to quick riches.
After some discussion, the question, “What might be wrong with our analysis?” is then posed to the students. Surprisingly, a suggestion that there has been a breach in the assumption of randomness is not immediately forthcoming, even though the importance of that principle has been a recurring theme for the students when previously exploring different statistical tests.
Perhaps more time has been spent on “driving” the analytical “cars” and not enough on the key theories (and risks) that go into the design of those vehicles? Is this a case of simply "losing sight of the forest for the trees"?
The students can then be asked to repeat their “sub-grouping” exercise but, this time, sub-group numbers randomly. The result might look something like the truncated table provided below. Note: In this case, all of the numbers 1-49 are randomly assigned to one of seven sub-groups, simply labelled A-G.
These new results can then be illustrated in a bar chart (below).
A repeat of the Chi Square analysis on these new groupings reveals the following;
The resulting p-value of 0.543 is very different when contrasted with the previous “test”. At this point, we would conclude that the differences we are seeing in the