Alternate titles: "Holiday Internet Usage" or... "Post-Holiday Season Internet Blues"
Why, after only 10 days into my new billing cycle, would I be receiving an alert that 90% of my 125 GB monthly internet plan has been used up and I still have 20 days to go? I know that my average household usage over the last eight months has been 70-80 GB. I also know that my daughter worked in the city over the summer but came home for the December holiday break, (arrived December 21).
After accessing my ISP account and downloading the usage data for the last two months, I see an opportunity for some "basic statistics" that I can then share with my daughter as we approach the subject of... Wanton Internet Streaming Abuse (aka "WISA").
Let's generate some descriptive statistics and a histogram (courtesy of SigmaXL... a pretty nifty "made in Canada" product from John Noguera and his team in Waterloo, Ontario, (https://www.linkedin.com/in/johnnoguera).
Definitely not normal (but one might expect that). A median daily usage of 2.8 Gigabytes (GB) and a lot of dispersion. So, with 30 days times a 2.8 GB per day, one expects total monthly consumption to be in the neighborhood of 60... Perhaps even 70 GB, which is very much in step with historic average monthly usage. Nothing revealing so far. Let's generate a run chart to illustrate the usage over time;
Hmmm... Something doesn't smell right here. Usage spikes up after December 21st and stays elevated over the next fourteen days. When did our daughter come home for the holidays? Oh, yes! December 21st! (Note: For simplicity, we won't get into the p-values associated with the run chart... I don't think my daughter would appreciate that point in the discussion that will ensue as a result of this "study"). Just for fun, we'll generate a control chart on a "normally" transformed version of the data.
The control chart supports that assumption that something has "changed in the process". There is evidence of special cause that has affected the process. Observation: Daughter comes home and, interestingly, process changes. Correlation? Causation? We might as well play this out to the end. Since she arrived home on December 21st, we will add a categorical identifier, ("Home - Yes", "Home - No"), so we can subgroup the data for further analysis.
The preceding comparative histogram and descriptive statistics help to illustrate the difference between "daughter home" versus "daughter not home". One thing that jumps out is the difference in the sample medians: 1 GB per day when daughter is not home and 9.3 GB when daughter is home. Another observation is the spread... far more variation in daily usage when daughter is home. Now, for one of my favourite tools... the box plot.
The differences really jump out in the preceding chart. Notice that both the difference in variation (dispersion) and central tendency (median) stand out in this simple yet effective chart? Well, we have the data, so why not go the distance and perform a couple of hypothesis tests. First, let's test to see if the variances of the two groups (daughter home, daughter away) differ. Our choice of tools, since the data is not normal, will be the Levene's test, which SigmaXL handles very nicely.
A Bit About "Innocence" and "Guilt"
We have come to learn in our Six Sigma (and stats) classes that hypothesis tests are comprised of two statements... The "Null Hypothesis" (innocent until proven guilty beyond a reasonable doubt) and the "Alternate Hypothesis" which is the opposite of the null (the rejection of the assumption of innocence). The notation "Ho", pronounced "hoe", is used to represent the "null" hypothesis and Ha represents the alternate hypothesis. When we reject the assumption of "innocence", (the Ho), we want to be confident that it is the right decision. We want the risk of incorrectly rejecting the null hypothesis to be low, (in most cases, less than a 5% risk or 0.05).
In our example (above), the "Null" hypothesis (Ho) is that there is no difference in the variance of the two groups (daughter home, daughter not home). Our risk of being wrong in rejecting the Ho is represented by the p-value shown in the preceding analysis. With a p-value of 0.0007, there is virtually no risk in rejecting the null hypothesis (which states there is no difference...). He can confidently reject it in favour of the alternate hypothesis and we can safely conclude that...
There is a difference in the variance associated with daily GB usage of each group. Usage is (much) higher when daughter is home. (As if this was a surprise).
Okay? Onwards and upwards. Let's tackle the last test... a test of central tendency (location). In this case, since one of the groups is not normal, we defer to a "Mann-Whitney" test to see if the medians of the two groups are different. Once again, SigmaXL to the rescue... we plug in the data and, voila!, interpret the results.
What's great about about hypothesis testing is that although the mathematical "engines" differ from test to test, the method for interpretation of the results are similar. You remember from the preceding example that there are the two statements in your hypothesis;
Ho (null): "There is no difference in the median (internet usage) for the two groups".
Ha (alternate): "There is no difference in the median (internet usage) for the two groups".
We want to be at least 95% confident that if we reject the null (Ho) in favour of the alternate (Ha) that we are correct in making that decision. Therefore, our risk (represented by the p-value), has to be less than 0.05. You decide... Do we reject the null (Ho) in favour of the alternate (Ha), or do we fail to reject the null? Hint: Look at the p-value in the preceding image.
CORRECT! We reject the null and can conclude that there IS a difference in the median daily usage of Internet (GB) when the daughter is home.
Struggling a bit with this hypothesis testing and the rules for interpretation? There is a simple sentence that captures everything we have discussed in the preceding example. Now, you have to say this with me... say it nice and loud... Ready? Okay! Repeat after me...
IF THE P IS LOW... THE HO MUST GO!
Now, if you can take a few minutes to stop laughing, let's wrap this study up. I do need to review the results with my daughter and take appropriate action to "address" the confirmed change in the process. Every study should result in conclusions and a decision or action. What have I learned from this?...
Internet usage increases when daughter is home for the holidays!
Variation in usage also increases!
Occasions when daughter is out with friends and partaking of holiday “spirit” (and lasting effects of same) may result in low/no usage for certain time intervals?
I immediately need to purchase more capacity (usage) when daughter is home (or pay $3-4 per GB “surcharge” when plan exceeded.
An additional 50GB should cover us to the end of the month but I will need to monitor closely.
Here is my final observation and one that could easily be overlooked. Do I really need statistics to state the obvious? If one only considered the run chart and the box plot, and then looked at the situation from a practical perspective, a conclusion might have been drawn and a decision taken.
Descriptive and inferential statistics are extremely powerful tools but we should reserve those methods and techniques for important, high risk decisions when gut-feel and simple (graphical) observation is not enough to achieve the level of confidence you need to make a decision or take action.
But… when the opportunity does present itself and you have the data to work with, do try and turn every experience into an opportunity for a lesson in statistics. Perhaps this is a silver lining in every cloud?