Statistical Power in Hypothesis Testing
An Interactive Guide to the What/Why/How of Power
What is Statistical Power?
Statistical Power is a concept in hypothesis testing that calculates the probability of detecting a positive effect when the effect is actually positive. In my previous post, we walkthrough the procedures of conducting a hypothesis testing. And in this post, we will build upon that by introducing statistical power in hypothesis testing.
Power & Type 1 Error & Type 2 Error
When talking about Power, it seems unavoidable that Type 1 and Type 2 error will be mentioned as well. They are all well-known hypothesis testing concepts to compare the predicted results against the actual results.
Let’s continue to use the t-test example in my previous post “An Interactive Guide to Hypothesis Testing” to illustrate these concepts.
Recap: we used one-tail two sample t-test to compare two samples of customers - customers who accepted the campaign offer and customers who rejected the campaign offer.
recency_P = df[df['Response']==1]['Recency'].sample(n=20, random_state=100) recency_N = df[df['Response']==0]['Recency'].sample(n=20, random_state=100)
null hypothesis (H0): there is no difference in Recency between the customers who accept the offer and who don’t accept the offer - represented as the blue line.
alternative hypothesis (H1): customers who accept the offer has lower Recency compared to customers who don’t accept the offer - represented as the orange line.
Type 1 error (False Positive): If values fall within the blue area in the chart, even though they occur when null hypothesis is true, we choose to reject the null hypothesis because these values are lower than the threshold. As a result, we are making a type 1 error or false positive mistake. It is the same as the significance level (usually 0.05), which means that we allow 5% risk of claiming customers who accept the offer have lower recency when in fact there is no difference. The result of a type 1 error is that, the company may send out a new campaign offer to people with low Recency value but the response rate is not good.
Type 2 error (False Negative): It is the probability of rejecting the alternative hypothesis when it is actually true - so claim that there is no difference between two groups when actually difference exists. As in the business context, the marketing team may lose a potential target campaign opportunity with high return on investment.
Statistical Power (True Positive): The probability of correctly accepting the alternative hypothesis when it is true. It is the exact opposite of Type II error: Power = 1 - Type 2 error , and we correctly predict that customers who accept the offer are more likely to have lower Recency compared to customers who don’t accept the offer.
Hover over the chart below and you will see how Power, Type 1 error and Type 2 error changes when we apply different threshold.
Check out "Statistical Power" in our Code Snippet section, if you want to build this yourself.
Why Use Statistical Power?
Significance level is widely used to determine how statistically significant the hypothesis testing is. However, it only tells part of the story - try to avoid claiming there is a true effect / difference given that no actual difference exists. Everything is based on the assumption that null hypothesis is true.
What if we want to see the positive side of the story - the probability of making the right conclusion when the alternative hypothesis is true? We can use Power.
Additionally, Power also plays a role in determining the sample size. A small sample size might give a small p-value by chance, indicating that it is less likely to be a false positive mistake. But it does not guarantee that there is enough evidence for true positive. Therefore, Power is usually defined before the experiments to determine the minimum sample size required to provide sufficient evidence for detecting a real effect.
How to Calculate Power?
The magnitude of power is impacted by three factors: significance level, sample size and effect size. Python function solve_power() calculates the power given the values of parameters - effect_size, alpha, nobs1.
Let’s run a power analysis using the Customer Recency example above.
from statsmodels.stats.power import TTestIndPower t_solver = TTestIndPower() power = t_solver.solve_power(effect_size=recency_d, alpha=0.05, power=None, ratio=1, nobs1= 20, alternative='smaller')
significance level: we set alpha value as 0.05 which also determines that the Type 1 error rate is 5%. alternative= 'smaller' is to specify the alternative hypothesis: the mean difference between two groups is smaller than 0.
sample size: nobs1 specifies the size of sample 1 (20 customers) and ratio is the number of observations in sample 2 relative to sample 1
effect size: effect_size is calculated as the difference between the mean difference relative to pooled standard deviation. For two sample t-test, we use Cohen’s d formula below to calculate the effect size. And we got 0.73. In general, 0.20, 0.50, 0.80, and 1.3 are considered as small, medium, large, and very large effect sizes.
n1 = len(recency_N) n2 = len(recency_P) m1, m2 = mean(recency_N), mean(recency_P) sd1, sd2 = np.std(recency_N), np.std(recency_P) pooled_sd = sqrt(((n1 - 1) * sd1**2 + (n2 - 1) * sd2**2) / (n1 + n2 - 2)) recency_d = (m2 - m1)/pooled_sd
As shown in the interactive chart "Power, Type1 error and Type2 error", when the significance level is 0.05, the power is 0.74.