top of page
Search

# Statistical Power in Hypothesis Testing

## What is Statistical Power?

Statistical Power is a concept in hypothesis testing that calculates the probability of detecting a positive effect when the effect is actually positive. In my previous post, we walkthrough the procedures of conducting a hypothesis testing. And in this post, we will build upon that by introducing statistical power in hypothesis testing.

### Power & Type 1 Error & Type 2 Error

When talking about Power, it seems unavoidable that Type 1 and Type 2 error will be mentioned as well. They are all well-known hypothesis testing concepts to compare the predicted results against the actual results.

Let’s continue to use the t-test example in my previous post “An Interactive Guide to Hypothesis Testing” to illustrate these concepts.

Recap: we used one-tail two sample t-test to compare two samples of customers - customers who accepted the campaign offer and customers who rejected the campaign offer.

```recency_P = df[df['Response']==1]['Recency'].sample(n=20, random_state=100)
recency_N = df[df['Response']==0]['Recency'].sample(n=20, random_state=100)```
• null hypothesis (H0): there is no difference in Recency between the customers who accept the offer and who don’t accept the offer - represented as the blue line.

• alternative hypothesis (H1): customers who accept the offer has lower Recency compared to customers who don’t accept the offer - represented as the orange line.

Type 1 error (False Positive): If values fall within the blue area in the chart, even though they occur when null hypothesis is true, we choose to reject the null hypothesis because these values are lower than the threshold. As a result, we are making a type 1 error or false positive mistake. It is the same as the significance level (usually 0.05), which means that we allow 5% risk of claiming customers who accept the offer have lower recency when in fact there is no difference. The result of a type 1 error is that, the company may send out a new campaign offer to people with low Recency value but the response rate is not good.

Type 2 error (False Negative): It is the probability of rejecting the alternative hypothesis when it is actually true - so claim that there is no difference between two groups when actually difference exists. As in the business context, the marketing team may lose a potential target campaign opportunity with high return on investment.

Statistical Power (True Positive): The probability of correctly accepting the alternative hypothesis when it is true. It is the exact opposite of Type II error: Power = 1 - Type 2 error , and we correctly predict that customers who accept the offer are more likely to have lower Recency compared to customers who don’t accept the offer.

Hover over the chart below and you will see how Power, Type 1 error and Type 2 error changes when we apply different threshold.

Check out "Statistical Power" in our Code Snippet section, if you want to build this yourself.

## Why Use Statistical Power?

Significance level is widely used to determine how statistically significant the hypothesis testing is. However, it only tells part of the story - try to avoid claiming there is a true effect / difference given that no actual difference exists. Everything is based on the assumption that null hypothesis is true.

What if we want to see the positive side of the story - the probability of making the right conclusion when the alternative hypothesis is true? We can use Power.

Additionally, Power also plays a role in determining the sample size. A small sample size might give a small p-value by chance, indicating that it is less likely to be a false positive mistake. But it does not guarantee that there is enough evidence for true positive. Therefore, Power is usually defined before the experiments to determine the minimum sample size required to provide sufficient evidence for detecting a real effect.

## How to Calculate Power?

The magnitude of power is impacted by three factors: significance level, sample size and effect size. Python function solve_power() calculates the power given the values of parameters - effect_size, alpha, nobs1.

Let’s run a power analysis using the Customer Recency example above.

```from statsmodels.stats.power import TTestIndPower
t_solver = TTestIndPower()
power = t_solver.solve_power(effect_size=recency_d, alpha=0.05, power=None, ratio=1, nobs1= 20, alternative='smaller')```
• significance level: we set alpha value as 0.05 which also determines that the Type 1 error rate is 5%. alternative= 'smaller' is to specify the alternative hypothesis: the mean difference between two groups is smaller than 0.

• sample size: nobs1 specifies the size of sample 1 (20 customers) and ratio is the number of observations in sample 2 relative to sample 1

• effect size: effect_size is calculated as the difference between the mean difference relative to pooled standard deviation. For two sample t-test, we use Cohen’s d formula below to calculate the effect size. And we got 0.73. In general, 0.20, 0.50, 0.80, and 1.3 are considered as small, medium, large, and very large effect sizes.

```n1 = len(recency_N)
n2 = len(recency_P)
m1, m2 = mean(recency_N), mean(recency_P)
sd1, sd2 = np.std(recency_N), np.std(recency_P)
pooled_sd = sqrt(((n1 - 1) * sd1**2 + (n2 - 1) * sd2**2) / (n1 + n2 - 2))
recency_d = (m2 - m1)/pooled_sd```

As shown in the interactive chart "Power, Type1 error and Type2 error", when the significance level is 0.05, the power is 0.74.

## How to Increase Statistical Power?

Power is positively correlated with effect size, significance level and sample size.

### 1. Effect Size

Power increases when effect size increases (check out the code in Code Snippet to build this yourself)

Larger effect size indicates larger difference in mean relative to the pooled standard deviation. When effect size increases, it suggests more observed difference between two sample data. Therefore, Power increases as it provides more evidence that alternative hypothesis is true. Hover over the line to see how Power changes as effect size changes.

### 2. Significance level / Type I error

Power increases when significance level (alpha value) increases (Check out the code in Code Snippet to build this yourself)

There is a trade-off between type 1 error and type 2 error, hence if we allow more type 1 error we will also increase power. If you hover over the line in the first interactive chart “Power, Type 1 and Type 2 error”, you will notice that when we try to mitigate type 1 error, type 2 error increases and power decreases. This is because if minimize the false positive mistakes, we are raising the bar and adding more constraints to what we can classify as a positive effect. When the standard is too high, we are also reducing the probability of correctly classifying a positive effect. As a result, we cannot make them both perfect. So a common threshold, with Type 1 error 0.05 and Power - 0.8, is applied to balance this trade off.

### 3. Sample size

Power increases as sample size increases (check out the code in Code Snippet to build this yourself)

Power has a positive correlation with sample size. Large sample size brings the variance of the data down, so the average of the samples will be closer to the population mean. As a result, when we observe a difference in the sample data, it is less likely to occur by chance. As demonstrated in the chart, when the sample size is as large as 100, it is easy to reach 100% power with a relatively small effect size.

In hypothesis testing, we often reverse the process and derive the required sample size given the desired power using the code below. For this example, it is required to have around 24 customers in each sample group to run a t-test with Power of 0.8.

## Take Home Message

In this article, we introduce a statistics concept - Power, and answer some questions related to Power.

• What is Statistical Power? - Power is related to Type 1 error and Type 2 error

• Why we use Statistical Power? - Power can be used to determine sample size

• How to calculate Power? - Power is calculated from effect size, significance level and sample size.