8 Inference About the Mean

8.1 Introduction

The simplest type of statistical inference concerns one parameter in one population. Before comparing multiple populations or modeling complex relationships, we begin with the most fundamental case:

Inference about the population mean.

This is the natural starting point because:

  • The mean is often the parameter of primary interest.
  • Its sampling distribution is well understood.
  • The Central Limit Theorem provides broad applicability.

Methods for inference fall into two broad categories:

  • Estimation
  • Hypothesis testing

In estimation, we attempt to learn the value of the population mean.
In hypothesis testing, we evaluate a claim about the value of the population mean.

Throughout this chapter, we emphasize a computational perspective. Rather than relying only on formulas, we simulate sampling distributions to understand the logic of inference.


8.2 Example: Blood Alcohol Level

Suppose we want to determine the true blood alcohol concentration (BAC) of an individual.

The testing device reports measurements with random error. The manufacturer specifies that measurement errors are normally distributed with known standard deviation:

  • Measurement error standard deviation: 0.03
  • Errors are unbiased (mean zero)

If the true BAC is \(\mu\), each observed measurement can be modeled as:

\[ X_i = \mu + \varepsilon_i, \]

where the errors are independent and normally distributed.

We are interested only in the mean BAC level \(\mu\).

This example will guide our study of:

  • Point estimation
  • Confidence intervals
  • Hypothesis testing
  • Sample size determination

8.3 Assumptions

Statistical inference about the mean relies on several assumptions about the data generating process.

Suppose we observe measurements

\[ X_1, X_2, \ldots, X_n \]

from a population with mean \(\mu\) and variance \(\sigma^2\).

We assume:

  1. Independence

The observations are independent.

\[ X_1, X_2, \ldots, X_n \text{ are independent.} \]

  1. Constant Mean

Each observation has the same population mean

\[ \mathbb{E}(X_i) = \mu. \]

  1. Constant Variance

Each observation has the same variance

\[ \mathbb{V}(X_i) = \sigma^2. \]

  1. Normality or Large Sample Size

Either

  • the population distribution is normal, or
  • the sample size is sufficiently large so that the Central Limit Theorem applies.

Under these assumptions, the sample mean has sampling distribution

\[ \bar{X} \sim N\left(\mu,\frac{\sigma^2}{n}\right) \]

when the population is normal, and is approximately normal for large samples.


8.4 Point Estimation

The first step in statistical inference is computing a point estimate.

If we assume the measurements are normally distributed with mean \(\mu\), the natural estimator is the sample mean:

\[ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i. \]

Why the sample mean?

From probability theory:

  • \(\mathbb{E}(\bar{X}) = \mu\)
  • \(\mathbb{V}(\bar{X}) = \frac{\sigma^2}{n}\)
  • If the population is normal, \(\bar{X}\) is exactly normally distributed.

We verify this through simulation.

set.seed(123)

mu  <- 0.085
sd  <- 0.03
n   <- 5
B   <- 5000

means <- replicate(B, mean(rnorm(n, mu, sd)))

mean(means)
## [1] 0.08480454
sd(means)
## [1] 0.01343491
hist(means, breaks = 50, probability=TRUE, main="Sampling Distribution of the Sample Mean")
abline(v=mu, col="blue", lwd=2)

The histogram approximates a normal distribution centered at the true mean.


8.5 Interval Estimation

The sample mean is random. Reporting only \(\bar{X}\) does not indicate how close it is likely to be to \(\mu\).

To quantify uncertainty, we construct a confidence interval.

Because the population variance is known and the data are normal (or \(n\) is large), the sampling distribution of the sample mean is:

\[\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\]

This is the key starting point. The mean of \(\bar{X}\) is \(\mu\), and its standard deviation (the standard error) is \(\sigma/\sqrt{n}\).


8.5.1 Step 1: Standardize

To derive the interval, we standardize \(\bar{X}\):

\[ Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}}. \]

Since \(\bar{X}\) is normal, this standardized variable follows:

\[ Z \sim N(0,1). \]


8.5.2 Step 2: Pick a Confidence Level 1 - \(\alpha\)

We choose a confidence level between \(0\) and \(1\),

\[ 1 - \alpha \]

Based on this level and for the standard normal distribution, we select an approapiate quantile

\[ z_{\alpha/2} \]

that satisfies:

\[ P(-z_{\alpha/2} \le Z \le z_{\alpha/2}) = 1 - \alpha \]

This statement says that the probability a standard normal variable falls between \(-z_{\alpha/2}\) and \(z_{\alpha/2}\) is \(1-\alpha\).


8.5.3 Step 3: Substitute Back for \(Z\)

Replace \(Z\) with its definition:

\[ P\left( z_{\alpha/2} \le \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \le z_{\alpha/2} \right) = 1 - \alpha. \]


8.5.4 Step 4: Solve for \(\mu\)

Multiply all parts by \(\sigma/\sqrt{n}\):

\[ P\left( z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \le \bar{X} - \mu \le z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \right) = 1 - \alpha. \]

Now rearrange to isolate \(\mu\):

\[ P\left( \bar{X} - z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{X} + z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \right) = 1 - \alpha. \]


8.5.5 Final Result

This leads directly to the \((1-\alpha)\) confidence interval:

\[ \bar{X} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \]


8.5.6 Interpretation

The randomness is in \(\bar{X}\), not in \(\mu\).

Before observing the data, the interval is random. In repeated sampling, intervals constructed this way will contain the true mean \(\mu\) approximately \(100(1-\alpha)%\) of the time.

After the data are observed, the interval is fixed, and we interpret it using this long-run frequency property.


This derivation shows that confidence intervals are not ad hoc formulas. They arise directly from:

  1. The sampling distribution of \(\bar{X}\)
  2. Standardization
  3. A probability statement about the standard normal distribution

Everything follows from these three ingredients.

8.5.7 Example Interval Estimation

We return to the blood alcohol example and simulate repeated confidence intervals.

set.seed(123)

n      <- 5
alpha  <- 0.05
reaAlc <- 0.085
sd     <- 0.03
rep    <- 100

ci_bounds <- replicate(rep, {
  obsErr <- rnorm(n, 0, sd)
  x <- reaAlc + obsErr
  x_bar <- mean(x)

  lower <- x_bar - qnorm(1 - alpha/2) * sd / sqrt(n)
  upper <- x_bar + qnorm(1 - alpha/2) * sd / sqrt(n)

  c(lower, upper)
})

ci_bounds <- t(ci_bounds)

plot(ci_bounds[,1], type="n",
     ylim=range(ci_bounds),
     xlab="Simulation Index",
     ylab="Confidence Interval")

for(i in 1:nrow(ci_bounds)){
  segments(i, ci_bounds[i,1], i, ci_bounds[i,2],
           col = ifelse(ci_bounds[i,1] <= reaAlc &
                        ci_bounds[i,2] >= reaAlc,
                        "black", "red"))
}

abline(h = reaAlc, col="blue", lwd=2)

Black intervals contain the true mean. Red intervals miss it.

This visualizes the long-run interpretation:

A 95% confidence procedure captures the true mean approximately 95% of the time.


Understood. Below is the revised section with all display equations written using $$ ... $$ for proper R Markdown rendering.


8.5.8 Choosing the Sample Size for Estimating the Population Mean

When constructing a confidence interval, we often face a practical design question:

How large must the sample be to achieve a desired level of precision?

Precision is measured by the margin of error, which is half the width of the confidence interval.

For a \((1-\alpha)\) confidence interval with known variance,

\[ E = z_{\alpha/2}\frac{\sigma}{\sqrt{n}}. \]

This formula shows that the margin of error depends on:

  1. The confidence level (through \(z_{\alpha/2}\)),
  2. The population standard deviation \(\sigma\),
  3. The sample size \(n\).

8.5.8.1 Solving for the Required Sample Size

Suppose we want the margin of error to be no larger than some specified value \(E\). We solve the equation algebraically for \(n\).

Starting from

\[ E = z_{\alpha/2}\frac{\sigma}{\sqrt{n}}, \]

multiply both sides by \(\sqrt{n}\):

\[ E\sqrt{n} = z_{\alpha/2}\sigma. \]

Now divide both sides by \(E\):

\[ \sqrt{n} = \frac{z_{\alpha/2}\sigma}{E}. \]

Finally, square both sides:

\[ n = \left(\frac{z_{\alpha/2}\sigma}{E}\right)^2. \]

This formula gives the minimum required sample size to guarantee the desired precision.


8.5.8.2 Interpretation of the Formula

The structure of the formula immediately reveals:

  • If \(\sigma\) increases, \(n\) must increase.
  • If we want smaller \(E\) (greater precision), \(n\) must increase.
  • If we increase the confidence level (larger \(z_{\alpha/2}\)), \(n\) must increase.

Notice that \(n\) depends on the square of these quantities.


8.5.8.3 Interpreting “Width 0.02”

If the officer wants the confidence interval to have total width 0.02, then the margin of error must be:

\[ E = 0.01. \]

The resulting confidence interval would have the form:

\[ \bar{X} \pm 0.01. \]

Using:

  • \(\sigma = 0.03\)
  • \(\alpha = 0.05\)
  • \(E = 0.01\)

we compute:

sigma <- 0.03
alpha <- 0.05
E     <- 0.01

z <- qnorm(1 - alpha/2)

n_required <- (z * sigma / E)^2
ceiling(n_required)
## [1] 35

We always round up to ensure the margin of error does not exceed the target value.


8.5.8.4 The Square-Root Relationship

From the margin of error formula,

\[ E = z_{\alpha/2}\frac{\sigma}{\sqrt{n}}, \]

we see that

\[ E \propto \frac{1}{\sqrt{n}}. \]

This square-root relationship has an important implication:

  • To cut the margin of error in half, we must quadruple the sample size.

Thus, gains in precision become increasingly expensive as we demand tighter intervals.


8.5.8.5 If the Sample Size Is Fixed

In many applications, \(n\) is constrained by budget or logistics. If \(n\) is fixed, we compute the resulting margin of error:

n <- 5
E <- z * sigma / sqrt(n)
E
## [1] 0.02629568

This tells us how precise our estimate can realistically be.


8.5.8.6 Design Trade-Off

Choosing \(n\) is fundamentally a planning decision that balances:

  • Statistical precision,
  • Confidence level,
  • Practical cost.

Larger \(n\) leads to:

  • Smaller margin of error,
  • More reliable estimation,
  • Greater financial and operational cost.

Understanding this trade-off is essential in statistical study design and will reappear later in hypothesis testing and power analysis.


8.6 Hypothesis Testing

In estimation, we attempt to learn \(\mu\).

In hypothesis testing, we evaluate a specific claim about \(\mu\).

Suppose the legal limit for blood alcohol concentration (BAC) is 0.08. We wish to determine whether the true mean BAC exceeds this limit.

We formalize the problem as:

  • \(H_0: \mu = 0.08\)
  • \(H_a: \mu > 0.08\)

The null hypothesis represents the legal benchmark. The alternative reflects the claim that the limit has been exceeded.


8.6.1 Logical Structure of a Test

Hypothesis testing follows a proof-by-contradiction structure:

  1. Assume \(H_0\) is true.
  2. Determine what values of the data would be unlikely under that assumption.
  3. If the observed data fall in that unlikely region, reject \(H_0\).

The key is understanding the sampling distribution under \(H_0\).

If \(H_0\) is true, then

\[ \bar{X} \sim N\left(\mu_0,\frac{\sigma^2}{n}\right). \]

Standardizing gives the test statistic:

\[ z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}. \]

Under \(H_0\),

\[ Z \sim N(0,1). \]

This distribution determines the rejection region.


8.6.2 Developing the Rejection Region

For a right-tailed test at significance level \(\alpha\), we want:

\[ P(\text{Reject } H_0 \mid H_0 \text{ true}) = \alpha. \]

Since \(Z \sim N(0,1)\) under \(H_0\), we choose a cutoff \(z_\alpha\) such that

\[ P(Z \ge z_\alpha) = \alpha. \]

This defines the critical value.

Therefore, the rejection region in terms of \(Z\) is:

\[ \text{Reject } H_0 \quad \text{if} \quad Z \ge z_\alpha. \]

Substituting back for \(Z\):

\[ \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} \ge z_\alpha. \]

Solving for \(\bar{X}\) gives the rejection region in the original measurement scale:

\[ \bar{X} \ge \mu_0 + z_\alpha \frac{\sigma}{\sqrt{n}}. \]

This is an important step: the rejection region can be expressed either in standardized units (z-scores) or directly in terms of the sample mean.


8.6.3 Interpretation of the Rejection Region

The rejection region consists of sample means that are sufficiently large that they would be rare if \(\mu = 0.08\) were true.

Thus:

  • If \(\bar{X}\) falls in this region, we conclude the data are inconsistent with \(H_0\).
  • If \(\bar{X}\) does not fall in this region, we do not have enough evidence to reject \(H_0\).

Note carefully:

Failing to reject \(H_0\) does not mean \(H_0\) is true. It means the data are not sufficiently inconsistent with it.


8.6.4 Components of a Test

A complete hypothesis test requires:

  1. Null hypothesis (\(H_0\)) The benchmark claim being evaluated.

  2. Alternative hypothesis (\(H_a\)) The competing claim.

  3. Significance level (\(\alpha\)) The probability of a Type I error.

  4. Test statistic A standardized measure of discrepancy between data and \(H_0\).

  5. Sampling distribution under \(H_0\) Determines probabilities and critical values.

  6. Rejection region Values of the test statistic that lead to rejection.

  7. Decision rule and conclusion A formal statement in context.


8.6.5 Type I and Type II Errors

Two types of errors are possible.

8.6.5.1 Type I Error

Rejecting \(H_0\) when it is true.

\[ P(\text{Type I Error}) = \alpha. \]

In this context: concluding BAC exceeds 0.08 when in fact it does not.


8.6.5.2 Type II Error

Failing to reject \(H_0\) when it is false.

\[ P(\text{Type II Error}) = \beta. \]

In this context: failing to detect that BAC exceeds 0.08 when it truly does.


8.6.5.3 Trade-Off

For fixed \(n\):

  • Decreasing \(\alpha\) (being more conservative) → increases \(\beta\) → reduces power.

This trade-off is fundamental in test design.


8.6.6 One-Tailed Blood Alcohol Test (Numerical Illustration)

We simulate data where the true mean is slightly above the legal limit.

set.seed(123)

mu0 <- 0.08
mu  <- 0.09
sd  <- 0.03
n   <- 5

x <- rnorm(n, mu, sd)
xbar <- mean(x)
# Test Statistic
z <- (xbar - mu0) / (sd / sqrt(n))
z
## [1] 1.178192
# Rejection Region
R <- qnorm(p = 1 - alpha)
# Result of the Test
if(z > R){
  print("Reject the Null Hypothesis")
} else {
  print("Fail to reject the Null Hypothesis")
}
## [1] "Fail to reject the Null Hypothesis"

If \(z \ge z_\alpha\), we reject \(H_0\).

Otherwise, we fail to reject \(H_0\).


8.6.7 General Forms of Rejection Regions

8.6.7.1 Case 1: Right-Tailed Test

\[ H_a: \mu > \mu_0 \]

Reject if

\[ Z \ge z_\alpha. \]


8.6.7.2 Case 2: Left-Tailed Test

\[ H_a: \mu < \mu_0 \]

Reject if

\[ Z \le -z_\alpha. \]


8.6.7.3 Case 3: Two-Tailed Test

\[ H_a: \mu \ne \mu_0 \]

We split \(\alpha\) across both tails:

\[ P(|Z| \ge z_{\alpha/2}) = \alpha. \]

Reject if

\[ |Z| \ge z_{\alpha/2}. \]


8.6.8 Significance Level and p-Value

Instead of precomputing a rejection region, we may compute a p-value.

The p-value is:

The probability, assuming \(H_0\) is true, of observing a test statistic at least as extreme as the one obtained.

For a right-tailed test:

\[ \text{p-value} = P(Z \ge z_{\text{obs}} \mid H_0 \text{ true}). \]

p_value <- 1 - pnorm(z)
p_value

8.6.8.1 Decision Rule Using the p-Value

  • If p-value \(\le \alpha\), reject \(H_0\).
  • If p-value \(> \alpha\), fail to reject \(H_0\).

Both approaches—critical value and p-value—are mathematically equivalent. They differ only in presentation.


8.6.9 Conceptual Summary

Hypothesis testing proceeds by:

  1. Modeling the sampling distribution under \(H_0\).
  2. Determining what outcomes are rare under that model.
  3. Comparing observed data to that rarity threshold.

The rejection region formalizes what “rare” means. The p-value quantifies how rare the observed result actually is.

Both methods operationalize the same probabilistic logic.

8.6.10 Choosing the Sample Size for Testing the Population Mean

To detect a meaningful deviation from \(\mu_0\), we must choose \(n\) carefully.

Larger samples:

  • Reduce standard error
  • Increase power (reduce \(\beta\))

A detailed study of power and Type II error will follow in later sections.


8.7 Inference About the Mean Unknown Variance

In the previous sections, we constructed confidence intervals and hypothesis tests assuming that the population variance $ ^2 $ was known. In practice, this assumption is rarely realistic. Most of the time the population variance is unknown and must be estimated from the sample.

When the population variance is unknown, we replace $ $ with the sample standard deviation $ s $. However, this replacement introduces additional uncertainty. Because of this, the sampling distribution of the standardized mean is no longer normal. Instead, it follows the Student (t) distribution.

The Student $ t $ distribution allows us to correctly account for the additional variability introduced by estimating $ $.


8.7.1 Review Properties of the Student t Distribution

The Student \(t\) distribution is a continuous probability distribution that arises frequently in statistical inference. It forms a family of distributions indexed by a parameter called the degrees of freedom, denoted by \(\nu\).

8.7.1.1 Definition

A random variable \(T\) is said to follow a Student \(t\) distribution with \(\nu\) degrees of freedom, written

\[ T \sim t_\nu , \]

if it can be expressed as the ratio

\[ T = \frac{Z}{\sqrt{U/\nu}}, \]

where

  • \(Z \sim N(0,1)\),
  • \(U \sim \chi^2_\nu\),
  • \(Z\) and \(U\) are independent.

This representation shows that the \(t\) distribution is closely related to both the normal and chi–square distributions.

8.7.1.2 Shape and Behavior

The Student \(t\) distribution has several key characteristics:

  • It is symmetric around 0.
  • It has heavier tails than the standard normal distribution.
  • Its exact shape depends on the degrees of freedom \(\nu\).

For small values of \(\nu\), the distribution has very heavy tails, meaning that extreme values occur with higher probability. As \(\nu\) increases, the distribution becomes more concentrated around zero.

curve(dt(x, df = 5), from = -4, to = 4,
      ylab = "Density",
      xlab = "t value")

curve(dnorm(x), from = -4, to = 4, add = TRUE, lty = 2)
legend("topright",
       legend = c("t (df=5)", "Normal"),
       lty = c(1,2))

8.7.1.3 Relationship with the Normal Distribution

As the degrees of freedom increase, the Student \(t\) distribution approaches the standard normal distribution. In particular,

\[ t_\nu \rightarrow N(0,1) \quad \text{as } \nu \to \infty . \]

Thus, for large values of \(\nu\), the \(t\) distribution and the standard normal distribution are nearly indistinguishable.

8.7.1.4 Mean and Variance

The mean and variance of the \(t\) distribution depend on \(\nu\):

  • If \(\nu > 1\), the mean is

\[ \mathbb{E}(T) = 0 . \]

  • If \(\nu > 2\), the variance is

\[ \mathbb{V}(T) = \frac{\nu}{\nu - 2}. \]

For \(\nu \le 2\), the variance is infinite, reflecting the heavy tails of the distribution.

8.7.1.5 Critical Values

For many statistical procedures, we use quantiles (critical values) of the \(t\) distribution.

For a probability level \(\alpha\), the value \(t_{\alpha,\nu}\) satisfies

\[ P(T > t_{\alpha,\nu}) = \alpha, \quad T \sim t_\nu . \]

Because the distribution is symmetric,

\[ P(T < -t_{\alpha,\nu}) = \alpha. \]

These critical values are commonly tabulated and are used to construct confidence intervals and perform hypothesis tests.


8.7.2 Confidence Intervals Unknown Variance

When the population variance is unknown, the confidence interval for the population mean is constructed using the Student (t) distribution.

Definition 8.1 (Confidence Interval for the Mean (Unknown Variance)) Let \(y_1, y_2, \dots, y_n\) be a random sample from a normal population with unknown variance.

A \((1-\alpha)\) confidence interval for the population mean \(\mu\) is

\[ \bar{y} \pm t_{\alpha/2,\,n-1}\frac{s}{\sqrt{n}} \]

where

  • \(\bar{y}\) is the sample mean
  • \(s\) is the sample standard deviation
  • \(t_{\alpha/2,n-1}\) is the critical value from the Student \(t\) distribution with \(n-1\) degrees of freedom

Compared to the normal confidence interval, the only difference is that we use the (t) critical value instead of the (z) critical value.

Because the (t) distribution has heavier tails, the resulting confidence interval is typically wider, reflecting the additional uncertainty.


8.7.2.1 Example BAC continued unknown variance

Suppose we collected a small sample of \(n=10\) BAC measurements from a driver suspected of driving under the influence. The legal limit is \(0.08\). Since the population variance is unknown and the sample size is small, we use the Student \(t\) distribution.

We compute a 95% confidence interval for the population mean BAC.

First, suppose the observed BAC values are:

# Set Seed
set.seed(2026)

# Simulation settings
n      <- 5
alpha  <- 0.05
reaBac <- 0.085
sd     <- 0.03

# Draws Samples
bac <- rnorm(n = n, mean = reaBac, sd = sd)

We begin by computing the sample size, sample mean, and sample standard deviation.

n <- length(bac)
y_bar <- mean(bac)
s <- sd(bac)

n
## [1] 5
y_bar
## [1] 0.07797249
s
## [1] 0.01917043

Next, we compute the standard error of the mean.

se <- s / sqrt(n)
se
## [1] 0.008573276

For a 95% confidence interval we need the critical value from the \(t\) distribution with \(n-1\) degrees of freedom.

alpha <- 0.05
t_crit <- qt(1 - alpha/2, df = n - 1)
t_crit
## [1] 2.776445

The confidence interval is

\[ \bar{y} \pm t_{\alpha/2,,n-1}\frac{s}{\sqrt{n}} \]

We compute the interval endpoints.

lower <- y_bar - t_crit * se
upper <- y_bar + t_crit * se

lower
## [1] 0.05416926
upper
## [1] 0.1017757

Thus, the 95% confidence interval for the mean BAC is (0.0542, 0.1018).

This interval provides a range of plausible values for the population mean BAC.

Finally, we can verify the result using the built-in R function t.test().

t.test(bac, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  bac
## t = 9.0948, df = 4, p-value = 0.0008105
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.05416926 0.10177572
## sample estimates:
##  mean of x 
## 0.07797249

The output confirms the same confidence interval calculated manually.


8.7.3 Hypothesis Testing Unknown Variance

Hypothesis testing for the mean with unknown variance also uses the Student (t) distribution.

Definition 8.2 (t Test for the Mean) Let \(y_1, y_2, \dots, y_n\) be a random sample from a normal population with unknown variance.

To test

  • \(H_0: \mu = \mu_0\)

we use the test statistic

\[ t = \frac{\bar{y}-\mu_0}{s/\sqrt{n}} \]

Under \(H_0\), this statistic follows a Student \(t\) distribution with \(n-1\) degrees of freedom.

The steps for the hypothesis test are:

  1. State the hypotheses
  2. Compute the test statistic
  3. Determine the rejection region or compute the p-value
  4. Make a decision
  5. Interpret the result in context

8.7.3.1 Example BAC continued Hypothesis testing unknown variance

We now test whether the mean BAC of drivers exceeds the legal limit of 0.08. Since the population variance is unknown and the sample size is small, we use a one-sample \(t\) test.

We test

  • \(H_0: \mu = 0.08\)
  • \(H_a: \mu > 0.08\)

Suppose the BAC measurements are:

# Set Seed
set.seed(2026)

# Simulation settings
n      <- 5
alpha  <- 0.05
reaBac <- 0.085
sd     <- 0.03

# Draws Samples
bac <- rnorm(n = n, mean = reaBac, sd = sd)

First, compute the sample size, sample mean, and sample standard deviation.

n <- length(bac)
y_bar <- mean(bac)
s <- sd(bac)

n
## [1] 5
y_bar
## [1] 0.07797249
s
## [1] 0.01917043

Next, compute the test statistic

\[ T = \frac{\bar{y}-\mu_0}{s/\sqrt{n}} \]

where \(\mu_0 = 0.08\).

mu0 <- 0.08

t_stat <- (y_bar - mu0) / (s / sqrt(n))
t_stat
## [1] -0.2364921

The test statistic follows a \(t\) distribution with \(n-1\) degrees of freedom under the null hypothesis.

= n-1

df <- n - 1
df
## [1] 4

Next, compute the p-value for the right-tailed test.

p_value <- 1 - pt(t_stat, df)
p_value
## [1] 0.5876661

If the p-value is smaller than the significance level \(\alpha = 0.05\), we reject \(H_0\).

alpha <- 0.05
p_value < alpha
## [1] FALSE

Thus, if this value is TRUE, we reject the null hypothesis and conclude that the mean BAC is greater than the legal limit.

Finally, we can verify the result using the built-in t.test() function in R.

t.test(bac, mu = 0.08, alternative = "greater")
## 
##  One Sample t-test
## 
## data:  bac
## t = -0.23649, df = 4, p-value = 0.5877
## alternative hypothesis: true mean is greater than 0.08
## 95 percent confidence interval:
##  0.05969558        Inf
## sample estimates:
##  mean of x 
## 0.07797249

This function performs the same hypothesis test and returns the test statistic, degrees of freedom, p-value, and confidence interval.


8.8 Not Normally Distributed Population

In the previous sections, our inference procedures relied on an important assumption:

The population from which the sample is drawn is normally distributed.

When this assumption holds, the theoretical distributions used in inference are exact. In particular:

  • The sample mean has an exact normal distribution.
  • When the variance is unknown, the statistic

\[ t = \frac{\bar{y}-\mu}{s/\sqrt{n}} \]

follows a Student \(t\) distribution.

However, in practice many populations are not normally distributed. Real-world data may be skewed, heavy-tailed, or contain outliers.

When the population distribution is not normal, the reliability of the inference procedures we developed depends strongly on the sample size and on whether the population variance is known or unknown.

We analyze three scenarios:

  • Large sample size
  • Intermediate sample size
  • Small sample size

8.8.1 Big Sample Size

When the sample size is large, the procedures previously described for inference about the mean are generally reliable even when the population distribution is not normal.

A common rule of thumb is

\[ n \ge 30 \]

although the required sample size may depend on how strongly the population deviates from normality.

The key reason these methods remain valid is the Central Limit Theorem, which describes the behavior of the sample mean for large samples.


8.8.1.1 Distribution of the Sample Mean

Even when the population is not normally distributed, the Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size becomes large.

Definition 8.3 (Central Limit Theorem) Let \(y_1, y_2, \dots, y_n\) be a random sample from a population with mean \(\mu\) and variance \(\sigma^2\).

As the sample size \(n\) becomes large, the sampling distribution of the sample mean \(\bar{y}\) approaches a normal distribution with

\[ \mathbb{E}[\bar{y}] = \mu \]

and

\[ \mathbb{V}[\bar{y}] = \frac{\sigma^2}{n}. \]

({y}) =

Because of this result, the sampling distribution of the sample mean is approximately normal for large \(n\), even if the population distribution itself is not normal.

Variance known

If the population variance \(\sigma^2\) is known, the standardized statistic

\[ \frac{\bar{y}-\mu}{\sigma/\sqrt{n}} \]

is approximately standard normal for large \(n\). Therefore, the normal-based confidence intervals and hypothesis tests introduced earlier are typically reliable.

Variance unknown

If the population variance is unknown, we replace \(\sigma\) with the sample standard deviation \(s\). For large samples, \(s\) provides a good estimate of \(\sigma\), and the statistic

\[ \frac{\bar{y}-\mu}{s/\sqrt{n}} \]

is approximately normal as well.

As a result, the Student \(t\) procedures for the mean also work well for large samples, even when the population distribution is not normal.


8.8.1.2 Distribution of the Sample Variance

The behavior of the sample variance depends more strongly on the population distribution.

If the population is normal, the sample variance has a distribution related to the chi-square distribution. If the population is not normal, this exact result no longer holds.

However, for large samples:

  • The sample variance remains a consistent estimator of \(\sigma^2\).
  • The standard error \(s/\sqrt{n}\) provides a good approximation to the variability of the sample mean.

Therefore:

  • When the variance is known, inference about the mean remains reliable.
  • When the variance is unknown, the \(t\)-based procedures also remain reliable for large \(n\).

8.8.1.3 Example Big Sample Size

# set.seed(123)

# Simulation settings
rep    <- 1000

n      <- 30
alpha  <- 0.05
reaAlc <- 0.085
sd     <- 0.03
df     <- 1
varKno <- FALSE

densityChiSquare <- function(x) {dchisq(x = x, df = df)}

curve(densityChiSquare, from = 0, to = 5 * df)

ci_bounds <- replicate(rep, {
  # Creates the Observations
  obsErr <- (rchisq(n, df = df) - df) / sqrt(2 * df) * sd
  x      <- reaAlc + obsErr
  # Computes the Sample Mean
  x_bar  <- mean(x)
  # Computes the Sample Standard Error if Necesary
  if(varKno){
    s <- sd
  } else {
    s <- sd(x)
  }
  # Computes the Confidence Intervals
  lower <- x_bar - qt(1 - alpha/2, df = n-1) * s / sqrt(n)
  upper <- x_bar + qt(1 - alpha/2, df = n-1) * s / sqrt(n)
  # Saves
  c(lower, upper)
})

ci_bounds <- t(ci_bounds)

conInt <-  ci_bounds[, 1] <= reaAlc & ci_bounds[, 2] >= reaAlc

print(paste0("Percentage of Confidence Intervals that contain the True Parameter: ", mean(conInt)))
## [1] "Percentage of Confidence Intervals that contain the True Parameter: 0.91"

8.8.2 Intermediate Sample Size

When the sample size is moderate, the reliability of inference procedures depends more strongly on the shape of the population distribution.

A rough range for intermediate sample sizes is

\[ 10 \le n < 30. \]

In this range, the Central Limit Theorem begins to take effect, but the sampling distributions may still be influenced by the shape of the population.


8.8.2.1 Distribution of the Sample Mean

For intermediate sample sizes, the sampling distribution of \(\bar{y}\) may still show noticeable deviations from normality if the population distribution is highly skewed or heavy-tailed.

Variance known

If the variance is known, the statistic

\[ \frac{\bar{y}-\mu}{\sigma/\sqrt{n}} \]

is no longer guaranteed to follow a normal distribution unless the population itself is normal.

However:

  • If the population distribution is approximately symmetric, the normal approximation is often acceptable.
  • If the population is strongly skewed, the approximation may be poor.

Variance unknown

When the variance is unknown, an additional source of variability appears because \(s\) is estimated from the sample.

In this situation:

  • The statistic used in the \(t\) procedures may not follow the theoretical \(t\) distribution.
  • The inference procedures may still work reasonably well when the population is close to symmetric.
  • If the population is highly skewed, the procedures may become unreliable.

8.8.2.2 Distribution of the Sample Variance

For intermediate sample sizes, the sample variance can be sensitive to extreme observations.

This is particularly important when:

  • the population distribution has heavy tails, or
  • the sample contains outliers.

Consequently:

  • When the variance is known, inference about the mean depends mainly on the behavior of the sample mean.
  • When the variance is unknown, the additional uncertainty from estimating \(s\) can further reduce the reliability of the inference procedures.

8.8.3 Small Sample Size

When the sample size is small, deviations from normality can have a large effect on inference.

A common rule of thumb is

\[ n < 10. \]

In this situation, the theoretical results that justify the inference procedures developed earlier rely strongly on the assumption that the population distribution is normal.


8.8.3.1 Distribution of the Sample Mean

If the population is not normally distributed and the sample size is small, the sampling distribution of the sample mean may be:

  • skewed,
  • heavy-tailed,
  • very different from a normal distribution.

Variance known

Even if the population variance \(\sigma^2\) is known, the statistic

\[ \frac{\bar{y}-\mu}{\sigma/\sqrt{n}} \]

does not necessarily follow a normal distribution when the population is not normal.

Therefore, the normal-based inference procedures may not be valid.

Variance unknown

When the variance is unknown, the situation becomes even more difficult.

The \(t\) procedures rely on the assumption that the population distribution is normal when the sample size is small. If this assumption does not hold, the statistic

\[ \frac{\bar{y}-\mu}{s/\sqrt{n}} \]

may not follow the Student \(t\) distribution.

As a result, the confidence intervals and hypothesis tests based on the \(t\) distribution may perform poorly.


8.8.3.2 Distribution of the Sample Variance

The sample variance is particularly unstable when the sample size is small.

In small samples:

  • a single extreme observation can greatly affect \(s^2\),
  • the estimate of the standard error \(s/\sqrt{n}\) may be unreliable.

These issues further reduce the reliability of the inference procedures when the variance is unknown.


When both conditions occur:

  • the sample size is small, and
  • the population distribution is not approximately normal,

standard inference procedures for the mean may not be reliable.

In such situations, alternative approaches can be used that rely less heavily on distributional assumptions. One important approach is resampling methods, which we introduce in the next section.


8.8.4 What to Check Before Applying Inference Methods

When the population distribution is not known to be normal, the reliability of inference about the mean depends mainly on three elements:

  • the sample size
  • the shape of the data distribution
  • whether the population variance is known or unknown

Before applying the inference procedures introduced earlier, it is useful to evaluate the situation using the following guidelines.


8.8.4.1 Large Sample Size (\(n \ge 30\))

When the sample size is large, inference procedures for the mean are generally reliable even if the population is not normally distributed.

This occurs because the Central Limit Theorem ensures that the sampling distribution of the sample mean is approximately normal.

What to check:

  • Verify that the sample size is large (\(n \ge 30\)).
  • Look at a histogram or boxplot to confirm that there are no extreme outliers.
  • Confirm that the sample was obtained using an appropriate random sampling method.

If these conditions are satisfied:

  • If the variance \(\sigma^2\) is known, the normal-based procedures for inference about \(\mu\) can be used.
  • If the variance \(\sigma^2\) is unknown, the Student \(t\) procedures introduced earlier can be used.

8.8.4.2 Intermediate Sample Size (\(10 \le n < 30\))

When the sample size is moderate, the validity of the inference procedures depends more strongly on the shape of the data.

What to check:

  • Examine the shape of the sample distribution using graphical summaries such as histograms or boxplots.
  • Determine whether the data appear approximately symmetric.
  • Check for strong skewness or extreme outliers.

If the data appear approximately symmetric:

  • The previously described inference procedures are often still reasonable.

If the data show strong skewness or outliers:

  • The normal or \(t\) approximations may be unreliable.

The effect of non-normality is particularly important when the variance is unknown, since the standard error must be estimated using the sample variance.


8.8.4.3 Small Sample Size (\(n < 10\))

When the sample size is small, inference procedures depend heavily on the assumption that the population distribution is normal.

What to check:

  • Consider whether there is strong prior knowledge that the population distribution is approximately normal.
  • Examine the sample data for skewness or outliers, keeping in mind that small samples provide limited information about the true distribution.

If the population is approximately normal:

  • The inference procedures previously described remain valid.

If the population distribution is not approximately normal:

  • The theoretical distributions used in those procedures may not provide reliable results.

In these situations, alternative methods that rely less on distributional assumptions can be used. One such approach is bootstrap methods, which will be discussed in the next section.


8.8.5 Summary Table of When Inference Procedures Are Reliable

The following table summarizes when the inference procedures for the population mean can generally be applied when the population distribution is not known to be normal.

Sample Size Population Shape Variance Known Variance Unknown Reliability of Inference
\(n \ge 30\) Any reasonable distribution Normal procedures generally reliable \(t\) procedures generally reliable Usually reliable due to the Central Limit Theorem
\(10 \le n < 30\) Approximately symmetric Normal approximation often acceptable \(t\) procedures often acceptable Depends on skewness and outliers
\(10 \le n < 30\) Strongly skewed or heavy-tailed Normal approximation may be unreliable \(t\) procedures may be unreliable Use caution
\(n < 10\) Approximately normal Normal procedures valid \(t\) procedures valid Reliable if normality assumption holds
\(n < 10\) Not approximately normal Normal procedures unreliable \(t\) procedures unreliable Alternative methods recommended

This table provides a practical guideline for deciding whether the inference procedures previously introduced can be applied.

When the conditions for these procedures are not satisfied, alternative approaches that rely less on distributional assumptions may be needed. One such approach is bootstrap methods, which are introduced in the next section.


8.9 Inferences with Small n (Bootstrap Methods)

When the sample size is small, the assumptions required for the (t) methods may not hold. In particular, the methods rely on the population being approximately normally distributed.

If the normality assumption is questionable, we can use bootstrap methods.

Bootstrap methods use resampling to approximate the sampling distribution of a statistic.

Instead of relying on theoretical distributions, we repeatedly sample from the observed data and compute the statistic of interest.


8.9.1 When to Apply

Bootstrap methods are useful when:

  • The sample size is small
  • The population distribution is unknown
  • The normality assumption may not hold

They are particularly useful when theoretical sampling distributions are difficult to derive.

However, bootstrap methods still require that the sample be representative of the population.


8.9.2 Steps for Bootstrap Methods about the Mean Confidence Intervals

The bootstrap approach approximates the sampling distribution of the sample mean by repeatedly sampling from the observed data.

Steps:

  1. Start with the observed sample $ y_1, y_2, , y_n $

  2. Draw a bootstrap sample of size \(n\) from the data with replacement

  3. Compute the sample mean for the bootstrap sample

  4. Repeat steps 2–3 many times (for example, 10,000 times)

  5. The collection of bootstrap means approximates the sampling distribution

  6. Construct a confidence interval using the percentiles of the bootstrap distribution

For a 95% confidence interval we use the 2.5% and 97.5% percentiles.

# set.seed(123)

# Simulation settings
rep    <- 100

n      <- 10
alpha  <- 0.05
reaAlc <- 0.085
df     <- 100
B      <- 1000


densityChiSquare <- function(x) {dchisq(x = x, df = df)}

curve(densityChiSquare, from = 0, to = 5 * df)

ci_bounds <- replicate(rep, {
  # Creates the Observations
  obsErr <- (rchisq(n, df = df) - df) / sqrt(2 * df) * sd
  x      <- reaAlc + obsErr
  # Computes the Sample Mean
  x_bar  <- mean(x)
  # Resamples
  booSam <- replicate(n = B, expr = sample(x = x, size = n, replace = TRUE))
  # Computes Means
  booMea <- colMeans(booSam)
  # Computes the Quantiles
  lower <- quantile(x = booMea, probs = 0.025)
  upper <- quantile(x = booMea, probs = 0.975)
  # Saves
  c(lower, upper)
})

ci_bounds <- t(ci_bounds)

conInt <-  ci_bounds[, 1] <= reaAlc & ci_bounds[, 2] >= reaAlc

print(paste0("Number of Samples: ", n))
## [1] "Number of Samples: 10"
print(paste0("Percentage of Confidence Intervals that contain the True Parameter: ", mean(conInt)))
## [1] "Percentage of Confidence Intervals that contain the True Parameter: 0.83"

8.9.3 Steps for Bootstrap Methods about the Mean Hypothesis Testing

Bootstrap methods can also be used for hypothesis testing.

The key idea is to simulate the sampling distribution under the null hypothesis.

Steps:

  1. Assume the null hypothesis is true

  2. Center the data so that the sample mean equals the null value

  3. Draw bootstrap samples from the centered data

  4. Compute the bootstrap test statistic for each sample

  5. Compare the observed statistic to the bootstrap distribution

  6. Compute the p-value as the proportion of simulated statistics that are more extreme than the observed statistic

# set.seed(123)

# Simulation settings
rep    <- 100

n      <- 10
alpha  <- 0.05
reaAlc <- 0.085
sd     <- 0.01
mu0    <- reaAlc
df     <- 100
B      <- 1000


densityChiSquare <- function(x) {dchisq(x = x, df = df)}

curve(densityChiSquare, from = 0, to = 5 * df)

booPva <- replicate(rep, {
  # Creates the Observations
  obsErr <- (rchisq(n, df = df) - df) / sqrt(2 * df) * sd
  x      <- reaAlc + obsErr
  # Computes the Sample Mean
  x_bar  <- mean(x)
  # Centers
  x      <- x - x_bar + mu0
  # Resamples
  booSam <- replicate(n = B, expr = sample(x = x, size = n, replace = TRUE))
  # Computes Means
  booMea <- colMeans(booSam)
  # Computes the p-value
  mean(x_bar > booMea)
})

# Percentage of Times that the null hypothesys is rejected
rejPer <- mean(booPva < alpha)
print(paste0("Error rate: ", rejPer))
## [1] "Error rate: 0.11"

Bootstrap hypothesis tests are especially useful when the theoretical distribution of the test statistic is unknown or unreliable.


8.9.4 Bootstrap-\(t\) or Studentized Bootstrap

When the population distribution is not normal and the sample size is not sufficiently large for the Central Limit Theorem to provide a reliable approximation, the classical normal and \(t\) inference procedures may perform poorly.

A refinement of the basic bootstrap confidence interval is the Bootstrap-\(t\) (Studentized Bootstrap) method. This approach attempts to mimic the sampling distribution of the \(t\)-statistic rather than the sampling distribution of the sample mean itself.

The classical \(t\) statistic used in inference about the mean is

\[ t = \frac{\bar{y} - \mu}{s/\sqrt{n}} \]

which standardizes the sample mean by its estimated standard error.

\[ t = \frac{\bar{y}-\mu}{s/\sqrt{n}} \]

The Bootstrap-\(t\) method applies this same idea within the bootstrap framework.

Instead of directly approximating the distribution of \(\bar{y}\), we approximate the distribution of the statistic

\[ t^* = \frac{\bar{y}^* - \bar{y}}{s^*/\sqrt{n}} \]

where

  • \(\bar{y}^*\) is the mean of a bootstrap sample
  • \(s^*\) is the standard deviation of the bootstrap sample

This procedure studentizes the bootstrap estimate by accounting for variability in the standard error.

The following code exemplifies the studentized bootstrap confidence intervals and hypothesis testing.

# set.seed(123)

# Simulation settings
rep    <- 100

n      <- 5
alpha  <- 0.05
reaAlc <- 0.085
sd     <- 0.01
df     <- 1
B      <- 1000


densityChiSquare <- function(x) {dchisq(x = x, df = df)}

curve(densityChiSquare, from = 0, to = 5 * df)

ci_bounds <- replicate(rep, {
  # Creates the Observations
  obsErr <- (rchisq(n, df = df) - df) / sqrt(2 * df) * sd
  x      <- reaAlc + obsErr
  # Computes the Sample Mean
  x_bar  <- mean(x)
  # Computes the Sample SD
  sdx    <- sd(x)
  # Resamples
  booSam <- replicate(n = B, expr = sample(x = x, size = n, replace = TRUE))
  # Computes Means
  booMea <- colMeans(booSam)
  # Computes the SD
  booSD  <- apply(X = booSam, MARGIN = 2, FUN = sd)
  # Computes the t statistic
  booTst <- (booMea - x_bar) / (booSD / sqrt(n)) 
  # Computes the Quantiles of the bootstrap t
  booT025 <- quantile(x = booTst, probs = 0.025)
  booT975 <- quantile(x = booTst, probs = 0.975)
  # Computes the bootstrap upper and lower bound
  lower <- x_bar + booT025 * sdx / sqrt(n)
  upper <- x_bar + booT975 * sdx / sqrt(n)
  # Saves
  c(lower, upper)
})

ci_bounds <- t(ci_bounds)

conInt <-  ci_bounds[, 1] <= reaAlc & ci_bounds[, 2] >= reaAlc

print(paste0("Number of Samples: ", n))
## [1] "Number of Samples: 5"
print(paste0("Percentage of Confidence Intervals that contain the True Parameter: ", mean(conInt)))
## [1] "Percentage of Confidence Intervals that contain the True Parameter: 0.9"
# set.seed(123)

# Simulation settings
rep    <- 1000

n      <- 5
alpha  <- 0.05
reaAlc <- 0.085
mu0    <- 0.085
sd     <- 0.01
df     <- 1
B      <- 1000


densityChiSquare <- function(x) {dchisq(x = x, df = df)}

curve(densityChiSquare, from = 0, to = 5 * df)

booPva <- replicate(rep, {
  # Creates the Observations
  obsErr <- (rchisq(n, df = df) - df) / sqrt(2 * df) * sd
  x      <- reaAlc + obsErr
  # Computes the Sample Mean
  x_bar  <- mean(x)
  # Computes the Sample SD
  sdx    <- sd(x)
  # Computes the test statistic
  tst    <- (x_bar - mu0) / (sdx / sqrt(n))
  # Resamples
  booSam <- replicate(n = B, expr = sample(x = x, size = n, replace = TRUE))
  # Computes Means
  booMea <- colMeans(booSam)
  # Computes the SD
  booSD  <- apply(X = booSam, MARGIN = 2, FUN = sd)
  # Computes the t statistic
  booTst <- (booMea - x_bar) / (booSD / sqrt(n))
  # Computes the p-value
  mean(tst > booTst)
})

# Percentage of Times that the null hypothesys is rejected
rejPer <- mean(booPva < alpha)
print(paste0("Error rate: ", rejPer))
## [1] "Error rate: 0.091"

8.9.4.1 Why Studentization Helps

Studentization adjusts the statistic by its estimated variability. This typically produces an approximation that is:

  • more stable across different samples
  • less sensitive to skewness in the population distribution
  • more accurate for moderate sample sizes

Because of this adjustment, the Bootstrap-\(t\) method often performs better than simpler bootstrap intervals when:

  • the population distribution is skewed
  • the variance is unknown
  • the sample size is not very large

8.9.4.2 Relationship to Classical Inference

Notice that the Bootstrap-\(t\) method mirrors the logic of the classical \(t\) procedures discussed earlier:

Classical Method Bootstrap Analogue
\(t = (\bar{y}-\mu)/(s/\sqrt{n})\) \(t^* = (\bar{y}^*-\bar{y})/(s^*/\sqrt{n})\)
theoretical \(t\) distribution empirical bootstrap distribution
assumptions about normality fewer distributional assumptions

Thus, the Bootstrap-\(t\) method replaces the theoretical \(t\) distribution with an empirical distribution obtained through resampling.

8.9.4.3 When the Bootstrap-\(t\) Method is Useful

The Bootstrap-\(t\) method becomes particularly useful in the following situations:

  • the population distribution is not normal
  • the sample size is small or moderate
  • the variance of the population is unknown
  • classical normal or \(t\) approximations are questionable

In these settings, bootstrap approaches can provide more reliable inference by using the observed data to approximate the sampling distribution.

The specific implementation steps for bootstrap confidence intervals and hypothesis testing will be introduced in the next section.

Here is a ready-to-paste section for the end of file 07. It introduces power conceptually and connects it to sample size in a way that should flow naturally after hypothesis testing.

8.10 Power and Sample Size

Up to this point, hypothesis testing has focused mainly on controlling the probability of a Type I error.

That is, when we choose a significance level \(\alpha\), we are controlling the probability of rejecting \(H_0\) when \(H_0\) is actually true.

But that is only part of the story.

A good hypothesis test should not only avoid false positives. It should also be able to detect meaningful departures from the null hypothesis when they truly exist.

This idea leads to the concept of power.


8.10.1 Motivation

Suppose we are testing

\[ H_0:\mu=\mu_0 \]

against an alternative such as

\[ H_a:\mu>\mu_0. \]

If the true population mean really is larger than \(\mu_0\), then we would like our test to reject \(H_0\) with high probability.

However, this does not happen automatically.

Even when the alternative is true, the sample mean still varies from sample to sample. If the sample is small or the variability is large, the observed sample mean may fail to look sufficiently different from \(\mu_0\), and the test may not reject.

So even under a false null hypothesis, a test can fail to detect the difference.

This is why we need a way to measure the sensitivity of a test.


8.10.2 Type II Error and Power

Definition 8.4 (Type II Error) A Type II error occurs when we fail to reject \(H_0\) even though \(H_0\) is false.

The probability of a Type II error is denoted by

\[ \beta. \]

Definition 8.5 (Power of a Test) The power of a hypothesis test is the probability of rejecting \(H_0\) when \(H_0\) is false.

So power is

\[ 1-\beta. \]

This is one of the most important quantities in hypothesis testing.

While \(\alpha\) measures how often we make a false positive error, power measures how often we successfully detect a real effect.

A powerful test is one that is good at finding meaningful departures from the null hypothesis.


8.10.3 Interpreting Power

Power answers the question:

If the null hypothesis is false in a specific way, how likely is the test to detect it?

This is important because “the null is false” is not enough by itself. The null can be false by a very small amount or by a large amount.

For example, if we test

\[ H_0:\mu=100 \]

and the true mean is actually \(\mu=101\), then detecting that difference may be difficult.

But if the true mean is actually \(\mu=120\), then detecting that difference is much easier.

So power depends on how far the true parameter is from the null value.

This is why power is not a single universal number unless we specify the alternative more precisely.


8.10.4 What Affects Power?

Power depends on several factors:

  • the significance level \(\alpha\)
  • the sample size \(n\)
  • the variability of the population
  • the true distance from the null value
  • whether the test is one-sided or two-sided

Let us briefly interpret each one.

8.10.4.1 Significance Level

If we make \(\alpha\) larger, the rejection region becomes easier to reach.

That increases power.

But it also increases the probability of a Type I error.

So there is a trade-off:

  • larger \(\alpha\) gives more power
  • smaller \(\alpha\) gives stronger control of false positives

8.10.4.2 Sample Size

As the sample size increases, the standard error decreases.

For inference about a mean, the standard error is

\[ \frac{\sigma}{\sqrt{n}} \quad \text{or} \quad \frac{s}{\sqrt{n}}. \]

So larger samples make the sampling distribution of the sample mean more concentrated.

That means that if the true mean differs from the null value, the sample mean is more likely to fall far enough from \(\mu_0\) to enter the rejection region.

This is one of the most important relationships in all of inference:

Larger samples generally increase power.

8.10.4.3 Population Variability

If the population variability is large, then the sample mean fluctuates more from sample to sample.

That makes it harder to distinguish signal from noise.

So larger variability decreases power.

Less variability makes it easier to detect real differences.

8.10.4.4 Distance from the Null

The farther the true mean is from the null value, the easier it is to detect the difference.

So power increases as the true parameter moves farther away from the null hypothesis value.

8.10.4.5 One-Sided vs Two-Sided Tests

For the same \(\alpha\), a one-sided test places the entire rejection region in one tail.

If the true effect is in that direction, the one-sided test usually has more power than a two-sided test.

However, this only makes sense when a one-sided alternative is justified by the scientific question before seeing the data.


8.10.5 Example of Power Function for a Normal Population and Knwon Variance

For the one-population, normal, known-variance case, the cleanest way is to work in two steps:

  1. define the rejection region under \(H_0\),
  2. then compute the probability of falling in that region when the true mean is some alternative value \(\mu\).

Your notes already set up the rejection region this way for the one-sample \(z\) test: with \[ Z=\frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}}, \] we reject for a right-tailed test when \(Z\ge z_\alpha\), equivalently when \[ \bar{X}\ge \mu_0+z_\alpha\frac{\sigma}{\sqrt{n}}. \] This comes directly from requiring the Type I error probability to be \(\alpha\).

Suppose we are testing \[ H_0:\mu=\mu_0 \qquad\text{vs}\qquad H_a:\mu>\mu_0. \]

Because the population is normal and \(\sigma\) is known, \[ \bar{X}\sim N\left(\mu,\frac{\sigma^2}{n}\right). \] Under \(H_0\), this becomes \[ \bar{X}\sim N\left(\mu_0,\frac{\sigma^2}{n}\right), \] which is why the rejection region is chosen using the standard normal cutoff.

Now define the critical value on the \(\bar X\) scale: \[ c=\mu_0+z_\alpha\frac{\sigma}{\sqrt{n}}. \]

So the rejection rule is:

\[ \text{Reject }H_0 \quad \text{if} \quad \bar{X}\ge c. \]

Now suppose the true mean is not \(\mu_0\), but some value \(\mu\) in the alternative. Then the sampling distribution is

\[ \bar{X}\sim N\left(\mu,\frac{\sigma^2}{n}\right). \]

The power at \(\mu\) is the probability of rejection under that true mean:

\[ \text{Power}(\mu)=P(\text{Reject }H_0\mid \mu\text{ is true}) = P(\bar{X}\ge c\mid \mu). \]

Substitute the critical value:

\[ \text{Power}(\mu) =P\left(\bar{X}\ge \mu_0+z_\alpha\frac{\sigma}{\sqrt{n}} ,\middle|, \mu\right). \]

Now standardize using the true distribution of \(\bar X\) under mean \(\mu\):

\[ \text{Power}(\mu) = P\left( \frac{\bar X-\mu}{\sigma/\sqrt n} \ge \frac{\mu_0+z_\alpha\frac{\sigma}{\sqrt n}-\mu}{\sigma/\sqrt n} \right). \]

Since \[ \frac{\bar X-\mu}{\sigma/\sqrt n}\sim N(0,1), \] this becomes

\[ \text{Power}(\mu) =1-\Phi\left( \frac{\mu_0-\mu}{\sigma/\sqrt n}+z_\alpha \right). \]

A very common equivalent form is

\[ \boxed{ \text{Power}(\mu) =1-\Phi\left( z_\alpha-\frac{\mu-\mu_0}{\sigma/\sqrt n} \right) } \]

This is usually the most interpretable formula.

Since power is \(1-\beta\), this matches the general idea from the textbook: power is the probability of rejecting a false null hypothesis, and it depends on how far the true mean is from \(\mu_0\).

Interpretation:

  • If \(\mu=\mu_0\), then \[ \text{Power}(\mu_0)=1-\Phi(z_\alpha)=\alpha, \] as expected.

  • If \(\mu>\mu_0\), then \(\dfrac{\mu-\mu_0}{\sigma/\sqrt n}\) gets larger, so power increases.

  • Power increases when:

    • \(\mu-\mu_0\) increases,
    • \(n\) increases,
    • \(\sigma\) decreases,
    • or \(\alpha\) increases.

A compact way to think about it is through the signal-to-noise term \[ \frac{\mu-\mu_0}{\sigma/\sqrt n}, \] which measures how far the true mean is from the null in standard-error units.

For a concrete example, if \(\alpha=0.05\) in a right-tailed test, then \(z_\alpha=1.645\), so

\[ \text{Power}(\mu) =1-\Phi\left( 1.645-\frac{\mu-\mu_0}{\sigma/\sqrt n} \right). \]

So once you know \(\mu\), \(\mu_0\), \(\sigma\), and \(n\), you just plug them into that expression.

8.10.5.1 Example Exact Computation of the Power Function

# Parameters
mu0 <- 0.08  # Null Mean (Legal limit for BAC)
muA <- 0.085 # True Mean 
sd  <- 0.03  # Standard Deviation (Provided by the Manufacturer of the Alcoholimeter)
n   <- 5    # Number of Samples
alp <- 0.05  # Significance (1-alp) Confidence Level

# Using the SD information (Z test)
# Null Hypothesis Z score that defines the Rejection Region
za  <- qnorm(p = 1 - alp, mean = 0, sd = 1)
# Standarized Difference between NUll and Reality (Noie to Signal Ratio)
d   <- (mea1 - mea0) / (sd / sqrt(n))
# Probability of Rejecting when ALternative is true
pro <- 1 - pnorm(q = za - d, mean = 0, sd = 1) 
pro <- pnorm(q = za - d, mean = 0, sd = 1, lower.tail = FALSE)

# Changing the True Mean
muA <- seq(0, 0.24, by = 0.001)

# Using the SD information (Z test)
# Null Hypothesis Z score that defines the Rejection Region
za  <- qnorm(p = 1 - alp, mean = 0, sd = 1)
# Standarized Difference between NUll and Reality (Noie to Signal Ratio)
d   <- (muA - mu0) / (sd / sqrt(n))
# Probability of Rejecting when ALternative is true
pro <- 1 - pnorm(q = za - d, mean = 0, sd = 1) 
pro <- pnorm(q = za - d, mean = 0, sd = 1, lower.tail = FALSE)

# Plots the Power FUnction
plot(x    = muA,
     y    = pro,
     type = 'l',
     lwd  = 2,
     xlab = "Real BAC",
     ylab = expression(beta(mu[a])))
abline(v   = mu0,
       lwd = 2,
       col = "red")
abline(h   = alp,
       lwd = 2,
       col = "red")


8.10.6 Example Suboptimal Test

Assume again the one-sided problem

\[ H_0:\mu=\mu_0 \qquad\text{vs}\qquad H_a:\mu>\mu_0, \]

with

\[ X_1,\dots,X_n \overset{\text{iid}}{\sim} N(\mu,\sigma^2), \]

and now \(\sigma\) is in fact known, but we decide to ignore that and use the statistic

\[ T=\frac{\bar X-\mu_0}{S/\sqrt n}, \]

where

\[ S^2=\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar X)^2. \]


8.10.6.1 Rejection region

If we insist on using a \(t\) test with significance level \(\alpha\), then under \(H_0\) we have

\[ T\sim t_{n-1}, \]

so the rejection region for a right-tailed test is

\[ T\ge t_{\alpha,n-1}, \]

where \(t_{\alpha,n-1}\) is the upper-\(\alpha\) critical value from the \(t\) distribution with \(n-1\) degrees of freedom, that is,

\[ P\left(t_{n-1}\ge t_{\alpha,n-1}\right)=\alpha. \]

Equivalently, in terms of \(\bar X\) and \(S\),

\[ \text{Reject }H_0 \quad\text{if}\quad \bar X \ge \mu_0+t_{\alpha,n-1}\frac{S}{\sqrt n}. \]

So unlike the \(z\) test, the cutoff is now random because it depends on \(S\).


8.10.6.2 Distribution of the test statistic under a true mean \(\mu\)

When the true mean is \(\mu\), the statistic \(T\) no longer has a central \(t\) distribution. Instead it has a noncentral \(t\) distribution with

  • degrees of freedom: \(n-1\),
  • noncentrality parameter: \[ \delta=\frac{\mu-\mu_0}{\sigma/\sqrt n} =\sqrt n\frac{\mu-\mu_0}{\sigma}. \]

So under mean \(\mu\),

\[ T \sim t_{n-1}(\delta), \]

where \(t_{n-1}(\delta)\) denotes the noncentral \(t\) distribution.


8.10.6.3 Power function computation

The power at a true mean value \(\mu\) is

\[ \text{Power}_t(\mu) =P_\mu(\text{Reject }H_0) = P_\mu\left(T\ge t_{\alpha,n-1}\right). \]

Using the noncentral \(t\) distribution, this is

\[ \boxed{ \text{Power}_t(\mu) =P\left(t_{n-1}(\delta)\ge t_{\alpha,n-1}\right) } \]

with

\[ \delta=\sqrt n\frac{\mu-\mu_0}{\sigma}. \]

Equivalently, if \(F_{n-1,\delta}\) is the cdf of the noncentral \(t\) distribution,

\[ \boxed{ \text{Power}_t(\mu) = 1-F_{n-1,\delta}\left(t_{\alpha,n-1}\right) } \]

This is the exact power function for the test that wastes the known variance.


8.10.6.4 Compare with the \(z\) test

If we use the known variance correctly, the \(z\) test rejects when

\[ Z=\frac{\bar X-\mu_0}{\sigma/\sqrt n}\ge z_\alpha, \]

so its power is

\[ \text{Power}_z(\mu) = 1-\Phi\left( z_\alpha-\sqrt n\frac{\mu-\mu_0}{\sigma} \right). \]

So we want to compare

\[ \text{Power}_t(\mu) \quad\text{and}\quad \text{Power}_z(\mu). \]


8.10.6.5 Why the \(t\)-based test has less power

The clean theoretical reason is this:

For the normal model with known variance, the test based on \(\bar X\) or equivalently on \(Z\) is the uniformly most powerful level-\(\alpha\) test for

\[ H_0:\mu=\mu_0 \qquad\text{vs}\qquad H_a:\mu>\mu_0. \]

That means:

among all tests with significance level \(\alpha\), no other test can have greater power for any \(\mu>\mu_0\).

Since the \(t\)-based test is another level-\(\alpha\) test, it must satisfy

\[ \boxed{ \text{Power}_t(\mu)\le \text{Power}_z(\mu) \qquad\text{for all }\mu>\mu_0. } \]

And because the two tests are not the same test, the inequality is strict for at least some alternatives:

\[ \boxed{ \text{Power}_t(\mu)< \text{Power}_z(\mu) \qquad\text{for some }\mu>\mu_0. } \]

In practice, it is typically strictly smaller for essentially all relevant alternatives.


8.10.6.6 Intuition for the power loss

The \(z\) test uses the exact standard error

\[ \frac{\sigma}{\sqrt n}, \]

which is known.

The \(t\)-based test replaces it by the random quantity

\[ \frac{S}{\sqrt n}. \]

So even though \(\sigma\) is known, the test injects extra randomness into the denominator. That makes the test statistic noisier.

There are two consequences:

  1. the reference distribution becomes wider-tailed (\(t\) instead of normal),
  2. the rejection threshold becomes random because it depends on \(S\).

So the test is spending part of its information estimating something that was already known. That extra randomness lowers the ability to distinguish \(\mu>\mu_0\) from \(\mu=\mu_0\).


8.10.6.7 Another way to see it

The \(z\) test rejects for large values of \(\bar X\) alone:

\[ \bar X \ge \mu_0 + z_\alpha \frac{\sigma}{\sqrt n}. \]

This is exactly what the likelihood ratio suggests in the known-variance normal model: evidence against \(H_0\) is fully summarized by how large \(\bar X\) is.

The \(t\) test instead rejects when

\[ \bar X \ge \mu_0+t_{\alpha,n-1}\frac{S}{\sqrt n}. \]

So now the decision depends not only on \(\bar X\), but also on \(S\).

That means two samples with the same \(\bar X\) can lead to different conclusions just because their sample standard deviations differ, even though \(\sigma\) was already known and \(S\) is not needed. That extra dependence cannot improve power when \(\sigma\) is known.


8.10.6.8 Large-sample behavior

As \(n\) grows,

\[ t_{\alpha,n-1}\to z_\alpha \qquad\text{and}\qquad S\to \sigma, \]

so the \(t\) test becomes very close to the \(z\) test. Therefore,

\[ \text{Power}_t(\mu)\to \text{Power}_z(\mu) \quad\text{as } n\to\infty. \]

So the loss of power is mainly a small-sample issue.


8.10.6.9 Final statement

For the one-sample normal problem with known variance, if you ignore the known variance and use

\[ T=\frac{\bar X-\mu_0}{S/\sqrt n}, \]

then the level-\(\alpha\) rejection region is

\[ T\ge t_{\alpha,n-1}, \]

and the power at a true mean \(\mu\) is

\[ \boxed{ \text{Power}_t(\mu) =P\left(t_{n-1}(\delta)\ge t_{\alpha,n-1}\right), \qquad \delta=\sqrt n\frac{\mu-\mu_0}{\sigma}. } \]

Compared with the optimal \(z\)-test power

\[ \boxed{ \text{Power}_z(\mu) = 1-\Phi\left( z_\alpha-\sqrt n\frac{\mu-\mu_0}{\sigma} \right), } \]

we have, for the same significance level \(\alpha\),

\[ \boxed{ \text{Power}_t(\mu)\le \text{Power}_z(\mu) \quad\text{for all }\mu>\mu_0, } \]

because the \(z\) test is the uniformly most powerful level-\(\alpha\) test in the known-variance normal model.

8.10.6.10 Monte Carlo Simulation of the Power of a T test

# Parameters
mu0 <- 0.08  # Null Mean (Legal limit for BAC)
muA <- 0.085 # True Mean 
sd  <- 0.03  # Standard Deviation (Provided by the Manufacturer of the Alcoholimeter)
n   <- 5    # Number of Samples
alp <- 0.05  # Significance (1-alp) Confidence Level

# Using the SD information (Z test)
# Null Hypothesis Z score that defines the Rejection Region
za  <- qnorm(p = 1 - alp, mean = 0, sd = 1)
# Standarized Difference between NUll and Reality (Noie to Signal Ratio)
d   <- (mea1 - mea0) / (sd / sqrt(n))
# Probability of Rejecting when ALternative is true
pro <- 1 - pnorm(q = za - d, mean = 0, sd = 1) 
pro <- pnorm(q = za - d, mean = 0, sd = 1, lower.tail = FALSE)

# Changing the True Mean
muA <- seq(0.05, 0.15, by = 0.001)

# Using the SD information (Z test)
# Null Hypothesis Z score that defines the Rejection Region
za  <- qnorm(p = 1 - alp, mean = 0, sd = 1)
# Standarized Difference between NUll and Reality (Noie to Signal Ratio)
d   <- (muA - mu0) / (sd / sqrt(n))
# Probability of Rejecting when ALternative is true
pro <- 1 - pnorm(q = za - d, mean = 0, sd = 1) 
pro <- pnorm(q = za - d, mean = 0, sd = 1, lower.tail = FALSE)

powZte <- pro

# Estimating the Power Using Monte Carlo Simulation
S      <- 1000
M      <- length(muA)
rejMat <- matrix(data = NA, nrow = S, ncol = M)
# Simulates the Samples
for(i in 1:M){
  rej <- replicate(n = S, {
    # samples
    x <- rnorm(n = n, mean = muA[i], sd = sd)
    # Computes the Mean, SE and T statistic
    xBar <- mean(x)
    xSe  <- sd(x) / sqrt(n)
    t    <- (xBar - mu0) / xSe
    # Null Hypothesis T score that defines the Rejection Region
    ta <- qt(p = 1 - alp, df = n - 1)
    # Checks if it rejects
    t > ta
  })
  # Saves the Probability of Rejecting
  rejMat[, i] <- rej
}

# Power T Test (of Simulations)
powTte <- colMeans(rejMat)

# Plots the Power FUnctions for the Z test and T test
plot(x    = muA,
     y    = powZte,
     ylim = c(0, 1),
     type = 'l',
     lwd  = 2,
     xlab = "Real BAC",
     ylab = expression(beta(mu[a])))
abline(v   = mu0,
       lwd = 2,
       col = "red")
abline(h   = alp,
       lwd = 2,
       col = "red")
par(new=TRUE)
plot(x    = muA,
     y    = powTte,
     ylim = c(0, 1),
     type = 'l',
     lwd  = 2,
     col  = 'blue',
     xlab = "",
     ylab = "")
legend("topleft",
       legend = c("Z test", "Ttest"),
       fill   = c('black',  'blue'))


8.10.7 Why Power Matters

A test with low power may fail to detect real and meaningful effects.

This can lead to conclusions such as:

  • “there is no evidence of a difference”
  • “the treatment had no effect”
  • “the mean is not significantly different from the null value”

when in reality the study simply did not have enough sensitivity.

This is a very important point:

Failing to reject \(H_0\) does not necessarily mean that \(H_0\) is true.

It may simply mean that the study had low power.

So power helps us interpret non-significant results more carefully.


8.10.8 Power and Sample Size

The relationship between power and sample size is especially important in practice.

Suppose everything else is fixed:

  • the significance level
  • the population variability
  • the effect size we want to detect

Then increasing \(n\) reduces the standard error and makes the test more sensitive.

As a result, power increases.

This means that sample size is one of the main tools available to the researcher when designing a study.

A larger sample allows us to detect smaller effects more reliably.

A smaller sample may only detect very large effects.

So before collecting data, researchers often ask:

How large should the sample be so that the test has enough power?

This is the motivation for sample size determination.


8.10.9 Intuition Through the Sampling Distribution

It is helpful to think about power graphically.

Under the null hypothesis, the sample mean has a sampling distribution centered at \(\mu_0\).

If the true mean is actually some value \(\mu_a \ne \mu_0\), then the sampling distribution is centered at \(\mu_a\) instead.

Power is the probability that this alternative sampling distribution places the test statistic inside the rejection region.

If the sample size is small:

  • the sampling distribution is wide
  • there is a lot of overlap between the null and alternative distributions
  • power may be low

If the sample size is large:

  • the sampling distribution is narrower
  • the null and alternative distributions are easier to separate
  • power is higher

So increasing the sample size improves power because it reduces uncertainty in the estimator.


8.10.10 Practical Interpretation

Suppose a traffic officer wants to detect whether the mean blood alcohol concentration exceeds the legal limit by an amount that is scientifically or legally important during a traffic stop using an alcoholimeter.

Then:

  • if the sample is too small, the traffic stop may fail to detect that the true mean BAC is above the legal limit
  • if the sample is large enough, the traffic stop has a better chance of detecting that exceedance

This is why power is not just a mathematical detail. It is a planning tool.

It helps answer whether the traffic stop using the alcoholimeter is capable of detecting a meaningful increase above the legal threshold, rather than missing it simply because the sample size is too small.


8.10.11 Summary

Power is the probability of rejecting a false null hypothesis.

It is related to Type II error through

\[ \text{Power} = 1 - \beta. \]

Power increases when:

  • the sample size increases
  • the variability decreases
  • the true mean is farther from the null value
  • the significance level is larger

The connection between power and sample size is especially important:

Larger samples generally lead to greater power because they reduce the standard error and make true effects easier to detect.

This idea will be important later when we discuss how to choose a sample size before collecting data.