3 Studies and Data Collection

In statistics, we collect data in order to answer questions about a population.

A well-designed study is essential to ensure that the conclusions we draw are reliable.

3.1 Statistical Studies

Definition 3.1 (Statistical Study) A statistical study is a process of collecting data in order to answer a question about a population or phenomenon.

A study begins with a clear question and a plan for how data will be collected.

3.2 Population and Sample

Definition 3.2 (Population) A population is the complete collection of all individuals or objects of interest.

Definition 3.3 (Sample) A sample is a subset of the population used to collect data.

In most situations, it is not feasible to observe the entire population.

Instead, we collect data from a sample and use it to learn about the population.

3.3 Variables

Definition 3.4 (Variable) A variable is a characteristic or measurement that can take different values across observations.

Each observation in a study consists of measurements of one or more variables.

3.4 Types of Studies

There are two main types of statistical studies.

3.4.1 Observational Studies

Definition 3.5 (Observational Study) An observational study is a study where the researcher records data without manipulating any variables.

In an observational study, the researcher is a passive observer and does not interfere with the process generating the data.

3.4.2 Experimental Studies

Definition 3.6 (Experimental Study) An experimental study is a study where the researcher actively manipulates one or more variables to observe their effect.

In an experiment, the researcher controls certain variables and studies how they affect the outcome.

3.4.3 Association vs Causation

A key distinction between observational and experimental studies is the ability to establish causation.

Observational studies can identify associations between variables.
Experimental studies allow for stronger conclusions about cause-and-effect relationships.

Definition 3.7 (Confounding Variable) A confounding variable is a variable that affects both the explanatory and response variables, potentially distorting their relationship.

Confounding variables are a major limitation of observational studies, since they can create misleading associations.

3.5 Explanatory and Response Variables

Definition 3.8 (Explanatory Variable) An explanatory variable is a variable that may explain changes in another variable.

Definition 3.9 (Response Variable) A response variable is the variable being explained or predicted.

The goal of many studies is to understand how explanatory variables influence the response variable.

3.6 Types of Observational Studies

Observational studies can take different forms depending on how data are collected.

3.6.1 Sample Surveys

Definition 3.10 (Sample Survey) A sample survey is an observational study that collects information about a population at a single point in time.

Sample surveys are commonly used in polls and questionnaires.

3.6.2 Prospective Studies

A prospective study follows a group of individuals forward in time and records future outcomes.

3.6.3 Retrospective Studies

A retrospective study examines past data to identify relationships between variables.

3.7 Units in a Study

Definition 3.11 (Observation Unit) The observation unit is the entity on which measurements are taken.

Definition 3.12 (Sampling Unit) The sampling unit is the entity selected during the sampling process.

In many cases, the observation unit and the sampling unit are the same, but they can differ depending on the study design.

3.8 Sampling Methods

The way a sample is selected from a population plays a critical role in the quality of a study.

A good sampling method helps ensure that the sample is representative of the population.

3.8.1 Simple Random Sampling

Definition 3.13 (Simple Random Sampling) Simple random sampling is a method where every possible sample of a given size has an equal chance of being selected.

Each observation in the population has the same probability of being included in the sample.

Advantages

Easy to understand and implement
Produces unbiased samples when properly conducted

Disadvantages

Requires a complete list of the population
Can be impractical for large or geographically dispersed populations

3.8.2 Stratified Sampling

Definition 3.14 (Stratified Sampling) Stratified sampling divides the population into homogeneous groups called strata and then takes a sample from each group.

The strata are formed based on characteristics that are relevant to the study.

Advantages

Ensures representation from all important subgroups
Can increase precision of estimates

Disadvantages

Requires knowledge of population structure
More complex to implement

3.8.3 Cluster Sampling

Definition 3.15 (Cluster Sampling) Cluster sampling divides the population into groups called clusters and randomly selects entire clusters to include in the sample.

All observations within selected clusters are included in the sample.

Advantages

Cost-effective and practical for large populations
Useful when the population is geographically spread out

Disadvantages

Can lead to higher variability in estimates
Clusters may not be representative of the population

3.8.4 Systematic Sampling

Definition 3.16 (Systematic Sampling) Systematic sampling selects observations at regular intervals from an ordered list of the population.

For example, selecting every \(k\)-th observation after a random starting point.

Advantages

Simple and quick to implement
Ensures a spread of observations across the population

Disadvantages

Can introduce bias if there is a pattern in the data
Not truly random if ordering is structured

3.8.5 Convenience Sampling

Definition 3.17 (Convenience Sampling) Convenience sampling selects observations that are easiest to access.

Advantages

Easy and inexpensive
Useful for exploratory analysis

Disadvantages

Often highly biased
Not representative of the population
Results cannot be generalized reliably

3.9 Summary of Sampling Methods

Probability sampling methods (simple random, stratified, cluster, systematic) allow for more reliable conclusions.
Non-probability methods (such as convenience sampling) are prone to bias.
The choice of sampling method depends on the study objectives, resources, and population structure.

3.10 Bias in Studies

Definition 3.18 (Bias) Bias is a systematic error in data collection or analysis that leads to incorrect conclusions.

Bias can arise in many ways and can significantly affect the validity of a study.

3.10.1 Common Types of Bias

Selection bias: occurs when the sample is not representative of the population.
Nonresponse bias: occurs when individuals selected for the sample do not respond.

Careful study design is required to minimize bias.

3.11 Summary

A well-designed study requires:

a clear definition of the population and sample
an understanding of the variables involved
an appropriate study design (observational or experimental)
awareness of potential bias and confounding variables

3.12 Example: Observational Study

Suppose we want to study the relationship between study time and exam performance among university students.

3.12.1 Description of the Study

We collect data from a group of students without assigning or controlling how much they study.
We simply record their study habits and exam scores.

This is an observational study because no variables are manipulated by the researcher.

3.12.2 Identifying the Components

Population

All university students enrolled in the course.

Sample

A group of 100 students selected from the course.

Sampling Method

Suppose we select students by choosing every 5th student from the class roster.

This is an example of systematic sampling.

Observation Unit

Each individual student.

Sampling Unit

Each student selected from the roster.

In this case, the sampling unit and observation unit are the same.

Variables

Explanatory variable: number of hours studied per week
Response variable: exam score

Each student contributes one observation consisting of these measurements.

3.12.3 Potential Confounding Variables

There are other variables that may affect exam performance, such as:

prior academic ability
attendance
access to tutoring
sleep and health

These variables may influence both study time and exam scores.

These are confounding variables because they can distort the relationship between the explanatory and response variables.

3.12.4 Bias Considerations

If some students are absent when data are collected, the sample may suffer from nonresponse bias.

If the class roster is ordered in a non-random way (for example, by performance), systematic sampling could introduce selection bias.

3.12.5 Why Causality Cannot Be Established

Even if we observe that students who study more tend to have higher exam scores, we cannot conclude that studying more causes higher scores.

This is because:

confounding variables may explain the relationship
the researcher did not control or assign study time

As a result, the study can identify associations, but not cause-and-effect relationships.

3.13 Example: Experimental Study

Suppose we want to study whether a new tutoring program improves exam performance among university students.

3.13.1 Description of the Study

We select a group of students and randomly assign them to one of two groups:

a treatment group that receives the tutoring program
a control group that does not receive the program

We then compare their exam scores at the end of the semester.

This is an experimental study because the researcher actively assigns the treatment.

3.13.2 Identifying the Components

Population

All university students enrolled in the course.

Sample

A group of 100 students selected from the course.

Sampling Method

Suppose the students are selected using simple random sampling from the class roster.

Observation Unit

Each individual student.

Sampling Unit

Each student selected from the roster.

In this case, the sampling unit and observation unit are the same.

Variables

Explanatory variable: participation in the tutoring program (yes or no)
Response variable: exam score

Each student contributes one observation.

3.13.3 Random Assignment

After selecting the sample, students are randomly assigned to the treatment or control group.

Random assignment helps ensure that the two groups are comparable.

3.13.4 Role of Confounding Variables

Variables such as:

prior academic ability
motivation
attendance

may still affect exam performance.

However, because of random assignment, these variables are expected to be balanced across the groups.

This reduces the impact of confounding variables.

3.13.5 Bias Considerations

If the initial sample is not representative of the population, the study may still suffer from selection bias.

However, within the experiment, random assignment reduces systematic differences between groups.

3.13.6 Why Causality Can Be Established

If the treatment group has significantly higher exam scores than the control group, we can attribute this difference to the tutoring program.

This is because:

the researcher controlled the explanatory variable
random assignment reduces the effect of confounding variables

As a result, experimental studies allow for cause-and-effect conclusions.

3.13.7 Conclusion

This example illustrates the key features of an experimental study:

control of the explanatory variable
random assignment
reduced influence of confounding variables

These features allow experimental studies to establish causal relationships, unlike observational studies.

3.14 From Studies to Data

Once a study is designed and data are collected, the next step is to summarize and describe the data.

In the next section, we introduce methods for data description.