3 Studies and Data Collection
In statistics, we collect data in order to answer questions about a population.
A well-designed study is essential to ensure that the conclusions we draw are reliable.
3.1 Statistical Studies
Definition 3.1 (Statistical Study) A statistical study is a process of collecting data in order to answer a question about a population or phenomenon.
A study begins with a clear question and a plan for how data will be collected.
3.2 Population and Sample
Definition 3.2 (Population) A population is the complete collection of all individuals or objects of interest.
Definition 3.3 (Sample) A sample is a subset of the population used to collect data.
In most situations, it is not feasible to observe the entire population.
Instead, we collect data from a sample and use it to learn about the population.
3.3 Variables
Definition 3.4 (Variable) A variable is a characteristic or measurement that can take different values across observations.
Each observation in a study consists of measurements of one or more variables.
3.4 Types of Studies
There are two main types of statistical studies.
3.4.1 Observational Studies
Definition 3.5 (Observational Study) An observational study is a study where the researcher records data without manipulating any variables.
In an observational study, the researcher is a passive observer and does not interfere with the process generating the data.
3.4.2 Experimental Studies
Definition 3.6 (Experimental Study) An experimental study is a study where the researcher actively manipulates one or more variables to observe their effect.
In an experiment, the researcher controls certain variables and studies how they affect the outcome.
3.4.3 Association vs Causation
A key distinction between observational and experimental studies is the ability to establish causation.
- Observational studies can identify associations between variables.
- Experimental studies allow for stronger conclusions about cause-and-effect relationships.
Definition 3.7 (Confounding Variable) A confounding variable is a variable that affects both the explanatory and response variables, potentially distorting their relationship.
Confounding variables are a major limitation of observational studies, since they can create misleading associations.
3.5 Explanatory and Response Variables
Definition 3.8 (Explanatory Variable) An explanatory variable is a variable that may explain changes in another variable.
Definition 3.9 (Response Variable) A response variable is the variable being explained or predicted.
The goal of many studies is to understand how explanatory variables influence the response variable.
3.6 Types of Observational Studies
Observational studies can take different forms depending on how data are collected.
3.6.1 Sample Surveys
Definition 3.10 (Sample Survey) A sample survey is an observational study that collects information about a population at a single point in time.
Sample surveys are commonly used in polls and questionnaires.
3.7 Units in a Study
Definition 3.11 (Observation Unit) The observation unit is the entity on which measurements are taken.
Definition 3.12 (Sampling Unit) The sampling unit is the entity selected during the sampling process.
In many cases, the observation unit and the sampling unit are the same, but they can differ depending on the study design.
3.8 Sampling Methods
The way a sample is selected from a population plays a critical role in the quality of a study.
A good sampling method helps ensure that the sample is representative of the population.
3.8.1 Simple Random Sampling
Definition 3.13 (Simple Random Sampling) Simple random sampling is a method where every possible sample of a given size has an equal chance of being selected.
Each observation in the population has the same probability of being included in the sample.
Advantages
- Easy to understand and implement
- Produces unbiased samples when properly conducted
Disadvantages
- Requires a complete list of the population
- Can be impractical for large or geographically dispersed populations
3.8.2 Stratified Sampling
Definition 3.14 (Stratified Sampling) Stratified sampling divides the population into homogeneous groups called strata and then takes a sample from each group.
The strata are formed based on characteristics that are relevant to the study.
Advantages
- Ensures representation from all important subgroups
- Can increase precision of estimates
Disadvantages
- Requires knowledge of population structure
- More complex to implement
3.8.3 Cluster Sampling
Definition 3.15 (Cluster Sampling) Cluster sampling divides the population into groups called clusters and randomly selects entire clusters to include in the sample.
All observations within selected clusters are included in the sample.
Advantages
- Cost-effective and practical for large populations
- Useful when the population is geographically spread out
Disadvantages
- Can lead to higher variability in estimates
- Clusters may not be representative of the population
3.8.4 Systematic Sampling
Definition 3.16 (Systematic Sampling) Systematic sampling selects observations at regular intervals from an ordered list of the population.
For example, selecting every \(k\)-th observation after a random starting point.
Advantages
- Simple and quick to implement
- Ensures a spread of observations across the population
Disadvantages
- Can introduce bias if there is a pattern in the data
- Not truly random if ordering is structured
3.8.5 Convenience Sampling
Definition 3.17 (Convenience Sampling) Convenience sampling selects observations that are easiest to access.
Advantages
- Easy and inexpensive
- Useful for exploratory analysis
Disadvantages
- Often highly biased
- Not representative of the population
- Results cannot be generalized reliably
3.9 Summary of Sampling Methods
- Probability sampling methods (simple random, stratified, cluster, systematic) allow for more reliable conclusions.
- Non-probability methods (such as convenience sampling) are prone to bias.
- The choice of sampling method depends on the study objectives, resources, and population structure.
3.10 Bias in Studies
Definition 3.18 (Bias) Bias is a systematic error in data collection or analysis that leads to incorrect conclusions.
Bias can arise in many ways and can significantly affect the validity of a study.
3.11 Summary
A well-designed study requires:
- a clear definition of the population and sample
- an understanding of the variables involved
- an appropriate study design (observational or experimental)
- awareness of potential bias and confounding variables
3.12 Example: Observational Study
Suppose we want to study the relationship between study time and exam performance among university students.
3.12.1 Description of the Study
We collect data from a group of students without assigning or controlling how much they study.
We simply record their study habits and exam scores.
This is an observational study because no variables are manipulated by the researcher.
3.12.2 Identifying the Components
Population
All university students enrolled in the course.
Sample
A group of 100 students selected from the course.
Sampling Method
Suppose we select students by choosing every 5th student from the class roster.
This is an example of systematic sampling.
Observation Unit
Each individual student.
Sampling Unit
Each student selected from the roster.
In this case, the sampling unit and observation unit are the same.
Variables
- Explanatory variable: number of hours studied per week
- Response variable: exam score
Each student contributes one observation consisting of these measurements.
3.12.3 Potential Confounding Variables
There are other variables that may affect exam performance, such as:
- prior academic ability
- attendance
- access to tutoring
- sleep and health
These variables may influence both study time and exam scores.
These are confounding variables because they can distort the relationship between the explanatory and response variables.
3.12.4 Bias Considerations
If some students are absent when data are collected, the sample may suffer from nonresponse bias.
If the class roster is ordered in a non-random way (for example, by performance), systematic sampling could introduce selection bias.
3.12.5 Why Causality Cannot Be Established
Even if we observe that students who study more tend to have higher exam scores, we cannot conclude that studying more causes higher scores.
This is because:
- confounding variables may explain the relationship
- the researcher did not control or assign study time
As a result, the study can identify associations, but not cause-and-effect relationships.
3.13 Example: Experimental Study
Suppose we want to study whether a new tutoring program improves exam performance among university students.
3.13.1 Description of the Study
We select a group of students and randomly assign them to one of two groups:
- a treatment group that receives the tutoring program
- a control group that does not receive the program
We then compare their exam scores at the end of the semester.
This is an experimental study because the researcher actively assigns the treatment.
3.13.2 Identifying the Components
Population
All university students enrolled in the course.
Sample
A group of 100 students selected from the course.
Sampling Method
Suppose the students are selected using simple random sampling from the class roster.
Observation Unit
Each individual student.
Sampling Unit
Each student selected from the roster.
In this case, the sampling unit and observation unit are the same.
Variables
- Explanatory variable: participation in the tutoring program (yes or no)
- Response variable: exam score
Each student contributes one observation.
3.13.3 Random Assignment
After selecting the sample, students are randomly assigned to the treatment or control group.
Random assignment helps ensure that the two groups are comparable.
3.13.4 Role of Confounding Variables
Variables such as:
- prior academic ability
- motivation
- attendance
may still affect exam performance.
However, because of random assignment, these variables are expected to be balanced across the groups.
This reduces the impact of confounding variables.
3.13.5 Bias Considerations
If the initial sample is not representative of the population, the study may still suffer from selection bias.
However, within the experiment, random assignment reduces systematic differences between groups.
3.13.6 Why Causality Can Be Established
If the treatment group has significantly higher exam scores than the control group, we can attribute this difference to the tutoring program.
This is because:
- the researcher controlled the explanatory variable
- random assignment reduces the effect of confounding variables
As a result, experimental studies allow for cause-and-effect conclusions.