MATH 2565 W 2007 Section M Summary Week 2

From MathWiki

Home page: MATH 2565 W 2007 Section M

Table of contents

Basic issues

We analyze data in a data set to give us information about a target of inference, usually a population (e.g. all potential customers of a firm) or a hypothetical situation (what would be likely to happen if I chose to take or not to take a particular cancer drug).

The key questions are:

  • does our data set give us accurate information about the target?
  • are there ways of analyzing the data that will help us get more accurate information about the target?

Classifying data sets

1. Experimental vs Observational (if there's a treatment): Is it under the control of an experimenter?
If the assignment of treatment is under experimental control we say that the data set consists of experimental data. Otherwise, (the treatment is selected by the subject, nature, etc.) we have observational data
Why does it matter?
If we really want to know whether the treatment causes some outcome (e.g. cure) then we need an experiment. We can use observational data for causal inference but we need to use much more complex methods of analysis and we can never really be sure.
2. With experimental data: method of assignment of treatments: random, convenience, judgment, haphazard?
Why does it matter?
To be confident that differences are really due to treatment and not to some other factor, assignment must be rigorously random. In medical studies and properly randomized experiment is called a randomized clinical trial or RCT for short. This is also known as a randomized controlled experiment.
A well designed experiment may attempt to exclude uninteresting reasons for which the treatment might cause the outcome. E.g.
to exclude experimenter bias, we might ensure that the person assessing the outcome does not know which treatment a subject received
to exclude psychological factors, we might ensure that the subject does not know what treatment is applied. Note, for ethical reasons, human subjects must always know which treatments might be applied and they must give consent to the entire process including the fact that they won't be aware of the exact treatment they are receiving).
The best kind of experiment is a randomized controlled double-blind experiment.
However, such experiments are extremely expensive and difficult to carry out, so much of the time we need to make the best of data sets that are very far from ideal. Although we would like to have ideal data, the fact that we rarely do is part of what makes statistics as exciting a field as it is (in my opinion!).
3. With observational data sets: method of selection from a population: random (probability sample), convenience, judgdment, haphazard?
Why does it matter?
If a sample is selected using an explicit probability method (a probability sample), then we can generalize validly to a population. Otherwise it's hard to be confident that inferences are valid.
An important kind of observational data sets is the sample survey with a randomly selected sample from a specified target population (sometimes referred to as the sampling frame). With surveys, an important issue is the response rate. The quality of inferences from a well selected sample can be impaired by a poor response rate.
Note: It's possible to have a random experiment on a random sample. This would be ideal. In reality it rarely happens with human subjects. The need for consent for experiments means that the sample is somewhat self-selected at best.

Things that can go wrong

What can go wrong so the sample does not give accurate information?

1. Poor selection: the sample is not representative of the population or target of inference.
- non-random sample
- self-selection or use of judgment for allocation of treatment
2. Poor measurement:
- e.g. question that induces bias in a survey.

Classifying data, i.e. variables in data sets

See text and slides


Major issues in classifying data sets:

1. How were units obtained?
- random selection from a population
- haphazard/convenient selection from a population
- convenience sample
2. How are units selected for 'treatment' if any?
- random: stratified, cluster or block?
- self-selection?
- judgment or convenience?
3. If an experiment, how was it carried out?
- double-blind, single-blind or not blind?
4. If a survey:
- what is the response rate


  • When would you choose each of the following types of data:
    • survey
    • observational data
    • experiment
  • How would you determine whether a data set consists of a
    • a survey,
    • observational data, or
    • an experiment.