MATH 1532 2011W

From MathWiki

MATH 1532 Statistics for Business and Society Winter 2011

  • Be sure to RELOAD this page in your browser to see the latest changes
  • See the syllabus file for all links to slides, sample mid-term, etc.

Breaking News

  • Mar 11: The second team assignment is due on March 31 (with apologies for an earlier error)
  • Feb 10: Here is a screen capture video (http://www.math.yorku.ca/people/georges/Files/MATH1532/Recordings/MATH1532_Week_6_Day_2.html) of the lecture on Feb 10. Note that there is a voice over to correct my mistake when I used the word 'reliability' instead of the correct expression: 'sensitivity'. The probability of a positive test result given the presence of a disease is properly called the 'sensitivity' of the test. Where I went fishing for the word 'reliability' I have no idea.
  • Feb 10: The sample mid-term test (http://www.math.yorku.ca/people/georges/Files/MATH1532/Tests/Sample_mid-term_test_MATH_1532.PDF) is available. There is also another file with one additional question (http://www.math.yorku.ca/people/georges/Files/MATH1532/Tests/Sample_mid-term_test_MATH_1532-ADDITIONAL_QUESTION.pdf) that could be (with high probability) on the mid-term.
  • Feb 3: See the latest changes in the syllabus page.
  • Feb 2: At the bottom of this page, there is a section called Special Topics where I have notes on topics covered in class and in slides but not fully covered in the textbook.
  • Jan 11: There is now an evolving syllabus for the course which will also serve to provide links to files, exercises and assignments as the term progresses.
  • I will be trying to get a room for a tutorial session at a time that is as generally useful as possible. In the meantime I have office hours on Thursdays from 4 to 6 pm.
  • Jan 6: Slides used in class plus one set of slides that was not used have been uploaded to MATH1532/Slides (http://www.math.yorku.ca/people/georges/Files/MATH1532/Slides)

Files

Table of contents

General Information

Statistical reasoning is crucial for a critical understanding of the flood of data and information we face daily in modern society. Understanding the principles of statistical reasoning and being aware of a number of widespread errors in statistical thinking is often the key for distinguishing arguments that are sound from those that are fallacious.

This course stresses the logic and reasoning behind statistics avoiding emphasis on complex mathematical formulas. Statistical reasoning will be applied to a critical analysis of current events reported in the media and current scientific, medical and social controversies.


Instructor

  • Georges Monette, Ph.D., P.Stat. (http://www.ssc.ca/accreditation/index_e.html)
    • N626 Ross
    • Email: georges+math1532@yorku.ca (mailto:georges+math1532@yorku.ca)(Note: the "+math1532" portion is designed to avoid spam filters and to allow me to give your messages higher priority)
    • Phone: (416) 736-2100 ext. 77164
    • http://www.math.yorku.ca/~georges
    • Office hours: Thursday afternoons, 4 pm to 6 pm, in N626 Ross

Course work and grades

Date Weight
Assignment 0 (individual) Jan. 11 noon 0%
Assignment 1 (team) Feb. 3 (at beginning of class) 10%
Mid-term test Feb. 17 20%
Assignment 2 (team) March 31 10%
Project (individual) April 4 (by email or at my office) 30%
Final exam 30%

Text and References

The text book is:

  • Ken Black, Chuck Chakrapani, and Ignacio Castillo (2010) Business Statistics: For Contemporary Decision Making, Canadian Edition, Wiley.
  • We will also cover material that is not in the textbook. Links to background materials will be provided on this web page.

Syllabus

A table showing the syllabus also provides links to course notes, exercises and assignments.

Lectures and Tutorials

  • Class: Tuesdays and Thursdays, 10 am to 11:30 am in CLH (Curtis Lecture Hall) H
  • Optional tutorial: Time and location TBA

Important Dates

Last date to enrol without permission January 17
Last date to drop the course without receiving a grade March 4
Last class March 31
Last date to submit term work and end of classes April 4
Reading week February 19 to 25
Exam period April 6 to 23

Resources

Dataset, lecture notes, information on computing, etc. will be posted in http://www.math.yorku.ca/people/georges/Files/MATH1532/. Since some of the material may be copyrighted, access to the files is protected and requires a userid: 'buso' and a password: 'buso' also. If you find any interesting links please send them to me georges+math1532@yorku.ca (mailto:georges+math1532@yorku.ca?subject=Interesting%20link%20for%20MATH1532) and I will try to add them to the links at the bottom of this page.

Using computers for the course

Some assignments and the project will require you to analyze data using computer software. The test and exam will require you to interpret output from the same software. You can learn the computing aspects of the course in a number of ways:

  • If you have access to a computer, you will be able to download the software for the course, an add-on package for Excel 2007. If you have a laptop, you are encouraged to bring it to class and to tutorials and office hours.
  • If you don't have access to a computer, you can get an account to use computers a York where the software will be available. If this is the case for you, please send me an email message so I can make appropriate arrangements.

Week 1: Jan 4,6: Introduction

Material covered

Textbook

None

Other materials

The lecture and links will be posted after the course.

What is 'Statistics'?

The definition in the text says:

Definition: Statistics is a collection of procedures and principles for gathering data and analyzing information in order to help people make decisions when faced with uncertainty.
This is certainly an important aspect of statistics but I think it only tells a small part of the story. Statistics is the science (and art) of working with uncertainty --- whether you plan to make decisions or not. We tend to think of statements as true or false. But in practice, the truth or falsity of most important statements is not known for sure. There are all shades of degrees of uncertainty between between being sure a statement is true or false. Many of the most important decisions and choices we make in life are made despite the fact that we don't have all the information we would like to have to determine which route is best. Sometime we simply act as if something is true or false although we don't really know. Statistics is not just about how to make these difficult decisions. It is also about remembering and being aware of our uncertainty so we know where to look for better information and we are ready to revise our hypotheses as relevant information becomes available. Statistics is not just about making decisions, it's about where to look for information that could lead us to change our decisions. It's about knowing when to keep an open mind and knowing when and how to change your mind.
Statistics is about the fascinating journey from ignorance to increasingly certain knowledge to wisdom. This is a journey we all follow individually. It is also a journey undertaken by disciplines, by political and social organisms and by mankind as a whole.

Experimental vs Observational data

If X and Y are correlated, what can it mean?
1) X causes Y?
2) Y causes X?
3) Another variable(s) Z(s) causes both X and Y?
a) Some Zs might be known and measurable. For these Zs we might be able to adjust using sophisticated statistical methods.
b) Some Zs might be known but hard or impossible to measure. This is more difficult to deal with.
c) Some Zs might not be discovered until the year 3000. We can't adjust statistically for these.
4) Selection: maybe there's no relationship but some data got thrown out or ignored and the data left created the impression of a relationship.
5) Chance: This is the one statisticians are really good at dealing with -- as you will learn in this course.
What if we have an 'experiment' with 'random allocation of X' to experimental units?
1) possible
2) No! We know what caused X. It was the coin toss or the random number generator that caused X.
3) Maybe. But it could only be by chance that differences in levels of any combination of Zs, known or not, measurable or not, would have a large impact on Y.
4) We can exclude this by careful checking.
5) Chance again.
So we are left with two options:
1) X causes Y, or
2) Chance.
We can use statistical analysis to measure chance. If the chance is very small then we may be left with X causes Y as the plausible explanation.
How should you react to causal claims based on data analyses?
1) Experimental data or observational? You might have to ask questions to answer this. Sometimes it isn't obvious from the appearance of the data.
2) If experimental: was allocation random or by judgment or haphazard? Was the study double-blind? Are there possible biases in measurements? Psychological factors that influence outcome? Does the claim match the nature of the experiment or is the claim stretching to something that does not correspond exactly to what was done in the experiment?
3) If observational:
a) Can you poke an obvious hole in the claim? E.g. is there a plausible alternative explanation that was not taken into account in the analysis? In this case, you've countered the claim.
b) What has the analysis adjusted for? Are these factors that can be measured with precision? What kinds of factors are not accounted for?
Some examples: Toronto Star: Pulse (http://www.math.yorku.ca/~georges/Courses/2565/StatisticsInTheNews030926.html)
Which examples are experimental and which are observational?
Which conclusions are reasonable and which are not? Why?

Synopsis

Types of data

  • Purposes for analyzing data:
    • Descriptive
    • Inference: causal
    • Inference: predictive
  • Type of data
    • Experimental: X under control of experimenter: random assignment of levels of X
    • Observational: X determined by other factors and just observed, not manipulated
  • How purposes and types of data match
    • Descriptive statistics can be done with any kind of data since there is no intention to generalize
    • Causal inference is best done with experimental data
      • Caution: experiments are often conducted with volunteers who may not be similar to the target population for causal inference. Often, the only true experiments may be on animals who may or may not mimic the corresponding processes in humans.
    • Predictive inference is best done with observational data sampled so it is representative of the target population.
      • Just as random allocation is crucial for experiments, random selection is ideal form observational data for predictive inference.
    • Causal inference with observational data is highly problematic
      • Often, important questions are causal in nature and all that's available is observational data.
      • We can never be certain of causal conclusions based on observational data
      • Intelligent evaluation of causal claims based on observational data is challenging but may be the only way to shed light on crucial questions.
  • Assessing causal claims from observational data, where the relationship between X and Y is too strong to be attributed to chance:
    • Look for plausible alternative explanations:
      • May Y cause X?
      • Are there obvious plausible confounding factors: factors that could cause both X and Y. Note that factors that are caused by X and, in turn, cause Y are mediating factors that explain and do not contradict the possibility that X causes Y.
      • Have some of these possible confounding factors been controlled for in the study? How effectively?
      • Do important factors remain that have not been controlled for?
      • Consider the possibility of a selection effect.
      • Consider possible mediating factors that could explain how X could cause Y, even when the suggestion that X causes Y seems surprising.
      • When there are different sources of data, consider which seem more reliable and why?
      • What kind of data could determine whether X causes Y? Why does it not yet exist? Is it likely to be available in the future? What obstacles exist to obtaining such data?
      • Can you come to a practical conclusion and how much confidence do you have in it?
  • Good experiments:
    • Control vs treatment groups: experiments involve a comparison between two or more conditions or treatments)
    • Placebos -- blinding of subject
    • Blinding of assessor
    • If both subject and assessor are blind, we have double blind
    • Randomization is crucial so we can be sure that all possible confounding factors known or unknown are not responsible for the outcome except possibly by chance'. Randomization can be applied in many ways:
      • completely randomized design: take all subjects and randomly allocate to each treatment
      • paired designs: for two treatments: split subjects into pairs that are similar with respect to relevant variables, then randomly select within each pair.
      • blocked designs: for more than two: split subjects into similar blocks with as many subjects as treatments, then randomly assign within each block.
      • longitudinal designs: give all or some of the treatments to each subject. Randomize order.
  • Special types of observational studies for causal inference:
    • Retrospective: (measure Y in the present or past and X in the past)
    • Prospective: measure X now, Y later.
    • Case-control: If Y is disease vs. no disease: choose a group of subjects with with the disease (the cases) and then, for each case, find a non-diseased subject who is similar with respect to selected Zs. Measure X on everyone and see if X is related to Y.
    • Longitudinal without randomization: Subjects get all levels of X either in same order or in an order not controlled by experimenter.


Exercises

Exercises are not graded but they are useful preparation for the mid-term test or the final exam

Read the articles in Toronto Star article, Pulse (http://www.math.yorku.ca/~georges/Courses/2565/StatisticsInTheNews030926.html) and answer:

  • Which examples are experimental and which are observational?
  • Which conclusions are reasonable and which are not? Why?

Assignment 0

Due: 12 noon, Monday, January 10, 2011

I would like to know something about you and I also want to form random teams of 4 or 5 students to work on Assigment 1. I will use your emailed responses to this Assignment 0 to form the teams. You will receive the names of your team members on January 11 so you can meet face to face at the break during the class on January 11.
Send me (to georges+math1532@yorku.ca (mailto:georges+math1532@yorku.ca?subject=Assignment%200)) an email message from the e-mail account where you would like to receive email for the course.
In your email message, tell me about yourself by responding to the following questions. Please cut and paste the questions into your email message and then add your answers between the questions. Note that only the first two items will be shared with your work group:
1) Your given name and family name. (In parentheses, state the name by which you prefer to be called if it's different from your given name)
2) Your e-mail address
3) Your student number
4) Previous statistics courses if any?
5) What kind of computer do you plan to use for the course? e.g. Laptop, desktop, computer lab.
6) If you have a computer:
a) What operating system does your computer use: e.g. Windows Vista, MacOS X, Ubuntu Linux.
b) What spreadsheet packages do you have on your computer (e.g. Excel and/or Calc)? Provide the names and versions. e.g. Excel 2007.
c) What word processing packages (MS Word, OpenOffice Writer)?
d)) What statistical packages do you have (if any) on your computer? (e.g. SPSS, SAS, S-Plus, R)
7) List the software packages you use and indicate your skill level on a scale from 1 to 10 (1 = very basic, 3 = basic user, 5 = solid knowledge of basic features, 8 = advanced user (e.g. can define macros if relevant), 10 = guru)
a) Operating systems:
b) Spread sheet software:
c) Word processing software:
d) Statistical software:
8) What do you want to get out of this course?
9) What career(s) are you thinking of pursuing?
10) Any other information you would like to share with me:

Team Assignment 1 (part 1)

Due: Feb. 3 at beginning of class

1) Find a topic in the news currently or within the past year that involves some controversy over the interpretation of evidence.
2) Collect some clippings or on-line links to news, magazine or journal articles related to the topic.
3) Discuss why the topic is controversial. Is the controversy over causality? Why is there room for disagreement? What kind of evidence, data or theory, is available to support the various sides of the issue? Discuss the apparent strengths and weaknesses in the data or theory on either side? Is the available data observational or experimental? Is this relevant to the issue? What kind of data, if any, could resolve the issue? What obstacles are there to obtaining the ideal data to resolve the issue? Is better data likely to become available and how would it be helpful?
4) End the assignment with brief individual essays (identify the authors) stating your individual positions on the topic? Have you adopted a point of view? Describe the ways in which you remain uncertain and how your uncertainty could be resolved. If you wish you can write this part of the assignment as if it were a panel discussion among the members or your team. You could, in fact, record a panel discussion and transcribe it to text.

You are not expected to become experts in two weeks in the topic you choose. The goal of the assignment is for you to become informed lay persons with an understanding of the nature of the controversy and uncertainty concerning your topic, an understanding of the approaches that could resolve it and the challenges to achieving a resolution.

All members of the team receive the same grade. The grade is based on the quality of your research and the interest and intellectual energy you display in dealing with the problem.

Links

This section will contain links to data sets, notes, and, when it works, a video of the screen display for the course

Week 2: Jan 11,13: Continuation of Week 1; BCC Chapter 1

Material covered

Textbook

BCC Chapter 1

  • Meaning of 'statistics'
  • Where statistics is used
  • Basic Concepts:
    • Purposes:
      • Description
      • Inference: predictive vs causal
    • Population, Sample
    • Types of variables: nominal, ordinal, interval, ratio

Exercises

BCC Chap. 1, p. 16: 1.5, 1.7

Exercises are not graded but they are useful preparation for the mid-term test or the final exam

Week 3: Jan 18, 20

See links and exercises in the syllabus table.

Week 4: Jan 25, 27

See links and exercises in the syllabus table.

Links

This section will contain links to data sets, notes, and, when it works, a video of the screen display for the course

Weeks 5-7

See the weekly syllabus file.

Week 8: Mar 1, 3

Assignment 2

Assignment 2 will done in the same groups as Assignment 1 except that groups that have become too small may be combined with others. Assignment 2 consists of the accumulated problems from week to week that are assigned over the next three weeks. The assignment is due on March 31.

Each current group should send me (mailto:georges+math1532@yorku.ca) one email message giving me the name of the group and the names of its members. I'll address issues concerning reconstitution of groups on Sunday, March 6.

Project (Individual)

The general idea is to perform an analysis of some data that you find interesting using the statistical tools and critical insights that you have developed in the course. To help you find a topic and data you can have a look at Statistics: Pedagogical resources on this wiki.

  1. Identify a topic you find interesting about which you have a question that could be resolved with appropriate data and analysis.
  2. Find a number of sources (3 or more-- except in very special cases where 3 or more sources would not exist) that provide information relevant to your question. At least one source should have relevant data.
  3. Perform some analyses of the data including summaries of the distribution of relevant variable and relevant graphs.
  4. Based on a critical assessment of your sources and your analysis, discuss the implications for your question.
  5. Discuss clearly the strengths and limitations of your analysis and existing information in addressing your question.

Some guidelines for your report:

  1. Aim for a length of 8 to 12 pages of analyses and discussion plus at least 2 pages of relevant graphs.
  2. Show the results of at least one and preferably two analyses using a single data set -- unless you are very ambitious and want to use more.

Grading:

  1. Clear expression of specific question and relevant field: 10%
  2. Choice of sources and clear references: 10%
  3. Clarity and quality of argument: 20%
  4. Relevance and quality of analysis: 20%
  5. Relevance and quality of graphs: 20%
  6. Clear formal academic style of writing: 5%
  7. Effort: 5%
  8. Structure: 5%
  9. Overall appearance: 5%

Special Topics

This is a summary of topics that received special emphasis in class and are covered in slides.

Computing standard deviations

There are 'quick' formulas to compute standard deviations but in practice we will always use software for any real data set. These formulas are typically not very informative and, suprisingly, very poor as computational formulas because they have poor numerical properties, i.e. they are easily affected by round-off error in the calculation.

It is more instructive to know how to use the 'long' way of calculating the standard deviation and being able to apply it to small data sets.

Suppose the sample is: -5, 3, 0, -1, 3, 6. Note that n = 6.

Data
Xi
Mean
\bar{X}
Deviation
from mean
X_i-\bar{X}
Squared deviation
from mean
(X_i-\bar{X})^2
-5 1 -6 36
-3 1 2 4
0 1 -1 1
-1 1 -2 4
3 1 2 4
6 1 5 25
Total = 6 Total must be
the same as data
Total = 0
calculate this as a check
Total = 74

The sample variance of X is

s_X^2 = \frac{\Sigma (X_i-\bar{X})^2}{n-1}= \frac{74}{5}=14.8

The sample standard deviation is

s_X = \sqrt{s_X^2} = \sqrt{14.8} = 3.85 (here you finally need a calculator)

When is the empirical rule valid

The empirical rule says how probability to expect within 1, or within 2, or, generally within k, standard deviations of the mean of a variable. Two key values to remember: If a random variable obeys the empirical rule then you expect approximately 68% of the data to lie within one standard deviation of the mean and 95% to lie within 2 standard deviations of the mean.

To use the empirical rule wisely, we need to have a sense of what variables would obey it. The answer is: 'normally' distributed variables. What variables are normally distributed? One of the leading famous theorems of mathematics is the Central Limit Theorem. It says that:

If a variable is generated as
  1. the sum of
  2. a relatively large number
  3. of relatively independent components
  4. none of which makes a relatively large contribution to the whole.
Then the variable will have a distribution that is close to normal.

Many natural variables obey the CLT much of time with rare exceptions. For example human heights for a group of people of the same sex and same racial group may be normally distributed if we exclude rare individuals whose unusual height arises as the result of a genetic syndrome. Some variables, like income, do not have a normal distribution perhaps because factors have more of a multiplicative effect or some factors have a much larger contribution than others.

z-scores

If you know the value of a variable, X, for example suppose your height is 62 inches. To calculate the z-score subtract the mean (population or sample) and then divide by the standard deviation (population or sample).

The general formula is

Z=\frac{X-\mu}{\sigma}

If the population mean height is μ = 67 inches and the variance is σ2 = 16 squared inches, then your Z score is:

Z=\frac{X-\mu}{\sigma}=\frac{62-68}{\sqrt{16}}=-1.5

Be careful to pay attention to the difference between variance and standard deviation -- which is equal to the square root of variance. You need the standard deviation in the formula for the z-score.

Correlation

If the relationship between two continuous variables, X and Y, is approximately linear, then the correlation measures how strongly they are associated with each other. If the correlation is +1 or -1, then Y can be computed exactly from X (and vice versa). If the correlation is 0, then X provides no information about the likely value of Y (and vice versa).

To visualize correlation from a scatterplot in which the variables are roughly linearly related looking like a cloud of data, you can visualize a data ellipse with the shape of the data and calculate the correlation as shown in this figure:
Enlarge

Predicting Y from X using the correlation

With correlations different from 0, we can use knowledge X to make a better guess of the likely value of Y (and vice-versa). This is most easily done with z-scores.

Suppose we know that X and Y are approximately linearly related and we know their correlation and their means and standard deviations. If you know the value of X, you can use that value to make a better 'prediction' (guess) of the corresponding value of Y.

Use the following steps:

  1. Calculate the z-score for X: ZX = (X − μX) / σX
  2. Calculate the predicted z-score for Y: \hat{Z}_Y = \rho Z_X
  3. Calculate the predicted value of Y: \hat{Y} = (\hat{Z}_Y \times \sigma_Y) + \mu_Y


Interpreting skewness and kurtosis in Excel

Files

Files for MATH 1532 (http://www.math.yorku.ca/people/georges/Files/MATH1532)

Links

MATH 1532 links