MATH 6627 2007-08

From MathWiki

Quick links


Table of contents

News

Speakers: Richard Cook and Jerry Lawless.
Held at the Department of Public Health Sciences, University of Toronto 155 College Street, Room HS 610 (6th floor)
May 15, 2008, 8:30am – 5:00pm
  • SSC Case Studies (http://www.ssc.ca/documents/case_studies/2008/index_e.html)
Presentations in Ottawa May 25-28, 2008

General Information

Instructor

Meetings

The class will meet roughly every second week on Mondays from 2:30 to 5:30 in TEL 0004. Consult the schedule below for exact dates.

Goals

As undergraduates we learn statistics through a sequence of courses each focusing on some part of statistical theory. When we solve problems in these courses the tools we are expected to use are obvious. When you have to solve real-world statistical problems, it is rare that there are clear clues about the correct theory or method that needs to be used.

In fact, many problems are best handled with eclectic solutions borrowing from many statistical fields. The goal of this course is to help you develop the skills and confidence to solve real-world problems. You will learn about the key role of many statistical concepts that are rarely seen in standard courses. You will also learn the vital role of visualization and graphics, communication (listening even more than talking) and presentation skills.

The course will help you develop skills in a number of areas:

  1. programming and data management skills in R: Although the emphasis in this course is entirely on R, many jobs expect a strong knowledge of SAS -- take every opportunity you can to also learn SAS
  2. graphics to visualize data and models
  3. how to work as a statistical consultant/collaborator in the analysis of scientific problems
  4. developing presentations skills
  5. developing an understanding of the role of statistics as a discipline and as a profession in science and business
  6. understanding ethical issues related to statistical practice

Text and references

  • Text: Philip I. Good and James W. Hardin (2006) Common Errors in Statistics (and How to Avoid Them), Wiley-Interscience, Hoboken, N.J. Steacie QA 276 G586 2006 (http://theta.library.yorku.ca/uhtbin/cgisirsi/IIhxfvdWGS/YORK/292410312/9) In the York University bookstore selling for $58.
  • References (on reserve at Steacie Science Library):
    • Javier Cabrera and Andrew McDougall (2001) Statistical Consulting, Springer-Verlag, N.Y. Steacie Reserves HA 29 C227 2002 (http://theta.library.yorku.ca/uhtbin/cgisirsi/x22BBAABe7/YORK/292410312/8/1953602)
    • Janice Derr (2000) Statistical consulting: a guide to effective communication, Duxbury Steacie Reserves HA 29 D386 2000 Book and CD-ROM (http://theta.library.yorku.ca/uhtbin/cgisirsi/2kBAfAvWtU/YORK/292410312/8/1727556)
  • Other references: MATH 6627: References (we will build up the list during the year).

Course Work

  • In the first term the work for the course consists primarily of statistical analyses, consulting reports and presentations done in groups.
  • Since most of the interesting consulting problems I have been involved with in recent years have required an understanding of mixed models, we will spend a few weeks on basic concepts in mixed models.
  • We will work through the text Common Errors ... and different groups will prepare presentations including example data sets and a discussion of implementation in R along with appropriate R functions if necessary.
  • In the second term, you will work on a major consultation project in which you will collaborate with a client to produce a deep and probing consulting report. The project is likely to involved mixed models.
  • You will also attend some real statistical consultations and prepare brief reports.
  • Another important part of the course work is your contribution to the wiki. In particular we will develop two types of information on the wiki
    • how to's in R: these are brief articles describing how to do something simple in R, either a graph, an analysis or a type of data manipulation.
    • Paradoxes and fallacies in statistics: As your knowledge of statistics becomes deeper you abandon many simple suppositions and replace them with more sophisticated ones. An important part of communication between statisticians and clients -- for that matter between statisticians and the public or between statisticians and students -- involves understanding simple, often fallacious, suppositions and how they can lead to a deeper understanding. We will develop wiki pages to discuss important paradoxes and fallacies.

Grading

Each project or assignment receives an overall grade out of 100. This grade is attributed to each participant. Assignments are worth 50% of the final grade, the major project is worth 25% and individual contributions are worth 25%.

Class list and teams

Class photo, names, e-mail addresses and assignment to teams can be found at http://www.math.yorku.ca/~georges/Courses/6627. Note that a userid and password are needed to access this page.

Schedule

Week 1: September 10, 2007

Topics
Course organization
Participation in SCS seminars
You are welcome to attend SCS (http://www.isr.yorku.ca/scs/index.html) (Statistical Consulting Service) weekly meetings which consist of bi-weekly 'staff meetings' and bi-weekly seminars on a statistical topic of interest to statistical consultants. The exact topic for this year will be determined in two weeks. Meetings take place every Friday at 2:30 in TEL 5082. Please send an e-mail message to Georges Monette (mailto:georges@yorku.ca) to have your name added to the SCS mailing list. Note that SCS also offers short courses some of which might be of interest to you.
Consulting, communication, writing reports
Statistical consulting environment
Writing reports: Secret of good writing: write so your reader understands you!
Notes on writing reports
Seven basic principles
Not all consulting activities require a formal report. Often a phone call, a verbal report in a face to face meeting, a letter or a memo are the most efficient way of communicating to a client
Communication:
Interpersonal aspects of statistical consulting: Janice Derr, Statistical Consulting Video
Contributions by Doug Zahn
The role of statistics in society -- understanding evidence.
One of of the greatest challenges in understanding evidence is bridging the gap between observational data and causal inference, i.e. understanding the links between statistical significance and statistical meaning.
Statistics in the news: Talk (http://www.math.yorku.ca/~georges/Files/TalkHS2006.pdf)
The Fundamental Contingency Table of Statistics
  Types of Data
Experimental Observational
Types of Inference Causal Where Fisher would like us to be Where we often are
Predictive Very rare but problematic Okay: This is the topic of Frank Harrell's Regression Modeling Strategies'
Finding meaning in observational data -- examples
Hans Rosling: Myths about the developing world (http://video.google.com/videoplay?docid=4237353244338529080)
Al Gore: An Inconvenient Truth (http://www.imdb.com/title/tt0497116/)
Andrey Feuerverger: The Lost Tomb of Jesus (http://www.imdb.com/title/tt0974593/)
Software
A working statistician should be proficient with at least SAS and R. This course uses R. A good consultant should also be familiar with packages that are likely to be used by clients, e.g. SPSS.
Getting started with R
Using a wiki for group assignments
Editing hints for course assignments
Assignments and things to do
Wiki: Get started with the wiki. You'll need to use it for the assignment but start by logging in providing some information about yourself in your user page.
Assignment: Assignment 1: write answers on wiki and prepare to present on Sept. 24
Prac07 Hox: q. 1, 5, 9, 14
Prac07 Jaynes: q. 2, 6, 10
Prac07 Jeffreys: q. 3, 7, 11
Prac07 Moser: q. 4, 8, 12, 13
Readings: Common errors: Read Chapter 1 and be prepared to discuss on Sept. 24
Software: Get started with R (Getting started with R). Attend the special session on Sept. 17 if you need to know the basics of R.
Class photo: See the class photo at the course web page (http://www.math.yorku.ca/~georges/Courses/6627#Photo) and enter your name for the caption.
Coming up soon: Start thinking about the next assignment: you will need to find a current or recent topic in the news that involves, explicitly or implicitly, an interesting statistical issue. You will prepare an analysis of the topic together with a review of scientific evidence. Are there gaps between the science and the public presentation of the topic?

Week 1.5: September 17, 2007

Topic
This is a optional tutorial on the use of R for those who have had little or no experience with R. Be sure to have downloaded R and started covering some of the material in R: Getting started#First tutorials before the tutorial. If you have a laptop, install R on it and bring it to the class.
In this tutorial we will work through:
  1. the sample session in Venables and Ripley (2002) [1] (http://wiki.math.yorku.ca/index.php/VR4:_Chapter_1_summary) and
  2. the tutorial by John Fox prepared for a short course at UCLA: http://socserv.mcmaster.ca/jfox/Courses/UCLA/index.html

To continue learning R:

  1. Work through http://cran.r-project.org/doc/manuals/R-intro.html, also available as a pdf file through the help menu on the R console.
  2. Highly recommended for learning R systematically: work through the on-line textbook by J. H. Maindondald at http://wiki.math.yorku.ca/index.php/R:_Getting_started#Exploring_much_more_deeply

Week 2: September 24, 2007

Topics
  1. Statistics Canada: Careers for Mathematical Statisticians (http://www.statcan.ca/english/employment/ma/ma.htm).
  2. Discussion of Common Errors chapter 1.
  3. Observational vs Experimental data
  4. Making the best of observational data
  5. Visualizing multiple regression
Presentations
Assignment 1 will be presented by each group. Plan to take no more than 7 minutes per group.
Assignments and things to do
Group Assignment Prac07 Jeffreys Assignment 2 Prac07 Jaynes Assignment 2 Prac07 Moser Assignment 2
  1. Find a current or recent topic in the news that involves, explicitly or implicitly, an interesting statistical issue. Prepare an analysis of the topic together with a review of scientific evidence. Are there gaps between the science and the public presentation of the topic?
  2. Find -- or think of -- an example in which you would expect Simpson's Paradox to lead to paradoxical association between two variables. Can this lead to misinterpretation of the relationship among the variables? Discuss.
  3. Do the same for association in the 'wrong' direction due to selection.
Prepare a 7-minute presentation on your results. You can save your work on the wiki in files whose names begin with [[Prac07 Your.Group.Name Assignment 2]]
Individual assignment
  1. Look at the data set http://www.math.yorku.ca/~georges/Data/coffee.csv. It has three relevant variables, 'Heart', which is a measure of heart condition -- the higher the less healthy; 'Coffee', a measure of coffee consumption, and finally, 'Stress', measure of occupational stress. How could you use this data to address the question whether coffee consumption is harmful to the heart.Discuss assumptions needed to get anywhere with the data and discuss the nature of various assumptions that might lead to different interpretations, if relevant.
  2. Look at the data set http://www.math.yorku.ca/~georges/Data/hwX.csv where X is the remainder when you divide your 'class number' (the number from 1 to 20 on the class list on the web) by 4. Thus X will be 0, 1, 2, or 3. The data contains data on three variables: Health (the higher the better), Height and Weight. All are in standardized units. What would this data set have to say about the relationship between Weight and Health? Discuss assumptions needed to get anywhere with the data and discuss the nature of various assumptions that might lead to different interpretations, if relevant.
These individual assignments should be sent to the instructor by email (Word attachments or text files are okay) before noon on Sunday, October 14th.
Readings
  1. Read Common Errors chapter 2
  2. Review your textbooks on multiple regression. What is a confidence ellipse? What is its connection with hypothesis testing? What is a Scheffé confidence interval? What is a Bonferroni confidence interval?

Week 3: October 15, 2007

Topics
A review of multiple regression:
  1. The bivariate normal and its contours: iso-density or contour ellipse. script: visualizing normal contour ellipses (http://www.math.yorku.ca/~georges/R/MultivariateNormalContours.R)
  2. The dispersion ellipse [also known as a variance, data or deviation ellipse]:
    1. Correlation and regression, regression to the mean Visualizing Regression pp. 1- (http://www.math.yorku.ca/~georges/Slides/VisualizingRegression.pdf)
    2. Further properties: Deviation Ellipse and Precision Ellipse (http://www.math.yorku.ca/~georges/Slides/Deviation_and_Inverse_Ellipses.pdf).
    3. Yet more properties: Statistics: Ellipses
    4. Confidence ellipses and intervals in regression: Statistics: Ellipses of regression
    5. Even more properties: Statistics: Geometry of the ellipse
    6. Some older material on the topic: Visualizing Regression in 3D (http://www.math.yorku.ca/~georges/Files/VisualizingRegression.pdf)
  3. Simple and Multiple regression:
    1. Simple vs. multiple regression: R script: Visualizing Multiple Regression Part 1 (http://www.math.yorku.ca/~georges/R/VisualizingMultipleRegressionPart1.R)
      1. Interpreting coefficients
      2. Relationship between unconditional vs conditional effects in regression, Simpson's Paradox with numerical predictors
      3. Criteria for selecting models: role of causal assumptions.
    2. Measurement error in regression: R script: Visualizing Multiple Regression Part 2 (http://www.math.yorku.ca/~georges/R/VisualizingMultipleRegressionPart2.R)(incomplete)
      1. The measurement error paradox: Why it can be more important to measure the variables you're not directly interested in.
        1. Two continuous variables Visualizing Regression in 3D pp. 35-52 (http://www.math.yorku.ca/~georges/Files/VisualizingRegression.pdf)
        2. One continuous and one categorical variable: Visualizing Regression pp. 104ff (http://www.math.yorku.ca/~georges/Slides/VisualizingRegression.pdf)
    3. Review: regression diagnostics with simple regression:
      1. Notes
      2. Visualizing Regression pp. 99ff: Influence and Leverage (http://www.math.yorku.ca/~georges/Slides/VisualizingRegression.pdf)
      3. With multiple regression:
    4. R script: Visualizing Multiple Regression Part 3 (http://www.math.yorku.ca/~georges/R/VisualizingMultipleRegressionPart3.R)
      1. Why forward stepwise might not work
      2. Residual plots: the old, the newer and the newest: what they do?
      3. Outliers: why the good ones can be worse than the bad ones? Understanding what they do and how to find them.
      4. An example Visualizing Regression pp. 139ff: Added-variable plot: pay equity in a large law firm (http://www.math.yorku.ca/~georges/Slides/VisualizingRegression.pdf)
      5. Some diagnostics for multiple regression R script: multiple regression and diagnostics with car (http://www.math.yorku.ca/~georges/R/MultipleRegressionAndDiagnostics.R)
Presentations
We will finish presentations of Assignment 1 and will do presentations on assignment 2.
Assignment 3, due with presentation November 19
Prepare notes in wiki files on the material seen in class today combining graphics (e.g. using rgl.snapshot to capture 3D views), text and mathematial equations. You may use many wiki files for this but remember to start the name of each file with 'Prac07 Your.Group.Name' to ensure that the files are unique. Later we will restructure the notes into documents with better names. Prepare a 7 minute overview of the material (although the material itself would possibly take much longer than 7 minutes to cover).
Prac07 Pearson Assigment 3 Coffee, Stress and Health example
Prac07 Quetelet Assigment 3 Measurement error in Stress: its effect on the estimate of the effect of Stress and Coffee
Prac07 Rao Assigment 3 Health, Height and Weight with no outliers
Prac07 Robbins Assigment 3 Health, Height and Weight with outliers

Week 4: October 22, 2007

Continuation of October 16.

Week 5: November 5, 2007

Week 6: November 19, 2007

Assignment 2 Presentations

Statistics in Society

Statistics plays a major role in many public controversies. Recent examples that have received a lot of publicity include topics as diverse as the allegations that lottery ticket vendors in Ontario are defrauding lottery winners and the claim that a tomb has been discovered containing the remains of the Christian figure Jesus. The current scandal concerning incompetent pathology reports that led to many murder convictions of innocent people, leads us to pause to consider whether or how statisticians might have the potential of misleading jurors or public opinion through misapplications of statistical principles. There's an excellent passage in a talk by Peter Donnelly in "TED Talk" at http://tedblog.typepad.com/tedblog/2006/11/statistician_pe.html. The portion on the role of statistical evidence in the conviction of Sally Clark of murdering two of her children who are now thought to have died of sudden infant death syndrome starts around the 13 min. mark. Statistical evidence was given by a pediatrician. Donnelly seems to suggest that a statistician would have done a much better job of testifying in this case. Consider whether it would be possible to provide clear guidance to a jury using only frequentist methods in a case such as Sally Clark's.

There are other examples in which statistics played a crucial -- sometimes insufficiently recognized -- role (add to this list):

  • The case of Susan Nelles at the Hospital for Sick Kids in Toronto in 1981. Nelles was accused of murdering at least 4 infants partly on the basis of statistical evidence. The case fell apart in a preliminary inquiry but the controversy has never been resolved. It is possible that it could be clarified with a better understanding of the statistical evidence.

Week 7: January 7, 2008

Statistical practice for regression

  • Modeling: substantive and statistical issues
  • Diagnostics: added variable plots, studentized residual vs leverage plots, transformations: Box-Cox, CERES plots
  • Visualization: partial residual plots, etc.
  • Alternative models: VIFs, principal components, etc.

Week 8: January 21, 2008

Longitudinal data analysis with mixed models

Assignment

This is an individual assignment to be done with the help of your group as assigned for assignment 3. The individuals in each group will address different aspects of the analysis of the high school math achievement data set used by Bryk and Raudenbush available at http://www.math.yorku.ca/~georges/Data/hsfull.csv.

Description of variables:
ID: student identification number (for the study)
School: school identification number
Minority: an indicator for student ethnicity (Yes = member of a minority, No = not a member of a minority)
Sex: an indicator for student gender (Female and Male)
SES: the student's socio-economic status based on a standardized scale constructed from variables measuring parental education, occupation, and income
MathAch: a measure of the student's mathematics achievement (based on a mathematics test in the senior year)
Size: school enrollment
Sector: school sector: Public and Catholic
PRACAD: proportion of students in the academic track at the school
DISCLIM: a scale measuring disciplinary climate at the school
HIMINTY: an indicator of school enrollment ethnicity: 1 = more than 40% minority enrollment, 0 = less than or equal to 40%

I.e. each member of the group is working on a slightly different aspect of the problem.

Use the alphabetical ordering of the last names of the members of your group to decide who tackles each of the following questions.

Questions
1) Explore the relationship between gender and math achievement. Do girls do better at all-girls school than at coed schools? Is the relationship between SES and math achievement the same among girls and boys?
2) Explore the role of SES, in particular to what extent does it seem to be the child's SES and to what extent is it the school's SES that is related to math achievement? Would it be desirable to send a low SES child to a high SES school? Would they be expected to do better or not? Specify relevant cautions in coming to causal conclusions based on your analysis.
3) Explore the relationship of minority status and math achievement. Is minority status related the same way to math achievement depending on gender and SES. What is the role of the composition of the school versus the status of the individual?
4) Explore in detail the differences between the two sectors. For what ranges of SES do Catholic schools appear to reach higher math achievement? Are there other possible explanations for this apparent phenomenon( e.g. curvilinearity in the relationship between SES and math achievement)?
Guidelines
Make sure you address specific questions. Go beyond general statements based on default regression output. Use graphs effectively to illustrate your analyses. When using graphs comment on which aspects of the graph reflect significant effects and which do not.
Tutorial 
Next week, January 28, the class time (2:30 to 5:30) will be an optional tutorial session for anyone who has questions. You may, of course, contact me at other times.
Deadline
The deadline for sending your analyses is in two weeks plus one day (to give you the chance to ask final questions at the class in two weeks). Please send your analyses by e-mail by 11:59pm on Tuesday, February 19.

Week 9 etc.: TBA

From the tutorial on January 28 MATH 6627: 2008 Jan 28 R script

Week 10: Non-linear longitudinal models

  • Non-linear longitudinal models[3] (http://www.math.yorku.ca/~georges/Slides/TalkOnComasAndMigraines-v2008-06.pdf)

Week 11: Review of longitudinal models

Week 12: Practical and Ethical Issues in Statistical Consulting

Statistical Society of Canada:

Links

New York Times, April 8, 1984