MATH 6627 200708
From MathWiki
 Quick links
 Course web page with class list and teams http://www.math.yorku.ca/~georges/Courses/6627
 Prac07 Class Photo Caption
 http://www.amstat.org/sections/cnsl/index.html
 An R Primer http://www.stat.washington.edu/cggreen/rprimer/
News
 Workshop on Event History Analysis (http://www.math.yorku.ca/Who/Faculty/Ng/SORAworkshop2008/workshop2008.html)
 Speakers: Richard Cook and Jerry Lawless.
 Held at the Department of Public Health Sciences, University of Toronto 155 College Street, Room HS 610 (6th floor)
 May 15, 2008, 8:30am – 5:00pm
 SSC Case Studies (http://www.ssc.ca/documents/case_studies/2008/index_e.html)
 Presentations in Ottawa May 2528, 2008
General Information
Instructor
 Georges Monette, Ph.D., P.Stat. (http://www.ssc.ca/accreditation/index_e.html)
 N626 Ross
 mailto:georges@yorku.ca
 http://www.math.yorku.ca/~georges
 Office hours: Mondays 5:30 to 6:30 pm
Meetings
The class will meet roughly every second week on Mondays from 2:30 to 5:30 in TEL 0004. Consult the schedule below for exact dates.
Goals
As undergraduates we learn statistics through a sequence of courses each focusing on some part of statistical theory. When we solve problems in these courses the tools we are expected to use are obvious. When you have to solve realworld statistical problems, it is rare that there are clear clues about the correct theory or method that needs to be used.
In fact, many problems are best handled with eclectic solutions borrowing from many statistical fields. The goal of this course is to help you develop the skills and confidence to solve realworld problems. You will learn about the key role of many statistical concepts that are rarely seen in standard courses. You will also learn the vital role of visualization and graphics, communication (listening even more than talking) and presentation skills.
The course will help you develop skills in a number of areas:
 programming and data management skills in R: Although the emphasis in this course is entirely on R, many jobs expect a strong knowledge of SAS  take every opportunity you can to also learn SAS
 graphics to visualize data and models
 how to work as a statistical consultant/collaborator in the analysis of scientific problems
 developing presentations skills
 developing an understanding of the role of statistics as a discipline and as a profession in science and business
 understanding ethical issues related to statistical practice
Text and references
 Text: Philip I. Good and James W. Hardin (2006) Common Errors in Statistics (and How to Avoid Them), WileyInterscience, Hoboken, N.J. Steacie QA 276 G586 2006 (http://theta.library.yorku.ca/uhtbin/cgisirsi/IIhxfvdWGS/YORK/292410312/9) In the York University bookstore selling for $58.
 References (on reserve at Steacie Science Library):
 Javier Cabrera and Andrew McDougall (2001) Statistical Consulting, SpringerVerlag, N.Y. Steacie Reserves HA 29 C227 2002 (http://theta.library.yorku.ca/uhtbin/cgisirsi/x22BBAABe7/YORK/292410312/8/1953602)
 Janice Derr (2000) Statistical consulting: a guide to effective communication, Duxbury Steacie Reserves HA 29 D386 2000 Book and CDROM (http://theta.library.yorku.ca/uhtbin/cgisirsi/2kBAfAvWtU/YORK/292410312/8/1727556)
 Other references: MATH 6627: References (we will build up the list during the year).
Course Work
 In the first term the work for the course consists primarily of statistical analyses, consulting reports and presentations done in groups.
 Since most of the interesting consulting problems I have been involved with in recent years have required an understanding of mixed models, we will spend a few weeks on basic concepts in mixed models.
 We will work through the text Common Errors ... and different groups will prepare presentations including example data sets and a discussion of implementation in R along with appropriate R functions if necessary.
 In the second term, you will work on a major consultation project in which you will collaborate with a client to produce a deep and probing consulting report. The project is likely to involved mixed models.
 You will also attend some real statistical consultations and prepare brief reports.
 Another important part of the course work is your contribution to the wiki. In particular we will develop two types of information on the wiki
 how to's in R: these are brief articles describing how to do something simple in R, either a graph, an analysis or a type of data manipulation.
 Paradoxes and fallacies in statistics: As your knowledge of statistics becomes deeper you abandon many simple suppositions and replace them with more sophisticated ones. An important part of communication between statisticians and clients  for that matter between statisticians and the public or between statisticians and students  involves understanding simple, often fallacious, suppositions and how they can lead to a deeper understanding. We will develop wiki pages to discuss important paradoxes and fallacies.
Grading
 Each project or assignment receives an overall grade out of 100. This grade is attributed to each participant. Assignments are worth 50% of the final grade, the major project is worth 25% and individual contributions are worth 25%.
Class list and teams
 Class photo, names, email addresses and assignment to teams can be found at http://www.math.yorku.ca/~georges/Courses/6627. Note that a userid and password are needed to access this page.
Schedule
Week 1: September 10, 2007
 Topics

 Course organization
 Participation in SCS seminars
 You are welcome to attend SCS (http://www.isr.yorku.ca/scs/index.html) (Statistical Consulting Service) weekly meetings which consist of biweekly 'staff meetings' and biweekly seminars on a statistical topic of interest to statistical consultants. The exact topic for this year will be determined in two weeks. Meetings take place every Friday at 2:30 in TEL 5082. Please send an email message to Georges Monette (mailto:georges@yorku.ca) to have your name added to the SCS mailing list. Note that SCS also offers short courses some of which might be of interest to you.
 Consulting, communication, writing reports
 Writing reports: Secret of good writing: write so your reader understands you!
 Notes on writing reports
 Seven basic principles
 Not all consulting activities require a formal report. Often a phone call, a verbal report in a face to face meeting, a letter or a memo are the most efficient way of communicating to a client
 Communication:
 Interpersonal aspects of statistical consulting: Janice Derr, Statistical Consulting Video
 Contributions by Doug Zahn
 The role of statistics in society  understanding evidence.
 One of of the greatest challenges in understanding evidence is bridging the gap between observational data and causal inference, i.e. understanding the links between statistical significance and statistical meaning.
 Statistics in the news: Talk (http://www.math.yorku.ca/~georges/Files/TalkHS2006.pdf)
Types of Data  

Experimental  Observational  
Types of Inference  Causal  Where Fisher would like us to be  Where we often are 
Predictive  Very rare but problematic  Okay: This is the topic of Frank Harrell's Regression Modeling Strategies' 
 Finding meaning in observational data  examples
 Hans Rosling: Myths about the developing world (http://video.google.com/videoplay?docid=4237353244338529080)
 Al Gore: An Inconvenient Truth (http://www.imdb.com/title/tt0497116/)
 Andrey Feuerverger: The Lost Tomb of Jesus (http://www.imdb.com/title/tt0974593/)
 Finding meaning in observational data  examples
 Software

 A working statistician should be proficient with at least SAS and R. This course uses R. A good consultant should also be familiar with packages that are likely to be used by clients, e.g. SPSS.
 Getting started with R
 Using a wiki for group assignments
 Assignments and things to do
 Wiki: Get started with the wiki. You'll need to use it for the assignment but start by logging in providing some information about yourself in your user page.
 Assignment: Assignment 1: write answers on wiki and prepare to present on Sept. 24
 Prac07 Hox: q. 1, 5, 9, 14
 Prac07 Jaynes: q. 2, 6, 10
 Prac07 Jeffreys: q. 3, 7, 11
 Prac07 Moser: q. 4, 8, 12, 13
 Readings: Common errors: Read Chapter 1 and be prepared to discuss on Sept. 24
 Software: Get started with R (Getting started with R). Attend the special session on Sept. 17 if you need to know the basics of R.
 Class photo: See the class photo at the course web page (http://www.math.yorku.ca/~georges/Courses/6627#Photo) and enter your name for the caption.
 Coming up soon: Start thinking about the next assignment: you will need to find a current or recent topic in the news that involves, explicitly or implicitly, an interesting statistical issue. You will prepare an analysis of the topic together with a review of scientific evidence. Are there gaps between the science and the public presentation of the topic?
Week 1.5: September 17, 2007
 Topic
 This is a optional tutorial on the use of R for those who have had little or no experience with R. Be sure to have downloaded R and started covering some of the material in R: Getting started#First tutorials before the tutorial. If you have a laptop, install R on it and bring it to the class.
 In this tutorial we will work through:
 the sample session in Venables and Ripley (2002) [1] (http://wiki.math.yorku.ca/index.php/VR4:_Chapter_1_summary) and
 the tutorial by John Fox prepared for a short course at UCLA: http://socserv.mcmaster.ca/jfox/Courses/UCLA/index.html
To continue learning R:
 Work through http://cran.rproject.org/doc/manuals/Rintro.html, also available as a pdf file through the help menu on the R console.
 Highly recommended for learning R systematically: work through the online textbook by J. H. Maindondald at http://wiki.math.yorku.ca/index.php/R:_Getting_started#Exploring_much_more_deeply
Week 2: September 24, 2007
 Topics
 Statistics Canada: Careers for Mathematical Statisticians (http://www.statcan.ca/english/employment/ma/ma.htm).
 Discussion of Common Errors chapter 1.
 Observational vs Experimental data
 Making the best of observational data
 Visualizing multiple regression
 Presentations
 Assignment 1 will be presented by each group. Plan to take no more than 7 minutes per group.
 Assignments and things to do
 Group Assignment Prac07 Jeffreys Assignment 2 Prac07 Jaynes Assignment 2 Prac07 Moser Assignment 2
 Find a current or recent topic in the news that involves, explicitly or implicitly, an interesting statistical issue. Prepare an analysis of the topic together with a review of scientific evidence. Are there gaps between the science and the public presentation of the topic?
 Find  or think of  an example in which you would expect Simpson's Paradox to lead to paradoxical association between two variables. Can this lead to misinterpretation of the relationship among the variables? Discuss.
 Do the same for association in the 'wrong' direction due to selection.
 Prepare a 7minute presentation on your results. You can save your work on the wiki in files whose names begin with [[Prac07 Your.Group.Name Assignment 2]]
 Individual assignment
 Look at the data set http://www.math.yorku.ca/~georges/Data/coffee.csv. It has three relevant variables, 'Heart', which is a measure of heart condition  the higher the less healthy; 'Coffee', a measure of coffee consumption, and finally, 'Stress', measure of occupational stress. How could you use this data to address the question whether coffee consumption is harmful to the heart.Discuss assumptions needed to get anywhere with the data and discuss the nature of various assumptions that might lead to different interpretations, if relevant.
 Look at the data set http://www.math.yorku.ca/~georges/Data/hwX.csv where X is the remainder when you divide your 'class number' (the number from 1 to 20 on the class list on the web) by 4. Thus X will be 0, 1, 2, or 3. The data contains data on three variables: Health (the higher the better), Height and Weight. All are in standardized units. What would this data set have to say about the relationship between Weight and Health? Discuss assumptions needed to get anywhere with the data and discuss the nature of various assumptions that might lead to different interpretations, if relevant.
 These individual assignments should be sent to the instructor by email (Word attachments or text files are okay) before noon on Sunday, October 14th.
 Readings
 Read Common Errors chapter 2
 Review your textbooks on multiple regression. What is a confidence ellipse? What is its connection with hypothesis testing? What is a Scheffé confidence interval? What is a Bonferroni confidence interval?
Week 3: October 15, 2007
 Topics
 A review of multiple regression:
 The bivariate normal and its contours: isodensity or contour ellipse. script: visualizing normal contour ellipses (http://www.math.yorku.ca/~georges/R/MultivariateNormalContours.R)
 The dispersion ellipse [also known as a variance, data or deviation ellipse]:
 Correlation and regression, regression to the mean Visualizing Regression pp. 1 (http://www.math.yorku.ca/~georges/Slides/VisualizingRegression.pdf)
 Further properties: Deviation Ellipse and Precision Ellipse (http://www.math.yorku.ca/~georges/Slides/Deviation_and_Inverse_Ellipses.pdf).
 Yet more properties: Statistics: Ellipses
 Confidence ellipses and intervals in regression: Statistics: Ellipses of regression
 Even more properties: Statistics: Geometry of the ellipse
 Some older material on the topic: Visualizing Regression in 3D (http://www.math.yorku.ca/~georges/Files/VisualizingRegression.pdf)
 Simple and Multiple regression:
 Simple vs. multiple regression: R script: Visualizing Multiple Regression Part 1 (http://www.math.yorku.ca/~georges/R/VisualizingMultipleRegressionPart1.R)
 Interpreting coefficients
 Relationship between unconditional vs conditional effects in regression, Simpson's Paradox with numerical predictors
 Criteria for selecting models: role of causal assumptions.
 Measurement error in regression: R script: Visualizing Multiple Regression Part 2 (http://www.math.yorku.ca/~georges/R/VisualizingMultipleRegressionPart2.R)(incomplete)
 The measurement error paradox: Why it can be more important to measure the variables you're not directly interested in.
 Two continuous variables Visualizing Regression in 3D pp. 3552 (http://www.math.yorku.ca/~georges/Files/VisualizingRegression.pdf)
 One continuous and one categorical variable: Visualizing Regression pp. 104ff (http://www.math.yorku.ca/~georges/Slides/VisualizingRegression.pdf)
 The measurement error paradox: Why it can be more important to measure the variables you're not directly interested in.
 Review: regression diagnostics with simple regression:
 Notes
 Visualizing Regression pp. 99ff: Influence and Leverage (http://www.math.yorku.ca/~georges/Slides/VisualizingRegression.pdf)
 With multiple regression:
 R script: Visualizing Multiple Regression Part 3 (http://www.math.yorku.ca/~georges/R/VisualizingMultipleRegressionPart3.R)
 Why forward stepwise might not work
 Residual plots: the old, the newer and the newest: what they do?
 Outliers: why the good ones can be worse than the bad ones? Understanding what they do and how to find them.
 An example Visualizing Regression pp. 139ff: Addedvariable plot: pay equity in a large law firm (http://www.math.yorku.ca/~georges/Slides/VisualizingRegression.pdf)
 Some diagnostics for multiple regression R script: multiple regression and diagnostics with car (http://www.math.yorku.ca/~georges/R/MultipleRegressionAndDiagnostics.R)
 Simple vs. multiple regression: R script: Visualizing Multiple Regression Part 1 (http://www.math.yorku.ca/~georges/R/VisualizingMultipleRegressionPart1.R)
 Presentations
 We will finish presentations of Assignment 1 and will do presentations on assignment 2.
 Assignment 3, due with presentation November 19
 Prepare notes in wiki files on the material seen in class today combining graphics (e.g. using rgl.snapshot to capture 3D views), text and mathematial equations. You may use many wiki files for this but remember to start the name of each file with 'Prac07 Your.Group.Name' to ensure that the files are unique. Later we will restructure the notes into documents with better names. Prepare a 7 minute overview of the material (although the material itself would possibly take much longer than 7 minutes to cover).
 Prac07 Pearson Assigment 3 Coffee, Stress and Health example
 Prac07 Quetelet Assigment 3 Measurement error in Stress: its effect on the estimate of the effect of Stress and Coffee
 Prac07 Rao Assigment 3 Health, Height and Weight with no outliers
 Prac07 Robbins Assigment 3 Health, Height and Weight with outliers
Week 4: October 22, 2007
 Continuation of October 16.
Week 5: November 5, 2007
Week 6: November 19, 2007
Assignment 2 Presentations
Statistics in Society
Statistics plays a major role in many public controversies. Recent examples that have received a lot of publicity include topics as diverse as the allegations that lottery ticket vendors in Ontario are defrauding lottery winners and the claim that a tomb has been discovered containing the remains of the Christian figure Jesus. The current scandal concerning incompetent pathology reports that led to many murder convictions of innocent people, leads us to pause to consider whether or how statisticians might have the potential of misleading jurors or public opinion through misapplications of statistical principles. There's an excellent passage in a talk by Peter Donnelly in "TED Talk" at http://tedblog.typepad.com/tedblog/2006/11/statistician_pe.html. The portion on the role of statistical evidence in the conviction of Sally Clark of murdering two of her children who are now thought to have died of sudden infant death syndrome starts around the 13 min. mark. Statistical evidence was given by a pediatrician. Donnelly seems to suggest that a statistician would have done a much better job of testifying in this case. Consider whether it would be possible to provide clear guidance to a jury using only frequentist methods in a case such as Sally Clark's.
There are other examples in which statistics played a crucial  sometimes insufficiently recognized  role (add to this list):
 The case of Susan Nelles at the Hospital for Sick Kids in Toronto in 1981. Nelles was accused of murdering at least 4 infants partly on the basis of statistical evidence. The case fell apart in a preliminary inquiry but the controversy has never been resolved. It is possible that it could be clarified with a better understanding of the statistical evidence.
Week 7: January 7, 2008
Statistical practice for regression
 Modeling: substantive and statistical issues
 Diagnostics: added variable plots, studentized residual vs leverage plots, transformations: BoxCox, CERES plots
 Visualization: partial residual plots, etc.
 Alternative models: VIFs, principal components, etc.
Week 8: January 21, 2008
Longitudinal data analysis with mixed models
 Longitudinal Data Analysis with Mixed Models: A Graphical Overview (http://www.math.yorku.ca/~georges/Slides/Workshopv10Slides.pdf)
 R script using 'library(nlme)' for multilevel modelling: [2] (http://www.math.yorku.ca/~georges/R/MultilevelModelsinR.R)
Assignment
This is an individual assignment to be done with the help of your group as assigned for assignment 3. The individuals in each group will address different aspects of the analysis of the high school math achievement data set used by Bryk and Raudenbush available at http://www.math.yorku.ca/~georges/Data/hsfull.csv.
 Description of variables:
 ID: student identification number (for the study)
 School: school identification number
 Minority: an indicator for student ethnicity (Yes = member of a minority, No = not a member of a minority)
 Sex: an indicator for student gender (Female and Male)
 SES: the student's socioeconomic status based on a standardized scale constructed from variables measuring parental education, occupation, and income
 MathAch: a measure of the student's mathematics achievement (based on a mathematics test in the senior year)
 Size: school enrollment
 Sector: school sector: Public and Catholic
 PRACAD: proportion of students in the academic track at the school
 DISCLIM: a scale measuring disciplinary climate at the school
 HIMINTY: an indicator of school enrollment ethnicity: 1 = more than 40% minority enrollment, 0 = less than or equal to 40%
I.e. each member of the group is working on a slightly different aspect of the problem.
Use the alphabetical ordering of the last names of the members of your group to decide who tackles each of the following questions.
 Questions
 1) Explore the relationship between gender and math achievement. Do girls do better at allgirls school than at coed schools? Is the relationship between SES and math achievement the same among girls and boys?
 2) Explore the role of SES, in particular to what extent does it seem to be the child's SES and to what extent is it the school's SES that is related to math achievement? Would it be desirable to send a low SES child to a high SES school? Would they be expected to do better or not? Specify relevant cautions in coming to causal conclusions based on your analysis.
 3) Explore the relationship of minority status and math achievement. Is minority status related the same way to math achievement depending on gender and SES. What is the role of the composition of the school versus the status of the individual?
 4) Explore in detail the differences between the two sectors. For what ranges of SES do Catholic schools appear to reach higher math achievement? Are there other possible explanations for this apparent phenomenon( e.g. curvilinearity in the relationship between SES and math achievement)?
 Guidelines
 Make sure you address specific questions. Go beyond general statements based on default regression output. Use graphs effectively to illustrate your analyses. When using graphs comment on which aspects of the graph reflect significant effects and which do not.
 Tutorial
 Next week, January 28, the class time (2:30 to 5:30) will be an optional tutorial session for anyone who has questions. You may, of course, contact me at other times.
 Deadline
 The deadline for sending your analyses is in two weeks plus one day (to give you the chance to ask final questions at the class in two weeks). Please send your analyses by email by 11:59pm on Tuesday, February 19.
Week 9 etc.: TBA
From the tutorial on January 28 MATH 6627: 2008 Jan 28 R script
Week 10: Nonlinear longitudinal models
 Nonlinear longitudinal models[3] (http://www.math.yorku.ca/~georges/Slides/TalkOnComasAndMigrainesv200806.pdf)
Week 11: Review of longitudinal models
 More on the R matrix: A First Look at Multilevel and Longitudinal Models p. 86 ff. (http://www.math.yorku.ca/~georges/Courses/Repeated/CourseNotes.pdf)
 More on the G matrix: Making T (G) or R (Sigma) Simpler (http://www.math.yorku.ca/~georges/Slides/NMakingVariance.pdf)
 More on contextual variables: Using contextual 'dummy' variables: Practical Issues Applying Mixed Models (http://www.math.yorku.ca/~georges/Slides/PracticalIssuesApplyingMixedModels.pdf)
 More on nonlinear models using SAS: Summary of PROC NLMIXED (ftp://people.math.yorku.ca/Slides/PROC%20NLMIXED%20SUMMARY.pdf)
 A good overview John Fox: Using Mixed Models in R (http://www.math.yorku.ca/~georges/Courses/Repeated/FoxMixedModelsinR.pdf)
 More on data ellipses and confidence ellipses: Visualizing Regression pp. 159 ff. (http://www.math.yorku.ca/~georges/Slides/VisualizingRegression.pdf)
Week 12: Practical and Ethical Issues in Statistical Consulting
 American Statistical Association Guide to ethics: http://www.amstat.org/profession/index.cfm?fuseaction=ethicalstatistics
 Dr. Mary Gray's talk on ethics in statistics: Media: Ethical Considerations Ottawa.pdf
 A discussion on the ethics of consulting for the tobacco industry: http://www.stat.columbia.edu/~cook/movabletype/archives/2005/10/the_ethics_of_c.html
 Did statisticians mask the suicide risks of SSRI's? http://www.newscientist.com/article/mg19726424.600didgsktrialdatamaskpaxilsuiciderisk.html and http://www.theglobeandmail.com/servlet/story/RTGAM.20080116.wpharma1701/EmailBNStory/specialScienceandHealth/home
 The numbers can't stand alone: how paradigms matter. York's Dr. Pat Armstrong on some common quantitative issues: Media: Doubtful Data.pdf
 Statistical mistakes throughout history, compiled by the Australian Statistical Society: Media: Booklet.pdf
Statistical Society of Canada:
 Document on ethics (http://www.ssc.ca/accreditation/documents/ethics_e.pdf)
 Accreditation (http://www.ssc.ca/accreditation/)
Links
 Steve's Attempt to Teach Statistics (http://www.childrensmercy.org/stats/) A very interesting site.
 D.R. Cox on statistical consulting: http://www.ssc.ca/resources/consultants/cox_e.html
 ASA Consulting Section: http://www.amstat.org/sections/cnsl/index.html
 TED Talks:
 Peter Donnelly on genomes: http://www.youtube.com/watch?v=kLmzxmRcUTo&mode=related&search=
 Hans Rosling (2006) Debunking thirdworld myths with the best stats you've ever seen http://www.ted.com/talks/view/id/92
 Hans Rosling (2007) on New insights on poverty and life around the world: http://www.ted.com/index.php/talks/view/id/140
 United Nations Statistical Commission http://unstats.un.org/unsd/default.htm
 Rod Little on English style in scientific papers: http://sitemaker.umich.edu/rlittle/files/styletips.pdf
 Audio and slides of Workshop on Current Issues in the Analysis of Incomplete Longitudinal Data (October 1315, 2005)at the Fields Institute: http://www.fields.utoronto.ca/audio/#CMM
 UBC web page for its consulting course http://www.stat.ubc.ca/Courses/Details/course.php?course=65
 Hugh Chipman's introduction to R http://ace.acadiau.ca/math/scc/workshops_2005/Rclass.html
 Statistical Consulting at Acadia http://ace.acadiau.ca/math/m4233/m4233.htm
 Constructionism and Reductionism: Two Approaches to ProblemSolving and Their Implications for Reform of Statistics and Mathematics Curricula http://www.amstat.org/publications/jse/secure/v7n2/lazaridis.cfm
 A Medical Mystery Unfolds in Minnesota (http://www.nytimes.com/2008/02/05/health/05pork.html?pagewanted=1&_r=1&nl=8hlth&emc=hltha1), New York Times, Feb. 5, 2008.
 An article on the Grange inquiry: TORONTO INFANT DEATH STIRS CONCERN (http://query.nytimes.com/gst/fullpage.html?sec=health&res=9B05E3D71638F93BA35757C0A962948260) By Douglas Martin,
New York Times, April 8, 1984