MATH 1532 2009-10
From MathWiki
MATH 1532 Statistics for Business and Society
Breaking News
- I am trying to get a room for a tutorial session on Thursday from 2:30 to 4:30. I will also have office hours from 8:30 am to 10:30 am on Fridays.
- Jan 26: Frequently asked questions for MATH 1532
- Jan 26: We are a bit behind in covering the material so the mid-term test will cover only chapters 1 to 4 omitting 4.5.
- Feb 2: Please have a look at the sample mid-term test (http://www.math.yorku.ca/people/georges/Files/MATH1532/Tests/Sample_Test_Intro_Stats.pdf).
- Feb 4: The room for the Thursday tutorial from 2:30 to 4:30 is in Founders' College, FC 106. Founders' College is building number 50 on the map. For reference our class meets in CSE which is building number 19.
- Feb 9: A copy of the textbook has been donated to Steacie Science Library and is available on 2-hour reserve.
- Feb 25: Assignment 2 will be due on March 25 and the Project on April 5.
- Mar 14: The final exam is scheduled for Friday, 16 Apr 2010 at 14:00 in CLH D, duration 2 hours.
- Mar 16: Full details on the individual projects have been posted.
- Mar 16: All question for Assignment 2 have been posted for chapters 5, 6, and 7.
- Mar 18: There is additional information for Mac users. If you have had trouble using Rcmdr or downloading files, have a look here (http://wiki.math.yorku.ca/index.php/R:Installing_and_Using_R_and_Rcmdr_on_a_MAC).
- Mar 18: Some groups produced excellent reports for the first assignment. You can have a look at the winner of the Gold Medal (http://www.math.yorku.ca/people/georges/Files/MATH1532/Other_Files/SampleAssignment1.pdf) for inspiration.
- Mar 29: Sample solutions to Assignment 2 (http://www.math.yorku.ca/people/georges/Files/MATH1532/Other_Files/Sample_solutions_to_Assignment_2.pdf)
- Mar 29: How to use Rcmdr -- with pictures (http://wiki.math.yorku.ca/index.php/R:_Rcmdr_--_how_to)
- Mar 29: An extra tutorial will be help on Tuesday, March 30 from 1:30 to 3:30 pm in N 638 Ross
- Apr 1: A set of sample questions (http://www.math.yorku.ca/people/georges/Files/MATH1532/Tests/Sample_Final.pdf) for the final exam is available.
- NEW: Apr 12: The room for the tutorial on Tuesday, April 13 from 1:30 to 3:30 is N638 Ross.
- NEW: Apr 12: The set of final exam questions with solutions (http://www.math.yorku.ca/people/georges/Files/MATH1532/Tests/Sample_Final_with_Solutions.pdf) is available. The currently posted version includes corrections to the original. Be sure to have a look at the corrections if you haven't already.
- NEW: Apr 13: Recording of the April 13 tutorial (http://www.math.yorku.ca/people/georges/Files/MATH1532/Recordings/MATH1532_April_13_Tutorial.html)
Files
- Files for MATH 1532 (http://www.math.yorku.ca/people/georges/Files/MATH1532)
- FAQ for Intro Stats (http://wiki.math.yorku.ca/index.php/Intro_Stats_FAQ)
- Week 2:
- Chapter 1 slides with notes on Tuesday (http://www.math.yorku.ca/people/georges/Files/MATH1532/Slides/Week2_Slides_Chap_01_Tuesday_MATH1532_.pdf)
- Chapter 1 slides with notes on Tuesday and Thursday (http://www.math.yorku.ca/people/georges/Files/MATH1532/Slides/Week2_Slides_Chap_01_TR_MATH1532.pdf)
- Notes on Tuesday (http://www.math.yorku.ca/people/georges/Files/MATH1532/Slides/Week2_Notes_Tuesday_MATH1532.pdf)
- Notes on Tuesday and Thursday (http://www.math.yorku.ca/people/georges/Files/MATH1532/Slides/Week2_Notes_TR_MATH1532.pdf)
- Recordings (http://www.math.yorku.ca/people/georges/Files/MATH1532/Recordings/) (click on the html file, not the swf file)
- Week 3:
- Computer malfunctioned on Tuesday
- Chapter 2 slides with notes (http://www.math.yorku.ca/people/georges/Files/MATH1532/Slides/Week3_Chap_2_Slides_and_Notes.pdf)
- Notes (http://www.math.yorku.ca/people/georges/Files/MATH1532/Slides/Week3_Notes.pdf)
- Recording (http://www.math.yorku.ca/people/georges/Files/MATH1532/Recordings/) (click on the html file, not the swf file)
- Week 4:
- Chapter 2 slides start on p. 24 with notes (last part done on board) (http://www.math.yorku.ca/people/georges/Files/MATH1532/Slides/Week4_Slides_Chap_02-withNotes_Tuesday.pdf)
- Recording (http://www.math.yorku.ca/people/georges/Files/MATH1532/Recordings/MATH1532_10_Jan_26.html) of Tuesday class. The computer froze in the last 10 minutes but my voice was recorded while I was working at the blackboard.
- Weeks 5 to 12
- Slides (http://www.math.yorku.ca/people/georges/Files/MATH1532/Slides/)
- Recordings (http://www.math.yorku.ca/people/georges/Files/MATH1532/Recordings/)
- Week 2:
Table of contents |
1.1 Instructor |
General Information
Statistical reasoning is crucial for a critical understanding of the flood of data and information we face daily in modern society. Understanding the principles of statistical reasoning and being aware of a number of widespread errors in statistical thinking is often the key for distinguishing arguments that are sound from those that are fallacious.
This course stresses the logic and reasoning behind statistics avoiding emphasis on complex mathematical formulas. Statistical reasoning will be applied to a critical analysis of current events reported in the media and current scientific, medical and social controversies.
Instructor
- Georges Monette, Ph.D., P.Stat. (http://www.ssc.ca/accreditation/index_e.html)
- N626 Ross
- Email: georges+math1532@yorku.ca (mailto:georges+math1532@yorku.ca)(Note: the "+math1532" portion is designed to avoid spam filters)
- Phone: (416) 736-2100 ext. 77164
- http://www.math.yorku.ca/~georges
- Office hours: Thursday afternoons, exact time TBA or by appointment
Course work and grades
Date Weight Assignment 0 (individual) Jan. 11 noon 0% Assignment 1 (team) Feb. 2 10% Mid-term test Feb. 11 20% Assignment 2 (team) March 112510% Project (individual) April 4530% Final exam 30%
Text
Jessica M. Utts and Robert F. Heckard, (2006) Statistical Ideas and Methods, Thomson.
Lectures and Tutorials
- Class: Tuesdays and Thursdays, 10 am to 11:30 am in CSE A (building 19 on the Keele Campus (http://www.yorku.ca/yorkweb/maps/keele-webmap.html))
- Optional tutorial: Thursdays 2:30 to 4:30, Location TBA
Important Dates
Last date to enrol without permission January 19 Last date to drop the course without receiving a grade March 8 Last class March 31 Last date to submit term work and end of classes April 5 Reading week February 13 to 19 Exam period April 7 to 23
Resources
Dataset, lecture notes, information on computing, etc. will be posted in http://www.math.yorku.ca/people/georges/Files/MATH1532/. Since some of the material may be copyrighted, access to the files is protected and requires a userid: 'buso' and a password: 'buso' also. If you find any interesting links please send them to me georges+math1532@yorku.ca (mailto:georges+math1532@yorku.ca?subject=Interesting%20link%20for%20MATH1532) and I will try to add them to the links at the bottom of this page.
Using computers for the course
Some assignments and the project will require you to analyze data using computer software. The test and exam will require you to interpret output from the same software. You can learn the computing aspects of the course in a number of ways:
- If you have access to a computer, you can download the software for the course. We use public domain software that runs on Windows, MacOS X or Linux. If you have a laptop, you are encouraged to bring it to class and to tutorials and office hours.
- If you don't have access to a computer, you can get an account to use computers in the Gauss Lab where the software will be available.
- The course will show examples of statistical analyses with Microsoft Excel, OpenOffice Calc and R with Rcmdr.
Week 1: Jan 5,7: Introduction
Material covered
Textbook
Chapter 1, 3
What is 'Statistics'?
The definition in the text says:
Definition: Statistics is a collection of procedures and principles for gathering data and analyzing information in order to help people make decisions when faced with uncertainty.
- This is certainly an important aspect of statistics but I think it only tells a small part of the story. Statistics is the science (and art) of working with uncertainty --- whether you plan to make decisions or not. We tend to think of statements as true or false. But in practice, the truth or falsity of most important statements is not known for sure. There are all shades of degrees of uncertainty between between being sure a statement is true or false. Many of the most important decisions and choices we make in life are made despite the fact that we don't have all the information we would like to have to determine which route is best. Sometime we simply act as if something is true or false although we don't really know. Statistics is not just about how to make these difficult decisions. It is also about remembering and being aware of our uncertainty so we know where to look for better information and we are ready to revise our hypotheses as relevant information becomes available. Statistics is not just about making decisions, it's about where to look for information that could lead us to change our decisions. It's about knowing when to keep an open mind and knowing when and how to change your mind.
- Statistics is about the fascinating journey from ignorance to increasingly certain knowledge to wisdom. This is a journey we all follow individually. It is also a journey undertaken by disciplines, by political and social organisms and by mankind as a whole.
Experimental vs Observational data
- If X and Y are correlated, what can it mean?
- 1) X causes Y?
- 2) Y causes X?
- 3) Another variable(s) Z(s) causes both X and Y?
- a) Some Zs might be known and measurable. For these Zs we might be able to adjust using sophisticated statistical methods.
- b) Some Zs might be known but hard or impossible to measure. This is more difficult to deal with.
- c) Some Zs might not be discovered until the year 3000. We can't adjust statistically for these.
- 4) Selection: maybe there's no relationship but some data got thrown out or ignored and the data left created the impression of a relationship.
- 5) Chance: This is the one statisticians are really good at dealing with -- as you will learn in this course.
- If X and Y are correlated, what can it mean?
- What if we have an 'experiment' with 'random allocation of X' to experimental units?
- 1) possible
- 2) No! We know what caused X. It was the coin toss or the random number generator that caused X.
- 3) Maybe. But it could only be by chance that differences in levels of any combination of Zs, known or not, measurable or not, would have a large impact on Y.
- 4) We can exclude this by careful checking.
- 5) Chance again.
- What if we have an 'experiment' with 'random allocation of X' to experimental units?
- So we are left with two options:
- 1) X causes Y, or
- 2) Chance.
- So we are left with two options:
- We can use statistical analysis to measure chance. If the chance is very small then we may be left with X causes Y as the plausible explanation.
- How should you react to causal claims based on data analyses?
- 1) Experimental data or observational? You might have to ask questions to answer this. Sometimes it isn't obvious from the appearance of the data.
- 2) If experimental: was allocation random or by judgment or haphazard? Was the study double-blind? Are there possible biases in measurements? Psychological factors that influence outcome? Does the claim match the nature of the experiment or is the claim stretching to something that does not correspond exactly to what was done in the experiment?
- 3) If observational:
- a) Can you poke an obvious hole in the claim? E.g. is there a plausible alternative explanation that was not taken into account in the analysis? In this case, you've countered the claim.
- b) What has the analysis adjusted for? Are these factors that can be measured with precision? What kinds of factors are not accounted for?
- Some examples: Toronto Star: Pulse (http://www.math.yorku.ca/~georges/Courses/2565/StatisticsInTheNews030926.html)
- Which examples are experimental and which are observational?
- Which conclusions are reasonable and which are not? Why?
- How should you react to causal claims based on data analyses?
Things to do
'Things to do' are tasks that are not graded but are important to keep up with the course
- Download and install R and Rcmdr on your computer.
- If you have neither Microsoft Excel nor OpenOffice download and install OpenOffice, which is public domain and free.
Exercises
Exercises are not graded but they are useful preparation for the mid-term test or the final exam
Text Chapter 1
- pages 9+: 1.1, 1.5, 1.6, 1.7, 1.8, 1.10, 1.11, 1.13, 1.14, 1.15, 1.16, 1.19, 1.23.
Assignment 0
Due: 12 noon, January 11, 2010
- I would like to know something about you and I also want to form random teams of 4 or 5 students to work on Assigment 1. I will use your emailed responses to this Assignment 0 to form the teams. You will receive the names of your team members on January 11 so you can meet face to face at the break during the class on January 12.
- Send me (to georges+math1532@yorku.ca (mailto:georges+math1532@yorku.ca?subject=Assignment%200)) an email message from the e-mail account where you would like to receive email for the course.
- In your email message, tell me about yourself by responding to the following questions. Please cut and paste the questions into your email message and then add your answers between the questions. Note that only the first two items will be shared with your work group:
- 1) Your given name and family name. (In parentheses, state the name by which you prefer to be called if it's different from your given name)
- 2) Your e-mail address
- 3) Your student number
- 4) Previous statistics courses if any?
- 5) What kind of computer do you plan to use for the course? e.g. Laptop, desktop, computer lab.
- 6) If you have a computer:
- a) What operating system does your computer use: e.g. Windows Vista, MacOS X, Ubuntu Linux.
- 6) If you have a computer:
- b) What spreadsheet packages (Excel, Calc) do you have installed on your computer?
- c) What word processing packages (MS Word, OpenOffice Writer)?
- d)) What statistical packages do you have (if any) on your computer? (e.g. SPSS, SAS, S-Plus, R)
- 7) List the software packages you use and indicate your skill level on a scale from 1 to 10 (1 = very basic, 3 = basic user, 5 = solid knowledge of basic features, 8 = advanced user (e.g. can define macros if relevant), 10 = guru)
- a) Operating systems:
- 7) List the software packages you use and indicate your skill level on a scale from 1 to 10 (1 = very basic, 3 = basic user, 5 = solid knowledge of basic features, 8 = advanced user (e.g. can define macros if relevant), 10 = guru)
- b) Spread sheet software:
- c) Word processing software:
- d) Statistical software:
- 8) What do you want to get out of this course?
- 9) Any other information you would like to share with me:
Links
This section will contain links to data sets, notes, and, when it works, a video of the screen display for the course
- January 5:
- Recording of lecture (http://www.math.yorku.ca/people/georges/Files/MATH1532/Recordings/MATH1532_10_Jan_05.html) This shows the screen projected from my laptop along with sound from the lecture. You may need to enter a userid and password (both are 'buso')
- Lecture slides (http://www.math.yorku.ca/people/georges/Files/MATH1532/Slides/Week1_10_Jan_05.pdf)
- January 7:
- Recording (http://www.math.yorku.ca/people/georges/Files/MATH1532/Recordings/MATH1532_10_Jan_07.html)
- Lecture slides (http://www.math.yorku.ca/people/georges/Files/MATH1532/Slides/Week1_10_Jan_07.pdf)
Week 2: January 12,14
Material covered
Textbook
During the 2nd week we covered much of the material in Chapters 1 and 3 of the textbook and we just started discussing Chapter 2.
Next week, we will cover Chapter 2, review Chapter 3 and start Chapter 4
Synopsis
Types of data
- Purposes for analyzing data:
- Descriptive
- Inference: causal
- Inference: predictive
- Type of data
- Experimental: X under control of experimenter: random assignment of levels of X
- Observational: X determined by other factors and just observed, not manipulated
- How purposes and types of data match
- Descriptive statistics can be done with any kind of data since there is no intention to generalize
- Causal inference is best done with experimental data
- Caution: experiments are often conducted with volunteers who may not be similar to the target population for causal inference. Often, the only true experiments may be on animals who may or may not mimic the corresponding processes in humans.
- Predictive inference is best done with observational data sampled so it is representative of the target population.
- Just as random allocation is crucial for experiments, random selection is ideal form observational data for predictive inference.
- Causal inference with observational data is highly problematic
- Often, important questions are causal in nature and all that's available is observational data.
- We can never be certain of causal conclusions based on observational data
- Intelligent evaluation of causal claims based on observational data is challenging but may be the only way to shed light on crucial questions.
- Assessing causal claims from observational data, where the relationship between X and Y is too strong to be attributed to chance:
- Look for plausible alternative explanations:
- May Y cause X?
- Are there obvious plausible confounding factors: factors that could cause both X and Y. Note that factors that are caused by X and, in turn, cause Y are mediating factors that explain and do not contradict the possibility that X causes Y.
- Have some of these possible confounding factors been controlled for in the study? How effectively?
- Do important factors remain that have not been controlled for?
- Consider the possibility of a selection effect.
- Consider possible mediating factors that could explain how X could cause Y, even when the suggestion that X causes Y seems surprising.
- When there are different sources of data, consider which seem more reliable and why?
- What kind of data could determine whether X causes Y? Why does it not yet exist? Is it likely to be available in the future? What obstacles exist to obtaining such data?
- Can you come to a practical conclusion and how much confidence do you have in it?
- Look for plausible alternative explanations:
- Good experiments:
- Control vs treatment groups: experiments involve a comparison between two or more conditions or treatments)
- Placebos -- blinding of subject
- Blinding of assessor
- If both subject and assessor are blind, we have double blind
- Randomization is crucial so we can be sure that all possible confounding factors known or unknown are not responsible for the outcome except possibly by chance'. Randomization can be applied in many ways:
- completely randomized design: take all subjects and randomly allocate to each treatment
- paired designs: for two treatments: split subjects into pairs that are similar with respect to relevant variables, then randomly select within each pair.
- blocked designs: for more than two: split subjects into similar blocks with as many subjects as treatments, then randomly assign within each block.
- longitudinal designs: give all or some of the treatments to each subject. Randomize order.
- Special types of observational studies for causal inference:
- Retrospective: (measure Y in the present or past and X in the past)
- Prospective: measure X now, Y later.
- Case-control: If Y is disease vs. no disease: choose a group of subjects with with the disease (the cases) and then, for each case, find a non-diseased subject who is similar with respect to selected Zs. Measure X on everyone and see if X is related to Y.
- Longitudinal without randomization: Subjects get all levels of X either in same order or in an order not controlled by experimenter.
Exercises
Chapter 3, pp 82--87:
- 4, 6, 7 (distinguish between confounding factors -- Z causes X and Z causes Y -- and mediating factors -- X causes W which causes Y), 8--11, 19, 24, 27--31, 38, 41, 49.
Links
- You can find the talk by Hans Rosling by using Google. You will find a number of other interesting talks on statistics in the TED seminars.
Readings for next week
I expect to cover Chapters 2 and much of 4 by the end of the week.
Beyond Chapter 4, the readings will generally include less than whole chapters and I will indicate what can be omitted.
Note that for the first 3 weeks we're covering all of chapters 1, 2, 3, and some of 4 although the lectures structure the material in chapters 1 and 3 quite differently from the text in order to give you an alternative development of these important concepts.
Some important ideas not in the text are:
- the explicit list of 5 possible reasons for an association between X and Y in observational data
- the connection between these reasons and the possible reasons with experimental data
- the distinction between confounding factors and mediating factors
These concepts make explicit and clarify the reasoning underlying many statements in the text.
Team Assignment 1
Due: Feb. 2
- 1) Find a topic in the news currently or within the past year that involves some controversy over the interpretation of evidence.
- 2) Collect some clippings or on-line links to news, magazine or journal articles related to the topic.
- 3) Discuss why the topic is controversial. Is the controversy over causality? Why is there room for disagreement? What kind of evidence, data or theory, is available to support the various sides of the issue? Discuss the apparent strengths and weaknesses in the data or theory on either side? Is the available data observational or experimental? Is this relevant to the issue? What kind of data, if any, could resolve the issue? What obstacles are there to obtaining the ideal data to resolve the issue? Is better data likely to become available and how would it be helpful?
- 4) End the assignment with brief individual essays (identify the authors) stating your individual positions on the topic? Have you adopted a point of view? Describe the ways in which you remain uncertain and how your uncertainty could be resolved. If you wish you can write this part of the assignment as if it were a panel discussion among the members or your team. You could, in fact, record a panel discussion and transcribe it to text.
You are not expected to become experts in two weeks in the topic you choose. The goal of the assignment is for you to become informed lay persons with an understanding of the nature of the controversy and uncertainty concerning your topic, an understanding of the approaches that could resolve it and the challenges to achieving a resolution.
All members of the team receive the same grade. The grade is based on the quality of your research and the interest and intellectual energy you display in dealing with the problem.
Week 3, January 19, 21
Material covered
Using R and Rcmdr:
This video (http://www.math.yorku.ca/people/georges/Files/MATH1532/Recordings/Project_Directories_and_Downloading.html) demonstrates how to create a project folder (directory) so that R will start in the folder so that it will read and write files in the folder. The video also shows how to download an Excel file from the internet and read it into R. The same method can be applied for Excel files on your computer.
Graphs with Rcmdr
Purpose | Rcmdr menu | Notes |
---|---|---|
one categorical variable | Graphs | Bar graph Graphs | Pie chart | |
two categorical variables | need command line | library(lattice);with(Dataset, barchart( table( Xcat, Ycat), stack = F, auto.key=T) |
X cat var and Y num var | Graphs | Boxplot | click on Plot by groups |
one numeric variable | Graphs | Histogram Graphs | Boxplot | |
two numeric variables | Scatterplot | prompts for x and y variables |
X,Y num. vars & Z cat. var. | Graphs | Scatterplot | click on Plot by groups to choose Z |
one numeric variable | Graphs | Boxplot | |
one numeric variable | Graphs | Boxplot |
Statistics with Rcmdr
Purpose | Rcmdr menu | Notes |
---|---|---|
all variables | Data | Summaries | Active data sets | |
one categorical variable | Statistics | Summaries | Frequency distributions | |
two categorical variables | Statistics | Contingency tables | Two-way tables | Choose X as Row variable and Y as Column variable request multiple tables selecting No percentages and Column percentages |
X cat var and Y num var | Statistics | Means | One-way ANOVA Statistics | Summaries | Numerical summaries | X variable is groups |
one numeric variable | Statistics | Means | Single-sample t-test Statistics | Summaries | Numerical summaries | |
two numeric variables | Statistics | Fit models | |
X,Y num. vars & Z cat. var. | Statistics | Fit models |
Textbook
- Chapter 2
- Quick review and consolidation of chapter 3
- Chapter 4
Synopsis
- Types of variables
- Asking questions: about one categorical variable, one categorical and one numeric, etc.
- Describing and summarizing categorical and numerical data
- Outliers
- Graphs for numerical data, interpreting, shape of data
- Creating a histogram, dotplot and stem and leaf plot
- Summaries for quantitative data:
- How large? Measures of location: mean, median
- How far apart? Measures of spread: quartiles, inter-quartile range, range, standard deviation
- Percentiles
- The five number summary and boxplots
- Normal distribution: determined by mean and standard deviation
- Many variables have distributions that are close to normal -- for deep reasons, e.g. the Central Limit Theorem.
- Interpreting the mean and standard deviation: 'Empirical rule': 68% within 1 SD, 95% within 2 SDs, 99.7% within 3 SDs.
- Standardizing a variable: z-score.
FORTHCOMING: Menu selections in Rcmdr
Exercises
Note: In the textbook, the problems in chapter 2 are numbered 2.1, 2.2, etc. but, in the following lists, I only show the number after the period because the number before the period is always the same in each chapter.
Chapter 2, pp 48--57: 1, 2, 4, 5, 8, 14, 16 (data set in Data directory on web), 18, 29, 36, 37 (use arithmetic and use R), 39, 48, 54, 56, 70 (in R use pnorm: pnorm(.5) is the proportion below a z-score of .5, pnorm(-1.5) is the proportion below -1.5, etc.), 72, 75, 80, 84, 86, 96.
Chapter 3, pp 82--87: 4, 8, 9, 18, 24, 25 (here the authors do not seem to make the crucial distinction between a confounding factor and a mediating factor, whether the answer is a 'confounding factor' or a 'mediating factor' depends on the 'relationship' between the variable in question and the explanatory variable. If the variable 'causes' the explanatory variable, you have a confounding factor. If variable is caused by the explanatory variable, you have a mediating factor. The former provides a counter argument to causality, the latter provides a plausible explanation for causality), 27, 37, 49. 52, 60.
Chapter 4 (omit 4.5), pp 121--128: 6, 9, 17, 22, 34, 48, 80, 82.
Things to do
Play with R and Rcmdr
Links
- Data sets needed for exercises can be found in the Data directory in the repository of files for the course (http://www.math.yorku.ca/people/georges/Files/MATH1532).
Readings for next week
Finish Chapter 4, Chapters 5 and 6 (I read through them and there is nothing we can really drop without losing an important point)
Weeks 4, 5, 6
Consolidation and review of previous material, term test. See slides at http://www.math.yorku.ca/people/georges/Files/NATS1500 for material.
Week 7, February 24
Material covered
Chapter 5 and the beginning of Chapter 6
Regression with two variables with Rcmdr
Purpose | Rcmdr menu | Notes |
---|---|---|
Explore num. vars. + possibly 1 cat. var. | Graphs | Scatterplot matrix Graphs | 3D Graphs | 3D scatterplot Statistics | Summaries | Numerical summaries | You can also include one categorical variable by selecting "Summarize by groups" or "Plot by groups" |
Scatterplot | Graphs | Scatterplot | |
Correlation | Statistics | Summaries | Correlation matrix | Use correlation test for p-values |
Fitting the least-squares line i.e. the estimated linear regression equation | Statistics | Fit models | Linear regression Models | Summarize models / Confidence intervals / Add observation statistics to data /etc. | After adding observation statistics to data you can plot residuals in various ways to whether there are patterns remaining in the residuals |
Purpose | Rcmdr menu | Notes |
---|---|---|
Graph | need command line | library(lattice);with(Dataset, barchart( table( Xcat, Ycat), stack = F, auto.key=T)) |
Statistics | Statistics | Contingency tables | Two-way tables | Choose X as Row variable and Y as Column variable request multiple tables selecting No percentages and Column percentages |
Assignment 2
Assignment 2 will done in the same groups as Assignment 1 except that groups that have become too small may be combined with others. Assignment 2 consists of the accumulated problems from week to week that are assigned over the next three weeks. The assignment is due on March 25.
Each current group should send me (mailto:georges+nats1500@yorku.ca) one email message giving me the name of the group and the names of its members. I'll address issues concerning reconstitution of groups on Sunday, Feb. 28.
Project (Individual)
The general idea is to perform an analysis of some data that you find interesting using the statistical tools and critical insights that you have developed in the course. To help you find a topic and data you can have a look at Statistics: Pedagogical resources on this wiki.
- Identify a topic you find interesting about which you have a question that could be resolved with appropriate data and analysis.
- Find a number of sources (3 or more-- except in very special cases where 3 or more sources would not exist) that provide information relevant to your question. At least one source should have relevant data.
- Perform some analyses of the data including summaries of the distribution of relevant variable and relevant graphs.
- Based on a critical assessment of your sources and your analysis, discuss the implications for your question.
- Discuss clearly the strengths and limitations of your analysis and existing information in addressing your question.
Some guidelines for your report:
- Aim for a length of 8 to 12 pages of analyses and discussion plus at least 2 pages of relevant graphs.
- Show the results of at least one and preferably two analyses using a single data set -- unless you are very ambitious and want to use more.
Grading:
- Clear expression of specific question and relevant field: 10%
- Choice of sources and clear references: 10%
- Clarity and quality of argument: 20%
- Relevance and quality of analysis: 20%
- Relevance and quality of graphs: 20%
- Clear formal academic style of writing: 5%
- Effort: 5%
- Structure: 5%
- Overall appearance: 5%
Exercises and Assignment
Notes:
- Numbers in red need to be done for Assignment 2.
- The numbers shown in the text all have the form '5.x' where 'x' is the number of the question within chapter 5. In the following lists I only show 'x'.
Chapter 5, pp. 161--168:
- Looking for Patterns with Scatterplots:
- 1, 2, 3, 7
- Describing Linear Pattern with a Regression Line:
- 11, 14
- Measuring Strength and Direction with Correlation:
- 24 (important -- likely to be on exam), 27 (also a good candidate for the exam)
- Why the Answers May Not Make Sense & Correlation Does Not Prove Causation:
- 36 (refers to 7), 39, 40,
- Chapter Exercises:
- 46, 48, 49, 59, 60, 61, 62.
Chapter 6, pp. 193--201:
- Displaying Relationships Between Categorical Variables:
- 3, 4, 6, 7,
- Risk, Relative Risk, Odds Ratio and Increased Risk & Misleading Statistics About Risk:
- 10 (nice exam question), 11-14 (ditto), 20 (refers to 6), 22
- The Effect of a Third Variable and Simpson's Paradox:
- 27, 29, 31
- Assessing the Statistical Significance of a 2 x 2 Table:
- 33, 34, 43
- Chapter Exercises:
- 56,57, 58, 62.
Readings for next week
Reread Chapter 6, read Chapter 7.
Mid-Term Test
Five-number summary:
mean sd 0% 25% 50% 75% 100% n 71.36923 14.27582 37 61 73 81 99 65Shapiro-Wilk normality test:
data: mm$MT W = 0.9775, p-value = 0.2826
Week 8, March 2, 4
Material covered
Chapters 6 and 7
Meaning of probability
- Relative frequency: what proportion of the time do you expect something to happen?
- How do you find a relative frequency?
- Mathematically, making assumptions about the physical world: e.g.
- The probability of get a 2 when rolling a 6-sided die is 1/6. This requires the assumption that all sides of the die are equally likely.
- The probability of a getting 9 or 10 heads when tossing a coin 10 times is 0.0009765. This assumes that heads or tails are equally likely.
- Empirically, by observing a random process many times. e.g.
- Probability of a boy is approximately 0.52. We obtain this by studying a large number of birth records and calculating the proportion of boys.
- Mathematically, making assumptions about the physical world: e.g.
- How do you find a relative frequency?
- Note that the relative frequency interpretation of probability requires the existence of a stable random process that can be repeated independently -- at least conceptually -- many times to produce the relative frequency of a particular event. For example, the random process is tossing a fair coin and the event is 'tossing a head'.
- A particular probability (relative frequency) may not be known. But we need to be able to conceive the possibility of observing the random process repeadtedly.
- Note that the relative frequency interpretation of probability requires the existence of a stable random process that can be repeated independently -- at least conceptually -- many times to produce the relative frequency of a particular event. For example, the random process is tossing a fair coin and the event is 'tossing a head'.
- Personal or subjective probability: e.g. What is probability that I will get an A in this course?
- Subjective probability provides a way of attaching weights to uncertain events or to uncertain beliefs about events, statements, historical propositions or theories. An unusual but interesting example is the attempt to attach a probability to the possibility that a particular tomb (http://en.wikipedia.org/w/index.php?title=The_Lost_Tomb_of_Jesus&oldid=342935090) discovered in East Talpiot, a neighbourhood in Jerusalem, is the Lost Tomb of Jesus.
- A test for subjective probabilities is whether they are coherent: Do they observe the basic laws of probability? For example your subjective probability of not getting an A should be equal to 1 minus your subjective probability of getting an A -- otherwise you are less likely to get an A!
Exercises
Notes:
- Numbers in red need to be done for Assignment 2.
- The numbers shown in the text all have the form '7.x' where 'x' is the number of the question within chapter 7. In the following lists I only show 'x'.
Chapter 7, pp. 240--247:
- 1. Random Circumstances & 2. Interpretations of Probability
- 2, 5, 7, 16
- 3. Probability Definitions and Relationships
- 18, 19, 20
- 4. Basic Rules for Finding Probabilities
- 34, 35, 36, 42
- 5. Strategies for Finding Complicated Probabilities
- 44, 45, 46, 47, 50, 54
- 6. Using Simulation to Estimate Probabilities
- none
- 7. Coincidences and Intuitive Judgments about Probability
- 64, 68, 72, 76 (similar question likely to be on test)
- Chapter Exercises
- 82, 83, 84, 85, 91 to 98 (sequence of exercises on same problem).
Readings for next week
Reread Chapter 7, Chapters 10, 11. (skip 8 and 9).
Week 9, March 9, 11
Material covered
- Chapters 7, 10
- Peter Donnelly and the meaning of probability: How juries are fooled by statistics (http://www.youtube.com/watch?v=kLmzxmRcUTo)
- Multiple vs simple regression: why the 'effect' of coffee controlling for 'stress' can be quite different from the 'effect' of coffee not controlling for stress: IntroStats: 3d coffee example.R.
Week 10, March 16, 18
- Pap test sensitivity and specificity (http://en.wikipedia.org/w/index.php?title=Pap_test&oldid=349393357#Technical_aspects)
- Wikipedia: Sensitivity and Specificity (http://en.wikipedia.org/w/index.php?title=Sensitivity_and_specificity&oldid=347278963)
Files
Files for MATH 1532 (http://www.math.yorku.ca/people/georges/Files/MATH1532)