From MathWiki

MATH 4939 is a 'capstone course', a course in which you get to use statistics in a way that uses your accumulated knowledge -- and particularly your ability to learn new things on your own -- to do statistical analyses and reports in a setting that reflects how statistics can be used in a real-world setting.

What does this involve?:

  • real data and real questions, not textbook examples
    • messy: a great deal of the work is in understanding the data, cleaning it up, creating new variables for analysis and modelling
  • collaborative work: almost everything worth doing in the real world is too big or too complex to be done by one person. You work in teams that are selected by your employer. Collaboration involves using appropriate software tools for collaboration. In our case we'll use git and github.
  • presenting and interpreting results:
    • producing meaningful results and interpreting them: not just the mechanical output of statistical analysis but interpretation of the meaning of results in the context of the study.
  • proficiency with software: SAS and R: not just using routines off the shelf but creating your own solutions and presentations
    • producing effective graphics that tell the story you want to tell in a way that is understood by the audience you want to reach
  • learning how to learn independently: most courses have a well defined syllabus. The exercises relate neatly to each component of the course. Real work and real research are entirely different. You need to address ill-posed questions with methods that you have to choose, and often develop, yourself. The background you acquired in former courses is a springboard for deeper discovery.

You won't develop 'expertise' in any of these areas in 4 months. That takes a lifetime. But I hope that you'll get a good start on your way to independent growth and discovery.

Please complete this survey ( before January 4, 2016.

Table of contents


  • Right away
    • Forming groups
    • Installing everything and starting on an easy project that includes specific questions requiring e.g. wald tests
    • Wald tests for regression
    • Start with R
    • Start blogging
    • Start working on R and SAS
    • Odd jobs
    • 20 Questions

Planning Issues

  • Goal: Use statistical knowledge and programming skills to work on a project that resembles you work in a real setting:
    • Knowing how to do analyses + knowing why: what does the analysis mean? How can the results be interpreted?
    • Working on a team of people with diverse backgrounds who bring different skills to the task.
    • Working with real data.
    • Using software for collaboration.
    • Preparing analyses for collaborators and statisticians
    • Preparing reports for clients and employers
    • Showing results with graphs
    • Presenting results in a limited amount of time
    • Connect with the wider world of statistics:
      • Be aware of controversies and major issues in statistics and science:
        • reproducibility
        • what's the relevance of the Bayesian/Frequentist conflict?
        • why is causality important yet so widely ignored?
        • How can I keep up and join the fray?
  • Why?
    • Gain experience you can discuss in job interviews.
  • Evaluation: Exams, projects (evaluation of contributions), odd jobs, blog contributions
  • R and SAS exercises? Coordinated with textbook
  • Statistical concepts? From technique and theory to meaning
  • Collaboration: Tools + principles
  • Setting up blogs
  • Odd jobs??
  • Realstatitiscs:
    • Meaning
    • Presentation
    • Writing

Topics planning

Philosophy of approach to course

  • create experiences that provide narratives -- stories to tell at interviews
  • create experiences in collaborating in situations that are authentic: have something something to say when you're asked: "what was your worst experience working with other people and how did you handle it?"
  • Learn 'how to' do stats but especially why and what does it mean?

To Do

  • Find and assign experts
  • Check on textbook
  • Check re access to SAS
  • Check re my copy -- I think I have electronic version

To do first

  • Get R, RStudio and Github going
  • Get blogging on wiki going -- or should I use github

Activity planning

  • data.table
  • github
  • wiki (github or practicum)
  • moodle? (can we focus on github?)


  • The multifaceted statistical profession


git init # local init in .
git add file
git add .
git commit
git status
git log
git log --onefile --graph --decorate --all
# others
git revert
git mv
git rm
git branch <new-name>
git checkout <branch-name>
git merge <branch-name>  # into current branch
git checkout -b <new-branch-name>  # combine git branch and git checkout

git xxx --help

# periodic garbage collection:
git gc

# what to do about big files?
find . -size +5000000c 2>/dev/null -exec ls -l {} \;


  •  ?


  • develop strategy for access and syllabus


Week 1


  • Bring laptops to next class
  • We will meet in N604 so you can use computers to log in to blackwell

To do before Tuesday, Jan 5, 2016 at 11:59pm

Issues in statistical practice

  • Survey on computing in MATH courses
  • Twenty questions
    • Understanding the meaning of data analyses: looking outside the analysis
    • The multifaceted role of statistics in exploring the world
    • The role of complexity and uncertainty
      • Learning to learn: most of the tools and methods you will use in 10 years don't exist today
      • Learning to collaborate: most tasks are much too complex to be done alone
  • Getting your workstation/laptop ready
    • Install R and Rstudio
    • Install git (might come with Rstudio)
    • Create a github account
    • Create practicum account (will use this as wiki because of storage limitations on github)
    • Alternatives to personal laptop: access to blackwell via lab.
    • Access to SAS ???
  • Course organization
    • Team Project
      • Topics and expectations
      • Assignment of teams: random -- do publicly (recourse)
  • Questions to ponder
  • /Setting up R, RStudio, Git and GitHub

Week 2

One principle is to progress on different topics in parallel:

  • Learning R
    • Markdown
  • Applied statistical concepts
  • Progress on projects
    • Collaborating in github
  • Learning SAS

Project 1

Purpose: do something fun to get to know software, etc.

  • Have a look at gapminder
  • Find an interesting issue to discuss
  • download and analyze data
  • do some interesting graphs
  • a discussion of what your analysis might mean -- and what it might not mean.
  • prepare a 5-minute presentations + 5 minutes for questions, discussion on Friday, January ??

Week 3

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Week 12

Possible activities

  • Select online tutorials for R and for SAS
  • Subscribe to and post comments on R Bloggers
  • Maybe, each student choose a blog and blogs regularly on it, e.g. Gelman (2), Wasserman, Lumley, ASA, ...
  • Assignment that requires answers to specific questions (

Using Github to collaborate on a project

  • Using Github in the classroom (
  • Find or create a tutorial on Github
    • need to address the 'what to' as well as the 'how to'
    • focus on workflow adapted to 1. package development and 2. collaboration
    • address technical issues (how to do X) and workflow/collaboration issues: using issues, milestones and task assignment.
      • how to setup a project for relative symmetric collaboration
      • how to use issues for initiation, discussion and resolution of problems
  • Install R and RStudio
  • Install devtools
  • Install git and Create an account on github
    Great description:
    Some of this might have been done with Rstudio but following the above should work
  • Set up SSH keys on your computer and github:
    I recommend using SSH instead of HTTPS as recommended by Github. SSH is harder to set up initially, but that's okay because there's help but much more convenient to use as you go along. This is important because everything will go so much more smoothly if you pull and push to Github frequently. Otherwise the frequency of conflicting contributions will be much higher.
    1. Create a key from RStudio: Tools|Global Options|Git/SVN|(Create RSA Key..) then click on (Create). I suggest not using a passphrase if you are the only person with access to your computer. If you use one keep it easy to type, you will have to type it frequently.
    2. Click on (View public key) and use Ctrl-C to copy the key
    3. Log in to your account in Github and click on your photo at the top right then click on "Your Profile" and then on "Edit Profile" at the top right.
    4. Click on "SSH keys" in the menu on the left.
    5. Click on "Add SSH key" and paste (Ctrl-V) the public key. Add a description, e.g. your computer's nickname in case you end up using different computers.
    Adding an SSH key (
  • Use RStudio to create a small package Hadley Wickham R Packages ( with a repository
  • Link the the local repository to github: git remote add origin
    Info: [1] (
  • Commit changes and push to github: git push -u origin master
  • Development workflow
    • Install package on various machines:
  • Collaborator
    • Create project in Rstudio
      Pull project git remote add origin
      In shell:
      git fetch -all # downloads from github but doesn't change local files
      git reset --hard origin/master # or relevant branch -- just the first time since Rstudio might have left some files
      Pull project

Approaches to R


Hierarchical data manipulation

  • Consider 'data.table' to allow quick manipulation of large data frames
  • modify 'up' unless I can find something else
  • 'fread' for reading, 'foreign' for SAS, SPSS, 'readxl' for Excel.
  • 'fasttime' for times and dates but only for 'true POSIXct' dates


  • Testing statistical methods with fake (= true) data.

Test questions

  • xo <- x[order(nchar(x))] # order by size


Approaches to SAS

  • Programming
  • Major PROCs
  • Graphics


  • Intensive regression review
  • Multilevel
  • Overview of other methods
  • Emphasize role and importance of understanding data
  • Starting GLMM (



Image:Data science Venn diagram.PNG

Course components

  • 5 or 10 marks for having installed R and R Studio and writing a small package with a vignette in R markdown including some Latex. Grade given when student comes during office hours to demonstrate that they have done this.

Current issues in statistics


Why α = 0.05 doesn't mean what you would like it to mean:

  • Regina Nuzzo (2015) How scientists fool themselves – and how they can stop: ( Humans are remarkably good at self-deception. But growing concern about reproducibility is driving many researchers to seek ways to fight their own worst instincts. Nature, 07 October 2015.
  • Regina Nuzzo (2014) Scientific method: Statistical errors: P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume. ( Nature 12 February 2014.
  • Jeffrey T. Leek, Roger D. Peng (2015) Statistics: P values are just the tip of the iceberg. ( Nature 28 April 2015.
    Why banning NHST doesn't solve the problem.
  • Pre-registered reports:
    An even more radical extension of this idea is the introduction of registered reports: publications in which scientists present their research plans for peer review before they even do the experiment. If the plan is approved, the researchers get an 'in-principle' guarantee of publication, no matter how strong or weak the results turn out to be. This should reduce the unconscious temptation to warp the data analysis, says Pashler. At the same time, he adds, it should keep peer reviewers from discounting a study's results or complaining after results are known. “People are evaluating methods without knowing whether they're going to find the results congenial or not,” he says. “It should create a much higher level of honesty among referees.” More than 20 journals are offering or plan to offer some format of registered reports.
    from Regina Nuzzo (2015) How scientists fool themselves – and how they can stop: ( Humans are remarkably good at self-deception. But growing concern about reproducibility is driving many researchers to seek ways to fight their own worst instincts. Nature, 07 October 2015.

p-hacking, HARKing, JARKing

  • Hypothesis After Results are Known
  • JARKing, or justifying after results are know

Prosecutor's Fallacy

When should you look at a problem from a Bayesian perspective?

  • Peter Green's letter on Sally Clark (
    Argues that it's not the p-value but the likelihood ratio we should consider:
    The jury needs to weigh up two competing explanations for the babies' deaths: SIDS or murder. The fact that two deaths by SIDS is quite unlikely is, taken alone, of little value. Two deaths by murder may well be even more unlikely. What matters is the relative likelihood of the deaths under each explanation, not just how unlikely they are under one explanation.


Other courses at York

  • MATH 3330 (

Online tutorial


Odd jobs

  • write a tutorial for (multi)hierarchical data manipulation:
    • Using Spida: capply, up, merge, cLag, long, wide(need to write), gicc,
    • Using SAS
    • Using Hadley Wickham's tools
    • Using data.table (maybe the fastest and easiest?)
    It should show how to perform a number of common operations:
    • creating a summary variable by id, including aggregation and selection using another variable
    • creating a summary data frame
    • wide to long and long to wide


Git and GitHub

  • GitHub for Beginners (
  • Git Users Manual (
  • Using git with Word (
  • Synchronizing with origin (
    git push --all -u
    git pull --all
  • How to set up a local repo:
    • Ideal: clone a github repo. If none, create a github repo with just a file and then clone.
      • HTTPS or SSH: Github recommends the former but latter much better. It's harder to setup but once it's set up, the daily workflow is much easier making pushing and pulling just a question of clicking an icon with no need to fill in a username and password. Setting up SSH is not too difficult via RStudio.
  • How to create a new branch and push it upstream
    • git branch new-branch # creates the branch
    • git checkout new-branch # makes it the current branch
      • can combine both into one command: git checkout -b new-branch
    • when ready to commit and push
      • git commit -m 'message'
      • git push -u origin new-branch # this sets up the association for future pushing and pulling. In future, you just need 'git push' when you're in the branch.

Markup languages

  • wiki, markdown, R markdown


Odd jobs

Short interesting, clear and practical tutorials posted on wiki:

  • For someone with a background in relational databases: describe how to implement relational database operations in R and SAS.
  • Some aspects of Scraping
  • How to do something in SAS or R that is equivalent to what you do in the other language
  • Tutorial on how to do something in SAS that we did in R
  • Relational database operations in 'data table'
  • Hierarchical data manipulations with spida/dplyr/data table.
    • Comparison of features: e.g. I suspect that 'data table' is most scalable


  • Focus first week on specific R skills to do Arrests analysis
    • Cover factors and linear estimates very early in the course
    • Why R, then SAS: because the envelope gets pushed in R. New methods are available in R as they are published. In SAS, they appear 5 to 15 years later, but possibly much better engineered.
    • Revisit concepts periodically
  • Multiple threads:
    • programming in R, and SAS
    • specific data analytic skills
    • connecting with statistical issues in the world
    • meaning and interpretation of statistical results
    • working on and presenting project
    • NOT a traditional assignment/exam course


  • Hierarchical manipulation: capply, merge, up, cLag,
  • Constructing linear hypotheses: wald, L, Lfx, effects package tools
  • Simulation