VR4: Chapter 1 summary

From MathWiki

The following commands have been adapted from Venables and Ripley (2002) Modern Applied Statistics with S, (4th ed.), and serve as a first introduction to the capabilities R.

After installing R, you can work through the following commands to get an overview of R's capabilities. In an R command, any text following a number sign (#) is a comment. The '>' is the standard prompt. To work through the following examples, type the commands (the text between the '>' prompt and the comment symbol '#') in the R console window. Try variations and experiment to see what happens. You can also cut and paste but you won't learn as much as you would by typing!

R script version

Table of contents

Starting

You can open a general help page:

> help.start()

Arithmetic

R evaluates ordinary mathematical expressions and prints the results.

> 2 + 3
[1] 5
> sqrt( 3/4)/( 1/3 - 2/pi^2)
[1] 6.6265

Experiment with the command above to find out whether in the expression

> 2/pi^2

exponentiation (^2) takes place after division (/) or before.

Loading a library

The following command loads the MASS (Modern Applied Statistics with S) library that contains functions and data sets that accompany the text.

> library(MASS)

Some libraries need to be installed when you're on the internet (you only need to do this once):

> install.packages("car")
> install.packages("rgl")

and then they can be loaded when needed:

> library(car)
> library(rgl)

Read the help page about how to use help:

> ?help

Generating and looking at random numbers

Generate 1,000 pairs of standard normal variates:

> x <- rnorm(1000)
> y <- rnorm(1000)

Histogram of a mixture of normal distributions. Experiment with the number of bins and (25) and the shift(3) of the second component.

> truehist( c( x, y +3 ), nbins = 25)

Note:

  • c(x, y+3) combines the vector 'x' and the vector 'y+3' into a single vector.
  • nbins: number of bars in the histogram


Read about optional arguments:

> ?truehist

2D density plot:

> contour ( dd <- kde2d(x,y) )

Greyscale or pseudo-color plot:

> image(dd)

Simple regression

> x <- seq( 1, 20, 0.5)   # sequence from 1 to 20 by 0.5
> x                       # print it
> w <- 1 + x/2            # a weight vector
> y <- x + w*rnorm(x)     # generate y with standard deviations
                          #    equal to w
 
> dum <- data.frame( x, y, w )     # make a data frame of 3 columns
> dum                     # typing it's name shows the object
> rm( x, y, w )           # remove the original variables
> fm <- lm ( y ~ x, data = dum)   # fit a simple linear regression (between x and y)
> summary( fm )                   # and look at the analysis
> fm1 <- lm( y ~ x, data = dum, weight = 1/w^2 )  # a weighted regression  
> summary(fm1)
 
> lrf <- loess( y ~ x, dum)    # a smooth regression curve using a
                               #   modern local regression function
> attach( dum )                # make the columns in the data frame visible
                               #   as variables (Note: before this command, typing x and
                               #   y would yield errors)
> plot( x, y)                  # a standard scatterplot to which we will
                               #   fit regression curves 
> lines( spline( x, fitted (lrf)), col = 2)  # add in the local regression using spline
                                             #  interpolation between calculated
                                             #  points
> abline(0, 1, lty = 3, col = 3)    # add in the true regression line (lty=3: read line type=dotted;
                                    #   col=3: read colour=green; type "?par" for details)
> abline(fm, col = 4)               # add in the unweighted regression line
> abline(fm1, lty = 4, col = 5)     # add in the weighted regression line
> plot( fitted(fm), resid(fm),
       xlab = "Fitted Values",
       ylab = "Residuals")          # a standard regression diagnostic plot
                                    # to check for heteroscedasticity.
                                    # The data were generated from a
                                    # heteroscedastic process. Can you see
                                    # it from this plot?
> qqnorm( resid(fm) )               # Normal scores to check for skewness,
                                    #   kurtosis and outliers.  How would you
                                    #   expect heteroscedasticity to show up?
> search()                          # Have a look at the search path
> detach()                          # Remove the data frame from the search path
> search()                          # Have another look. How did it change?
> rm(fm, fm1, lrf, dum)             # Clean up (this is not that important in R 
                                    #    for reasons we will understand later)

Comments by Team Adler

Scottish hill races

> data(hills)              # In R, data frames need to be loaded with the 'data' command.
> hills                    # List the data
> pairs(hills)             # A pairwise scatterplot

> attach(hills)            # Make the columns available by name
> plot(dist, time)         # Plot the points
> identify( dist, time, rownames(hills) )
                           # Use the left mouse button to identify outlying points
                           # and the right mouse button to quit. On a Macintosh
                           # click outside the plot to quit.
> abline( lm( time ~ dist ))  # Show the least-squares regression line
> detach()                    # Clean up

Effect of outliers

We can explore the effect of outliers on a linear regression by designing our own examples interactively. Try this several times.


> plot( c(0,1), c(0,1), type = 'n')
> xy <- locator(type = "p")         # create your own data set by clicking on the 
                                    # left mouse button, then with the right mouse button
                                    # to finish. With a Macintosh click outside the
                                    # plot to finish.
> abline( lm( y ~ x, xy), col = 4)  # Fit a least-squares, a robust and a resistant
                                    # regression line. Repeat to discover the effect of
                                    # outliers, both vertically and horizontally
> abline( rlm( y ~ x, xy, method = "MM") , lty = 3, col = 3)
> abline( lqs( y ~ x, xy), lty = 2, col = 2)
> rm(xy)

Michelson

We now look at data from the 1879 experiment of Michelson to measure the speed of light. There are five experiments (column Expt); each has 20 runs (column Run) and Speed is the recorded speed of light, in km/sec, less 299,000. (The currently accepted value on this scale is 724.5)


> data(michelson)
> attach(michelson)
> search()
> plot( Expt, Speed, main = "Speed of Light Data",
         xlab = "Experiment No.")
> fm <- aov( Speed ~ Run + Expt)                  # analyze as a randomized block design
                                                  # with runs and experiments as factors
> summary( fm )
> fm0 <- update( fm, . ~ . - Run)       # fit the sub-model omitting the nonsense factor Runs
                                        # and compare using a formal analysis of variance.
> anova(fm0, fm)
> detach()
> rm( fm, fm0)

3D visualization with rgl

> library(rgl)
> rgl.open()
> rgl.bg(col='white')
> x <- sort(rnorm(1000))
> y <- rnorm(1000)
> z1 <- atan2(x,y)
> z <- rnorm(1000) + atan2(x, y)
> plot3d(x, y, z1, col = rainbow(1000), size = 2)
> bbox3d()

Statistical tables