R: Q and A

From MathWiki

Questions and Answers about R

Table of contents

ls

Q. When I entered ls() in R to get a list of list of objects available in R for me to work on, I got only one named "Last Warning" - how do I find out what that is (i.e., a variable, a set of variables or a function). Also, am I supposed to get more than one?? or is there something else I need to install?

A: ls() shows your "working database" or "Global Environment". It only contains variables and function you have created in this session or that were created in previous sessions and saved when you quit R. Most of the objects available to you are in other databases or packages. One way of getting an overview of what's available in packages you have installed is to use the command:

> help.start()

This will open a browser window with links under the link 'packages' that will show you datasets and functions in each package.

Also, to get information about the internal structure of an object, 'x', perhaps the best way is to use:

> str(x)

Downweighting outliers

Q. How would you downweight outlying observations in R, i.e., once you identify which observations are outliers, how would you downweight them in R??

A. This question has a potentially extensive answer which depends on the function used for analysis. A simple answer for 'lm' is to use the argument 'weights'. e.g.

x <- 1:10
y <- x + rnorm(10)
y[1] <- 10
plot( x, y )
fit1 <- lm( y ~ x )
summary(fit1)
abline(fit1)
wts <- c( .1, rep(1,9))
fit2 <- lm( y ~ x, weights = wts )
summary(fit2)
abline(fit2, col = 'red')


Omitting outliers

Q. How can I omit points entirely in a regression?

A. To omit a point entirely in a regression, you can use a number of methods. If the data frame is called dd and you want to omit the 12th and 14th observations you could use:

 fit <- lm ( y ~ x, dd, subset = -c(12,14))

To omit points based on the value of another variable w, you could, for example, do something like:

 fit <- lm ( y ~ x, dd, subset = w > 0)

You can also select a subset of the data frame with:

 fit <- lm ( y ~ x, dd[ dd$w > 0, ] )

Summary data sets

Q. I would like to create a data frame that has means of subsets of an original data frame where the subsets are formed by a factor in the data set.

A. I'll use data(Prestige) in library(car).

> library(car)
> data(Prestige)
> head(Prestige)
                    education income women prestige census type
GOV.ADMINISTRATORS      13.11  12351 11.16     68.8   1113 prof
GENERAL.MANAGERS        12.26  25879  4.02     69.1   1130 prof
ACCOUNTANTS             12.77   9271 15.70     63.4   1171 prof
PURCHASING.OFFICERS     11.42   8865  9.11     56.8   1175 prof
CHEMISTS                14.62   8403 11.68     73.5   2111 prof
PHYSICISTS              15.64  11030  5.13     77.6   2113 prof
# There are three types:
> table(Prestige$type)

 bc prof   wc 
 44   31   23 

To get means by 'type', you could use:

>   xx <- by( Prestige, Prestige$type, mean)
Warning messages:
1: argument is not numeric or logical: returning NA in: mean.default(X6, ...) 
2: argument is not numeric or logical: returning NA in: mean.default(X6, ...) 
3: argument is not numeric or logical: returning NA in: mean.default(X6, ...) 

You can ignore the warning messages that are produced when 'mean' tries to take the mean of the factor 'type'.

The 'by' function produces a list:

>  xx 
Prestige$type: bc
  education      income       women    prestige      census        type 
   8.359318 5374.136364   18.970682   35.527273 7944.659091          NA 
------------------------------------------------------------------- 
Prestige$type: prof
  education      income       women    prestige      census        type 
   14.08419 10559.45161    25.51161    67.84839  2574.35484          NA 
------------------------------------------------------------------- 
Prestige$type: wc
 education     income      women   prestige     census       type 
  11.02174 5052.30435   52.82739   42.24348 4340.69565         NA 
> 

Now, we'd like to turn this list into the rows of a matrix. We would like the elements of the list to be arguments to the function 'rbind'. We can do this with the 'do.call' function:

> mat <- do.call( 'rbind', xx)
> mat
     education    income    women prestige   census type
bc    8.359318  5374.136 18.97068 35.52727 7944.659   NA
prof 14.084194 10559.452 25.51161 67.84839 2574.355   NA
wc   11.021739  5052.304 52.82739 42.24348 4340.696   NA

We can now use 'mat' as a data frame for plotting and fitting a line:

> plot( education ~ income, as.data.frame(mat))
> fit <- lm( education ~ income, as.data.frame(mat))

etc.

3d plots

A: If I would like to do a 3-d plot, do I first have to upload a library from CRAN?

Q: You need to install the 'rgl' library with

 > install.packages('rgl')

Then to use scatter3d which was written by John Fox just a month ago you need to 'source' the function with

 > source("http://www.math.yorku.ca/~georges/R/coursefun.R")


identify 3d

Q. Identify 3d doesn't work.

A. To use identify3d you need to right-click and drag to create a rectangle around the points you want to identify. You then release and the labels for the points appear. To end identify mode, you right-click without dragging.


Changing colours in plots

Q: How do I change the colour of a line (when plotting more than one regression line on a single scatterplot)?

A: There are a number of ways of doing this. The simplest to explain, if you plotted data points using 'plot' is to plot each line using for example:

 > abline( fitx, col = 'red')

In most plotting functions, you can specify colour with the 'col' argument. You can also get some functions to do multiple colours in one go, e.g. matplot, xyplot( ..., groups = grouping.factor,...). 'matplot' is quite easy to use and you can play around with it to see what happens. xyplot is bit more advanced but more powerful -- of course.

Identifying points in plots

Q: How do I do the identifying function to list outliers in scatter plots. Can this be done in other types of plots (i.e. histogram) as well?

A: If you have variables x, y and labels in 'lab', then you could do:

> plot(x,y)
> identify(x,y,lab)

You left-click on each point you want to identify and right-click to end. To identify points with 'boxplot' you would use the fact that the function

> boxplot(x)

draws the boxplot over the value '1' on the x-axis. Thus:

> identify( rep(1,length(x)) , x, lab)

will work. If you have a factor 'fac', then

> boxplot( x ~ fac )
> identify( fac, x, lab)

will do the right things. With other functions, it's a matter of experimenting.


Assessing the variance in Y

Q. How would I go about assessing the variance (and spread) of the changes in Y as occurring with X in R? Stated another way. How would I examine the data to ensure that the assumption of equal variances has not been violated using R?

A. ??

Non-parametric analyses in R

Q. What are the commands for performing different nonparametric regression analyses in R?

A.

Robust and resistant lines in R

Q. I had previously asked you about how to fit a regression line while downweighting outliers and as I look through the tutorials and notes, I came across the lqs and rlm functions. I know that the lqs function is the most resistant to outliers, so was wondering if by using it I would also be in a way downweighting outliers.

A. These methods downweight residual outliers but not necessarily outliers in predictor space. 'rlm' uses M-estimation downweighting large residuals with Tukey's biweight -- like 'lowess' -- and 'lqs' minimizes a 'trimmed' sum of squares omitting the largest residuals. Both methods fit linear models, not non-parametric models so an outlier in predictor space could have a large influence. Which method is better? I suspect they could converge in subtly different ways with some datasets resulting in different subsets of the data being downweighted and, thus, in different fitted models. Which robust methods should be used among those still in current use would be a very good topic for a paper.

Uploading plots to a wiki page

Question
How do I upload a plot to show it in a wiki page?
Note
If you want to upload anything but a plot (e.g. a word (.doc) file or a .pdf file) please see Using_the_Wiki:_Q_and_A#Uploading_a_file.
Answer
1. Create the plot in the usual way.
2. Create a .png file of the plot. You can do this using the 'File|Save as|Png' menu of the R graphics window or you can use R commands: (Note that to use the following you need to have created a directory 'R' in the base directory of your C drive. Make the name of the file very descriptive. Include your name to ensure that the name will be unique.
 > dev.copy(device = png, file ='c:/R/MyNamePlotDescription.png")
 > dev.off()
3. Edit the wiki page in which you wish to include the graph.
4. Click on 'Upload file' in the menu on the left side of the wiki page display. Then click 'Browse' and navigate to your png file.
5. Click on 'Open'
6. Answer the questions in the dialog boxes, e.g. the question on ownership of copyright.
7. Use the 'back arrow' on you browser to go back to the editing pane.
8. Include a link to the graph in the wiki page. It should have the form
[[Image:MyNamePlotDescription.png]]
Alternatively, you can show a thumbnail of your graph with:
[[Image:MyNamePlotDescription.png|thumb]]

Uploading a .csv data frame and reading it in R

To upload a .csv (Comma Separated Values) data frame, you first create it with a program like Excel. Be sure that the file contains variable names in the first row.

  • In Excel you save the file as a .csv file by using the drop-down menu "File | Save as" and choose 'CSV (Comma delimited) *.csv' from the 'Save as type' menu. When you save the file choose a name that will be unique on the wiki.
  • From your web browser, logged in to the MathStat wiki, click on "Upload file" in the window on the left.
  • Click on "Browse" and find the file you want to upload.
  • Complete the rest of the information and upload the page.
  • You can create a link to the file on the wiki in the form [[Media:MyNameDataDescription.csv]] (supposing that your file was named 'MyNameDataDescription.csv'. Clicking on this will show the file in raw .csv format which could be used for cutting and pasting to a raw text editor on your computer (e.g. notepad).
  • To read the file directly into R over the internet, you need to get the 'raw URL' for the file. This is the URL you see in your browser's url window when you view the raw .csv file. It will have a form like:
http://wiki.math.yorku.ca/images/7/73/MyNameDataDescription.csv
  • Cut and paste this URL into a wiki page. Other users can then retrieve the file directly with an R command like
   dd <- read.csv("http://wiki.math.yorku.ca/images/7/73/MyNameDataDescription.csv")

Changing the reference level for a factor

Q: The default reference level using treatment contrasts for a factor in a model formula is the level that is first in alphabetical order. How can I change it to another level?

A: Suppose that your factor is f:

> f <- factor(c('b','b','c','a'))
> f
[1] c a b c a
Levels: a b c

The default contrasts have 'a' as the reference level:

> contrasts(f)
  b c
a 0 0
b 1 0
c 0 1

You can change the order of levels with:

> ff <- factor(f , levels = c('c','b','a'))
> ff
[1] c a b c a
Levels: c b a

which makes 'c' the reference level:

> contrasts(ff)
  b a
c 0 0
b 1 0
a 0 1

Note that creating an 'ordered' factor doesn't achieve the same thing since the default coding for an ordered factor is generated with 'poly'.

Creating a subset of a data frame

Q: How do I create a subset of a data frame?

A: Suppose the data frame is called Angell. You can create a data frame with selected rows by specifying their row indices. e.g.

> dd <- Angell[ c(4,12,35), ]

A more 'R-like' solution might use the 'pmatch' function. e.g.

> dd <- Angell[ pmatch( c("Rich","Chatt",...), rownames(Angell)), ]

Reading data into R

Q: How can I quickly read a small data set into R?

A: You can cut and paste the data into a R script window (either within R or using Tinn-R). Then add a blank line at the end of the data and prepend a scan command. The scan command has a 'what' argument that is a list giving the names of the variables and a sample value, numeric, character or logical, as follows:

  > dd <- scan( what = list( x = 0, y = 0, fac = 'a', flag = T))
  1.2 2.3 male  F
  2.1 4.5 female  T
  9.2 2.1 male F
  # leave the previous line blank to end the input
  # dd is now a list and needs to be turned into a data.frame
  > dd <- as.data.frame(dd)
  > dd
      x y    w    lo
  1 2.0 3    a  TRUE
  2 2.0 3    b FALSE
  3 1.2 5 more FALSE

If cutting and pasting doesn't produce separate lines of data, you can add the argument 'multi.line = T' to scan.

Confidence ellipse for two slopes

1. fit the model, e.g.

  > fit <- lm( y ~ x1 + x2, dd)

2. Download 'cell'

  > source("http://www.math.yorku.ca/~georges/R/fun.R")

3. Prepare the plot area

  > plot( rbind(cell(fit),c(0,0)), type = 'n')

4. Plot the ellipse

  > lines( cell(fit))

5. Improve the description above.


> mean(x)
[1] 2.3
> sample(10)
[1] 2 3  4  3