R: Warnings

From MathWiki

This is a page of traps or 'gotchas' in R, things that can go silently wrong without warning and lead to serious errors.


Table of contents

factor as numeric

Many problems occur because of the dual personality of factors.

merge

When merging two data frames, if the same variable name is used for a factor in one data frame and for a numeric variable in the other, the merge will create a factor that will probably not make sense.

as.numeric

will return the 'codes', not the values of the factor coerced to numeric values. To coerce values to numeric, use:

 > as.numeric( as.character( ff ))

where 'ff' is a factor.

Looping on a factor

  > for ( nn in ff ) print(nn)

will print the 'codes', not the values. This can be a problem when you think you are using 'nn' as a 'name index' into a different list. You will silently select by 'numerical index'.

NOTE: This is no longer true in version 2.9.1. The character values of the factor will be printed.

factor quirks

expand.grid

The following examples were produced with version 2.9.1

If 'ff' is a factor, e.g.

 > (ff <- factor( c('b','a',NA,'c','c'), levels= c('a','c','b')))
 [1] b    a    <NA> c    c   
 Levels: a c b

Then 'expand.grid' produces, as one would expect, a different result whether one uses 'expand.grid( ff )' or 'expand.grid( levels(ff) )'. This comes from the fact that, applied to a character vector, 'expand.grid' creates a factor with levels in the same order as the character vector. Applied to a factor, it produces a factor in the same order as the factor but with levels ordered the same as the levels of the factor. 99% of the time, this is really a feature, not a bug, but it can give surprises.

Note that the order of the data in 'ff' is different from that of the levels. Both of these orderings are different from lexicographical order. A repeated value and a missing value have also been included for illustration.

Using the factor:

> expand.grid(ff = ff)
    ff
1    b
2    a
3 <NA>
4    c
5    c

produces a vector identical to the original:

> expand.grid(ff = ff)$ff
[1] b    a    <NA> c    c   
Levels: a c b

Using the factor levels:

> expand.grid(ff = levels(ff))
  ff
1  a
2  c
3  b

produces the 'same factor' (i.e. same levels in same order) but different values, namely the values in the order of the levels:

> expand.grid(ff = levels(ff))$ff
[1] a c b
Levels: a c b

Using 'unique(ff)':

> expand.grid(ff = unique(ff))
    ff
1    b
2    a
3 <NA>
4    c

produces the same result as using the factor except that repeated values are not included: [note that this works as expected although I don't think this was the case in earlier versions]

> expand.grid(ff = unique(ff))$ff
[1] b    a    <NA> c   
Levels: a c b

Using 'as.character(ff)':

> expand.grid(ff = as.character(ff))
    ff
1    b
2    a
3 <NA>
4    c
5    c

produces values and levels identical to the argument:

> expand.grid(ff = as.character(ff))$ff
[1] b    a    <NA> c    c   
Levels: b a c

Thus, it would seem that 'unique(ff)' would be best if one wishes to preserve the order of both the values and the levels of the original and to include missing values.

One potentially problematic issue is that of adding a variable created with 'tapply':

> tapply( ff, ff, function(x ) unique(x)) 
a c b 
1 2 3 

Given that 'unique' preserves vectors, it is surprising that numerical codes are returned here. The reason, however, is not due to 'unique' nor to the manner in which portions of the argument are presented to 'unique', The reason lies in the manner in which 'tapply' simplifies the results to form a single vector. Consider what happens when the result cannot be simplified:

> tapply( ff, ff, function(x ) x) 
$a
[1] a
Levels: a c b  
$c
[1] c c
Levels: a c b

$b
[1] b
Levels: a c b

To get a clearer picture of what 'tapply' returns, we can use:

> tapply( as.character(ff), ff, function(x ) unique(x)) 
  a   c   b 
"a" "c" "b" 

Note that 'tapply' uses the ordering of the factor levels.

So it looks like we could 'add' the outcome of 'tapply' to the data.frame created with

> (ddf <- expand.grid( ff = levels(ff)))
  ff
1  a
2  c
3  b
> (y <- tapply( as.character(ff) , ff, unique) )
  a   c   b 
"a" "c" "b" 

Note that:

> ddf$y <- y
> ddf
  ff y
1  a a
2  c c
3  b b

works, but: > sapply(ddf, class)

     ff        y 

"factor" "array"

which can cause problems when using 'ddf' for other purposes such as model fitting. The problems are more serious if 'tapply' is used with a multidimensional index. It is best to use:

> ddf$y <- c(y)
> ddf
  ff y
1  a a
2  c c
3  b b
> sapply(ddf, class)
         ff           y 
   "factor" "character"

A final observation: if it is desired to include 'NA' as a regular value in the index of 'tapply', the indexing factor should be created to include NA's:

> (ff <- factor( c('b','a',NA,'c','c'), exclude = NULL) )
[1] b    a    <NA> c    c   
Levels: a b c <NA>

Question: How could you set a different order for levels?