My criticism of R numeric summary

Datetime:2016-08-23 01:43:45          Topic: R Program           Share

(This article was first published on R – Win-Vector Blog , and kindly contributed toR-bloggers)

My criticism of R ‘s numeric summary() method is: it unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs. summary() likely represents good work by high-ability researchers, and the sharp edges are due to historically necessary trade-offs.

The Big Lebowski, 1998.

Please read on for some context and my criticism.

Introduction

My group has been doing a lot more professional training lately. This is interesting because bright students really put a lot of interesting demands on how you organize and communicate. They want things that make sense (so they can learn them), that are powerful (so it is worth learning them), and that are regular (so they can compose them and move beyond what you are teaching). Students are less sympathetic to implementation history and unstated conventions, as new users tend not to benefit from them. Remember a new R student is still deciding if they want to use R , to them it is new so an instructor needs to defend R ‘s current trade-offs (not its evolutionary path). We find it is best to point out both what is great in R and what isn’t great (versus skipping such, or worse trying to justify such portions).

Please keep this in mind when I demonstrate what goes wrong when one attempts to teach R’s summary() function to the laity.

The Issue

Suppose you had a list or vector of numbers in R. It would be useful to be able to produce and view some summaries or statistics about these numbers. The primary way to do this in R is to call the summary() method. Here is an example below:

numbers <- 1:7
print(numbers)
##  [1]  1  2  3  4  5  6  7

summary(numbers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     2.5     4.0     4.0     5.5     7.0

From the names attached to the results you can get the meanings and move on. But the whole time you are hoping none of your students call summary() on a single number. Because if the do, they have a very good chance of seeing summary() fail. And now you have broken trust in R .

Let’s tack into the wind and demonstrate the failure:

summary(15555)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15560   15560   15560   15560   15560   15560

summary() is claiming the minimum value from the set of numbers c(15555) is 15560 . Now this is a deliberately trivial example where we can see what is going on (it sure looks like presentation rounding). To make matters worse, this isn’t just confusion generated during presentation- the actual values are wrong.

str(summary(15555))
## Classes 'summaryDefault', 'table'  Named num [1:6] 15560 15560 15560 15560 15560 ...
##   ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...

summary(15555)[['Min.']] == min(15555)
## [1] FALSE

It may seem silly to expect the slots from a summary() call on a vector would be used in calculation (when we have direct functions such as quantile() and mean() for getting the same results), but using values from summaries of models is standard practice in R. The trivial linear model summary summary(lm(y~0,data.frame(y=15555))) shows rounded results (though it appears to hold accurate results, and only round during presentation; use unclass() to inspect the actual values).

Why it Matters

This is in fact a problem. You can say this is a consequence of the “default settings of summary() ” and it is my fault for not changing those settings. But frankly it is quite fair to expect the default settings to be safe and sane.

Let us also appeal to authority:

The many computational steps between original data source and displayed results must all be truthful, or the effect of the analysis may be worthless, if not pernicious. This places an obligation on all creators of software to program in such a way that the computations can be understood and trusted. This obligation I label the Prime Directive .

John Chambers, Software for Data Analysis: Programming with R , Springer 2008.

The point is you are delegating work to your system. If it needlessly fails (no matter how trivially) when observed, how can you trust it when unobserved? John Chambers’ point is that trust is very expensive to build up, so you really don’t want to squander it.

I used to try to “lecture this away” as just being “rounding in the presentation for neatness.” But this runs into two objections:

  • Why doesn’t the presentation hint at this by switching to scientific notation such as 1.556e+4 ?
  • If summary() “is just presentation” wouldn’t it be a string?

We are losing substitutability. We would love to be able to say to students that “ summary() is a convenient shorthand and you can treat the following as equivalent”:

  • summary(x)[['Min.']] == min(x)
  • summary(x)[['1st Qu.']] == quantile(x,0.25)
  • summary(x)[['Median']] == median(x)
  • summary(x)[['Mean']] == mean(x)
  • summary(x)[['3rd Qu.']] == quantile(x,0.75)
  • summary(x)[['Max.']] == max(x)

But the above isn’t always the case. What we would like is for summary() to contain these values and get pretty printing by using the S3 or S4 object system to override the print() method. It is quite likely summary() predates these object systems, so achieved pretty printing through rounding of values.

What is going on?

We can take a look at the actual code and see what is happening. We are looking for a reason, not an excuse.

From help(summary) we see summary takes a digits option with default value digits = max(3, getOption("digits")-3) (lets not even get into why setting digits directly does one thing and the system default is shifted by 3 ). getOption("digits") returns 7 on my machine so we see we are asking for four digit rounding, which is consistent with what we saw. Digging through the dispatch rules we can eventually determine that for a numeric vector summary() eventually calls summary.default() . By calling print(summary.default) we can look at the code. The offending snippet is:

qq <- stats::quantile(object)
        qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)

After computing the quantiles summary then calls signif() to round the results. R isn’t inaccurate, it just went out of its way to round the results.

Why is this whiny rant so long?

One reason this article is long is the behavior we are describing breaks expectations. So we end up having to document what is actually going on (a laborious process) instead of being able to rely on shared educated expectations. The whining is where actualities and expectations diverge.

Conclusion

summary() attempts to achieve neatness and legibility. This is a laudable goal, if achievable. Numeric analysis is not so simple that rounding could safely achieve such a goal.

It is well known that rounding is not a safe or faithful operation (it loses information, and can be catastrophic if naively applied in many stages of a complex calculation). Because it is obvious rounding is dangerous, sophisticated students are surprised that it defaults to “on” in common calculations without indication or warning (such as moving to scientific notation). summary() compounds this error by returning rounded values (instead of rounding only at print /presentation). As summary() is often a first view of data (along with print() ) we encounter confusing inconsistent situations where un-rounded values (presentation of original data) and rounded values are compared.

Of course, we can (and should) teach students to call mean(x) and quantile(x) rather than summary(x) when they want to reuse the summary statistics. But then we have to explain why . After seeing something like this it becomes an unfortunate additional teaching goal to convince students that more of R doesn’t behave like summary() .





About List