Unexpected behavior getting class-specific variable importance from a randomForest object

Mar 30, 2015 · 1231 words · 6 minutes read R • random forest

One of the useful features of random forests is the ability to assess the importance of predictor variables. This can be the mean decrease in accuracy or mean decrease in node impurity when excluding this variable. While many statisticians, data scientists, and machine learnists are familiar with variable importance measures for random forests, I don’t often see many analyses that include class-specific variable importance.

I think this is a shame because for many problems variables may be more or less important for different classes. Understanding the overall importance of variables only gives a high-level understanding of the problem. Examining the importance of predictors on a single class provides insight into the features that differentiate that class from the others.

Getting the overall importance for variables in a random forest model in R is simple, it is not entirely obvious how to return the class-specific importance. In fact trying to extract a measure of class-specific variable importance can be incredibly frusturating (until you find the incredibly simple solution).

Demonstration of the problem

Suppose that I want to fit a random forest model in R and assess the variable importance of my predictors. This is simple enough.

> my_model <- randomForest(Species ~ ., data=iris, importance=TRUE)
> importance(my_model)
                setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
Sepal.Length  6.710277  8.5874923  8.989345            12.107605         9.676950
Sepal.Width   4.501133  0.7066144  5.481982             5.368399         2.380181
Petal.Length 22.323162 34.0679333 28.312631            34.034319        42.895229
Petal.Width  22.203651 32.8962472 30.960037            34.227022        44.272395

Now if I want to get the variable importance for a specific class, but don’t know how, I would just turn to the documentation.

> ?importance

It turns out that the importance function takes an argument called class.

class – for classification problem, which class-specific measure to return.

This seems straight forward enough and most users would just do this

> importance(my_model, class='setosa')
                setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
Sepal.Length  6.710277  8.5874923  8.989345            12.107605         9.676950
Sepal.Width   4.501133  0.7066144  5.481982             5.368399         2.380181
Petal.Length 22.323162 34.0679333 28.312631            34.034319        42.895229
Petal.Width  22.203651 32.8962472 30.960037            34.227022        44.272395

but this is a trap and will return the exact same result as importance(my_model) did. Not only does this function not behave as a user would reasonably expect it to, there is no message or warning to the user.

At this point it is reasonable to expect the user to go back to the function documentation and look for an answer or perhaps example of how to properly request the class-specific variable importance. But the user won’t find the answer in the function documentation. Now the user will probably think carefully (but not carefully enough) and try to extract the class-specific variable importances manually. This is somewhat dangerous if the user decides to extract the importances from the model directly.

> my_model$importance
                  setosa  versicolor   virginica MeanDecreaseAccuracy MeanDecreaseGini
Sepal.Length 0.036573992 0.025547632 0.046447034          0.036346581         9.676950
Sepal.Width  0.008046969 0.001263502 0.009825166          0.006397894         2.380181
Petal.Length 0.334837811 0.293303764 0.294833879          0.303596194        42.895229
Petal.Width  0.332024023 0.309245664 0.275216276          0.302861787        44.272395

Observe that these variable importances differ from the ones above returned by importance(my_model)! This is because importance contains an optional argument to scale the measures by their standard error that defaults to true. Scaling is generally a good idea so we might decide to get the variable importances for each class by indexing the result from the importance function.

> importance(my_model)[,'setosa']
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
    6.710277     4.501133    22.323162    22.203651

But why doesn’t the extractor function work as expected? We can examine this by digging into the function code.

> randomForest:::importance.randomForest

Solution

The first hint to our question is found on line 11, allImp <- is.null(type) && hasImp. This checks whether the user has passed an argument to type, the type of importance measure to return, and that the model object has a matrix of importance values. So in addition to specifying which class we want importance measures for, we also have to specify the type of importance measure. If we go back to the function documentation we can find that value for type should either be 1 (mean decrease in accuracy) or 2 (mean decrease in node impurity). Unmentioned in the documentation, importance will only return the class-specific variable importance as a measure of mean decrease in accuracy.

> importance(fit, type=1, class='setosa')
                setosa
Sepal.Length  6.710277
Sepal.Width   4.501133
Petal.Length 22.323162
Petal.Width  22.203651
> importance(fit, type=2, class='setosa')
Error in importance.randomForest(fit, type = 2, class = "setosa") :
  No class-specific measure for that type

While not mentioned at all in the documentation for importance, the documentation for randomForest does make a note that only the class-specific measure computed is mean decrease in accuracy.

For classification, the first nclass columns are the class-specific measures computed as mean descrease [sic] in accuracy. The nclass + 1st column is the mean descrease [sic] in accuracy over all classes. The last column is the mean decrease in Gini index.

Since only one type of class-specific variable importance measure is available, it does not make sense to require an argument to type. Further it is not readily apparent from the function documentation that the function requires both arguments. It is only after reviewing the code and the randomForest documentation does it become clear what and why the behavior of the importance function is what it is.

Some people may shrug at this and say, “so what? so what if I have to pass type=1 to get the class-specific variable importance?” These people are missing the point. As we saw above importance(my_model, class='setosa') gave the exact same response as importance(my_model). There was absolutely no indication that the function was being used incorrectly (not as the author intended). This could easily be correct with a message:

> importance(my_model, class='setosa')
Warning in importance(my_model, class='setosa') :
  class is non-null value but type is null
  specify type=1 to get class-specific variable importance
                setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
Sepal.Length  6.710277  8.5874923  8.989345            12.107605         9.676950
Sepal.Width   4.501133  0.7066144  5.481982             5.368399         2.380181
Petal.Length 22.323162 34.0679333 28.312631            34.034319        42.895229
Petal.Width  22.203651 32.8962472 30.960037            34.227022        44.272395

Some people will argue that users should have to specify values for class and type. These people might say that requiring type=1 informs the user that the measure of variable importance they are receiving is the mean decrease in accuracy. This could be combated through either message or, preferably, classifying the function documentation. The obscurity of the documentation for importance is a bad thing, it makes users dig to find the answer (type=1 is required).

Compounding of bad behavior

My final grief about this is demonstrated through the use of another function, varImpPlot, which creates a dotchart of variable importance. This function makes a direct call to importance. When a user requests a dotchart of variable importance for a given class, but does not explicitly pass type=1, the resulting plot contains the overall variable importance (side-by-side plots of mean decrease in accuracy and impurity). If a user passes a value to class, it is clear that the user wanted a plot of class-specific variable importances, but again no warning is given that the user (it’s always the user’s fault) is in fact using the function incorrectly.

Closing

Examining the importance of predictors for a single class gives understanding of the features that define that class. However the behavior and accompanying documentation for the importance function may dissuade less experienced R users from examining class-specific variable importance. Improvements could easily be made to change the behavior of the function (e.g. type default to 1 when class is not null) or clarify the function documentation (or preferably both).