One of the useful features of random forests is the ability to assess the importance of predictor variables. This can be the mean decrease in accuracy or mean decrease in node impurity when excluding this variable. While many statisticians, data scientists, and machine learnists are familiar with variable importance measures for random forests, I don’t often see many analyses that include class-specific variable importance.
I think this is a shame because for many problems variables may be more or less important for different classes. Understanding the overall importance of variables only gives a high-level understanding of the problem. Examining the importance of predictors on a single class provides insight into the features that differentiate that class from the others.
Getting the overall importance for variables in a random forest model in R is simple, it is not entirely obvious how to return the class-specific importance. In fact trying to extract a measure of class-specific variable importance can be incredibly frusturating (until you find the incredibly simple solution).
Demonstration of the problem
Suppose that I want to fit a random forest model in R and assess the variable importance of my predictors. This is simple enough.
> my_model <- randomForest(Species ~ ., data=iris, importance=TRUE) > importance(my_model) setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini Sepal.Length 6.710277 8.5874923 8.989345 12.107605 9.676950 Sepal.Width 4.501133 0.7066144 5.481982 5.368399 2.380181 Petal.Length 22.323162 34.0679333 28.312631 34.034319 42.895229 Petal.Width 22.203651 32.8962472 30.960037 34.227022 44.272395
Now if I want to get the variable importance for a specific class, but don’t know how, I would just turn to the documentation.
It turns out that the importance function takes an argument called class.
class – for classification problem, which class-specific measure to return.
This seems straight forward enough and most users would just do this
> importance(my_model, class='setosa') setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini Sepal.Length 6.710277 8.5874923 8.989345 12.107605 9.676950 Sepal.Width 4.501133 0.7066144 5.481982 5.368399 2.380181 Petal.Length 22.323162 34.0679333 28.312631 34.034319 42.895229 Petal.Width 22.203651 32.8962472 30.960037 34.227022 44.272395
but this is a trap and will return the exact same result as
importance(my_model) did. Not only does this function not behave as a user would reasonably expect it to, there is no message or warning to the user.
At this point it is reasonable to expect the user to go back to the function documentation and look for an answer or perhaps example of how to properly request the class-specific variable importance. But the user won’t find the answer in the function documentation. Now the user will probably think carefully (but not carefully enough) and try to extract the class-specific variable importances manually. This is somewhat dangerous if the user decides to extract the importances from the model directly.
> my_model$importance setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini Sepal.Length 0.036573992 0.025547632 0.046447034 0.036346581 9.676950 Sepal.Width 0.008046969 0.001263502 0.009825166 0.006397894 2.380181 Petal.Length 0.334837811 0.293303764 0.294833879 0.303596194 42.895229 Petal.Width 0.332024023 0.309245664 0.275216276 0.302861787 44.272395
Observe that these variable importances differ from the ones above returned by
importance(my_model)! This is because
importance contains an optional argument to scale the measures by their standard error that defaults to true. Scaling is generally a good idea so we might decide to get the variable importances for each class by indexing the result from the
> importance(my_model)[,'setosa'] Sepal.Length Sepal.Width Petal.Length Petal.Width 6.710277 4.501133 22.323162 22.203651
But why doesn’t the extractor function work as expected? We can examine this by digging into the function code.
The first hint to our question is found on line 11,
allImp <- is.null(type) && hasImp. This checks whether the user has passed an argument to type, the type of importance measure to return, and that the model object has a matrix of importance values. So in addition to specifying which class we want importance measures for, we also have to specify the type of importance measure. If we go back to the function documentation we can find that value for type should either be 1 (mean decrease in accuracy) or 2 (mean decrease in node impurity). Unmentioned in the documentation,
importance will only return the class-specific variable importance as a measure of mean decrease in accuracy.
> importance(fit, type=1, class='setosa') setosa Sepal.Length 6.710277 Sepal.Width 4.501133 Petal.Length 22.323162 Petal.Width 22.203651 > importance(fit, type=2, class='setosa') Error in importance.randomForest(fit, type = 2, class = "setosa") : No class-specific measure for that type
While not mentioned at all in the documentation for
importance, the documentation for
randomForest does make a note that only the class-specific measure computed is mean decrease in accuracy.
For classification, the first
nclasscolumns are the class-specific measures computed as mean descrease [sic] in accuracy. The
nclass+ 1st column is the mean descrease [sic] in accuracy over all classes. The last column is the mean decrease in Gini index.
Since only one type of class-specific variable importance measure is available, it does not make sense to require an argument to
type. Further it is not readily apparent from the function documentation that the function requires both arguments. It is only after reviewing the code and the
randomForest documentation does it become clear what and why the behavior of the
importance function is what it is.
Some people may shrug at this and say, “so what? so what if I have to pass
type=1 to get the class-specific variable importance?” These people are missing the point. As we saw above
importance(my_model, class='setosa') gave the exact same response as
importance(my_model). There was absolutely no indication that the function was being used incorrectly (not as the author intended). This could easily be correct with a message:
> importance(my_model, class='setosa') Warning in importance(my_model, class='setosa') : class is non-null value but type is null specify type=1 to get class-specific variable importance setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini Sepal.Length 6.710277 8.5874923 8.989345 12.107605 9.676950 Sepal.Width 4.501133 0.7066144 5.481982 5.368399 2.380181 Petal.Length 22.323162 34.0679333 28.312631 34.034319 42.895229 Petal.Width 22.203651 32.8962472 30.960037 34.227022 44.272395
Some people will argue that users should have to specify values for class and type. These people might say that requiring
type=1 informs the user that the measure of variable importance they are receiving is the mean decrease in accuracy. This could be combated through either message or, preferably, classifying the function documentation. The obscurity of the documentation for
importance is a bad thing, it makes users dig to find the answer (
type=1 is required).
Compounding of bad behavior
My final grief about this is demonstrated through the use of another function,
varImpPlot, which creates a dotchart of variable importance. This function makes a direct call to
importance. When a user requests a dotchart of variable importance for a given class, but does not explicitly pass
type=1, the resulting plot contains the overall variable importance (side-by-side plots of mean decrease in accuracy and impurity). If a user passes a value to
class, it is clear that the user wanted a plot of class-specific variable importances, but again no warning is given that the user (it’s always the user’s fault) is in fact using the function incorrectly.
Examining the importance of predictors for a single class gives understanding of the features that define that class. However the behavior and accompanying documentation for the
importance function may dissuade less experienced R users from examining class-specific variable importance. Improvements could easily be made to change the behavior of the function (e.g. type default to 1 when class is not null) or clarify the function documentation (or preferably both).