14 Feature Importance and Variable Selection

14.1 Introduction: Ranking Predictors in Complex Models

In the previous chapter, we introduced a general map of interpretable machine learning and distinguished global from local explanations. Feature importance is often the first global summary we compute, because it gives a compact answer to a central descriptive question, which predictors carry the strongest signal for model predictions.

At the same time, feature importance is not a single quantity with a universal interpretation. Different methods encode different perturbations, assumptions, and notions of contribution. For descriptive analysis, this is not a limitation as much as a reminder that interpretation is comparative and contextual.

This chapter develops feature importance from that perspective. We review common importance definitions, compute them on reproducible examples, and connect ranking to variable selection. The goal is to establish a careful workflow that remains useful for communication while respecting technical nuance.

14.2 Why Feature Importance Matters in Descriptive Work

When models include nonlinearities and interactions, coefficient style interpretation is often unavailable or incomplete. Importance summaries help us recover structure by organizing predictors into an approximate relevance ordering.

A useful ranking can support several descriptive tasks:

prioritizing variables for deeper inspection,
comparing model behavior across datasets or time periods,
identifying redundant predictors before simplification,
communicating model focus to collaborators.

These uses are practical, but they are not causal claims. If an importance score is high, we learn that the model relied on that variable for prediction under the observed data distribution. We do not learn that intervening on the variable would necessarily change the outcome.

14.3 Two Broad Families of Importance Measures

In applied tabular analysis, two families appear frequently.

14.3.1 Impurity Based Importance (Model Specific)

Tree based models evaluate splits by impurity reduction. In regression trees, impurity is commonly measured by residual sum of squares. In classification trees, impurity is often measured by Gini or entropy style criteria. Aggregating split improvements across trees gives an internal importance score.

For a feature \(X_j\), a generic form is (Breiman 2001):

\[ I_j^{imp} = \sum_{t=1}^{T} \sum_{s \in S_{j,t}} \Delta i_{s,t}, \]

where \(T\) is the number of trees, \(S_{j,t}\) is the set of splits using feature \(j\) in tree \(t\), and \(\Delta i_{s,t}\) is impurity decrease at split \(s\).

These scores are efficient to compute, but they can favor variables with many potential split points, and they can distribute credit unevenly when predictors are correlated.

14.3.2 Permutation Importance (Mostly Model Agnostic)

Permutation importance asks a different question. If we randomly shuffle a feature and break its link to the outcome while preserving its marginal distribution, how much does prediction error increase?

As discussed in the previous chapter, the population level target can be written as a difference in expected loss before and after permutation. Here we emphasize the empirical version that we estimate in practice on a dataset of size \(n\) (Fisher, Rudin, and Dominici 2019):

\[ \widehat{I}_j^{perm} = \frac{1}{n}\sum_{i=1}^{n} L\left(y_i, \hat{f}(x_{i,-j}, \pi(x_{i,j}))\right) - \frac{1}{n}\sum_{i=1}^{n} L\left(y_i, \hat{f}(x_i)\right). \]

If shuffling \(X_j\) leads to a substantial error increase, the model appears to rely on that feature. This definition is often more comparable across model classes, though correlation can still dilute or redistribute signal.

14.4 Running Example: Random Forest on Boston Housing

We continue with a familiar setup from earlier chapters and fit a random forest for medv on a moderate set of predictors.

data(Boston, package = "MASS")

set.seed(123)
idx <- sample(seq_len(nrow(Boston)), size = 0.7 * nrow(Boston))
train <- Boston[idx, ]
test <- Boston[-idx, ]

predictors <- c("lstat", "rm", "ptratio", "indus", "nox", "crim")
rf_formula <- as.formula(paste("medv ~", paste(predictors, collapse = " + ")))

set.seed(123)
rf_fit <- randomForest(
    rf_formula,
    data = train,
    ntree = 500,
    mtry = 2,
    importance = TRUE
)

rf_pred <- predict(rf_fit, newdata = test)
rf_rmse <- sqrt(mean((test$medv - rf_pred)^2))

rf_rmse

[1] 3.5575

This test RMSE provides context for interpretation. Importance summaries from models with weak predictive signal are often unstable, so predictive adequacy and interpretability should be read together.

14.5 Model Specific Importance from the Forest Object

The randomForest object provides two common regression importance summaries, %IncMSE and IncNodePurity. We collect both to compare rankings.

rf_importance <- importance(rf_fit)

imp_tbl <- tibble::tibble(
    variable = rownames(rf_importance),
    inc_mse = rf_importance[, "%IncMSE"],
    inc_node_purity = rf_importance[, "IncNodePurity"]
) |>
    arrange(desc(inc_mse))

imp_tbl |>
    gt() |>
    cols_label(
        variable = "Variable",
        inc_mse = "% increase in MSE",
        inc_node_purity = "Increase in node purity"
    ) |>
    fmt_number(columns = c(inc_mse, inc_node_purity), decimals = 3)

Variable	% increase in MSE	Increase in node purity
rm	33.752	8,924.274
lstat	32.154	9,536.612
nox	21.368	2,615.381
crim	18.043	2,590.015
ptratio	17.088	2,735.910
indus	11.933	1,822.111

imp_tbl |>
    ggplot(aes(x = reorder(variable, inc_mse), y = inc_mse)) +
    geom_col(fill = "steelblue", alpha = 0.85) +
    coord_flip() +
    labs(
        title = "Random Forest Importance (%IncMSE)",
        x = NULL,
        y = "% increase in MSE"
    ) +
    theme_minimal()

This ranking is often informative, but it should not be treated as a fixed truth. Small rank changes can occur with different seeds, data splits, or hyperparameters.

14.6 Permutation Importance on Held Out Data

To complement model specific scores, we compute permutation importance directly on the test set with repeated shuffles.

perm_importance <- function(model, data, outcome, vars, n_repeats = 12, seed = 123) {
    set.seed(seed)

    base_pred <- predict(model, newdata = data)
    base_rmse <- sqrt(mean((data[[outcome]] - base_pred)^2))

    results <- lapply(vars, function(v) {
        deltas <- numeric(n_repeats)

        for (b in seq_len(n_repeats)) {
            perm_data <- data
            perm_data[[v]] <- sample(perm_data[[v]])
            perm_pred <- predict(model, newdata = perm_data)
            perm_rmse <- sqrt(mean((data[[outcome]] - perm_pred)^2))
            deltas[b] <- perm_rmse - base_rmse
        }

        tibble::tibble(
            variable = v,
            mean_delta_rmse = mean(deltas),
            sd_delta_rmse = sd(deltas)
        )
    })

    dplyr::bind_rows(results) |>
        arrange(desc(mean_delta_rmse))
}

perm_imp <- perm_importance(
    model = rf_fit,
    data = test,
    outcome = "medv",
    vars = predictors,
    n_repeats = 15,
    seed = 123
)

perm_imp |>
    gt() |>
    cols_label(
        variable = "Variable",
        mean_delta_rmse = "Mean increase in RMSE",
        sd_delta_rmse = "SD across permutations"
    ) |>
    fmt_number(columns = c(mean_delta_rmse, sd_delta_rmse), decimals = 3)

Variable	Mean increase in RMSE	SD across permutations
lstat	3.792	0.277
rm	3.064	0.307
nox	0.437	0.098
ptratio	0.408	0.068
crim	0.224	0.058
indus	0.136	0.069

perm_imp |>
    ggplot(aes(x = reorder(variable, mean_delta_rmse), y = mean_delta_rmse)) +
    geom_col(fill = "darkred", alpha = 0.80) +
    geom_errorbar(
        aes(
            ymin = mean_delta_rmse - sd_delta_rmse,
            ymax = mean_delta_rmse + sd_delta_rmse
        ),
        width = 0.15,
        color = "gray25"
    ) +
    coord_flip() +
    labs(
        title = "Permutation Importance on Test Data",
        subtitle = "Error bars show ±1 SD across repeated shuffles",
        x = NULL,
        y = "Increase in RMSE"
    ) +
    theme_minimal()

The dispersion bars help us avoid over interpreting very close ranks. If two variables have overlapping uncertainty ranges, it can be preferable to report them as similarly important rather than forcing a strict order.

14.7 Comparing Importance Definitions

A practical check is to compare rank agreement between methods.

rank_compare <- imp_tbl |>
    dplyr::select(variable, inc_mse) |>
    left_join(perm_imp |> dplyr::select(variable, mean_delta_rmse), by = "variable") |>
    mutate(
        rank_inc_mse = rank(-inc_mse, ties.method = "average"),
        rank_perm = rank(-mean_delta_rmse, ties.method = "average")
    )

rank_cor <- cor(rank_compare$rank_inc_mse, rank_compare$rank_perm,
                method = "spearman")

rank_compare |>
    arrange(rank_perm) |>
    gt() |>
    cols_label(
        variable = "Variable",
        inc_mse = "% increase in MSE",
        mean_delta_rmse = "Permutation ΔRMSE",
        rank_inc_mse = "Rank (%IncMSE)",
        rank_perm = "Rank (Permutation)"
    ) |>
    fmt_number(columns = c(inc_mse, mean_delta_rmse), decimals = 3)

Variable	% increase in MSE	Permutation ΔRMSE	Rank (%IncMSE)	Rank (Permutation)
lstat	32.154	3.792	2	1
rm	33.752	3.064	1	2
nox	21.368	0.437	3	3
ptratio	17.088	0.408	5	4
crim	18.043	0.224	4	5
indus	11.933	0.136	6	6

rank_cor

[1] 0.8857143

High rank agreement can increase confidence that a variable is consistently influential under more than one definition. Disagreement is also informative, because it often signals correlation structure or interaction effects that deserve deeper inspection.

14.8 Correlated Predictors and Shared Signal

When predictors overlap strongly, attribution becomes ambiguous. We can diagnose this by inspecting the predictor correlation matrix.

cor_mat <- cor(train[, predictors])

cor_mat |>
    as.data.frame() |>
    round(2) |>
    gt()

lstat	rm	ptratio	indus	nox	crim
1.00	-0.60	0.40	0.60	0.60	0.46
-0.60	1.00	-0.39	-0.41	-0.31	-0.22
0.40	-0.39	1.00	0.36	0.20	0.29
0.60	-0.41	0.36	1.00	0.78	0.41
0.60	-0.31	0.20	0.78	1.00	0.42
0.46	-0.22	0.29	0.41	0.42	1.00

If two variables are strongly correlated, shuffling one may have a limited effect because the other can partially recover the same information. In that case, a low marginal importance does not necessarily mean substantive irrelevance.

For communication, it is often helpful to pair feature importance with grouped interpretation, for example, discussing signal carried by related predictor families rather than only isolated variables.

14.9 From Ranking to Variable Selection

Importance ranking often motivates a second question, how many predictors do we need for a stable descriptive model? A simple approach is to order variables by importance, fit models with the top \(k\) variables, and inspect out of sample error as \(k\) grows.

ordered_vars <- perm_imp$variable

subset_perf <- lapply(seq_along(ordered_vars), function(k) {
    vars_k <- ordered_vars[1:k]
    formula_k <- as.formula(paste("medv ~", paste(vars_k, collapse = " + ")))

    set.seed(200 + k)
    fit_k <- randomForest(
        formula_k,
        data = train,
        ntree = 400,
        mtry = max(1, floor(sqrt(k)))
    )

    pred_k <- predict(fit_k, newdata = test)
    rmse_k <- sqrt(mean((test$medv - pred_k)^2))

    tibble::tibble(
        k = k,
        rmse = rmse_k,
        variables = paste(vars_k, collapse = ", ")
    )
}) |>
    bind_rows()

subset_perf |>
    gt() |>
    cols_label(
        k = "Number of top variables",
        rmse = "Test RMSE",
        variables = "Variables included"
    ) |>
    fmt_number(columns = rmse, decimals = 3)

Number of top variables	Test RMSE	Variables included
1	6.504	lstat
2	4.715	lstat, rm
3	3.665	lstat, rm, nox
4	3.770	lstat, rm, nox, ptratio
5	3.588	lstat, rm, nox, ptratio, crim
6	3.490	lstat, rm, nox, ptratio, crim, indus

best_k <- subset_perf$k[which.min(subset_perf$rmse)]

subset_perf |>
    ggplot(aes(x = k, y = rmse)) +
    geom_line(color = "steelblue", linewidth = 1) +
    geom_point(color = "steelblue", size = 2) +
    geom_vline(xintercept = best_k, linetype = "dashed", color = "darkred") +
    scale_x_continuous(breaks = seq_len(length(ordered_vars))) +
    labs(
        title = "Parsimony Profile Based on Ranked Variables",
        subtitle = "Dashed line marks the best RMSE in this split",
        x = "Number of included variables",
        y = "Test RMSE"
    ) +
    theme_minimal()

This profile is intentionally pragmatic. It does not claim to identify a uniquely correct subset, but it helps us evaluate the trade off between parsimony and predictive adequacy in a transparent way.

14.10 A Brief Classification Illustration

Feature importance behaves similarly for classification, although impurity and loss definitions differ. We use the Pima diabetes data for a compact illustration.

data(Pima.tr, package = "MASS")
data(Pima.te, package = "MASS")

set.seed(321)
rf_cls <- randomForest(
    type ~ npreg + glu + bp + skin + bmi + ped + age,
    data = Pima.tr,
    ntree = 500,
    mtry = 3,
    importance = TRUE
)

cls_imp <- importance(rf_cls)

cls_tbl <- tibble::tibble(
    variable = rownames(cls_imp),
    mean_decrease_accuracy = cls_imp[, "MeanDecreaseAccuracy"],
    mean_decrease_gini = cls_imp[, "MeanDecreaseGini"]
) |>
    arrange(desc(mean_decrease_accuracy))

cls_tbl |>
    gt() |>
    cols_label(
        variable = "Variable",
        mean_decrease_accuracy = "Mean decrease accuracy",
        mean_decrease_gini = "Mean decrease Gini"
    ) |>
    fmt_number(columns = c(mean_decrease_accuracy, mean_decrease_gini), decimals = 3)

Variable	Mean decrease accuracy	Mean decrease Gini
glu	20.921	24.081
age	14.010	14.625
bmi	10.670	12.735
ped	9.824	13.883
npreg	8.470	8.622
skin	0.812	8.212
bp	−0.798	7.298

Even in this short example, ranking should be interpreted jointly with domain context and class balance considerations. Importance identifies model reliance, not clinical causality.

14.11 Good Practice for Reporting Feature Importance

A careful descriptive report often benefits from a short checklist:

report which importance definition was used,
indicate whether evaluation is in sample, OOB, or held out,
include some notion of variability (for example repeated permutations),
note potential correlation induced dilution,
avoid causal language unless causal identification is justified separately.

These practices do not eliminate ambiguity, but they make interpretation more transparent and reproducible.

14.12 Summary and Key Takeaways

Feature importance is a global explanatory tool that ranks model reliance on predictors.
Impurity based and permutation based scores answer related but distinct questions, and both can be informative.
Correlated predictors can redistribute importance, so low marginal importance is not always low substantive relevance.
Ranking can guide variable selection through parsimony profiles that compare error across top \(k\) subsets.
Importance is descriptive unless paired with a separate causal design.

14.13 Looking Ahead

Feature importance tells us which predictors matter most for model performance, but it does not describe the functional shape of those effects. The next chapter therefore focuses on partial dependence and related profile methods, where we examine how predictions vary as individual predictors change across their observed ranges.