15  Partial Dependence and Individual Conditional Expectation

15.1 Introduction: Moving from Importance to Functional Interpretation

In the previous chapter, we focused on feature importance as a global ranking device. That perspective helps us identify which predictors the model appears to rely on most. It does not yet tell us how predictions change as a predictor varies.

Partial dependence and individual conditional expectation (ICE) plots address this next layer of interpretation. Rather than asking only which variables matter, we ask how model predictions respond across the range of one or more predictors. This shift is central in descriptive work with nonlinear models, because it reveals shape, heterogeneity, and potential interaction structure.

This chapter develops both methods from first principles, then implements them in reproducible R code on a tree based model. We emphasize practical interpretation, known limitations, and communication choices that keep conclusions technically accurate.

15.2 Why Profile Based Explanations Matter

Feature rankings are compact and useful, but they collapse functional detail. Two predictors may receive similar importance scores while having very different predictive patterns, for example, one roughly monotonic and another strongly nonlinear.

Profile based summaries help with questions such as:

  • how predictions evolve over the observed range of a predictor,
  • whether effects appear approximately linear, threshold-like, or saturating,
  • whether different observations exhibit similar or divergent response profiles.

These are descriptive statements about fitted model behavior under the empirical data distribution. As discussed earlier in the book, they should not be interpreted as causal response functions.

15.3 Formal Definitions

Let a fitted predictor be denoted by \(\hat{f}(x)\), and partition the feature vector as \((x_s, x_C)\), where \(x_s\) is the focal feature (or feature subset) and \(x_C\) collects the remaining features.

15.3.1 Partial Dependence Function

Building on the introductory formulation presented earlier in the book, we now write partial dependence for a subset of focal features \(S\) (with complement \(C\)):

\[ f_S^{PD}(x_S) = \mathbb{E}_{X_C}\left[\hat{f}(x_S, X_C)\right]. \]

In practice, this expectation is estimated by Monte Carlo averaging over observed rows (Friedman 2001):

\[ \widehat{f}_S^{PD}(x_S) = \frac{1}{n}\sum_{i=1}^{n} \hat{f}(x_S, x_{C}^{(i)}). \]

This notation accommodates one dimensional and multi dimensional profiles in a unified way. Operationally, we evaluate the estimator on a grid of \(x_S\) values and average predictions at each grid point.

15.3.2 Individual Conditional Expectation Curves

For observation \(i\), the corresponding ICE curve is (Goldstein et al. 2015):

\[ \hat{f}_{ICE}^{(i)}(x_s) = \hat{f}(x_s, x_{i,C}). \]

PD is the average of ICE curves over observations. This relationship is important, because it shows what PD can hide. If ICE curves differ substantially in shape, the average can mask heterogeneity.

15.3.3 Centered ICE Curves

A common variant is centered ICE (c-ICE), where each ICE curve is shifted to a common reference value \(x_s^{ref}\) (Goldstein et al. 2015):

\[ \hat{f}_{cICE}^{(i)}(x_s) = \hat{f}_{ICE}^{(i)}(x_s) - \hat{f}_{ICE}^{(i)}(x_s^{ref}). \]

Centering removes vertical offsets and makes slope differences easier to inspect.

15.4 Running Example: Random Forest on Boston Housing

To maintain continuity with the interpretability sequence, we continue with a random forest model for medv using a compact predictor set.

data(Boston, package = "MASS")

set.seed(123)
idx <- sample(seq_len(nrow(Boston)), size = 0.7 * nrow(Boston))
train <- Boston[idx, ]
test <- Boston[-idx, ]

predictors <- c("lstat", "rm", "ptratio", "indus", "nox", "crim")
rf_formula <- as.formula(paste("medv ~", paste(predictors, collapse = " + ")))

set.seed(123)
rf_fit <- randomForest(
    rf_formula,
    data = train,
    ntree = 500,
    mtry = 2,
    importance = TRUE
)

rf_pred <- predict(rf_fit, newdata = test)
rf_rmse <- sqrt(mean((test$medv - rf_pred)^2))

rf_rmse
[1] 3.5575

The predictive error gives context for interpretation. Profile based explanations are most informative when the model captures a meaningful share of signal.

15.5 Computing One Dimensional Partial Dependence

As introduced earlier in the book (see Chapter 13), one dimensional PD is obtained by averaging predictions while varying one focal feature over a grid. Here we move one step further by comparing PD profiles for two predictors, lstat and rm, to emphasize that profile interpretation is inherently comparative.

partial_dependence <- function(model, data, variable, grid_values) {
    pd_values <- sapply(grid_values, function(g) {
        new_data <- data
        new_data[[variable]] <- g
        mean(predict(model, newdata = new_data))
    })

    tibble::tibble(
        value = grid_values,
        pd = pd_values,
        variable = variable
    )
}

lstat_grid <- seq(
    from = as.numeric(quantile(train$lstat, 0.05)),
    to = as.numeric(quantile(train$lstat, 0.95)),
    length.out = 50
)

rm_grid <- seq(
    from = as.numeric(quantile(train$rm, 0.05)),
    to = as.numeric(quantile(train$rm, 0.95)),
    length.out = 50
)

pd_lstat <- partial_dependence(
    model = rf_fit,
    data = train,
    variable = "lstat",
    grid_values = lstat_grid
)

pd_rm <- partial_dependence(
    model = rf_fit,
    data = train,
    variable = "rm",
    grid_values = rm_grid
)

pd_profiles <- bind_rows(
    pd_lstat,
    pd_rm
)

rug_profiles <- bind_rows(
    tibble(value = train$lstat, variable = "lstat"),
    tibble(value = train$rm, variable = "rm")
)
pd_profiles |>
    ggplot(aes(x = value, y = pd)) +
    geom_line(color = "darkred", linewidth = 1) +
    geom_rug(
        data = rug_profiles,
        aes(x = value, y = NULL),
        inherit.aes = FALSE,
        sides = "b",
        alpha = 0.20
    ) +
    facet_wrap(~ variable, scales = "free_x") +
    labs(
        title = "One Dimensional Partial Dependence Profiles",
        subtitle = "Comparison of lstat and rm with support rugs",
        x = "Feature value",
        y = "Average predicted medv"
    ) +
    theme_minimal()

The two panels highlight complementary directions of association. As lstat increases, the average prediction decreases, while larger rm values are associated with higher predicted medv. In both cases, the curvature suggests that effects are not well summarized by a single linear slope.

15.6 Individual Conditional Expectation and Heterogeneity

To inspect whether this average pattern is broadly shared, we compute ICE curves for a subsample of observations.

ice_curves <- function(model, data, variable, grid_values, ids = NULL) {
    if (is.null(ids)) {
        ids <- seq_len(nrow(data))
    }

    ice_list <- lapply(ids, function(i) {
        obs <- data[i, , drop = FALSE]
        preds <- sapply(grid_values, function(g) {
            new_obs <- obs
            new_obs[[variable]] <- g
            as.numeric(predict(model, newdata = new_obs))
        })

        tibble::tibble(
            obs_id = i,
            value = grid_values,
            ice = preds
        )
    })

    dplyr::bind_rows(ice_list)
}

set.seed(321)
ice_ids <- sample(seq_len(nrow(train)), size = 60)

ice_lstat <- ice_curves(
    model = rf_fit,
    data = train,
    variable = "lstat",
    grid_values = lstat_grid,
    ids = ice_ids
)
ice_lstat |>
    ggplot(aes(x = value, y = ice, group = obs_id)) +
    geom_line(color = "gray50", alpha = 0.30) +
    geom_line(
        data = pd_lstat,
        aes(x = value, y = pd),
        inherit.aes = FALSE,
        color = "darkred",
        linewidth = 1.1
    ) +
    labs(
        title = "ICE Curves with PD Overlay for lstat",
        subtitle = "Gray lines: individual profiles, red line: average profile",
        x = "lstat",
        y = "Predicted medv"
    ) +
    theme_minimal()

The spread of gray lines shows that observations do not all respond identically. PD remains useful as a summary, but ICE reveals that a single average curve can hide meaningful variation.

15.7 Centered ICE for Shape Comparison

Vertical differences between ICE curves often reflect baseline prediction level rather than slope behavior. Centering helps isolate shape differences.

ice_centered <- ice_lstat |>
    group_by(obs_id) |>
    mutate(ice_c = ice - first(ice)) |>
    ungroup()

pd_centered <- pd_lstat |>
    mutate(pd_c = pd - first(pd))
ice_centered |>
    ggplot(aes(x = value, y = ice_c, group = obs_id)) +
    geom_line(color = "gray55", alpha = 0.30) +
    geom_line(
        data = pd_centered,
        aes(x = value, y = pd_c),
        inherit.aes = FALSE,
        color = "navy",
        linewidth = 1.1
    ) +
    labs(
        title = "Centered ICE (c-ICE) for lstat",
        subtitle = "Centering improves comparison of profile shape",
        x = "lstat",
        y = "Change in prediction from reference grid point"
    ) +
    theme_minimal()

In this representation, we can compare profile slopes more directly. Divergence in c-ICE trajectories suggests interaction effects or observation specific nonlinearities.

15.8 Quantifying ICE Dispersion Along the Grid

Visual inspection is important, and a small numeric summary can also help. We compute the standard deviation of ICE predictions at each grid point.

ice_dispersion <- ice_lstat |>
    group_by(value) |>
    summarise(sd_ice = sd(ice), .groups = "drop")

dispersion_summary <- ice_dispersion |>
    summarise(
        min_sd = min(sd_ice),
        median_sd = median(sd_ice),
        max_sd = max(sd_ice)
    )

dispersion_summary |>
    gt() |>
    cols_label(
        min_sd = "Minimum SD",
        median_sd = "Median SD",
        max_sd = "Maximum SD"
    ) |>
    fmt_number(columns = everything(), decimals = 3)
Minimum SD Median SD Maximum SD
4.513 5.522 6.263
ice_dispersion |>
    ggplot(aes(x = value, y = sd_ice)) +
    geom_line(color = "purple", linewidth = 1) +
    labs(
        title = "Dispersion of ICE Curves Across lstat",
        x = "lstat",
        y = "SD of individual predictions"
    ) +
    theme_minimal()

Higher dispersion regions indicate where individual profiles differ more strongly. These zones are often where interaction structure is more pronounced.

15.9 Two Dimensional Partial Dependence

When interactions are suspected between two predictors, a two dimensional PD surface can provide a useful global view.

partial_dependence_2d <- function(model, data, var1, var2, grid1, grid2) {
    grid <- expand.grid(
        value1 = grid1,
        value2 = grid2,
        KEEP.OUT.ATTRS = FALSE,
        stringsAsFactors = FALSE
    )

    pd_vals <- mapply(function(g1, g2) {
        new_data <- data
        new_data[[var1]] <- g1
        new_data[[var2]] <- g2
        mean(predict(model, newdata = new_data))
    }, grid$value1, grid$value2)

    tibble::tibble(
        value1 = grid$value1,
        value2 = grid$value2,
        pd = pd_vals
    )
}

rm_grid <- seq(
    from = as.numeric(quantile(train$rm, 0.05)),
    to = as.numeric(quantile(train$rm, 0.95)),
    length.out = 30
)

lstat_grid_2d <- seq(
    from = as.numeric(quantile(train$lstat, 0.05)),
    to = as.numeric(quantile(train$lstat, 0.95)),
    length.out = 30
)

pd_rm_lstat <- partial_dependence_2d(
    model = rf_fit,
    data = train,
    var1 = "rm",
    var2 = "lstat",
    grid1 = rm_grid,
    grid2 = lstat_grid_2d
)
pd_rm_lstat |>
    ggplot(aes(x = value1, y = value2, z = pd)) +
    geom_tile(aes(fill = pd)) +
    geom_contour(color = "white", alpha = 0.5) +
    scale_fill_gradient(low = "#f7fbff", high = "#08306b") +
    labs(
        title = "Two Dimensional Partial Dependence Surface",
        subtitle = "Joint profile for rm and lstat",
        x = "rm",
        y = "lstat",
        fill = "Predicted medv"
    ) +
    theme_minimal()

This surface supports a joint reading. Predictions tend to increase with larger rm and decrease with larger lstat, and contour geometry helps us assess whether the combined pattern is approximately additive or more interaction-like.

15.10 Practical Limitations and Interpretation Cautions

PD and ICE are powerful, but they rely on assumptions that deserve explicit attention.

  • Correlation among predictors can create unrealistic combinations when we vary one feature while keeping others fixed from observed rows. This can affect profile shape.
  • Support awareness matters. Curves are most reliable in regions with adequate data density, which is why rugs or density overlays are useful.
  • Averaging can obscure heterogeneity. PD should be interpreted jointly with ICE whenever interaction effects are plausible.
  • Descriptive, not interventional. These curves describe model predictions under observed covariate structure, not causal effects of manipulating predictors.

In practice, we often combine profile methods with other interpretability summaries, such as permutation importance and local attributions, to obtain a more balanced account.

15.11 Summary and Key Takeaways

Partial dependence and ICE add functional detail to model interpretation. Several points from this chapter are central:

  • Partial dependence is a global average profile over non focal predictors.
  • ICE exposes observation level variation that PD can hide.
  • Centered ICE improves shape comparison by removing baseline offsets.
  • Two dimensional PD helps inspect joint structure and potential interactions between predictors.
  • Interpretation quality depends on data support and dependence structure, so profile reading should be paired with caution and complementary diagnostics.

Together, these tools provide a richer descriptive view of model behavior than variable ranking alone.

15.12 Looking Ahead

The next chapter turns to Shapley value based explanations. Whereas partial dependence and ICE focus on functional profiles across feature values, Shapley methods decompose individual predictions into additive feature contributions. This gives us a complementary local perspective, and helps connect global patterns to case specific explanations.