18  Automated Feature Engineering and Interaction Discovery

18.2 Why Feature Engineering Can Be Automated in Descriptive Work

Earlier in the book, we repeatedly saw that model conclusions depend on how signal is represented. If an important interaction is omitted, an additive model may appear systematically biased in parts of the covariate space. If a relation is nonlinear, raw linear terms can underfit even when model diagnostics look acceptable in aggregate.

A modest automated feature engineering workflow can support descriptive analysis by helping us:

  • detect candidate nonlinear structure that is not visible in additive baselines,
  • rank interaction candidates under a common cross validation design,
  • compare several near-optimal representations instead of one handcrafted specification,
  • document feature space uncertainty, not only parameter uncertainty.

As in the previous chapter, these are predictive and descriptive statements about fitted behavior under observed data conditions. They do not, by themselves, identify causal effects.

18.3 A Formal View: Search Over Feature Mappings

Let \(X \in \mathbb{R}^{n \times p}\) denote the original predictor matrix and \(y\) the outcome. A feature engineering procedure defines a mapping family

\[ \Phi = \{\phi_1, \dots, \phi_m\}, \quad \phi_j: \mathbb{R}^p \to \mathbb{R}^{q_j}, \]

where each mapping can represent, for example, polynomial terms, logarithmic transforms, or interactions. For a chosen modeling algorithm \(\mathcal{A}\) with parameters \(\theta\), we evaluate candidates of the form

\[ \hat{f}_{j,\theta}(x) = \mathcal{A}(\phi_j(x); \theta). \]

Automated feature engineering then becomes a structured optimization problem:

\[ (j^\star, \theta^\star) = \arg\min_{j,\theta} \widehat{R}(\hat{f}_{j,\theta}), \]

with \(\widehat{R}\) estimated by cross validation or a related resampling protocol.

In practice, this search can be large. A key methodological point is that search design (candidate library, screening rule, evaluation budget) is part of the inferential context and should be reported explicitly.

18.4 Running Example: Boston Housing With a Transparent Candidate Library

To remain consistent with the previous chapters, we use the Boston housing data and predict medv. We first fit a baseline additive linear model, then evaluate engineered candidates one at a time on top of that baseline.

data(Boston, package = "MASS")

set.seed(123)
idx_test <- sample(seq_len(nrow(Boston)), size = 0.2 * nrow(Boston))

analysis_data <- Boston[-idx_test, ]
test_data <- Boston[idx_test, ]

predictors <- c("lstat", "rm", "ptratio", "indus", "nox", "crim")
target <- "medv"

baseline_formula <- as.formula(
    paste(target, "~", paste(predictors, collapse = " + "))
)

nrow(analysis_data)
[1] 405
nrow(test_data)
[1] 101

We keep a final holdout test set untouched during candidate screening. This separation helps reduce optimism after searching across many engineered terms.

18.5 A Reproducible Evaluation Engine

We implement helper functions for fold assignment and cross validated RMSE. The code below is intentionally compact and explicit, so that each design choice is auditable.

rmse <- function(actual, predicted) {
    sqrt(mean((actual - predicted)^2))
}

make_folds <- function(n, k = 5, seed = 123) {
    set.seed(seed)
    sample(rep(seq_len(k), length.out = n))
}

cv_rmse_lm <- function(data, formula_obj, fold_id) {
    k <- length(unique(fold_id))
    fold_errors <- numeric(k)

    for (fold in seq_len(k)) {
        train_fold <- data[fold_id != fold, , drop = FALSE]
        valid_fold <- data[fold_id == fold, , drop = FALSE]

        fit <- lm(formula_obj, data = train_fold)
        pred <- as.numeric(predict(fit, newdata = valid_fold))
        fold_errors[fold] <- rmse(valid_fold[[target]], pred)
    }

    tibble::tibble(
        mean_rmse = mean(fold_errors),
        sd_rmse = sd(fold_errors)
    )
}

18.6 Constructing Candidate Features

For each predictor, we generate two transformation candidates, a quadratic term and a log transformation. We also generate all pairwise interactions among the selected predictors.

transform_terms <- unlist(lapply(predictors, function(v) {
    c(
        paste0("I(", v, "^2)"),
        paste0("log1p(", v, ")")
    )
}))

interaction_terms <- combn(predictors, 2, simplify = FALSE) |>
    lapply(function(pair) paste(pair, collapse = ":")) |>
    unlist()

candidate_terms <- c(transform_terms, interaction_terms)

length(candidate_terms)
[1] 27
head(candidate_terms)
[1] "I(lstat^2)"     "log1p(lstat)"   "I(rm^2)"        "log1p(rm)"     
[5] "I(ptratio^2)"   "log1p(ptratio)"

This library is not exhaustive, but it is broad enough to illustrate automated screening. In applied projects, candidate generation is often guided by domain constraints, monotonicity expectations, and operational interpretability requirements.

18.7 Screening Candidates by Cross Validated Improvement

We now evaluate each candidate term by augmenting the baseline formula with one additional engineered feature and measuring out of fold RMSE.

fold_id <- make_folds(n = nrow(analysis_data), k = 5, seed = 456)

baseline_cv <- cv_rmse_lm(
    data = analysis_data,
    formula_obj = baseline_formula,
    fold_id = fold_id
)

screen_results <- lapply(candidate_terms, function(term) {
    formula_term <- as.formula(
        paste(target, "~", paste(c(predictors, term), collapse = " + "))
    )

    res <- cv_rmse_lm(
        data = analysis_data,
        formula_obj = formula_term,
        fold_id = fold_id
    )

    tibble::tibble(
        term = term,
        term_type = ifelse(grepl(":", term), "Interaction", "Transformation"),
        mean_rmse = res$mean_rmse,
        sd_rmse = res$sd_rmse,
        delta_rmse = baseline_cv$mean_rmse - res$mean_rmse
    )
}) |>
    bind_rows() |>
    arrange(desc(delta_rmse), mean_rmse)

baseline_cv |>
    mutate(model = "Baseline additive model") |>
    gt() |>
    cols_label(
        model = "Model",
        mean_rmse = "CV mean RMSE",
        sd_rmse = "CV SD"
    ) |>
    fmt_number(columns = c(mean_rmse, sd_rmse), decimals = 3)
CV mean RMSE CV SD Model
5.314 0.742 Baseline additive model
screen_results |>
    slice_head(n = 12) |>
    mutate(rank = row_number()) |>
    dplyr::select(rank, term, term_type, mean_rmse, sd_rmse, delta_rmse) |>
    gt() |>
    cols_label(
        rank = "Rank",
        term = "Candidate term",
        term_type = "Type",
        mean_rmse = "CV mean RMSE",
        sd_rmse = "CV SD",
        delta_rmse = "Improvement vs baseline"
    ) |>
    fmt_number(columns = c(mean_rmse, sd_rmse, delta_rmse), decimals = 3)
Rank Candidate term Type CV mean RMSE CV SD Improvement vs baseline
1 lstat:rm Interaction 4.570 0.825 0.744
2 log1p(lstat) Transformation 4.629 0.471 0.685
3 I(rm^2) Transformation 4.664 0.998 0.650
4 log1p(rm) Transformation 4.703 1.036 0.611
5 I(lstat^2) Transformation 4.791 0.477 0.523
6 rm:ptratio Interaction 4.792 0.761 0.522
7 rm:indus Interaction 4.943 0.882 0.371
8 rm:nox Interaction 4.992 0.920 0.322
9 nox:crim Interaction 5.202 0.755 0.112
10 log1p(crim) Transformation 5.263 0.743 0.051
11 log1p(nox) Transformation 5.276 0.655 0.038
12 I(crim^2) Transformation 5.278 0.750 0.036

The ranking helps us identify promising terms, while the SD column provides context about stability across folds. Small differences should be interpreted cautiously when they are comparable to fold level variability.

18.8 Building an Engineered Model From Top Candidates

A practical compromise between flexibility and readability is to keep only the strongest positive candidates. Here we retain up to five terms with positive estimated gain.

selected_terms <- screen_results |>
    filter(delta_rmse > 0) |>
    slice_head(n = 5) |>
    pull(term)

if (length(selected_terms) == 0) {
    selected_terms <- screen_results |> slice_head(n = 1) |> pull(term)
}

engineered_formula <- as.formula(
    paste(target, "~", paste(c(predictors, selected_terms), collapse = " + "))
)

engineered_formula
medv ~ lstat + rm + ptratio + indus + nox + crim + lstat:rm + 
    log1p(lstat) + I(rm^2) + log1p(rm) + I(lstat^2)
baseline_fit <- lm(baseline_formula, data = analysis_data)
engineered_fit <- lm(engineered_formula, data = analysis_data)

pred_baseline <- as.numeric(predict(baseline_fit, newdata = test_data))
pred_engineered <- as.numeric(predict(engineered_fit, newdata = test_data))

comparison_tbl <- tibble::tibble(
    model = c("Baseline additive", "Engineered linear"),
    test_rmse = c(
        rmse(test_data[[target]], pred_baseline),
        rmse(test_data[[target]], pred_engineered)
    )
)

comparison_tbl |>
    gt() |>
    cols_label(
        model = "Model",
        test_rmse = "Holdout RMSE"
    ) |>
    fmt_number(columns = test_rmse, decimals = 3)
Model Holdout RMSE
Baseline additive 5.177
Engineered linear 4.540

This test set comparison is only one realization, but it indicates whether screened candidates transfer beyond the resampling environment used during selection.

18.9 Inspecting the Interaction Signal

Screening tables are useful, but interaction discovery is often easier to communicate w ith a response surface. We extract the top ranked interaction term, if available, and visualize fitted values while keeping other predictors at their analysis set medians.

top_interaction <- screen_results |>
    filter(term_type == "Interaction") |>
    slice_head(n = 1)

top_interaction |>
    gt() |>
    cols_label(
        term = "Top Interaction Term",
        term_type = "Type",
        mean_rmse = "CV mean RMSE",
        sd_rmse = "CV SD",
        delta_rmse = "Improvement vs baseline"
    ) |>
    fmt_number(columns = c(mean_rmse, sd_rmse, delta_rmse), decimals = 3)
Top Interaction Term Type CV mean RMSE CV SD Improvement vs baseline
lstat:rm Interaction 4.570 0.825 0.744
if (nrow(top_interaction) == 1) {
    vars <- strsplit(top_interaction$term[1], ":", fixed = TRUE)[[1]]
    v1 <- vars[1]
    v2 <- vars[2]

    grid_v1 <- seq(
        as.numeric(quantile(analysis_data[[v1]], 0.1)),
        as.numeric(quantile(analysis_data[[v1]], 0.9)),
        length.out = 45
    )

    grid_v2 <- seq(
        as.numeric(quantile(analysis_data[[v2]], 0.1)),
        as.numeric(quantile(analysis_data[[v2]], 0.9)),
        length.out = 45
    )

    surface_data <- expand.grid(
        x1 = grid_v1,
        x2 = grid_v2
    )
    names(surface_data) <- c(v1, v2)

    for (p in predictors) {
        if (!(p %in% c(v1, v2))) {
            surface_data[[p]] <- median(analysis_data[[p]])
        }
    }

    surface_data$pred <- as.numeric(predict(engineered_fit, newdata = surface_data))

    surface_data |>
        ggplot(aes(x = .data[[v1]], y = .data[[v2]], z = pred, fill = pred)) +
        geom_raster(interpolate = TRUE) +
        geom_contour(color = "white", alpha = 0.6) +
        scale_fill_gradient(low = "#f7fbff", high = "#08306b") +
        labs(
            title = "Predicted Response Surface for the Top Interaction",
            subtitle = paste("Interaction candidate:", top_interaction$term[1]),
            x = v1,
            y = v2,
            fill = "Predicted medv"
        ) +
        theme_minimal()
}

The surface should be read as model behavior conditional on fixed values of remaining predictors. It is a useful descriptive map, but not direct evidence of causal interaction.

18.10 Practical Considerations and Failure Modes

Automated feature engineering can increase descriptive resolution, but several risks are worth monitoring.

  • Search induced optimism can appear when large candidate libraries are screened with weak validation protocols.
  • Collinearity inflation can destabilize coefficient level interpretation in engineered linear models.
  • Data leakage can occur if transformations use information outside the analysis fold.
  • Interpretability drift can arise when many engineered terms are retained without pruning.

These issues do not invalidate automated workflows. They highlight why fold aware preprocessing, explicit holdout testing, and conservative reporting are central parts of the methodology.

18.11 Reporting Guidelines for Automated Feature Engineering

For transparent descriptive reporting, it is often helpful to document:

  • candidate generation rules (which transforms and interactions were considered),
  • validation design and random seeds,
  • selection criterion (for example, improvement in cross validated RMSE),
  • final retained terms and their practical contribution,
  • holdout performance relative to a simpler baseline.

When several engineered specifications are near tied, presenting a small candidate set can be more informative than emphasizing a single winner.

18.12 Summary and Key Takeaways

  • Automated feature engineering extends AutoML logic from model space to feature space.
  • Interaction and transformation screening can reveal predictive structure not captured by additive baselines.
  • Cross validated ranking is most useful when read together with variability measures and holdout confirmation.
  • Interaction surfaces support interpretation, while remaining conditional predictive summaries.
  • Transparent documentation of the candidate library and validation design is essential for credible descriptive conclusions.

18.13 Looking Ahead

This chapter focused on representation search in a controlled tabular workflow. In the next chapters, we turn to integrated case studies where model search, feature engineering, and interpretation are combined in full descriptive analyses for policy, health, and business contexts.