17  AutoML as a Descriptive Tool

17.2 Why AutoML Can Be Useful for Descriptive Analysis

When we fit one model at a time, we often learn deeply about that model, but we can miss broader patterns. A modest AutoML workflow helps us answer additional questions:

  • does a flexible nonlinear model materially outperform a simpler baseline,
  • how sensitive performance is to hyperparameter choices,
  • whether several candidate models are practically tied,
  • how robust model rankings are across resamples.

These are descriptive questions about predictive structure and model behavior under the observed data distribution. As in earlier chapters, they should not be interpreted as causal claims.

17.3 AutoML as an Optimization Problem

A generic AutoML procedure searches over a space of pipelines and minimizes an estimated generalization loss.

Let \(\mathcal{A}\) denote a set of algorithms, and let \(\Lambda_a\) be the hyperparameter space for algorithm \(a \in \mathcal{A}\). A candidate is a pair \((a, \lambda)\) with \(\lambda \in \Lambda_a\). For loss function \(L\), the target can be written as (Feurer et al. 2019; Automated Machine Learning: Methods, Systems, Challenges 2019):

\[ (a^\star, \lambda^\star) = \arg\min_{a \in \mathcal{A},\, \lambda \in \Lambda_a} \widehat{R}(a, \lambda), \]

where \(\widehat{R}(a, \lambda)\) is an estimate of predictive risk (for example, cross validated RMSE in regression).

In practice, an AutoML system also includes design choices that matter substantively:

  • search strategy (grid, random, Bayesian, evolutionary),
  • resampling protocol (holdout, cross validation, repeated cross validation),
  • budget constraints (time, number of evaluations, early stopping),
  • objective definition (single metric or multi objective with complexity penalties).

For descriptive work, these choices should be reported explicitly because they shape conclusions.

17.4 Running Example: A Lightweight AutoML Workflow on Boston Housing

To keep continuity with recent chapters, we use the Boston housing data and predict medv. We compare three model families:

  • linear regression,
  • regression trees (rpart),
  • random forests.

The workflow is intentionally transparent and uses only packages already present in this book sequence.

data(Boston, package = "MASS")

set.seed(123)
idx_test <- sample(seq_len(nrow(Boston)), size = 0.2 * nrow(Boston))

analysis_data <- Boston[-idx_test, ]
test_data <- Boston[idx_test, ]

predictors <- c("lstat", "rm", "ptratio", "indus", "nox", "crim")
target <- "medv"

model_formula <- as.formula(paste(target, "~", paste(predictors, collapse = " + ")))

nrow(analysis_data)
[1] 405
nrow(test_data)
[1] 101

We reserve a final test set and perform model search only on analysis_data. This separation helps us avoid optimistic reporting when comparing many candidates.

17.5 A Reproducible Search and Evaluation Engine

We now define helper functions for fold creation, RMSE computation, and candidate evaluation.

rmse <- function(actual, predicted) {
    sqrt(mean((actual - predicted)^2))
}

make_folds <- function(n, k = 5, seed = 123) {
    set.seed(seed)
    fold_id <- sample(rep(seq_len(k), length.out = n))
    fold_id
}

evaluate_candidate <- function(data,
                            outcome,
                            fold_id,
                            fit_fun,
                            pred_fun,
                            params = list()) {
    k <- length(unique(fold_id))
    fold_rmse <- numeric(k)

    for (fold in seq_len(k)) {
        train_fold <- data[fold_id != fold, , drop = FALSE]
        valid_fold <- data[fold_id == fold, , drop = FALSE]

        fit_obj <- do.call(fit_fun, c(list(train_data = train_fold), params))
        pred <- pred_fun(fit_obj, valid_fold)
        fold_rmse[fold] <- rmse(valid_fold[[outcome]], pred)
    }

    tibble::tibble(
        mean_rmse = mean(fold_rmse),
        sd_rmse = sd(fold_rmse)
    )
}

Next, we define model specific fit and predict wrappers.

fit_lm_model <- function(train_data, formula_obj) {
    lm(formula_obj, data = train_data)
}

pred_lm_model <- function(model_obj, new_data) {
    as.numeric(predict(model_obj, newdata = new_data))
}

fit_tree_model <- function(train_data, formula_obj, cp, maxdepth, minsplit) {
    rpart(
        formula_obj,
        data = train_data,
        method = "anova",
        control = rpart.control(cp = cp, maxdepth = maxdepth, minsplit = minsplit)
    )
}

pred_tree_model <- function(model_obj, new_data) {
    as.numeric(predict(model_obj, newdata = new_data))
}

fit_rf_model <- function(train_data, formula_obj, mtry, ntree, seed = 123) {
    set.seed(seed)
    randomForest(
        formula_obj,
        data = train_data,
        mtry = mtry,
        ntree = ntree
    )
}

pred_rf_model <- function(model_obj, new_data) {
    as.numeric(predict(model_obj, newdata = new_data))
}

17.6 Candidate Space and Leaderboard Construction

We specify a compact candidate space, evaluate each candidate with five fold cross validation, and collect results in one table.

fold_id <- make_folds(n = nrow(analysis_data), k = 5, seed = 456)

results_list <- list()
row_counter <- 1

# 1) Linear regression baseline
lm_eval <- evaluate_candidate(
    data = analysis_data,
    outcome = target,
    fold_id = fold_id,
    fit_fun = fit_lm_model,
    pred_fun = pred_lm_model,
    params = list(formula_obj = model_formula)
)

results_list[[row_counter]] <- tibble::tibble(
    family = "Linear regression",
    config = "standard OLS",
    mean_rmse = lm_eval$mean_rmse,
    sd_rmse = lm_eval$sd_rmse
)
row_counter <- row_counter + 1

# 2) Regression tree grid
tree_grid <- expand.grid(
    cp = c(0.001, 0.01),
    maxdepth = c(3, 6),
    minsplit = c(10, 20)
)

for (i in seq_len(nrow(tree_grid))) {
    this_eval <- evaluate_candidate(
        data = analysis_data,
        outcome = target,
        fold_id = fold_id,
        fit_fun = fit_tree_model,
        pred_fun = pred_tree_model,
        params = list(
            formula_obj = model_formula,
            cp = tree_grid$cp[i],
            maxdepth = tree_grid$maxdepth[i],
            minsplit = tree_grid$minsplit[i]
        )
    )

    results_list[[row_counter]] <- tibble::tibble(
        family = "Regression tree",
        config = paste0(
            "cp=", tree_grid$cp[i],
            ", maxdepth=", tree_grid$maxdepth[i],
            ", minsplit=", tree_grid$minsplit[i]
        ),
        mean_rmse = this_eval$mean_rmse,
        sd_rmse = this_eval$sd_rmse
    )
    row_counter <- row_counter + 1
}

# 3) Random forest grid
rf_grid <- expand.grid(
    mtry = c(2, 3, 4),
    ntree = c(200, 400)
)

for (i in seq_len(nrow(rf_grid))) {
    this_eval <- evaluate_candidate(
        data = analysis_data,
        outcome = target,
        fold_id = fold_id,
        fit_fun = fit_rf_model,
        pred_fun = pred_rf_model,
        params = list(
            formula_obj = model_formula,
            mtry = rf_grid$mtry[i],
            ntree = rf_grid$ntree[i],
            seed = 100 + i
        )
    )

    results_list[[row_counter]] <- tibble::tibble(
        family = "Random forest",
        config = paste0("mtry=", rf_grid$mtry[i], ", ntree=", rf_grid$ntree[i]),
        mean_rmse = this_eval$mean_rmse,
        sd_rmse = this_eval$sd_rmse
    )
    row_counter <- row_counter + 1
}

leaderboard <- bind_rows(results_list) |>
    arrange(mean_rmse)

leaderboard |>
    mutate(rank = row_number()) |>
    dplyr::select(rank, family, config, mean_rmse, sd_rmse) |>
    gt() |>
    cols_label(
        rank = "Rank",
        family = "Model family",
        config = "Configuration",
        mean_rmse = "CV mean RMSE",
        sd_rmse = "CV SD"
    ) |>
    fmt_number(columns = c(mean_rmse, sd_rmse), decimals = 3)
Rank Model family Configuration CV mean RMSE CV SD
1 Random forest mtry=2, ntree=400 3.562 0.546
2 Random forest mtry=2, ntree=200 3.595 0.547
3 Random forest mtry=3, ntree=400 3.621 0.538
4 Random forest mtry=3, ntree=200 3.624 0.586
5 Random forest mtry=4, ntree=400 3.728 0.574
6 Random forest mtry=4, ntree=200 3.729 0.568
7 Regression tree cp=0.001, maxdepth=6, minsplit=10 4.567 0.906
8 Regression tree cp=0.01, maxdepth=6, minsplit=10 4.691 0.833
9 Regression tree cp=0.001, maxdepth=6, minsplit=20 4.725 0.825
10 Regression tree cp=0.001, maxdepth=3, minsplit=10 4.859 0.838
11 Regression tree cp=0.01, maxdepth=3, minsplit=10 4.879 0.850
12 Regression tree cp=0.001, maxdepth=3, minsplit=20 4.917 0.745
13 Regression tree cp=0.01, maxdepth=6, minsplit=20 4.953 0.896
14 Regression tree cp=0.01, maxdepth=3, minsplit=20 4.978 0.849
15 Linear regression standard OLS 5.314 0.742
top_plot <- leaderboard |>
    mutate(
        model_id = paste(family, config, sep = " | "),
        model_id = factor(model_id, levels = rev(model_id))
    )

top_plot |>
    ggplot(aes(x = model_id, y = mean_rmse, color = family)) +
    geom_point(size = 2.3) +
    geom_errorbar(aes(ymin = mean_rmse - sd_rmse, ymax = mean_rmse + sd_rmse), width = 0.2) +
    coord_flip() +
    labs(
        title = "AutoML Leaderboard by Cross Validated RMSE",
        subtitle = "Error bars represent ±1 fold standard deviation",
        x = NULL,
        y = "Cross validated RMSE"
    ) +
    theme_minimal()

The leaderboard provides more than a winner. It shows whether performance differences are large, modest, or practically negligible relative to resampling variability.

17.7 Selecting a Final Candidate and Evaluating on Holdout Data

We now fit the best ranked candidate on all analysis_data and evaluate once on the untouched test set.

best_row <- leaderboard[1, ]

best_row |>
    gt() |>
    cols_label(
        family = "Selected family",
        config = "Selected configuration",
        mean_rmse = "CV mean RMSE",
        sd_rmse = "CV SD"
    ) |>
    fmt_number(columns = c(mean_rmse, sd_rmse), decimals = 3)
Selected family Selected configuration CV mean RMSE CV SD
Random forest mtry=2, ntree=400 3.562 0.546
fit_best_model <- function(best_row, train_data, formula_obj) {
    if (best_row$family == "Linear regression") {
        return(fit_lm_model(train_data = train_data, formula_obj = formula_obj))
    }

    if (best_row$family == "Regression tree") {
        parts <- strsplit(best_row$config, ", ")[[1]]
        cp_val <- as.numeric(sub("cp=", "", parts[1]))
        maxdepth_val <- as.numeric(sub("maxdepth=", "", parts[2]))
        minsplit_val <- as.numeric(sub("minsplit=", "", parts[3]))

        return(
            fit_tree_model(
                train_data = train_data,
                formula_obj = formula_obj,
                cp = cp_val,
                maxdepth = maxdepth_val,
                minsplit = minsplit_val
            )
        )
    }

    parts <- strsplit(best_row$config, ", ")[[1]]
    mtry_val <- as.numeric(sub("mtry=", "", parts[1]))
    ntree_val <- as.numeric(sub("ntree=", "", parts[2]))

    fit_rf_model(
        train_data = train_data,
        formula_obj = formula_obj,
        mtry = mtry_val,
        ntree = ntree_val,
        seed = 999
    )
}

predict_best_model <- function(best_row, model_obj, new_data) {
    if (best_row$family == "Linear regression") {
        return(pred_lm_model(model_obj, new_data))
    }
    if (best_row$family == "Regression tree") {
        return(pred_tree_model(model_obj, new_data))
    }
    pred_rf_model(model_obj, new_data)
}

best_model <- fit_best_model(
    best_row = best_row,
    train_data = analysis_data,
    formula_obj = model_formula
)

test_pred <- predict_best_model(best_row, best_model, test_data)
test_rmse <- rmse(test_data[[target]], test_pred)

tibble::tibble(
    best_family = best_row$family,
    best_config = best_row$config,
    cv_rmse = best_row$mean_rmse,
    test_rmse = test_rmse
) |>
    gt() |>
    cols_label(
        best_family = "Selected family",
        best_config = "Selected configuration",
        cv_rmse = "CV RMSE",
        test_rmse = "Holdout RMSE"
    ) |>
    fmt_number(columns = c(cv_rmse, test_rmse), decimals = 3)
Selected family Selected configuration CV RMSE Holdout RMSE
Random forest mtry=2, ntree=400 3.562 3.750

The gap between cross validated and holdout RMSE offers a quick diagnostic of selection stability. A small gap suggests that the search process did not overfit strongly to resampling noise.

17.8 Near Optimal Models and Practical Equivalence

A useful descriptive habit is to avoid over interpreting tiny leaderboard differences. One pragmatic criterion is to treat candidates as near optimal when their mean RMSE is within one standard deviation of the best candidate.

threshold <- leaderboard$mean_rmse[1] + leaderboard$sd_rmse[1]

near_optimal <- leaderboard |>
    filter(mean_rmse <= threshold) |>
    mutate(rank = row_number())

near_optimal |>
    dplyr::select(rank, family, config, mean_rmse, sd_rmse) |>
    gt() |>
    cols_label(
        rank = "Rank among near-optimal",
        family = "Model family",
        config = "Configuration",
        mean_rmse = "CV mean RMSE",
        sd_rmse = "CV SD"
    ) |>
    fmt_number(columns = c(mean_rmse, sd_rmse), decimals = 3)
Rank among near-optimal Model family Configuration CV mean RMSE CV SD
1 Random forest mtry=2, ntree=400 3.562 0.546
2 Random forest mtry=2, ntree=200 3.595 0.547
3 Random forest mtry=3, ntree=400 3.621 0.538
4 Random forest mtry=3, ntree=200 3.624 0.586
5 Random forest mtry=4, ntree=400 3.728 0.574
6 Random forest mtry=4, ntree=200 3.729 0.568

When several models are near tied, we can use secondary criteria, for example, computational cost, interpretability, or robustness under distribution shift assumptions.

17.9 Descriptive Interpretation of AutoML Results

In a descriptive workflow, the AutoML output is not only a model selector. It is also a structured summary of what the data seem to support under competing functional assumptions.

For instance:

  • if linear regression is competitive, the dominant structure may be close to additive linear signal,
  • if trees and forests improve substantially, nonlinearities or interactions likely matter,
  • if many configurations tie, model uncertainty should be communicated explicitly.

This interpretation remains predictive and data dependent. As discussed earlier in the book, moving from predictive regularities to causal statements requires additional design assumptions and identification strategies.

17.10 Good Practice for Reporting AutoML in Descriptive Work

To keep reporting transparent and reproducible, it is often useful to document:

  • search space (model families and hyperparameter ranges),
  • resampling protocol and random seeds,
  • optimization metric and any tie breaking rule,
  • final holdout performance,
  • whether conclusions rely on one model or a near optimal set.

These elements help readers understand how much of the final conclusion is driven by data signal versus search design.

17.11 Limits and Cautions

AutoML can improve coverage of model space, but it does not eliminate substantive judgment.

Key limits include:

  • performance metrics can hide subgroup specific errors,
  • search spaces can encode strong implicit priors,
  • correlated predictors can produce unstable model rankings,
  • repeated experimentation can still induce selection bias if holdout data are reused.

A careful descriptive workflow therefore combines automation with explicit diagnostics, domain context, and communication of uncertainty.

17.12 AutoML in Practice: Vertex AI and SageMaker Canvas

The R workflow above makes each modeling choice explicit. In practice, many teams implement similar logic through managed platforms. To connect this chapter to operational settings, we briefly consider two widely used platforms, Vertex AI AutoML and SageMaker Canvas, using the same evaluative lens developed earlier in the chapter.

17.12.1 Overview of Vertex AI AutoML

Vertex AI AutoML is Google’s managed environment for automated model training and selection. In tabular settings, it supports supervised tasks such as classification and regression by combining data preparation steps, candidate model search, hyperparameter tuning, and evaluation reporting within one workflow.

The platform workflow typically follows these steps:

  1. users upload tabular data and specify the prediction target;
  2. the system performs automatic data exploration and feature engineering;
  3. candidates are generated by trying multiple model architectures and hyperparameter combinations;
  4. each candidate is evaluated via cross-validation on a held-out portion of training data;
  5. a leaderboard ranks models by performance metrics such as RMSE (regression) or AUC-ROC (classification);
  6. the top-ranked model or an ensemble is deployed for inference.

Vertex AI emphasizes broad model exploration and provides strong predictive results, especially when sufficient computational budget is available. The platform generates detailed evaluation reports including feature importance scores and model comparison tables that align with the leaderboard concept discussed in this chapter.

17.12.2 Overview of SageMaker Canvas

SageMaker Canvas is AWS’s visual interface for building machine learning models with limited coding. For tabular data, it offers guided workflows for data ingestion, model building, evaluation, and prediction generation, with optional quick build versus standard optimization modes.

The Canvas workflow is similarly structured:

  1. users import data via the web interface or cloud storage;
  2. the system performs exploratory data analysis and data profiling;
  3. the user specifies the target variable and builds a model with minimal configuration;
  4. Canvas automatically tries different model families and hyperparameters;
  5. results including predictions, model metrics, and feature importance are presented in a dashboard;
  6. predictions can be generated on new data directly within the interface.

SageMaker Canvas prioritizes usability and rapid iteration. In applied contexts, this design can lower entry barriers for analysts who need interpretable summaries and quick feedback before investing in heavier engineering pipelines. The platform includes model explanation features that communicate variable importance to non-technical stakeholders.

17.12.3 Practical Comparison

Comparing Vertex AI AutoML and SageMaker Canvas from a descriptive analysis perspective highlights several structural patterns.

Similarities include:

  • both platforms automate model selection and hyperparameter tuning from tabular data,
  • both report standard evaluation metrics and feature importance summaries,
  • both allow prediction generation without manual pipeline coding,
  • both maintain a candidate leaderboard or ranking of explored models.

Structural differences documented in official platform descriptions include:

  • Vertex AI’s architecture is designed to explore a broader search space and can generate ensemble models for higher predictive accuracy, at the cost of longer training time,
  • SageMaker Canvas emphasizes quick iteration with a “quick build” mode for rapid baseline models and a “standard build” mode for more thorough search, trading speed for computational cost control,
  • Canvas aims for a beginner-friendly visual interface with drag-and-drop data preparation,
  • Vertex AI and Canvas differ in their approach to uncertainty quantification; Vertex AI provides detailed training metrics and hyperparameter sensitivity reports, while Canvas focuses on prediction confidence intervals,
  • pricing models differ, with Vertex AI billed by training time and SageMaker Canvas based on model training duration and complexity.

These contrasts should be interpreted as reflecting different design priorities rather than universal rankings. Organization policies, existing cloud investments, team technical expertise, and project requirements all influence which platform is more suitable in practice.

17.12.4 Strengths, Limitations, and Use Cases

For advanced descriptive analysis of tabular data, a platform choice can be framed in terms of analytic priorities.

  • Vertex AI AutoML can be attractive when broader model search and comprehensive feature importance analysis are central goals, and when organization infrastructure already uses Google Cloud Platform.
  • SageMaker Canvas can be attractive when rapid iteration, visual collaboration with non-technical stakeholders, and AWS ecosystem integration are priorities.
  • In both platforms, automated outputs still require statistical interpretation, especially when predictors are correlated, classes are imbalanced, or subgroup performance differs.

An analytically balanced approach is therefore to treat these platforms as structured experimentation environments documented by official resources. They can accelerate candidate discovery and provide transparent leaderboards for model comparison, while inferential framing, diagnostic scrutiny, and domain interpretation remain human responsibilities.

17.12.5 Implications for Descriptive and Exploratory Analysis

From the perspective of this book, the main value of these platforms is not only faster model fitting. Their broader contribution is methodological: they make it easier to compare many predictive hypotheses under a consistent evaluation framework, capture that comparison in transparent leaderboards, and generate feature importance summaries at scale.

That capacity aligns directly with advanced descriptive analysis, where we use predictive tools to characterize pattern strength, nonlinear structure, interaction potential, and uncertainty in model choice. Vertex AI and SageMaker Canvas operationalize the leaderboard and ranking logic developed in the R workflow of this chapter. When used with careful interpretation of results and explicit documentation of search design, platform-based AutoML can complement the transparent R-first workflow developed earlier in this chapter.

17.13 Summary and Key Takeaways

  • AutoML can be used as a descriptive instrument, not only as a predictive optimizer.
  • A transparent search setup provides comparative evidence about functional form and model sensitivity.
  • Leaderboards are most informative when interpreted with variability and near tie structure.
  • Platform implementations such as Vertex AI AutoML and SageMaker Canvas extend this logic to operational settings, with trade-offs in accuracy, speed, usability, and cost that should be evaluated in context.
  • Final model selection should be paired with untouched holdout evaluation and explicit reporting choices.
  • Automation supports analysis, but it does not replace substantive interpretation.

17.14 Looking Ahead

This chapter used automation to search model and hyperparameter spaces. The next chapter extends that idea to the predictor space itself, where automated feature engineering helps us discover transformed variables and interaction candidates that can improve both predictive performance and descriptive resolution.