21  Case Study: Business Analytics and Customer Insights

21.1 Introduction: AutoML and Automated Feature Engineering for Customer Insight

Across the previous chapters, we developed a descriptive analytical toolkit that moves from exploratory association analysis through segmentation to interpretable model summaries and communication. This case study applies the final two layers of that toolkit to a business setting: automated feature engineering, which structures the search for informative derived predictors, and AutoML-style model comparison, which surveys multiple model families under a common evaluation protocol.

The analytical emphasis is intentionally non-predictive. In business analytics, the goal is often to understand which customer behaviors drive value heterogeneity and how they interact, not to optimize a prediction system. AutoML and automated feature engineering support this by making the search process explicit and transparent rather than driven by ad hoc choices. They allow us to document what the data supports across a range of model families and feature representations, rather than committing to a single handcrafted specification.

21.2 Business Context and Customer Analytics Questions

We analyze a subscription business with recurring transaction data. The analytical questions we address are:

  1. How do customer transaction patterns vary across segments?
  2. Which transformation and interaction terms most improve our understanding of customer value, according to a systematic candidate screening procedure (following Chapter 18)?
  3. How does a multi-model AutoML comparison reveal which model family best captures customer value structure (following Chapter 17)?
  4. What interpretable customer profiles emerge from variable importance rankings?
  5. How can we communicate customer insights to product and marketing teams?

21.3 Customer Data and Baseline Profiling

We construct a customer dataset from transaction-level data aggregated to customer level. For reproducibility, we simulate transaction data that reflects realistic e-commerce patterns.

set.seed(789)

n_customers <- 500

customer_transactions <- expand_grid(
    customer_id = 1:n_customers,
    month = 1:12
) |>
    mutate(
        n_trans = rpois(n(), lambda = 2 + rnorm(n(), mean = 0, sd = 0.5)) + 1,
        avg_transaction_value = runif(n(), min = 20, max = 200),
        trans_amount = n_trans * avg_transaction_value
    ) |>
    slice_sample(prop = 0.7)

nrow(customer_transactions)
[1] 4200

From transaction data, we aggregate to customer-level metrics that serve as both features and communication targets.

customer_base <- customer_transactions |>
    group_by(customer_id) |>
    summarise(
        n_transactions = sum(n_trans),
        total_spent = sum(trans_amount),
        avg_order_value = mean(avg_transaction_value),
        months_active = n_distinct(month),
        std_spending = sd(trans_amount),
        .groups = "drop"
    ) |>
    mutate(std_spending = ifelse(is.na(std_spending), 0, std_spending)) |>
    drop_na()

nrow(customer_base)
[1] 500
customer_overview <- customer_base |>
    summarise(
        n_customers = n(),
        mean_total_spent = mean(total_spent),
        median_total_spent = median(total_spent),
        mean_transactions = mean(n_transactions),
        mean_aov = mean(avg_order_value),
        mean_months = mean(months_active)
    )

customer_overview |>
    gt() |>
    cols_label(
        n_customers = "Number of Customers",
        mean_total_spent = "Mean Total Spent",
        median_total_spent = "Median Total Spent",
        mean_transactions = "Mean Transactions",
        mean_aov = "Mean AOV",
        mean_months = "Mean Months Active"
    ) |>
    fmt_number(columns = -n_customers, decimals = 2)
Number of Customers Mean Total Spent Median Total Spent Mean Transactions Mean AOV Mean Months Active
500 2,764.49 2,629.08 24.98 110.24 8.40
customer_base |>
    ggplot(aes(x = total_spent)) +
    geom_histogram(fill = "steelblue", alpha = 0.7) +
    labs(
        title = "Distribution of Customer Total Spending",
        x = "Total Spent ($)",
        y = "Number of Customers"
    ) +
    theme_minimal()

21.4 RFM-Style Baseline Segmentation

Before automated methods, we create a simple baseline segmentation based on RFM, a widely used customer analytics framework: Recency (how recently customers bought), Frequency (how often they buy), and Monetary value (how much they spend). We use months_active as a proxy for recency, n_transactions for frequency, and total_spent for monetary value. We create quintile-based scores for each dimension and define 4 segments based on combinations of these scores.

rfm_segments <- customer_base |>
    mutate(
        r_score = ntile(months_active, 5),
        f_score = ntile(n_transactions, 5),
        m_score = ntile(total_spent, 5),
        rfm_segment = case_when(
            r_score >= 4 & f_score >= 4 & m_score >= 4 ~ "High Value",
            r_score <= 2 & f_score <= 2 ~ "Inactive",
            r_score >= 4 & f_score >= 3 ~ "Engaged",
            TRUE ~ "Standard"
        )
    )

segment_summary <- rfm_segments |>
    group_by(rfm_segment) |>
    summarise(
        n_customers = n(),
        mean_spent = mean(total_spent),
        mean_trans = mean(n_transactions),
        mean_months = mean(months_active),
        .groups = "drop"
    )

segment_summary |>
    gt() |>
    cols_label(
        rfm_segment = "Segment",
        n_customers = "N",
        mean_spent = "Avg. Spending",
        mean_trans = "Avg. Transactions",
        mean_months = "Avg. Months Active"
    ) |>
    fmt_number(columns = -c(rfm_segment, n_customers), decimals = 2)
Segment N Avg. Spending Avg. Transactions Avg. Months Active
Engaged 67 2,602.33 26.61 9.76
High Value 109 3,823.70 32.61 10.14
Inactive 138 2,020.00 18.14 6.59
Standard 186 2,754.55 24.98 8.24
rfm_segments |>
    ggplot(aes(x = rfm_segment, y = total_spent, fill = rfm_segment)) +
    geom_boxplot(alpha = 0.7) +
    labs(
        title = "Spending Distribution by RFM Segment",
        x = "RFM Segment",
        y = "Total Spent ($)",
        fill = "Segment"
    ) +
    theme_minimal() +
    theme(legend.position = "none")

This baseline segmentation provides a business-interpretable starting point. The automated feature engineering and model search steps that follow will reveal whether richer representations of customer behavior capture meaningful additional structure.

21.5 Automated Feature Engineering with Candidate Screening

RFM is an effective summary, but it discards information about the shape of customer behavior. Automated feature engineering, following the workflow developed in Chapter 18, provides a principled alternative: rather than manually selecting derived features, we generate a library of candidate transformations and interaction terms, then screen them by cross-validated predictive improvement.

We reserve a held-out test set before any screening, following the same separation used in Chapter 18.

set.seed(999)
idx_test <- sample(seq_len(nrow(customer_base)), size = 0.2 * nrow(customer_base))

analysis_data <- customer_base[-idx_test, ]
test_data     <- customer_base[idx_test, ]

cat("Analysis set:", nrow(analysis_data), "customers\n")
Analysis set: 400 customers
cat("Test set:    ", nrow(test_data), "customers\n")
Test set:     100 customers

21.5.1 Constructing the Candidate Library

We define four raw features and generate two classes of candidate terms: element-wise transformations (log and square root, which are useful for right-skewed customer metrics) and pairwise interactions (which capture how feature combinations relate to spending).

raw_features <- c("n_transactions", "avg_order_value", "months_active", "std_spending")

# Transformation candidates
transform_candidates <- unlist(lapply(raw_features, function(v) {
    c(paste0("log1p(", v, ")"), paste0("sqrt(", v, ")"))
}))

# Pairwise interaction candidates
interaction_candidates <- combn(raw_features, 2, simplify = FALSE) |>
    lapply(function(pair) paste(pair, collapse = ":")) |>
    unlist()

all_candidates <- c(transform_candidates, interaction_candidates)
cat("Transformation candidates:", length(transform_candidates), "\n")
Transformation candidates: 8 
cat("Interaction candidates:   ", length(interaction_candidates), "\n")
Interaction candidates:    6 
cat("Total candidates:         ", length(all_candidates), "\n")
Total candidates:          14 

21.5.2 Screening by Cross-Validated Improvement

We define a baseline additive linear model and evaluate each candidate term by augmenting this baseline with one additional feature at a time. The evaluation uses five-fold cross-validation on the analysis set, with holdout RMSE as the criterion.

rmse <- function(actual, predicted) {
    sqrt(mean((actual - predicted)^2))
}

make_folds <- function(n, k = 5, seed = 123) {
    set.seed(seed)
    sample(rep(seq_len(k), length.out = n))
}

cv_rmse_lm <- function(data, formula_obj, fold_id) {
    k <- length(unique(fold_id))
    fold_errors <- numeric(k)

    for (fold in seq_len(k)) {
        train_fold <- data[fold_id != fold, , drop = FALSE]
        valid_fold <- data[fold_id == fold, , drop = FALSE]

        fit <- lm(formula_obj, data = train_fold)
        pred <- as.numeric(predict(fit, newdata = valid_fold))
        fold_errors[fold] <- rmse(valid_fold[["total_spent"]], pred)
    }

    mean(fold_errors)
}

fold_id <- make_folds(n = nrow(analysis_data), k = 5, seed = 456)

baseline_formula <- total_spent ~ n_transactions + avg_order_value + months_active + std_spending
baseline_cv <- cv_rmse_lm(analysis_data, baseline_formula, fold_id)

cat("Baseline CV RMSE:", round(baseline_cv, 2), "\n")
Baseline CV RMSE: 190.44 
screening_results <- lapply(all_candidates, function(term) {
    aug_formula <- as.formula(
        paste("total_spent ~ n_transactions + avg_order_value + months_active + std_spending +", term)
    )
    tryCatch({
        cv_val <- cv_rmse_lm(analysis_data, aug_formula, fold_id)
        tibble::tibble(
            candidate = term,
            cv_rmse   = cv_val,
            improvement = baseline_cv - cv_val
        )
    }, error = function(e) {
        tibble::tibble(candidate = term, cv_rmse = NA_real_, improvement = NA_real_)
    })
})

feature_leaderboard <- bind_rows(screening_results) |>
    drop_na() |>
    arrange(desc(improvement))

feature_leaderboard |>
    gt() |>
    cols_label(
        candidate   = "Candidate feature",
        cv_rmse     = "CV RMSE",
        improvement = "Improvement vs. baseline"
    ) |>
    fmt_number(columns = c(cv_rmse, improvement), decimals = 2)
Candidate feature CV RMSE Improvement vs. baseline
n_transactions:avg_order_value 135.88 54.56
avg_order_value:months_active 156.77 33.67
months_active:std_spending 178.55 11.88
n_transactions:std_spending 180.86 9.58
log1p(n_transactions) 188.86 1.57
sqrt(n_transactions) 189.02 1.42
n_transactions:months_active 189.07 1.37
sqrt(months_active) 189.57 0.86
log1p(months_active) 189.58 0.86
avg_order_value:std_spending 189.67 0.77
log1p(std_spending) 189.74 0.70
sqrt(std_spending) 190.03 0.40
sqrt(avg_order_value) 190.58 −0.14
log1p(avg_order_value) 190.61 −0.17
feature_leaderboard |>
    ggplot(aes(
        x    = reorder(candidate, improvement),
        y    = improvement,
        fill = improvement > 0
    )) +
    geom_col(alpha = 0.8) +
    coord_flip() +
    scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "gray60"),
                      guide = "none") +
    labs(
        title = "Candidate Feature Screening",
        subtitle = "Improvement in CV RMSE over additive baseline",
        x     = NULL,
        y     = "Improvement in CV RMSE"
    ) +
    theme_minimal()

The screening leaderboard shows which feature representations add descriptive value beyond the additive baseline. Interaction terms that appear at the top reflect genuine behavioral coupling between customer metrics: for example, a high-frequency customer with a high average order value represents a qualitatively different profile from one where only one dimension is elevated. This kind of multiplicative structure is not captured by the additive baseline and is exactly the type of pattern that automated feature engineering helps surface.

It is worth noting a methodological point: some interaction candidates may show strong improvement partly because of the mathematical construction of the data. In practice, a business analyst should inspect top-ranked features carefully to distinguish between genuine behavioral insights and artefacts of variable construction.

21.6 AutoML-Style Model Search and Leaderboard

Automated feature engineering answers the question of what to include as inputs. An AutoML model search, following Chapter 17, answers the complementary question: which model family best captures the structure of customer value heterogeneity across those inputs?

We compare three model families under a common five-fold cross-validation protocol: a linear regression baseline, regression trees with several hyperparameter configurations, and random forests with several mtry values. This mirrors the search structure developed in Chapter 17, applied to the customer data.

21.6.1 Evaluation Engine

evaluate_candidate <- function(data,
                               outcome,
                               fold_id,
                               fit_fun,
                               pred_fun,
                               params = list()) {
    k <- length(unique(fold_id))
    fold_rmse <- numeric(k)

    for (fold in seq_len(k)) {
        train_fold <- data[fold_id != fold, , drop = FALSE]
        valid_fold <- data[fold_id == fold, , drop = FALSE]

        fit_obj <- do.call(fit_fun, c(list(train_data = train_fold), params))
        pred    <- pred_fun(fit_obj, valid_fold)
        fold_rmse[fold] <- rmse(valid_fold[[outcome]], pred)
    }

    tibble::tibble(mean_rmse = mean(fold_rmse), sd_rmse = sd(fold_rmse))
}
automl_formula <- total_spent ~ n_transactions + avg_order_value + months_active + std_spending

fit_lm_model <- function(train_data, formula_obj) {
    lm(formula_obj, data = train_data)
}

pred_lm_model <- function(model_obj, new_data) {
    as.numeric(predict(model_obj, newdata = new_data))
}

fit_tree_model <- function(train_data, formula_obj, cp, maxdepth, minsplit) {
    rpart(
        formula_obj,
        data    = train_data,
        method  = "anova",
        control = rpart.control(cp = cp, maxdepth = maxdepth, minsplit = minsplit)
    )
}

pred_tree_model <- function(model_obj, new_data) {
    as.numeric(predict(model_obj, newdata = new_data))
}

fit_rf_model <- function(train_data, formula_obj, mtry, ntree, seed = 123) {
    set.seed(seed)
    randomForest(formula_obj, data = train_data, mtry = mtry, ntree = ntree)
}

pred_rf_model <- function(model_obj, new_data) {
    as.numeric(predict(model_obj, newdata = new_data))
}

21.6.2 Candidate Space and Leaderboard

results_list    <- list()
row_counter     <- 1
formula_registry <- list()   # stores formulas keyed by "family||config"

# 1a. Linear regression baseline
lm_eval <- evaluate_candidate(
    data     = analysis_data,
    outcome  = "total_spent",
    fold_id  = fold_id,
    fit_fun  = fit_lm_model,
    pred_fun = pred_lm_model,
    params   = list(formula_obj = automl_formula)
)

results_list[[row_counter]] <- tibble::tibble(
    family    = "Linear regression",
    config    = "standard OLS",
    mean_rmse = lm_eval$mean_rmse,
    sd_rmse   = lm_eval$sd_rmse
)
formula_registry[["Linear regression||standard OLS"]] <- automl_formula
row_counter <- row_counter + 1

# 1b. Linear regression augmented with top screened features
top_engineered <- feature_leaderboard |>
    filter(improvement > 0) |>
    slice_head(n = 3) |>
    pull(candidate)

for (k in seq_along(top_engineered)) {
    aug_terms    <- paste(top_engineered[1:k], collapse = " + ")
    aug_formula  <- as.formula(
        paste("total_spent ~ n_transactions + avg_order_value + months_active + std_spending +",
              aug_terms)
    )
    config_label <- paste0("+ top ", k, ifelse(k == 1, " feature", " features"))
    formula_registry[[paste0("Linear regression (engineered)||", config_label)]] <- aug_formula

    aug_eval <- evaluate_candidate(
        data     = analysis_data,
        outcome  = "total_spent",
        fold_id  = fold_id,
        fit_fun  = fit_lm_model,
        pred_fun = pred_lm_model,
        params   = list(formula_obj = aug_formula)
    )
    results_list[[row_counter]] <- tibble::tibble(
        family    = "Linear regression (engineered)",
        config    = config_label,
        mean_rmse = aug_eval$mean_rmse,
        sd_rmse   = aug_eval$sd_rmse
    )
    row_counter <- row_counter + 1
}

# 2. Regression tree grid
tree_grid <- expand.grid(
    cp       = c(0.001, 0.01),
    maxdepth = c(3, 6),
    minsplit = c(10, 20)
)

for (i in seq_len(nrow(tree_grid))) {
    this_eval <- evaluate_candidate(
        data     = analysis_data,
        outcome  = "total_spent",
        fold_id  = fold_id,
        fit_fun  = fit_tree_model,
        pred_fun = pred_tree_model,
        params   = list(
            formula_obj = automl_formula,
            cp          = tree_grid$cp[i],
            maxdepth    = tree_grid$maxdepth[i],
            minsplit    = tree_grid$minsplit[i]
        )
    )

    results_list[[row_counter]] <- tibble::tibble(
        family   = "Regression tree",
        config   = paste0("cp=", tree_grid$cp[i], ", depth=", tree_grid$maxdepth[i],
                          ", minsplit=", tree_grid$minsplit[i]),
        mean_rmse = this_eval$mean_rmse,
        sd_rmse   = this_eval$sd_rmse
    )
    row_counter <- row_counter + 1
}

# 3. Random forest grid
rf_grid <- expand.grid(
    mtry  = c(2, 3, 4),
    ntree = c(200, 400)
)

for (i in seq_len(nrow(rf_grid))) {
    this_eval <- evaluate_candidate(
        data     = analysis_data,
        outcome  = "total_spent",
        fold_id  = fold_id,
        fit_fun  = fit_rf_model,
        pred_fun = pred_rf_model,
        params   = list(
            formula_obj = automl_formula,
            mtry        = rf_grid$mtry[i],
            ntree       = rf_grid$ntree[i],
            seed        = 100 + i
        )
    )

    results_list[[row_counter]] <- tibble::tibble(
        family   = "Random forest",
        config   = paste0("mtry=", rf_grid$mtry[i], ", ntree=", rf_grid$ntree[i]),
        mean_rmse = this_eval$mean_rmse,
        sd_rmse   = this_eval$sd_rmse
    )
    row_counter <- row_counter + 1
}

automl_leaderboard <- bind_rows(results_list) |>
    arrange(mean_rmse)
automl_leaderboard |>
    gt() |>
    cols_label(
        family    = "Model family",
        config    = "Configuration",
        mean_rmse = "Mean CV RMSE",
        sd_rmse   = "SD CV RMSE"
    ) |>
    fmt_number(columns = c(mean_rmse, sd_rmse), decimals = 2)
Model family Configuration Mean CV RMSE SD CV RMSE
Linear regression (engineered) + top 3 features 133.39 10.81
Linear regression (engineered) + top 1 feature 135.88 11.73
Linear regression (engineered) + top 2 features 136.73 11.65
Linear regression standard OLS 190.44 8.45
Random forest mtry=3, ntree=200 201.05 16.03
Random forest mtry=4, ntree=200 201.58 23.62
Random forest mtry=3, ntree=400 201.94 15.53
Random forest mtry=4, ntree=400 202.72 19.78
Random forest mtry=2, ntree=400 211.76 16.51
Random forest mtry=2, ntree=200 214.71 15.13
Regression tree cp=0.001, depth=6, minsplit=10 326.26 33.14
Regression tree cp=0.001, depth=6, minsplit=20 341.90 35.14
Regression tree cp=0.01, depth=6, minsplit=10 408.88 38.09
Regression tree cp=0.01, depth=6, minsplit=20 408.88 38.09
Regression tree cp=0.001, depth=3, minsplit=10 440.14 18.63
Regression tree cp=0.01, depth=3, minsplit=10 440.14 18.63
Regression tree cp=0.001, depth=3, minsplit=20 440.14 18.63
Regression tree cp=0.01, depth=3, minsplit=20 440.14 18.63
automl_leaderboard |>
    mutate(rank = seq_len(n())) |>
    ggplot(aes(x = rank, y = mean_rmse, color = family,
               ymin = mean_rmse - sd_rmse, ymax = mean_rmse + sd_rmse)) +
    geom_pointrange(alpha = 0.8) +
    labs(
        title  = "AutoML Leaderboard: CV RMSE by Model Family and Configuration",
        x      = "Candidate rank",
        y      = "Mean CV RMSE",
        color  = "Model family"
    ) +
    theme_minimal()

The leaderboard now spans four model families: the plain linear baseline, linear regression augmented with top screened features from the previous section, regression trees, and random forests. This is the natural combination of the feature engineering workflow (Chapter 18) and the model search workflow (Chapter 17): screening candidates first and then searching over model families that include those candidates.

The leaderboard identifies the winning configuration, both in terms of model family and feature representation, under a common evaluation protocol. If a plain linear model with raw features wins, the relationship is additive and no engineered features are needed. If an augmented linear model wins, the interaction or transformation terms improve things but complex nonlinear models do not. If ensemble models win, there is genuine nonlinear or interaction structure that linear representations cannot capture regardless of feature engineering.

Comparing the winning CV RMSE back to the baseline (plain linear regression, evaluated in the same folds) provides a calibrated sense of how much the combined feature engineering and model search workflow gained over the starting point.

The winning model, linear model augmented with the top 3 features, is used for all downstream analysis.

21.7 Feature Importance and Variable Selection

We fit the winning model on the full analysis set and rank the original raw predictors by model-agnostic permutation importance, following Chapter 14. Permutation importance is computed by measuring how much holdout RMSE increases when each feature is randomly shuffled, breaking its relationship with the outcome while leaving the model unchanged. A large increase signals a feature the model relies on heavily; a near-zero increase signals that the model can predict almost as well without it.

Permuting a raw feature (for example n_transactions) affects all formula terms that depend on it, including any engineered terms such as log1p(n_transactions) or interaction terms. This means the importance scores correctly attribute signal to the original customer behavior variables rather than to their derived representations, regardless of which formula won the leaderboard.

# Recover the formula of the best linear model from the registry
winner_key     <- paste(automl_leaderboard$family[1],
                        automl_leaderboard$config[1], sep = "||")
winning_formula <- if (!is.null(formula_registry[[winner_key]])) {
    formula_registry[[winner_key]]
} else {
    automl_formula   # fallback: plain baseline
}

lm_final <- lm(winning_formula, data = analysis_data)
cat("Winning formula:\n")
Winning formula:
print(winning_formula)
total_spent ~ n_transactions + avg_order_value + months_active + 
    std_spending + n_transactions:avg_order_value + avg_order_value:months_active + 
    months_active:std_spending
permute_importance <- function(model, data, outcome, features, n_perm = 100, seed = 456) {
    set.seed(seed)
    baseline_pred <- as.numeric(predict(model, newdata = data))
    baseline_rmse <- rmse(data[[outcome]], baseline_pred)

    imp <- sapply(features, function(v) {
        perm_rmse <- replicate(n_perm, {
            data_perm       <- data
            data_perm[[v]]  <- sample(data_perm[[v]])
            rmse(data[[outcome]], as.numeric(predict(model, newdata = data_perm)))
        })
        mean(perm_rmse) - baseline_rmse
    })

    tibble::tibble(
        feature    = names(imp),
        importance = as.numeric(imp)
    ) |>
        arrange(desc(importance))
}

features_to_rank <- c("n_transactions", "avg_order_value", "months_active", "std_spending")

importance_tbl <- permute_importance(
    model    = lm_final,
    data     = test_data,
    outcome  = "total_spent",
    features = features_to_rank
)

importance_tbl |>
    gt() |>
    cols_label(
        feature    = "Feature",
        importance = "Permutation Importance (increase in holdout RMSE)"
    ) |>
    fmt_number(columns = importance, decimals = 2)
Feature Permutation Importance (increase in holdout RMSE)
n_transactions 568.64
avg_order_value 460.57
std_spending 151.53
months_active 39.60
importance_tbl |>
    ggplot(aes(x = reorder(feature, importance), y = importance)) +
    geom_col(fill = "darkgreen", alpha = 0.75) +
    coord_flip() +
    labs(
        title = "Permutation Importance: Customer Value Drivers (Winning Linear Model)",
        x     = NULL,
        y     = "Increase in holdout RMSE when feature is permuted"
    ) +
    theme_minimal()

Cross-referencing importance rankings with the feature screening leaderboard provides a consistency check. When the same predictors appear both as top candidates in the screening step and as high-importance variables in the winning model, there is convergent evidence that these features carry genuine descriptive signal. When rankings differ, the discrepancy may reflect collinearity between predictors or sensitivity to how the candidate term interacts with the baseline model form.

21.8 Customer Targeting and Segmentation

Using the trained model, we score all customers and partition them by predicted value into four business-relevant tiers.

customer_scored <- customer_base |>
    mutate(
        predicted_value = as.numeric(predict(lm_final, newdata = customer_base)),
        value_segment   = cut(
            predicted_value,
            breaks        = quantile(predicted_value, probs = c(0, 0.25, 0.50, 0.75, 1.0)),
            labels        = c("Emerging", "Growth", "Established", "Premium"),
            include.lowest = TRUE
        )
    )

segment_profile <- customer_scored |>
    group_by(value_segment) |>
    summarise(
        n_customers          = n(),
        actual_avg_spend     = mean(total_spent),
        predicted_avg_spend  = mean(predicted_value),
        mean_transactions    = mean(n_transactions),
        mean_months_active   = mean(months_active),
        .groups = "drop"
    )

segment_profile |>
    gt() |>
    cols_label(
        value_segment        = "Segment",
        n_customers          = "N",
        actual_avg_spend     = "Actual Avg. Spending",
        predicted_avg_spend  = "Predicted Avg. Spending",
        mean_transactions    = "Mean Transactions",
        mean_months_active   = "Mean Months Active"
    ) |>
    fmt_number(columns = -c(value_segment, n_customers), decimals = 2)
Segment N Actual Avg. Spending Predicted Avg. Spending Mean Transactions Mean Months Active
Emerging 125 1,741.30 1,745.07 18.65 7.06
Growth 125 2,414.82 2,409.82 23.10 8.15
Established 125 2,969.78 2,974.38 26.18 8.78
Premium 125 3,932.06 3,932.73 31.97 9.61
customer_scored |>
    ggplot(aes(x = value_segment, y = total_spent, fill = value_segment)) +
    geom_violin(alpha = 0.7) +
    geom_boxplot(width = 0.1, alpha = 0.5) +
    labs(
        title = "Spending Distribution Across Predicted Value Segments",
        x     = "Value Segment",
        y     = "Total Spent ($)",
        fill  = "Segment"
    ) +
    theme_minimal() +
    theme(legend.position = "none")

These data-driven segments can inform differentiated marketing, pricing, or retention strategies. The model-based partitioning complements the rule-based RFM segmentation from earlier in the chapter: the two approaches can be compared directly to examine whether the automated method adds resolution beyond the manual baseline.

21.9 Communicating Customer Insights to Business Stakeholders

Data-driven customer segmentation is most valuable when translated into actionable business narratives. The following insights bridge the feature screening and AutoML results with business language.

Insight 1: Transaction frequency and average order value are the primary value drivers. Permutation importance indicates that n_transactions and avg_order_value are the two strongest contributors to predicted customer spending. The feature screening step confirms that their interaction term provides additional explanatory power beyond either predictor in isolation. This suggests that growth actions should target basket expansion (higher order value) and repeat purchase behavior (higher transaction volume) simultaneously, rather than treating them as independent levers.

Insight 2: Customer value is well described by an additive linear model, complexity is not required. The AutoML leaderboard shows that linear regression outperforms all tree-based and ensemble alternatives. This is a substantive finding: it means that the drivers of total spending combine approximately additively, and a linear combination of transaction frequency, order value, tenure, and spending variability captures the main patterns. For business planning, this simplicity is an asset: additive contributions are directly interpretable as marginal value drivers without requiring complex model explanations.

Insight 3: A four-tier segmentation is operationally useful. The model separates customers into four value tiers with rising average spend from one tier to the next. Each tier can receive a different plan, for instance: low-cost activation for Emerging, frequency-building offers for Growth, loyalty reinforcement for Established, and high-touch retention for Premium. The feature importance rankings also give each tier a behavioral profile that supports message targeting within that tier.

These insights translate feature rankings and model structure into business language that can guide strategy without requiring stakeholders to engage with the underlying algorithmic details.

21.10 Holdout Validation and Model Performance

To assess whether the segmentation generalizes beyond the analysis sample, we evaluate model performance on the held-out test set reserved at the start of the analysis.

test_pred <- as.numeric(predict(lm_final, newdata = test_data))

holdout_rmse <- rmse(test_data$total_spent, test_pred)
holdout_mae  <- mean(abs(test_data$total_spent - test_pred))
r_squared    <- cor(test_data$total_spent, test_pred)^2

tibble::tibble(
    metric = c("Root Mean Squared Error", "Mean Absolute Error", "R-squared"),
    value  = c(holdout_rmse, holdout_mae, r_squared)
) |>
    gt() |>
    cols_label(metric = "Metric", value = "Value") |>
    fmt_number(columns = value, decimals = 2)
Metric Value
Root Mean Squared Error 140.05
Mean Absolute Error 109.51
R-squared 0.96
tibble::tibble(
    actual    = test_data$total_spent,
    predicted = test_pred
) |>
    ggplot(aes(x = actual, y = predicted)) +
    geom_point(alpha = 0.5, size = 2) +
    geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
    labs(
        title = "Holdout Validation: Predicted vs. Actual Customer Spending",
        x     = "Actual Total Spent ($)",
        y     = "Predicted Total Spent ($)"
    ) +
    theme_minimal()

Holdout performance confirms that the model’s segmentation and feature ranking reflect stable patterns in customer behavior. Comparing CV RMSE from the leaderboard to the holdout RMSE provides a consistency check: a large discrepancy would suggest over-optimistic leaderboard estimates, possibly due to overfitting during the feature screening or model search phase.

21.11 Limitations and Reproducibility Notes

Several caveats warrant attention.

  1. This case study uses simulated data. Real customer data often contains temporal autocorrelation, seasonality, sparse records, and missing values that require additional preprocessing. The feature-outcome relationships illustrated here are deliberately simple to make the workflow visible.
  2. The strong contribution of the interaction between n_transactions and avg_order_value in the feature screening partly reflects the mathematical construction of total_spent in this dataset. In applied settings, careful examination of whether top-ranked features encode genuine behavioral insight or are artefacts of variable construction is essential.
  3. The AutoML search is intentionally lightweight. Production systems would explore larger candidate spaces, use repeated cross-validation, and apply model-specific regularization. The principles remain the same, but the engineering budget is different.
  4. Feature importance can be sensitive to feature correlation and scale. When predictors are correlated, importance scores should be interpreted as group-level signals rather than individual contributions.

All random components use set seeds for reproducibility. In applied settings, stability checks across alternative model specifications and multiple random seeds are recommended before using results to inform strategic decisions.

21.12 Summary and Key Takeaways

  • Automated feature engineering systematizes the search for informative derived predictors by generating a candidate library and screening it by cross-validated improvement, following the workflow from Chapter 18.
  • AutoML-style model comparison surveys multiple model families under a common evaluation protocol, producing a leaderboard that describes how much structure different algorithms can extract from the data, following Chapter 17.
  • The combination of feature screening and model leaderboard provides two complementary views of the same question: what drives customer value, and how complex is that structure?
  • Model-agnostic permutation importance applied to the winning model translates AutoML and feature engineering results into a ranked list of customer behavior drivers, regardless of which model family prevailed.
  • Data-driven customer segmentation supports differentiated targeting and resource allocation.
  • The emphasis throughout is descriptive: AutoML and automated feature engineering are instruments for learning about data, not for deploying prediction systems.

21.13 Looking Ahead

This case study focused on automated feature engineering and AutoML-style model comparison as descriptive instruments for customer insight. The next chapter consolidates the core lessons from all three case studies (policy, health, and business), highlights the main practical principles of descriptive tabular analysis, and closes with recommendations for applying this workflow responsibly in real-world projects.