4  Unified Association Measures for Mixed-Type Variables

4.1 Introduction: The Challenge of Mixed-Type Data

Real-world datasets rarely consist of purely continuous or purely categorical variables. A typical health survey contains continuous biomarkers, ordinal symptom scales, categorical diagnoses, and binary treatment indicators. A business database combines numeric sales figures, categorical product types, ordinal customer satisfaction ratings, and binary contract status.

A helpful framing question is: How do we measure association across such mixed data in a unified, comparable way?

Chapter 2 established that different variable types require different association measures. But practitioners often face a practical problem: given a dataset with mixed types, how do we compute a single association matrix that allows us to rank relationships, identify anomalies, and prepare for network analysis or further modeling?

This chapter presents a practical framework for unified association measurement. We aim to:

  1. Survey classical type-specific measures and their limitations
  2. Introduce modern alternatives that are more flexible and often more robust
  3. Build a type-aware framework for selecting appropriate measures
  4. Demonstrate implementation on real mixed-type data

4.2 Variable Type Classification

Association is not a single concept. The way we measure relationships depends on the measurement scale of the variables involved. A correlation coefficient makes sense for two continuous variables, but it becomes misleading when applied to categorical or ordinal data. A good descriptive workflow usually begins by classifying variable types and selecting association measures accordingly.

We classify variables into five types:

Type Examples Characteristics
Continuous income, height, temperature Interval/ratio scale, numeric measurements
Discrete count data, number of events Non-negative integers, often right-skewed
Ordinal education level, Likert scales Ordered categories, rank structure matters
Categorical region, industry, color Unordered labels, no inherent order
Binary yes/no, 0/1, presence/absence Two categories, often treatment or outcome indicator

The core goal is comparability: use measures that help us rank associations across mixed types without forcing invalid assumptions.

4.3 Classical Measures by Variable Type

Classical association measures are well-understood, computationally efficient, and interpretable. However, each is designed for specific type combinations. We review the most important measures organized by the types of variables they handle.

4.3.1 Continuous-Continuous: Pearson and Spearman

Pearson’s \(r\) measures linear association between two continuous variables (Pearson 1895; Anderson 1985). It is scale-invariant, interpretable, and widely understood. However, it has significant limitations:

  • Sensitive to outliers
  • Assumes bivariate normality for inference (though we focus on description)
  • Misses nonlinear relationships (two variables can be perfectly dependent with \(r = 0\))
# Pearson correlation on mtcars
mtcars |>
    select(mpg, wt, hp, disp) |>
    cor(method = "pearson") |>
    as.data.frame() |>
    tibble::rownames_to_column(var = "Variable") |>
    gt() |>
    fmt_number(columns = -Variable, decimals = 3)
Variable mpg wt hp disp
mpg 1.000 −0.868 −0.776 −0.848
wt −0.868 1.000 0.659 0.888
hp −0.776 0.659 1.000 0.791
disp −0.848 0.888 0.791 1.000

Spearman’s \(\rho\) ranks data first, then computes Pearson’s \(r\) on ranks (Spearman 1961). This makes it:

  • Robust to outliers
  • Sensitive to monotonic (not just linear) relationships
  • Suitable for ordinal data
# Compare Pearson and Spearman on data with nonlinear association
set.seed(42)
data_nonlin <- tibble(
  x = seq(0, 4, length.out = 100),
  y = x^2 + rnorm(100, 0, 2)
)

tibble(
  Measure = c("Pearson", "Spearman"),
  r = c(
    cor(data_nonlin$x, data_nonlin$y, method = "pearson"),
    cor(data_nonlin$x, data_nonlin$y, method = "spearman")
  )
) |>
  gt() |>
  fmt_number(columns = r, decimals = 3)
Measure r
Pearson 0.880
Spearman 0.877

Recommendation: Spearman is often a sensible default for continuous–continuous pairs due to its robustness; Pearson can be helpful when linearity is established and outliers are minimal.

4.3.2 Categorical-Categorical: Cramér’s V

For two categorical variables, Cramér’s V normalizes the chi-square statistic to \([0, 1]\) (David and Cramer 1947):

\[V = \sqrt{\frac{\chi^2}{n(k-1)}}\]

where \(k = \min(\text{rows}, \text{columns})\). This is intuitive and widely used, though it tends to underestimate association in sparse tables.

# Cramér's V for categorical variables
cramers_v <- function(x, y) {
  tab <- table(x, y)
  chi2 <- suppressWarnings(chisq.test(tab, correct = FALSE)$statistic)
  n <- sum(tab)
  k <- min(nrow(tab), ncol(tab))
  sqrt(as.numeric(chi2) / (n * (k - 1)))
}

# Example: cylinder type vs transmission type
mtcars |>
  summarise(V = cramers_v(factor(cyl), factor(am))) |>
  gt() |>
  fmt_number(columns = V, decimals = 3)
V
0.523

4.3.3 Continuous-Categorical: Eta-Squared

When a continuous variable varies across groups defined by a categorical variable, eta-squared (\(\eta^2\)) captures the proportion of variance explained (Fisher 1990; Cohen 2013):

\[\eta^2 = \frac{\text{SS}_{\text{between}}}{\text{SS}_{\text{total}}}.\]

This is directly interpretable as effect size and ranges from 0 to 1.

# Eta-squared: mpg explained by number of cylinders
eta_squared <- function(y, group) {
  df <- tibble(y = y, group = group)
  grand_mean <- mean(df$y, na.rm = TRUE)
  ss_total <- sum((df$y - grand_mean)^2, na.rm = TRUE)
  ss_between <- df |>
    group_by(group) |>
    summarise(n = n(), mean_y = mean(y, na.rm = TRUE), .groups = "drop") |>
    mutate(ss = n * (mean_y - grand_mean)^2) |>
    summarise(ss = sum(ss)) |>
    pull(ss)
  ss_between / ss_total
}

mtcars |>
  summarise(eta2 = eta_squared(mpg, factor(cyl))) |>
  gt() |>
  fmt_number(columns = eta2, decimals = 3)
eta2
0.732

4.3.4 Ordinal Variables: Kendall’s Tau

Ordinal variables carry rank information that is often best retained. Kendall’s \(\tau\) is a natural choice for ordinal–ordinal associations and is robust to ties (Kendall 1938). Spearman’s \(\rho\) can also be used for ordinal data.

# Ordinal example using a cut of mpg
mtcars |>
  mutate(mpg_ordinal = cut(mpg, breaks = 4, ordered_result = TRUE)) |>
  summarise(
    spearman = cor(as.numeric(mpg_ordinal), wt, method = "spearman"),
    kendall = cor(as.numeric(mpg_ordinal), wt, method = "kendall")
  ) |>
  gt() |>
  fmt_number(columns = c(spearman, kendall), decimals = 3)
spearman kendall
−0.777 −0.656

4.3.5 Binary-Binary: Phi Coefficient

Binary variables can be treated as a special case of categorical data. For two binary indicators, the phi coefficient is equivalent to Pearson correlation on 0/1 coding.

# Phi coefficient for two binary variables
phi <- function(x, y) {
  cor(as.integer(x), as.integer(y))
}

mtcars |>
  mutate(am_bin = am == 1, vs_bin = vs == 1) |>
  summarise(phi = phi(am_bin, vs_bin)) |>
  gt() |>
  fmt_number(columns = phi, decimals = 3)
phi
0.168

4.4 Modern Alternatives: Beyond Classical Measures

Classical measures work well within their designed domains but lack flexibility. We present modern association measures that offer broader applicability and often improved robustness.

4.4.1 Distance Correlation

Distance correlation (Székely, Rizzo, and Bakirov 2007) is a measure of dependence that:

  • Works for any dimension (univariate, multivariate)
  • Detects both linear and nonlinear associations
  • Equals zero if and only if the variables are independent
  • Is not sensitive to outliers in the same way as Pearson’s \(r\)

The distance correlation is computed from distance matrices of the two variables by centering these matrices and computing their correlation.

# Distance correlation via the energy package
library(energy)
  
  # Example
  mtcars |>
    summarise(
      dcor = dcor(mpg, wt),
      pearson = cor(mpg, wt, method = "pearson")
    ) |>
    gt() |>
    fmt_number(columns = c(dcor, pearson), decimals = 3)
dcor pearson
0.871 −0.868

Practical advantages:

  • Captures nonlinear relationships
  • Robust to outliers
  • Theoretically appealing (zero iff independent)

Disadvantages:

  • Computationally expensive for large samples
  • Less interpretable than Pearson’s \(r\) or Spearman’s \(\rho\)

4.4.2 Maximal Information Coefficient (MIC)

The Maximal Information Coefficient (Reshef et al. 2011) measures association by finding the grid partition of the data that maximizes mutual information. It detects diverse functional relationships (linear, nonlinear, exponential, etc.) and is normalized to \([0, 1]\).

Strengths:

  • Detects diverse functional relationships
  • Symmetric and normalized
  • No assumption about relationship shape

Weaknesses:

  • Computationally demanding
  • Can be noisy with small sample sizes
  • Requires specialized software
# MIC example (requires the minerva package)
  library(minerva)
  
  # Compute MIC for a pair
  set.seed(42)
  x <- rnorm(200)
  y <- x^2 + rnorm(200, 0, 0.5)
  
  mic_result <- mine(x, y)
  tibble(
    Measure = "MIC",
    Value = mic_result$MIC
  ) |>
    gt() |>
    fmt_number(columns = Value, decimals = 3)
Measure Value
MIC 0.729

4.4.3 Mutual Information

Mutual information (MI) quantifies the amount of information one variable contains about another (Shannon 1948; Cover and Thomas 2001):

\[I(X; Y) = \sum_x \sum_y p(x, y) \log \frac{p(x, y)}{p(x) p(y)}.\]

Mutual information is:

  • Symmetric and non-negative
  • Works for any combination of variable types
  • Invariant to monotonic transformations
  • Zero if and only if variables are independent

Interpretation challenge: MI is measured in “nats” or “bits” and lacks an intuitive 0-1 scale, making comparisons across variable pairs difficult without normalization.

# Normalized mutual information
mutual_info_norm <- function(x, y) {
  # Discretize continuous variables if needed
  x_cat <- if (is.numeric(x)) cut(x, breaks = 5) else factor(x)
  y_cat <- if (is.numeric(y)) cut(y, breaks = 5) else factor(y)
  
  tab <- table(x_cat, y_cat)
  px <- rowSums(tab) / sum(tab)
  py <- colSums(tab) / sum(tab)
  pxy <- tab / sum(tab)
  
  # Mutual information
  mi <- 0
  for (i in 1:nrow(pxy)) {
    for (j in 1:ncol(pxy)) {
      if (pxy[i, j] > 0) {
        mi <- mi + pxy[i, j] * log(pxy[i, j] / (px[i] * py[j]))
      }
    }
  }
  
  # Normalize by entropy
  hx <- -sum(px * log(px + 1e-10))
  hy <- -sum(py * log(py + 1e-10))
  2 * mi / (hx + hy)
}

# Example
mtcars |>
  summarise(
    nmi = mutual_info_norm(mpg, wt)
  ) |>
  gt() |>
  fmt_number(columns = nmi, decimals = 3)
nmi
0.576

4.4.4 Copula-Based Measures

Copulas separate marginal distributions from the dependence structure (Sklar 1959). Common copula-derived measures include Kendall’s tau and Spearman’s rho, which are rank-based and capture dependence structure independent of marginal scales.

# Kendall's tau and Spearman's rho
mtcars |>
  summarise(
    kendall_tau = cor(mpg, wt, method = "kendall"),
    spearman_rho = cor(mpg, wt, method = "spearman"),
    pearson_r = cor(mpg, wt, method = "pearson")
  ) |>
  gt() |>
  fmt_number(columns = everything(), decimals = 3)
kendall_tau spearman_rho pearson_r
−0.728 −0.886 −0.868

4.5 A Type-Aware Framework for Mixed Data

Given the diversity of measures, we need a practical approach that applies the most appropriate measure to each variable pair based on their types.

4.5.1 Association Measure Selection Matrix

The table below summarizes recommended measures for each type combination. This consolidated matrix serves as the primary reference for selecting appropriate measures:

X → Y Continuous Discrete Ordinal Categorical Binary
Continuous Spearman (default), Pearson, distance correlation Spearman, distance correlation Spearman, polyserial \(\eta^2\), distance correlation Point-biserial, \(\eta^2\)
Discrete Spearman, distance correlation Spearman Spearman \(\eta^2\), distance correlation \(\eta^2\)
Ordinal Spearman, polyserial Spearman Kendall’s \(\tau\), Spearman Polychoric, Cramér’s V Rank-biserial
Categorical \(\eta^2\), distance correlation \(\eta^2\), distance correlation Polychoric, Cramér’s V Cramér’s V, mutual information \(\phi\), Cramér’s V
Binary Point-biserial, \(\eta^2\) \(\eta^2\) Rank-biserial \(\phi\), Cramér’s V \(\phi\)

Rationale for selections:

  • Continuous-Continuous: Use Spearman as default (robust to outliers); add distance correlation to detect nonlinearity
  • Continuous-Categorical: Use \(\eta^2\) to measure variance explained by group membership
  • Categorical-Categorical: Use Cramér’s V (normalized chi-square)
  • Binary pairs: Use phi coefficient (equivalent to Pearson on 0/1 encoding)
  • Mixed discrete/ordinal: Prefer rank-based measures

4.5.2 Implementation: Type-Aware Association Matrix

We now implement a complete workflow that classifies variables, selects appropriate measures, and computes a unified association matrix.

# Define automatic variable classification
classify_variable <- function(x, n_unique_threshold = 10) {
  # Remove NA for classification
  x_clean <- x[!is.na(x)]
  n_unique <- n_distinct(x_clean)
  
  if (is.numeric(x)) {
    if (all(x_clean == as.integer(x_clean)) && min(x_clean, na.rm = TRUE) >= 0) {
      if (n_unique <= n_unique_threshold) "ordinal" else "discrete"
    } else {
      "continuous"
    }
  } else if (is.factor(x) || is.character(x)) {
    if (n_unique == 2) "binary" else "categorical"
  } else {
    "unknown"
  }
}

# Define type-aware association computation
compute_association <- function(x, y, type_x, type_y) {
  # Remove pairs with missing data
  valid_idx <- !is.na(x) & !is.na(y)
  x <- x[valid_idx]
  y <- y[valid_idx]
  
  if (length(x) == 0) return(NA_real_)
  
  # Handle numeric conversion for type checking
  if (is.logical(x)) x <- as.integer(x)
  if (is.logical(y)) y <- as.integer(y)
  
  # Continuous–Continuous, Continuous–Discrete, or Continuous–Ordinal
  if ((type_x %in% c("continuous", "discrete")) && 
      (type_y %in% c("continuous", "discrete", "ordinal"))) {
    return(cor(x, as.numeric(y), method = "spearman", use = "complete.obs"))
  }
  
  if ((type_y %in% c("continuous", "discrete")) && 
      (type_x %in% c("continuous", "discrete", "ordinal"))) {
    return(cor(as.numeric(x), y, method = "spearman", use = "complete.obs"))
  }
  
  # Continuous–Categorical or Continuous–Binary (Eta-squared)
  if ((type_x == "continuous" && type_y %in% c("categorical", "binary")) ||
      (type_y == "continuous" && type_x %in% c("categorical", "binary"))) {
    if (type_x %in% c("categorical", "binary")) {
      return(eta_squared(y, x))
    } else {
      return(eta_squared(x, y))
    }
  }
  
  # Categorical–Categorical, Categorical–Binary, or Ordinal–Categorical
  if ((type_x %in% c("categorical", "binary", "ordinal")) && 
      (type_y %in% c("categorical", "binary", "ordinal"))) {
    return(cramers_v(factor(x), factor(y)))
  }
  
  return(NA_real_)
}

# Prepare mtcars with mixed types
mtcars_mixed <- mtcars |>
  mutate(
    cyl_cat = factor(cyl),
    am_bin = factor(am),
    vs_bin = factor(vs),
    gear_cat = factor(gear)
  ) |>
  select(mpg, wt, hp, cyl_cat, am_bin, gear_cat)

# Classify variables
var_types <- tibble(
  var_name = names(mtcars_mixed),
  var_type = map_chr(mtcars_mixed, classify_variable)
)

var_types |>
  gt()
var_name var_type
mpg continuous
wt continuous
hp discrete
cyl_cat categorical
am_bin binary
gear_cat categorical
# Compute association matrix
vars <- names(mtcars_mixed)
n_vars <- length(vars)
assoc_matrix <- matrix(1.0, nrow = n_vars, ncol = n_vars)
rownames(assoc_matrix) <- vars
colnames(assoc_matrix) <- vars

for (i in 1:n_vars) {
  for (j in i:n_vars) {
    if (i == j) {
      assoc_matrix[i, j] <- 1.0
    } else {
      assoc <- compute_association(
        mtcars_mixed[[i]], mtcars_mixed[[j]],
        var_types$var_type[i], var_types$var_type[j]
      )
      assoc_matrix[i, j] <- assoc
      assoc_matrix[j, i] <- assoc
    }
  }
}

# Display association matrix as table
assoc_matrix |>
  as.data.frame() |>
  tibble::rownames_to_column(var = "Variable") |>
  gt() |>
  fmt_number(columns = -Variable, decimals = 3)
Variable mpg wt hp cyl_cat am_bin gear_cat
mpg 1.000 −0.886 −0.895 0.732 0.360 0.429
wt −0.886 1.000 0.775 0.612 0.480 0.434
hp −0.895 0.775 1.000 NA NA NA
cyl_cat 0.732 0.612 NA 1.000 0.523 0.531
am_bin 0.360 0.480 NA 0.523 1.000 0.809
gear_cat 0.429 0.434 NA 0.531 0.809 1.000
# Display as heatmap
assoc_matrix |>
  as.data.frame() |>
  tibble::rownames_to_column(var = "x") |>
  pivot_longer(-x, names_to = "y", values_to = "assoc") |>
  ggplot(aes(x = x, y = y, fill = assoc)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "#3b4cc0", mid = "white", high = "#b40426", 
                       limits = c(0, 1), breaks = seq(0, 1, 0.25)) +
  labs(
    title = "Type-Aware Association Heatmap",
    x = NULL, y = NULL, fill = "Association"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

4.6 Normalization and Comparability

Different measures operate on different scales. Some are naturally bounded to \([0, 1]\); others are unbounded. To compare associations across variable pairs, we often need to normalize.

4.6.1 Standard Normalization Approaches

  1. Min-max scaling: Scale a measure to \([0, 1]\) using its theoretical bounds \[\text{Normalized} = \frac{\text{Measure} - \min}{\max - \min}\]
  2. Absolute value: For measures that can be negative (e.g., correlation), use absolute value if directionality is not of interest \[\text{Normalized} = |\text{Measure}|\]
  3. Fisher z-transformation: For Pearson’s \(r\) specifically, use \(z = \frac{1}{2} \ln\left(\frac{1+r}{1-r}\right)\) to stabilize variance
  4. Mutual information normalization: Divide by maximum possible MI or entropy \[\text{Normalized MI} = \frac{I(X;Y)}{\min(H(X), H(Y))}\]

4.6.2 Important Caveats

Normalization enables ranking but not direct comparison across measure types. A normalized value of 0.6 for Cramér’s V is not equivalent to 0.6 for Spearman’s \(\rho\). It helps to:

  • Document your normalization choice
  • Interpret results within measure families first
  • Use normalization to identify anomalies and ranking, not to claim equivalence
# Example: normalize association matrix to [0, 1]
assoc_matrix_norm <- assoc_matrix
diag(assoc_matrix_norm) <- NA  # Exclude diagonal (all 1s)

# Normalize each off-diagonal value
min_assoc <- min(assoc_matrix_norm, na.rm = TRUE)
max_assoc <- max(assoc_matrix_norm, na.rm = TRUE)

assoc_matrix_norm <- (assoc_matrix_norm - min_assoc) / (max_assoc - min_assoc)
diag(assoc_matrix_norm) <- 1  # Restore diagonal

# Heatmap of normalized values
assoc_matrix_norm |>
  as.data.frame() |>
  tibble::rownames_to_column(var = "x") |>
  pivot_longer(-x, names_to = "y", values_to = "assoc") |>
  filter(x != y) |>  # Exclude diagonal for clarity
  ggplot(aes(x = x, y = y, fill = assoc)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(direction = -1, na.value = "white") +
  labs(
    title = "Normalized Association Matrix",
    x = NULL, y = NULL, fill = "Normalized\nAssociation"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

4.7 Handling Special Cases

4.7.1 Missing Data

Associations with missing data require explicit handling:

  1. Listwise deletion: Remove all rows with any missing values (simple but wasteful)
  2. Pairwise deletion: Compute each association only from complete cases for that pair (more data, but samples vary by pair)
  3. Imputation: Replace missing values with estimates before computing associations
# Example: comparing pairwise deletion methods
mtcars_with_na <- mtcars |>
  mutate(
    mpg = ifelse(row_number() %in% c(5, 10), NA, mpg),
    wt = ifelse(row_number() %in% c(3, 8), NA, wt)
  )

tibble(
  Method = c("Listwise", "Pairwise (mpg-wt)"),
  N = c(
    nrow(na.omit(mtcars_with_na[, c("mpg", "wt")])),
    sum(!is.na(mtcars_with_na$mpg) & !is.na(mtcars_with_na$wt))
  ),
  Correlation = c(
    cor(na.omit(mtcars_with_na[, c("mpg", "wt")])[, 1],
        na.omit(mtcars_with_na[, c("mpg", "wt")])[, 2]),
    cor(mtcars_with_na$mpg, mtcars_with_na$wt, use = "complete.obs")
  )
) |>
  gt() |>
  fmt_number(columns = c(Correlation), decimals = 3)
Method N Correlation
Listwise 28 −0.875
Pairwise (mpg-wt) 28 −0.875

4.7.2 Outliers and Robustness

Outliers can distort Pearson correlations but have less impact on rank-based measures. When outliers are present:

  • Use Spearman or Kendall instead of Pearson
  • Inspect scatterplots for unusual patterns
  • Consider robust alternatives (winsorization, trimmed correlation)
# Impact of outliers
set.seed(42)
x <- rnorm(50)
y <- 0.7 * x + rnorm(50, 0, 0.5)

# Add outliers
x_outlier <- c(x, 10, 10)
y_outlier <- c(y, -8, 12)

tibble(
  Scenario = c("No outliers", "With outliers"),
  Pearson = c(
    cor(x, y, method = "pearson"),
    cor(x_outlier, y_outlier, method = "pearson")
  ),
  Spearman = c(
    cor(x, y, method = "spearman"),
    cor(x_outlier, y_outlier, method = "spearman")
  )
) |>
  gt() |>
  fmt_number(columns = c(Pearson, Spearman), decimals = 3)
Scenario Pearson Spearman
No outliers 0.840 0.845
With outliers 0.310 0.751

4.8 Practical Workflow Summary

Here is a step-by-step workflow for building a unified association matrix:

  1. Classify variables by type (continuous, discrete, ordinal, categorical, binary)
  2. Handle missing data explicitly (document your choice)
  3. Inspect distributions for outliers, skewness, and multimodality
  4. Select measures based on the type-aware selection matrix above
  5. Compute pairwise associations with appropriate measures
  6. Normalize if comparability across pairs is desired
  7. Visualize as a heatmap or network (see later chapters)
  8. Interpret in context (avoid overstating quantitative results)

4.9 Summary and Key Takeaways

  • Variable types matter: Different measurement scales require different association measures to avoid misleading results.
  • Classical measures (Pearson, Spearman, Cramér’s V, \(\eta^2\)) are efficient, interpretable, and widely understood, but each is specialized for specific type combinations.
  • Modern alternatives (distance correlation, MIC, mutual information) offer greater flexibility and can detect nonlinear relationships, but are often more computationally expensive and less interpretable.
  • Type-aware framework: A systematic approach to selecting appropriate measures based on variable type combinations ensures valid and comparable association estimates.
  • Normalization enables ranking but does not erase conceptual differences between measures. It helps to document our choices and interpret results carefully.
  • Special cases (missing data, outliers) require explicit handling. Rank-based measures are more robust than Pearson correlation to outliers.

4.10 Looking Ahead

With a unified framework for association measurement now in hand, the next chapter extends this work to nonlinear and conditional associations. We plan to explore copulas, partial correlations, and graphical models to capture more complex dependence structures. Then, we move to network representations that visualize multivariate association patterns in a format that stakeholders can interpret and act upon.