2 Data Types in Tabular Data

2.1 Why Variable Type Matters

Descriptive analysis depends on measurement scale. A statistic that is meaningful for continuous variables can be misleading for categorical or ordinal data. Before we compute any association, it helps to classify variables carefully.

This chapter provides a compact guide to data types and the practical decisions they imply. The goal is to build a shared vocabulary for the rest of the book.

2.2 Measurement Scales

Most tabular variables fall into one of four measurement scales:

Nominal: unordered labels (e.g., region, industry, color)
Ordinal: ordered categories (e.g., education level, Likert scales)
Interval: numeric scales with equal spacing but no true zero (e.g., temperature in °C)
Ratio: numeric scales with a meaningful zero (e.g., income, height)

In descriptive analysis, the key distinction is order vs. no order and numeric vs. non-numeric. These distinctions determine whether ranks, distances, or variance-based measures are appropriate.

2.3 Common Data Types in Tabular Data

In practice, we usually work with these operational types:

Continuous: real-valued measurements (income, height, temperature)
Discrete: non-negative integers (counts of events, number of visits)
Categorical (nominal): unordered labels (industry, region)
Ordinal: ordered categories (education level, satisfaction)
Binary: two-level indicators (yes/no, 0/1)

Binary variables can be treated as a special case of categorical or ordinal data, though they often benefit from their own association measures for clarity and comparability.

2.4 Practical Heuristics for Classification

When codebooks are incomplete, a few simple heuristics can help classify variables:

Check data types: numeric vs. character/factor.
Inspect unique values: few unique values often signal categorical or ordinal variables.
Look for ordering: ordered factors or natural rank in labels.
Validate units: ratios and intervals differ in interpretability (e.g., 20°C is not “twice” 10°C).

A misclassified variable can distort downstream association measures, so it is often helpful to include these checks in any descriptive pipeline.

2.5 Common Pitfalls

Numeric codes for categories: e.g., 1 = “North”, 2 = “South”. It is usually better to treat these as categorical, not continuous.
Ordinal scales stored as text: e.g., “low”, “medium”, “high” without explicit ordering.
Sparse categories: rare levels can inflate or destabilize association measures.
Mixed scales: variables constructed from heterogeneous sources (e.g., indices) may not behave like true ratio scales.

2.6 Checklist Before Computing Associations

Confirm variable types with documentation or exploratory checks.
Convert numeric codes to factors when they represent labels.
Encode ordinal variables with explicit ordering.
Decide how to handle rare categories (merge, drop, or keep with caution).

2.7 Summary and Key Takeaways

Variable type determines appropriate methods: The measurement scale (nominal, ordinal, interval, ratio) shapes which descriptive statistics and association measures are valid.
Five operational types cover most cases: Continuous, discrete, categorical, ordinal, and binary variables each call for different analytical approaches.
Misclassification can lead to misleading results: Treating numeric codes as continuous or ignoring ordinal structure can distort associations and summaries.
Heuristics help when codebooks are incomplete: Checking data types, unique values, natural ordering, and units provides practical guidance for classification.
Common pitfalls are avoidable: Being attentive to numeric codes for categories, unordered ordinal text, sparse categories, and mixed-scale indices helps prevent errors.
Explicit classification supports comparability: Type-aware analysis helps ensure that different variable pairs are measured with appropriate, comparable methods.

2.8 Looking Ahead

The next chapter expands the descriptive toolkit beyond basic summaries. With data types clarified, we can interpret richer multivariate structure and prepare for systematic association measurement in later parts of the book.