This post is a quick summary of three papers that break down the goals and approaches of statistical modeling. I find myself returning to these papers often, and I heartily recommend reading all three:

Statistical Modeling: The Two Cultures

Leo Breiman wrote “Statistical Modeling: The Two Cultures” in 2001.

Breiman argues that there are two goals in analyzing data:

  • Prediction: To be able to predict what the responses (y) are going to be to future input variables (X).
  • Information: To extract some information about how nature is associating the response variables (y) to the input variables (X).

Breiman is using the common analogy of a black box data-generating process that receives a vector of independent variables (X) and produces repsonse variables (y).

The data modeling culture assume “a stochastic data model for the inside of the black box”. Think linear regression.

The algorithmic modeling culture assume the inside of the box is “complex and unknown”. Think deep learning or gradient boosting.

I am not against data models per se. …. But the emphasis needs to be on the problem and on the data.

I associate data modeling culture with an interest in interpretable machine learning.

Interpretability

Unfortunately, in prediction, accuracy and simplicity (interpretability) are in conflict. For instance, linear regression gives a fairly interpretable picture of the y, X relation. But its accuracy is usually less than that of the less interpretable neural nets.

Breiman calls this the Occam dilemma: “Accuracy generally requires more complex prediction methods. Simple and interpretable functions do not make the most accurate predictors.” He claims that “the models that best emulate nature in terms of predictive accuracy are also the most complex and inscrutable.”

The point of a model is to get useful information about the relation between the response and predictor variables. Interpretability is a way of getting information.

I like this re-framing of interpretability as a process for getting information about the variables. It’s interesting that in the 20 years since this was published a whole field (mechanistic interpretability) arose using interpretability methods to understand the functioning of the model, which demonstrates how complex the models have gotten!

Responses

In 2021, the journal Observational Studies published a special issue with 28 commentaries on Breiman’s essay. Nandita Mitra summarizes the repsonses as covering:

  1. The blending and cross-fertilization of modeling cultures (not just two distinct ones) under historical, foundational, and flexible paradigms
  2. The importance of interpretable algorithms, understanding the data, outcome reasoning, modeling based on scientific theory, and social responsibility
  3. Causal modeling
  4. Bayesian inference
  5. Distributed learning, targeted learning, supervised learning, and computational thinking

Judea Pearl, for example, writes that “the two cultures contrasted by Breiman are not descriptive vs. causal but, rather, two styles of descriptive modeling, one interpretable, the other uninterpretable”. He argues (as he always does) for causal modeling: “As we read Breiman’s paper today … we may say that his advocacy of algorithmic prediction was justified. Guided by a formal causal model for identification and bias reduction, the predictive component in the analysis can safely be trusted to non-interpretable algorithms. The interpretation can be accomplished separately by the causal component of the analysis.”

On Discriminative vs. Generative Classifiers

“On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes” is a famous paper by famous authors: Andrew Ng and Michael I. Jordan. It focuses on the distinction between discriminative and generative classifiers, providing a useful abstraction for thinking about modeling.

Generative classifiers learn a model of the joint probability, p(x,y), of the inputs x and the label y, and make their predictions by using Bayes rules to calculate p(y|x), and then picking the most likely label y. Discriminative classifiers model the posterior p(y|x) directly, or learn a direct map from inputs x to the class labels. There are several compelling reasons for using discriminative rather than generative classifiers, one of which, succinctly articulated by Vapnik, is that “one should solve the [classification] problem directly and never solve a more general problem as an intermediate step [such as modeling p(x|y)].”

I found this distinction hugely helpful when I was first learning about machine learning, although this distinction between the posterior and the joint probability is fairly tenuous for many modeling approaches.

Integrating explanation and prediction in computational social science

In “Integrating explanation and prediction in computational social science” (open-access pdf), Hofman et al. articulate the use of statistical modeling for answering social science research questions.

They articulate four approaches to modeling:

  1. Descriptive modeling: Describe situations in the past or present (but neither causal nor predictive)
  2. Explanatory modeling: Estimate effects of changing a situation (but many effects are small)
  3. Predictive modeling: Forecast outcomes for similar situations in the future (but can break under changes)
  4. Integrative modeling: Predict outcomes and estimate effects in as yet unseen situations

They lay these four approaches to modeling in a 2x2 grid reflecting two primary axes:

  • Focusing on specific features or causal effects vs outcomes
  • One dataset (no experimental intervention or distributional shift) vs multiple datasets (e.g. comparisons under intervention or under distributional shift)

Descriptive and explanatory modeling are mostly self-explanatory, but the distinction between predictive and integrative modeling is a little more subtle:

Whereas [predictive modeling] concerns itself with data that are out of sample, but still from the same (statistical) distribution, [for integrative modeling] the focus is on generalizing ‘out of distribution’ to a situation that might change either naturally, owing to some factor out of our control, or because of some intentional intervention such as an experiment or change in policy.

While planning a new modeling project, I encourage researchers to think through the specific type of modeling they are trying to do. These days, many people I work with try to apply predictive modeling tools for explanatory or integrative modeling tasks.