Inference with predicted data refers to making inferences (causal or otherwise) about a phenomena when some or all of the data we use to make inferences is the result of a predictive algorithm of some kind. The dream of using statistical models to make data collection cheaper has intoxicated many researchers over the years, although the use of predictive models introduces bias. Quantifying and correcting that bias remains a long-term open research question.

I may clean this post up later, but here are a few sources discussing these issues:

Kentaro Hoffman et al. “Do We Really Even Need Data?” (Feb 2024).

Zach Wood-Doughty et al. “Challenges of Using Text Classifiers for Causal Inference”

In “Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond”, Amir Feder et al. discuss how “causal assumptions are complicated when variables necessary for a causal analysis are extracted automatically from text”.

The main idea is to use NLP methods to extract confounding aspects from text and then adjust for those aspects in an estimation approach such as propensity score matching. However, how and when these methods violate causal assumptions are still open questions.

My favorite example of this approach is the work of Koustuv Saha, e.g. “A Social Media Study on the Effects of Psychiatric Medication Use”.

Open questions I have:

  • I would love to see a good taxonomy of the different types of inference with predicted data, beyond Hoffman et al. There are huge bodies of work on, for example, missing data imputation and causal inference that could be united under a single framework.
  • In what circumstances might using a predictive model reduce bias relative to alternative data collection instruments? Seems hard but valuable to try to quantify this. This line of thinking is related to data quality estimation. My only touchpoint on this topic is “Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election” (June 2018) by Xiao-Li Meng. Meng demonstrates that the difference between a sample average and the population average is the product of (1) data quality, (2) data quantity, and (3) problem difficulty. We might apply a similar decomposition to compare the costs of using a statistical model with a particular quality versus a high-quality (manual) data collection method that is much more expensive.