The conventional approach to machine learning annotation is to have multiple people label each data instance and then take the majority vote. We can do better.

The two primary types of approaches for thinking about achieving consensus are (1) better modeling processes than “majority vote” and (2) discussion or validation processes of some kind. This blog post rounds up a few papers related to annotator disagreement.

Modeling

The exemplar approach to probabilistic modeling of annotators and annotator disagreement is the Dawid-Skene model. The research duo Dawid and Skene proposed it in 1979. My favorite write-up on the model is Michael Camilleri’s “Reaching a Consensus in crowdsourced data using the Dawid-Skene Model” (2020).

Rebecca Passonneau and Bob Carpenter published “The Benefits of a Model of Annotation” (2014) (and a blog post summary) on a more modern approach. Summarizing from their abstract:

Standard agreement measures for interannotator reliability are neither necessary nor sufficient to ensure a high quality corpus. [A probabilistic] annotation model provides far more information [than conventional strategies like majority vote], including a certainty measure for each gold standard label; the crowdsourced data was collected at less than half the cost of the conventional approach.

Discussion processes

My go-to paper for thinking about discussion processes is Schaekermann et al.’s “Resolvable vs. Irresolvable Disagreement: A Study on Worker Deliberation in Crowd Work” (2018), which is comprehensible and provides useful pointers to other research.

Conceptualizing disagreement and uncertainty

Nan-Chen Chen et al. write usefully about the sources of ambiguity in data annotation, suggesting that there are two primary dimensions of ambiguity: data ambiguity and human subjectivity (“Using Machine Learning to Support Qualitative Coding in Social Science: Shifting the Focus to Ambiguity”, 2018).

Others have discussed the uncertainty in predictions - including by human annotators and by probabilistic models. The distinction between aleatoric and epistemic uncertainty is a useful way of breaking down the concept, see e.g. “Deep Learning Uncertainty in Machine Teaching” (2022).

Mitchell Gordon proposed “jury learning”. See “Jury Learning: Integrating Dissenting Voices into Machine Learning Models” (2022) and “The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality” (2021).

Also on juries: “Can Online Juries Make Consistent, Repeatable Decisions?” (2021). In the context of community moderation (which is relevant for many real-world annotation practices), see “Measuring User-Moderator Alignment on r/ChangeMyView” (2023).