Consensus Relevance with Topic and Worker Models

 
Consensus Relevance with Topic
and Worker Conditional Models
 
Paul N. Bennett, Microsoft Research
 
Joint with
Ece Kamar, Microsoft Research
Gabriella Kazai, Microsoft Research Cambridge
Motivation for Consensus Task
Recover actual relevance of a topic-document pair
based on noisy predictions from multiple labelers.
Obtain a more reliable signal from the crowd and/or
benefit from scale (expert quality from inexperienced
assessors).
Variety of proposed approaches in the literature and in
competition.
Supervised: Classification models.
Semi-supervised: EM-style algorithms.
Unsupervised: majority vote.
Common Axes of Generalization
Topics
Documents
Relevance
Observed in Training
Relevance
Not Observed in Training
Observed 
In Training
Not Observed 
In Training
Compute consensus
for “new
documents” on
known topics.
Compute consensus
on new topics for
documents with
known relevance on
other topics.
Use rules or
observed worker
accuracies on other
topics/documents to
compute consensus
on new topics and
documents.
 
Note hidden axis of observed workers.
Our Approach
 
Supervised
Given gold truth judgments on a topic set and worker responses, learn
a consensus model to generalize to new documents on same topic set.
Must be able to generalize to new workers.
 
Want a well-founded probabilistic method
Need to handle major sources of worker error.
Worker skill/accuracy.
Topic difficulty.
Needs to handle correlation in labels.
Correlation expected because of underlying label.
 
Note: will use “assessor” for ground truth labeler and “worker” for
noisy labelers.
Basic Model
 
Exchangeability Related Assumptions
 
Given two identical sets of voting history, we
assume two workers have the same response
distribution.
 
Whether or not a worker’s opinion is elicited is
not informative.
 
The ordering of responses/elicitation is not
informative.
Relevance Conditional Independence
Assume conditional independence of worker response
given document relevance.
implies workers have comparable accuracies across tasks.
Assume one topic independent prior on relevance
Referred to as 
naïve Bayes.
Probability of relevance across all topics.
Probability of a random worker’s
response given relevance
(across all topics).
Topic and Relevance
Conditional Independence
Assume response conditionally independent
given topic and relevance.
Implies workers have comparable accuracy within
a topic, but varying across topics.
Assume topic dependent prior on relevance.
Referred to as 
nB Topic.
Probability of relevance for this topic.
Probability of a random worker’s
response given relevance
for this topic.
Worker and Relevance
Conditional Independence
Probability of relevance across all topics.
Probability of this worker’s response
given relevance
(across all topics).
Evaluation
 
Which Label
Gold: evaluate using expert assessor’s label as truth.
Consensus: evaluate using consensus of participants’ responses
as truth.
Other Participant: evaluate using a particular participant’s
responses as truth.
 
 
Methodology
Use development validation as test to decide what method to
submit.
Split development train into 80/20 train/validation by topic-
docID pair (i.e. for a given topic all responses for a docID were
completely in/out of the validation set.
Development Set
Skew and scarcity of development set, made model
selection challenging.
Chose 
nB Topic 
since only method that outperformed
the baseline (predicting most common class).
Results
 
 
Methods that report probabilities did better on probability measures in
almost all cases and almost always improve on decision theoretic
threshold.
Outlier’s performance in Log loss and conversion to accuracy implies
poorly calibrated wrt decision threshold, but likely good overall.
Our method best on probability measures and near top in general.
 
 
Conclusions
Simple topic and relevance conditional assumption model
produces
Best performance on probability measures on gold set.
Nearly best performance on accuracy.
Topic-level effects explain the majority of variability in
judgments (on this data and over set of submissions).
Future:
Worker-relevance on test set
Worker-topic-relevance conditional independence model
Method performance versus best/median individual worker
(sufficient data to evaluate?)
 
Thoughts for Future
Crowdsourcing Tracks
 
Is consensus independent of elicitation?
Can consensus be studied independent of the design for
worker response collection?
Probably okay if development and test sets are collected
with the same methodology.
 
Likely collection design impact factors worth analyzing.
Number of gold standard in “training set” on topic
Number of labels per worker
Number of labels per item
Number of worker responses on observed items
Stability of topic-conditional prior of relevance
 
Questions?
 
Slide Note
Embed
Share

Study focuses on recovering actual relevance of a topic-document pair using noisy predictions from multiple labelers. Various supervised, semi-supervised, and unsupervised approaches are explored. The goal is to obtain a more reliable signal from the crowd or benefit from scale through expert quality. Common axes of generalization, motivation for consensus task, and the approach taken are discussed, emphasizing the need for handling worker errors, skill, accuracy, and topic difficulty.

  • Consensus
  • Relevance
  • Topic
  • Worker
  • Models

Uploaded on Sep 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai, Microsoft Research Cambridge

  2. Motivation for Consensus Task Recover actual relevance of a topic-document pair based on noisy predictions from multiple labelers. Obtain a more reliable signal from the crowd and/or benefit from scale (expert quality from inexperienced assessors). Variety of proposed approaches in the literature and in competition. Supervised: Classification models. Semi-supervised: EM-style algorithms. Unsupervised: majority vote.

  3. Common Axes of Generalization Documents Relevance Observed in Training documents on known topics. Compute consensus for new Relevance Not Observed in Training In Training Observed Use rules or observed worker accuracies on other topics/documents to compute consensus on new topics and documents. Topics Not Observed In Training Compute consensus on new topics for documents with known relevance on other topics. Note hidden axis of observed workers.

  4. Our Approach Supervised Given gold truth judgments on a topic set and worker responses, learn a consensus model to generalize to new documents on same topic set. Must be able to generalize to new workers. Want a well-founded probabilistic method Need to handle major sources of worker error. Worker skill/accuracy. Topic difficulty. Needs to handle correlation in labels. Correlation expected because of underlying label. Note: will use assessor for ground truth labeler and worker for noisy labelers.

  5. Basic Model The probability of relevance should depend on the document, worker response vector, and topic: P ??,???,??, ?1, ,???? is elicited for i,j where ?? ? is a particular topic, ?? ? is a particular document, ??,? 0,1 is the event that ?? is relevant to topic ?? and ?? 0,1 is the response of worker k for the i,jth pair abbreviate ??:?

  6. Exchangeability Related Assumptions Given two identical sets of voting history, we assume two workers have the same response distribution. Whether or not a worker s opinion is elicited is not informative. The ordering of responses/elicitation is not informative.

  7. Relevance Conditional Independence Assume conditional independence of worker response given document relevance. implies workers have comparable accuracies across tasks. Assume one topic independent prior on relevance P ??,???,??,??:? P ??,? P ? ??,? ? ??:? Probability of a random worker s response given relevance (across all topics). Referred to as na ve Bayes. Probability of relevance across all topics.

  8. Topic and Relevance Conditional Independence Assume response conditionally independent given topic and relevance. Implies workers have comparable accuracy within a topic, but varying across topics. Assume topic dependent prior on relevance. P ??,???,??,??:? P ??,??? P ? ??,?,?? ? ??:? Probability of a random worker s response given relevance for this topic. Referred to as nB Topic. Probability of relevance for this topic.

  9. Worker and Relevance Conditional Independence Each worker has a particular skill/accuracy in making relevance judgments. This can be estimated by aggregating a history of accuracy ? across all tasks. Responses are independent conditional on historical accuracy and relevance. P ??,???,??,??:? P ??,? P ????,?, ? ?? ??:? Referred to as nB Worker. Probability of relevance across all topics. Probability of this worker s response given relevance (across all topics).

  10. Evaluation Which Label Gold: evaluate using expert assessor s label as truth. Consensus: evaluate using consensus of participants responses as truth. Other Participant: evaluate using a particular participant s responses as truth. Methodology Use development validation as test to decide what method to submit. Split development train into 80/20 train/validation by topic- docID pair (i.e. for a given topic all responses for a docID were completely in/out of the validation set.

  11. Development Set Model TruePos TrueNeg FalsePos FalseNeg Accuracy DefaultAcc Prec Recall Specificity Majority Vote 101 8 17 19 75.2% 82.8% 85.6% 84.2% 32.0% naive Bayes 120 0 25 0 82.8% 82.8% 82.8% 100.0% 0.0% nB Topic 115 7 18 5 84.1% 82.8% 86.5% 95.8% 28.0% nB Worker 117 1 24 3 81.4% 82.8% 83.0% 97.5% 4.0% Skew and scarcity of development set, made model selection challenging. Chose nB Topic since only method that outperformed the baseline (predicting most common class).

  12. Results Soft Accuracy Rank Soft Accuracy Recall Accuracy Rank Team Accuracy PrecisionSpecificityLog LossRMSE MSRC 69.3% 64.0% 79.0% 66.2% 59.6% 610.28 44.9% 3 6 uogTr 36.7% 44.1% 13.6% 25.3% 59.8% 931.74 58.8% 10 10 LingPipe 67.6% 66.2% 76.2% 65.0% 59.0% 975.88 49.7% 5 4 GeAnn 60.7% 57.7% 88.4% 56.9% 33.0% 1150.45 51.3% 7 8 UWaterlooMDS 69.4% 67.4% 80.2% 66.0% 58.6% 1435.79 50.1% 2 3 uc3m 69.9% 69.9% 75.4% 67.9% 64.4% 2772.38 54.9% 1 1 BUPT-WILDCAT 68.5% 68.5% 78.6% 65.4% 58.4% 2901.33 56.1% 4 2 TUD_DMIR 66.2% 66.2% 76.4% 63.5% 56.0% 3113.16 58.1% 6 5 UTaustin 60.4% 60.4% 90.8% 56.5% 30.0% 3647.36 62.9% 8 7 qirdcsuog Methods that report probabilities did better on probability measures in almost all cases and almost always improve on decision theoretic threshold. Outlier s performance in Log loss and conversion to accuracy implies poorly calibrated wrt decision threshold, but likely good overall. Our method best on probability measures and near top in general. 52.9% 52.9% 82.4% 51.8% 23.4% 4338.12 68.6% 9 9

  13. Conclusions Simple topic and relevance conditional assumption model produces Best performance on probability measures on gold set. Nearly best performance on accuracy. Topic-level effects explain the majority of variability in judgments (on this data and over set of submissions). Future: Worker-relevance on test set Worker-topic-relevance conditional independence model Method performance versus best/median individual worker (sufficient data to evaluate?)

  14. Thoughts for Future Crowdsourcing Tracks Is consensus independent of elicitation? Can consensus be studied independent of the design for worker response collection? Probably okay if development and test sets are collected with the same methodology. Likely collection design impact factors worth analyzing. Number of gold standard in training set on topic Number of labels per worker Number of labels per item Number of worker responses on observed items Stability of topic-conditional prior of relevance

  15. Questions?

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#