Entity-specific Rankings of Knowledge Base Properties
Towards entity-specific rankings of knowledge base properties, this research explores the problem of property ranking for entities based on their attributes and properties. Various applications in knowledge base curation and natural language generation are discussed, along with related work in entity ranking and fact ranking. Contributions include a dataset of the 100 most frequent non-ID properties and an analysis of properties for entities like Cristiano Ronaldo.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Doctoral Advisor or Medical Condition: Towards Entity-specific Rankings of Knowledge Base Properties Simon Razniewski Max Planck Institute for Informatics, Germany Joint work with Vevake Balaraman (FBK Trento, Italy) Werner Nutt (FU Bozen-Bolzano, Italy)
Knowledge Bases Structured collections of facts about general world Examples Wikidata DBpedia Google Knowledge Graph Microsoft Satori 2
Date of birth 5 February 1985 Native language Portuguese Position played Forward Nickname El Comandante Mass 80 kg Religion Roman catholic 3 www.cristianoronaldo.com Official website
Editors: Information overload More than 3000 properties can be asserted in Wikidata Doctoral advisor Medical condition Monastic order Handedness Names of pets Which ones actually matter? 4
The property ranking problem Given: An entity A set of attributes/properties Question: Which properties are actually interesting for that entity? 5
Property Ranking: Applications Knowledge base curation What the heck, we are missing the team of Ronaldo? Natural language generation Which properties should be mentioned (first)? Comparing informativeness of entries How good is data about Ronaldo compared with Maradona? 6
Related work Considerable work on entity ranking (e.g. Pagerank) and fact ranking (e.g. trivia generation) Property suggestion Association rules [Abedjan and Naumann, DBS 2013] Ad-hoc combination of rules No human evaluation Generic machine learning [Atzori and Dessi, FGCS 2016] Class level only Limited evaluation Frequency-based [Ahmeti, Razniewski, Polleres, ESWC 2017] Very simple heuristic 7
Contribution 1: Dataset Properties: 100 most frequent non-ID properties Entities: Recently edited ones 350 triples of random entity and property pair@10 judgments (Ronaldo, doctoral advisor, medical condition) Cristiano Ronaldo dos Santos Aveiro (born 5 February 1985) is a Portuguese professional footballer who plays as a forward for Spanish club Real Madrid and the Portugal national team. Often considered the best player in the world, Ronaldo is the first player in history to win four European Golden Shoes. Which of the following two properties would be more interesting to know about for Cristiano Ronaldo? a. Doctoral advisor b. Medical condition 8
Agreement distribution Remainder: Focus on triples with >=0.8 agreement 10
Learning to Rank (L2R): Theory 3 core paradigms: Pointwise, pairwise, listwise L2R Pointwise: Learn an score function score(Ronaldo, positionPlayed) = 0.73 Pairwise: Learn a pairwise preference function preference(Ronaldo, positionPlayed, lastName) = lastName (Listwise: Directly learn lists from lists) Pointwise: Unstable - score dependent on framing Supervised pairwise: Requires too much training data (100 properties 5,000 pairs/entity, we have only 1/350) Choice left: Unsupervised approximation of pairwise ranking 11
Baselines (1/2) 1. Human frequency Winner is property more frequently used for humans 2. Occupation frequency (=Ahmeti et al. 2017) Winner is property more frequently used in profession 3. Google count Winner is String returning more results in Google Ronaldo doctoral advisor vs. Ronaldo medical condition 4. Association rules (= Abedjan + Naumann 2013) Source code of Wikidata implementation by Abedjan and Naumann available 12
Baselines (2/2) Method Performance Random 50% Human frequency Occupation frequency Google count Property suggester 60.6% 58.6% 58% 61.3% Annotator agreement 87.5% 13
How can we get better? 1. Use property presence 2. Explore semantic similarity between entities and properties 14
Property presence: Idea Hypothesis: Property presence indicates interestingness If true: Can use a model that predicts presence to predict interestingness ( transfer learning ) But can we predict property presence? 15
Property presence: Training One regression classifier per property pair Input: Bag of words from entity Wikipedia article Training: 5000 entities that have Property 1 and not Property 2 5000 entities that have Property 2 and not Property 1 Precision on predicting presence: 94.8% Example: weights for position played vs. religious order 16
Property presence: Transfer Precision in predicting interestingness: 72% 10% better than baselines Expensive training O(n ) classifiers pays off But: Still data-driven Misses similarity Soccer player and drafted by vs. military conflict 17
How can we get better? 1. Predict property presence 2. Explore semantic similarity between entities and properties 18
Semantic similarity: Idea Entities have Wikipedia articles Cristiano Ronaldo is professional footballer who Properties have descriptions Position played: Position that someone played on for a team Use semantic similarity between entity articles and property descriptions as heuristic for interestingness 19
Semantic similarity: Implementation Topic modelling = Represent texts as distributions of topics Common techniques: LSI and LDA (Latent Semantic Indexing/Latent Dirichlet Allocation) Semantic similarity by cosine of topic vectors 20
Semantic similarity: LDA Examples Ronaldo = 52% T6 + 12% T18 + 7% T26 + 6% T41 + Member of sports team = 96% T6 + Goals scored = 92% T6 + 21
Semantic similarity: Performance Precision on predicting interestingness: LDA: 60% LSI: 65.3% Pro: Great at discovering relevance/applicability Cons: Struggle with short property descriptions Similarity only mediocre proxy for interestingness 22
Ensemble classifiers Condorcet s jury theorem (1785): The more people involved in a voting decision, the better If their judgment is better than random If are sufficiently statistically independent Machine learning: Combination of many weak classifiers can be beneficial Majority voting among Google count, LSI, LDA, Occupation frequency, and regression: 74% precision 23
Conclusion Semantic similarity great for detecting applicability Predicting presence expensive, but worth Best method (ensemble): 74% precision Much better than baselines (~60%) Still notably worse than humans (87.5%) Possible extensions: Include an explicit notion of importance Acquire larger textual descriptions of properties Explore incorporation of formal semantics/constraints Dataset: www.kaggle.com/srazniewski/wikidatapropertyranking 24
Appendix: Instability of pointwise scores Setup 1 Important properties: Country of origin, participant of, award received, date of birth, member of sports team, position played on team / speciality, country of citizenship Setup 2: - Tail properties: academic degree, image of grave, educated at, brother, hair color, military rank, religion 25
Appendix: Correlation between methods 26