Automatic Discovery of Character Infoboxes from Books - CharBoxes System

Slide Note
Embed
Share

The CharBoxes system aims to automatically discover character infoboxes from books, assisting in effective summarization, marketing, and understanding of book characters. By extracting important character details, constructing social graphs, summarizing character-centric text, and more, CharBoxes enhances the extraction of structured data from free text for character analysis.


Uploaded on Sep 27, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Char Boxes CharBoxes CharBoxes: A System for : A System for Automatic Discovery of Automatic Discovery of Character Character Infoboxes Infoboxes from Books from Books Manish Gupta, Piyush Bansal, Vasudeva Varma 8thJuly 2014

  2. Motivation (1) We live in an entity-centric world. Structured data about book characters is not easily available. State-of-the-art (Harry Potter Example)

  3. Motivation (2) Automatic discovery of character infoboxes can help in Effective summarization Effective marketing of books Aid understanding Challenges Automatic discovery of important characters given a book Automatic social graph construction relating the discovered characters Automatic Summarization of text most related to each of the characters Automatic infobox extraction from such summarized text for each character

  4. Shelfari does it (manually?)

  5. Goal of CharBoxes For every character, show me Most related persons (along with the relationship preferably) Most related places and organizations (along with verbs indicating relation preferably) Personality traits of the person Overall sentiment of the person Frequently mentioned dress, actions, looks Sociability of the person Books in which appeared Character-centric text summary

  6. Comparison with Related Work Analysis of books or multi-documents Most of the work is on summarization A blog on integrating locations in books with points on Maps Extracting structured data from free text Widely studied But we focus on using this to extract infoboxes from books Novelty Sentiment-based summarizer Character-specific summary based on subject-predicate-object facts Heuristic patterns to extract attribute values for characters

  7. System Diagram Book text POS Tagging + NER+ Cleaning Person Names Chapter Boundary Detection Co-reference Resolution + Parse Tree Analysis Characters Linguistically Analyzed Book text Person-person Interaction Network Most related places and organizations Character Centric Text Summarization (Fact triplet extraction + Sentiment Analysis) Extracting Character- Centric Facts Character Infoboxes

  8. Character Extraction Harry: 1083 Ron: 347 Hagrid: 290 Hermione: 201 Snape: 151 Dumbledore: 131 Dudley: 120 Neville: 104 Quirrell: 93 Vernon: 83 McGonagall: 83 Malfoy: 83 Potter: 81 Dursley: 46 Weasley: 40 Wood: 34 Petunia: 34 Percy: 31 Voldemort: 30 Norbert: 22 Input: Book text Extract authors and year of publication, if available Post-process POS Tagged data to obtain names Post-process to merge tokens Clean names Sort by frequency Merge names using simple rules Handle diminutives Maps parts of names to canonical name Maintain list of ambiguous names Output: List of popular characters in the book

  9. System Diagram Book text POS Tagging + NER+ Cleaning Person Names Chapter Boundary Detection Co-reference Resolution + Parse Tree Analysis Characters Linguistically Analyzed Book text Person-person Interaction Network Most related places and organizations Character Centric Text Summarization (Fact triplet extraction + Sentiment Analysis) Extracting Character- Centric Facts Character Infoboxes

  10. Linguistic Analysis Chapter Boundary Detection Clues like Chapter X , Lesson X , Section X Hints from table of contents If no clear chapters, use topic shift detection Co-reference Resolution On each chapter Resolve pronouns or short names to full names 'Uncle Vernon': [('Vernon', 83), ('Uncle Vernon', 16), ('Vernon Dursley', 1)] Parse Tree Analysis Understand dependencies Understand subject-predicate- object

  11. System Diagram Book text POS Tagging + NER+ Cleaning Person Names Chapter Boundary Detection Co-reference Resolution + Parse Tree Analysis Characters Linguistically Analyzed Book text Person-person Interaction Network Most related places and organizations Character Centric Text Summarization (Fact triplet extraction + Sentiment Analysis) Extracting Character- Centric Facts Character Infoboxes

  12. Person-Person Graph Construction (1) Build an interaction graph between characters using Non-ambiguous mentions and dialogue extraction Keywords like said, told, say, tell, says, screamed, etc. Perform disambiguation of ambiguous mentions E.g., Weasley in Harry Potter and the Philosopher s Stone Using Context words Mention of full name in vicinity Frequency of co-occurrence with other entities in the vicinity based on the graph Use disambiguated mentions to refine interaction graph Annotate the graph with relationships (if extracted using word clues) Mother, father, sibling Friend, enemy

  13. Person-Person Graph Construction (2) ['Dumbledore', 'Professor McGonagall'] Professor McGonagall shot a sharp look at Dumbledore and said , `` The owls are nothing next to the rumors that are flying around . ['Dumbledore', 'Hagrid', 'Professor McGonagall'] `` But I c-c-can ' t stand it -- Lily an ' James dead -- an ' poor little Harry off ter live with Muggles - '' `` Yes , yes , it 's all very sad , but get a grip on yourself , Hagrid , or we 'll be found , '' Professor McGonagall whispered , patting Hagrid gingerly on the arm as Dumbledore stepped over the low garden wall and walked to the front door. Identifying set of people participating in a text conversation is a hard problem.

  14. System Diagram Book text POS Tagging + NER+ Cleaning Person Names Chapter Boundary Detection Co-reference Resolution + Parse Tree Analysis Characters Linguistically Analyzed Book text Person-person Interaction Network Most related places and organizations Character Centric Text Summarization (Fact triplet extraction + Sentiment Analysis) Extracting Character- Centric Facts Character Infoboxes

  15. Related Places and Organizations Extraction Given a character Most relevant places and organizations associated with the character are discovered Frequency and proximity of mentions Use linking verb to establish relationship between person and place/organization For example, studies could be the most frequent verb linking Harry Potter with Hogwarts.

  16. System Diagram Book text POS Tagging + NER+ Cleaning Person Names Chapter Boundary Detection Co-reference Resolution + Parse Tree Analysis Characters Linguistically Analyzed Book text Person-person Interaction Network Most related places and organizations Character Centric Text Summarization (Fact triplet extraction + Sentiment Analysis) Extracting Character- Centric Facts Character Infoboxes

  17. Character-centric Summary Generation Consider all sentences containing the character Remove sentences which also contain other characters Remove sentences with quotations Rank sentences with more entities higher Rank longer sentences higher Rank sentences which introduce a new entity higher Rank sentences with dress description or looks of the character higher Rank sentences with extreme sentiments higher

  18. System Diagram Book text POS Tagging + NER+ Cleaning Person Names Chapter Boundary Detection Co-reference Resolution + Parse Tree Analysis Characters Linguistically Analyzed Book text Person-person Interaction Network Most related places and organizations Character Centric Text Summarization (Fact triplet extraction + Sentiment Analysis) Extracting Character- Centric Facts Character Infoboxes

  19. Character-Centric Facts Extraction Extract the following for every person Year of birth/death Using time clues Looks, qualities of the person Either direct text mentions or inferred from the spoken sentences Overall sentiment of the person (hero/villian) Based on sentences containing mentions Frequently mentioned facts Like relation between Harry Potter and quidditch linked by the verb plays ) Sociability of the person Based on number of other characters it interacts with

  20. Conclusion CharBoxes is a system which is expected to take book text as input and output structured Infoboxes for various characters in the book. The system would utilize deep natural language processing techniques complemented by domain specific heuristics. The system can be very useful in summarizing books in a structured way in terms of insights about characters discussed in the book.

Related


More Related Content