German Discourse Blog Corpus Compilation & Annotation

Slide Note
Embed
Share

Compilation and annotation of a discourse-structured blog corpus for German, involving data collection, annotation, addressing specific problems, and planning next steps. The project focuses on fostering interoperability, meeting requirements, and developing models for annotating blogs' structural and linguistic aspects. Emphasis is on facilitating automatic annotation and ensuring easy reversible anonymization.


Uploaded on Dec 05, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Compilation and Annotation of the Discourse-structured Blog Corpus for German Natali Karlova-Bourbonus Natali.Karlova-Bourbonus@zmi.uni-giessen.de Holger Grumt Su rez Holger.H.Grumt-Suarez@germanistik.uni-giessen.de Henning Lobin Henning.Lobin@uni-giessen.de

  2. outline I. Data Collection II. Data Annotation III. Specific Problems IV. Next Steps

  3. #60854 #54045 #54105 #65405 #65405 #465765

  4. I. DATA COLLECTION I. DATA COLLECTION

  5. 1.177 blogposts from 80 blogger and 14.601 comments from 1.429 commentators 2.716.569 tokens

  6. II. DATA ANNOTATION II. DATA ANNOTATION

  7. why why TEI TEI? ?

  8. foster interoperability of resources in/between research fields

  9. interoperability of CMC resources with other types of resources

  10. our our requirements requirements

  11. models for the annotation of the structural and linguistic peculiarities of blogs

  12. the basic structure should favor (partially) automatic annotation

  13. easy and reversible anonymization

  14. elements for the annotation of phenomena on the microlevel (e.g. emoticons, addressing terms)

  15. what what to to annotate annotate? ?

  16. three three types types of of information information

  17. three three types types of of information information type A: explicit information (e.g. blog title, blogger name etc.)

  18. three three types types of of information information type B: implicit information (e.g. usual activity time or topics of interest of a commentator)

  19. three three types types of of information information type C: interpretative information (e.g. topic of the comment)

  20. differentiate between blogpost and comment anonymisation level of indentation what we want replies timestamp x "function" of comment title and paragraph x category and tags/keywords o e.g. adressing terms, emoticons, blockquotes

  21. Procedure (explicit information) Procedure (explicit information) EXTRACTION OF VALUES FROM HTML DOCUMENT INTEGRATION OF VALUES IN TEI TEMPLATE TEI TEMPLATE GENERATION

  22. <div type="blog"> <head> <ref target="">Natur des Glaubens</ref> [ ] </head> <post type="blogpost" target="" xml:id="B" who="#A" indentLevel="" when=""> <p></p> <p></p> </post> <div type="blog"> <head> <ref target=""></ref> [ ] </head> <post type="blogpost" target="" xml:id="B" who="#A" indentLevel="" when=""> <p></p> <p></p> </post> </div> </div>

  23. III. SPECIFIC PROBLEMS III. SPECIFIC PROBLEMS

  24. 1. Problem explanation example solution comment level assignment is limited to five levels

  25. 1. Problem explanation example solution LEVEL 1 LEVEL 2 LEVEL 3

  26. 1. Problem explanation example solution LEVEL 4 LEVEL 5 CORRECT: LEVEL 5 LEVEL 6

  27. 1. Problem explanation example solution manual assignment of the levels after the first comment of the fifth level (2557 out of 12044) or preservation of the original CMC structure?

  28. 2. Problem explanation example solution multiple references included in a comment

  29. 2. Problem explanation example solution cues of multiple references: - @name, [name] schrieb (engl.: wrote) - most references however are not of a standardized form <post type="comment" target="" xml:id="B1142K13286" who="#K" replyTo="B1142K13277 B1142K13280 B1142K13281 B1142K13285 indentLevel="3" when="2011-06-25T07:00:27+00:00"> <p>@ Manuel Kr ger, Veritatibus, Ares and Co.</p> <p>Mein Blog nutzt die einmalige Chance des Brainstorming aus Forendiskussionen</p> <p>Ich entstelle den Sinn keiner Aussagen, [ ]</p> </post>

  30. 2. Problem explanation example solution manual annotation of multiple references

  31. 3. Problem explanation example solution multiple versions of the same comment with minor editings (e.g. corrections, additional information, further thoughts)

  32. 3. Problem explanation example solution Additional thought ( bonus comment )

  33. 3. Problem explanation example solution typology construction semi-automated annotation typology of comments as attribute of the <post>-tag? (e.g. editing= correction )

  34. tags tags categorys tags categorys tags categorys tags categorys

  35. blogpost blogpost comment comment

  36. IV. NEXT STEPS IV. NEXT STEPS

  37. Legal aspects of data usage Publication and access: in what form? Information of type B (implicit) and type C (interpretative)

More Related Content