German Discourse Blog Corpus Compilation & Annotation
Compilation and annotation of a discourse-structured blog corpus for German, involving data collection, annotation, addressing specific problems, and planning next steps. The project focuses on fostering interoperability, meeting requirements, and developing models for annotating blogs' structural and linguistic aspects. Emphasis is on facilitating automatic annotation and ensuring easy reversible anonymization.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Compilation and Annotation of the Discourse-structured Blog Corpus for German Natali Karlova-Bourbonus Natali.Karlova-Bourbonus@zmi.uni-giessen.de Holger Grumt Su rez Holger.H.Grumt-Suarez@germanistik.uni-giessen.de Henning Lobin Henning.Lobin@uni-giessen.de
outline I. Data Collection II. Data Annotation III. Specific Problems IV. Next Steps
#60854 #54045 #54105 #65405 #65405 #465765
I. DATA COLLECTION I. DATA COLLECTION
1.177 blogposts from 80 blogger and 14.601 comments from 1.429 commentators 2.716.569 tokens
II. DATA ANNOTATION II. DATA ANNOTATION
why why TEI TEI? ?
foster interoperability of resources in/between research fields
interoperability of CMC resources with other types of resources
our our requirements requirements
models for the annotation of the structural and linguistic peculiarities of blogs
the basic structure should favor (partially) automatic annotation
elements for the annotation of phenomena on the microlevel (e.g. emoticons, addressing terms)
what what to to annotate annotate? ?
three three types types of of information information
three three types types of of information information type A: explicit information (e.g. blog title, blogger name etc.)
three three types types of of information information type B: implicit information (e.g. usual activity time or topics of interest of a commentator)
three three types types of of information information type C: interpretative information (e.g. topic of the comment)
differentiate between blogpost and comment anonymisation level of indentation what we want replies timestamp x "function" of comment title and paragraph x category and tags/keywords o e.g. adressing terms, emoticons, blockquotes
Procedure (explicit information) Procedure (explicit information) EXTRACTION OF VALUES FROM HTML DOCUMENT INTEGRATION OF VALUES IN TEI TEMPLATE TEI TEMPLATE GENERATION
<div type="blog"> <head> <ref target="">Natur des Glaubens</ref> [ ] </head> <post type="blogpost" target="" xml:id="B" who="#A" indentLevel="" when=""> <p></p> <p></p> </post> <div type="blog"> <head> <ref target=""></ref> [ ] </head> <post type="blogpost" target="" xml:id="B" who="#A" indentLevel="" when=""> <p></p> <p></p> </post> </div> </div>
III. SPECIFIC PROBLEMS III. SPECIFIC PROBLEMS
1. Problem explanation example solution comment level assignment is limited to five levels
1. Problem explanation example solution LEVEL 1 LEVEL 2 LEVEL 3
1. Problem explanation example solution LEVEL 4 LEVEL 5 CORRECT: LEVEL 5 LEVEL 6
1. Problem explanation example solution manual assignment of the levels after the first comment of the fifth level (2557 out of 12044) or preservation of the original CMC structure?
2. Problem explanation example solution multiple references included in a comment
2. Problem explanation example solution cues of multiple references: - @name, [name] schrieb (engl.: wrote) - most references however are not of a standardized form <post type="comment" target="" xml:id="B1142K13286" who="#K" replyTo="B1142K13277 B1142K13280 B1142K13281 B1142K13285 indentLevel="3" when="2011-06-25T07:00:27+00:00"> <p>@ Manuel Kr ger, Veritatibus, Ares and Co.</p> <p>Mein Blog nutzt die einmalige Chance des Brainstorming aus Forendiskussionen</p> <p>Ich entstelle den Sinn keiner Aussagen, [ ]</p> </post>
2. Problem explanation example solution manual annotation of multiple references
3. Problem explanation example solution multiple versions of the same comment with minor editings (e.g. corrections, additional information, further thoughts)
3. Problem explanation example solution Additional thought ( bonus comment )
3. Problem explanation example solution typology construction semi-automated annotation typology of comments as attribute of the <post>-tag? (e.g. editing= correction )
tags tags categorys tags categorys tags categorys tags categorys
blogpost blogpost comment comment
IV. NEXT STEPS IV. NEXT STEPS
Legal aspects of data usage Publication and access: in what form? Information of type B (implicit) and type C (interpretative)