German Discourse Blog Corpus Compilation & Annotation

Slide Note

Compilation and annotation of a discourse-structured blog corpus for German, involving data collection, annotation, addressing specific problems, and planning next steps. The project focuses on fostering interoperability, meeting requirements, and developing models for annotating blogs' structural and linguistic aspects. Emphasis is on facilitating automatic annotation and ensuring easy reversible anonymization.

pett_lla Follow

Uploaded on Dec 05, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Compilation and Annotation of the Discourse-structured Blog Corpus for German Natali Karlova-Bourbonus Natali.Karlova-Bourbonus@zmi.uni-giessen.de Holger Grumt Su rez Holger.H.Grumt-Suarez@germanistik.uni-giessen.de Henning Lobin Henning.Lobin@uni-giessen.de

outline I. Data Collection II. Data Annotation III. Specific Problems IV. Next Steps

#60854 #54045 #54105 #65405 #65405 #465765

I. DATA COLLECTION I. DATA COLLECTION

1.177 blogposts from 80 blogger and 14.601 comments from 1.429 commentators 2.716.569 tokens

II. DATA ANNOTATION II. DATA ANNOTATION

why why TEI TEI? ?

foster interoperability of resources in/between research fields

interoperability of CMC resources with other types of resources

our our requirements requirements

models for the annotation of the structural and linguistic peculiarities of blogs

the basic structure should favor (partially) automatic annotation

easy and reversible anonymization

elements for the annotation of phenomena on the microlevel (e.g. emoticons, addressing terms)

what what to to annotate annotate? ?

three three types types of of information information

three three types types of of information information type A: explicit information (e.g. blog title, blogger name etc.)

three three types types of of information information type B: implicit information (e.g. usual activity time or topics of interest of a commentator)

three three types types of of information information type C: interpretative information (e.g. topic of the comment)

differentiate between blogpost and comment anonymisation level of indentation what we want replies timestamp x "function" of comment title and paragraph x category and tags/keywords o e.g. adressing terms, emoticons, blockquotes

Procedure (explicit information) Procedure (explicit information) EXTRACTION OF VALUES FROM HTML DOCUMENT INTEGRATION OF VALUES IN TEI TEMPLATE TEI TEMPLATE GENERATION

<div type="blog"> <head> <ref target="">Natur des Glaubens</ref> [ ] </head> <post type="blogpost" target="" xml:id="B" who="#A" indentLevel="" when=""> <p></p> <p></p> </post> <div type="blog"> <head> <ref target=""></ref> [ ] </head> <post type="blogpost" target="" xml:id="B" who="#A" indentLevel="" when=""> <p></p> <p></p> </post> </div> </div>

III. SPECIFIC PROBLEMS III. SPECIFIC PROBLEMS

1. Problem explanation example solution comment level assignment is limited to five levels

1. Problem explanation example solution LEVEL 1 LEVEL 2 LEVEL 3

1. Problem explanation example solution LEVEL 4 LEVEL 5 CORRECT: LEVEL 5 LEVEL 6

1. Problem explanation example solution manual assignment of the levels after the first comment of the fifth level (2557 out of 12044) or preservation of the original CMC structure?

2. Problem explanation example solution multiple references included in a comment

2. Problem explanation example solution cues of multiple references: - @name, [name] schrieb (engl.: wrote) - most references however are not of a standardized form <post type="comment" target="" xml:id="B1142K13286" who="#K" replyTo="B1142K13277 B1142K13280 B1142K13281 B1142K13285 indentLevel="3" when="2011-06-25T07:00:27+00:00"> <p>@ Manuel Kr ger, Veritatibus, Ares and Co.</p> <p>Mein Blog nutzt die einmalige Chance des Brainstorming aus Forendiskussionen</p> <p>Ich entstelle den Sinn keiner Aussagen, [ ]</p> </post>

2. Problem explanation example solution manual annotation of multiple references

3. Problem explanation example solution multiple versions of the same comment with minor editings (e.g. corrections, additional information, further thoughts)

3. Problem explanation example solution Additional thought ( bonus comment )

3. Problem explanation example solution typology construction semi-automated annotation typology of comments as attribute of the <post>-tag? (e.g. editing= correction )