Statistical Analysis of Discourse in Corpus Linguistics

Slide Note

Statistical analysis plays a crucial role in understanding the complexities of discourse in corpus linguistics. This involves exploring collocations, keywords, and the reliability of manual coding in linguistic research. The relationship between the fluid nature of discourse and the rigour expected by statistical methods presents a unique challenge. Various baselines and measures are employed to determine the strength of collocational relationships, such as the shake-the-box model and different association measures like MI2, MI3, log likelihood, z score, and more.

ogle_bi Follow

Uploaded on Sep 30, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Semantics and discourse: Collocations, keywords and reliability of manual coding Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 1

Statistical analysis of discourse involves an inherent paradox. While discourse is often fluid, ambiguous and fuzzy, statistics expects rigour, precision and clearly defined categories.

Think about and discuss 1. What associations come to your mind when you see the word love? 2. Why do you think the word has these associations for you? Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 3

Collocations collocates node collocation window (span): 1L 1R Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 4

Collocations (cont.) Is my really a genuine collocate of love in the poem? In other words, is my really strongly associated with love? Observed frequency (3) compared with: 1) 1) No baseline: No baseline: We compare the observed frequencies of all individual words co-occurring with the node and produce a rank-ordered list. 2) 2) Random co Random co- -occurrence baseline ( shake the box model): occurrence baseline ( shake the box model): We compare the observed frequencies with frequencies expected by chance alone and evaluate the strength of collocation using a mathematical equation which puts emphasis on a particular aspect of the collocational relationship. 3) 3) Word competition baseline: Word competition baseline: We use a different type of baseline from random co-occurrence; this baseline is incorporated in the equation, which again highlights a particular aspect of the collocational relationship. Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 5

Shake the box model expected frequency of collocation =node frequency collocate frequency no.of tokens in text or corpus Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 6

? ?? ??? ??? ? ??? Association measures ???? ??? ???? ??? ???? ? ??? ??+ ?? ? ?? ??? ??? MI2 MI3 log likelihood z score T score Dice log Dice Delta P log ratio Cohen s d ??? ?? ??? ?? Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 7

Association measures (cont.) Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 8

Association measures (cont.) Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 9

Collocation networks C3 C2 C2 N4 C1 C3 C2 C1 N3 C2 C1 N2 C1 node C3 C5 C4 Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 10

CPN (Brezina et al. 2015) Statistic ID Statistic name Statistic cut-off value L and R span Minimum collocate freq. (C) Minimum collocation freq. (NC) Filter function words removed 4b MI2 3 L5-R5 5 1 Example 4b-MI2(3), L5-R5, C5-NC1; function words removed Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 11

Keywords Positive keywords + Lockwords Negative keywords - 100M Decision + (positive keyword) Corpus of interest C frequent Reference corpus R infrequent 1M infrequent frequent - (negative keyword) Corpus of interest comparable freq. comparable freq. 0 (lockword) Reference corpus Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 12

Keywords (cont.) SMP (with 100 as the constant) U.S. Log likelihood Log ratio Cohen s d log likelihood short = 2 O11 logO11 E11+ O21 logO21 E21 (3.1) U.S. LABOR NEIGHBORHOOD TOWARD PERCENT AMERICAN PERCENT PROGRAM NEIGHBORHOOD RECOGNIZE DEFENSE CONGRESSIONAL PROGRAM TOWARD STATES FEDERAL BUSH PRESIDENT CENTER MR. PROGRAMS UNITED STATE CONGRESS WASHINGTON AMERICANS DEFENSE CALIFORNIA WAR TOWARD AMERICAN BUSH FEDERAL STATES CENTER MR. PRESIDENT PROGRAMS UNITED WASHINGTON CONGRESS AMERICANS STATE CALIFORNIA AMERICA DEFENSE NEIGHBORS COLORED MANHATTAN FAVORITE RECOGNIZED CENTER REALIZE RECOGNIZING TRAVELED SIGNALED COLOR CALIFORNIA GOTTEN LABOR FAVOR FINALLY CENTERS ATLANTA PGF2A MACDOWELL MRNA NEIGHBORS ABBY GENOME FLORIDA 9-11 DOE POE ROUSSEAU NS1 REZKO MITCH ADDITIVES simple maths parameter = relative frequency of w in C + k (3.1) relative frequency of w in R + k Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 13

Inter-rater agreement Inter-rater agreement, which is an estimate of how reliable and consistent a coding is, should be reported in studies working with a judgement variable. Judgement variable is a variable that involves categorisation or evaluation of cases (e.g. concordance lines) by the analyst that might bring an element of subjectivity into the study. Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 14

Inter-rater agreement (cont.) Positive or Negative? Categorisation is a matter of choice. Rigorous? Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 15

Inter-rater agreement (cont.) Rater 1 negative negative negative positive negative negative negative negative positive negative Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 16

Inter-rater agreement (cont.) Rater 1 Rater 2 negative negative negative negative negative negative positive positive negative positive negative negative negative negative negative negative positive negative negative negative Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 17

Inter-rater agreement (cont.) raw agreement =cases of agreement 8 raw agreement =cases of agreement Rater 1 Rater 2 = 10= 0.8 total no. of cases total no. of cases negative negative YES Agreement statistic =raw agreement agreement by chance 1 agreement by chance negative negative YES negative negative YES AC1=0.8 0.32 positive positive YES 1 0.32 = 0.71 negative positive NO negative negative YES negative negative YES negative negative YES positive negative NO negative negative YES Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 18

Cohens /AC1 absolute disagreement random agreement very good agreement absolute agreement agreement 0 [-1] 0.67 0.8 1 0.71 Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 19

Inter-rater agreement (cont.) Type of judgement variable No. of values No. of raters Statistic(s) to use Nominal (categories) 2 and more 2 Gwet s AC1and Cohen s 2 and more 3 and more Gwet s AC1 and Fleiss' Ordinal (ranks) 2 and more 2 and more Gwet s AC2 Interval/Ratio (scale) 2 and more 2 and more Interclass correlation (ICC) 20

Things to remember There are many association measures each highlighting different aspects of the collocational relationship (e.g. frequency or exclusivity). There is no one best association measure. Collocations can be presented in a tabular (table) or visual form (graph). Collocation networks show complex cross-associations in texts and discourses. The keyword procedure in its essence is a comparison which depends on a number of parameters. There is no such thing as one set of keywords. For judgement variables inter-rater agreement statistic should be reported. Gwet sAC1 and AC2, Cohen s and Fleiss' as well as Interclass correlation can be used depending on the situation. Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press. 21

Statistical Analysis of Discourse in Corpus Linguistics

Download Presentation

Presentation Transcript

Related

More Related Content