Enhancing Corpus Analysis: Text and Sub-text Level Analysis

undefined

A case for improving the text-level and

sub-text-level analysis of corpora

Corpus Linguistics 2019, Cardiff University, Cardiff, UK, July 23-26, 2019

Robbie LOVE

School of Education,

University of Leeds, UK

r.love@leeds.ac.uk

        @lovermob

Laurence ANTHONY

Faculty of Science and Engineering,

Waseda University, Japan

anthony@waseda.jp

        @antlabjp

Overview



Background



Traditional approaches to corpus analysis



Current capabilities and challenges



Research aims



Design and implementation



Database architecture



Software overview



Case study



Text- and sub-text level analysis of the Spoken BNC2014 using

AntConc (4.0)



Conclusion and future challenges



Where next?

undefined

Background

Traditional approaches to corpus analysis,

Current capabilities and challenges, research aims

Background

Traditional approaches to corpus analysis



Whole-corpus analysis



Treating the entire corpus as a single unit of analysis



analyses in isolation



e.g. KWIC concordancers, Dispersion plots, Collocation, …



analyses by comparisons to other (entire) corpora



e.g. Keyword analysis, Vocabulary profiling, CDA, …



Part-corpus analysis



Dividing the entire corpus into easily identifiable pieces



comparisons of parts of a corpus with each other



comparisons of parts of a corpus with

(parts of) other corpora



e.g. genre studies, literary studies, ...

http://www.bakerssupplies.pk/Cupcake-Tray-12-Cavity

Background

Current capabilities and challenges (online and offline tools)



Online concordancing tools



e.g.

CQPweb

 (Hardie 2012),

Sketch Engine

(Kilgarriff et al. 2014),

BYU Corpora

(Davies 2018), …



Most offer (whole corpus)

token-level

metadata (annotation)

processing (e.g. token/type, POS, lemma)



Some offer

sub-text-level

 metadata processing

(e.g. speakers in

CQPweb



Many offer

text-level

 metadata processing

(e.g. register, genre, time period)



Offline concordancing tools



e.g.

AntConc

 (Anthony 2018),

Wordsmith Tools

(Scott 2017),

LancsBox

 (Brezina et al. 2015)



Most favour

token-level

metadata

processing at the whole corpus level

https://www.inovex.de/blog/online-offline-machine-learning-network-anomaly-detection/

Background

Current capabilities and challenges (online and offline tools)



Concordancing

tools are useful, but…



online tools offer less accommodation for the users’ own data



offline tools offer less sub-text and text-level functionality



The solution lies in programming, but…



many researchers are not programmers



effective database design for sub-text and text-level processing

is an ongoing research question



So…



many users try to ‘chop up’ their data manually…



…and/or shape the research to match the tool capability

https://www.istockphoto.com/gb/vector/programming-and-coding-icon-laptop-gm509114308-85608385

Background

Research aims



Research aims



Create a design for the fast

processing of text and sub-text

metadata in corpora



Test and incorporate this design into

an update for

AntConc



The project



Funded by the

Japan Society for the

Promotion of Science

(JSPS)



One-month Visiting Research

Fellowship, hosted  by Waseda

University, Tokyo, Japan



November-December 2018

https://www.jsps.go.jp/english/

undefined

Design and implementation

Database architecture, Software overview

Design and implementation

Database architecture (tables and indexes)

Design and implementation

Software overview (AntConc 4.0 beta)

Design and implementation

Software overview (AntConc 4.0 beta)



Freeware



Multiplatform



Win 7/8/10+



Linux



Mac OS X



Portable



no installation required



Metadata processing



token, sub-text-, and text-

level metadata analysis



Tools



KWIC Concordancer



Distribution Plot



File View



Clusters



N-grams



Collocates



Word Frequency



Keyword Frequency

https://www.laurenceanthony.net/software/antconc

undefined

Case Study

Text- and sub-text-level analysis of the

Spoken BNC2014 using

AntConc

 (4.0)

Spoken BNC2014 (Love et al. 2017)

Overview



11.5 million tokens of contemporary spoken British English



Genre

: Informal conversation



Years

: 2012-2016



Publisher

: Lancaster University & Cambridge University Press



Availability

: Publicly available for free



Highly ‘structured’ corpus, rich in token, sub-text-, and text-

level metadata



Token

: e.g. POS, lemma, utterances, overlaps, pauses, …



Sub-text

: e.g. speaker demographics incl. gender, age,

(regional) dialect, socio-economic status, …



Text

: e.g. situational context, no. of speakers, transcriber, …

http://corpora.lancs.ac.uk/bnc2014/

Spoken BNC2014 (Love et al. 2017)

Query syntax (under investigation)



Three categories:



Token (‘tok’)

token-level

 metadata categories



Contributor (‘con’)

sub-text-level

 (speaker) metadata categories



Document (‘doc’)

text-level

 metadata categories



Queries can include any one or more categories, as well as

multiples of the same category



happy



<con gender = "male"> happy



<doc n_speakers = 3>&<con gender = "male"> happy



<con gender="female">



<tok trans = "overlap">



…

Spoken BNC2014 (Love et al. 2017)

Demo queries



Token-level



All filled pauses



*_UH



Sub-text-level



All "_UH" filled pauses produced by speakers aged 30-39



<con agerange = "30_39"> *_UH



All instances of  the lemma "give", by female speakers



<con gender= "F"> {give}



Text-level



All instances of "innit" in texts produced by transcriber no. 15



<doc transcriber = T15> in n it

undefined

Conclusion and future challenges

Where next?

Conclusion and future challenges



Pushing towards the next generation of corpus analysis and

corpus availability



The more powerful the tools, the more freedom researchers have to

investigate language



The faster, more automatic processing of metadata is one example of

functionality that boosts research power



The Spoken BNC2014 is a good ‘guinea pig’ for testing since it is

highly structured



Making sub-text-level and text-level functionality universal



Design specialized software modules to process corpora?



Design/modify publicly-accessible corpora to work

with popular tools?



Promote standards for data interoperability?

Conclusion and future challenges



Challenges/limitations (user perspectives)



Another new syntax…..?



(requires user knowledge of syntax abbreviations)



Drop-down menus?



(easy to use but limited in scope)



Future directions



A universal input format for all ‘structured’ corpora, which can be

processed automatically by offline tools in order to offer this

functionality instantly

References

1.

Anthony, L. (2018). AntConc (Version 3.5.7) [Computer Software]. Tokyo, Japan:

Waseda University. Available from http://www.laurenceanthony.net/software

2.

Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new

perspective on collocation networks. International Journal of Corpus Linguistics,

20(2), 139-173.

3.

Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA): 560

million words, 1990-present. Available online at https://corpus.byu.edu/coca/.

4.

Hardie, A (2012). CQPweb - combining power, flexibility and usability in a corpus

analysis tool. International Journal of Corpus Linguistics, 17(3): 380–409.

5.

Kilgarriff, A., Rychlý, P., Smrž, P., & Tugwell, D. (2004). The sketch engine.

Information Technology.

6.

Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken

BNC2014: Designing and building a spoken corpus of everyday conversations.

International Journal of Corpus Linguistics, 22(3), 319-344.

7.

Scott, M. (2017). WordSmith Tools version 7, Stroud: Lexical Analysis Software.

Spoken BNC2014 (Love et al. 2017)

Demo queries



Token-level



All filled pauses



*_UH



All short pauses



<tok:dur=“short”>



Sub-text-level



All tokens produced by speakers aged 30-39



<con:agerange=“30-39”>



All instances of “thanks”, by female speakers, in texts featuring 2

speakers



<doc:n_speakers=2> <con:gender=“female”> thanks



Text-level



All instances of "innit" in texts produced by transcriber no. 2



<doc:transcriber=2> in n it

Slide Note

Embed Share

Download

This study delves into the importance of improving text and sub-text level analysis of corpora, highlighting traditional approaches, current tools, challenges, and the necessity for effective database design. It emphasizes the need for user-friendly solutions to enhance research capabilities.

simo_ae Follow

Uploaded on Sep 15, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

A case for improving the text-level and sub-text-level analysis of corpora Laurence ANTHONY Faculty of Science and Engineering, Waseda University, Japan anthony@waseda.jp @antlabjp Robbie LOVE School of Education, University of Leeds, UK r.love@leeds.ac.uk @lovermob Corpus Linguistics 2019, Cardiff University, Cardiff, UK, July 23-26, 2019

2 Overview Background Traditional approaches to corpus analysis Current capabilities and challenges Research aims Design and implementation Database architecture Software overview Case study Text- and sub-text level analysis of the Spoken BNC2014 using AntConc (4.0) Conclusion and future challenges Where next?

Background Traditional approaches to corpus analysis, Current capabilities and challenges, research aims

4 Background Traditional approaches to corpus analysis Whole-corpus analysis Treating the entire corpus as a single unit of analysis analyses in isolation e.g. KWIC concordancers, Dispersion plots, Collocation, analyses by comparisons to other (entire) corpora e.g. Keyword analysis, Vocabulary profiling, CDA, Part-corpus analysis Dividing the entire corpus into easily identifiable pieces comparisons of parts of a corpus with each other comparisons of parts of a corpus with (parts of) other corpora e.g. genre studies, literary studies, ... http://www.bakerssupplies.pk/Cupcake-Tray-12-Cavity

5 Background Current capabilities and challenges (online and offline tools) Online concordancing tools e.g. CQPweb (Hardie 2012), Sketch Engine (Kilgarriff et al. 2014), BYU Corpora (Davies 2018), Most offer (whole corpus) token-level metadata (annotation) processing (e.g. token/type, POS, lemma) Some offer sub-text-level metadata processing (e.g. speakers in CQPweb) Many offer text-level metadata processing (e.g. register, genre, time period) Offline concordancing tools e.g. AntConc (Anthony 2018), Wordsmith Tools (Scott 2017), LancsBox (Brezina et al. 2015) Most favour token-level metadata processing at the whole corpus level https://www.inovex.de/blog/online-offline-machine-learning-network-anomaly-detection/

6 Background Current capabilities and challenges (online and offline tools) Concordancing tools are useful, but online tools offer less accommodation for the users own data offline tools offer less sub-text and text-level functionality The solution lies in programming, but many researchers are not programmers effective database design for sub-text and text-level processing is an ongoing research question So many users try to chop up their data manually and/or shape the research to match the tool capability https://www.istockphoto.com/gb/vector/programming-and-coding-icon-laptop-gm509114308-85608385

7 Background Research aims Research aims Create a design for the fast processing of text and sub-text metadata in corpora Test and incorporate this design into an update for AntConc The project Funded by the Japan Society for the Promotion of Science (JSPS) One-month Visiting Research Fellowship, hosted by Waseda University, Tokyo, Japan November-December 2018 https://www.jsps.go.jp/english/

Design and implementation Database architecture, Software overview

9 Design and implementation Database architecture (tables and indexes)

10 Design and implementation Software overview (AntConc 4.0 beta)

11 Design and implementation Software overview (AntConc 4.0 beta) Freeware Tools KWIC Concordancer Multiplatform Distribution Plot Win 7/8/10+ File View Linux Clusters Mac OS X N-grams Portable Collocates no installation required Word Frequency Metadata processing Keyword Frequency token, sub-text-, and text- level metadata analysis https://www.laurenceanthony.net/software/antconc

Case Study Text- and sub-text-level analysis of the Spoken BNC2014 using AntConc (4.0)

13 Spoken BNC2014 (Love et al. 2017) Overview 11.5 million tokens of contemporary spoken British English Genre: Informal conversation Years: 2012-2016 Publisher: Lancaster University & Cambridge University Press Availability: Publicly available for free Highly structured corpus, rich in token, sub-text-, and text- level metadata Token: e.g. POS, lemma, utterances, overlaps, pauses, Sub-text: e.g. speaker demographics incl. gender, age, (regional) dialect, socio-economic status, Text: e.g. situational context, no. of speakers, transcriber, http://corpora.lancs.ac.uk/bnc2014/

14 Spoken BNC2014 (Love et al. 2017) Query syntax (under investigation) Three categories: Token ( tok ) token-level metadata categories Contributor ( con ) sub-text-level (speaker) metadata categories Document ( doc ) text-level metadata categories Queries can include any one or more categories, as well as multiples of the same category happy <con gender = "male"> happy <doc n_speakers = 3>&<con gender = "male"> happy <con gender="female"> <tok trans = "overlap">

15 Spoken BNC2014 (Love et al. 2017) Demo queries Token-level All filled pauses *_UH Sub-text-level All "_UH" filled pauses produced by speakers aged 30-39 <con agerange = "30_39"> *_UH All instances of the lemma "give", by female speakers <con gender= "F"> {give} Text-level All instances of "innit" in texts produced by transcriber no. 15 <doc transcriber = T15> in n it

Conclusion and future challenges Where next?

17 Conclusion and future challenges Pushing towards the next generation of corpus analysis and corpus availability The more powerful the tools, the more freedom researchers have to investigate language The faster, more automatic processing of metadata is one example of functionality that boosts research power The Spoken BNC2014 is a good guinea pig for testing since it is highly structured Making sub-text-level and text-level functionality universal Design specialized software modules to process corpora? Design/modify publicly-accessible corpora to work with popular tools? Promote standards for data interoperability?

18 Conclusion and future challenges Challenges/limitations (user perspectives) Another new syntax ..? (requires user knowledge of syntax abbreviations) Drop-down menus? (easy to use but limited in scope) TOK (type here) CON DOC Future directions A universal input format for all structured corpora, which can be processed automatically by offline tools in order to offer this functionality instantly

19 References 1. Anthony, L. (2018). AntConc (Version 3.5.7) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.laurenceanthony.net/software 2. Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139-173. 3. Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA): 560 million words, 1990-present. Available online at https://corpus.byu.edu/coca/. 4. Hardie, A (2012). CQPweb - combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3): 380 409. 5. Kilgarriff, A., Rychl , P., Smr , P., & Tugwell, D. (2004). The sketch engine. Information Technology. 6. Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319-344. 7. Scott, M. (2017). WordSmith Tools version 7, Stroud: Lexical Analysis Software.

Enhancing Corpus Analysis: Text and Sub-text Level Analysis

Download Presentation

Presentation Transcript

Related

More Related Content