Enhancing Corpus Analysis: Text and Sub-text Level Analysis
This study delves into the importance of improving text and sub-text level analysis of corpora, highlighting traditional approaches, current tools, challenges, and the necessity for effective database design. It emphasizes the need for user-friendly solutions to enhance research capabilities.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
A case for improving the text-level and sub-text-level analysis of corpora Laurence ANTHONY Faculty of Science and Engineering, Waseda University, Japan anthony@waseda.jp @antlabjp Robbie LOVE School of Education, University of Leeds, UK r.love@leeds.ac.uk @lovermob Corpus Linguistics 2019, Cardiff University, Cardiff, UK, July 23-26, 2019
2 Overview Background Traditional approaches to corpus analysis Current capabilities and challenges Research aims Design and implementation Database architecture Software overview Case study Text- and sub-text level analysis of the Spoken BNC2014 using AntConc (4.0) Conclusion and future challenges Where next?
Background Traditional approaches to corpus analysis, Current capabilities and challenges, research aims
4 Background Traditional approaches to corpus analysis Whole-corpus analysis Treating the entire corpus as a single unit of analysis analyses in isolation e.g. KWIC concordancers, Dispersion plots, Collocation, analyses by comparisons to other (entire) corpora e.g. Keyword analysis, Vocabulary profiling, CDA, Part-corpus analysis Dividing the entire corpus into easily identifiable pieces comparisons of parts of a corpus with each other comparisons of parts of a corpus with (parts of) other corpora e.g. genre studies, literary studies, ... http://www.bakerssupplies.pk/Cupcake-Tray-12-Cavity
5 Background Current capabilities and challenges (online and offline tools) Online concordancing tools e.g. CQPweb (Hardie 2012), Sketch Engine (Kilgarriff et al. 2014), BYU Corpora (Davies 2018), Most offer (whole corpus) token-level metadata (annotation) processing (e.g. token/type, POS, lemma) Some offer sub-text-level metadata processing (e.g. speakers in CQPweb) Many offer text-level metadata processing (e.g. register, genre, time period) Offline concordancing tools e.g. AntConc (Anthony 2018), Wordsmith Tools (Scott 2017), LancsBox (Brezina et al. 2015) Most favour token-level metadata processing at the whole corpus level https://www.inovex.de/blog/online-offline-machine-learning-network-anomaly-detection/
6 Background Current capabilities and challenges (online and offline tools) Concordancing tools are useful, but online tools offer less accommodation for the users own data offline tools offer less sub-text and text-level functionality The solution lies in programming, but many researchers are not programmers effective database design for sub-text and text-level processing is an ongoing research question So many users try to chop up their data manually and/or shape the research to match the tool capability https://www.istockphoto.com/gb/vector/programming-and-coding-icon-laptop-gm509114308-85608385
7 Background Research aims Research aims Create a design for the fast processing of text and sub-text metadata in corpora Test and incorporate this design into an update for AntConc The project Funded by the Japan Society for the Promotion of Science (JSPS) One-month Visiting Research Fellowship, hosted by Waseda University, Tokyo, Japan November-December 2018 https://www.jsps.go.jp/english/
Design and implementation Database architecture, Software overview
9 Design and implementation Database architecture (tables and indexes)
10 Design and implementation Software overview (AntConc 4.0 beta)
11 Design and implementation Software overview (AntConc 4.0 beta) Freeware Tools KWIC Concordancer Multiplatform Distribution Plot Win 7/8/10+ File View Linux Clusters Mac OS X N-grams Portable Collocates no installation required Word Frequency Metadata processing Keyword Frequency token, sub-text-, and text- level metadata analysis https://www.laurenceanthony.net/software/antconc
Case Study Text- and sub-text-level analysis of the Spoken BNC2014 using AntConc (4.0)
13 Spoken BNC2014 (Love et al. 2017) Overview 11.5 million tokens of contemporary spoken British English Genre: Informal conversation Years: 2012-2016 Publisher: Lancaster University & Cambridge University Press Availability: Publicly available for free Highly structured corpus, rich in token, sub-text-, and text- level metadata Token: e.g. POS, lemma, utterances, overlaps, pauses, Sub-text: e.g. speaker demographics incl. gender, age, (regional) dialect, socio-economic status, Text: e.g. situational context, no. of speakers, transcriber, http://corpora.lancs.ac.uk/bnc2014/
14 Spoken BNC2014 (Love et al. 2017) Query syntax (under investigation) Three categories: Token ( tok ) token-level metadata categories Contributor ( con ) sub-text-level (speaker) metadata categories Document ( doc ) text-level metadata categories Queries can include any one or more categories, as well as multiples of the same category happy <con gender = "male"> happy <doc n_speakers = 3>&<con gender = "male"> happy <con gender="female"> <tok trans = "overlap">
15 Spoken BNC2014 (Love et al. 2017) Demo queries Token-level All filled pauses *_UH Sub-text-level All "_UH" filled pauses produced by speakers aged 30-39 <con agerange = "30_39"> *_UH All instances of the lemma "give", by female speakers <con gender= "F"> {give} Text-level All instances of "innit" in texts produced by transcriber no. 15 <doc transcriber = T15> in n it
Conclusion and future challenges Where next?
17 Conclusion and future challenges Pushing towards the next generation of corpus analysis and corpus availability The more powerful the tools, the more freedom researchers have to investigate language The faster, more automatic processing of metadata is one example of functionality that boosts research power The Spoken BNC2014 is a good guinea pig for testing since it is highly structured Making sub-text-level and text-level functionality universal Design specialized software modules to process corpora? Design/modify publicly-accessible corpora to work with popular tools? Promote standards for data interoperability?
18 Conclusion and future challenges Challenges/limitations (user perspectives) Another new syntax ..? (requires user knowledge of syntax abbreviations) Drop-down menus? (easy to use but limited in scope) TOK (type here) CON DOC Future directions A universal input format for all structured corpora, which can be processed automatically by offline tools in order to offer this functionality instantly
19 References 1. Anthony, L. (2018). AntConc (Version 3.5.7) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.laurenceanthony.net/software 2. Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139-173. 3. Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA): 560 million words, 1990-present. Available online at https://corpus.byu.edu/coca/. 4. Hardie, A (2012). CQPweb - combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3): 380 409. 5. Kilgarriff, A., Rychl , P., Smr , P., & Tugwell, D. (2004). The sketch engine. Information Technology. 6. Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319-344. 7. Scott, M. (2017). WordSmith Tools version 7, Stroud: Lexical Analysis Software.