Enhancing Corpus Analysis: Text and Sub-text Level Analysis

undefined
 
A case for improving the text-level and
sub-text-level analysis of corpora
 
Corpus Linguistics 2019, Cardiff University, Cardiff, UK, July 23-26, 2019
 
Robbie LOVE
School of Education,
University of Leeds, UK
r.love@leeds.ac.uk
        @lovermob
 
Laurence ANTHONY
Faculty of Science and Engineering,
Waseda University, Japan
anthony@waseda.jp
        @antlabjp
Overview
 
Background
Traditional approaches to corpus analysis
Current capabilities and challenges
Research aims
Design and implementation
Database architecture
Software overview
Case study
Text- and sub-text level analysis of the Spoken BNC2014 using
AntConc (4.0)
Conclusion and future challenges
Where next?
2
 
 
 
undefined
 
Background
Traditional approaches to corpus analysis,
Current capabilities and challenges, research aims
Background
Traditional approaches to corpus analysis
 
Whole-corpus analysis
Treating the entire corpus as a single unit of analysis
analyses in isolation
e.g. KWIC concordancers, Dispersion plots, Collocation, …
analyses by comparisons to other (entire) corpora
e.g. Keyword analysis, Vocabulary profiling, CDA, …
Part-corpus analysis
Dividing the entire corpus into easily identifiable pieces
comparisons of parts of a corpus with each other
comparisons of parts of a corpus with
(parts of) other corpora
e.g. genre studies, literary studies, ...
 
4
 
http://www.bakerssupplies.pk/Cupcake-Tray-12-Cavity
Background
Current capabilities and challenges (online and offline tools)
 
Online concordancing tools
e.g. 
CQPweb
 (Hardie 2012), 
Sketch Engine 
(Kilgarriff et al. 2014),
BYU Corpora 
(Davies 2018), …
Most offer (whole corpus) 
token-level 
metadata (annotation)
processing (e.g. token/type, POS, lemma)
Some offer 
sub-text-level
 metadata processing
(e.g. speakers in 
CQPweb
)
Many offer 
text-level
 metadata processing
(e.g. register, genre, time period)
Offline concordancing tools
e.g. 
AntConc
 (Anthony 2018), 
Wordsmith Tools
(Scott 2017), 
LancsBox
 (Brezina et al. 2015)
Most favour 
token-level 
metadata
processing at the whole corpus level
5
https://www.inovex.de/blog/online-offline-machine-learning-network-anomaly-detection/
Background
Current capabilities and challenges (online and offline tools)
 
Concordancing 
tools are useful, but…
online tools offer less accommodation for the users’ own data
offline tools offer less sub-text and text-level functionality
The solution lies in programming, but…
many researchers are not programmers
effective database design for sub-text and text-level processing
is an ongoing research question
So…
many users try to ‘chop up’ their data manually…
…and/or shape the research to match the tool capability
 
 
6
 
https://www.istockphoto.com/gb/vector/programming-and-coding-icon-laptop-gm509114308-85608385
Background
Research aims
 
Research aims
Create a design for the fast
processing of text and sub-text
metadata in corpora
Test and incorporate this design into
an update for 
AntConc
The project
Funded by the 
Japan Society for the
Promotion of Science 
(JSPS)
One-month Visiting Research
Fellowship, hosted  by Waseda
University, Tokyo, Japan
November-December 2018
7
https://www.jsps.go.jp/english/
undefined
 
Design and implementation
Database architecture, Software overview
 
Design and implementation
Database architecture (tables and indexes)
 
9
 
Design and implementation
Software overview (AntConc 4.0 beta)
 
10
 
Design and implementation
Software overview (AntConc 4.0 beta)
 
Freeware
Multiplatform
Win 7/8/10+
Linux
Mac OS X
Portable
no installation required
Metadata processing
token, sub-text-, and text-
level metadata analysis
 
Tools
KWIC Concordancer
Distribution Plot
File View
Clusters
N-grams
Collocates
Word Frequency
Keyword Frequency
 
11
 
https://www.laurenceanthony.net/software/antconc
undefined
 
Case Study
Text- and sub-text-level analysis of the
Spoken BNC2014 using 
AntConc
 (4.0)
Spoken BNC2014 (Love et al. 2017)
Overview
 
11.5 million tokens of contemporary spoken British English
Genre
: Informal conversation
Years
: 2012-2016
Publisher
: Lancaster University & Cambridge University Press
Availability
: Publicly available for free
Highly ‘structured’ corpus, rich in token, sub-text-, and text-
level metadata
Token
: e.g. POS, lemma, utterances, overlaps, pauses, …
Sub-text
: e.g. speaker demographics incl. gender, age,
(regional) dialect, socio-economic status, …
Text
: e.g. situational context, no. of speakers, transcriber, …
13
http://corpora.lancs.ac.uk/bnc2014/
Spoken BNC2014 (Love et al. 2017)
Query syntax (under investigation)
 
Three categories:
Token (‘tok’)
 
token-level
 metadata categories
Contributor (‘con’)
 
sub-text-level
 (speaker) metadata categories
Document (‘doc’)
 
text-level
 metadata categories
Queries can include any one or more categories, as well as
multiples of the same category
happy
<con gender = "male"> happy
<doc n_speakers = 3>&<con gender = "male"> happy
<con gender="female">
<tok trans = "overlap">
14
Spoken BNC2014 (Love et al. 2017)
Demo queries
 
Token-level
All filled pauses
*_UH
Sub-text-level
All "_UH" filled pauses produced by speakers aged 30-39
<con agerange = "30_39"> *_UH
All instances of  the lemma "give", by female speakers
<con gender= "F"> {give}
Text-level
All instances of "innit" in texts produced by transcriber no. 15
<doc transcriber = T15> in n it
15
undefined
 
Conclusion and future challenges
Where next?
Conclusion and future challenges
 
Pushing towards the next generation of corpus analysis and
corpus availability
The more powerful the tools, the more freedom researchers have to
investigate language
The faster, more automatic processing of metadata is one example of
functionality that boosts research power
The Spoken BNC2014 is a good ‘guinea pig’ for testing since it is
highly structured
Making sub-text-level and text-level functionality universal
Design specialized software modules to process corpora?
Design/modify publicly-accessible corpora to work
with popular tools?
Promote standards for data interoperability?
 
 
 
 
 
 
 
 
 
 
 
 
17
Conclusion and future challenges
 
Challenges/limitations (user perspectives)
Another new syntax…..?
(requires user knowledge of syntax abbreviations)
Drop-down menus?
(easy to use but limited in scope)
 
 
 
Future directions
A universal input format for all ‘structured’ corpora, which can be
processed automatically by offline tools in order to offer this
functionality instantly
 
18
 
References
 
1.
Anthony, L. (2018). AntConc (Version 3.5.7) [Computer Software]. Tokyo, Japan:
Waseda University. Available from http://www.laurenceanthony.net/software
2.
Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new
perspective on collocation networks. International Journal of Corpus Linguistics,
20(2), 139-173.
3.
Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA): 560
million words, 1990-present. Available online at https://corpus.byu.edu/coca/.
4.
Hardie, A (2012). CQPweb - combining power, flexibility and usability in a corpus
analysis tool. International Journal of Corpus Linguistics, 17(3): 380–409.
5.
Kilgarriff, A., Rychlý, P., Smrž, P., & Tugwell, D. (2004). The sketch engine.
Information Technology.
6.
Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken
BNC2014: Designing and building a spoken corpus of everyday conversations.
International Journal of Corpus Linguistics, 22(3), 319-344.
7.
Scott, M. (2017). WordSmith Tools version 7, Stroud: Lexical Analysis Software.
 
 
 
 
 
19
Spoken BNC2014 (Love et al. 2017)
Demo queries
 
Token-level
All filled pauses
*_UH
All short pauses
<tok:dur=“short”>
Sub-text-level
All tokens produced by speakers aged 30-39
<con:agerange=“30-39”>
All instances of “thanks”, by female speakers, in texts featuring 2
speakers
<doc:n_speakers=2> <con:gender=“female”> thanks
Text-level
All instances of "innit" in texts produced by transcriber no. 2
<doc:transcriber=2> in n it
20
Slide Note
Embed
Share

This study delves into the importance of improving text and sub-text level analysis of corpora, highlighting traditional approaches, current tools, challenges, and the necessity for effective database design. It emphasizes the need for user-friendly solutions to enhance research capabilities.

  • Corpus linguistics
  • Text analysis
  • Sub-text analysis
  • Research tools
  • Database design

Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. A case for improving the text-level and sub-text-level analysis of corpora Laurence ANTHONY Faculty of Science and Engineering, Waseda University, Japan anthony@waseda.jp @antlabjp Robbie LOVE School of Education, University of Leeds, UK r.love@leeds.ac.uk @lovermob Corpus Linguistics 2019, Cardiff University, Cardiff, UK, July 23-26, 2019

  2. 2 Overview Background Traditional approaches to corpus analysis Current capabilities and challenges Research aims Design and implementation Database architecture Software overview Case study Text- and sub-text level analysis of the Spoken BNC2014 using AntConc (4.0) Conclusion and future challenges Where next?

  3. Background Traditional approaches to corpus analysis, Current capabilities and challenges, research aims

  4. 4 Background Traditional approaches to corpus analysis Whole-corpus analysis Treating the entire corpus as a single unit of analysis analyses in isolation e.g. KWIC concordancers, Dispersion plots, Collocation, analyses by comparisons to other (entire) corpora e.g. Keyword analysis, Vocabulary profiling, CDA, Part-corpus analysis Dividing the entire corpus into easily identifiable pieces comparisons of parts of a corpus with each other comparisons of parts of a corpus with (parts of) other corpora e.g. genre studies, literary studies, ... http://www.bakerssupplies.pk/Cupcake-Tray-12-Cavity

  5. 5 Background Current capabilities and challenges (online and offline tools) Online concordancing tools e.g. CQPweb (Hardie 2012), Sketch Engine (Kilgarriff et al. 2014), BYU Corpora (Davies 2018), Most offer (whole corpus) token-level metadata (annotation) processing (e.g. token/type, POS, lemma) Some offer sub-text-level metadata processing (e.g. speakers in CQPweb) Many offer text-level metadata processing (e.g. register, genre, time period) Offline concordancing tools e.g. AntConc (Anthony 2018), Wordsmith Tools (Scott 2017), LancsBox (Brezina et al. 2015) Most favour token-level metadata processing at the whole corpus level https://www.inovex.de/blog/online-offline-machine-learning-network-anomaly-detection/

  6. 6 Background Current capabilities and challenges (online and offline tools) Concordancing tools are useful, but online tools offer less accommodation for the users own data offline tools offer less sub-text and text-level functionality The solution lies in programming, but many researchers are not programmers effective database design for sub-text and text-level processing is an ongoing research question So many users try to chop up their data manually and/or shape the research to match the tool capability https://www.istockphoto.com/gb/vector/programming-and-coding-icon-laptop-gm509114308-85608385

  7. 7 Background Research aims Research aims Create a design for the fast processing of text and sub-text metadata in corpora Test and incorporate this design into an update for AntConc The project Funded by the Japan Society for the Promotion of Science (JSPS) One-month Visiting Research Fellowship, hosted by Waseda University, Tokyo, Japan November-December 2018 https://www.jsps.go.jp/english/

  8. Design and implementation Database architecture, Software overview

  9. 9 Design and implementation Database architecture (tables and indexes)

  10. 10 Design and implementation Software overview (AntConc 4.0 beta)

  11. 11 Design and implementation Software overview (AntConc 4.0 beta) Freeware Tools KWIC Concordancer Multiplatform Distribution Plot Win 7/8/10+ File View Linux Clusters Mac OS X N-grams Portable Collocates no installation required Word Frequency Metadata processing Keyword Frequency token, sub-text-, and text- level metadata analysis https://www.laurenceanthony.net/software/antconc

  12. Case Study Text- and sub-text-level analysis of the Spoken BNC2014 using AntConc (4.0)

  13. 13 Spoken BNC2014 (Love et al. 2017) Overview 11.5 million tokens of contemporary spoken British English Genre: Informal conversation Years: 2012-2016 Publisher: Lancaster University & Cambridge University Press Availability: Publicly available for free Highly structured corpus, rich in token, sub-text-, and text- level metadata Token: e.g. POS, lemma, utterances, overlaps, pauses, Sub-text: e.g. speaker demographics incl. gender, age, (regional) dialect, socio-economic status, Text: e.g. situational context, no. of speakers, transcriber, http://corpora.lancs.ac.uk/bnc2014/

  14. 14 Spoken BNC2014 (Love et al. 2017) Query syntax (under investigation) Three categories: Token ( tok ) token-level metadata categories Contributor ( con ) sub-text-level (speaker) metadata categories Document ( doc ) text-level metadata categories Queries can include any one or more categories, as well as multiples of the same category happy <con gender = "male"> happy <doc n_speakers = 3>&<con gender = "male"> happy <con gender="female"> <tok trans = "overlap">

  15. 15 Spoken BNC2014 (Love et al. 2017) Demo queries Token-level All filled pauses *_UH Sub-text-level All "_UH" filled pauses produced by speakers aged 30-39 <con agerange = "30_39"> *_UH All instances of the lemma "give", by female speakers <con gender= "F"> {give} Text-level All instances of "innit" in texts produced by transcriber no. 15 <doc transcriber = T15> in n it

  16. Conclusion and future challenges Where next?

  17. 17 Conclusion and future challenges Pushing towards the next generation of corpus analysis and corpus availability The more powerful the tools, the more freedom researchers have to investigate language The faster, more automatic processing of metadata is one example of functionality that boosts research power The Spoken BNC2014 is a good guinea pig for testing since it is highly structured Making sub-text-level and text-level functionality universal Design specialized software modules to process corpora? Design/modify publicly-accessible corpora to work with popular tools? Promote standards for data interoperability?

  18. 18 Conclusion and future challenges Challenges/limitations (user perspectives) Another new syntax ..? (requires user knowledge of syntax abbreviations) Drop-down menus? (easy to use but limited in scope) TOK (type here) CON DOC Future directions A universal input format for all structured corpora, which can be processed automatically by offline tools in order to offer this functionality instantly

  19. 19 References 1. Anthony, L. (2018). AntConc (Version 3.5.7) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.laurenceanthony.net/software 2. Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139-173. 3. Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA): 560 million words, 1990-present. Available online at https://corpus.byu.edu/coca/. 4. Hardie, A (2012). CQPweb - combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3): 380 409. 5. Kilgarriff, A., Rychl , P., Smr , P., & Tugwell, D. (2004). The sketch engine. Information Technology. 6. Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319-344. 7. Scott, M. (2017). WordSmith Tools version 7, Stroud: Lexical Analysis Software.

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#