Understanding Corpus Linguistics in Web Research

Slide Note
Embed
Share

Explore the world of corpus linguistics through Adam Kilgarriff's research, delving into the definition of a corpus, its historical background, types, parameters, and the vastness of linguistic data available on the web since the 1960s. Discover the significance of corpora in various fields such as education, psychology, and language technology.


Uploaded on Sep 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds

  2. You cant help noticing Replaceable or replacable? http://googlefight.com BL, Jan 2011 Kilgarriff: Web as Corpus 2

  3. What is a corpus? A collection of texts Call it a corpus when Used for literary or linguistic research BL, Jan 2011 Kilgarriff: Web as Corpus 3

  4. History BL, Jan 2011 Kilgarriff: Web as Corpus 4

  5. Corpora since the 1960s 109 Size (in words) 108 107 106 1960s 1970s 1980s 1990s 2000s Brown/LOB COBUILD BNC OEC BL, Jan 2011 Kilgarriff: Web as Corpus Slide 5

  6. Pioneers Dictionary publishers Most words rare: must be vast Other interested parties Mostly for word frequency lists: Educationalists Psychologists Since 1990s Language technology BL, Jan 2011 Kilgarriff: Web as Corpus 6

  7. Corpus types Monolingual Parallel Bi-texts: a text and its translation Statistical machine translation Google translate Comparable More than one language, same kind of text for each BL, Jan 2011 Kilgarriff: Web as Corpus 7

  8. Parameters Language Size A thousand to a trillion words 1,000 to 1,000,000,000,000 words, sentences, GB, hours Text type Writing, speech Newspaper, blog, chat, academic, , mixed Sport, hairdressing, DNA of the nematode worm BL, Jan 2011 Kilgarriff: Web as Corpus 8

  9. The Web Very very large 2006 estimates for duplicate free, linguistic, Google-indexed web German: 44 billion words Italian: 25 billion words English: 1 -10 trillion words Most languages Most language types Up-to-date Free Instant access BL, Jan 2011 Kilgarriff: Web as Corpus 9

  10. What is out there? What text types are there on the web? some are new: chatroom proportions is it overwhelmed by porn? How much? Hard question BL, Jan 2011 Kilgarriff: Web as Corpus 10

  11. Comparing frequency lists Web1T Present from Google All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion words of English Compare with British National Corpus 100m words Early 1990s: pre-web Keywords of each vs. other Highest contrast of frequency BL, Jan 2011 Kilgarriff: Web as Corpus 11

  12. Web-high (155 terms) 61 web and computing config browser spyware url www forum 38 porn 22 US English (incl Spanish influence los) 18 business/products common on web poker viagra lingerie ringtone dvd casino rental collectible tiffany NB: BNC is old 4 legal trademarks pursuant accordance herein BL, Jan 2011 Kilgarriff: Web as Corpus 12

  13. BNC-high Exclude British English, transcription/tokenisation anomalies herself stood seemed she lookedyesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him BL, Jan 2011 Kilgarriff: Web as Corpus 13

  14. Observations Pronouns and past tense verbs Fiction Masc vs fem Yesterday Probably daily newspapers Constancy of ratios: He/him/himself She/her/herself BL, Jan 2011 Kilgarriff: Web as Corpus 14

  15. Corpus Factory Most languages: no large corpora Goal 100 biggest languages, 100m-word corpora BootCat method Repeat 50,000 times Seeds words Send to a search engine In random pairs, threes or fours Collect the pages the search engine finds Seed words from wikipedia BL, Jan 2011 Kilgarriff: Web as Corpus 15

  16. 42 Languages Arabic Bengali Bulgarian Chinese Croatian Czech Danish Dutch English Estonian Finnish French German Greek Gujarati Hebrew Hindi Indonesian Irish Italian Japanese Korean Malay Malayalam Maltese Norwegian Persian Polish Portuguese Romanian Russian Serbian Slovene Spanish Swahili Swedish Tamil Telugu Thai Turkish Vietnamese Welsh BL, Jan 2011 Kilgarriff: Web as Corpus 16

  17. Corpus quality Character encoding boilerplate Navigation bars, adverts, legal disclaimers, Duplicates Language Contamination by English Concerns shared by by Google, Microsoft, IBM etc LCL use (and develop) leading methods BL, Jan 2011 Kilgarriff: Web as Corpus 17

  18. Levels of processing Lemmas and word forms Invade vs invade invaded invades invaded Part-of-speech tagging Also word-class tagging brush (verb) ( she brushed him aside ) vs. brush (noun) ( Give me the brush. ) can (verb) ( he can do it ) vs. can (noun) ( the beer can ) Some languages, not others BL, Jan 2011 Kilgarriff: Web as Corpus 18

  19. Demo BL, Jan 2011 Kilgarriff: Web as Corpus 19

Related


More Related Content