Corpus Linguistics in Web Research

 
What's on the Web?
The Web as a Linguistic Corpus
 
Adam Kilgarriff
Lexical Computing Ltd
University of Leeds
 
BL, Jan 2011
 
Kilgarriff: Web as Corpus
 
2
 
You can’t help noticing
 
 
Replaceable or replacable?
http://googlefight.com
 
What is a corpus?
 
A collection of texts
Call it a corpus when
Used for literary or linguistic research
 
BL, Jan 2011
 
3
 
Kilgarriff: Web as Corpus
 
History
 
BL, Jan 2011
 
4
 
Kilgarriff: Web as Corpus
 
BL, Jan 2011
 
Kilgarriff: Web as Corpus
 
Slide
 5
 
Corpora since the 1960s
 
 
Size
 (in
words)
 
1960s   1970s   1980s   1990s   2000s
Brown/LOB   COBUILD  BNC    OEC
 
Pioneers
 
Dictionary publishers
Most words rare: must be vast
Other interested parties
Mostly for word frequency lists:
Educationalists
Psychologists
Since 1990s
Language technology
 
BL, Jan 2011
 
6
 
Kilgarriff: Web as Corpus
 
Corpus types
 
Monolingual
Parallel
Bi-texts: a text and its translation
Statistical machine translation
Google translate
Comparable
More than one language, same kind of text for
each
 
BL, Jan 2011
 
7
 
Kilgarriff: Web as Corpus
 
Parameters
 
Language
Size
A thousand to a trillion words
1,000 to 1,000,000,000,000
words, sentences, GB, hours
Text 
type
Writing, speech
Newspaper, blog, chat, academic, …, mixed
Sport, hairdressing, DNA of the nematode worm
 
BL, Jan 2011
 
8
 
Kilgarriff: Web as Corpus
 
The Web
 
Very very large
2006 estimates for duplicate free, linguistic, Google-indexed
web
German: 
 
44 billion words
Italian: 
 
25 billion words
English: 
 
1 -10 trillion words
Most languages
Most language types
Up-to-date
Free
Instant access
 
BL, Jan 2011
 
9
 
Kilgarriff: Web as Corpus
 
BL, Jan 2011
 
Kilgarriff: Web as Corpus
 
10
 
What is out there?
 
What text types are there on the web?
some are new: chatroom
proportions
is it overwhelmed by porn?  How much?
Hard question
 
BL, Jan 2011
 
Kilgarriff: Web as Corpus
 
11
 
Comparing frequency lists
 
Web1T
Present from Google
All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion words of
English
Compare with British National Corpus
100m words
Early 1990s: pre-web
Keywords of each 
vs. 
other
Highest contrast of frequency
 
BL, Jan 2011
 
Kilgarriff: Web as Corpus
 
12
 
Web-high (155 terms)
 
61 web and computing
config browser spyware url www forum
38 porn
22 US English (incl Spanish influence –
los
)
18 business/products common on web
poker viagra lingerie ringtone dvd casino rental collectible
tiffany
NB: BNC is old
4 legal
trademarks pursuant accordance herein
 
 
 
BL, Jan 2011
 
Kilgarriff: Web as Corpus
 
13
 
BNC-high
 
Exclude British English, transcription/tokenisation
anomalies
 
herself stood seemed she looked
 
yesterday sat
considerable had council felt perhaps walked
round her towards claimed knew obviously
remained himself he him
 
BL, Jan 2011
 
Kilgarriff: Web as Corpus
 
14
 
Observations
 
Pronouns and past tense verbs
Fiction
Masc 
vs 
fem
Yesterday
Probably daily newspapers
Constancy of ratios:
He/him/himself
She/her/herself
 
Corpus Factory
 
Most languages: no large corpora
Goal
100 biggest languages, 100m-word corpora
BootCat method
Repeat 50,000 times
Seeds words
Send to a search engine
In random pairs, threes or fours
Collect the pages the search engine finds
Seed words from wikipedia
 
BL, Jan 2011
 
15
 
Kilgarriff: Web as Corpus
 
42 Languages
 
Arabic Bengali Bulgarian Chinese Croatian
Czech Danish Dutch English Estonian Finnish
French German Greek Gujarati Hebrew Hindi
Indonesian Irish Italian Japanese Korean
Malay Malayalam Maltese Norwegian Persian
Polish Portuguese Romanian Russian Serbian
Slovene Spanish Swahili Swedish Tamil Telugu
Thai Turkish Vietnamese Welsh
 
BL, Jan 2011
 
16
 
Kilgarriff: Web as Corpus
 
Corpus quality
 
Character encoding
‘boilerplate’
Navigation bars, adverts, legal disclaimers, …
Duplicates
Language
Contamination by English
 
Concerns shared by by Google, Microsoft, IBM etc
LCL use (and develop) leading methods
 
BL, Jan 2011
 
17
 
Kilgarriff: Web as Corpus
 
Levels of processing
 
Lemmas and word forms
Invade 
vs 
invade invaded invades invaded
Part-of-speech tagging
Also word-class tagging
brush (verb) (“she brushed him aside”) 
vs. 
brush (noun)
(“Give me the brush.”)
can (verb) (“he can do it”) 
vs. 
can (noun) (“the beer
can”)
Some languages, not others
 
BL, Jan 2011
 
18
 
Kilgarriff: Web as Corpus
 
Demo
 
 
BL, Jan 2011
 
19
 
Kilgarriff: Web as Corpus
Slide Note
Embed
Share

Explore the world of corpus linguistics through Adam Kilgarriff's research, delving into the definition of a corpus, its historical background, types, parameters, and the vastness of linguistic data available on the web since the 1960s. Discover the significance of corpora in various fields such as education, psychology, and language technology.

  • Corpus Linguistics
  • Web Research
  • Adam Kilgarriff
  • Language Technology
  • Historical Background

Uploaded on Sep 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds

  2. You cant help noticing Replaceable or replacable? http://googlefight.com BL, Jan 2011 Kilgarriff: Web as Corpus 2

  3. What is a corpus? A collection of texts Call it a corpus when Used for literary or linguistic research BL, Jan 2011 Kilgarriff: Web as Corpus 3

  4. History BL, Jan 2011 Kilgarriff: Web as Corpus 4

  5. Corpora since the 1960s 109 Size (in words) 108 107 106 1960s 1970s 1980s 1990s 2000s Brown/LOB COBUILD BNC OEC BL, Jan 2011 Kilgarriff: Web as Corpus Slide 5

  6. Pioneers Dictionary publishers Most words rare: must be vast Other interested parties Mostly for word frequency lists: Educationalists Psychologists Since 1990s Language technology BL, Jan 2011 Kilgarriff: Web as Corpus 6

  7. Corpus types Monolingual Parallel Bi-texts: a text and its translation Statistical machine translation Google translate Comparable More than one language, same kind of text for each BL, Jan 2011 Kilgarriff: Web as Corpus 7

  8. Parameters Language Size A thousand to a trillion words 1,000 to 1,000,000,000,000 words, sentences, GB, hours Text type Writing, speech Newspaper, blog, chat, academic, , mixed Sport, hairdressing, DNA of the nematode worm BL, Jan 2011 Kilgarriff: Web as Corpus 8

  9. The Web Very very large 2006 estimates for duplicate free, linguistic, Google-indexed web German: 44 billion words Italian: 25 billion words English: 1 -10 trillion words Most languages Most language types Up-to-date Free Instant access BL, Jan 2011 Kilgarriff: Web as Corpus 9

  10. What is out there? What text types are there on the web? some are new: chatroom proportions is it overwhelmed by porn? How much? Hard question BL, Jan 2011 Kilgarriff: Web as Corpus 10

  11. Comparing frequency lists Web1T Present from Google All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion words of English Compare with British National Corpus 100m words Early 1990s: pre-web Keywords of each vs. other Highest contrast of frequency BL, Jan 2011 Kilgarriff: Web as Corpus 11

  12. Web-high (155 terms) 61 web and computing config browser spyware url www forum 38 porn 22 US English (incl Spanish influence los) 18 business/products common on web poker viagra lingerie ringtone dvd casino rental collectible tiffany NB: BNC is old 4 legal trademarks pursuant accordance herein BL, Jan 2011 Kilgarriff: Web as Corpus 12

  13. BNC-high Exclude British English, transcription/tokenisation anomalies herself stood seemed she lookedyesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him BL, Jan 2011 Kilgarriff: Web as Corpus 13

  14. Observations Pronouns and past tense verbs Fiction Masc vs fem Yesterday Probably daily newspapers Constancy of ratios: He/him/himself She/her/herself BL, Jan 2011 Kilgarriff: Web as Corpus 14

  15. Corpus Factory Most languages: no large corpora Goal 100 biggest languages, 100m-word corpora BootCat method Repeat 50,000 times Seeds words Send to a search engine In random pairs, threes or fours Collect the pages the search engine finds Seed words from wikipedia BL, Jan 2011 Kilgarriff: Web as Corpus 15

  16. 42 Languages Arabic Bengali Bulgarian Chinese Croatian Czech Danish Dutch English Estonian Finnish French German Greek Gujarati Hebrew Hindi Indonesian Irish Italian Japanese Korean Malay Malayalam Maltese Norwegian Persian Polish Portuguese Romanian Russian Serbian Slovene Spanish Swahili Swedish Tamil Telugu Thai Turkish Vietnamese Welsh BL, Jan 2011 Kilgarriff: Web as Corpus 16

  17. Corpus quality Character encoding boilerplate Navigation bars, adverts, legal disclaimers, Duplicates Language Contamination by English Concerns shared by by Google, Microsoft, IBM etc LCL use (and develop) leading methods BL, Jan 2011 Kilgarriff: Web as Corpus 17

  18. Levels of processing Lemmas and word forms Invade vs invade invaded invades invaded Part-of-speech tagging Also word-class tagging brush (verb) ( she brushed him aside ) vs. brush (noun) ( Give me the brush. ) can (verb) ( he can do it ) vs. can (noun) ( the beer can ) Some languages, not others BL, Jan 2011 Kilgarriff: Web as Corpus 18

  19. Demo BL, Jan 2011 Kilgarriff: Web as Corpus 19

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#