Evolution of the Web: A Journey Through Time

 
So much of everything
 
Adam Kilgarriff
Lexical Computing Ltd
 
To the web on its thirteenth birthday
 
Teenager now, growth spurts wild and luxuriant,
Your gangling form changes month on month.
Born from the womb of academe in 1994
Pornography was your wetnurse
Growing you big and strong
Teaching you new tricks
(webcams, streaming video, cash payments)
that your parents never dreamed of.
 
Teenagers must have independence!
So Web-2.0 has arrived.
Now everybody feeds you.
 
2
 
 
New life for democracy
- are you aren't you?
- are you aren't you?
Giving voice to millions or cancer
Spam spammer spamming
Phishing, viruses, Trojans
Appropriating mutating multiplying.
New life or cancer? Both of course
Like nuclear physics and the Russian Revolution.
You are huge like Exxon, Zeus and China
We tremble and we lick our lips as we watch you grow.
 
AK 2007
 
3
 
Changes
 
Then
 
Documents
Library
Passive
Information
Computer
Public
 
Now
 
+ social media, audio, video
+ phone exchange, marketplace
+ Active
+ Interaction
+ tablet, smartphone
+ degrees of privacy
 
4
 
 
 
 
 
 
 
5
 
Dick Whittington
off to London 
where the streets are paved with gold 
to make his fortune
 
 
 
 
6
 
Dick Whittington
off to London 
where the streets are paved with gold 
to make his fortune
 
 
 
 
7
 
Hero
    
or
   
Villain
 
8
 
Hero
    
or
   
Villain
 
9
 
As linguists we can
 
Study the web
new
fascinating
mostly language: 
we are well-placed
Use the web as infrastructure
Click and get it
Use the web as a source of data
All important: all different
 
 
10
 
Sketch Engine
 
Online corpus query tool
Ready-to-use corpora
All major world languages
Many others
Install your own corpora
Build corpora from the web
 
11
 
Language varieties
 
Get a corpus
How does it compare with a reference
Qualitatively
Quantitatively
 
Biber 1988: Variation in Speech and Writing
 
Qualitative
 
Take keyword lists
[a-z]{3,}
Lemma if lemmatisation identical, else word
C1 vs C2, top 100/200
C2 vs C1, top 100/200
study
 
Example: 
OCC and OEC
 
OEC: general reference corpus
BrE and AmE
All 21
st
 century
Some fiction
OCC: writing for children
Most BrE
Some 21
st
 century, some earlier
Most fiction
Compare 21
st
 century fiction subcorpora
(not enough British 21
st
 cent fiction on OEC)
 
 
It’s ever so interesting
Do it
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
19
 
Simple maths for keywords
“This word is twice as common here as there”
 
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
20
 
 
“This word is twice as common here as there”
 
 
What does it mean?
For word 
wubble
 
 
 
 
 
 
 
Ratio=2:
wubble 
is twice as common in fc  as rc
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
21
 
 
“This word is twice as common here as there”
 
Not just words
Grammatical constructions
Suffixes
Keyword list
Calculate ratio for all words
Sort
Keywords: at top of list
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
22
 
Good enough for keywords?
 
 
Almost, but
1.
Are corpora well matched?
2.
Burstiness
3.
You can’t divide by zero
4.
High ratios more common for rare words
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
23
 
1
 
Are corpora well matched?
 
Proportionality
If fiction contains more American, newspaper
more British…
genre
 compromised by 
region
Usual problem
Issue in corpus design
Not here
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
24
 
2
 
Burstiness
 
 
Discount frequency for bursty words
 Gries, CL 2007, also CL journal
 We use ARF (average reduced frequency)
 Not here
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
25
 
3
 
You can’t divide by zero
 
 
 
 
 
Standard solution: add one
 
 
 
 
Problem solved
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
26
 
4 High ratios more common for rarer words
 
 
 
 some researchers: grammar, grammar words
 some researchers: lexis content words
No right answer
Slider?
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
27
 
Solution
 
Don’t just add 1, add n: n=1
 
 
 
 
n=100
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
28
 
Solution
 
n=1000
 
 
 
 
Summary
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
29
 
But what about
 
Mutual information
Log-likelihood
Chi-square
Fisher’s test
Don’t they use cleverer maths?
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
30
 
Yes but
 
Clever maths is for hypothesis testing
Can you defeat null hypothesis?
Language is not random, so
  … you always can
Null hypothesis never true
Hypothesis-testing not informative
Clever maths 
irrelevant
Kilgarriff 2006, CLLT
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
31
 
Moreover…
 
just one answer
grammar words 
vs
 content words?
does not help
confuses and obscures
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
32
 
Example
 
BAWE
British Academic Written English
Nesi and Thompson, completed last year
Student essays
Arts/Humanities, Social Sciences, Life Sciences, Physical
Sciences
 fc: ArtsHum, rc: SocSci
With n=10 and n=1000
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
33
 
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
34
 
 
Quantitative
 
Methods, evaluation
Kilgarriff 2001, 
Comparing Corpora, 
Int J Corp Ling
Then:
not many corpora to compare
Now:
Many
Ad hoc, from web
First question: is it any good, how does it compare
Let’s make it easy: offer it in Sketch Engine
 
A digital native
 
who still likes
books and crayons
 
 
 
 
 
Thank you
 
37
 
Original method
 
C1 and C2:
Same size, by design
Put together, find 500 highest freq words
For each of these words
Freqs: f1 in C1, f2 in C2, mean=(f1+f2)/2
(f1-f2)
2
/mean  (chi-square statistic)
Sum
Divide by 500: CBDF
 
Evaluated
 
Known-similarity corpora
Shows it worked
Used to set parameter (500)
CBDF better than alternative measures tested
 
Adjustments for SkE
 
Problem: non-identical tokenisation
Some awkward words: 
can’t
undermine stats as one corpus has zero
Solution
commonest 5000 words in each corpus
intersection only
commonest 500 in intersection
 
Adjustments for SkE
 
Corpus size highly variable
Chi-square not so dependable
Also not consistent with our keyword lists
Link to keyword lists – link quant to qual
Keyword lists
nf = normalised (per million) frequencies
Keyword lists: nf1+k/nf2+k
Default value for k=100
We use: if nf1>nf2, nf1+k/nf2+k, else nf2+k/nf1+k
Evaluated on Known-Sim Corpora
 as good as/better than chi-square
 
 
 
 
  
Simple maths for keywords
This word is twice as common in this text type as
that
 
 
Intuitive
Nearly right but:
How well matched are corpora
Not here
Burstiness
Not here
Can’t divide by zero
Commoner 
vs.
 rarer words
 
What’s missing
 
Heterogeneity
“how similar is BNC to WSJ”?
We need to know heterogeneity before we can
interpret
The leading diagonal
2001 paper: randomising halves
Inelegant and inefficient
Depended on standard size of document
 
New definition, method (Pavel)
 
Heterogeneity (def)
Distance between most different partitions
Cluster to find ‘most different partitions’
Bottom-up clustering
until largest cluster has over one third of data
Rest: the other partition
Problem
nxn distance matrix where n > 1 million
Solution: do it in steps
 
Summary
 
Extrinsic evaluation
Method defined
Big project for coming months
Corpus comparison
Qualitative: use keywords
Quantitative
On beta
Heterogeneity (to complete the task) to follow (soon)
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
48
 
 
you should understand the maths you use
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
49
 
The Sketch Engine
 
Leading corpus query tool
Widely used by dictionary publishers, at
universities
Large corpora for many lgs available
Word sketches
Web service
Since last week:
Implements SimpleMaths
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
50
 
 
 
 
Thank you
 
http://www.sketchengine.co.uk
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
51
 
 
 
Language is never ever ever
random
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
52
 
 
 
 
Language
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
53
 
 
 
 
is
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
54
 
 
 
 
never
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
55
 
 
 
 
ever
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
56
 
 
 
 
ever
 
Liverpool, July 2009
 
Kilgarriff: Simple Maths
 
57
 
 
 
 
random
Slide Note
Embed
Share

Explore the evolution of the web from its teenage years to modern-day advancements. Witness the transformative impact of technology on democracy, communication, and society. Reflect on the web's growth and changing landscape, from passive information to active interaction. Discover how linguists study the web's language and the tools available for analysis. Join the journey of the web's development, from its humble beginnings to its monumental presence today.

  • Web Evolution
  • Technology Impact
  • Digital Transformation
  • Linguistic Analysis
  • Online Tools

Uploaded on Sep 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. So much of everything Adam Kilgarriff Lexical Computing Ltd

  2. To the web on its thirteenth birthday Teenager now, growth spurts wild and luxuriant, Your gangling form changes month on month. Born from the womb of academe in 1994 Pornography was your wetnurse Growing you big and strong Teaching you new tricks (webcams, streaming video, cash payments) that your parents never dreamed of. Teenagers must have independence! So Web-2.0 has arrived. Now everybody feeds you. 2

  3. New life for democracy - are you aren't you? - are you aren't you? Giving voice to millions or cancer Spam spammer spamming Phishing, viruses, Trojans Appropriating mutating multiplying. New life or cancer? Both of course Like nuclear physics and the Russian Revolution. You are huge like Exxon, Zeus and China We tremble and we lick our lips as we watch you grow. AK 2007 3

  4. Changes Then Documents Library Passive Information Computer Public Now + social media, audio, video + phone exchange, marketplace + Active + Interaction + tablet, smartphone + degrees of privacy 4

  5. 5

  6. Dick Whittington off to London where the streets are paved with gold to make his fortune 6

  7. Dick Whittington off to London where the streets are paved with gold to make his fortune 7

  8. Hero orVillain 8

  9. Hero orVillain 9

  10. As linguists we can Study the web new fascinating mostly language: we are well-placed Use the web as infrastructure Click and get it Use the web as a source of data All important: all different 10

  11. Sketch Engine Online corpus query tool Ready-to-use corpora All major world languages Many others Install your own corpora Build corpora from the web 11

  12. Language varieties Get a corpus How does it compare with a reference Qualitatively Quantitatively Biber 1988: Variation in Speech and Writing

  13. Qualitative Take keyword lists [a-z]{3,} Lemma if lemmatisation identical, else word C1 vs C2, top 100/200 C2 vs C1, top 100/200 study

  14. Example: OCC and OEC OEC: general reference corpus BrE and AmE All 21st century Some fiction OCC: writing for children Most BrE Some 21st century, some earlier Most fiction Compare 21st century fiction subcorpora (not enough British 21st cent fiction on OEC)

  15. Its ever so interesting Do it

  16. Simple maths for keywords This word is twice as common here as there Liverpool, July 2009 Kilgarriff: Simple Maths 19

  17. This word is twice as common here as there What does it mean? For word wubble Freq (f) Corp Size Per million Focus corp (fc) 40 10m 4 Reference corp (rc) 50 25m 2 Ratio=2: wubble is twice as common in fc as rc Liverpool, July 2009 Kilgarriff: Simple Maths 20

  18. This word is twice as common here as there Not just words Grammatical constructions Suffixes Keyword list Calculate ratio for all words Sort Keywords: at top of list Liverpool, July 2009 Kilgarriff: Simple Maths 21

  19. Good enough for keywords? Almost, but 1. Are corpora well matched? 2. Burstiness 3. You can t divide by zero 4. High ratios more common for rare words Liverpool, July 2009 Kilgarriff: Simple Maths 22

  20. 1 Are corpora well matched? Proportionality If fiction contains more American, newspaper more British genre compromised by region Usual problem Issue in corpus design Not here Liverpool, July 2009 Kilgarriff: Simple Maths 23

  21. 2 Burstiness Word mucosa BNC freq 1031 BNC files 9 theology 1032 230 unfortunate 1031 648 Discount frequency for bursty words Gries, CL 2007, also CL journal We use ARF (average reduced frequency) Not here Liverpool, July 2009 Kilgarriff: Simple Maths 24

  22. 3 You can t divide by zero fc 10 rc 0 ratio ? buggle stort nammikin 100 1000 0 0 ? ? Standard solution: add one fc 11 rc 1 ratio 11 buggle stort nammikin 101 1001 1 1 101 1001 Problem solved Liverpool, July 2009 Kilgarriff: Simple Maths 25

  23. 4 High ratios more common for rarer words fc rc ratio interesting? spug 10 1 10 no grod 1000 100 10 yes some researchers: grammar, grammar words some researchers: lexis content words No right answer Slider? Liverpool, July 2009 Kilgarriff: Simple Maths 26

  24. Solution Don t just add 1, add n: n=1 word obscurish middling common fc rc fc+n rc+n Ratio 11.00 1.99 1.20 Rank 10 200 0 11 201 1 1 2 3 100 101 12000 10000 12001 10001 n=100 word obscurish middling common fc rc fc+n rc+n Ratio 1.10 1.50 1.20 Rank 10 200 0 110 300 100 200 3 1 2 100 12000 10000 12100 10100 Liverpool, July 2009 Kilgarriff: Simple Maths 27

  25. Solution n=1000 word obscurish middling common fc rc fc+n 1010 1200 13000 rc+n Ratio Rank 10 200 0 1000 1100 11000 1.01 1.09 1.18 3 2 1 100 12000 10000 Summary rc 10 200 12000 10000 word obscurish middling common fc n=1 1st 2nd 3rd n=100 2nd 1st 3rd n=1000 3rd 2nd 1st 0 100 Liverpool, July 2009 Kilgarriff: Simple Maths 28

  26. But what about Mutual information Log-likelihood Chi-square Fisher s test Don t they use cleverer maths? Liverpool, July 2009 Kilgarriff: Simple Maths 29

  27. Yes but Clever maths is for hypothesis testing Can you defeat null hypothesis? Language is not random, so you always can Null hypothesis never true Hypothesis-testing not informative Clever maths irrelevant Kilgarriff 2006, CLLT Liverpool, July 2009 Kilgarriff: Simple Maths 30

  28. Moreover just one answer grammar words vs content words? does not help confuses and obscures Liverpool, July 2009 Kilgarriff: Simple Maths 31

  29. Example BAWE British Academic Written English Nesi and Thompson, completed last year Student essays Arts/Humanities, Social Sciences, Life Sciences, Physical Sciences fc: ArtsHum, rc: SocSci With n=10 and n=1000 Liverpool, July 2009 Kilgarriff: Simple Maths 32

  30. Liverpool, July 2009 Kilgarriff: Simple Maths 33

  31. Liverpool, July 2009 Kilgarriff: Simple Maths 34

  32. Quantitative Methods, evaluation Kilgarriff 2001, Comparing Corpora, Int J Corp Ling Then: not many corpora to compare Now: Many Ad hoc, from web First question: is it any good, how does it compare Let s make it easy: offer it in Sketch Engine

  33. A digital native who still likes books and crayons Thank you 37

  34. Original method C1 and C2: Same size, by design Put together, find 500 highest freq words For each of these words Freqs: f1 in C1, f2 in C2, mean=(f1+f2)/2 (f1-f2)2/mean (chi-square statistic) Sum Divide by 500: CBDF

  35. Evaluated Known-similarity corpora Shows it worked Used to set parameter (500) CBDF better than alternative measures tested

  36. Adjustments for SkE Problem: non-identical tokenisation Some awkward words: can t undermine stats as one corpus has zero Solution commonest 5000 words in each corpus intersection only commonest 500 in intersection

  37. Adjustments for SkE Corpus size highly variable Chi-square not so dependable Also not consistent with our keyword lists Link to keyword lists link quant to qual Keyword lists nf = normalised (per million) frequencies Keyword lists: nf1+k/nf2+k Default value for k=100 We use: if nf1>nf2, nf1+k/nf2+k, else nf2+k/nf1+k Evaluated on Known-Sim Corpora as good as/better than chi-square

  38. Simple maths for keywords This word is twice as common in this text type as that N freq Freq per m Focus Corp 2m 80 40 Ref corp 15m 300 20 ratio 2

  39. Intuitive Nearly right but: How well matched are corpora Not here Burstiness Not here Can t divide by zero Commoner vs. rarer words

  40. Whats missing Heterogeneity how similar is BNC to WSJ ? We need to know heterogeneity before we can interpret The leading diagonal 2001 paper: randomising halves Inelegant and inefficient Depended on standard size of document

  41. New definition, method (Pavel) Heterogeneity (def) Distance between most different partitions Cluster to find most different partitions Bottom-up clustering until largest cluster has over one third of data Rest: the other partition Problem nxn distance matrix where n > 1 million Solution: do it in steps

  42. Summary Extrinsic evaluation Method defined Big project for coming months Corpus comparison Qualitative: use keywords Quantitative On beta Heterogeneity (to complete the task) to follow (soon)

  43. you should understand the maths you use Liverpool, July 2009 Kilgarriff: Simple Maths 48

  44. The Sketch Engine Leading corpus query tool Widely used by dictionary publishers, at universities Large corpora for many lgs available Word sketches Web service Since last week: Implements SimpleMaths Liverpool, July 2009 Kilgarriff: Simple Maths 49

  45. Thank you http://www.sketchengine.co.uk Liverpool, July 2009 Kilgarriff: Simple Maths 50

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#