Unveiling the Feed Corpus: A Comprehensive Study

F
e
e
d
 
C
o
r
p
u
s
 
:
A
n
 
E
v
e
r
 
G
r
o
w
i
n
g
 
U
p
 
t
o
 
D
a
t
e
C
o
r
p
u
s
Akshay Minocha, Siva Reddy, Adam Kilgarriff
Lexical Computing Ltd
I
n
t
r
o
d
u
c
t
i
o
n
Study language change
o
over months, years
Most web pages
o
no info about when written
Feeds
o
written then posted
Same feeds over time
o
we hope
identical genre mix
only factor that changes is time
M
e
t
h
o
d
 
   Feed Discovery
 
   Feed Crawler
   Feed Scheduler
   Feed Validation
Cleaning, de-duplication,
Linguistic Processing
F
e
e
d
 
D
i
s
c
o
v
e
r
y
 
v
i
a
 
T
w
i
t
t
e
r
Tweets often contain links for posts on feeds
o
bloggers, newswires often tweet
"see my new post at http..."
Twitter keyword searches
o
News, business, arts, games, regional, science,
shopping, society, etc.
o
Ignore retweets
o
Every 15 minutes
S
a
m
p
l
e
 
S
e
a
r
c
h
A
i
m
 
-
 
T
o
 
m
a
k
e
 
t
h
e
 
m
o
s
t
 
o
u
t
 
o
f
 
t
h
e
 
s
e
a
r
c
h
 
r
e
s
u
l
t
s
https://twitter.com/search?q=news%20source%3Atwitterfee
d%20filter%3Alinks&lang=en&include_entities=1&rpp=1
00
Q
u
e
r
y
 
-
 
N
e
w
s
S
o
u
r
c
e
 
-
 
t
w
i
t
t
e
r
f
e
e
d
F
i
l
t
e
r
 
-
 
L
i
n
k
s
 
(
 
T
o
 
g
e
t
 
a
l
l
 
t
w
e
e
t
s
 
n
e
c
e
s
s
a
r
i
l
y
 
w
i
t
h
 
l
i
n
k
s
)
L
a
n
g
u
a
g
e
 
-
 
e
n
 
(
 
E
n
g
l
i
s
h
 
)
I
n
c
l
u
d
e
 
E
n
t
i
t
i
e
s
 
-
 
I
n
f
o
 
l
i
k
e
 
g
e
o
,
 
u
s
e
r
,
 
e
t
c
.
r
p
p
 
-
 
r
e
s
u
l
t
 
p
e
r
 
p
a
g
e
 
(
 
m
a
x
i
m
u
m
 
1
0
0
 
)
F
e
e
d
 
V
a
l
i
d
a
t
i
o
n
Does the link lead directly to a feed?
o
does metadata contain
type=application/rss+xml
type=application/atom+xml
If yes, 
good
If no
o
search for a feed in domain of the link
o
If no
search for feed in (one_step_from_domain)
If still no
o
link is blacklisted
S
c
h
e
d
u
l
i
n
g
I
n
p
u
t
s
o
Frequency of update
average over last ten feeds
o
Yield Rate
ratio, raw data input to 'good text' output
as in Spiderling, Suchomel and Pomikalek 2012
O
u
t
p
u
t
o
priority level for checking the feed
F
e
e
d
 
C
r
a
w
l
e
r
V
i
s
i
t
 
f
e
e
d
 
a
t
 
t
o
p
 
o
f
 
q
u
e
u
e
Is there new content?
o
If yes
o
Is it already in corpus?
Onion: Pomikalek
if no
clean up
JusText: Pomikalek
add to corpus
P
r
e
p
a
r
e
 
f
o
r
 
a
n
a
l
y
s
i
s
Lemmatise, POS-tag
Load into Sketch Engine
I
n
i
t
i
a
l
 
r
u
n
:
 
F
e
b
-
M
a
r
c
h
 
2
0
1
3
Raw:1.36 billion English words
300 m words after deduplication, cleaning
150,000+ feeds 
Delivered to CUP
Keep their corpus up-to-date
Keywords vs enTenTen12
o
[a-z]{3,}
A
n
 
e
a
r
l
i
e
r
 
v
e
r
s
i
o
n
maintenance
F
u
t
u
r
e
 
W
o
r
k
M
A
I
N
T
A
I
N
Include "Category Tags"
Other languages
o
Collection started now
o
Identification by langid.py
 
(Lui and Baldwin 2012)
"No-typo" material
o
copy-edited subset, so
newspapers, business: 
yes
personal blogs: 
no
o
method:
manual classification of 100 highest-volume feeds
T
h
a
n
k
 
Y
o
u
http://www.sketchengine.co.uk
Slide Note
Embed
Share

Explore how the Feed Corpus tackles the challenge of monitoring language evolution over time by discovering, validating, and scheduling feeds from sources like Twitter. The methodology involves linguistic processing, de-duplication, and more to build an ever-growing, up-to-date database. Witness the process of feed discovery, validation, scheduling, and crawling in action.

  • Corpus Study
  • Linguistic Processing
  • Language Change
  • Feed Discovery
  • Data Validation

Uploaded on Sep 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd

  2. Introduction Study language change o over months, years Most web pages o no info about when written Feeds o written then posted Same feeds over time o we hope identical genre mix only factor that changes is time

  3. Method Feed Discovery Feed Validation Feed Scheduler Feed Crawler Cleaning, de-duplication, Linguistic Processing

  4. Feed Discovery via Twitter Tweets often contain links for posts on feeds o bloggers, newswires often tweet "see my new post at http..." Twitter keyword searches o News, business, arts, games, regional, science, shopping, society, etc. o Ignore retweets o Every 15 minutes

  5. Sample Search Aim - To make the most out of the search results https://twitter.com/search?q=news%20source%3Atwitterfee d%20filter%3Alinks&lang=en&include_entities=1&rpp=1 00 Query - News Source - twitterfeed Filter - Links ( To get all tweets necessarily with links) Language - en ( English ) Include Entities - Info like geo, user, etc. rpp - result per page ( maximum 100 )

  6. Feed Validation Does the link lead directly to a feed? o does metadata contain type=application/rss+xml type=application/atom+xml If yes, good If no o search for a feed in domain of the link o If no search for feed in (one_step_from_domain) If still no o link is blacklisted

  7. Scheduling Inputs o Frequency of update average over last ten feeds o Yield Rate ratio, raw data input to 'good text' output as in Spiderling, Suchomel and Pomikalek 2012 Output o priority level for checking the feed

  8. Feed Crawler Visit feed at top of queue Is there new content? o If yes o Is it already in corpus? Onion: Pomikalek if no clean up JusText: Pomikalek add to corpus

  9. Prepare for analysis Lemmatise, POS-tag Load into Sketch Engine

  10. Initial run: Feb-March 2013 Raw:1.36 billion English words 300 m words after deduplication, cleaning 150,000+ feeds Delivered to CUP Keep their corpus up-to-date Keywords vs enTenTen12 o [a-z]{3,}

  11. An earlier version maintenance

  12. Future Work MAINTAIN Include "Category Tags" Other languages o Collection started now o Identification by langid.py (Lui and Baldwin 2012) "No-typo" material o copy-edited subset, so newspapers, business: yes personal blogs: no o method: manual classification of 100 highest-volume feeds

  13. Thank You http://www.sketchengine.co.uk

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#