Unveiling the Feed Corpus: A Comprehensive Study

Akshay Minocha, Siva Reddy, Adam Kilgarriff

Lexical Computing Ltd

•

Study language change

over months, years

•

Most web pages

no info about when written

•

Feeds

written then posted

•

Same feeds over time

we hope



identical genre mix



only factor that changes is time

   Feed Discovery

   Feed Crawler

   Feed Scheduler

   Feed Validation

Cleaning, de-duplication,

Linguistic Processing

•

Tweets often contain links for posts on feeds

bloggers, newswires often tweet



"see my new post at http..."

•

Twitter keyword searches

News, business, arts, games, regional, science,

shopping, society, etc.

Ignore retweets

Every 15 minutes

https://twitter.com/search?q=news%20source%3Atwitterfee

d%20filter%3Alinks&lang=en&include_entities=1&rpp=1

•

•

•

•

•

•

•

Does the link lead directly to a feed?

does metadata contain



type=application/rss+xml



type=application/atom+xml

•

If yes,

good

•

If no

search for a feed in domain of the link

If no



search for feed in (one_step_from_domain)

•

If still no

link is blacklisted

•

Frequency of update



average over last ten feeds

Yield Rate



ratio, raw data input to 'good text' output

•

as in Spiderling, Suchomel and Pomikalek 2012

•

priority level for checking the feed

•

Is there new content?

If yes

Is it already in corpus?

•

Onion: Pomikalek



if no



clean up

•

JusText: Pomikalek



add to corpus

•

Lemmatise, POS-tag

•

Load into Sketch Engine

•

Raw:1.36 billion English words

•

300 m words after deduplication, cleaning

•

150,000+ feeds

•

Delivered to CUP

•

Keep their corpus up-to-date

•

Keywords vs enTenTen12

[a-z]{3,}

•

maintenance

•

Include "Category Tags"

•

Other languages

Collection started now

Identification by langid.py

(Lui and Baldwin 2012)

•

"No-typo" material

copy-edited subset, so



newspapers, business:

yes



personal blogs:

no

method:



manual classification of 100 highest-volume feeds

http://www.sketchengine.co.uk

Slide Note

Embed Share

Download

Explore how the Feed Corpus tackles the challenge of monitoring language evolution over time by discovering, validating, and scheduling feeds from sources like Twitter. The methodology involves linguistic processing, de-duplication, and more to build an ever-growing, up-to-date database. Witness the process of feed discovery, validation, scheduling, and crawling in action.

novalei Follow

Uploaded on Sep 18, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd

Introduction Study language change o over months, years Most web pages o no info about when written Feeds o written then posted Same feeds over time o we hope identical genre mix only factor that changes is time

Method Feed Discovery Feed Validation Feed Scheduler Feed Crawler Cleaning, de-duplication, Linguistic Processing

Feed Discovery via Twitter Tweets often contain links for posts on feeds o bloggers, newswires often tweet "see my new post at http..." Twitter keyword searches o News, business, arts, games, regional, science, shopping, society, etc. o Ignore retweets o Every 15 minutes

Sample Search Aim - To make the most out of the search results https://twitter.com/search?q=news%20source%3Atwitterfee d%20filter%3Alinks&lang=en&include_entities=1&rpp=1 00 Query - News Source - twitterfeed Filter - Links ( To get all tweets necessarily with links) Language - en ( English ) Include Entities - Info like geo, user, etc. rpp - result per page ( maximum 100 )

Feed Validation Does the link lead directly to a feed? o does metadata contain type=application/rss+xml type=application/atom+xml If yes, good If no o search for a feed in domain of the link o If no search for feed in (one_step_from_domain) If still no o link is blacklisted

Scheduling Inputs o Frequency of update average over last ten feeds o Yield Rate ratio, raw data input to 'good text' output as in Spiderling, Suchomel and Pomikalek 2012 Output o priority level for checking the feed

Feed Crawler Visit feed at top of queue Is there new content? o If yes o Is it already in corpus? Onion: Pomikalek if no clean up JusText: Pomikalek add to corpus

Prepare for analysis Lemmatise, POS-tag Load into Sketch Engine

Initial run: Feb-March 2013 Raw:1.36 billion English words 300 m words after deduplication, cleaning 150,000+ feeds Delivered to CUP Keep their corpus up-to-date Keywords vs enTenTen12 o [a-z]{3,}

An earlier version maintenance

Future Work MAINTAIN Include "Category Tags" Other languages o Collection started now o Identification by langid.py (Lui and Baldwin 2012) "No-typo" material o copy-edited subset, so newspapers, business: yes personal blogs: no o method: manual classification of 100 highest-volume feeds

Thank You http://www.sketchengine.co.uk

Unveiling the Feed Corpus: A Comprehensive Study

Download Presentation

Presentation Transcript

Related

More Related Content