Algorithms and Abstractions for Phrase Finding Task

algorithms and abstractions for stream and sort n.w
1 / 36
Embed
Share

Explore the implementation of phrase finding task using algorithms based on BLRT and Pointwise KL-divergence. Learn about statistics, including the na ve Bayes training and classification, inverted indices, and more. Guest lecture by Manik Varma from MSR India.

  • Algorithms
  • Abstractions
  • Phrase Finding
  • BLRT
  • Statistics

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Algorithms and Abstractions for Stream-and-Sort

  2. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably best to do it soon after class

  3. Recap Last week Algorithms that use only stream and sort primitives: Na ve Bayes training (event counting) Inverted indices, preprocessing for distributional clustering, Na ve Bayes classification (with event counts out of memory) Phrase finding task Score all the phrases in a corpus for phrasiness and informativeness Statistics based on BLRT, Pointwise KL-divergence

  4. Today Phrase finding task the implementation Score all the phrases in a corpus for phrasiness and informativeness Statistics based on BLRT, Pointwise KL- divergence Phrase finding algorithm Abstractions for Map-Reduce Guest lecture: Manik Varma, MSR India

  5. Phrase Finding William W. Cohen

  6. Phraseness1 based on BLRT Define pi=ki /ni, p=(k1+k2)/(n1+n2), L(p,k,n) = pk(1-p)n-k jp(n1,k1,n2,k2)=2logL(p1,k1,n1)L(p2,k2,n2) Phrase x y: W1=x ^ W2=y L(p,k1,n1)L(p,k2,n2) comment how often bigram x y occurs in corpus C how often word x occurs in corpus C how often y occurs in C after a non-x how often a non-x occurs in C k1 n1 k2 n2 C(W1=x ^ W2=y) C(W1=x) C(W1 x^W2=y) C(W1 x) Does y occur at the same frequency after x as in other positions?

  7. Informativeness1 based on BLRT Define pi=ki /ni, p=(k1+k2)/(n1+n2), L(p,k,n) = pk(1-p)n-k ji(n1,k1,n2,k2)=2logL(p1,k1,n1)L(p2,k2,n2) Phrase x y: W1=x ^ W2=y and two corpora, C and B L(p,k1,n1)L(p,k2,n2) comment how often bigram x y occurs in corpus C how many bigrams in corpus C how often x y occurs in background corpus how many bigrams in background corpus k1 n1 k2 n2 C(W1=x ^ W2=y) C(W1=* ^ W2=*) B(W1=x^W2=y) B(W1=* ^ W2=*) Does x y occur at the same frequency in both corpora?

  8. The breakdown: what makes a good phrase To compare distributions, use KL-divergence Phraseness: difference between bigram and unigram language model in foreground Pointwise KL divergence Bigram model: P(x y)=P(x)P(y|x) Unigram model: P(x y)=P(x)P(y)

  9. The breakdown: what makes a good phrase To compare distributions, use KL-divergence Informativeness: difference between foreground and background models Pointwise KL divergence w would be a phrase here Bigram model: P(x y)=P(x)P(y|x) Unigram model: P(x y)=P(x)P(y)

  10. The breakdown: what makes a good phrase To compare distributions, use KL-divergence Combined: difference between foreground bigram model and background unigram model Pointwise KL divergence Bigram model: P(x y)=P(x)P(y|x) Unigram model: P(x y)=P(x)P(y)

  11. Implementation Request-and-answer pattern Main data structure: tables of key-value pairs key is a phrase x y value is a mapping from a attribute names (like phraseness, freq- in-B, ) to numeric values. Keys and values are just strings We ll operate mostly by sending messages to this data structure and getting results back, or else streaming thru the whole table For really big data: we d also need tables where key is a word and val is set of attributes of the word (freq-in-B, freq-in-C, )

  12. Generating and scoring phrases: 1 Stream through foreground corpus and count events W1=x ^ W2=y the same way we do in training naive Bayes: stream-and sort and accumulate deltas (a sum-reduce ) Don t bother generating boring phrases (e.g., crossing a sentence, contain a stopword, ) Then stream through the output and convert to phrase, attributes- of-phrase records with one attribute: freq-in-C=n Stream through foreground corpus and count events W1=x in a (memory-based) hashtable . This is enough* to compute phrasiness: p(x y) = f( freq-in-C(x), freq-in-C(y), freq-in-C(x y)) so you can do that with a scan through the phrase table that adds an extra attribute (holding word frequencies in memory). * actually you also need total # words and total #phrases .

  13. Generating and scoring phrases: 2 Stream through background corpus and count events W1=x ^ W2=y and convert to phrase, attributes-of- phrase records with one attribute: freq-in-B=n Sort the two phrase-tables: freq-in-B and freq-in-C and run the output through another reducer that appends together all the attributes associated with the same key, so we now have elements like

  14. Generating and scoring phrases: 3 Scan the through the phrase table one more time and add the informativeness attribute and the overall quality attribute Summary, assuming word vocabulary nW is small: Scan foreground corpus C for phrases: O(nC) producing mC phrase records of course mC << nC Compute phrasiness: O(mC) Scan background corpus B for phrases: O(nB) producing mB Sort together and combine records: O(m log m), m=mB + mC Compute informativeness and combined quality: O(m) Assumes word counts fit in memory

  15. Recap Is there a stream-and-sort analog of this request-and-answer pattern? Record of all event counts for each word Test data id1 found an aardvark in zynga s farmville today! id2 id3 . id4 id5 .. w aardvark Counts associated with W C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews zynga C[w^Y=sports]=21,C[w^Y=worldNews Counter records found aardvark ~ctr to id2 today ~ctr to idi ~ctr to id1 Classification logic Combine and sort requests

  16. Recap A stream-and-sort analog of the request-and- answer pattern w aardvark Counts C[w^Y=sports]=2 Record of all event counts for each word w Counts aardvark ~ctr to id1 agent agent agent agent zynga C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 aardvark C[w^Y=sports]=2 agent zynga Counter records C[ ] found aardvark ~ctr to id1 today ~ctr to id1 ~ctr to id1 zynga ~ctr to id1 Combine and sort Request-handling logic requests

  17. Recap A stream-and-sort analog of the request-and- answer pattern w aardvark Counts C[w^Y=sports]=2 previousKey = somethingImpossible For each (key,val) in input: If key==previousKey Answer(recordForPrevKey,val) Else previousKey = key recordForPrevKey = val aardvark ~ctr to id1 agent agent agent agent zynga C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 define Answer(record,request): find idwhere request = ~ctr to id print id ~ctr for request is record C[ ] zynga ~ctr to id1 Combine and sort Request-handling logic requests

  18. Recap A stream-and-sort analog of the request-and- answer pattern w aardvark Counts C[w^Y=sports]=2 previousKey = somethingImpossible For each (key,val) in input: If key==previousKey Answer(recordForPrevKey,val) Else previousKey = key recordForPrevKey = val aardvark ~ctr to id1 agent agent agent agent zynga C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 define Answer(record,request): find idwhere request = ~ctr to id print id ~ctr for request is record C[ ] Output: id1 ~ctr for aardvark is C[w^Y=sports]=2 id1 ~ctr for zynga is . zynga ~ctr to id1 Request-handling logic Combine and sort requests

  19. Recap A stream-and-sort analog of the request-and- answer pattern w aardvark Counts C[w^Y=sports]=2 Output: id1 ~ctr for aardvark is C[w^Y=sports]=2 id1 ~ctr for zynga is . aardvark ~ctr to id1 agent agent agent agent zynga C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 id1 found an aardvark in zynga s farmville today! id2 id3 . id4 id5 .. C[ ] zynga ~ctr to id1 Request-handling logic Combine and sort ????

  20. Recap What we ended up with Key id1 Value found aardvark zynga farmville today ~ctr for aardvark isC[w^Y=sports]=2 ~ctr for found isC[w^Y=sports]=1027,C[w^Y=worldNews]=564 w2,1 w2,2 w2,3 . ~ctr for w2,1 is id2

  21. Ramping it up keeping word counts out of memory Goal: records for xy with attributes freq-in-B, freq-in-C, freq-of-x-in-C, freq-of-y-in-C, Assume I have built built phrase tables and word tables .how do I incorporate the word attributes into the phrase records? For each phrase xy, request necessary word frequencies: Print x ~request=freq-in-C,from=xy Print y ~request=freq-in-C,from=xy Sort all the word requests in with the word tables Scan through the result and generate the answers: for each word w, a1=n1,a2=n2, . Print xy ~request=freq-in-C,from=w Sort the answers in with the xy records Scan through and augment the xy records appropriately

  22. Generating and scoring phrases: 3 Summary 1. Scan foreground corpus C for phrases, words: O(nC) producing mC phrase records, vC word records 2. Scan phrase records producing word-freq requests: O(mC ) producing 2mC requests 3. Sort requests with word records: O((2mC + vC )log(2mC + vC)) = O(mClog mC) since vC < mC 4. Scan through and answer requests: O(mC) 5. Sort answers with phrase records: O(mClog mC) 6. Repeat 1-5 for background corpus: O(nB + mBlogmB) 7. Combine the two phrase tables: O(m log m), m = mB +mC 8. Compute all the statistics: O(m)

  23. ABSTRACTIONS FOR STREAM AND SORT AND MAP-REDUCE 24

  24. Abstractions On Top Of Map- Reduce We ve decomposed some algorithms into a map-reduce workflow (series of map- reduce steps) naive Bayes training na ve Bayes testing phrase scoring How else can we express these sorts of computations? Are there some common special cases of map-reduce steps we can parameterize and reuse? 25

  25. Abstractions On Top Of Map- Reduce Some obvious streaming processes: for each row in a table Transform it and output the result Example Example: stem words in a stream of word-count pairs: ( aardvarks ,1) ( aardvark ,1) Proposed syntax: f(row) row table2 = MAP table1 TO row : f(row)) Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test Example Example: apply stop words ( aardvark ,1) ( aardvark ,1) ( the ,1) deleted Proposed syntax: f(row) {true,false} table2 = FILTER table1 BY row : f(row)) 26

  26. Abstractions On Top Of Map- Reduce A non-obvious? streaming processes: for each row in a table Transform it to a list of items Splice all the lists together to get the output table (flatten) Proposed syntax: f(row) list of rows table2 = FLATMAP table1 TO row : f(row)) i i found found an an aardvark aardvark we we love love Example Example: tokenizing a line I found an aardvark [ i , found , an , aardvark ] We love zymurgy [ we , love , zymurgy ] ..but final table is one word per row 27

  27. Abstractions On Top Of Map- Reduce Another example from the Na ve Bayes test program 28

  28. NB Test Step Event counts X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y= X= 5245 1054 2120 How: Stream and sort: for each C[X=w^Y=y]=n print w C[Y=y]=n sort and build a list of values associated with each key w Like an inverted index 37 3 w aardvark agent Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world News]=564 C[w^Y=sports]=21,C[w^Y=worldNe ws]=4464 zynga

  29. NB Test Step Event counts The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y= X= 5245 1054 2120 37 3 Grouping operation Special case of a map-reduce w aardvark agent Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world News]=564 C[w^Y=sports]=21,C[w^Y=worldNe ws]=4464 Proposed syntax: f(row) field GROUP table BY row : f(row) zynga Could define f via: a function, a field of a defined record structure,

  30. NB Test Step Aside: you guys know how to implement this, right? The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value 1. Output pairs (f(row),row) with a map/streaming process 2. Sort pairs by key which is f(row) Grouping operation Special case of a map-reduce 3. Reduce and aggregate by appending together all the values associated with the same key Proposed syntax: f(row) field GROUP table BY row : f(row) Could define f via: a function, a field of a defined record structure,

  31. Abstractions On Top Of Map- Reduce And another example from the Na ve Bayes test program 32

  32. Request-and-answer Record of all event counts for each word Test data id1 w1,1 w1,2 w1,3 . w1,k1 id2 w2,1 w2,2 w2,3 . id3 w3,1 w3,2 . id4 w4,1 w4,2 id5 w5,1 w5,2 . .. w aardvark Counts associated with W C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews zynga C[w^Y=sports]=21,C[w^Y=worldNews Step 2: stream through and for each test case idi wi,1 wi,2 wi,3 . wi,ki request the event counters needed to classify idi from the event-count DB, then classify using the answers Classification logic

  33. Request-and-answer Break down into stages Generate the data being requested (indexed by key, here a word) Eg with group by Generate the requests as (key, requestor) pairs Eg with flatmap to Jointhese two tables by key Join defined as (1) cross-product and (2) filter out pairs with different values for keys This replaces the step of concatenating two different tables of key-value pairs, and reducing them together Postprocess the joined result

  34. w aardvark Counters C[w^Y=sports]=2 w found Request ~ctr to id1 agent C[w^Y=sports]=1027,C[w^Y=worldNews aardvark ~ctr to id1 zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 zynga ~ctr to id1 ~ctr to id2 w Counters Requests aardvark C[w^Y=sports]=2 ~ctr to id1 agent agent agent zynga C[w^Y=sports]= C[w^Y=sports]= C[w^Y=sports]= C[w^Y=sports]= C[ ] ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 ~ctr to id1 zynga C[ ]

  35. w aardvark Counters C[w^Y=sports]=2 w found Request id1 agent C[w^Y=sports]=1027,C[w^Y=worldNews aardvark id1 zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 zynga id1 id2 Examples: JOIN wordInDoc BY word, wordCounters BY word --- if word(row) defined correctly w Counters Requests aardvark C[w^Y=sports]=2 id1 JOIN wordInDoc BY lambda (word,docid):word, wordCounters BY lambda (word,counters):word using python syntax for functions agent agent agent zynga C[w^Y=sports]= C[w^Y=sports]= C[w^Y=sports]= C[w^Y=sports]= C[ ] id345 id9854 id345 id34742 id1 Proposed syntax: JOIN table1 BY row : f(row), table2 BY row : g(row) zynga C[ ]

More Related Content