Locality Sensitive Hashing (LSH) for Nearest Neighbor Queries

Locality sensitive hashing (LSH)

Nearest Neighbor

Given a set P of n points in R

Nearest Neighbor

Want to build a data structure to answer nearest

neighbor queries

Voronoi Diagram

Build a Voronoi diagram & a point location data

structure

Curse of dimensionality

•

In R

 the Voronoi diagram is of size O(n)

•

Query takes O(logn) time

•

In R

 the complexity is O(n



d/2



•

Other techniques also scale bad with the

dimension

Problem has many variants

All

 nearest neighbors

•

Finding the nearest neighbors of every

point in the data (near duplicate web

pages)

Locality Sensitive Hashing

•

We will use a family of hash functions

such that close points tend to hash to

the same bucket.

•

Put all points of P in their buckets,

ideally we want the query q to find its

nearest neighbor in its bucket

Locality Sensitive Hashing

A family H of functions is

(d

< d

,p

> p

)-

sensitive

if

d(p,q) ≤ d



 Pr[h(p) = h(q)] ≥ p

d(p,q) ≥ d



 Pr[h(p) = h(q)] ≤ p

LSF for hamming distance

Pr[h(p) = h(q)] = 1 – ham(p,q)/m

Jaacard distance and random

permutations

jaccard(p,q) = |p



q|/

|p



q|

Think of p and q as sets

H={h



(p) = min in



 of the items in p}

Pr[h



(p) = h



(q)] = jaccard(p,q)

Efficiency: Need to pick



 from a

min-wise ind.

family of permutations

Jd(p,q) = 1-jaccard(p,q) = 1 - |p



q|/

|p



q|

Jaacard distance and minhash

Pr[h

(p) = h

(q)] = jaccard(p,q)

Jd(p,q) = 1-jaccard(p,q) = 1 - |p



q|/

|p



q|

Think of p and q as sets

Cosine distance

Cosine distance

H = {h

(p) =

if

·

p > 0,

 otherwise |

 is a

random unit vector}

Cosine distance

H = {h

(p) =

if

·

p > 0,

 otherwise |

 is a

random unit vector}

Cosine distance

H = {h

(p) =

if

·

p > 0,

 otherwise |

 is a

random unit vector}

Cosine distance

H = {h

(p) =

if

·

p > 0,

 otherwise |

 is a

random unit vector}

Cosine distance

H = {h

(p) =

if

·

p > 0,

 otherwise |

 is a

random unit vector}

For binary

vectors (like

term-doc)

incidence

vectors:

Combining by “AND”

Reduce the number of false positives by

concatenating hash function to get a new

family of hash functions

h(p) = h

(p)h

(p)h

(p)h

(p)……h

(p) = 00101010

We get a new family of hash functions

Combining by “OR”

Reduce the number of false negatives

h(p) = h

(p),h

(p),h

(p),h

(p),……,h

(p) =

0,0,1,0,1,0,1,0

We get a new family of hash functions

“And k” followed by “Or L”

•

k=5, L=20

A theoretical result on NN

(r,

ε

)-neighbor problem

If there

is

 a neighbor p, such that

d(p,q)



r, return p’, s.t. d(p’,q)



(1+

ε

)r.

(r,

ε

)-neighbor problem

•

Lets construct a data structure that

succeeds with constant probability

•

Focus on the hamming distance first

NN using locality sensitive

hashing

•

Take a

(r

, r

, p

, p

) =

    (r , (1+



)r, 1-r/m

1-

(1+



)r/m) -

sensitive family

•

If there is a neighbor at distance r we catch it

with probability

NN using locality sensitive

hashing

•

Take a

(r

, r

, p

, p

) =

    (r , (1+



)r, 1-r/m

1-

(1+



)r/m) -

sensitive family

•

If there is a neighbor at distance r we catch it

with probability

 so to guarantee catching it we

need to “or”

1/p

 functions..

NN using locality sensitive

hashing

•

Take a

(r

, r

, p

, p

) =

    (r , (1+



)r, 1-r/m

1-

(1+



)r/m) -

sensitive family

•

If there is a neighbor at distance r we catch it

with probability

 so to guarantee catching it we

need to “or”

1/p

 functions..

•

But we also get false positives, how many ?

NN using locality sensitive

hashing

NN using locality sensitive

hashing

•

Take a

(r

, r

, p

, p

) = (r , (1+



)r, 1-r/m

1-

(1+



)r/m) -

sensitive family

•

Make a new function by concatenating (“and”)

 of these

basic functions

•

We get

(r

, r

, (p

, (p

sensitive family

•

If there is a neighbor at distance r we catch it with

probability

(p

so to guarantee catching it we need

L=1/(p

 functions. We get a

(r

, r

, 1-(1-(p

, 1-(1-

(p

sensitive family

•

But we also get false positives in our

1/(p

 buckets,

how many ?

n(p

/(p

(r,

ε

)-Neighbor with constant prob

Scan the first

n(p

/(p

 points in the

buckets and return the closest

A close neighbor (≤ r

) is in one of the buckets

with probability ≥ 1-(1/e)

There are ≤ 4

n(p

/(p

false positives with

probability ≥ 3/4



 Both events happen with constant prob.

Analysis

We want to choose k to minimize this.

Total query time:

(each op takes time prop.

to the dim.)

time

≤ 2*min

Analysis

We want to choose k to minimize this:



Total query time:

(each op takes time prop.

to the dim.)

Summary

Total query time:

Put:

space:

What is



(1+

ε

)-approximate NN

•

Given q find p such that



p’



   d(q,p)



(1+

ε

)d(q,p’)

•

We can use our solution to the (r,



)-

neighbor problem

(1+

ε

)-approximate NN vs (r,

ε

)-

neighbor problem

•

If we know r

min

 and r

max

 we can find (1+

ε

)-

approximate NN using log(r

max

/r

min

(r,

ε

’≈

ε

/2)-neighbor problems

(1+

ε

’)r

LSH using

p-stable

 distributions

So what do we do with this ?

h(p) =



h(p)-h(q) =







(p

-q

)X

=||p-q||X

LSH using

p-stable

 distributions

So what do we do with this ?

h(p) =



(p

)/r



Pick r to maximize

ρ

…

Bibliography

•

M. Charikar: Similarity estimation techniques from

rounding algorithms. STOC 2002: 380-388

•

P. Indyk, R. Motwani: Approximate Nearest Neighbors:

Towards Removing the Curse of Dimensionality. STOC

1998: 604-613.

•

A. Gionis, P. Indyk, R. Motwani: Similarity Search in High

Dimensions via Hashing. VLDB 1999: 518-529

•

M. R. Henzinger: Finding near-duplicate web pages: a large-

scale evaluation of algorithms. SIGIR 2006: 284-291

•

G. S. Manku,  A. Jain , A. Das Sarma: Detecting near-

duplicates for web crawling. WWW 2007: 141-150

Slide Note

Embed Share

Download

Locality Sensitive Hashing (LSH) is a technique used to efficiently find nearest neighbors in high-dimensional spaces. By grouping similar points into the same hash bucket, LSH enables fast search for nearest neighbors, overcoming the curse of dimensionality. Variants include k-nearest neighbors and finding all nearest neighbors in a dataset.

man_sel Follow

Uploaded on Sep 15, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Locality sensitive hashing (LSH) 1

Nearest Neighbor Given a set P of n points in Rd 2

Nearest Neighbor Want to build a data structure to answer nearest neighbor queries 3

Voronoi Diagram Build a Voronoi diagram & a point location data structure 4

Curse of dimensionality In R2 the Voronoi diagram is of size O(n) Query takes O(logn) time In Rd the complexity is O(n d/2 ) Other techniques also scale bad with the dimension 5

Problem has many variants Approximate (will talk about this soon) k nearest neighbors All neighbors at distance ? 6

All nearest neighbors Finding the nearest neighbors of every point in the data (near duplicate web pages) 7

Locality Sensitive Hashing We will use a family of hash functions such that close points tend to hash to the same bucket. Put all points of P in their buckets, ideally we want the query q to find its nearest neighbor in its bucket 8

Locality Sensitive Hashing A family H of functions is (d1 < d2,p1 > p2)- sensitive if d(p,q) d1 Pr[h(p) = h(q)] p1 d2 d1 d(p,q) d2 Pr[h(p) = h(q)] p2 p 9

Locality Sensitive Family for a distance function ?(?,?) ?1 ?2 10

LSF for hamming distance Think of the points as strings of m bits and consider the hamming distance (?,?) H={hi(p) = the i-th bit of p} is locality sensitive wrt (?,?) Pr[h(p) = h(q)] = 1 ham(p,q)/m So this family is ?1,?2,1 ?1 ?,1 ?2 ?-sensitive 11

Jaacard distance and random permutations Think of p and q as sets jaccard(p,q) = |p q|/|p q| Jd(p,q) = 1-jaccard(p,q) = 1 - |p q|/|p q| H={h (p) = min in of the items in p} Pr[h (p) = h (q)] = jaccard(p,q) Efficiency: Need to pick from a min-wise ind. family of permutations This is (?1= 1 ?1,?2= 1 ?2,?1,?2)-sensitive family 12

Jaacard distance and minhash Think of p and q as sets Jd(p,q) = 1-jaccard(p,q) = 1 - |p q|/|p q| H={hr(p) = min ?(?), ? ?, ?~?[0,1]} Pr[hr(p) = hr(q)] = jaccard(p,q) Precision for ? should avoid ties This is (?1= 1 ?1,?2= 1 ?2,?1,?2)-sensitive family 13

Cosine distance ? and ? are vectors and ?(?,?) = ? ? ? ? 14

Cosine distance H = {hr(p) = 1 if r p > 0, 0 otherwise | r is a random unit vector} r ? ? ? 15

Cosine distance H = {hr(p) = 1 if r p > 0, 0 otherwise | r is a random unit vector} r ? ? ? Pr[ ?(?) = ?(?)] = ? 16

Cosine distance H = {hr(p) = 1 if r p > 0, 0 otherwise | r is a random unit vector} r ? ? ? Pr[ ?(?) = ?(?)] = 1 ? ? 17

Cosine distance H = {hr(p) = 1 if r p > 0, 0 otherwise | r is a random unit vector} r ? ? ? This is ?1,?2,1 ?1 ?,1 ?2 ?-sensitive family 18

Cosine distance H = {hr(p) = 1 if r p > 0, 0 otherwise | r is a random unit vector} For binary vectors (like term-doc) incidence vectors: r ? ? A B ? = 1 cos A B This is ?1,?2,1 ?1 ?,1 ?2 ?-sensitive family 19

Combining by AND Reduce the number of false positives by concatenating hash function to get a new family of hash functions h(p) = h1(p)h2(p)h3(p)h4(p) hk(p) = 00101010 We get a new family of hash functions (?) = (?) iff ?? = ?? ? If the original family is ?1,?2,?1,?2-sensitive then the new family is ?1,?2, ?1 sensitive ?- ?, ?2 20

Combining by OR Reduce the number of false negatives h(p) = h1(p),h2(p),h3(p),h4(p), ,hL(p) = 0,0,1,0,1,0,1,0 We get a new family of hash functions (?) = ? iff ? ?.?. ?? = ?? If the original family is ?1,?2,?1,?2-sensitive then the new family is ?1,?2,1 1 ?1 ?-sensitive ?,1 1 ?2 21

And k followed by Or L ?1,?2,?1,?2-sensitive ??,1 1 ?2 ??- ?1,?2,1 1 ?1 sensitive What does this do ? 22

The function 1 1 ??? k=5, L=20 1 1 ?5 20 ? For example if ?1,?2 were 0.6,0.4 then now they are (0.802,0.186) 23

A theoretical result on NN 24

(r,)-neighbor problem If there is a neighbor p, such that d(p,q) r, return p , s.t. d(p ,q) (1+ )r. r p (1+ )r 25

(r,)-neighbor problem Lets construct a data structure that succeeds with constant probability Focus on the hamming distance first 26

NN using locality sensitive hashing Take a (r1 , r2, p1 , p2) = (r , (1+ )r, 1-r/m , 1-(1+ )r/m) - sensitive family If there is a neighbor at distance r we catch it with probability p1 27

NN using locality sensitive hashing Take a (r1 , r2, p1 , p2) = (r , (1+ )r, 1-r/m , 1-(1+ )r/m) - sensitive family If there is a neighbor at distance r we catch it with probability p1 so to guarantee catching it we need to or 1/p1 functions.. 28

NN using locality sensitive hashing Take a (r1 , r2, p1 , p2) = (r , (1+ )r, 1-r/m , 1-(1+ )r/m) - sensitive family If there is a neighbor at distance r we catch it with probability p1 so to guarantee catching it we need to or 1/p1 functions.. But we also get false positives, how many ? 29

NN using locality sensitive hashing Take a (r1 , r2, p1 , p2) = (r , (1+ )r, 1-r/m , 1-(1+ )r/m) - sensitive family If there is a neighbor at distance r we catch it with probability p1 so to guarantee catching it we need to or 1/p1 functions.. But we also get false positives, how many ? ??2 ?1 ?(1 1 ?2 1 ?1) 30

NN using locality sensitive hashing Take a (r1 , r2, p1 , p2) = (r , (1+ )r, 1-r/m , 1-(1+ )r/m) - sensitive family Make a new function by concatenating ( and ) k of these basic functions We get a (r1 , r2, (p1)k, (p2)k) sensitive family If there is a neighbor at distance r we catch it with probability (p1)k so to guarantee catching it we need L=1/(p1)k functions. We get a (r1 , r2, 1-(1-(p1)k)L , 1-(1- (p2)k)L ) sensitive family But we also get false positives in our 1/(p1)k buckets, how many ? n(p2)k/(p1)k 31

(r,)-Neighbor with constant prob Scan the first 4n(p2)k/(p1)k points in the buckets and return the closest A close neighbor ( r1) is in one of the buckets with probability 1-(1/e) There are 4n(p2)k/(p1)k false positives with probability 3/4 Both events happen with constant prob. 32

Analysis k p p k Total query time: (each op takes time prop. to the dim.) + n 2 ( ) k p 1 1 We want to choose k to minimize this. time 2*min k 33

Analysis k p p k Total query time: (each op takes time prop. to the dim.) + n 2 ( ) k p 1 1 We want to choose k to minimize this: k ( ) k 1 p n k = k n p = 2 2 = (loglog ) log ( ) n k n 1 p 2 34

Summary k p k p p Total query time: + n 2 ( ) k 1 1 Put: ? = log 1p2(?) log 1p2(?) log ? 1p2 log 1/?1 log 1/?2 =1 1 ?1 1 1 ?1 =1 ?1?? ? ?1n ?1 ?1 ?1??=1 space: ?1n1+? 35

What is ? 1 p r m + log log 1 ( ( ) ) log log p p 1 + = = = 1 1 (1 ) r 1 1 p log 1 2 log m 2 36

(1+)-approximate NN Given q find p such that p p d(q,p) (1+ )d(q,p ) We can use our solution to the (r, )- neighbor problem 37

(1+)-approximate NN vs (r,)- neighbor problem If we know rmin and rmax we can find (1+ )- approximate NN using log(rmax/rmin) (r, /2)-neighbor problems r p (1+ )r 38

LSH using p-stable distributions Definition: A distribution ? is 2-stable if when X1, ,Xd are drawn from ?, ????= ? ? where ? is drawn from ?. So what do we do with this ? h(p) = piXi h(p)-h(q) = piXi - qiXi = (pi-qi)Xi=||p-q||X 39

LSH using p-stable distributions So what do we do with this ? h(p) = (pX)/r Pick r to maximize r 40

Bibliography M. Charikar: Similarity estimation techniques from rounding algorithms. STOC 2002: 380-388 P. Indyk, R. Motwani: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998: 604-613. A. Gionis, P. Indyk, R. Motwani: Similarity Search in High Dimensions via Hashing. VLDB 1999: 518-529 M. R. Henzinger: Finding near-duplicate web pages: a large- scale evaluation of algorithms. SIGIR 2006: 284-291 G. S. Manku, A. Jain , A. Das Sarma: Detecting near- duplicates for web crawling. WWW 2007: 141-150 41

Locality Sensitive Hashing (LSH) for Nearest Neighbor Queries

Download Presentation

Presentation Transcript

Related

More Related Content