Overview of Bloom Filters and Succinct Data Structures
Bloom filters and succinct data structures are efficient data structures used for set-membership tests and approximate queries. Bloom filters offer compact storage for set membership with a trade-off in accuracy, while succinct data structures aim to balance high performance, low space cost, and support for delete operations. This content covers the basic concepts, applications, and implementations of these data structures.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Cuckoo Filters Michael T. Goodrich University of California, Irvine Some slides adapted from slides by Fan, Andersen, Kaminsky, and Mitzenmacher
What is Bloom Filter? A Compact Data Structure Storing Set-membership Bloom Filters answer is item x x in set Y Y by: definitely no definitely no , or probably yes probably yes with probability to be wrong false positive rate false positive rate Benefit: not always precise but highly compact Typically a few bits per item Achieving lower (more accurate) requires spending more bits per item
Example Use: Web Crawling www www. .example example. .com com Please verify Please verify www.example.com www.example.com Remote Server It is Good! It is Good! Probably Yes! Probably Yes! Lookup( Lookup( www.example.com www.example.com ) ) No! No! Known Malicious URLs Stored in Bloom Filter Scale to millions URLs
Bloom Filter Review A Bloom Filter consists of m m bits and k k hash functions Example: m m = 10, k k = 3 Insert(x) Lookup(y) = = not found not found hash3(x) hash3(y) hash1(x) hash1(y) hash2(x) hash2(y) 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0
Succinct Data Structures for Approximate Set-membership Tests High Low Space Cost Delete Support Performance Bloom Filter Counting Bloom Filter Can we achieve all three in practice? Can we achieve all three in practice?
Basic Idea: Store Fingerprints in Hash Table Fingerprint Fingerprint(x): A hash value of x Lower false positive rate , longer fingerprint 0: 1: 2: 3: 4: FP(a) FP(c) 5: 6: 7: FP(b)
Basic Idea: Store Fingerprints in Hash Table Fingerprint Fingerprint(x): A hash value of x Lower false positive rate , longer fingerprint Insert Insert(x): add Fingerprint(x) Fingerprint(x) to hash table 0: 1: 2: 3: 4: FP(x) FP(a) FP(c) 5: 6: 7: FP(b)
Basic Idea: Store Fingerprints in Hash Table Fingerprint Fingerprint(x): A hash value of x Lower false positive rate , longer fingerprint Insert Insert(x): add Fingerprint(x) Fingerprint(x) to hash table Lookup Lookup(x): search Fingerprint Fingerprint(x) in hashtable Lookup(x) = = found 0: 1: 2: 3: 4: FP(x) FP(a) FP(c) 5: 6: 7: FP(b)
Basic Idea: Store Fingerprints in Hash Table Fingerprint Fingerprint(x): A hash value of x Lower false positive rate , longer fingerprint Insert Insert(x): add Fingerprint(x) Fingerprint(x) to hash table Lookup Lookup(x): search Fingerprint Fingerprint(x) in hashtable Delete Delete(x): remove Fingerprint Fingerprint(x) from hashtable Delete(x) 0: 1: 2: 3: 4: FP(x) FP(a) FP(c) 5: 6: 7: FP(b) How to Construct Hashtable?
Cuckoo Hashing[Pagh2004] Higher Space Utilization 4-way set-associative table: >95% entries occupied Fast Lookup: O(1) [Pagh2004] Good But .. 0: 1: 2: 3: 4: hash1(x) lookup(x) hash2(x) 5: 6: 7: Standard cuckoo hashing doesn t use fingerprints
Standard Cuckoo Requires Storing Each Item 0: 1: 2: 3: 4: b h1(x) Insert(x) Insert(x) c 5: 6: 7: h2(x) a
Standard Cuckoo Requires Storing Each Item 0: 1: 2: 3: 4: b Insert(x) Insert(x) c 5: 6: 7: Rehash a: alternate(a) = 4 Kick a to bucket 4 h2(x) x
Standard Cuckoo Requires Storing Each Item 0: 1: 2: 3: 4: b Rehash c: alternate(c) = 1 Kick c to bucket 1 Insert(x) Insert(x) a 5: 6: 7: Rehash a: alternate(a) = 4 Kick a to bucket 4 h2(x) x
Standard Cuckoo Requires Storing Each Item Insert complete (or fail if MaxSteps reached) 0: 1: 2: 3: 4: c b Rehash c: alternate(c) = 1 Kick c to bucket 1 Insert(x) Insert(x) a 5: 6: 7: Rehash a: alternate(a) = 4 Kick a to bucket 4 h2(x) x
Challenge: How to Perform Cuckoo? Cuckoo hashing requires rehashing and displacing existing items 0: 1: 2: 3: 4: FP(b) Kick FP(c) to which bucket? FP(c) 5: 6: 7: Kick FP(a) to which bucket? FP(a) With only fingerprint, how to calculate item s alternate bucket?
We Apply Partial-Key Cuckoo Standard Cuckoo Hashing: two independent hash functions for two buckets functions for two buckets two independent hash bucket1 = hash1(x) bucket2 = hash2(x) Partial-key Cuckoo Hashing: use one bucket and fingerprint to derive the other fingerprint to derive the other [Fan2013] use one bucket and [Fan2013] bucket1 = hash(x) bucket2 = bucket1 hash(FP(x)) To displace existing fingerprint: alternate(x) = current(x) hash(FP(x))
Partial Key Cuckoo Hashing Perform cuckoo hashing on fingerprints 0: 1: 2: 3: 4: FP(b) Kick FP(c) to 4 hash(FP(c)) FP(c) 5: 6: 7: Kick FP(a) to 6 hash(FP(a)) FP(a) Can we still achieve high space utilization with partial-key cuckoo hashing?
Set Intersection Problem Preprocess a collection of sets so as to quickly answer set-intersection queries.
Review: Cuckoo Filters A cuckoo table and parallel cuckoo filter. See Fan et al. 14, Eppstein 16. Provides improved functionality over Bloom filters. Each fingerprint is stored in 1 of 2 places. T: cat pig dog cat dog pig M: 111 000 000 000 000 000 111 111 000 F: 2 5 4 1 word 1 word 1 word
New: 2-3 Cuckoo Filters A cuckoo table and parallel cuckoo filter, where each element is stored in 2-out-of-3 possible locations. T: cat dog pig dog cat pig pig dog cat M: 111 000 111 111 000 111 111 111 000 F: 2 4 5 2 5 4 1 word 1 word 1 word
Intersecting Two 2-3 Cuckoo Filters At least one location must overlap: We may have false-positives, though. T1: cat dog pig cat pig dog M1: 111 000 111 111 000 111 111 111 000 F1: 2 4 5 2 5 4 T2: cat cat fox pig pig fox M2: 111 111 000 000 111 111 111 111 000 F2: 2 2 4 5 5 4
Set Intersection Analysis By above analysis, we can construct 2-3 cuckoo filters of size at least 2 with constant-size stashes with probability at least 1-1/wc. Output intersection elements & discard false positives Intersect cuckoo filters w/ bit-parallel operations Examine each colliding position Choosing a fingerprint size of at least log w bits implies false positives occur with probability less than 1/w. Expected number of false positives is O(n/w). Thus, expected time for a set intersection query is O(n(log w)/w + k), where k is the output size.