Bloom Filters

 
Bloom Filters
 
An Introduction and Really Most Of It
CMSC 491
Hadoop-Based Distributed Computing
Spring 2016
Adam Shook
 
A
g
e
n
d
a
 
Discuss what a set data structure is using math
terms
Discuss the concept of a Bloom filter
Explore the mathematical magic behind Bloom
filters
 
Set!
 
A 
set 
is an 
unsorted
 data structure containing
unique
 values
Most common uses are:
Error-free 
set membership tests
Storing unique members of data (remove duplicates)
Iterating through data in no particular order
Other fun operations like unions, intersections, subsets,
etceteras!
Other sets support sorting and duplicate values
But we aren’t here to talk about those
Set Insertion
peter
lois
chris
peter
stewie
chris
 
stewie
lois
chris
peter
insert
is_memb
er
Set Membership Test
peter
 
 
stewie
lois
chris
peter
is_memb
er
Set Membership Test
adam
 
stewie
lois
chris
peter
Use Case!
 
I’ve got a bunch of interesting keywords, 
A
I’ve got a data set 
B
I want to check if a record in 
B
 contains a word in 
A
Make a new data set 
C
 
for some cool data science
 
  
for each record 
x 
in 
B
   
for each word 
w 
in 
x
    
if 
w
 in 
A
     
emit 
x
Use Case, Solved!
 
Stuff all the data in 
A
 
into a set
Get an A+ on your computer science project
Impress the boss
 
But what if 
A
 
is stupid big?
 
credit to mr. squarepants
Memory Footprint
 
 
A
 
contains 1 billion unique strings, average of 32
characters in length
8 bits per character
32 characters per string
1 billion of them
8 bits * 32 * 1,000,000,000 …
Roughly 29.8 GB of raw storage required to hold these
elements
+ overhead
+ even more if you are using Java
 
For the sake of argument, let’s all agree that 
A
 
doesn’t
fit comfortably on a computer…
 
credit to xkcd and paint
Making a Set Smaller
 
What two ‘features’ of a set can we relax to meet
our requirements and have a reasonable memory
footprint?
 
Functionality
Only want set membership operations
Accuracy
Don’t really need to be 100% accurate
Use Case, Revised!
 
I’ve got a bunch of interesting keywords, 
A
I’ve got a data set 
B
I want to check if a record in 
B
 contains a word in 
A
Make a new data set 
C
 
for some cool data science
I don’t really care if some stuff in 
C
 
doesn’t contain words from 
A
 
    for each record 
x 
in 
B
        for each word 
w 
in 
x
            
if 
w
 
is likely 
in 
A 
with false positive 
p
                emit 
x
Let me paint you a story…
 
We travel back to 1970…
Burton Howard Bloom was investigating means to
eliminate unnecessary disk accesses for particular
algorithms
Came up with the a 
probabilistic data structure 
for
set membership
Useful for programs with expensive operations
where the operation is often unnecessary
A structure only 15% of the size of the original can
eliminate 85% of unnecessary disk accesses
Bloom Filter
 
A space-efficient means to test if an element is a
member of a set
Elements can be added, but cannot be removed
Storage cost for a single element is independent of
the element size
The members are not stored, so they cannot be
retrieved
There are no false negatives, but false positives are
possible
 
How It’s Made – Training a Bloom
Filter
 
Given
An array of bits size
 
m
, initialized to 0
k
 hash functions
n
 elements
 
foreach element 
n
i
 in 
n
 
foreach function 
k
i
 in 
k
  
m
[
k
i
(
n
i
) % 
m
] = 1
 
Training a Bloom filter is O(
n
)
How It’s Made – Training a Bloom
Filter
peter
lois
chris
1
1
1
1
1
1
1
 
How It’s Made – Membership
Testing
 
Given
A trained Bloom filter of size 
m
The same 
k
 hash functions
An element 
x
 
foreach function 
k
i
 in 
k
 
if 
m
[
k
i
(
x
) % 
m
] is 0
  
return false
return true
 
Testing a Bloom filter is O(1)
How It’s Made – Membership
Testing
peter
 
How It’s Made – Membership
Testing
adam
I know what you’re thinking
The Catch
 
What we make up for in space, we give up the
accuracy…
I give you… the false positive!
cleveland
 
 
credit to xkcd and paint
 
Controlling the False Positive Rate
 
Given
Approximate number of elements in 
A
, 
n
A willingness to tolerate a percent 
p
 of false positives
k
 is the optimal number of hash functions
 
We can approximate 
m
 
 
 
If you want the full details, read the paper or Wikipedia
Back to our use case…
 
n
 = 1,000,000,000
p
 = .1
 
After dusting off the calculators….
 
m
 = 4.792 x10
9
 bits or 0.558 GB
 
An improvement of 29.8/0.558 = 53.4!
A
n
d
 
n
o
w
 
t
h
a
t
 
w
e
 
h
a
v
e
 
m
 
We can use 
n
 
and 
m
 
to calculate 
k
 = 
m
/
n
 * ln(2)
 
 
 
 
But I haven’t heard of 3.32 hash functions so let’s call
it 4
 
References
 
Wikipedia
Slide Note
Embed
Share

Set data structures, specifically Bloom filters, are explored in this Spring 2016 course on Hadoop-Based Distributed Computing. Learn about the concept of Bloom filters, set membership tests, and how they can be used for error-free membership checks, removing duplicates, and other set operations. Dive into the mathematical magic behind Bloom filters and understand their applications in data science projects.

  • Bloom Filters
  • Set Data Structures
  • Hadoop-Based Computing
  • Distributed Computing
  • Data Science

Uploaded on Feb 15, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

  2. Agenda Agenda Discuss what a set data structure is using math terms Discuss the concept of a Bloom filter Explore the mathematical magic behind Bloom filters

  3. Set! A set is an unsorted data structure containing unique values Most common uses are: Error-free set membership tests Storing unique members of data (remove duplicates) Iterating through data in no particular order Other fun operations like unions, intersections, subsets, etceteras! Other sets support sorting and duplicate values But we aren t here to talk about those

  4. Set Insertion peter lois chris peter stewie chris stewie lois chris peter insert

  5. Set Membership Test stewie lois chris peter is_memb er peter

  6. Set Membership Test stewie lois chris peter is_memb er adam

  7. Use Case! I ve got a bunch of interesting keywords, A I ve got a data set B I want to check if a record in B contains a word in A Make a new data set C for some cool data science for each record x in B for each word w in x if w in A emit x

  8. Use Case, Solved! Stuff all the data in A into a set Get an A+ on your computer science project Impress the boss But what if A is stupid big? credit to mr. squarepants

  9. Memory Footprint A contains 1 billion unique strings, average of 32 characters in length 8 bits per character 32 characters per string 1 billion of them 8 bits * 32 * 1,000,000,000 Roughly 29.8 GB of raw storage required to hold these elements + overhead + even more if you are using Java For the sake of argument, let s all agree that Adoesn t fit comfortably on a computer

  10. credit to xkcd and paint

  11. Making a Set Smaller What two features of a set can we relax to meet our requirements and have a reasonable memory footprint? Functionality Only want set membership operations Accuracy Don t really need to be 100% accurate

  12. Use Case, Revised! I ve got a bunch of interesting keywords, A I ve got a data set B I want to check if a record in B contains a word in A Make a new data set C for some cool data science I don t really care if some stuff in C doesn t contain words from A for each record x in B for each word w in x if wis likely in A with false positive p emit x

  13. Let me paint you a story We travel back to 1970 Burton Howard Bloom was investigating means to eliminate unnecessary disk accesses for particular algorithms Came up with the a probabilistic data structure for set membership Useful for programs with expensive operations where the operation is often unnecessary A structure only 15% of the size of the original can eliminate 85% of unnecessary disk accesses

  14. Bloom Filter A space-efficient means to test if an element is a member of a set Elements can be added, but cannot be removed Storage cost for a single element is independent of the element size The members are not stored, so they cannot be retrieved There are no false negatives, but false positives are possible

  15. How Its Made Training a Bloom Filter Given An array of bits size m, initialized to 0 k hash functions n elements foreach element ni in n foreach function ki in k m[ki(ni) % m] = 1 Training a Bloom filter is O(n)

  16. How Its Made Training a Bloom Filter peterlois chris 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0

  17. How Its Made Membership Testing Given A trained Bloom filter of size m The same k hash functions An element x foreach function ki in k if m[ki(x) % m] is 0 return false return true Testing a Bloom filter is O(1)

  18. How Its Made Membership Testing 0 1 1 1 1 1 0 1 0 1 peter

  19. How Its Made Membership Testing 0 1 1 1 1 1 0 1 0 1 adam

  20. I know what youre thinking

  21. The Catch What we make up for in space, we give up the accuracy I give you the false positive! 0 1 1 1 1 1 0 1 0 1 cleveland

  22. credit to xkcd and paint

  23. Controlling the False Positive Rate Given Approximate number of elements in A, n A willingness to tolerate a percent p of false positives k is the optimal number of hash functions We can approximate m If you want the full details, read the paper or Wikipedia

  24. Back to our use case n = 1,000,000,000 p = .1 After dusting off the calculators . m = 4.792 x109 bits or 0.558 GB An improvement of 29.8/0.558 = 53.4!

  25. And now that we have m m We can use n and m to calculate k = m/n * ln(2) But I haven t heard of 3.32 hash functions so let s call it 4

  26. References Wikipedia

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#