Bloom Filters

Bloom Filters

An Introduction and Really Most Of It

CMSC 491

Hadoop-Based Distributed Computing

Spring 2016

Adam Shook

•

Discuss what a set data structure is using math

terms

•

Discuss the concept of a Bloom filter

•

Explore the mathematical magic behind Bloom

filters

Set!

•

set

is an

unsorted

 data structure containing

unique

 values

•

Most common uses are:

•

Error-free

set membership tests

•

Storing unique members of data (remove duplicates)

•

Iterating through data in no particular order

•

Other fun operations like unions, intersections, subsets,

etceteras!

•

Other sets support sorting and duplicate values

•

But we aren’t here to talk about those

Set Insertion

peter

lois

chris

peter

stewie

chris

stewie

lois

chris

peter

insert

is_memb

er

Set Membership Test

peter



stewie

lois

chris

peter

is_memb

er

Set Membership Test

adam

stewie

lois

chris

peter

Use Case!

•

I’ve got a bunch of interesting keywords,

•

I’ve got a data set

•

I want to check if a record in

 contains a word in

•

Make a new data set

for some cool data science

for each record

in

for each word

in

if

in

emit

Use Case, Solved!

•

Stuff all the data in

into a set

•

Get an A+ on your computer science project

•

Impress the boss

•

But what if

is stupid big?

credit to mr. squarepants

Memory Footprint

•

contains 1 billion unique strings, average of 32

characters in length

•

8 bits per character

•

32 characters per string

•

1 billion of them

•

8 bits * 32 * 1,000,000,000 …

•

Roughly 29.8 GB of raw storage required to hold these

elements

•

+ overhead

•

+ even more if you are using Java

•

For the sake of argument, let’s all agree that

doesn’t

fit comfortably on a computer…

credit to xkcd and paint

Making a Set Smaller

•

What two ‘features’ of a set can we relax to meet

our requirements and have a reasonable memory

footprint?

•

Functionality

•

Only want set membership operations

•

Accuracy

•

Don’t really need to be 100% accurate

Use Case, Revised!

•

I’ve got a bunch of interesting keywords,

•

I’ve got a data set

•

I want to check if a record in

 contains a word in

•

Make a new data set

for some cool data science

•

I don’t really care if some stuff in

doesn’t contain words from

    for each record

in

        for each word

in

if

is likely

in

with false positive

                emit

Let me paint you a story…

•

We travel back to 1970…

•

Burton Howard Bloom was investigating means to

eliminate unnecessary disk accesses for particular

algorithms

•

Came up with the a

probabilistic data structure

for

set membership

•

Useful for programs with expensive operations

where the operation is often unnecessary

•

A structure only 15% of the size of the original can

eliminate 85% of unnecessary disk accesses

Bloom Filter

•

A space-efficient means to test if an element is a

member of a set

•

Elements can be added, but cannot be removed

•

Storage cost for a single element is independent of

the element size

•

The members are not stored, so they cannot be

retrieved

•

There are no false negatives, but false positives are

possible

How It’s Made – Training a Bloom

Filter

Given

An array of bits size

, initialized to 0

 hash functions

 elements

foreach element

in

foreach function

in

) %

] = 1

Training a Bloom filter is O(

How It’s Made – Training a Bloom

Filter

peter

lois

chris

How It’s Made – Membership

Testing

Given

A trained Bloom filter of size

The same

 hash functions

An element

foreach function

in

if

) %

] is 0

return false

return true

Testing a Bloom filter is O(1)

How It’s Made – Membership

Testing

peter



How It’s Made – Membership

Testing

adam

I know what you’re thinking

The Catch

•

What we make up for in space, we give up the

accuracy…

•

I give you… the false positive!

cleveland



credit to xkcd and paint

Controlling the False Positive Rate

Given

Approximate number of elements in

A willingness to tolerate a percent

 of false positives

 is the optimal number of hash functions

We can approximate

If you want the full details, read the paper or Wikipedia

Back to our use case…

 = 1,000,000,000

 = .1

After dusting off the calculators….

 = 4.792 x10

 bits or 0.558 GB

An improvement of 29.8/0.558 = 53.4!

…

We can use

and

to calculate

 * ln(2)

But I haven’t heard of 3.32 hash functions so let’s call

it 4

References

•

Wikipedia

Slide Note

Embed Share

Download

Set data structures, specifically Bloom filters, are explored in this Spring 2016 course on Hadoop-Based Distributed Computing. Learn about the concept of Bloom filters, set membership tests, and how they can be used for error-free membership checks, removing duplicates, and other set operations. Dive into the mathematical magic behind Bloom filters and understand their applications in data science projects.

gaus_id Follow

Uploaded on Feb 15, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

Agenda Agenda Discuss what a set data structure is using math terms Discuss the concept of a Bloom filter Explore the mathematical magic behind Bloom filters

Set! A set is an unsorted data structure containing unique values Most common uses are: Error-free set membership tests Storing unique members of data (remove duplicates) Iterating through data in no particular order Other fun operations like unions, intersections, subsets, etceteras! Other sets support sorting and duplicate values But we aren t here to talk about those

Set Insertion peter lois chris peter stewie chris stewie lois chris peter insert

Set Membership Test stewie lois chris peter is_memb er peter

Set Membership Test stewie lois chris peter is_memb er adam

Use Case! I ve got a bunch of interesting keywords, A I ve got a data set B I want to check if a record in B contains a word in A Make a new data set C for some cool data science for each record x in B for each word w in x if w in A emit x

Use Case, Solved! Stuff all the data in A into a set Get an A+ on your computer science project Impress the boss But what if A is stupid big? credit to mr. squarepants

Memory Footprint A contains 1 billion unique strings, average of 32 characters in length 8 bits per character 32 characters per string 1 billion of them 8 bits * 32 * 1,000,000,000 Roughly 29.8 GB of raw storage required to hold these elements + overhead + even more if you are using Java For the sake of argument, let s all agree that Adoesn t fit comfortably on a computer

credit to xkcd and paint

Making a Set Smaller What two features of a set can we relax to meet our requirements and have a reasonable memory footprint? Functionality Only want set membership operations Accuracy Don t really need to be 100% accurate

Use Case, Revised! I ve got a bunch of interesting keywords, A I ve got a data set B I want to check if a record in B contains a word in A Make a new data set C for some cool data science I don t really care if some stuff in C doesn t contain words from A for each record x in B for each word w in x if wis likely in A with false positive p emit x

Let me paint you a story We travel back to 1970 Burton Howard Bloom was investigating means to eliminate unnecessary disk accesses for particular algorithms Came up with the a probabilistic data structure for set membership Useful for programs with expensive operations where the operation is often unnecessary A structure only 15% of the size of the original can eliminate 85% of unnecessary disk accesses

Bloom Filter A space-efficient means to test if an element is a member of a set Elements can be added, but cannot be removed Storage cost for a single element is independent of the element size The members are not stored, so they cannot be retrieved There are no false negatives, but false positives are possible

How Its Made Training a Bloom Filter Given An array of bits size m, initialized to 0 k hash functions n elements foreach element ni in n foreach function ki in k m[ki(ni) % m] = 1 Training a Bloom filter is O(n)

How Its Made Training a Bloom Filter peterlois chris 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0

How Its Made Membership Testing Given A trained Bloom filter of size m The same k hash functions An element x foreach function ki in k if m[ki(x) % m] is 0 return false return true Testing a Bloom filter is O(1)

How Its Made Membership Testing 0 1 1 1 1 1 0 1 0 1 peter

How Its Made Membership Testing 0 1 1 1 1 1 0 1 0 1 adam

I know what youre thinking

The Catch What we make up for in space, we give up the accuracy I give you the false positive! 0 1 1 1 1 1 0 1 0 1 cleveland

credit to xkcd and paint

Controlling the False Positive Rate Given Approximate number of elements in A, n A willingness to tolerate a percent p of false positives k is the optimal number of hash functions We can approximate m If you want the full details, read the paper or Wikipedia

Back to our use case n = 1,000,000,000 p = .1 After dusting off the calculators . m = 4.792 x109 bits or 0.558 GB An improvement of 29.8/0.558 = 53.4!