Efficient and Scalable Protocol for Private Set Intersection in Big Data Security

When Private Set Intersection

Meets Big Data:

An Efficient and Scalable Protocol

ECE 693 Big Data Security

Abstract

•

Large scale

data processing brings new challenges to the design of privacy-

preserving protocols: how to meet the increasing requirements of speed and

throughput of modern applications, and how to scale up smoothly when data

being protected is big.

Efficiency and scalability

become critical criteria for

privacy preserving

protocols in the age of

Big Data

•

In this paper, we present a new

Private Set Intersection (PSI)

protocol that is

extremely efficient and highly scalable compared with existing protocols. The

protocol is based on a novel approach that we call

oblivious Bloom

intersection

•

It has

linear complexity

and relies mostly on efficient

symmetric key

operations. It has

high scalability

due to the fact that

most operations can be

parallelized

 easily.

•

The protocol has two versions: a

basic

protocol and an

enhanced

protocol,

the security of the two variants is analyzed and proved in the

semi-honest

model

and

the malicious model

respectively.

Needs

•

Examples:

(1)

geneticists need to search 3 billion base pairs in personal

genome to find

genetic disorders

that might cause diabetes or cancers, (2)

epidemiologists need to link multiple medical databases that contain

millions of patients’ records to

identify risk factors for diseases

, and (3)

online retailers want to

correlate

 petabytes of their transaction records

with customers’ social network activities, hoping to increase customer

satisfaction.

•

Any privacy-preserving data processing service is not cost free and this has

brought us new challenges: how to meet the increasing requirements of

speed and throughput

of modern applications, and how to

scale up

smoothly

when data being protected is big?

•

With the prevalence of large scale data processing,

efficiency and

scalability

become critical criteria for designing a privacy-preserving

protocol in the age of “Big Data”.

Private Set Intersection (PSI) problem

•

Definition:

 Namely, two parties, a client and a server, want to

jointly

compute the intersection of their private input sets

in a manner that at the

end

the client learns the intersection

and

the server learns nothing

•

The PSI problem has been extensively studied for two reasons, firstly set

intersection is a foundational primitive and secondly it has many practical

applications.

•

PSI protocols are often criticized as being impractical because the

performance becomes unacceptable when the input size or the security

parameter becomes large

, and it is difficult to improve the performance by

just adding hardware proportionally. The criticism is not unfounded.

Existing ones

•

Currently two protocols claim to be

the fastest PSI protocol

: the

RSA-OPRF-based

protoco

l by De Cristofaro et al [15, 16] and the

garbled circuit protocol

by Huang et

al [26]. Both protocols have a highly optimized implementation. We obtained the

source code from the authors of these two protocols and tested the performance.

•

To compute the intersection of two

1,048,576-element

(220) sets, De Cristofaro’s

protocol needs

10.6 minutes at 80-bit security

, but requires a much longer time at

256-bit security

. We estimate the time to be approximately

131 hours

from tests

with smaller sets.

•

The tests with million-element sets on Huang’s protocol were unsuccessful because

the Java Virtual Machine ran

out of memory

on the client computer that has 16 GB

RAM. From tests with smaller sets, we estimate that Huang’s protocol requires

hours and 51 hours

respectively to

compute the intersection at 80-bit and 256-bit

security

•

Clearly to use PSI in real world applications, we need

more practical

protocols.

Contributions

•

a new PSI protocol that is much

more efficient

than all the already existing PSI

protocols. The protocol is designed based on a novel

two-party computation

approach, which makes use of a new variant of Bloom filters that we call

garbled

Bloom filters,

 and we refer the new approach as

oblivious Bloom intersection

. The

ideas of garbled Bloom filters and oblivious Bloom intersection are general and have

their own interests.

•

Our PSI protocol has

two versions

: a

basic protocol

, security of which can be proved

in the

semi-honest model

, and an

enhanced protocol

, security of which can be

proved in the

malicious model

•

The

basic

protocol has

linear

 complexity (with a small constant factor) and relies

mostly on

symmetric key

operations. It is fast even with large input sets, and when

the security parameter increases, the performance degrades

gracefully

. Test results

show it is orders of magnitude

faster

 than the previous best protocols.

•

The

enhanced protocol

is an extension of the basic protocol, that only increases the

cost by a factor proportional to the security parameter.

scalability

•

The

computational

memory

and

communication

complexities are all

linear

in the

size of the input sets

•

More attractively, most operations in the protocol can be performed in the

SPMD

(single program, multiple data) fashion

, which means little effort is required to

separate the computation into a number of parallel tasks

•

Therefore it can fully take the advantage of

parallel processing capacity

provided

by current

multi-core CPUs, GPGPUs

(General-purpose graphics processing unit)

and

cloud computing

•

As a result, the protocol is particularly suitable for

Big Data oriented applications

that have to process data in a

parallelized and/or distributed

way.

•

We have implemented a proof of concept prototype of the

basic

 protocol. To

compute

the intersection of two million-element sets

, it needs only

41 seconds

(80-bit) and 5.65 minutes (256-bit)

 on two moderate computers in parallel mode.

Notations

Bloom Filters

•

A Bloom filter [9]

{m, n, k, H}

is a compact

data structure

for

probabilistic set

membership testing

A Bloom filter

is an

array

of

bits

 that can represent a set

of

at most

elements

•

A Bloom filter comes with a set of

 independent uniform

hash functions

H = {h

 , …,

k-1

} that each

maps

its

elements

(total

to

index numbers (total

over the

range [0, m-1] uniformly.

•

Initially, all bits in the

array

are set to

. To

insert an

element

 (belonging to S) into

the filter

the

element

is

hashed

using the

hash functions

to get k index numbers.

The bits

at ALL these indexes

in the bit array are set to

, i.e. set

BF

(x)

] = 1 for 0  <

I <  k -1

.   (total bits

>>

 (each element generates

 index positions)

•

To check if an item

 is in S,

 is hashed by the

 hash functions

, and all indexes

locations that y hashes to are checked.

If ANY of the bits at the locations is 0 , y is

certainly NOT in S

, otherwise y is

most likely (maybe not)

in S.

The BF is an array with

 bits

The array can represent

 elements, all belong to a set

 hash functions (

(.))

Bloom filter – Intersection set

i.e., the following

won’t

 happen:

“If an element is in fact NOT in the intersection

of S1 and S2, but it is there based on BF

intersection query result.”

That is, if an element is NOT in the set

intersection, then BF will also tell NOT in there.

False positive:

If an element is in the

intersection, but the BF tells Not there. This

can happen, but with a small chance.

Secret Sharing

•

Secret sharing

is a fundamental cryptographic primitive. It allows a

dealer to split a secret

into

 shares such that

the secret

can be

recovered efficiently

with any subset of

or more shares

. With any

subset of

less than t

shares,

the secret is

unrecoverable

and the

shares give no information about the secret. Such a system is called a

(t, n)-

secret sharing scheme

Oblivious Transfer Protocol (definition)

•

Oblivious transfer [39, 20] allows a sender to send

part

 of its input to

a receiver in a manner that

protects both parties

•

Namely, the sender does not

know

 which part the receiver receives,

and the receiver does not

learn

 any information about

the other part

of the sender’s input. Generally

Receiver decides to select

 specific

indexes (each corresponding to an

element in sender’s x)

Note that r

 here can only be 0 or 1. It comes from the

receiver’s selection string r (one of its bits)

Here the receiver may select jth Pair (one of the two in

the pair, either left X

j,0

or right X

j,1

Reduce complexity for Oblivious transfer

protocols

•

Oblivious transfer protocols are

costly

 and often become the

efficiency

bottleneck

in protocol design. However it has been shown by Beaver that

it is

possible to obtain a

large

 number of oblivious transfers given only

a small

number of

actual

oblivious transfer calls [7]. In this direction, efficient

OT

extensions

were proposed in [27]. The extensions rely on the Random Oracle

Semi-honest Model

•

We prove the security of the basic protocol in the presence of

static

semi-honest

 adversaries. In the model,

the adversary

controls one of

the parties

 and follows the protocol specification exactly. However, it

may try to

learn more

information about

the other party’s input

Empty string means “learn nothing

about the other part (f

)”.This is the main

idea of PSI.

Semi-honest Model

•

In the

semi-honest model

a protocol  is secure

if whatever can be computed

by a

party

 in the

protocol can be obtained from its

input (

and

output (

only

. This is formalized by the

simulation

 paradigm.

•

We require

party

’s view

in a protocol execution to be simulatable given

only

its

input and output

The view of the

party

 during an execution of

π

on

(x; y)

is

denoted by                     which equals to

      where                      is the

 input

of

party

is the

output

 of i’s

internal

random coin tosses

, and          represents

the j

th

 message

that it received.

For party i=1

For party i=2

Basic Protocol

•

Conceptually the protocol is very simple:

the client

computes a

Bloom

filter

 (BF) that encodes its set

and

the server

computes a

garbled

Bloom filter

 (BGF) that encodes its set

•

Then they run an

oblivious transfer (OT) protocol

so that

the client

obtains

a garbled Bloom filter (GBF)

that represents

the intersection

and

the server learns nothing

•

Then the client

queries

the

intersection

garbled Bloom filter

and

obtains the intersection contents.

Garbled Bloom Filters

•

A garbled Bloom filter (GBF) is the garbled version of a standard Bloom filter.

Common features (GBF & general BF):

•

From a

high level

point of view (next slide:

low level

, there is

no difference

between a

garbled Bloom filter and a Bloom filter: it encodes a set of at most

 elements in an

array of length

, it supports

membership query

with

zero false negative

and

very

small false positive

•

To

add

an element, the element is mapped by

 independent uniform

hash functions

into

index numbers

, and the corresponding

array locations

are updated.

•

To

query

 an element, the element is mapped by the same

 hash functions into

 index

numbers, and the corresponding array locations are checked.

Low-level point of view

Different from general BF:

•

From a low-level point of view

, a garbled Bloom filter is backed by a

different

data structure

. Namely, instead of using an array of

bits

, a garbled Bloom

filter uses an array of

λ

-bit strings

, where

λ

is a

security parameter

•

In the rest of the paper, we use (m, n, k, H,

λ

)-garbled Bloom filter to denote

a garbled Bloom filter parameterized by (m, n, k, H,

λ

), we denote a garbled

Change BF’s “bits” to “strings”

Add an element

•

To

add

an element

 to a garbled Bloom filter, we

split

the element itself into

multiple

k-bit

shares

(each share has k bits, which occupies a position in GBF; remember that GBF stores

a string

in

each position, not a single bit),

by

using the

XOR-based

secret sharing

scheme.

•

The element is also mapped into

index numbers,

and

•

we store

one k-bit share in each

location ID

(x)

. Note this is a very loose description, the actual

process is more complicated.

•

To

query

 an element

, we collect

all bit strings

at

(y)

and

XOR them

together. If the result is

then

 is in S, otherwise y is not in S.

common “share”

Algorithm 1

In Algorithm

1, we first

create an

empty garbled

Bloom filter

and

initialize

each location

to NULL (line

1-4).

Determine location in GBF

Put a string (a share) in GBF

Get the next share

Put a random string there

Query the garbled Bloom Filter

•

In a garbled Bloom filter, each location is

-bit

string

that is either (1)

a share

of certain

elements or (2

) a random string

(see previous algorithm)

. Analogously,

a share

in a gabled

Bloom filter is equivalent to a

“1” bit

in a Bloom filter, and

a random string

is equivalent to a

“0” bit

•

Same as the Bloom filters, there is

no false negative

when using a GBF

because (1) all shares

of an encoded element are guaranteed to be retrievable, and (2) the XOR-based secret

sharing scheme always produces the original element

when all shares are available

•

When using a GBF, we need to consider and

differentiate the following two probabilities

•

(1)

The collision probability

of a GBF is

the probability when y is NOT in S, but it hashes to the

same set of index numbers as some x

A collision does not cause false positive: the recovered

string (Algorithm 2)

is x but not y

so the query result is still false

. However it reveals x. The

collision probability is negligible in k.

•

Loosely, we can use the

upper bound of the false positive probability

of a Bloom filter as the

upper bound of the collision probability of a garbled Bloom filter. Note that collisions do not

affect the security of our protocol, but may be a concern if a GBF is used in other protocols.

•

(2)

The “false positive” probability

of GBFS is the probability

when y is not in S but the

recovered string equals y coincidentally

. This probability is at most

λ

Algorithm 2

In summary, with

proper

parameters, a

garbled Bloom

filter exhibits

similar properties

when encoding set

membership:

no

false negative and

negligible false

positive.

Each time get a share,

XOR with it, then the

final result is the

completely retrieved

element.

Produce an Intersection GBF

•

we show how to produce an

intersection

garbled Bloom filter from an

(m, n, k,H)-

garbled

Bloom filter (

client

) and an (m, n, k,H)-Bloom filter

server

).

•

The idea is quite similar to creating an intersection Bloom filter by

ANDing

 two Bloom filters

•

Let’s say we have an (m, n, k,H)-Bloom filter BFC that encodes a set

and an (m, n, k,H )-garbled Bloom filter GBFS that encodes a set

. We

use

Algorithm 3

to build the

intersection

garbled Bloom filter GBF

C\S

The intuition of the algorithm is this:

(1)

if

an element x is in C \ S

, then for every position

 it hashes to,

BF

[i]

must be a

1 bit

and

GBFS[i]

must be

share of x

. Therefore by running the algorithm,

all shares of x

are copied to the new garbled Bloom filter.

That is, all elements in C \S are

preserved

 in the new garbled Bloom filter.

(2)

On the other hand,

if x is not in C \S

, then with a high probability,

at least one share will not be copied

. In

other words,

elements not in C \ S are eliminated from

the new garbled Bloom filter.

Thus the new garbled Bloom filter is indeed a garbled Bloom filter that encodes the intersection.

This is GBF is

controlled by

Server

This is BF is

controlled by

Client

Based on client’s

query requirement,

From the server’s corresponding

position, we take that element and

put in the “intersection” GBF

From security point of view, a more interesting property of the intersection GBF is that it is

indistinguishable from a GBF

built from scratch

that encodes C \ S. That is ---

Therefore, Algorithm 1 & 3 produce the same result.

We wish enemies do not distinguish them!

(i.e., they look almost the same)

λ

: String size in each

element of GBF

Oblivious Bloom Intersection

•

The idea of the

basic protocol

is shown in Figure 2. That is, to

run

Algorithm 3 by two parties

using

oblivious transfer

. Thus we call it

oblivious Bloom intersection

•

The protocol runs as follows:

It is secure

•

Informally, the correctness of the protocol follows from Theorem 3 and

6. The protocol produces a garbled Bloom filter

that encodes C \ S,

then

by querying it

the client can obtain the correct intersection

except for a

negligible probability.

•

To see why

the protocol is secure

, notice that

the only messages being

sent in the protocol

are the messages in the OT protocol.

•

The client’s privacy is protected

 because

the server learns no

information about BFC

in the OT execution.

•

The server’s privacy is protected

 because

the client receives only GBF

C\S from the server

and it contains only information about elements in

C \ S.

The basic protocol is secure!

The

Enhanced PSI

protocol (

used Encryption

!)

Why enhanced protocol?

•

In the basic protocol, the interaction between the two parties is essentially an

oblivious transfer.

•

At the first glance, it seems that we can easily obtain a

fully secure protocol

by

replacing the semi-honest OT protocol with one that is secure against malicious

parties. However, this is not enough. A

fully secure OT protocol

can prevent

malicious behaviors

such as changing input during the protocol execution

but it

cannot prevent a malicious client from mounting a full universe attack

•

In a

full universe attack

a malicious client encodes the full universe of all possible

elements in its Bloom filter and uses it in the PSI protocol to learn the server’s entire

set.

 A Bloom filter can easily

represent the full universe by setting all the bits to 1.

This is a special feature of Bloom filters and

it causes a problem

when we try to

construct a simulator for the client in the malicious model. Namely, when the

adversary uses the

all-one Bloom filter

, the simulator

needs to enumerate all

elements in the universe

and send them to the trusted party in the ideal process.

•

Without making any assumptions, the universe is potentially too large and a

polynomial time algorithm

may fail to enumerate all elements

Prevent the full universe attack

•

To prevent the

full universe attack

, we add a step to make sure that the

client’s Bloom filter is

not all-one

. More specifically, the server uses

symmetric key block cipher

to

encrypt strings

in its garbled Bloom filter

before transferring them to the client.

•

It forces the client to behave honestly

by splitting the key into m shares

using

a secret sharing scheme.

The client uses the bit array in its Bloom filter as the

selection string

 to receive the intersection garbled Bloom filter

and the

shares of the key

. If the bit in the selection string is

, the client

receives a

share of the key

; if the bit is

, the client

receives an encrypted string

in

GBFS.

•

The intuition is that

if the client cheats by using an all-one Bloom filter, it will

not be able to gather enough shares to recover the

key

, and thus will not be

able to decrypt the encrypted garbled Bloom filter.

Implementation

•

They have implemented a prototype of the basic protocol in C. The

source code (and its Java port) is released online.

http://personal.cis.strath.ac.uk/changyu.dong/PSI/PSI.html

•

It uses OpenSSL (1.0.1e) for the cryptographic operations. We

currently use keyed SHA-1 to build/query Bloom filters and garbled

Bloom filters

Implement in MapReduce !

Extremely Big Data Set & Cloud Computing

•

In practice,

to process extremely big data set, we have to distribute the task on multiple

computers.

New computing paradigms such as cloud computing make it possible to

execute such distributed tasks “on demand”.

•

Our protocol can be easily deployed on cloud platforms

. Here we show how to do it with

the semi-honest protocol. The fully secure protocol case is similar. From a high level point

of view, the client and the server

throw their elements into bins using an hash function

•

 Then they build Bloom filters and garbled Bloom filters for each bin. The parameter k is

still determined by the desired false positive probability, the parameter m is determined

by k and the bin size. The filters are associated with the bin number. Then for each 0  i < b,

the server uses OT to transfer the garbled Bloom filter for bin i to the client, who uses its

Bloom filter for bin i as the selection string. The client then queries all elements in its bin i

against the received garbled Bloom filter and adds any positive elements into the result

set.

•

In the end, the client has the intersection

. Conceptually, this splits a big set into b smaller

sets that each can be handled by a single node. It is correct because the two parties use

the same hash function so an element thrown by the server into bin i will also be threw by

the client into bin i. The idea can be implemented using the MapReduce programming

model [19] easily.

Slide Note

Embed Share

Download

Large-scale data processing presents challenges for privacy-preserving protocols, particularly in terms of efficiency and scalability. This paper introduces a novel Private Set Intersection (PSI) protocol called oblivious Bloom intersection, offering linear complexity and high scalability. The protocol addresses the need for speed and throughput in modern applications dealing with big data while ensuring privacy.

lea_ste Follow

Uploaded on Oct 04, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

When Private Set Intersection Meets Big Data: An Efficient and Scalable Protocol ECE 693 Big Data Security

Abstract Large scale data processing brings new challenges to the design of privacy- preserving protocols: how to meet the increasing requirements of speed and throughput of modern applications, and how to scale up smoothly when data being protected is big. Efficiency and scalability become critical criteria for privacy preserving protocols in the age of Big Data. In this paper, we present a new Private Set Intersection (PSI) protocol that is extremely efficient and highly scalable compared with existing protocols. The protocol is based on a novel approach that we call oblivious Bloom intersection. It has linear complexity and relies mostly on efficient symmetric key operations. It has high scalability due to the fact that most operations can be parallelized easily. The protocol has two versions: a basic protocol and an enhanced protocol, the security of the two variants is analyzed and proved in the semi-honest model and the malicious model respectively.

Needs Examples: (1) geneticists need to search 3 billion base pairs in personal genome to find genetic disorders that might cause diabetes or cancers, (2) epidemiologists need to link multiple medical databases that contain millions of patients records to identify risk factors for diseases, and (3) online retailers want to correlate petabytes of their transaction records with customers social network activities, hoping to increase customer satisfaction. Any privacy-preserving data processing service is not cost free and this has brought us new challenges: how to meet the increasing requirements of speed and throughput of modern applications, and how to scale up smoothly when data being protected is big? With the prevalence of large scale data processing, efficiency and scalability become critical criteria for designing a privacy-preserving protocol in the age of Big Data .

Private Set Intersection (PSI) problem Definition: Namely, two parties, a client and a server, want to jointly compute the intersection of their private input sets in a manner that at the end the client learns the intersection and the server learns nothing. The PSI problem has been extensively studied for two reasons, firstly set intersection is a foundational primitive and secondly it has many practical applications. PSI protocols are often criticized as being impractical because the performance becomes unacceptable when the input size or the security parameter becomes large, and it is difficult to improve the performance by just adding hardware proportionally. The criticism is not unfounded.

Existing ones Currently two protocols claim to be the fastest PSI protocol: the RSA-OPRF-based protocol by De Cristofaro et al [15, 16] and the garbled circuit protocol by Huang et al [26]. Both protocols have a highly optimized implementation. We obtained the source code from the authors of these two protocols and tested the performance. To compute the intersection of two 1,048,576-element (220) sets, De Cristofaro s protocol needs 10.6 minutes at 80-bit security, but requires a much longer time at 256-bit security. We estimate the time to be approximately 131 hours from tests with smaller sets. The tests with million-element sets on Huang s protocol were unsuccessful because the Java Virtual Machine ran out of memory on the client computer that has 16 GB RAM. From tests with smaller sets, we estimate that Huang s protocol requires 27 hours and 51 hours respectively to compute the intersection at 80-bit and 256-bit security. Clearly to use PSI in real world applications, we need more practical protocols.

Contributions a new PSI protocol that is much more efficient than all the already existing PSI protocols. The protocol is designed based on a novel two-party computation approach, which makes use of a new variant of Bloom filters that we call garbled Bloom filters, and we refer the new approach as oblivious Bloom intersection. The ideas of garbled Bloom filters and oblivious Bloom intersection are general and have their own interests. Our PSI protocol has two versions: a basic protocol, security of which can be proved in the semi-honest model, and an enhanced protocol, security of which can be proved in the malicious model. The basic protocol has linear complexity (with a small constant factor) and relies mostly on symmetric key operations. It is fast even with large input sets, and when the security parameter increases, the performance degrades gracefully. Test results show it is orders of magnitude faster than the previous best protocols. The enhanced protocol is an extension of the basic protocol, that only increases the cost by a factor proportional to the security parameter.

scalability The computational, memory and communication complexities are all linear in the size of the input sets. More attractively, most operations in the protocol can be performed in the SPMD (single program, multiple data) fashion, which means little effort is required to separate the computation into a number of parallel tasks. Therefore it can fully take the advantage of parallel processing capacity provided by current multi-core CPUs, GPGPUs (General-purpose graphics processing unit) and cloud computing. As a result, the protocol is particularly suitable for Big Data oriented applications that have to process data in a parallelized and/or distributed way. We have implemented a proof of concept prototype of the basic protocol. To compute the intersection of two million-element sets, it needs only 41 seconds (80-bit) and 5.65 minutes (256-bit) on two moderate computers in parallel mode.

Notations

The BF is an array with m bits Bloom Filters The array can represent n elements, all belong to a set S K hash functions (H(.)) A Bloom filter [9] {m, n, k, H} is a compact data structure for probabilistic set membership testing. A Bloom filter is an array of m bits that can represent a set S of at most n elements. A Bloom filter comes with a set of k independent uniform hash functions H = {h0, , hk-1} that each himaps its elements (total n) to index numbers (total m) over the range [0, m-1] uniformly. Initially, all bits in the array are set to 0. To insert an element x (belonging to S) into the filter, the element is hashed using the k hash functions to get k index numbers. The bits at ALL these indexes in the bit array are set to 1, i.e. set BFS[hi(x)] = 1 for 0 < I < k -1. (total bits m >> k (each element generates k index positions) To check if an item y is in S, y is hashed by the k hash functions, and all indexes locations that y hashes to are checked. If ANY of the bits at the locations is 0 , y is certainly NOT in S, otherwise y is most likely (maybe not) in S.

i.e., the following wont happen: If an element is in fact NOT in the intersection of S1 and S2, but it is there based on BF intersection query result. That is, if an element is NOT in the set intersection, then BF will also tell NOT in there. Bloom filter Intersection set S1 S2 False positive: If an element is in the intersection, but the BF tells Not there. This can happen, but with a small chance.

Secret Sharing Secret sharing is a fundamental cryptographic primitive. It allows a dealer to split a secret s into n shares such that the secret s can be recovered efficiently with any subset of t or more shares. With any subset of less than t shares, the secret is unrecoverable and the shares give no information about the secret. Such a system is called a (t, n)-secret sharing scheme

Oblivious Transfer Protocol (definition) Oblivious transfer [39, 20] allows a sender to send part of its input to a receiver in a manner that protects both parties. Namely, the sender does not know which part the receiver receives, and the receiver does not learn any information about the other part of the sender s input. Generally Note that rj here can only be 0 or 1. It comes from the receiver s selection string r (one of its bits) Here the receiver may select jth Pair (one of the two in the pair, either left Xj,0 or right Xj,1) Receiver decides to select m specific indexes (each corresponding to an element in sender s x)

Reduce complexity for Oblivious transfer protocols Oblivious transfer protocols are costly and often become the efficiency bottleneck in protocol design. However it has been shown by Beaver that it is possible to obtain a large number of oblivious transfers given only a small number of actual oblivious transfer calls [7]. In this direction, efficient OT extensions were proposed in [27]. The extensions rely on the Random Oracle

Semi-honest Model We prove the security of the basic protocol in the presence of static semi-honest adversaries. In the model, the adversary controls one of the parties and follows the protocol specification exactly. However, it may try to learn more information about the other party s input. f Empty string means learn nothing about the other part (f2) .This is the main idea of PSI.

Semi-honest Model In the semi-honest model, a protocol is secure if whatever can be computed by a party in the protocol can be obtained from its input (x) and output (y) only. This is formalized by the simulation paradigm. We require a party s view in a protocol execution to be simulatable given only its input and output. The view of the party i during an execution of on (x; y) is denoted by which equals to where is the input of party i , ri is the output of i s internal random coin tosses, and represents the jth message that it received. For party i=1 For party i=2

Basic Protocol Conceptually the protocol is very simple: the client computes a Bloom filter (BF) that encodes its set C and the server computes a garbled Bloom filter (BGF) that encodes its set S. Then they run an oblivious transfer (OT) protocol so that the client obtains a garbled Bloom filter (GBF) that represents the intersection, and the server learns nothing. Then the client queries the intersection garbled Bloom filter and obtains the intersection contents.

Garbled Bloom Filters A garbled Bloom filter (GBF) is the garbled version of a standard Bloom filter. Common features (GBF & general BF): From a high level point of view (next slide: low level), there is no difference between a garbled Bloom filter and a Bloom filter: it encodes a set of at most n elements in an array of length m, it supports membership query with zero false negative and very small false positive. To add an element, the element is mapped by k independent uniform hash functions into k index numbers, and the corresponding array locations are updated. To query an element, the element is mapped by the same k hash functions into k index numbers, and the corresponding array locations are checked.

Low-level point of view Change BF s bits to strings Different from general BF: From a low-level point of view, a garbled Bloom filter is backed by a different data structure. Namely, instead of using an array of bits, a garbled Bloom filter uses an array of -bit strings, where is a security parameter. In the rest of the paper, we use (m, n, k, H, )-garbled Bloom filter to denote a garbled Bloom filter parameterized by (m, n, k, H, ), we denote a garbled

Add an element To add an element x to a garbled Bloom filter, we split the element itself into multiple k-bit shares (each share has k bits, which occupies a position in GBF; remember that GBF stores a string in each position, not a single bit), by using the XOR-based secret sharing scheme. The element is also mapped into k index numbers, and we store one k-bit share in each location ID = hi(x). Note this is a very loose description, the actual process is more complicated. To query an element y, we collect all bit strings at hi(y) and XOR them together. If the result is y, then y is in S, otherwise y is not in S. common share

Algorithm 1 In Algorithm 1, we first create an empty garbled Bloom filter and initialize each location to NULL (line 1-4). Determine location in GBF Get the next share Put a string (a share) in GBF Put a random string there

Query the garbled Bloom Filter In a garbled Bloom filter, each location is a-bit string that is either (1) a share of certain elements or (2) a random string (see previous algorithm). Analogously, a share in a gabled Bloom filter is equivalent to a 1 bit in a Bloom filter, and a random string is equivalent to a 0 bit. Same as the Bloom filters, there is no false negative when using a GBF because (1) all shares of an encoded element are guaranteed to be retrievable, and (2) the XOR-based secret sharing scheme always produces the original element when all shares are available. When using a GBF, we need to consider and differentiate the following two probabilities: (1) The collision probability of a GBF is the probability when y is NOT in S, but it hashes to the same set of index numbers as some x. A collision does not cause false positive: the recovered string (Algorithm 2) is x but not y so the query result is still false. However it reveals x. The collision probability is negligible in k. Loosely, we can use the upper bound of the false positive probability of a Bloom filter as the upper bound of the collision probability of a garbled Bloom filter. Note that collisions do not affect the security of our protocol, but may be a concern if a GBF is used in other protocols. (2) The false positive probability of GBFS is the probability when y is not in S but the recovered string equals y coincidentally. This probability is at most 2- .

Algorithm 2 In summary, with proper parameters, a garbled Bloom filter exhibits similar properties when encoding set membership: no false negative and negligible false positive. Each time get a share, XOR with it, then the final result is the completely retrieved element.

Produce an Intersection GBF we show how to produce an intersection garbled Bloom filter from an (m, n, k,H)-garbled Bloom filter (client) and an (m, n, k,H)-Bloom filter (server). The idea is quite similar to creating an intersection Bloom filter by ANDing two Bloom filters. Let s say we have an (m, n, k,H)-Bloom filter BFC that encodes a set C and an (m, n, k,H )-garbled Bloom filter GBFS that encodes a set S. We use Algorithm 3 to build the intersection garbled Bloom filter GBF C\S.

Based on clients query requirement, This is GBF is controlled by Server This is BF is controlled by Client From the server s corresponding position, we take that element and put in the intersection GBF The intuition of the algorithm is this: (1) if an element x is in C \ S, then for every position i it hashes to, BFC[i] must be a 1 bit and GBFS[i] must be a share of x. Therefore by running the algorithm, all shares of x are copied to the new garbled Bloom filter. That is, all elements in C \S are preserved in the new garbled Bloom filter. (2) On the other hand, if x is not in C \S, then with a high probability, at least one share will not be copied. In other words, elements not in C \ S are eliminated from the new garbled Bloom filter. Thus the new garbled Bloom filter is indeed a garbled Bloom filter that encodes the intersection.

From security point of view, a more interesting property of the intersection GBF is that it is indistinguishable from a GBF built from scratch that encodes C \ S. That is --- Therefore, Algorithm 1 & 3 produce the same result. We wish enemies do not distinguish them! (i.e., they look almost the same)

: String size in each element of GBF

Oblivious Bloom Intersection The idea of the basic protocol is shown in Figure 2. That is, to run Algorithm 3 by two parties using oblivious transfer. Thus we call it oblivious Bloom intersection. The protocol runs as follows:

It is secure Informally, the correctness of the protocol follows from Theorem 3 and 6. The protocol produces a garbled Bloom filter that encodes C \ S, then by querying it the client can obtain the correct intersection except for a negligible probability. To see why the protocol is secure, notice that the only messages being sent in the protocol are the messages in the OT protocol. The client s privacy is protected because the server learns no information about BFC in the OT execution. The server s privacy is protected because the client receives only GBF C\S from the server and it contains only information about elements in C \ S.

The basic protocol is secure!

The Enhanced PSI protocol (used Encryption!)

Why enhanced protocol? In the basic protocol, the interaction between the two parties is essentially an oblivious transfer. At the first glance, it seems that we can easily obtain a fully secure protocol by replacing the semi-honest OT protocol with one that is secure against malicious parties. However, this is not enough. A fully secure OT protocol can prevent malicious behaviors such as changing input during the protocol execution but it cannot prevent a malicious client from mounting a full universe attack. In a full universe attack, a malicious client encodes the full universe of all possible elements in its Bloom filter and uses it in the PSI protocol to learn the server s entire set. A Bloom filter can easily represent the full universe by setting all the bits to 1. This is a special feature of Bloom filters and it causes a problem when we try to construct a simulator for the client in the malicious model. Namely, when the adversary uses the all-one Bloom filter, the simulator needs to enumerate all elements in the universe and send them to the trusted party in the ideal process. Without making any assumptions, the universe is potentially too large and a polynomial time algorithm may fail to enumerate all elements.

Prevent the full universe attack To prevent the full universe attack, we add a step to make sure that the client s Bloom filter is not all-one. More specifically, the server uses a symmetric key block cipher to encrypt strings in its garbled Bloom filter before transferring them to the client. It forces the client to behave honestly by splitting the key into m shares using a secret sharing scheme. The client uses the bit array in its Bloom filter as the selection string to receive the intersection garbled Bloom filter and the shares of the key. If the bit in the selection string is 0, the client receives a share of the key; if the bit is 1, the client receives an encrypted string in GBFS. The intuition is that if the client cheats by using an all-one Bloom filter, it will not be able to gather enough shares to recover the key, and thus will not be able to decrypt the encrypted garbled Bloom filter.

Implementation They have implemented a prototype of the basic protocol in C. The source code (and its Java port) is released online. http://personal.cis.strath.ac.uk/changyu.dong/PSI/PSI.html It uses OpenSSL (1.0.1e) for the cryptographic operations. We currently use keyed SHA-1 to build/query Bloom filters and garbled Bloom filters

Implement in MapReduce !

Extremely Big Data Set & Cloud Computing In practice, to process extremely big data set, we have to distribute the task on multiple computers. New computing paradigms such as cloud computing make it possible to execute such distributed tasks on demand . Our protocol can be easily deployed on cloud platforms. Here we show how to do it with the semi-honest protocol. The fully secure protocol case is similar. From a high level point of view, the client and the server throw their elements into bins using an hash function. Then they build Bloom filters and garbled Bloom filters for each bin. The parameter k is still determined by the desired false positive probability, the parameter m is determined by k and the bin size. The filters are associated with the bin number. Then for each 0 i < b, the server uses OT to transfer the garbled Bloom filter for bin i to the client, who uses its Bloom filter for bin i as the selection string. The client then queries all elements in its bin i against the received garbled Bloom filter and adds any positive elements into the result set. In the end, the client has the intersection. Conceptually, this splits a big set into b smaller sets that each can be handled by a single node. It is correct because the two parties use the same hash function so an element thrown by the server into bin i will also be threw by the client into bin i. The idea can be implemented using the MapReduce programming model [19] easily.

Efficient and Scalable Protocol for Private Set Intersection in Big Data Security

Download Presentation

Presentation Transcript

Related

More Related Content