Efficient and Scalable Protocol for Private Set Intersection in Big Data Security

Slide Note

Large-scale data processing presents challenges for privacy-preserving protocols, particularly in terms of efficiency and scalability. This paper introduces a novel Private Set Intersection (PSI) protocol called oblivious Bloom intersection, offering linear complexity and high scalability. The protocol addresses the need for speed and throughput in modern applications dealing with big data while ensuring privacy.

lea_ste Follow

Uploaded on Oct 04, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

When Private Set Intersection Meets Big Data: An Efficient and Scalable Protocol ECE 693 Big Data Security

Abstract Large scale data processing brings new challenges to the design of privacy- preserving protocols: how to meet the increasing requirements of speed and throughput of modern applications, and how to scale up smoothly when data being protected is big. Efficiency and scalability become critical criteria for privacy preserving protocols in the age of Big Data. In this paper, we present a new Private Set Intersection (PSI) protocol that is extremely efficient and highly scalable compared with existing protocols. The protocol is based on a novel approach that we call oblivious Bloom intersection. It has linear complexity and relies mostly on efficient symmetric key operations. It has high scalability due to the fact that most operations can be parallelized easily. The protocol has two versions: a basic protocol and an enhanced protocol, the security of the two variants is analyzed and proved in the semi-honest model and the malicious model respectively.

Needs Examples: (1) geneticists need to search 3 billion base pairs in personal genome to find genetic disorders that might cause diabetes or cancers, (2) epidemiologists need to link multiple medical databases that contain millions of patients records to identify risk factors for diseases, and (3) online retailers want to correlate petabytes of their transaction records with customers social network activities, hoping to increase customer satisfaction. Any privacy-preserving data processing service is not cost free and this has brought us new challenges: how to meet the increasing requirements of speed and throughput of modern applications, and how to scale up smoothly when data being protected is big? With the prevalence of large scale data processing, efficiency and scalability become critical criteria for designing a privacy-preserving protocol in the age of Big Data .

Private Set Intersection (PSI) problem Definition: Namely, two parties, a client and a server, want to jointly compute the intersection of their private input sets in a manner that at the end the client learns the intersection and the server learns nothing. The PSI problem has been extensively studied for two reasons, firstly set intersection is a foundational primitive and secondly it has many practical applications. PSI protocols are often criticized as being impractical because the performance becomes unacceptable when the input size or the security parameter becomes large, and it is difficult to improve the performance by just adding hardware proportionally. The criticism is not unfounded.

Existing ones Currently two protocols claim to be the fastest PSI protocol: the RSA-OPRF-based protocol by De Cristofaro et al [15, 16] and the garbled circuit protocol by Huang et al [26]. Both protocols have a highly optimized implementation. We obtained the source code from the authors of these two protocols and tested the performance. To compute the intersection of two 1,048,576-element (220) sets, De Cristofaro s protocol needs 10.6 minutes at 80-bit security, but requires a much longer time at 256-bit security. We estimate the time to be approximately 131 hours from tests with smaller sets. The tests with million-element sets on Huang s protocol were unsuccessful because the Java Virtual Machine ran out of memory on the client computer that has 16 GB RAM. From tests with smaller sets, we estimate that Huang s protocol requires 27 hours and 51 hours respectively to compute the intersection at 80-bit and 256-bit security. Clearly to use PSI in real world applications, we need more practical protocols.

Contributions a new PSI protocol that is much more efficient than all the already existing PSI protocols. The protocol is designed based on a novel two-party computation approach, which makes use of a new variant of Bloom filters that we call garbled Bloom filters, and we refer the new approach as oblivious Bloom intersection. The ideas of garbled Bloom filters and oblivious Bloom intersection are general and have their own interests. Our PSI protocol has two versions: a basic protocol, security of which can be proved in the semi-honest model, and an enhanced protocol, security of which can be proved in the malicious model. The basic protocol has linear complexity (with a small constant factor) and relies mostly on symmetric key operations. It is fast even with large input sets, and when the security parameter increases, the performance degrades gracefully. Test results show it is orders of magnitude faster than the previous best protocols. The enhanced protocol is an extension of the basic protocol, that only increases the cost by a factor proportional to the security parameter.

scalability The computational, memory and communication complexities are all linear in the size of the input sets. More attractively, most operations in the protocol can be performed in the SPMD (single program, multiple data) fashion, which means little effort is required to separate the computation into a number of parallel tasks. Therefore it can fully take the advantage of parallel processing capacity provided by current multi-core CPUs, GPGPUs (General-purpose graphics processing unit) and cloud computing. As a result, the protocol is particularly suitable for Big Data oriented applications that have to process data in a parallelized and/or distributed way. We have implemented a proof of concept prototype of the basic protocol. To compute the intersection of two million-element sets, it needs only 41 seconds (80-bit) and 5.65 minutes (256-bit) on two moderate computers in parallel mode.

Notations

The BF is an array with m bits Bloom Filters The array can represent n elements, all belong to a set S K hash functions (H(.)) A Bloom filter [9] {m, n, k, H} is a compact data structure for probabilistic set membership testing. A Bloom filter is an array of m bits that can represent a set S of at most n elements. A Bloom filter comes with a set of k independent uniform hash functions H = {h0, , hk-1} that each himaps its elements (total n) to index numbers (total m) over the range [0, m-1] uniformly. Initially, all bits in the array are set to 0. To insert an element x (belonging to S) into the filter, the element is hashed using the k hash functions to get k index numbers. The bits at ALL these indexes in the bit array are set to 1, i.e. set BFS[hi(x)] = 1 for 0 < I < k -1. (total bits m >> k (each element generates k index positions) To check if an item y is in S, y is hashed by the k hash functions, and all indexes locations that y hashes to are checked. If ANY of the bits at the locations is 0 , y is certainly NOT in S, otherwise y is most likely (maybe not) in S.

i.e., the following wont happen: If an element is in fact NOT in the intersection of S1 and S2, but it is there based on BF intersection query result. That is, if an element is NOT in the set intersection, then BF will also tell NOT in there. Bloom filter Intersection set S1 S2 False positive: If an element is in the intersection, but the BF tells Not there. This can happen, but with a small chance.

Secret Sharing Secret sharing is a fundamental cryptographic primitive. It allows a dealer to split a secret s into n shares such that the secret s can be recovered efficiently with any subset of t or more shares. With any subset of less than t shares, the secret is unrecoverable and the shares give no information about the secret. Such a system is called a (t, n)-secret sharing scheme

Oblivious Transfer Protocol (definition) Oblivious transfer [39, 20] allows a sender to send part of its input to a receiver in a manner that protects both parties. Namely, the sender does not know which part the receiver receives, and the receiver does not learn any information about the other part of the sender s input. Generally Note that rj here can only be 0 or 1. It comes from the receiver s selection string r (one of its bits) Here the receiver may select jth Pair (one of the two in the pair, either left Xj,0 or right Xj,1) Receiver decides to select m specific indexes (each corresponding to an element in sender s x)

Reduce complexity for Oblivious transfer protocols Oblivious transfer protocols are costly and often become the efficiency bottleneck in protocol design. However it has been shown by Beaver that it is possible to obtain a large number of oblivious transfers given only a small number of actual oblivious transfer calls [7]. In this direction, efficient OT extensions were proposed in [27]. The extensions rely on the Random Oracle

Semi-honest Model We prove the security of the basic protocol in the presence of static semi-honest adversaries. In the model, the adversary controls one of the parties and follows the protocol specification exactly. However, it may try to learn more information about the other party s input. f Empty string means learn nothing about the other part (f2) .This is the main idea of PSI.

Semi-honest Model In the semi-honest model, a protocol is secure if whatever can be computed by a party in the protocol can be obtained from its input (x) and output (y) only. This is formalized by the simulation paradigm. We require a party s view in a protocol execution to be simulatable given only its input and output. The view of the party i during an execution of on (x; y) is denoted by which equals to where is the input of party i , ri is the output of i s internal random coin tosses, and represents the jth message that it received. For party i=1 For party i=2

Basic Protocol Conceptually the protocol is very simple: the client computes a Bloom filter (BF) that encodes its set C and the server computes a garbled Bloom filter (BGF) that encodes its set S. Then they run an oblivious transfer (OT) protocol so that the client obtains a garbled Bloom filter (GBF) that represents the intersection, and the server learns nothing. Then the client queries the intersection garbled Bloom filter and obtains the intersection contents.

Garbled Bloom Filters A garbled Bloom filter (GBF) is the garbled version of a standard Bloom filter. Common features (GBF & general BF): From a high level point of view (next slide: low level), there is no difference between a garbled Bloom filter and a Bloom filter: it encodes a set of at most n elements in an array of length m, it supports membership query with zero false negative and very small false positive. To add an element, the element is mapped by k independent uniform hash functions into k index numbers, and the corresponding array locations are updated. To query an element, the element is mapped by the same k hash functions into k index numbers, and the corresponding array locations are checked.

Low-level point of view Change BF s bits to strings Different from general BF: From a low-level point of view, a garbled Bloom filter is backed by a different data structure. Namely, instead of using an array of bits, a garbled Bloom filter uses an array of -bit strings, where is a security parameter. In the rest of the paper, we use (m, n, k, H, )-garbled Bloom filter to denote a garbled Bloom filter parameterized by (m, n, k, H, ), we denote a garbled

Add an element To add an element x to a garbled Bloom filter, we split the element itself into multiple k-bit shares (each share has k bits, which occupies a position in GBF; remember that GBF stores a string in each position, not a single bit), by using the XOR-based secret sharing scheme. The element is also mapped into k index numbers, and we store one k-bit share in each location ID = hi(x). Note this is a very loose description, the actual process is more complicated. To query an element y, we collect all bit strings at hi(y) and XOR them together. If the result is y, then y is in S, otherwise y is not in S. common share

Algorithm 1 In Algorithm 1, we first create an empty garbled Bloom filter and initialize each location to NULL (line 1-4). Determine location in GBF Get the next share Put a string (a share) in GBF Put a random string there

Query the garbled Bloom Filter In a garbled Bloom filter, each location is a-bit string that is either (1) a share of certain elements or (2) a random string (see previous algorithm). Analogously, a share in a gabled Bloom filter is equivalent to a 1 bit in a Bloom filter, and a random string is equivalent to a 0 bit. Same as the Bloom filters, there is no false negative when using a GBF because (1) all shares of an encoded element are guaranteed to be retrievable, and (2) the XOR-based secret sharing scheme always produces the original element when all shares are available. When using a GBF, we need to consider and differentiate the following two probabilities: (1) The collision probability of a GBF is the probability when y is NOT in S, but it hashes to the same set of index numbers as some x. A collision does not cause false positive: the recovered string (Algorithm 2) is x but not y so the query result is still false. However it reveals x. The collision probability is negligible in k. Loosely, we can use the upper bound of the false positive probability of a Bloom filter as the upper bound of the collision probability of a garbled Bloom filter. Note that collisions do not affect the security of our protocol, but may be a concern if a GBF is used in other protocols. (2) The false positive probability of GBFS is the probability when y is not in S but the recovered string equals y coincidentally. This probability is at most 2- .

Algorithm 2 In summary, with proper parameters, a garbled Bloom filter exhibits similar properties when encoding set membership: no false negative and negligible false positive. Each time get a share, XOR with it, then the final result is the completely retrieved element.

Produce an Intersection GBF we show how to produce an intersection garbled Bloom filter from an (m, n, k,H)-garbled Bloom filter (client) and an (m, n, k,H)-Bloom filter (server). The idea is quite similar to creating an intersection Bloom filter by ANDing two Bloom filters. Let s say we have an (m, n, k,H)-Bloom filter BFC that encodes a set C and an (m, n, k,H )-garbled Bloom filter GBFS that encodes a set S. We use Algorithm 3 to build the intersection garbled Bloom filter GBF C\S.

Based on clients query requirement, This is GBF is controlled by Server This is BF is controlled by Client From the server s corresponding position, we take that element and put in the intersection GBF The intuition of the algorithm is this: (1) if an element x is in C \ S, then for every position i it hashes to, BFC[i] must be a 1 bit and GBFS[i] must be a share of x. Therefore by running the algorithm, all shares of x are copied to the new garbled Bloom filter. That is, all elements in C \S are preserved in the new garbled Bloom filter. (2) On the other hand, if x is not in C \S, then with a high probability, at least one share will not be copied. In other words, elements not in C \ S are eliminated from the new garbled Bloom filter. Thus the new garbled Bloom filter is indeed a garbled Bloom filter that encodes the intersection.

From security point of view, a more interesting property of the intersection GBF is that it is indistinguishable from a GBF built from scratch that encodes C \ S. That is --- Therefore, Algorithm 1 & 3 produce the same result. We wish enemies do not distinguish them! (i.e., they look almost the same)

: String size in each element of GBF

Oblivious Bloom Intersection The idea of the basic protocol is shown in Figure 2. That is, to run Algorithm 3 by two parties using oblivious transfer. Thus we call it oblivious Bloom intersection. The protocol runs as follows:

It is secure Informally, the correctness of the protocol follows from Theorem 3 and 6. The protocol produces a garbled Bloom filter that encodes C \ S, then by querying it the client can obtain the correct intersection except for a negligible probability. To see why the protocol is secure, notice that the only messages being sent in the protocol are the messages in the OT protocol. The client s privacy is protected because the server learns no information about BFC in the OT execution. The server s privacy is protected because the client receives only GBF C\S from the server and it contains only information about elements in C \ S.

The basic protocol is secure!

The Enhanced PSI protocol (used Encryption!)

Why enhanced protocol? In the basic protocol, the interaction between the two parties is essentially an oblivious transfer. At the first glance, it seems that we can easily obtain a fully secure protocol by replacing the semi-honest OT protocol with one that is secure against malicious parties. However, this is not enough. A fully secure OT protocol can prevent malicious behaviors such as changing input during the protocol execution but it cannot prevent a malicious client from mounting a full universe attack. In a full universe attack, a malicious client encodes the full universe of all possible elements in its Bloom filter and uses it in the PSI protocol to learn the server s entire set. A Bloom filter can easily represent the full universe by setting all the bits to 1. This is a special feature of Bloom filters and it causes a problem when we try to construct a simulator for the client in the malicious model. Namely, when the adversary uses the all-one Bloom filter, the simulator needs to enumerate all elements in the universe and send them to the trusted party in the ideal process. Without making any assumptions, the universe is potentially too large and a polynomial time algorithm may fail to enumerate all elements.

Prevent the full universe attack To prevent the full universe attack, we add a step to make sure that the client s Bloom filter is not all-one. More specifically, the server uses a symmetric key block cipher to encrypt strings in its garbled Bloom filter before transferring them to the client. It forces the client to behave honestly by splitting the key into m shares using a secret sharing scheme. The client uses the bit array in its Bloom filter as the selection string to receive the intersection garbled Bloom filter and the shares of the key. If the bit in the selection string is 0, the client receives a share of the key; if the bit is 1, the client receives an encrypted string in GBFS. The intuition is that if the client cheats by using an all-one Bloom filter, it will not be able to gather enough shares to recover the key, and thus will not be able to decrypt the encrypted garbled Bloom filter.

Implementation They have implemented a prototype of the basic protocol in C. The source code (and its Java port) is released online. http://personal.cis.strath.ac.uk/changyu.dong/PSI/PSI.html It uses OpenSSL (1.0.1e) for the cryptographic operations. We currently use keyed SHA-1 to build/query Bloom filters and garbled Bloom filters

Implement in MapReduce !

Extremely Big Data Set & Cloud Computing In practice, to process extremely big data set, we have to distribute the task on multiple computers. New computing paradigms such as cloud computing make it possible to execute such distributed tasks on demand . Our protocol can be easily deployed on cloud platforms. Here we show how to do it with the semi-honest protocol. The fully secure protocol case is similar. From a high level point of view, the client and the server throw their elements into bins using an hash function. Then they build Bloom filters and garbled Bloom filters for each bin. The parameter k is still determined by the desired false positive probability, the parameter m is determined by k and the bin size. The filters are associated with the bin number. Then for each 0 i < b, the server uses OT to transfer the garbled Bloom filter for bin i to the client, who uses its Bloom filter for bin i as the selection string. The client then queries all elements in its bin i against the received garbled Bloom filter and adds any positive elements into the result set. In the end, the client has the intersection. Conceptually, this splits a big set into b smaller sets that each can be handled by a single node. It is correct because the two parties use the same hash function so an element thrown by the server into bin i will also be threw by the client into bin i. The idea can be implemented using the MapReduce programming model [19] easily.

Efficient and Scalable Protocol for Private Set Intersection in Big Data Security

Download Presentation

Presentation Transcript

Related

More Related Content