Efficient Bucket Management in Extendible Hashing
In cases where a bucket (primary page) becomes full in extendible hashing, re-organizing the file by doubling the number of buckets can be costly in terms of resource utilization. An alternative approach involves using a directory of pointers to buckets and doubling the directory instead of all buckets. By splitting only the overflowing bucket, this method minimizes the need to read and write all pages, making it a more efficient solution. The trick lies in adjusting the hash function to determine bucket allocations. This concept is illustrated with examples and explanations of global and local depth considerations in extendible hashing.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Extendible Hashing Situation: Bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets? Reading and writing all pages is expensive! and is needlessly expensive on resource use. Idea: Use directory of pointers to buckets, double # of buckets by doubling the directory , splitting just the bucket that overflowed! Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page! Trick lies in how hash function is adjusted! Not always necessary! CSCIX370: Database Management
LOCAL DEPTH 2 Bucket A Example 4* 12* 32* 16* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* Directory is array of size 4. To find bucket for r, take last `global depth # bits of h(r) e.g., h(r) = 5 = binary 101, it is in bucket pointed to by 01. hash fn used: h(k) = k (for illustration only). 01 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES v Insert: If bucket is full, split it (allocate new page, re-distribute data entries). E.g., consider insert 20* (10100). v If necessary, double the directory. (As we will see, splitting a bucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.) CSCIX370: Database Management
Example Remarks Depth deals with how many bits from the hash address suffix we examine at a given time. Global depth = what s the #bits needed to correctly find the home bucket for an arbitrary data entry, in general? Local depth of bkt B = how many bits did I really need to look at to get to bucket B? Global depth >= local depth. Check this on examples. CSCIX370: Database Management
Insert h(r)=20 - Part 1 Suppose h(k) = k for this example. Bucket A split into 2 using an extra bit, i.e., 3 bits A divisible by 8, i.e., 1000 A2 divisible by 4, i.e., 100 note that only one bucket needs to be re-distributed, i.e., re-hashed B, C, D remain unchanged Where to link A2? 2 LOCAL DEPTH Bucket A 32* 16* GLOBAL DEPTH 2 2 Bucket B 1* 5* 21*13* 00 01 10 11 2 Bucket C 10* 2 DIRECTORY Bucket D 15* 7* 19* 2 Bucket A2 (`split image' of Bucket A) 4* 12* 20* CSCIX370: Database Management
Insert h(r)=20 Part 2 3 LOCAL DEPTH double the directory add 1 to global depth & to local depth of A/A2. now can distinguish between A and A2 notice the difference in local depth between buckets multiple pointers to the same bucket Review properties of LD & GD. 32* 16* Bucket A GLOBAL DEPTH 2 3 1* 5* 21*13* 000 Bucket B 001 010 2 10* Bucket C 011 100 2 101 15* 7* 19* Bucket D 110 111 3 DIRECTORY 12* 20* Bucket A2 (`split image' of Bucket A) 4* Insert 9 (1001) now CSCIX370: Database Management
Points to Note 20 = binary 10100. Last 2 bits (00) tell us r belongs in A or A2. Last 3 bits needed to tell which. Global depth of directory: min # of bits needed to tell which bucket an entry belongs to = max{local depths}. Local depth of a bucket: # of bits used to determine if an entry belongs to this bucket. When does bucket split cause directory doubling? Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth; directory is doubled by copying it over and `fixing pointer to split image page. (Use of least significant bits enables efficient doubling via copying of directory!) CSCIX370: Database Management
EH - Insert 3, 4, 7, 2, 5, 1, 6 0: [ ., . ] bp = *0 1: [ ., . ] bp = *1 0: [4, 2] bp = *0 1: [3, 7] bp = *1 insert 5 => OVF 0: [4, 2] bp = *0 1: [5, 1] bp = *01 insert 1, 6 => OVF 3: [3, 7] bp = *11 0: [4, . ] bp = *00 1: [5, 1] bp = *01 3: [3, 7] bp = *11 buckets are out of order => 2: [2, 6] bp = *10 a directory (not shown) is required CSCIX370: Database Management
Comments on Extendible Hashing If directory fits in memory, equality search answered with one disk access; else two. 100MB file, 100 bytes/rec, 4K page; contains 1,000,000 records (as data entries); 40 records/page 106/40 = 25,000 pages of data entries; as many directory elements; can handle using 15bit addresses; chances may be high that directory will fit in memory. Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large. Delete: If removal of data entry makes bucket empty, check to see whether all `split images can be merged if each directory element points to the same bucket as its split image, can halve directory rarely done in practice (e.g., leave room for future insertions). CSCIX370: Database Management
Comments on Extendible Hashing If directory fits in memory, equality search answered with one disk access; else two. 100MB file, 100 bytes/rec, 4K page; contains 1,000,000 records (as data entries); 40 records/page 106/40 = 25,000 pages of data Let keys be unsigned 32-bit integers and the number of slots per bucket s = 4 Insert 5 keys that will maximize the size of the directory What are those five keys? How many entries will there be in the directory? If starting with 2 buckets, how many entries in directory before inserting the 5th key? CSCIX370: Database Management