Introduction to Database Systems

undefined
C
S
 
4
0
5
G
:
 
I
n
t
r
o
d
u
c
t
i
o
n
 
t
o
D
a
t
a
b
a
s
e
 
S
y
s
t
e
m
s
Instructor: Jinze Liu
Fall 2017
3/1/2025
Jinze Liu @ University of Kentucky
2
R
e
v
i
e
w
 
The unit of disk read and write is
Block (or called Page)
The disk access time is composed by
Seek time
Rotation time
Data transfer time
3/1/2025
Jinze Liu @ University of Kentucky
3
R
e
v
i
e
w
 
A row in a table, when located on disks, is called
A record
Two types of record:
Fixed-length
Variable-length
3/1/2025
Jinze Liu @ University of Kentucky
4
R
e
v
i
e
w
 
In an abstract sense, a file is
A set of “records” on a disk
In reality, a file is
A set of disk pages
Each record lives on
A page
Physical Record ID (RID)
A tuple of <page#, slot#>
3/1/2025
Jinze Liu @ University of Kentucky
5
T
o
d
a
y
s
 
T
o
p
i
c
How to locate data in a file 
fast?
Introduction to indexing
Tree-based indexes
ISAM
: 
Indexed sequence access method
B
+
-tree
3/1/2025
Jinze Liu @ University of Kentucky
6
B
a
s
i
c
s
Given a value, locate the record(s) with this value
   
SELECT * FROM 
R
 WHERE 
A
 = 
value
;
   
SELECT * FROM 
R
, 
S
 WHERE 
R
.
A
 = 
S
.
B
;
Other search criteria, e.g.
Range search
   
SELECT * FROM 
R
 WHERE 
A
 > 
value
;
Keyword search
3/1/2025
Jinze Liu @ University of Kentucky
7
D
e
n
s
e
 
a
n
d
 
s
p
a
r
s
e
 
i
n
d
e
x
e
s
 
Dense
: one index entry for each search key value
 
Sparse
: one index entry for each block
Records must be 
clustered
 according to the search key
3/1/2025
Jinze Liu @ University of Kentucky
8
D
e
n
s
e
 
v
e
r
s
u
s
 
s
p
a
r
s
e
 
i
n
d
e
x
e
s
 
Index size
Sparse index is smaller
Requirement on records
Records must be clustered for sparse index
Lookup
Sparse index is smaller and may fit in memory
Dense index can directly tell if a record exists
Update
Easier for sparse index
3/1/2025
Jinze Liu @ University of Kentucky
9
P
r
i
m
a
r
y
 
a
n
d
 
s
e
c
o
n
d
a
r
y
 
i
n
d
e
x
e
s
Primary index
Created for the 
primary key
 of a table
Records are usually clustered according to the primary key
Can be sparse
Secondary index
Usually dense
SQL
PRIMARY KEY
 declaration automatically creates a primary
index, 
UNIQUE
 key automatically creates a secondary index
Additional secondary index can be created on non-key
attribute(s)
CREATE INDEX
 StudentGPAIndex 
ON
 Student(GPA)
;
 
 
T
r
e
e
-
S
t
r
u
c
t
u
r
e
d
 
I
n
d
e
x
e
s
:
 
I
n
t
r
o
d
u
c
t
i
o
n
Tree-structured indexing techniques support both 
range
selections 
and 
equality selections
.
ISAM =
I
ndexed
 
S
equential
 
A
ccess
 
M
ethod
static structure; early index technology.
B
+
 tree
:  dynamic, adjusts gracefully under inserts and
deletes.
3/1/2025
10
Jinze Liu @ University of Kentucky
 
 
M
o
t
i
v
a
t
i
o
n
 
f
o
r
 
I
n
d
e
x
 
``
Find all students with gpa > 3.0
’’
If data file is sorted, do binary search
Cost of binary search in a database can be quite high,
Why?
Simple idea:  Create an `index’ file.
 
Can do binary search on (smaller) index file!
P
a
g
e
 
1
P
a
g
e
 
2
P
a
g
e
 
N
P
a
g
e
 
3
D
a
t
a
 
F
i
l
e
3/1/2025
11
Jinze Liu @ University of Kentucky
3/1/2025
Jinze Liu @ University of Kentucky
12
I
S
A
M
 
What if an index is still too big?
Put a another (sparse) index on top of that!
ISAM
 (
Index Sequential Access Method
), more or less
 
Example: look up 197
3/1/2025
Jinze Liu @ University of Kentucky
13
U
p
d
a
t
e
s
 
w
i
t
h
 
I
S
A
M
 
Overflow chains and empty data blocks degrade
performance
Worst case: most records go into one long chain
 
Example: insert 107
 
Example: delete 129
3/1/2025
Jinze Liu @ University of Kentucky
14
A
 
N
o
t
e
 
o
f
 
C
a
u
t
i
o
n
ISAM is an old-fashioned idea
B+-trees are usually better, as we’ll see
But, ISAM is a good place to start to understand the
idea of indexing
Upshot
Don’t brag about being an ISAM expert on your
resume
Do understand how they work, and tradeoffs with B
+
-
trees
3/1/2025
Jinze Liu @ University of Kentucky
15
B
+
-
t
r
e
e
A hierarchy of intervals
Balanced
 (more or less): good performance guarantee
Disk-based
: one node per block; large fan-out
3/1/2025
Jinze Liu @ University of Kentucky
16
S
a
m
p
l
e
 
B
+
-
t
r
e
e
 
n
o
d
e
s
Max fan-out: 4
1
2
0
1
5
0
1
8
0
to keys 
100 <=
k
 < 120
to keys
120 <=
k
 < 150
to keys
150 <= 
k
 < 180
to keys
180 <= 
k
Non-leaf
1
2
0
1
3
0
to records with these 
k
 values;
or, store records directly in leaves
to next leaf node in sequence
Leaf
to keys
100 · 
k
3/1/2025
Jinze Liu @ University of Kentucky
17
B
+
-
t
r
e
e
 
b
a
l
a
n
c
i
n
g
 
p
r
o
p
e
r
t
i
e
s
Height constraint: all leaves at the same lowest level
Fan-out constraint: all nodes at least half full
(except root)
  
     
Max #   Max #
  
Min #
 
Min #
        
    pointers
        
keys
     
    active pointers
         
 keys
    
Non-leaf
 
f
 
f
 – 1
   
Root
  
f
 
f
 – 1
  
2
  
1
Leaf
  
f
 
f
 – 1
   
3/1/2025
Jinze Liu @ University of Kentucky
18
L
o
o
k
u
p
s
 
SELECT * FROM 
R
 WHERE 
k
 = 179
;
3
5
1
1
3
0
3
5
1
0
0
1
0
1
1
1
0
1
2
0
1
3
0
1
5
0
1
5
6
1
7
9
1
8
0
2
0
0
3
0
1
0
0
1
2
0
1
5
0
1
8
0
Max fan-out: 4
 
Not found
 
SELECT * FROM 
R
 WHERE 
k
 = 32
;
3/1/2025
Jinze Liu @ University of Kentucky
19
R
a
n
g
e
 
q
u
e
r
y
SELECT * FROM 
R
 WHERE 
k
 > 32 AND 
k
 < 179
;
3
5
1
1
3
0
3
5
1
0
0
1
0
1
1
1
0
1
2
0
1
3
0
1
5
0
1
5
6
1
7
9
1
8
0
2
0
0
3
0
1
0
0
1
2
0
1
5
0
1
8
0
Max fan-out: 4
 
Look up 32…
 
And follow next-leaf pointers
3/1/2025
Jinze Liu @ University of Kentucky
20
I
n
s
e
r
t
i
o
n
Insert a record with search key value 
32
3
5
1
1
3
0
3
5
1
0
0
1
0
1
1
1
0
1
2
0
1
3
0
1
5
0
1
5
6
1
7
9
1
8
0
2
0
0
3
0
1
0
0
1
2
0
1
5
0
1
8
0
Max fan-out: 4
 
Look up where the
inserted key
should go…
 
And insert it right there
3/1/2025
Jinze Liu @ University of Kentucky
21
A
n
o
t
h
e
r
 
i
n
s
e
r
t
i
o
n
 
e
x
a
m
p
l
e
Insert a record with search key value 
152
1
0
0
1
0
1
1
1
0
1
2
0
1
3
0
1
5
0
1
5
6
1
7
9
1
8
0
2
0
0
1
0
0
1
2
0
1
5
0
1
8
0
Max fan-out: 4
 
Oops, node is already full!
3/1/2025
Jinze Liu @ University of Kentucky
22
N
o
d
e
 
s
p
l
i
t
t
i
n
g
1
2
0
1
3
0
1
5
0
1
5
2
1
5
6
1
7
9
1
8
0
2
0
0
1
0
0
1
2
0
1
5
0
1
8
0
Max fan-out: 4
 
Yikes, this node is
also already full!
1
0
0
1
0
1
1
1
0
3/1/2025
Jinze Liu @ University of Kentucky
23
M
o
r
e
 
n
o
d
e
 
s
p
l
i
t
t
i
n
g
 
In the worst case, node splitting can “propagate” all the way up to the root of
the tree (not illustrated here)
Splitting the root introduces a new root of fan-out 2 and causes the tree to
grow “up” by one level
1
2
0
1
3
0
1
5
0
1
5
2
1
5
6
1
7
9
1
8
0
2
0
0
1
0
0
1
8
0
Max fan-out: 4
1
0
0
1
0
1
1
1
0
1
2
0
1
5
0
1
5
6
3/1/2025
Jinze Liu @ University of Kentucky
24
I
n
s
e
r
t
i
o
n
 
B
+
-tree Insert
Find correct leaf 
L.
Put data entry onto 
L
.
If 
L 
has enough space, 
done
!
Else, must 
split
  
L (into L and a new node L2)
Distribute entries evenly, 
copy up
 
middle key.
Insert index entry pointing to 
L2 
into parent of 
L
.
This can happen recursively
Tree growth: gets 
wider
 and (sometimes) 
one level taller at top
.
 
3/1/2025
Jinze Liu @ University of Kentucky
25
D
e
l
e
t
i
o
n
Delete a record with search key value 
130
1
0
0
1
0
1
1
1
0
1
2
0
1
3
0
1
5
0
1
5
6
1
7
9
1
8
0
2
0
0
1
0
0
1
2
0
1
5
0
1
8
0
Max fan-out: 4
 
Look up the key
to be deleted…
 
And delete it
 
Oops, node is too empty!
3/1/2025
Jinze Liu @ University of Kentucky
26
S
t
e
a
l
i
n
g
 
f
r
o
m
 
a
 
s
i
b
l
i
n
g
1
0
0
1
0
1
1
1
0
1
2
0
1
5
0
1
5
6
1
7
9
1
8
0
2
0
0
1
0
0
1
2
0
1
5
0
1
8
0
Max fan-out: 4
 
Remember to fix the key
in the least common ancestor
3/1/2025
Jinze Liu @ University of Kentucky
27
A
n
o
t
h
e
r
 
d
e
l
e
t
i
o
n
 
e
x
a
m
p
l
e
Delete a record with search key value 
179
1
0
0
1
0
1
1
1
0
1
2
0
1
5
0
1
5
6
1
7
9
1
8
0
2
0
0
1
0
0
1
2
0
1
5
6
1
8
0
Max fan-out: 4
 
Cannot steal from siblings
 
Then coalesce (merge) with a sibling!
3/1/2025
Jinze Liu @ University of Kentucky
28
C
o
a
l
e
s
c
i
n
g
1
0
0
1
0
1
1
1
0
1
2
0
1
5
0
1
0
0
1
2
0
1
5
6
1
8
0
Max fan-out: 4
 
Deletion can “propagate” all the way up to the root of the tree (not illustrated
here)
When the root becomes empty, the tree “shrinks” by one level
1
5
6
1
8
0
2
0
0
3/1/2025
Jinze Liu @ University of Kentucky
29
D
e
l
e
t
i
o
n
 
B+-tree Delete
Start at root, find leaf 
L
 where entry belongs.
Remove the entry.
If L is at least half-full, 
done!
If L has only 
d-1 
entries,
Try to 
redistribute
, borrowing from sibling
 (adjacent node
with same parent as L)
.
If re-distribution fails, 
merge
 
L 
and sibling.
If merge occurred, must delete entry (pointing to 
L
 or sibling)
from parent of 
L
.
Tree shrink: gets 
narrower
 and (sometimes) 
one level lower at
top
.
 
 
E
x
a
m
p
l
e
 
B
+
 
T
r
e
e
 
-
 
I
n
s
e
r
t
i
n
g
 
8
*
 
In this example, we can avoid split by re-distributing
entries; however, this is usually not done in practice.
 
Notice that root was split, leading to increase in height.
3/1/2025
30
Jinze Liu @ University of Kentucky
 
 
E
x
a
m
p
l
e
 
T
r
e
e
 
(
i
n
c
l
u
d
i
n
g
 
8
*
)
D
e
l
e
t
e
 
1
9
*
 
a
n
d
 
2
0
*
 
.
.
.
3/1/2025
31
Jinze Liu @ University of Kentucky
 
 
E
x
a
m
p
l
e
 
T
r
e
e
 
(
i
n
c
l
u
d
i
n
g
 
8
*
)
D
e
l
e
t
e
 
1
9
*
 
a
n
d
 
2
0
*
 
.
.
.
 
Deleting 19* is easy.
Deleting 20* is done with re-distribution. Notice
how middle key is 
copied up
.
3/1/2025
32
Jinze Liu @ University of Kentucky
 
 
 
 
 
 
 
 
 
 
.
.
.
 
A
n
d
 
T
h
e
n
 
D
e
l
e
t
i
n
g
 
2
4
*
Must merge.
Observe 
`
toss
of index
entry (key 27 on right), and
`
pull down
of index entry
(below).
3
0
2
2
*
2
7
*
2
9
*
3
3
*
3
4
*
3
8
*
3
9
*
2
*
3
*
7
*
1
4
*
1
6
*
2
2
*
2
7
*
2
9
*
3
3
*
3
4
*
3
8
*
3
9
*
5
*
8
*
R
o
o
t
3
0
1
3
5
1
7
3/1/2025
33
Jinze Liu @ University of Kentucky
3/1/2025
Jinze Liu @ University of Kentucky
34
P
e
r
f
o
r
m
a
n
c
e
 
a
n
a
l
y
s
i
s
How many I/O’s are required for each operation?
h
, the 
height of the tree
 (more or less)
Plus one or two to manipulate actual records
Plus 
O
(
h
) for reorganization (should be very rare if 
f
 is large)
Minus one if we cache the root in memory
How big is 
h
?
Roughly 
log
fan-out
 
N
, where 
N
 is the number of records
B
+
-tree properties guarantee that fan-out is least 
f
 / 2 for all non-root
nodes
Fan-out is typically large (in hundreds)—many keys and pointers
can fit into one block
A 4-level B
+
-tree is enough for typical tables
3/1/2025
Jinze Liu @ University of Kentucky
35
B
+
-
t
r
e
e
 
i
n
 
p
r
a
c
t
i
c
e
Complex reorganization for deletion often is not
implemented (e.g., Oracle, Informix)
Leave nodes less than half full and periodically reorganize
Most commercial DBMS use B
+
-tree instead of hashing-
based indexes because B
+
-tree handles range queries
3/1/2025
Jinze Liu @ University of Kentucky
36
T
h
e
 
H
a
l
l
o
w
e
e
n
 
P
r
o
b
l
e
m
 
Story from the early days of System R…
 
UPDATE Payroll
SET salary = salary * 1.1
WHERE salary >= 100000;
There is a B
+
-tree index on 
Payroll
(
salary
)
The update never stopped (why?)
Solutions?
Scan index in reverse
Before update, scan index to create a complete “to-do” list
During update, maintain a “done” list
Tag every row with transaction/statement id
3/1/2025
Jinze Liu @ University of Kentucky
37
B
+
-
t
r
e
e
 
v
e
r
s
u
s
 
I
S
A
M
ISAM is more 
static
; B
+
-tree is more 
dynamic
ISAM is more compact (at least initially)
Fewer levels and I/O’s than B
+
-tree
Overtime, ISAM may not be balanced
Cannot provide guaranteed performance as B
+
-tree does
3/1/2025
Jinze Liu @ University of Kentucky
38
B
+
-
t
r
e
e
 
v
e
r
s
u
s
 
B
-
t
r
e
e
 
B-tree: why not store records (or record pointers) in
non-leaf nodes?
These records can be accessed with fewer I/O’s
Problems?
Storing more data in a node decreases fan-out and
increases 
h
Records in leaves require more I/O’s to access
Vast majority of the records live in leaves!
3/1/2025
Jinze Liu @ University of Kentucky
39
B
e
y
o
n
d
 
I
S
A
M
,
 
B
-
,
 
a
n
d
 
B
+
-
t
r
e
e
s
Other tree-based indexes: R-trees and variants, GiST,
etc.
Hashing-based indexes: extensible hashing, linear
hashing, etc.
Text indexes: inverted-list index, suffix arrays, etc.
Other tricks: bitmap index, bit-sliced index, etc.
How about indexing subgraph search?
3/1/2025
Jinze Liu @ University of Kentucky
40
S
u
m
m
a
r
y
 
Two types of queries
Key-search
Range-query
B
+
-tree operations
Search
Insert
Split child
Delete
Redistribution
B
+
-tree sorting
Next: disk-based sorting algorithms
Slide Note
Embed
Share

This course covers topics such as disk access time, rows and records in tables, file organization, indexing methods, dense and sparse indexes, primary and secondary indexes, and more. The content delves into the fundamentals of database systems, with a focus on efficient data retrieval techniques and indexing strategies like B+-trees and ISAM. Gain insights into how data is stored and accessed, and learn about different types of indexes and their advantages in database management.

  • Database Systems
  • Indexing Methods
  • Disk Access Time
  • B+-trees
  • File Organization

Uploaded on Mar 01, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2017

  2. Review The unit of disk read and write is Block (or called Page) The disk access time is composed by Seek time Rotation time Data transfer time 3/1/2025 Jinze Liu @ University of Kentucky 2

  3. Review A row in a table, when located on disks, is called A record Two types of record: Fixed-length Variable-length 3/1/2025 Jinze Liu @ University of Kentucky 3

  4. Review In an abstract sense, a file is A set of records on a disk In reality, a file is A set of disk pages Each record lives on A page Physical Record ID (RID) A tuple of <page#, slot#> 3/1/2025 Jinze Liu @ University of Kentucky 4

  5. Todays Topic How to locate data in a file fast? Introduction to indexing Tree-based indexes ISAM: Indexed sequence access method B+-tree 3/1/2025 Jinze Liu @ University of Kentucky 5

  6. Basics Given a value, locate the record(s) with this value SELECT * FROM R WHERE A = value; SELECT * FROM R, S WHERE R.A = S.B; Other search criteria, e.g. Range search SELECT * FROM R WHERE A > value; Keyword search database indexing Search 3/1/2025 Jinze Liu @ University of Kentucky 6

  7. Dense and sparse indexes Dense: one index entry for each search key value Sparse: one index entry for each block Records must be clustered according to the search key 3/1/2025 Jinze Liu @ University of Kentucky 7

  8. Dense versus sparse indexes Index size Sparse index is smaller Requirement on records Records must be clustered for sparse index Lookup Sparse index is smaller and may fit in memory Dense index can directly tell if a record exists Update Easier for sparse index 3/1/2025 Jinze Liu @ University of Kentucky 8

  9. Primary and secondary indexes Primary index Created for the primary key of a table Records are usually clustered according to the primary key Can be sparse Secondary index Usually dense SQL PRIMARY KEY declaration automatically creates a primary index, UNIQUE key automatically creates a secondary index Additional secondary index can be created on non-key attribute(s) CREATE INDEX StudentGPAIndex ON Student(GPA); 3/1/2025 Jinze Liu @ University of Kentucky 9

  10. Tree-Structured Indexes: Introduction Tree-structured indexing techniques support both range selections and equality selections. ISAM =Indexed Sequential Access Method static structure; early index technology. B+ tree: dynamic, adjusts gracefully under inserts and deletes. 3/1/2025 Jinze Liu @ University of Kentucky 10

  11. Motivation for Index ``Find all students with gpa > 3.0 If data file is sorted, do binary search Cost of binary search in a database can be quite high, Why? Simple idea: Create an `index file. index entry Index File kN k2 k1 P0 K1 Pm P1 K2 P2 Km Data File Page N Page 3 Page 1 Page 2 Can do binary search on (smaller) index file! Jinze Liu @ University of Kentucky 3/1/2025 11

  12. ISAM What if an index is still too big? Put a another (sparse) index on top of that! ISAM (Index Sequential Access Method), more or less Example: look up 197 100, 200, , 901 Index blocks 100, 123, , 192 200, 901, , 996 100, 108, 119, 121 123, 129, 192, 197, 200, 202, 901, 907, 996, 997, Data blocks 3/1/2025 Jinze Liu @ University of Kentucky 12

  13. Updates with ISAM Example: insert 107 Example: delete 129 100, 200, , 901 Index blocks 100, 123, , 192 200, 901, , 996 100, 108, 119, 121 123, 129, 192, 197, 200, 202, 901, 907, 996, 997, Data blocks 107 Overflow block Overflow chains and empty data blocks degrade performance Worst case: most records go into one long chain 3/1/2025 Jinze Liu @ University of Kentucky 13

  14. A Note of Caution ISAM is an old-fashioned idea B+-trees are usually better, as we ll see But, ISAM is a good place to start to understand the idea of indexing Upshot Don t brag about being an ISAM expert on your resume Do understand how they work, and tradeoffs with B+- trees 3/1/2025 Jinze Liu @ University of Kentucky 14

  15. B+-tree A hierarchy of intervals Balanced (more or less): good performance guarantee Disk-based: one node per block; large fan-out Max fan-out: 4 100 120 150 180 30 179 180 120 156 100 101 130 150 200 110 11 30 35 5 3 3/1/2025 Jinze Liu @ University of Kentucky 15

  16. Sample B+-tree nodes to keys 100 k Max fan-out: 4 150 180 120 Non-leaf to keys 100 <=k < 120 to keys to keys to keys 180 <= k 120 <=k < 150 150 <= k < 180 120 130 Leaf to next leaf node in sequence to records with these k values; or, store records directly in leaves 3/1/2025 Jinze Liu @ University of Kentucky 16

  17. B+-tree balancing properties Height constraint: all leaves at the same lowest level Fan-out constraint: all nodes at least half full (except root) Max # Max # pointers Non-leaf Root Leaf active pointers 2 f Min # Min # keys f 1 f 1 f 1 keys 2 / f 1 f f f / 2 f 1 / 2 / 2 f 3/1/2025 Jinze Liu @ University of Kentucky 17

  18. Lookups SELECT * FROM R WHERE k = 179; SELECT * FROM R WHERE k = 32; Max fan-out: 4 100 120 150 180 30 Not found 179 179 180 120 156 100 101 130 150 200 110 11 30 35 5 3 3/1/2025 Jinze Liu @ University of Kentucky 18

  19. Range query SELECT * FROM R WHERE k > 32 AND k < 179; Max fan-out: 4 100 120 150 180 30 Look up 32 179 180 120 120 156 156 100 100 101 101 130 130 150 150 200 110 110 11 30 35 35 5 3 And follow next-leaf pointers 3/1/2025 Jinze Liu @ University of Kentucky 19

  20. Insertion Insert a record with search key value 32 Max fan-out: 4 100 120 150 180 30 Look up where the inserted key should go 32 179 180 120 156 100 101 130 150 200 110 11 30 35 5 3 And insert it right there 3/1/2025 Jinze Liu @ University of Kentucky 20

  21. Another insertion example Insert a record with search key value 152 Max fan-out: 4 100 120 150 180 152 179 156 120 180 100 101 130 150 200 110 Oops, node is already full! 3/1/2025 Jinze Liu @ University of Kentucky 21

  22. Node splitting Max fan-out: 4 100 Yikes, this node is also already full! 120 150 156 180 150 179 120 180 130 100 152 156 101 200 110 3/1/2025 Jinze Liu @ University of Kentucky 22

  23. More node splitting Max fan-out: 4 156 100 180 120 150 150 179 120 180 130 100 152 156 101 200 110 In the worst case, node splitting can propagate all the way up to the root of the tree (not illustrated here) Splitting the root introduces a new root of fan-out 2 and causes the tree to grow up by one level 3/1/2025 Jinze Liu @ University of Kentucky 23

  24. Insertion B+-tree Insert Find correct leaf L. Put data entry onto L. If L has enough space, done! Else, must splitL (into L and a new node L2) Distribute entries evenly, copy upmiddle key. Insert index entry pointing to L2 into parent of L. This can happen recursively Tree growth: gets wider and (sometimes) one level taller at top. 3/1/2025 Jinze Liu @ University of Kentucky 24

  25. Deletion Delete a record with search key value 130 Max fan-out: 4 100 If a sibling has more than enough keys, steal one! 120 150 180 Look up the key to be deleted 179 156 120 180 100 101 130 150 200 110 And delete it Oops, node is too empty! 3/1/2025 Jinze Liu @ University of Kentucky 25

  26. Stealing from a sibling Max fan-out: 4 100 156 Remember to fix the key in the least common ancestor 120 150 180 179 120 150 180 100 156 101 200 110 3/1/2025 Jinze Liu @ University of Kentucky 26

  27. Another deletion example Delete a record with search key value 179 Max fan-out: 4 100 120 156 180 150 179 120 180 100 156 101 200 110 Cannot steal from siblings Then coalesce (merge) with a sibling! 3/1/2025 Jinze Liu @ University of Kentucky 27

  28. Coalescing Max fan-out: 4 100 Remember to delete the appropriate key from parent 120 156 180 180 150 120 100 156 200 101 110 Deletion can propagate all the way up to the root of the tree (not illustrated here) When the root becomes empty, the tree shrinks by one level 3/1/2025 Jinze Liu @ University of Kentucky 28

  29. Deletion B+-tree Delete Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half-full, done! If L has only d-1 entries, Try to redistribute, borrowing from sibling (adjacent node with same parent as L). If re-distribution fails, mergeL and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Tree shrink: gets narrower and (sometimes) one level lower at top. 3/1/2025 Jinze Liu @ University of Kentucky 29

  30. Example B+ Tree - Inserting 8* Root Root 17 30 24 13 17 24 30 5 13 33* 34* 38* 39* 2* 3* 19* 20* 22* 24* 27* 29* 5* 7* 8* 14* 16* 33* 34* 38* 39* 19* 20* 22* 24* 27* 29* 3* 5* 2* 7* 14* 16* Notice that root was split, leading to increase in height. In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice. 3/1/2025 Jinze Liu @ University of Kentucky 30

  31. Example Tree (including 8*) Delete 19* and 20* ... Root Root 17 30 24 13 17 24 30 5 13 33* 34* 38* 39* 33* 34* 38* 39* 19* 20* 22* 24* 27* 29* 3* 3* 5* 2* 2* 7* 14* 16* 7* 19* 20* 22* 24* 27* 29* 5* 8* 14* 16* 3/1/2025 Jinze Liu @ University of Kentucky 31

  32. Example Tree (including 8*) Delete 19* and 20* ... Root Root 17 17 27 30 5 13 24 30 5 13 33* 34* 38* 39* 2* 3* 22* 24* 27* 29* 5* 7* 8* 14* 16* 33* 34* 38* 39* 2* 3* 19* 20* 22* 24* 27* 29* 5* 7* 8* 14* 16* Deleting 19* is easy. Deleting 20* is done with re-distribution. Notice how middle key is copied up. 3/1/2025 Jinze Liu @ University of Kentucky 32

  33. ... And Then Deleting 24* Must merge. Observe `toss of index entry (key 27 on right), and `pull down of index entry (below). 30 39* 22* 27* 38* 29* 33* 34* Root 5 13 17 30 33* 34* 38* 39* 3* 22* 27* 29* 2* 5* 7* 8* 14* 16* 3/1/2025 Jinze Liu @ University of Kentucky 33

  34. Performance analysis How many I/O s are required for each operation? h, the height of the tree (more or less) Plus one or two to manipulate actual records Plus O(h) for reorganization (should be very rare if f is large) Minus one if we cache the root in memory How big is h? Roughly logfan-outN, where N is the number of records B+-tree properties guarantee that fan-out is least f / 2 for all non-root nodes Fan-out is typically large (in hundreds) many keys and pointers can fit into one block A 4-level B+-tree is enough for typical tables 3/1/2025 Jinze Liu @ University of Kentucky 34

  35. B+-tree in practice Complex reorganization for deletion often is not implemented (e.g., Oracle, Informix) Leave nodes less than half full and periodically reorganize Most commercial DBMS use B+-tree instead of hashing- based indexes because B+-tree handles range queries 3/1/2025 Jinze Liu @ University of Kentucky 35

  36. The Halloween Problem Story from the early days of System R UPDATE Payroll SET salary = salary * 1.1 WHERE salary >= 100000; There is a B+-tree index on Payroll(salary) The update never stopped (why?) Solutions? Scan index in reverse Before update, scan index to create a complete to-do list During update, maintain a done list Tag every row with transaction/statement id 3/1/2025 Jinze Liu @ University of Kentucky 36

  37. B+-tree versus ISAM ISAM is more static; B+-tree is more dynamic ISAM is more compact (at least initially) Fewer levels and I/O s than B+-tree Overtime, ISAM may not be balanced Cannot provide guaranteed performance as B+-tree does 3/1/2025 Jinze Liu @ University of Kentucky 37

  38. B+-tree versus B-tree B-tree: why not store records (or record pointers) in non-leaf nodes? These records can be accessed with fewer I/O s Problems? Storing more data in a node decreases fan-out and increases h Records in leaves require more I/O s to access Vast majority of the records live in leaves! 3/1/2025 Jinze Liu @ University of Kentucky 38

  39. Beyond ISAM, B-, and B+-trees Other tree-based indexes: R-trees and variants, GiST, etc. Hashing-based indexes: extensible hashing, linear hashing, etc. Text indexes: inverted-list index, suffix arrays, etc. Other tricks: bitmap index, bit-sliced index, etc. How about indexing subgraph search? 3/1/2025 Jinze Liu @ University of Kentucky 39

  40. Summary Two types of queries Key-search Range-query B+-tree operations Search Insert Split child Delete Redistribution B+-tree sorting Next: disk-based sorting algorithms 3/1/2025 Jinze Liu @ University of Kentucky 40

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#