Introduction to Database Systems

undefined

Instructor: Jinze Liu

Fall 2017

3/1/2025

Jinze Liu @ University of Kentucky



The unit of disk read and write is



Block (or called Page)



The disk access time is composed by



Seek time



Rotation time



Data transfer time

3/1/2025

Jinze Liu @ University of Kentucky



A row in a table, when located on disks, is called



A record



Two types of record:



Fixed-length



Variable-length

3/1/2025

Jinze Liu @ University of Kentucky



In an abstract sense, a file is



A set of “records” on a disk



In reality, a file is



A set of disk pages



Each record lives on



A page



Physical Record ID (RID)



A tuple of <page#, slot#>

3/1/2025

Jinze Liu @ University of Kentucky

’



How to locate data in a file

fast?



Introduction to indexing



Tree-based indexes



ISAM

Indexed sequence access method



-tree

3/1/2025

Jinze Liu @ University of Kentucky



Given a value, locate the record(s) with this value

SELECT * FROM

 WHERE

value

SELECT * FROM

 WHERE



Other search criteria, e.g.



Range search

SELECT * FROM

 WHERE

value



Keyword search

3/1/2025

Jinze Liu @ University of Kentucky



Dense

: one index entry for each search key value



Sparse

: one index entry for each block



Records must be

clustered

 according to the search key

3/1/2025

Jinze Liu @ University of Kentucky



Index size



Sparse index is smaller



Requirement on records



Records must be clustered for sparse index



Lookup



Sparse index is smaller and may fit in memory



Dense index can directly tell if a record exists



Update



Easier for sparse index

3/1/2025

Jinze Liu @ University of Kentucky



Primary index



Created for the

primary key

 of a table



Records are usually clustered according to the primary key



Can be sparse



Secondary index



Usually dense



SQL



PRIMARY KEY

 declaration automatically creates a primary

index,

UNIQUE

 key automatically creates a secondary index



Additional secondary index can be created on non-key

attribute(s)

CREATE INDEX

 StudentGPAIndex

ON

 Student(GPA)



Tree-structured indexing techniques support both

range

selections

and

equality selections



ISAM =

ndexed

equential

ccess

ethod



static structure; early index technology.



 tree

:  dynamic, adjusts gracefully under inserts and

deletes.

3/1/2025

Jinze Liu @ University of Kentucky



``

Find all students with gpa > 3.0

’’



If data file is sorted, do binary search



Cost of binary search in a database can be quite high,

Why?



Simple idea:  Create an `index’ file.

Can do binary search on (smaller) index file!

3/1/2025

Jinze Liu @ University of Kentucky

3/1/2025

Jinze Liu @ University of Kentucky



What if an index is still too big?



Put a another (sparse) index on top of that!



ISAM

Index Sequential Access Method

), more or less

Example: look up 197

3/1/2025

Jinze Liu @ University of Kentucky



Overflow chains and empty data blocks degrade

performance



Worst case: most records go into one long chain

Example: insert 107

Example: delete 129

3/1/2025

Jinze Liu @ University of Kentucky



ISAM is an old-fashioned idea



B+-trees are usually better, as we’ll see



But, ISAM is a good place to start to understand the

idea of indexing



Upshot



Don’t brag about being an ISAM expert on your

resume



Do understand how they work, and tradeoffs with B

trees

3/1/2025

Jinze Liu @ University of Kentucky



A hierarchy of intervals



Balanced

 (more or less): good performance guarantee



Disk-based

: one node per block; large fan-out

3/1/2025

Jinze Liu @ University of Kentucky

Max fan-out: 4

to keys

100 <=

 < 120

to keys

120 <=

 < 150

to keys

150 <=

 < 180

to keys

180 <=

Non-leaf

to records with these

 values;

or, store records directly in leaves

to next leaf node in sequence

Leaf

to keys

100 ·

3/1/2025

Jinze Liu @ University of Kentucky



Height constraint: all leaves at the same lowest level



Fan-out constraint: all nodes at least half full

(except root)

Max #   Max #

Min #

Min #

    pointers

keys

    active pointers

 keys

Non-leaf

 – 1

Root

 – 1

Leaf

 – 1

3/1/2025

Jinze Liu @ University of Kentucky

SELECT * FROM

 WHERE

 = 179

Max fan-out: 4

Not found

SELECT * FROM

 WHERE

 = 32

3/1/2025

Jinze Liu @ University of Kentucky

SELECT * FROM

 WHERE

 > 32 AND

 < 179

Max fan-out: 4

Look up 32…

And follow next-leaf pointers

3/1/2025

Jinze Liu @ University of Kentucky



Insert a record with search key value

Max fan-out: 4

Look up where the

inserted key

should go…

And insert it right there

3/1/2025

Jinze Liu @ University of Kentucky



Insert a record with search key value

Max fan-out: 4

Oops, node is already full!

3/1/2025

Jinze Liu @ University of Kentucky

Max fan-out: 4

Yikes, this node is

also already full!

3/1/2025

Jinze Liu @ University of Kentucky



In the worst case, node splitting can “propagate” all the way up to the root of

the tree (not illustrated here)



Splitting the root introduces a new root of fan-out 2 and causes the tree to

grow “up” by one level

Max fan-out: 4

3/1/2025

Jinze Liu @ University of Kentucky



-tree Insert



Find correct leaf

L.



Put data entry onto



If

has enough space,

done



Else, must

split

L (into L and a new node L2)



Distribute entries evenly,

copy up

middle key.



Insert index entry pointing to

L2

into parent of



This can happen recursively



Tree growth: gets

wider

 and (sometimes)

one level taller at top

3/1/2025

Jinze Liu @ University of Kentucky



Delete a record with search key value

Max fan-out: 4

Look up the key

to be deleted…

And delete it

Oops, node is too empty!

3/1/2025

Jinze Liu @ University of Kentucky

Max fan-out: 4

Remember to fix the key

in the least common ancestor

3/1/2025

Jinze Liu @ University of Kentucky



Delete a record with search key value

Max fan-out: 4

Cannot steal from siblings

Then coalesce (merge) with a sibling!

3/1/2025

Jinze Liu @ University of Kentucky

Max fan-out: 4



Deletion can “propagate” all the way up to the root of the tree (not illustrated

here)



When the root becomes empty, the tree “shrinks” by one level

3/1/2025

Jinze Liu @ University of Kentucky



B+-tree Delete



Start at root, find leaf

 where entry belongs.



Remove the entry.



If L is at least half-full,

done!



If L has only

d-1

entries,



Try to

redistribute

, borrowing from sibling

 (adjacent node

with same parent as L)



If re-distribution fails,

merge

and sibling.



If merge occurred, must delete entry (pointing to

 or sibling)

from parent of



Tree shrink: gets

narrower

 and (sometimes)

one level lower at

top

In this example, we can avoid split by re-distributing

entries; however, this is usually not done in practice.

Notice that root was split, leading to increase in height.

3/1/2025

Jinze Liu @ University of Kentucky

3/1/2025

Jinze Liu @ University of Kentucky



Deleting 19* is easy.



Deleting 20* is done with re-distribution. Notice

how middle key is

copied up

3/1/2025

Jinze Liu @ University of Kentucky



Must merge.



Observe

toss

’

of index

entry (key 27 on right), and

pull down

’

of index entry

(below).

3/1/2025

Jinze Liu @ University of Kentucky

3/1/2025

Jinze Liu @ University of Kentucky



How many I/O’s are required for each operation?



, the

height of the tree

 (more or less)



Plus one or two to manipulate actual records



Plus

) for reorganization (should be very rare if

 is large)



Minus one if we cache the root in memory



How big is



Roughly

log

fan-out

, where

 is the number of records



-tree properties guarantee that fan-out is least

 / 2 for all non-root

nodes



Fan-out is typically large (in hundreds)—many keys and pointers

can fit into one block



A 4-level B

-tree is enough for typical tables

3/1/2025

Jinze Liu @ University of Kentucky



Complex reorganization for deletion often is not

implemented (e.g., Oracle, Informix)



Leave nodes less than half full and periodically reorganize



Most commercial DBMS use B

-tree instead of hashing-

based indexes because B

-tree handles range queries

3/1/2025

Jinze Liu @ University of Kentucky



Story from the early days of System R…

UPDATE Payroll

SET salary = salary * 1.1

WHERE salary >= 100000;



There is a B

-tree index on

Payroll

salary



The update never stopped (why?)



Solutions?



Scan index in reverse



Before update, scan index to create a complete “to-do” list



During update, maintain a “done” list



Tag every row with transaction/statement id

3/1/2025

Jinze Liu @ University of Kentucky



ISAM is more

static

; B

-tree is more

dynamic



ISAM is more compact (at least initially)



Fewer levels and I/O’s than B

-tree



Overtime, ISAM may not be balanced



Cannot provide guaranteed performance as B

-tree does

3/1/2025

Jinze Liu @ University of Kentucky



B-tree: why not store records (or record pointers) in

non-leaf nodes?



These records can be accessed with fewer I/O’s



Problems?



Storing more data in a node decreases fan-out and

increases



Records in leaves require more I/O’s to access



Vast majority of the records live in leaves!

3/1/2025

Jinze Liu @ University of Kentucky



Other tree-based indexes: R-trees and variants, GiST,

etc.



Hashing-based indexes: extensible hashing, linear

hashing, etc.



Text indexes: inverted-list index, suffix arrays, etc.



Other tricks: bitmap index, bit-sliced index, etc.



How about indexing subgraph search?

3/1/2025

Jinze Liu @ University of Kentucky



Two types of queries



Key-search



Range-query



-tree operations



Search



Insert



Split child



Delete



Redistribution



-tree sorting



Next: disk-based sorting algorithms

Slide Note

Embed Share

Download

This course covers topics such as disk access time, rows and records in tables, file organization, indexing methods, dense and sparse indexes, primary and secondary indexes, and more. The content delves into the fundamentals of database systems, with a focus on efficient data retrieval techniques and indexing strategies like B+-trees and ISAM. Gain insights into how data is stored and accessed, and learn about different types of indexes and their advantages in database management.

nicolaus Follow

Uploaded on Mar 01, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2017

Review The unit of disk read and write is Block (or called Page) The disk access time is composed by Seek time Rotation time Data transfer time 3/1/2025 Jinze Liu @ University of Kentucky 2

Review A row in a table, when located on disks, is called A record Two types of record: Fixed-length Variable-length 3/1/2025 Jinze Liu @ University of Kentucky 3

Review In an abstract sense, a file is A set of records on a disk In reality, a file is A set of disk pages Each record lives on A page Physical Record ID (RID) A tuple of <page#, slot#> 3/1/2025 Jinze Liu @ University of Kentucky 4

Todays Topic How to locate data in a file fast? Introduction to indexing Tree-based indexes ISAM: Indexed sequence access method B+-tree 3/1/2025 Jinze Liu @ University of Kentucky 5

Basics Given a value, locate the record(s) with this value SELECT * FROM R WHERE A = value; SELECT * FROM R, S WHERE R.A = S.B; Other search criteria, e.g. Range search SELECT * FROM R WHERE A > value; Keyword search database indexing Search 3/1/2025 Jinze Liu @ University of Kentucky 6

Dense and sparse indexes Dense: one index entry for each search key value Sparse: one index entry for each block Records must be clustered according to the search key 3/1/2025 Jinze Liu @ University of Kentucky 7

Dense versus sparse indexes Index size Sparse index is smaller Requirement on records Records must be clustered for sparse index Lookup Sparse index is smaller and may fit in memory Dense index can directly tell if a record exists Update Easier for sparse index 3/1/2025 Jinze Liu @ University of Kentucky 8

Primary and secondary indexes Primary index Created for the primary key of a table Records are usually clustered according to the primary key Can be sparse Secondary index Usually dense SQL PRIMARY KEY declaration automatically creates a primary index, UNIQUE key automatically creates a secondary index Additional secondary index can be created on non-key attribute(s) CREATE INDEX StudentGPAIndex ON Student(GPA); 3/1/2025 Jinze Liu @ University of Kentucky 9

Tree-Structured Indexes: Introduction Tree-structured indexing techniques support both range selections and equality selections. ISAM =Indexed Sequential Access Method static structure; early index technology. B+ tree: dynamic, adjusts gracefully under inserts and deletes. 3/1/2025 Jinze Liu @ University of Kentucky 10

Motivation for Index ``Find all students with gpa > 3.0 If data file is sorted, do binary search Cost of binary search in a database can be quite high, Why? Simple idea: Create an `index file. index entry Index File kN k2 k1 P0 K1 Pm P1 K2 P2 Km Data File Page N Page 3 Page 1 Page 2 Can do binary search on (smaller) index file! Jinze Liu @ University of Kentucky 3/1/2025 11

ISAM What if an index is still too big? Put a another (sparse) index on top of that! ISAM (Index Sequential Access Method), more or less Example: look up 197 100, 200, , 901 Index blocks 100, 123, , 192 200, 901, , 996 100, 108, 119, 121 123, 129, 192, 197, 200, 202, 901, 907, 996, 997, Data blocks 3/1/2025 Jinze Liu @ University of Kentucky 12

Updates with ISAM Example: insert 107 Example: delete 129 100, 200, , 901 Index blocks 100, 123, , 192 200, 901, , 996 100, 108, 119, 121 123, 129, 192, 197, 200, 202, 901, 907, 996, 997, Data blocks 107 Overflow block Overflow chains and empty data blocks degrade performance Worst case: most records go into one long chain 3/1/2025 Jinze Liu @ University of Kentucky 13

A Note of Caution ISAM is an old-fashioned idea B+-trees are usually better, as we ll see But, ISAM is a good place to start to understand the idea of indexing Upshot Don t brag about being an ISAM expert on your resume Do understand how they work, and tradeoffs with B+- trees 3/1/2025 Jinze Liu @ University of Kentucky 14

B+-tree A hierarchy of intervals Balanced (more or less): good performance guarantee Disk-based: one node per block; large fan-out Max fan-out: 4 100 120 150 180 30 179 180 120 156 100 101 130 150 200 110 11 30 35 5 3 3/1/2025 Jinze Liu @ University of Kentucky 15

Sample B+-tree nodes to keys 100 k Max fan-out: 4 150 180 120 Non-leaf to keys 100 <=k < 120 to keys to keys to keys 180 <= k 120 <=k < 150 150 <= k < 180 120 130 Leaf to next leaf node in sequence to records with these k values; or, store records directly in leaves 3/1/2025 Jinze Liu @ University of Kentucky 16

B+-tree balancing properties Height constraint: all leaves at the same lowest level Fan-out constraint: all nodes at least half full (except root) Max # Max # pointers Non-leaf Root Leaf active pointers 2 f Min # Min # keys f 1 f 1 f 1 keys 2 / f 1 f f f / 2 f 1 / 2 / 2 f 3/1/2025 Jinze Liu @ University of Kentucky 17

Lookups SELECT * FROM R WHERE k = 179; SELECT * FROM R WHERE k = 32; Max fan-out: 4 100 120 150 180 30 Not found 179 179 180 120 156 100 101 130 150 200 110 11 30 35 5 3 3/1/2025 Jinze Liu @ University of Kentucky 18

Range query SELECT * FROM R WHERE k > 32 AND k < 179; Max fan-out: 4 100 120 150 180 30 Look up 32 179 180 120 120 156 156 100 100 101 101 130 130 150 150 200 110 110 11 30 35 35 5 3 And follow next-leaf pointers 3/1/2025 Jinze Liu @ University of Kentucky 19

Insertion Insert a record with search key value 32 Max fan-out: 4 100 120 150 180 30 Look up where the inserted key should go 32 179 180 120 156 100 101 130 150 200 110 11 30 35 5 3 And insert it right there 3/1/2025 Jinze Liu @ University of Kentucky 20

Another insertion example Insert a record with search key value 152 Max fan-out: 4 100 120 150 180 152 179 156 120 180 100 101 130 150 200 110 Oops, node is already full! 3/1/2025 Jinze Liu @ University of Kentucky 21

Node splitting Max fan-out: 4 100 Yikes, this node is also already full! 120 150 156 180 150 179 120 180 130 100 152 156 101 200 110 3/1/2025 Jinze Liu @ University of Kentucky 22

More node splitting Max fan-out: 4 156 100 180 120 150 150 179 120 180 130 100 152 156 101 200 110 In the worst case, node splitting can propagate all the way up to the root of the tree (not illustrated here) Splitting the root introduces a new root of fan-out 2 and causes the tree to grow up by one level 3/1/2025 Jinze Liu @ University of Kentucky 23

Insertion B+-tree Insert Find correct leaf L. Put data entry onto L. If L has enough space, done! Else, must splitL (into L and a new node L2) Distribute entries evenly, copy upmiddle key. Insert index entry pointing to L2 into parent of L. This can happen recursively Tree growth: gets wider and (sometimes) one level taller at top. 3/1/2025 Jinze Liu @ University of Kentucky 24

Deletion Delete a record with search key value 130 Max fan-out: 4 100 If a sibling has more than enough keys, steal one! 120 150 180 Look up the key to be deleted 179 156 120 180 100 101 130 150 200 110 And delete it Oops, node is too empty! 3/1/2025 Jinze Liu @ University of Kentucky 25

Stealing from a sibling Max fan-out: 4 100 156 Remember to fix the key in the least common ancestor 120 150 180 179 120 150 180 100 156 101 200 110 3/1/2025 Jinze Liu @ University of Kentucky 26

Another deletion example Delete a record with search key value 179 Max fan-out: 4 100 120 156 180 150 179 120 180 100 156 101 200 110 Cannot steal from siblings Then coalesce (merge) with a sibling! 3/1/2025 Jinze Liu @ University of Kentucky 27

Coalescing Max fan-out: 4 100 Remember to delete the appropriate key from parent 120 156 180 180 150 120 100 156 200 101 110 Deletion can propagate all the way up to the root of the tree (not illustrated here) When the root becomes empty, the tree shrinks by one level 3/1/2025 Jinze Liu @ University of Kentucky 28

Deletion B+-tree Delete Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half-full, done! If L has only d-1 entries, Try to redistribute, borrowing from sibling (adjacent node with same parent as L). If re-distribution fails, mergeL and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Tree shrink: gets narrower and (sometimes) one level lower at top. 3/1/2025 Jinze Liu @ University of Kentucky 29

Example B+ Tree - Inserting 8* Root Root 17 30 24 13 17 24 30 5 13 33* 34* 38* 39* 2* 3* 19* 20* 22* 24* 27* 29* 5* 7* 8* 14* 16* 33* 34* 38* 39* 19* 20* 22* 24* 27* 29* 3* 5* 2* 7* 14* 16* Notice that root was split, leading to increase in height. In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice. 3/1/2025 Jinze Liu @ University of Kentucky 30

Example Tree (including 8*) Delete 19* and 20* ... Root Root 17 30 24 13 17 24 30 5 13 33* 34* 38* 39* 33* 34* 38* 39* 19* 20* 22* 24* 27* 29* 3* 3* 5* 2* 2* 7* 14* 16* 7* 19* 20* 22* 24* 27* 29* 5* 8* 14* 16* 3/1/2025 Jinze Liu @ University of Kentucky 31

Example Tree (including 8*) Delete 19* and 20* ... Root Root 17 17 27 30 5 13 24 30 5 13 33* 34* 38* 39* 2* 3* 22* 24* 27* 29* 5* 7* 8* 14* 16* 33* 34* 38* 39* 2* 3* 19* 20* 22* 24* 27* 29* 5* 7* 8* 14* 16* Deleting 19* is easy. Deleting 20* is done with re-distribution. Notice how middle key is copied up. 3/1/2025 Jinze Liu @ University of Kentucky 32

... And Then Deleting 24* Must merge. Observe `toss of index entry (key 27 on right), and `pull down of index entry (below). 30 39* 22* 27* 38* 29* 33* 34* Root 5 13 17 30 33* 34* 38* 39* 3* 22* 27* 29* 2* 5* 7* 8* 14* 16* 3/1/2025 Jinze Liu @ University of Kentucky 33

Performance analysis How many I/O s are required for each operation? h, the height of the tree (more or less) Plus one or two to manipulate actual records Plus O(h) for reorganization (should be very rare if f is large) Minus one if we cache the root in memory How big is h? Roughly logfan-outN, where N is the number of records B+-tree properties guarantee that fan-out is least f / 2 for all non-root nodes Fan-out is typically large (in hundreds) many keys and pointers can fit into one block A 4-level B+-tree is enough for typical tables 3/1/2025 Jinze Liu @ University of Kentucky 34

B+-tree in practice Complex reorganization for deletion often is not implemented (e.g., Oracle, Informix) Leave nodes less than half full and periodically reorganize Most commercial DBMS use B+-tree instead of hashing- based indexes because B+-tree handles range queries 3/1/2025 Jinze Liu @ University of Kentucky 35

The Halloween Problem Story from the early days of System R UPDATE Payroll SET salary = salary * 1.1 WHERE salary >= 100000; There is a B+-tree index on Payroll(salary) The update never stopped (why?) Solutions? Scan index in reverse Before update, scan index to create a complete to-do list During update, maintain a done list Tag every row with transaction/statement id 3/1/2025 Jinze Liu @ University of Kentucky 36

B+-tree versus ISAM ISAM is more static; B+-tree is more dynamic ISAM is more compact (at least initially) Fewer levels and I/O s than B+-tree Overtime, ISAM may not be balanced Cannot provide guaranteed performance as B+-tree does 3/1/2025 Jinze Liu @ University of Kentucky 37

B+-tree versus B-tree B-tree: why not store records (or record pointers) in non-leaf nodes? These records can be accessed with fewer I/O s Problems? Storing more data in a node decreases fan-out and increases h Records in leaves require more I/O s to access Vast majority of the records live in leaves! 3/1/2025 Jinze Liu @ University of Kentucky 38

Beyond ISAM, B-, and B+-trees Other tree-based indexes: R-trees and variants, GiST, etc. Hashing-based indexes: extensible hashing, linear hashing, etc. Text indexes: inverted-list index, suffix arrays, etc. Other tricks: bitmap index, bit-sliced index, etc. How about indexing subgraph search? 3/1/2025 Jinze Liu @ University of Kentucky 39

Summary Two types of queries Key-search Range-query B+-tree operations Search Insert Split child Delete Redistribution B+-tree sorting Next: disk-based sorting algorithms 3/1/2025 Jinze Liu @ University of Kentucky 40

Introduction to Database Systems

Download Presentation

Presentation Transcript

Related

More Related Content