Relational Database Query Execution

undefined

RELATIONAL DATABASE

QUERY EXECUTION

Software Performance Engineering

The Life of a Query

■

An SQL query starts as a string

■

Compilation: gets converted to a

parse tree

■

Query rewrite

–

Convert from parse tree to

logical query plan

–

Some optimization steps here

–

Different tree structure

■

Physical plan generation

–

Convert logical query plan to physical query plan

–

Select algorithms for each of the operators

–

Choose order of operations

–

Governed largely by disk I/O

■

“Query optimizer” is a catch-all term for those second two steps

Example Parse Tree

■

From:

http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/5-

query-opt/FIGS/parseTree01.gif

Physical Plan Operators

■

Table scan

–

Iterate over the entire table in physical order

–

Return them in the order they were physically located

■

Index scan

–

Do a lookup on an index

–

Return the items that match the index

–

Possibly physically adjacent, possibly not

■

Sort-scan

–

Return the tuples needed in sorted order (e.g. ORDER BY, or

other query operations)

–

Can be done in a few ways

■

Does the index support sorted returns? e.g. use b+trees

■

Will the results fit in a sort buffer?

Place tuples in a sort buffer and sort as a separate operation

■

Otherwise, sort the results in pieces and place them back on

secondary storage and return merge-sorted

Categories of Operations

■

Tuple-at-a-time, unary operations

–

Some operations only need to see one tuple at a time

–

e.g. selection

–

Only needs a small buffer

–

But, they still benefit from multiple buffers because

physical scans can read a lot at a time

■

Full-relation, unary operations

–

The entire table is needed, but only one parameter

–

e.g. grouping (GROUP BY)

–

e.g. duplicate-elimination (DISTINCT)

■

Full-relation, binary operations

–

e.g. join, union, intersection, cross-products

One-Pass vs. Two-Pass Algorithms

■

Some algorithms allow for only one pass through the data,

others require two iterations

■

Many of the two-pass algorithms involve sorting first, then

using that assumption

■

So are two-pass algorithms worse than one-pass?

–

Not necessarily.

–

Two-pass algorithms can use less memory

–

If sorting was already required, then using that

assumption speeds up other operations anyway – best to

choose the right algorithm for the situation

e.g. DISTINCT

■

Is DISTINCT a slow operation?

–

SELECT DISTINCT username FROM access_logs

–

Without any other conditions, yes

–

Requires an input buffer to conduct a hash-based distinction

–

Memory usage is dependent upon the distribution of

usernames!! (i.e. many names, bigger hash table…)

■

But! What if we:

–

Use a B-tree: then we can return lists in sorted order

–

Have the table be in physically sorted order so that sequential

reads are fast

■

Now we can do a two-pass

–

Knowing we’ll see

username

s in sorted order, we know that any

duplicates would be adjacent

–

Now we can store ONLY the last tuple and compare

–

Less memory, and faster

e.g. GROUP BY

■

How can aggregation be computed?

–

e.g.

 SELECT username, count(*) FROM access_logs

    GROUP BY username

–

One-pass algorithm: hash table in the input buffer,

“admin”=>10}

■

Note: aggregation operator has a big impact on memory usage

–

e.g.

COUNT

MIN

MAX

 are small – need one number per group

–

e.g,

AVG

 is big – need to keep the entire group!

■

Two-pass algorithm: sort by

username

–

e.g.

 SELECT student, avg(*) FROM grades

    GROUP BY student

–

You know all students named “dustin” are adjacent on disk

–

Store beginnings and ends of groups to the places on disk and

access them when needed

Iterator Pattern

■

Relational databases will often implement the iterator pattern

–

Have a LIMIT on the result set explicitly

–

Or, the result set has an internal limit based on the

iterator

■

This can often complicate the implementation of these

algorithms

■

This often can be exploited for performance reasons:

–

Instead of sorting an entire table, only try to retrieve the

top 10

–

“Retrieve the top 10” is a different algorithm than

sorting!

Slide Note

Embed Share

Download

In the world of relational databases, query execution is a crucial process involving steps like compilation, optimization, and physical plan generation. This involves converting SQL queries into logical query plans, selecting algorithms, and optimizing operations for efficient performance. Different types of operators and categories of operations play a role in executing queries effectively. Understanding the process from query string to result retrieval is essential for software performance engineering. One-pass and two-pass algorithms further impact query processing efficiency, emphasizing the importance of choosing the right approach based on the situation.

teci920 Follow

Uploaded on Mar 05, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

RELATIONAL DATABASE QUERY EXECUTION Software Performance Engineering 1

The Life of a Query An SQL query starts as a string Compilation: gets converted to a parse tree Query rewrite Convert from parse tree to logical query plan Some optimization steps here Different tree structure Physical plan generation Convert logical query plan to physical query plan Select algorithms for each of the operators Choose order of operations Governed largely by disk I/O parse tree logical query plan Query optimizer is a catch-all term for those second two steps 2

Example Parse Tree From: http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/5- query-opt/FIGS/parseTree01.gif 3

Physical Plan Operators Table scan Iterate over the entire table in physical order Return them in the order they were physically located Index scan Do a lookup on an index Return the items that match the index Possibly physically adjacent, possibly not Sort-scan Return the tuples needed in sorted order (e.g. ORDER BY, or other query operations) Can be done in a few ways Does the index support sorted returns? e.g. use b+trees Will the results fit in a sort buffer? Place tuples in a sort buffer and sort as a separate operation Otherwise, sort the results in pieces and place them back on secondary storage and return merge-sorted 4

Categories of Operations Tuple-at-a-time, unary operations Some operations only need to see one tuple at a time e.g. selection Only needs a small buffer But, they still benefit from multiple buffers because physical scans can read a lot at a time Full-relation, unary operations The entire table is needed, but only one parameter e.g. grouping (GROUP BY) e.g. duplicate-elimination (DISTINCT) Full-relation, binary operations e.g. join, union, intersection, cross-products 5

One-Pass vs. Two-Pass Algorithms Some algorithms allow for only one pass through the data, others require two iterations Many of the two-pass algorithms involve sorting first, then using that assumption So are two-pass algorithms worse than one-pass? Not necessarily. Two-pass algorithms can use less memory If sorting was already required, then using that assumption speeds up other operations anyway best to choose the right algorithm for the situation 6

e.g. DISTINCT Is DISTINCT a slow operation? SELECT DISTINCT username FROM access_logs Without any other conditions, yes Requires an input buffer to conduct a hash-based distinction Memory usage is dependent upon the distribution of usernames!! (i.e. many names, bigger hash table ) But! What if we: Use a B-tree: then we can return lists in sorted order Have the table be in physically sorted order so that sequential reads are fast Now we can do a two-pass Knowing we ll see usernames in sorted order, we know that any duplicates would be adjacent Now we can store ONLY the last tuple and compare Less memory, and faster 7

e.g. GROUP BY How can aggregation be computed? e.g. SELECT username, count(*) FROM access_logs GROUP BY username One-pass algorithm: hash table in the input buffer, { admin =>10} Note: aggregation operator has a big impact on memory usage e.g. COUNT, MIN, MAX are small need one number per group e.g, AVG is big need to keep the entire group! Two-pass algorithm: sort by username e.g. SELECT student, avg(*) FROM grades GROUP BY student You know all students named dustin are adjacent on disk Store beginnings and ends of groups to the places on disk and access them when needed 8

Iterator Pattern Relational databases will often implement the iterator pattern Have a LIMIT on the result set explicitly Or, the result set has an internal limit based on the iterator This can often complicate the implementation of these algorithms This often can be exploited for performance reasons: Instead of sorting an entire table, only try to retrieve the top 10 Retrieve the top 10 is a different algorithm than sorting! 9

Relational Database Query Execution

Download Presentation

Presentation Transcript

Related

More Related Content