Relational Database Query Execution

undefined
 
RELATIONAL DATABASE
QUERY EXECUTION
 
Software Performance Engineering
 
1
 
The Life of a Query
 
An SQL query starts as a string
Compilation: gets converted to a 
parse tree
Query rewrite
Convert from parse tree to 
logical query plan
Some optimization steps here
Different tree structure
Physical plan generation
Convert logical query plan to physical query plan
Select algorithms for each of the operators
Choose order of operations
Governed largely by disk I/O
 
“Query optimizer” is a catch-all term for those second two steps
 
 
 
2
 
Example Parse Tree
 
 
 
 
 
 
 
 
 
From:
http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/5-
query-opt/FIGS/parseTree01.gif
 
3
 
 
Physical Plan Operators
 
Table scan
Iterate over the entire table in physical order
Return them in the order they were physically located
Index scan
Do a lookup on an index
Return the items that match the index
Possibly physically adjacent, possibly not
Sort-scan
Return the tuples needed in sorted order (e.g. ORDER BY, or
other query operations)
Can be done in a few ways
Does the index support sorted returns? e.g. use b+trees
Will the results fit in a sort buffer?
Place tuples in a sort buffer and sort as a separate operation
Otherwise, sort the results in pieces and place them back on
secondary storage and return merge-sorted
 
4
 
Categories of Operations
 
Tuple-at-a-time, unary operations
Some operations only need to see one tuple at a time
e.g. selection
Only needs a small buffer
But, they still benefit from multiple buffers because
physical scans can read a lot at a time
Full-relation, unary operations
The entire table is needed, but only one parameter
e.g. grouping (GROUP BY)
e.g. duplicate-elimination (DISTINCT)
Full-relation, binary operations
e.g. join, union, intersection, cross-products
 
 
5
 
One-Pass vs. Two-Pass Algorithms
 
Some algorithms allow for only one pass through the data,
others require two iterations
 
Many of the two-pass algorithms involve sorting first, then
using that assumption
 
So are two-pass algorithms worse than one-pass?
Not necessarily.
Two-pass algorithms can use less memory
If sorting was already required, then using that
assumption speeds up other operations anyway – best to
choose the right algorithm for the situation
 
6
 
e.g. DISTINCT
 
Is DISTINCT a slow operation?
SELECT DISTINCT username FROM access_logs
Without any other conditions, yes
Requires an input buffer to conduct a hash-based distinction
Memory usage is dependent upon the distribution of
usernames!! (i.e. many names, bigger hash table…)
But! What if we:
Use a B-tree: then we can return lists in sorted order
Have the table be in physically sorted order so that sequential
reads are fast
Now we can do a two-pass
Knowing we’ll see 
username
s in sorted order, we know that any
duplicates would be adjacent
Now we can store ONLY the last tuple and compare
Less memory, and faster
 
7
 
e.g. GROUP BY
 
How can aggregation be computed?
e.g.
 SELECT username, count(*) FROM access_logs
    GROUP BY username
One-pass algorithm: hash table in the input buffer, 
{
“admin”=>10}
Note: aggregation operator has a big impact on memory usage
e.g. 
COUNT
, 
MIN
, 
MAX
 are small – need one number per group
e.g, 
AVG
 is big – need to keep the entire group!
Two-pass algorithm: sort by 
username
e.g.
 SELECT student, avg(*) FROM grades
    GROUP BY student
You know all students named “dustin” are adjacent on disk
Store beginnings and ends of groups to the places on disk and
access them when needed
 
8
 
Iterator Pattern
 
Relational databases will often implement the iterator pattern
Have a LIMIT on the result set explicitly
Or, the result set has an internal limit based on the
iterator
This can often complicate the implementation of these
algorithms
This often can be exploited for performance reasons:
Instead of sorting an entire table, only try to retrieve the
top 10
“Retrieve the top 10” is a different algorithm than
sorting!
 
9
Slide Note
Embed
Share

In the world of relational databases, query execution is a crucial process involving steps like compilation, optimization, and physical plan generation. This involves converting SQL queries into logical query plans, selecting algorithms, and optimizing operations for efficient performance. Different types of operators and categories of operations play a role in executing queries effectively. Understanding the process from query string to result retrieval is essential for software performance engineering. One-pass and two-pass algorithms further impact query processing efficiency, emphasizing the importance of choosing the right approach based on the situation.

  • Relational Database
  • Query Execution
  • Software Performance
  • Optimization
  • SQL

Uploaded on Mar 05, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. RELATIONAL DATABASE QUERY EXECUTION Software Performance Engineering 1

  2. The Life of a Query An SQL query starts as a string Compilation: gets converted to a parse tree Query rewrite Convert from parse tree to logical query plan Some optimization steps here Different tree structure Physical plan generation Convert logical query plan to physical query plan Select algorithms for each of the operators Choose order of operations Governed largely by disk I/O parse tree logical query plan Query optimizer is a catch-all term for those second two steps 2

  3. Example Parse Tree From: http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/5- query-opt/FIGS/parseTree01.gif 3

  4. Physical Plan Operators Table scan Iterate over the entire table in physical order Return them in the order they were physically located Index scan Do a lookup on an index Return the items that match the index Possibly physically adjacent, possibly not Sort-scan Return the tuples needed in sorted order (e.g. ORDER BY, or other query operations) Can be done in a few ways Does the index support sorted returns? e.g. use b+trees Will the results fit in a sort buffer? Place tuples in a sort buffer and sort as a separate operation Otherwise, sort the results in pieces and place them back on secondary storage and return merge-sorted 4

  5. Categories of Operations Tuple-at-a-time, unary operations Some operations only need to see one tuple at a time e.g. selection Only needs a small buffer But, they still benefit from multiple buffers because physical scans can read a lot at a time Full-relation, unary operations The entire table is needed, but only one parameter e.g. grouping (GROUP BY) e.g. duplicate-elimination (DISTINCT) Full-relation, binary operations e.g. join, union, intersection, cross-products 5

  6. One-Pass vs. Two-Pass Algorithms Some algorithms allow for only one pass through the data, others require two iterations Many of the two-pass algorithms involve sorting first, then using that assumption So are two-pass algorithms worse than one-pass? Not necessarily. Two-pass algorithms can use less memory If sorting was already required, then using that assumption speeds up other operations anyway best to choose the right algorithm for the situation 6

  7. e.g. DISTINCT Is DISTINCT a slow operation? SELECT DISTINCT username FROM access_logs Without any other conditions, yes Requires an input buffer to conduct a hash-based distinction Memory usage is dependent upon the distribution of usernames!! (i.e. many names, bigger hash table ) But! What if we: Use a B-tree: then we can return lists in sorted order Have the table be in physically sorted order so that sequential reads are fast Now we can do a two-pass Knowing we ll see usernames in sorted order, we know that any duplicates would be adjacent Now we can store ONLY the last tuple and compare Less memory, and faster 7

  8. e.g. GROUP BY How can aggregation be computed? e.g. SELECT username, count(*) FROM access_logs GROUP BY username One-pass algorithm: hash table in the input buffer, { admin =>10} Note: aggregation operator has a big impact on memory usage e.g. COUNT, MIN, MAX are small need one number per group e.g, AVG is big need to keep the entire group! Two-pass algorithm: sort by username e.g. SELECT student, avg(*) FROM grades GROUP BY student You know all students named dustin are adjacent on disk Store beginnings and ends of groups to the places on disk and access them when needed 8

  9. Iterator Pattern Relational databases will often implement the iterator pattern Have a LIMIT on the result set explicitly Or, the result set has an internal limit based on the iterator This can often complicate the implementation of these algorithms This often can be exploited for performance reasons: Instead of sorting an entire table, only try to retrieve the top 10 Retrieve the top 10 is a different algorithm than sorting! 9

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#