Understanding Parallel Databases and Their Impact on Performance

Slide Note
Embed
Share

Explore the concept of parallel databases, how they address the I/O bottleneck, and their benefits such as increased scalability and improved application availability. Learn about parallel architectures and shared memory systems in advanced database design. Discover the importance of concurrency control and the impact of the memory hierarchy on data processing efficiency.


Uploaded on Sep 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Parallel Databases COMP3211 Advanced Databases Dr Nicholas Gibbins - nmg@ecs.soton.ac.uk 2014-2015

  2. Overview The I/O bottleneck Parallel architectures Parallel query processing Inter-operator parallelism Intra-operator parallelism Bushy parallelism Concurrency control Reliability 2

  3. The I/O Bottleneck

  4. The Memory Hierarchy, Revisited Type Capacity Latency Registers 101 bytes 1 cycle L1 104 bytes <5 cycles L2 105 bytes 5-10 cycles RAM 109-1010 bytes 20-30 cycles (10-8 s) Hard Disk 1011-1012 bytes 106 cycles (10-3 s) 4

  5. The I/O Bottleneck Access time to secondary storage (hard disks) dominates performance of DBMSes Two approaches to addressing this: Main memory databases (expensive!) Parallel databases (cheaper!) Increase I/O bandwidth by spreading data across a number of disks 5

  6. Definitions Parallelism An arrangement or state that permits several operations or tasks to be performed simultaneously rather than consecutively Parallel Databases have the ability to split processing of data access to data across multiple processors, multiple disks 6

  7. Why Parallel Databases Hardware trends Reduced elapsed time for queries Increased transaction throughput Increased scalability Better price/performance Improved application availability Access to more data In short, for better performance 7

  8. Parallel Architectures

  9. Shared Memory Architecture Tightly coupled P P P Symmetric Multiprocessor (SMP) P = processor Global Memory M = memory 9

  10. Software Shared Memory Less complex database software P P P Limited scalability Single buffer Single database storage Global Memory 10

  11. Shared Disc Architecture Loosely coupled P P P Distributed Memory M M M S 11

  12. Software Shared Disc Avoids memory bottleneck P P P Same page may be in more than one buffer at once can lead to incoherence M M M Needs global locking mechanism S Single logical database storage Each processor has its own database buffer 12

  13. Shared Nothing Architecture Massively Parallel P P P Loosely Coupled High Speed Interconnect (between processors) M M M 13

  14. Software - Shared Nothing Each processor owns part of the data P P P Each processor has its own database buffer M M M One page is only in one local buffer no buffer incoherence Needs distributed deadlock detection Needs multiphase commit protocol Needs to break SQL requests into multiple sub-requests 14

  15. Hardware vs. Software Architecture It is possible to use one software strategy on a different hardware arrangement Also possible to simulate one hardware configuration on another Virtual Shared Disk (VSD) makes an IBM SP shared nothing system look like a shared disc setup (for Oracle) From this point on, we deal only with shared nothing 15

  16. Shared Nothing Challenges Partitioning the data Keeping the partitioned data balanced Splitting up queries to get the work done Avoiding distributed deadlock Concurrency control Dealing with node failure 16

  17. Parallel Query Processing

  18. Dividing up the Work Application Coordinator Process Worker Process Worker Process Worker Process 18

  19. Database Software on each node App1 App2 DBMS DBMS DBMS C1 C2 W1 W2 W1 W2 W1 W2 19

  20. Inter-Query Parallelism Improves throughput Different queries/transactions execute on different processors (largely equivalent to material in lectures on concurrency) 20

  21. Intra-Query Parallelism Improves response times (lower latency) Intra-operator (horizontal) parallelism Operators decomposed into independent operator instances, which perform the same operation on different subsets of data Inter-operator (vertical) parallelism Operations are overlapped Pipeline data from one stage to the next without materialisation Bushy (independent) parallelism Subtrees in query plan executed concurrently 21

  22. Intra-Operator Parallelism

  23. Intra-Operator Parallelism SQL Query Subset Queries Subset Queries Subset Queries Subset Queries Processor Processor Processor Processor 23

  24. Partitioning Decomposition of operators relies on data being partitioned across the servers that comprise the parallel database Access data in parallel to mitigate the I/O bottleneck Partitions should aim to spread I/O load evenly across servers Choice of partitions affords different parallel query processing approaches: Range partitioning Hash partitioning Schema partitioning 24

  25. Range Partitioning A-H I-P Q-Z 25

  26. Hash Partitioning Table 26

  27. Schema Partitioning Table 1 Table 2 27

  28. Rebalancing Data Data in proper balance Data grows, performance drops Add new nodes and disc Redistribute data to new nodes 28

  29. Intra-Operator Parallelism Example query: SELECT c1,c2 FROM t WHERE c1>5.5 Assumptions: 100,000 rows Predicates eliminate 90% of the rows Considerations for query plans: Data shipping Query shipping 29

  30. Data Shipping c1,c2 c1>5.5 t1 t2 t3 t4 30

  31. Data Shipping Coordinator and Worker 10,000 tuples (c1,c2) Network 25,000 tuples 25,000 tuples 25,000 tuples 25,000 tuples Worker Worker Worker Worker 31

  32. Query Shipping c1,c2 c1,c2 c1,c2 c1,c2 c1>5.5 c1>5.5 c1>5.5 c1>5.5 t1 t2 t3 t4 32

  33. Query Shipping 10,000 tuples (c1,c2) Coordinator Network 2,500 tuples 2,500 tuples 2,500 tuples 2,500 tuples Worker Worker Worker Worker 33

  34. Query Shipping Benefits Database operations are performed where the data are, as far as possible Network traffic is minimised For basic database operators, code developed for serial implementations can be reused In practice, mixture of query shipping and data shipping has to be employed 34

  35. Inter-Operator Parallelism

  36. Inter-Operator Parallelism Allows operators with a producer-consumer dependency to be executed concurrently Results produced by producer are pipelined directly to consumer Consumer can start before producer has produced all results No need to materialise intermediate relations on disk (although available buffer memory is a constraint) Best suited to single-pass operators 36

  37. Inter-Operator Parallelism Scan Join Sort Scan Join Sort time 37

  38. Intra- + Inter-Operator Parallelism Scan Join Sort Scan Join Sort Scan Scan Join Join Sort Sort time 38

  39. The Volcano Architecture Basic operators as usual: scan, join, sort, aggregate (sum, count, average, etc) The Exchange operator Inserted between the steps of a query to: Pipeline results Direct streams of data to the next step(s), redistributing as necessary Provides mechanism to support both vertical and horizontal parallelism 39

  40. Exchange Operators Example query: SELECT county, SUM(order_item) FROM customer, order WHERE order.customer_id=customer_id GROUP BY county ORDER BY SUM(order_item) 40

  41. Exchange Operators SORT GROUP HASH JOIN SCAN SCAN Customer Order 41

  42. Exchange Operators HASH JOIN HASH JOIN HASH JOIN EXCHANGE SCAN SCAN Customer 42

  43. Exchange Operators HASH JOIN HASH JOIN HASH JOIN EXCHANGE EXCHANGE SCAN SCAN SCAN SCAN SCAN 43 Customer Order

  44. SORT EXCHANGE GROUP GROUP EXCHANGE HASH JOIN HASH JOIN HASH JOIN EXCHANGE EXCHANGE SCAN SCAN SCAN SCAN SCAN 44 Customer Order

  45. Bushy Parallelism

  46. Bushy Parallelism Execute subtrees concurrently R S T U 46

  47. Parallel Query Processing

  48. Some Parallel Queries Enquiry Collocated Join Directed Join Broadcast Join Repartitioned Join Combine aspects of intra-operator and bushy parallelism 48

  49. Orders Database CUSTOMER CUSTKEY C_NAME C_NATION ORDER ORDERKEY DATE CUSTKEY SUPPKEY SUPPLIER SUPPKEY S_NAME S_NATION 49

  50. Enquiry/Query How many customers live in the UK? Return to application Coordinator SUM Return subcounts to coordinator SCAN Slave Task COUNT Multiple partitions of customer table 50

More Related Content