SQL Server Polybase: Data Virtualization Overview

 
 
 
 
 
 
 
 
 
SQL Server
T-SQL
Analytics
Apps
ODBC
NoSQL
Relational databases
Big data
PolyBase external tables
 
D
istributed compute engine integrated with SQL
 Server
Query data where it lives using T-SQL
Distributed, scalable query performance
Manual/deploy with SQL Server
Auto deploy/optimize with Big Data Clusters
Intelligence over all data
 
“It’s all about
Data Virtualization”
 
 
 
 
 
 
 
 
 
 
SQL Engine
Polybase
Engine
Polybase Data
Movement
Service
SQL Engine
Polybase
Engine
Polybase Data
Movement
Service
SQL Engine
Polybase
Engine
Polybase Data
Movement
Service
mpdwsvc.exe 
 
 
 
 
 
 
 
 
 
 
ODBC Connector
 
odbc
 
You install the driver
No client software
install required
 
 
 
 
 
 
 
 
 
SQL Server
2019
 
 
 
 
 
 
 
 
 
 
The Solution
Kubernetes and containers to deploy and scale elastically
Everything pre-installed including HDFS cluster
Build a control plane to help manage and monitor
Enhance SQL Server to read from HDFS natively
Provide a data mart for cached results
Introduce Spark and Notebooks for Data scientists
 
 
 
 
Polybase head node
Polybase head node
In Linux container
In Linux container
 
Polybase compute nodes
Polybase compute nodes
In Linux containers
In Linux containers
 
“Built-in”
“Built-in”
 Data Sources
 Data Sources
 
MapReduce
MapReduce
Not used
Not used
mpdwsvc.exe
uses SQLPAL
Control Plane
Controller Svc
Configuration Store (SQL Server)
Grafana
Elastic Search
Azure FSM Engine
InfluxDB
Kibana
 
 
 
 
 
CREATE EXTERNAL DATA SOURCE 
SqlDataPool
WITH (LOCATION = 
'sqldatapool
://
service-mssql-
controller:8080/datapools/default
’);
 
 
 
 
CREATE EXTERNAL TABLE…
    WITH
    (
        DATA_SOURCE = 
SqlDataPool
,
        DISTRIBUTION = ROUND_ROBIN
    );
 
 
 
 
 
 
 
 
 
 
https://aka.ms/sqlworkshops
https://aka.ms/bobsql
https://aka.ms/bobwardms
https://aka.ms/bobsqldemos
Slide Note

© Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Embed
Share

Learn about SQL Server Polybase, a data virtualization feature that allows distributed query processing and data virtualization across various sources such as HDFS, Cosmos DB, and more. Discover how to use Polybase to build a data hub within SQL Server, enabling efficient query performance and analytics for diverse data sources.

  • SQL Server
  • Data Virtualization
  • Polybase
  • Distributed Query
  • Big Data Clusters

Uploaded on Sep 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Polybase++ in SQL Server 2019 SQLIntersection June 2019 Bob Ward, Microsoft

  2. Session learning objectives At the end of this session, you should be better able to Understand what is Polybase and how to use it. Understand how Polybase works to know its capabilities for data virtualization Learn how to use Polybase to build a data hub with SQL Server

  3. What is SQL Server Polybase? It s all about Data Virtualization Distributed compute engine integrated with SQL Server Query data where it lives using T-SQL Distributed, scalable query performance Analytics Apps T-SQL Manual/deploy with SQL Server SQL Server Auto deploy/optimize with Big Data Clusters PolyBase external tables ODBC NoSQL Relational databases Big data Excel Cosmos DB HDFS Intelligence over all data

  4. The Journey to Data Virtualization in SQL Server 2019 SQL Server 2019 includes Polybase classic, new ODBC data sources, and Big Data Clusters (BDC) Project Aris commences in 2017 to take Polybase to the next level Microsoft acquires Metanautix in 2015 bringing new connectors Polybase ships with SQL Server 2016 - Polybase classic Built-in connectors on Linux Polybase comes to PDW 2012 MSFT Jim Gray Labs Project to query big data with SQL in 2011 David Dewitt Rimma Nehme SQL Server 2019 is in Preview Some details subject to change

  5. Using Polybase in SQL Server: T-SQL EXTERNAL TABLE Not simple without BDC Login and password Only for HDFS Setup and configure Polybase Create EXTERNAL DATA SOURCE Create Setup EXTERNAL FILE FORMAT authentication Create EXTERNAL TABLE Create statistics on key columns Query like any other table And join to any other table or external table metadata Results streamed HDFS WWI Data lives here SQL INSERT only for HDFS Cosmos DB

  6. control and execution The SQL Server Polybase Architecture EE only Data flow All editions All editions Compute node Head node Compute node DW dbs SQL Engine SQL Engine SQL Engine DW dbs DW dbs Need more scale? Add compute nodes tempdb tempdb tempdb Polybase Engine Polybase Engine Polybase Engine ... Polybase Data Movement Service Polybase Data Movement Service Polybase Data Movement Service Scan or pushdown Scale out with partitions shuffle mpdwsvc.exe Your data sources HDFS Cosmos DB

  7. Demo Dive into Polybase

  8. SQL Server Polybase and Hadoop Polybase Classic Connect to Hadoop: Cloudera or HortonWorks on Windows and Linux Azure Blob Storage Direct Steaming or Java MapReduce SQL Server or Azure SQL Data Warehouse Polybase and HDFS in Big Data Clusters Direct Access to HDFS via SQL Server Engine Hadoop cluster pre-installed with HDFS and Spark HDFS metadata handled within the cluster

  9. Polybase and Other Connectors LOCATION string in EXTERNAL DATA SOURCE \binn\Polybase\ODBC Drivers Built-in Connectors (use ODBC) sqlserver sqlserver oracle oracle teradata teradata mongodb mongodb SQL Server, Azure SQL Database, Azure SQL Data Warehouse Scale out with partitions No client software install required MongoDB or CosmosDB (using MongoDB API) ODBC Connector odbc odbc You install the driver 64bit ODBC 3.0+ compliant Ex. SAP HANA (HDBCODBC Driver)

  10. Polybase as a Semantic Layer Use EXTERNAL table to map common names familiar to your SQL Server Exclude fields or columns from data source that are not needed (Planned) Use SQL Server views to abstract joins and data source access Use UNION to join similar data from data sources and local SQL Server

  11. External Tables vs Linked Servers External Tables Linked Servers Database Scoped Instance Scoped ODBC Drivers OLEDB Providers Read-only* Read/write Scale out queries with push-down Single threaded queries with push-down Failover with AG Requires separate config from AG Basic authentication Basic and integrated authentication Distributed Transactions not supported Distributed Transactions supported * Insert into HDFS allowed

  12. SQL Server 2019: Data Virtualization Modern StockItems Legacy Suppliers Mobile App Orders SQL Server 2019 WWI WideWorldimporters SQL Accounts Receivable Customers from Acquisition Order Reviews

  13. Demo Data Virtualization around the WideWorld

  14. Lessons Learned COLLATE required for character columns. Compat may be required ORACLE case sensitive for LOCATION = <instance>.<schema>.<table> LOCATION for SQL Server is <db>.<schema>.<table> EXTERNAL tables don t support these types (they may be more) VARCHAR(MAX) GEOGRAPHY Computed Columns JSON MongoDB (CosmosDB) observations Be careful of types in your document LOCATION = <database>.<collection> Need to dive in EXTERNAL TABLE compatibility (Ex. Row Level Security)

  15. SQL Server 2019 Big Data Clusters and Polybase The Problem The Problem Customers want to integrate with Big Data and other data sources easily Polybase is not simple to setup, configure, maintain, and scale elastically Customers may not have a Hadoop cluster or want to build one Polybase classic (MapReduce) could be better The Solution The Solution Kubernetes and containers to deploy and scale elastically Everything pre-installed including HDFS cluster Build a control plane to help manage and monitor Enhance SQL Server to read from HDFS natively Provide a data mart for cached results Introduce Spark and Notebooks for Data scientists

  16. Polybase in SQL Server 2019 Big Data Clusters mpdwsvc.exe uses SQLPAL Control Plane Controller Svc Azure FSM Engine Kibana Grafana Custom apps BI Analytics Elastic Search InfluxDB Configuration Store (SQL Server) Polybase head node In Linux container Cluster Polybase compute nodes In Linux containers SQL Server master instance SQL External data sources Compute pool Compute pool Compute pool Directly read from HDFS SQL Compute Node SQL Compute Node SQL Compute Node SQL Compute Node SQL Compute Node Data mart Storage pool Built-in Data Sources SQL Data Node SQL Data Node SQL Server SQL Server SQL Server Spark Spark Spark MapReduce Not used HDFS Data Node HDFS Data Node HDFS Data Node Storage Storage Kubernetes pod IoT data Node Node Node Node Node Node Node Persistent storage

  17. Storage and Data Pools Data Sources CREATE EXTERNAL DATA SOURCE SqlStoragePool WITH (LOCATION = 'sqlhdfs 'sqlhdfs://service controller:8080 controller:8080 ); SqlStoragePool service- -mssql mssql- - CREATE EXTERNAL DATA SOURCE SqlDataPool WITH (LOCATION = 'sqldatapool 'sqldatapool://service controller:8080/datapools/default controller:8080/datapools/default ); SqlDataPool service- -mssql mssql- - Preinstalled in model of Master Instance mssql-controller REST endpoint CREATE EXTERNAL TABLE WITH ( DATA_SOURCE = SqlStoragePool LOCATION = '/clickstream_data', FILE_FORMAT = csv_file ); CREATE EXTERNAL TABLE WITH ( DATA_SOURCE = SqlDataPool DISTRIBUTION = ROUND_ROBIN ); SqlDataPool, SqlStoragePool,

  18. Azure Data Studio and Data Virtualization Using PROSE for intelligent import and schema detection Code generation in Notebooks Import wizards External Table detection for HDFS

  19. Session takeaways Polybase = Data Virtualization = Reduced Need for ETL Polybase provides distributed read scale performance Big Data Clusters automate the deployment of Polybase Download and try it yourself Sign up for EAP for SQL Server 2019 Big Data Clusters

  20. https://aka.ms/sqlworkshops https://aka.ms/bobsql https://aka.ms/bobwardms https://aka.ms/bobsqldemos Session resources SQL Server 2019 Polybase documentation SQL Server Big Data Clusters documentation Polybase demos on GitHub Azure SQL Database Elastic Query documentation Loading data into Azure SQL Data Warehouse with Polybase

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#