SQL Server Polybase: Data Virtualization Overview
Learn about SQL Server Polybase, a data virtualization feature that allows distributed query processing and data virtualization across various sources such as HDFS, Cosmos DB, and more. Discover how to use Polybase to build a data hub within SQL Server, enabling efficient query performance and analytics for diverse data sources.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Polybase++ in SQL Server 2019 SQLIntersection June 2019 Bob Ward, Microsoft
Session learning objectives At the end of this session, you should be better able to Understand what is Polybase and how to use it. Understand how Polybase works to know its capabilities for data virtualization Learn how to use Polybase to build a data hub with SQL Server
What is SQL Server Polybase? It s all about Data Virtualization Distributed compute engine integrated with SQL Server Query data where it lives using T-SQL Distributed, scalable query performance Analytics Apps T-SQL Manual/deploy with SQL Server SQL Server Auto deploy/optimize with Big Data Clusters PolyBase external tables ODBC NoSQL Relational databases Big data Excel Cosmos DB HDFS Intelligence over all data
The Journey to Data Virtualization in SQL Server 2019 SQL Server 2019 includes Polybase classic, new ODBC data sources, and Big Data Clusters (BDC) Project Aris commences in 2017 to take Polybase to the next level Microsoft acquires Metanautix in 2015 bringing new connectors Polybase ships with SQL Server 2016 - Polybase classic Built-in connectors on Linux Polybase comes to PDW 2012 MSFT Jim Gray Labs Project to query big data with SQL in 2011 David Dewitt Rimma Nehme SQL Server 2019 is in Preview Some details subject to change
Using Polybase in SQL Server: T-SQL EXTERNAL TABLE Not simple without BDC Login and password Only for HDFS Setup and configure Polybase Create EXTERNAL DATA SOURCE Create Setup EXTERNAL FILE FORMAT authentication Create EXTERNAL TABLE Create statistics on key columns Query like any other table And join to any other table or external table metadata Results streamed HDFS WWI Data lives here SQL INSERT only for HDFS Cosmos DB
control and execution The SQL Server Polybase Architecture EE only Data flow All editions All editions Compute node Head node Compute node DW dbs SQL Engine SQL Engine SQL Engine DW dbs DW dbs Need more scale? Add compute nodes tempdb tempdb tempdb Polybase Engine Polybase Engine Polybase Engine ... Polybase Data Movement Service Polybase Data Movement Service Polybase Data Movement Service Scan or pushdown Scale out with partitions shuffle mpdwsvc.exe Your data sources HDFS Cosmos DB
Demo Dive into Polybase
SQL Server Polybase and Hadoop Polybase Classic Connect to Hadoop: Cloudera or HortonWorks on Windows and Linux Azure Blob Storage Direct Steaming or Java MapReduce SQL Server or Azure SQL Data Warehouse Polybase and HDFS in Big Data Clusters Direct Access to HDFS via SQL Server Engine Hadoop cluster pre-installed with HDFS and Spark HDFS metadata handled within the cluster
Polybase and Other Connectors LOCATION string in EXTERNAL DATA SOURCE \binn\Polybase\ODBC Drivers Built-in Connectors (use ODBC) sqlserver sqlserver oracle oracle teradata teradata mongodb mongodb SQL Server, Azure SQL Database, Azure SQL Data Warehouse Scale out with partitions No client software install required MongoDB or CosmosDB (using MongoDB API) ODBC Connector odbc odbc You install the driver 64bit ODBC 3.0+ compliant Ex. SAP HANA (HDBCODBC Driver)
Polybase as a Semantic Layer Use EXTERNAL table to map common names familiar to your SQL Server Exclude fields or columns from data source that are not needed (Planned) Use SQL Server views to abstract joins and data source access Use UNION to join similar data from data sources and local SQL Server
External Tables vs Linked Servers External Tables Linked Servers Database Scoped Instance Scoped ODBC Drivers OLEDB Providers Read-only* Read/write Scale out queries with push-down Single threaded queries with push-down Failover with AG Requires separate config from AG Basic authentication Basic and integrated authentication Distributed Transactions not supported Distributed Transactions supported * Insert into HDFS allowed
SQL Server 2019: Data Virtualization Modern StockItems Legacy Suppliers Mobile App Orders SQL Server 2019 WWI WideWorldimporters SQL Accounts Receivable Customers from Acquisition Order Reviews
Demo Data Virtualization around the WideWorld
Lessons Learned COLLATE required for character columns. Compat may be required ORACLE case sensitive for LOCATION = <instance>.<schema>.<table> LOCATION for SQL Server is <db>.<schema>.<table> EXTERNAL tables don t support these types (they may be more) VARCHAR(MAX) GEOGRAPHY Computed Columns JSON MongoDB (CosmosDB) observations Be careful of types in your document LOCATION = <database>.<collection> Need to dive in EXTERNAL TABLE compatibility (Ex. Row Level Security)
SQL Server 2019 Big Data Clusters and Polybase The Problem The Problem Customers want to integrate with Big Data and other data sources easily Polybase is not simple to setup, configure, maintain, and scale elastically Customers may not have a Hadoop cluster or want to build one Polybase classic (MapReduce) could be better The Solution The Solution Kubernetes and containers to deploy and scale elastically Everything pre-installed including HDFS cluster Build a control plane to help manage and monitor Enhance SQL Server to read from HDFS natively Provide a data mart for cached results Introduce Spark and Notebooks for Data scientists
Polybase in SQL Server 2019 Big Data Clusters mpdwsvc.exe uses SQLPAL Control Plane Controller Svc Azure FSM Engine Kibana Grafana Custom apps BI Analytics Elastic Search InfluxDB Configuration Store (SQL Server) Polybase head node In Linux container Cluster Polybase compute nodes In Linux containers SQL Server master instance SQL External data sources Compute pool Compute pool Compute pool Directly read from HDFS SQL Compute Node SQL Compute Node SQL Compute Node SQL Compute Node SQL Compute Node Data mart Storage pool Built-in Data Sources SQL Data Node SQL Data Node SQL Server SQL Server SQL Server Spark Spark Spark MapReduce Not used HDFS Data Node HDFS Data Node HDFS Data Node Storage Storage Kubernetes pod IoT data Node Node Node Node Node Node Node Persistent storage
Storage and Data Pools Data Sources CREATE EXTERNAL DATA SOURCE SqlStoragePool WITH (LOCATION = 'sqlhdfs 'sqlhdfs://service controller:8080 controller:8080 ); SqlStoragePool service- -mssql mssql- - CREATE EXTERNAL DATA SOURCE SqlDataPool WITH (LOCATION = 'sqldatapool 'sqldatapool://service controller:8080/datapools/default controller:8080/datapools/default ); SqlDataPool service- -mssql mssql- - Preinstalled in model of Master Instance mssql-controller REST endpoint CREATE EXTERNAL TABLE WITH ( DATA_SOURCE = SqlStoragePool LOCATION = '/clickstream_data', FILE_FORMAT = csv_file ); CREATE EXTERNAL TABLE WITH ( DATA_SOURCE = SqlDataPool DISTRIBUTION = ROUND_ROBIN ); SqlDataPool, SqlStoragePool,
Azure Data Studio and Data Virtualization Using PROSE for intelligent import and schema detection Code generation in Notebooks Import wizards External Table detection for HDFS
Session takeaways Polybase = Data Virtualization = Reduced Need for ETL Polybase provides distributed read scale performance Big Data Clusters automate the deployment of Polybase Download and try it yourself Sign up for EAP for SQL Server 2019 Big Data Clusters
https://aka.ms/sqlworkshops https://aka.ms/bobsql https://aka.ms/bobwardms https://aka.ms/bobsqldemos Session resources SQL Server 2019 Polybase documentation SQL Server Big Data Clusters documentation Polybase demos on GitHub Azure SQL Database Elastic Query documentation Loading data into Azure SQL Data Warehouse with Polybase