Understanding Snowflake Data Warehousing Solution

Slide Note
Embed
Share

Data warehousing involves capturing data from diverse sources to support decision-making processes. However, traditional methods face challenges with schema-less and semi-structured data. Snowflake offers a cloud-based solution that separates storage and compute, allowing efficient processing of SQL and semi-structured data. Virtual Warehouses and a robust SQL execution engine further enhance performance and scalability.


Uploaded on Sep 16, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Interactive Queries in Data Warehouses Shreya

  2. What does Data Warehousing consist of? Capturing data from diverse sources to analyze later usually for business intelligence Usually housed on an enterprise mainframe server (moving towards cloud) User queries and analysis used to support management s decision making process

  3. What is the problem? Data now comes from less predictable sources like web applications, mobile devices, and sensor data. This data is often in a schema-less, semi structured format with self defined categories. Eg. Json, xml Defined data types were important for ETL (Extract, Transform, Load) pipelines requiring Hadoop or Spark. These are not efficient for data warehousing.

  4. Snowflake Overview Designed as a pay as you go service Support for both SQL and semi structured and schema less data (extensions to handle this). Pure software as a service solution Runs on AWS

  5. Snowflake Separates storage and compute Storage is through Amazon s S3 Tables are horizontally partitioned. Within each file columns or attributes are grouped together and compressed. Known as PAX or hybrid columnar Compute is provided through snowflake s share nothing engine To reduce work traffic between the two, compute nodes cache some table data on local disk

  6. Virtual Warehouses This is an abstraction for the users Virtual warehouses consist of clusters of EC2 (elastic compute). Each EC2 is a worker node for a VW. These can be created, destroyed, resized on demand Come in X Small to XX Large. Each worker node has a cache of table data on local disk Each query spawns a new worker process, which dies after its done. VWs have access to same shared tables

  7. SQL Execution Engine Characteristics Columnar: Storing column wise Vectorized: Avoids storing intermediate results. Data is processed in pipelined fashion. Pushbased: relational operators push results to downstream operators. Improves cache efficiency

  8. Shared Data Architecture The cloud service layer is multi tenant VW have their own worker nodes. This layer is shared nothing Data storage shared across VWs belonging to one data center.

  9. Search optimizations Given an analytical workload, Snowflake uses multi- version concurrency control (MVCC), which means a copy of every changed database object is preserved for some duration. Maintaining indices is expensive. Snowflake employs a min-max based pruning. Files have some metadata indicated attributes about the data, so that it may be pruned. small storage overhead does for semi structured data storing columns inside

  10. Technical Differentiators Pure software as a service Failure tolerance Allows variant (native SQL type), array (of values), and object (JavaScript like) Cloning and time travel Security

  11. Future Work Make Snowflake a full self service model, without developer involvement. If a query fails it is entirely rerun, which can be costly for a long query. Each worker node has a cache of table data that currently uses LRU, but the policy could be improved. Snowflake also doesn t handle the situation if an availability zone is unavailable. Currently it requires reallocating the query to another VW.

Related


More Related Content