Overview of Delta Lake, Apache Spark, and Databricks Pricing
Delta Lake is an open-source storage layer that enables ACID transactions in big data workloads. Apache Spark is a unified analytics engine supporting various libraries for large-scale data processing. Databricks offers a pricing model based on DBUs, providing support for AWS and Microsoft Azure. Explore more about these technologies for reliable data lakes and efficient data processing.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Delta Lake Luca Gagliardelli luca.gagliardelli@unimore.it www.dbgroup.unimore.it
Apache Spark Apache Spark (spark.apache.org) is a unified analytics engine for large-scale data processing; Commercialized by Databricks since 2013; Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. It can combine these libraries seamlessly in the same application. It supports many languages: Java, Scala, Python, R, and SQL. 2
Databricks Pricing Prices are for DBU A Databricks Unit ( DBU ) is a unit of processing capability per hour, billed on per-second usage. Databricks supports many AWS EC2 instance types. The larger the instance is, the more DBUs you will be consuming on an hourly basis. For example, 1 DBU is the equivalent of Databricks running on a c4.2xlarge machine for an hour. Databricks provides supports Services and for Microsoft Azure. for Amazon Web 3
Databricks Pricing AWS 1 DBU is the equivalent of Databricks running on a c4.2xlarge machine for an hour (4 vCPU, 15GB of RAM) https://databricks.com/product/aws-pricing 4
Databricks Pricing MS Azure https://azure.microsoft.com/en-us/pricing/details/databricks 5
Lambda architecture Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. Main drawbacks: Hard to maintain and setup Data validation Updating data in the Data Lake is hard Usually no isolation is provided 6
Reliable Data Lakes at Scale Delta Lake (www.delta.io) is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. Developed by Databricks in 2017 it is open source since 2019; Able to process Petabytes of data. https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html 7
Delta Architecture Delta Lake tables can handle a continuous flow of data allowing to incrementally improve the quality of data until it is ready for consumption. Events Filtering, Cleaning, Augmentation Business-level aggregates Raw ingestion Streaming analytics AI & Reporting Bronze Silver Gold Data Quality https://www.youtube.com/watch?v=FePv0lro0z8 8
Storage A Delta Lake table is a directory on the filesystem; Data are stored in Apache Parquet format, while logs in JSON; Delta Lake supports concurrent reads and writes from multiple clusters; Delta Lake supports many storage systems: HDFS (default implementation) Amazon S3 Microsoft Azure Storage https://docs.delta.io/0.7.0/delta-storage.html 9
Apache Parquet Parquet (parquet.apache.org) is a columnar storage format available to any project in the Hadoop ecosystem; Benefits of columnar storage: Column-wise compression is efficient and saves storage space; Compression techniques specific to a type can be applied as the column values tend to be of the same type; Queries that fetch specific column values need not read the entire row data thus improving performance; Different encoding techniques can be applied to different columns; For each column several metadata are stored. 10
Key Features Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet. Unified Batch and Streaming: a table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box. 100% Compatible with Apache Spark: developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark. 11
Key Features ACID Transactions: Delta Lake provides serializability, the strongest level of isolation level. Scalable Metadata Handling: Delta Lake treats metadata (schema, name, partitioning) just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease. Data Versioning: Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments. Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes. These features are implemented through the Delta Lake Transaction Log https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html 12
Key Features Schema Validation Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption. Schema validation ensures data quality by rejecting writes to a table that do not match the table s schema. Schema Evolution Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL. Schema evolution is a feature that allows users to easily change a table s current schema to accommodate data that is changing over time. Most commonly, it s used when performing an append or overwrite operation, to automatically adapt the schema to include one or more new columns. https://databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html 13
Key Features Updates and Deletes Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture. Because Delta Lake adds a transactional layer that provides structured data management on top of your data lake, it can dramatically simplify and speed up your ability to locate and remove personal information in response to consumer GDPR or CCPA requests. https://docs.databricks.com/security/privacy/gdpr-delta.html 14
Organizations using and contributing to Delta Lake 15
Use Cases ETL: Delta Lake is used to store raw data and to clean and transform them through ETL pipelines; Business Intelligence: Delta Lake can be used to run ad- hoc queries or through BI software such as Tableau; Security Information and Event Management A company logs a wide range of computer system events (TCP/UDP flows, authentication requests, SSH logins, etc.); Hundreds of terabytes per day, data are kept for forensic analysis, this results in a multi-petabytes dataset; Multiple ETL, SQL, analytics and machine learning jobs are run against these tables to search for known intrusion patterns. 16 Armbrust, Michael, et al. "Delta lake: high-performance ACID table storage over cloud object stores." VLDB (2020).