Efficient Spark ETL on Hadoop: SETL Approach

Slide Note
Embed
Share

An overview of how SETL offers an efficient approach to Spark ETL on Hadoop, focusing on reducing memory footprint, file size management, and utilizing low-level file-format APIs. With significant performance improvements, including reducing task hours by 83% and file count by 87%, SETL streamlines common Hadoop ETL workloads for processing, transformation, cleansing, and generating detailed exception reports.


Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. SETL SETL: Efficient Spark ETL on Hadoop www.aqfer.com

  2. Agenda Agenda Context Idea 1: Reducing memory footprint Idea 2: File size management Idea 3: Low-level file-format APIs Performance numbers 83% reduction in task hours 87% reduction in file count

  3. Context: A common Hadoop ETL workload Context: A common Hadoop ETL workload Input Log data of Web servers, beacons, IoT, devices and so on Processing Transformation into a fixed target schema Cleansing Output Clean granular data Clean aggregations Exception report of rows that could not be imported

  4. Running example for this talk Running example for this talk Input Beacon records imported once an hour Granular data Avro output for exporting to external systems Parquet output for AWS Athena query engine 45 column schema with 3 map-columns and one array column and the rest scalar Aggregate data on date, hour, country, event-type and two other fields and combination thereof CSV files Exception report CSV files, with error message and erroneous row Size 1 billion events per day Avro size ~80 GB per day, Parquet ~60 GB

  5. Traditional approach using Spark Traditional approach using Spark

  6. Problems with the traditional approach Problems with the traditional approach Memory requirement proportional to input size If number of containers is less, serialized and de-serialized in each stage Need to provision for the peak input size Input size varies based on hour of the day and/or day of the week If large backlog needs to be processed, job may fail due to out of memory Multiple scans of the entire dataset leads to inefficiency

  7. SETL: Save output as you scan SETL: Save output as you scan

  8. Benefits Benefits Granular data is scanned just once Saved as a side-effect of processing rather than a separate action on RDD Only small amount of data kept in memory Memory requirements grow sublinear with input size Depends on the cardinality of the aggregation dimensions Common objection: side-effects are not resilient against task failures Not in our case: if a task is rerun, its outputs are overwritten

  9. Limitations Limitations Limited benefits for non-additive aggregates E.g. count-distinct, median, percentiles Workaround: approximate algorithms E.g. hyperloglog for count-distinct, Paterson algorithm for quantiles

  10. Problem: Hard to control size of output files Problem: Hard to control size of output files Desired: Programmer specifies file size, input size determines the number of files Spark: Programmer specifies number of files, input size determines sizes of them

  11. Manage the file sizes by moving blocks Manage the file sizes by moving blocks Avro, ORC and Parquet files consist of large relocatable blocks Can merge these files into larger files without looking inside the blocks CSV and JSON files are wholly relocatable Just concatenating these files produce larger files Can be done as a Spark job Multiple file-formats can be merged in a single Spark job

  12. Example: Parquet file merge operation Example: Parquet file merge operation

  13. Problem: High Problem: High- -level APIs are inefficient level APIs are inefficient Multiple layers The layers are designed for generality rather than performance Every layer walks through the schema object once per record Boxing and unboxing causes performance problem too A large number of intermediate objects are created and destroyed Use of lowest level API to save the contents gives up to 10x write performance

  14. Solution: Use custom writers Solution: Use custom writers How about maintenance? Every time schema changes, do you end up coding new writers? Improved Avro compiler generates writers for the given schema Avro schema being an LL(1) Grammar lends itself to robust writers E.g. Parquet writer for given Avro Schema

  15. Example: Layers of Avro file format for Spark Example: Layers of Avro file format for Spark

  16. SETL Performance SETL Performance Traditional Traditional w/ low-level IO Traditional w/ low-level IO + file-merge Full SETL Executor memory seconds (GB-second) 198693 149798 162384 33259 % reduction 25% 18% 83% # of output files per format per partition 100 100 13 13 Number of executors 70 70 70 20 Memory per executor (MB) 5120 5120 5120 900 Cores per executor 2 2 2 2 Elapsed time (seconds) 731 555 600 547 Vcore-seconds 56754 42775 46376 22155

  17. Conclusion Conclusion The new approach to ETL job Efficient use of CPU and memory Predictable size for containers Duration of the job varies based on the input Desired size for the output files without affecting other tuning parameters

Related