Data Lakes and Cloud Analytics in Research

Slide Note

Delve into the realm of data lakes and cloud analytics through a non-CERN perspective, focusing on terascale data processing in the cloud. Learn about traditional data workflows, analysis tools like R and Jupyter notebooks, and the limits of in-memory processing. Get insights on Hadoop, data lakes, and Azure's analysis language. Discover the complexities and possibilities of analyzing large datasets in the cloud.

pavel Follow

Uploaded on Aug 07, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Data Lakes The Non-CERN Perspective? Towards terascale- in-the-cloud? J Jensen, STFC(?) 10 Apr 2018

Exercise Analyse data in the cloud? Towards the 100 GB 1TB range Old database paradigm: Just drop it in Mostly using my own Azure account, and spare time And work done during phone meetings :-) Split personality (= work/personal) in portals works well GridPP Storage and Data Management Group

Traditional Data Workflow Ingest into platfor m Find data Tidy data Reformat data Hypo- thesis testing results EDA Analysis Archive Publish Annotat e Share GridPP Storage and Data Management Group

"Traditional" analysis R, Mathematica, Matlab, Excel, Jupyter notebooks Locally, or in cloud ALL DATA IS LOADED INTO MEMORY If your data is too big, sample, and analyse the sample Centuries of good (and bad) practice GridPP Storage and Data Management Group

Limits of In-Memory Processing STFC s cloud Default: 8 GB max per machine, 20 GB/person Typ. 90 GB, ~300 GB if you ask v. nicely Azure 432 GB in some regions 256 GB in Northern Europe Alibaba cloud "Up to 768 GB" GridPP Storage and Data Management Group

Suppose you want to analyse all your data... (or just a large dataset) Hadoop Requires a degree in Astrology Made easier with Pig, Hive, etc. Load it into a database Data Lake Analytics?! GridPP Storage and Data Management Group

What's a Data Lake? Hadoop (Wikipedia) Linked data centres with HSM (CERN) A sort of sophisticated row-based database (Microsoft) "Fake News!" (some bloke on the Internet) (See Sam s post to storage group) GridPP Storage and Data Management Group

Getting Started (Azure) Analysis language: hybrid SQL and C# - U-SQL Assumes data is row-based And text (CSV, TSV, ) !! However, it understands .gz and transparently uncompresses Auto-parallelisation You decide # parallel tasks Max 250 but it can be increased of course GridPP Storage and Data Management Group

Listy type (rows) Case sensitive Import: column name and type (MUST import ALL!!) Simple example U-SQL (one of my earlier experiments) @timestring = EXTRACT AccIndex string, FROM "/data/acc-restricted.csv.gz" USING Extractors.Csv(nullEscape:"-1", rowDelimiter: "\n", encoding: Encoding.[ASCII], skipFirstNRows: 1); longitude float, latitude float, severity int?, Date string, Time string, SpeedLimit int?, light int?, weather int? understand headers Doesn t Partial types!? From default lake store Nullable @dates = SELECT DateTime.ParseExact( Date + Time, "dd/MM/yyyyHH:mm", System.Globalization.CultureInfo.InvariantCulture ) AS date FROM @timestring ; Use .NET (in C#) @hours = SELECT date.ToString( "HH", DateTimeFormatInfo.InvariantInfo ) AS hour FROM @dates; To default lake store @table = SELECT hour, COUNT(hour) AS HourCount FROM @hours GROUP BY hour; OUTPUT @table TO "/output/table.csv" USING Outputters.Csv(); GridPP Storage and Data Management Group

Getting Started (Azure) Developing - Azure portal or Visual Studio VS lets you run stuff locally, and gives syntax highlighting portal doesn't VS lets you save stuff! Portal saves your job when it is (attempted) run Keep reusing old jobs works well for linear and incremental development, but takes a bit getting used to Safe to discard edits and back up to job list Need to submit job to find out whether it is correct and works GridPP Storage and Data Management Group

Getting Started (Azure) GridPP Storage and Data Management Group

Getting Started (Azure) GridPP Storage and Data Management Group

GridPP Storage and Data Management Group

GridPP Storage and Data Management Group

GridPP Storage and Data Management Group

Data it should be obvious, but... Put your analytics where your data is Same cloud region Don't run your analysis on the full dataset Create a (representative) sample Analyse it in a tool you know well (R, Excel, ) Debug your script; when it finally runs, compare the outputs Debug runs with lowest number of parallel jobs (Optionally) Develop/build locally, then run in cloud GridPP Storage and Data Management Group

Updated workflow Ingest sample into cloud Tidy data Subset/ sample Local analysis Scale to larger data Develop /test Download results Compare Delete cloud GridPP Storage and Data Management Group

Data sampling Your samples should contain all the weirdness of your full data This is tricky for sparse datasets (make sure all your weirdness is present in the sample) Easy to sample CSV and TSV ( awk, head) 0 1 1 - 1 0 2 0 1 3 0 1 1 1. 1 1 1 2 3 1 GridPP Storage and Data Management Group

Data the rules of the game Assume we cannot egress cloud generated data Until we have the final results Download/reupload editing not feasible However, data from outside the cloud can re- ingest Thus, edit cloud-generated data in the cloud? GridPP Storage and Data Management Group

Editing in the Cloud ADLA VM Blob store ADLS GridPP Storage and Data Management Group

Editing in the Cloud ADLA VM Azure AD Azure AD authenticator Fuse mounted blob store Blob store ADLS Data Factory GridPP Storage and Data Management Group

Editing in the Cloud ADLA VM Azure AD Azure AD authenticator Fuse mounted blob store Blob store ADLS Data Factory GridPP Storage and Data Management Group

Editing in the Cloud Resource Groups ADLA VM Blob store ADLS Data Factory RG1 RG2 GridPP Storage and Data Management Group

Data Factory Like a data workflow orchestrator Insists on understanding the file as it passes through(!?) Understands CSV but not gzipped CSV(!) Had trouble with the ADLS permissions Not fw Not at-rest encryption Caveat: you can attach (expensive!) things that do not delete with the resource that contains the factory! GridPP Storage and Data Management Group

Big data analysis workflow "Workflow" in Azure-ese is "Data Factory" Not currently available in Northern Europe Can move stuff from A to B in Northern Europe even if it sits in Western Europe Can also move from 3rd party (e.g. desktop, S3 endpoint, etc.) GridPP Storage and Data Management Group

Down The Rabbit Hole Build Workflows, automate Do create subdirs/containers (e.g. input, output) The rabbit holes:... can spend a lot of time debugging or investigating E.g. at-rest encryption, access perms, Azure AD Feels like spending time on unrelated tasks Sometimes it's best to start over with your lessons learnt Eventually need to get to the science bit GridPP Storage and Data Management Group

Giant Steps Services that understand science data? Lots of science domains with not much in common Except of course everybody uses HDF5 except HEP Deploy VMs/containers in the cloud Interfaces to science data Science apps in market place ROOT is in Debian (and so is Globus ) GridPP Storage and Data Management Group

Data Scientist As before need platform expertise Only more so (terascale = extra complexity) Azure docs pretty good tuts, refs. In addition to viz., stats, maths, CS, domain, comms. [Schutt & O Neil] GridPP Storage and Data Management Group

Danger! The infrastructure scales But the maths doesn't (necessarily) var(X)=(X-X) /N var(X)=( X -X )/N X ... GridPP Storage and Data Management Group

Science Data CSV and TSV are, of course, rather non- optimal for science Lots of impl. variations No consistent NULL Escape characters Larger than binary encodings Variable records size Need an Extractor for HDF5? GridPP Storage and Data Management Group