Exploring Data Lakes and Cloud Analytics in Research
Delve into the realm of data lakes and cloud analytics through a non-CERN perspective, focusing on terascale data processing in the cloud. Learn about traditional data workflows, analysis tools like R and Jupyter notebooks, and the limits of in-memory processing. Get insights on Hadoop, data lakes, and Azure's analysis language. Discover the complexities and possibilities of analyzing large datasets in the cloud.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Data Lakes The Non-CERN Perspective? Towards terascale- in-the-cloud? J Jensen, STFC(?) 10 Apr 2018
Exercise Analyse data in the cloud? Towards the 100 GB 1TB range Old database paradigm: Just drop it in Mostly using my own Azure account, and spare time And work done during phone meetings :-) Split personality (= work/personal) in portals works well GridPP Storage and Data Management Group
Traditional Data Workflow Ingest into platfor m Find data Tidy data Reformat data Hypo- thesis testing results EDA Analysis Archive Publish Annotat e Share GridPP Storage and Data Management Group
"Traditional" analysis R, Mathematica, Matlab, Excel, Jupyter notebooks Locally, or in cloud ALL DATA IS LOADED INTO MEMORY If your data is too big, sample, and analyse the sample Centuries of good (and bad) practice GridPP Storage and Data Management Group
Limits of In-Memory Processing STFC s cloud Default: 8 GB max per machine, 20 GB/person Typ. 90 GB, ~300 GB if you ask v. nicely Azure 432 GB in some regions 256 GB in Northern Europe Alibaba cloud "Up to 768 GB" GridPP Storage and Data Management Group
Suppose you want to analyse all your data... (or just a large dataset) Hadoop Requires a degree in Astrology Made easier with Pig, Hive, etc. Load it into a database Data Lake Analytics?! GridPP Storage and Data Management Group
What's a Data Lake? Hadoop (Wikipedia) Linked data centres with HSM (CERN) A sort of sophisticated row-based database (Microsoft) "Fake News!" (some bloke on the Internet) (See Sam s post to storage group) GridPP Storage and Data Management Group
Getting Started (Azure) Analysis language: hybrid SQL and C# - U-SQL Assumes data is row-based And text (CSV, TSV, ) !! However, it understands .gz and transparently uncompresses Auto-parallelisation You decide # parallel tasks Max 250 but it can be increased of course GridPP Storage and Data Management Group
Listy type (rows) Case sensitive Import: column name and type (MUST import ALL!!) Simple example U-SQL (one of my earlier experiments) @timestring = EXTRACT AccIndex string, FROM "/data/acc-restricted.csv.gz" USING Extractors.Csv(nullEscape:"-1", rowDelimiter: "\n", encoding: Encoding.[ASCII], skipFirstNRows: 1); longitude float, latitude float, severity int?, Date string, Time string, SpeedLimit int?, light int?, weather int? understand headers Doesn t Partial types!? From default lake store Nullable @dates = SELECT DateTime.ParseExact( Date + Time, "dd/MM/yyyyHH:mm", System.Globalization.CultureInfo.InvariantCulture ) AS date FROM @timestring ; Use .NET (in C#) @hours = SELECT date.ToString( "HH", DateTimeFormatInfo.InvariantInfo ) AS hour FROM @dates; To default lake store @table = SELECT hour, COUNT(hour) AS HourCount FROM @hours GROUP BY hour; OUTPUT @table TO "/output/table.csv" USING Outputters.Csv(); GridPP Storage and Data Management Group
Getting Started (Azure) Developing - Azure portal or Visual Studio VS lets you run stuff locally, and gives syntax highlighting portal doesn't VS lets you save stuff! Portal saves your job when it is (attempted) run Keep reusing old jobs works well for linear and incremental development, but takes a bit getting used to Safe to discard edits and back up to job list Need to submit job to find out whether it is correct and works GridPP Storage and Data Management Group
Getting Started (Azure) GridPP Storage and Data Management Group
Getting Started (Azure) GridPP Storage and Data Management Group
GridPP Storage and Data Management Group
GridPP Storage and Data Management Group
GridPP Storage and Data Management Group
Data it should be obvious, but... Put your analytics where your data is Same cloud region Don't run your analysis on the full dataset Create a (representative) sample Analyse it in a tool you know well (R, Excel, ) Debug your script; when it finally runs, compare the outputs Debug runs with lowest number of parallel jobs (Optionally) Develop/build locally, then run in cloud GridPP Storage and Data Management Group
Updated workflow Ingest sample into cloud Tidy data Subset/ sample Local analysis Scale to larger data Develop /test Download results Compare Delete cloud GridPP Storage and Data Management Group
Data sampling Your samples should contain all the weirdness of your full data This is tricky for sparse datasets (make sure all your weirdness is present in the sample) Easy to sample CSV and TSV ( awk, head) 0 1 1 - 1 0 2 0 1 3 0 1 1 1. 1 1 1 2 3 1 GridPP Storage and Data Management Group
Data the rules of the game Assume we cannot egress cloud generated data Until we have the final results Download/reupload editing not feasible However, data from outside the cloud can re- ingest Thus, edit cloud-generated data in the cloud? GridPP Storage and Data Management Group
Editing in the Cloud ADLA VM Blob store ADLS GridPP Storage and Data Management Group
Editing in the Cloud ADLA VM Azure AD Azure AD authenticator Fuse mounted blob store Blob store ADLS Data Factory GridPP Storage and Data Management Group
Editing in the Cloud ADLA VM Azure AD Azure AD authenticator Fuse mounted blob store Blob store ADLS Data Factory GridPP Storage and Data Management Group
Editing in the Cloud Resource Groups ADLA VM Blob store ADLS Data Factory RG1 RG2 GridPP Storage and Data Management Group
Data Factory Like a data workflow orchestrator Insists on understanding the file as it passes through(!?) Understands CSV but not gzipped CSV(!) Had trouble with the ADLS permissions Not fw Not at-rest encryption Caveat: you can attach (expensive!) things that do not delete with the resource that contains the factory! GridPP Storage and Data Management Group
Big data analysis workflow "Workflow" in Azure-ese is "Data Factory" Not currently available in Northern Europe Can move stuff from A to B in Northern Europe even if it sits in Western Europe Can also move from 3rd party (e.g. desktop, S3 endpoint, etc.) GridPP Storage and Data Management Group
Down The Rabbit Hole Build Workflows, automate Do create subdirs/containers (e.g. input, output) The rabbit holes:... can spend a lot of time debugging or investigating E.g. at-rest encryption, access perms, Azure AD Feels like spending time on unrelated tasks Sometimes it's best to start over with your lessons learnt Eventually need to get to the science bit GridPP Storage and Data Management Group
Giant Steps Services that understand science data? Lots of science domains with not much in common Except of course everybody uses HDF5 except HEP Deploy VMs/containers in the cloud Interfaces to science data Science apps in market place ROOT is in Debian (and so is Globus ) GridPP Storage and Data Management Group
Data Scientist As before need platform expertise Only more so (terascale = extra complexity) Azure docs pretty good tuts, refs. In addition to viz., stats, maths, CS, domain, comms. [Schutt & O Neil] GridPP Storage and Data Management Group
Danger! The infrastructure scales But the maths doesn't (necessarily) var(X)=(X-X) /N var(X)=( X -X )/N X ... GridPP Storage and Data Management Group
Science Data CSV and TSV are, of course, rather non- optimal for science Lots of impl. variations No consistent NULL Escape characters Larger than binary encodings Variable records size Need an Extractor for HDF5? GridPP Storage and Data Management Group
Conclusion Need platform expertise The proof is very much in the pudding GridPP Storage and Data Management Group