Data Lakes and Cloud Analytics in Research

undefined
 
Data Lakes
The Non-CERN Perspective?
 
Towards terascale-
in-the-cloud?
J Jensen, STFC(?)
10 Apr 2018
 
Exercise
 
Analyse data in the cloud?
Towards the 100 GB – 1TB range
Old database paradigm:
“Just drop it in”
Mostly using my own Azure
account, and spare time
And work done during phone meetings
:-)
Split personality (= work/personal) in
portals works well
Traditional Data Workflow
Find
data
 EDA
Analysis
 Hypo-
thesis
testing
 results
 Publish
"Traditional" analysis
 
R, Mathematica, Matlab, Excel, Jupyter
notebooks
Locally, or in cloud
ALL DATA IS LOADED INTO MEMORY
If your data is too big, sample, and analyse
the sample
Centuries of good (and bad) practice
 
Limits of In-Memory
Processing
 
STFC’s cloud
Default: 8 GB max per machine, 20 GB/person
Typ. 90 GB, ~300 GB if you ask v. nicely
Azure
432 GB in some regions
256 GB in Northern Europe
Alibaba cloud
"Up to 768 GB"
 
Suppose you want to analyse
all
 your data... (or just a
large dataset)
 
Hadoop
Requires a degree in Astrology
Made easier with Pig, Hive, etc.
Load it into a database
Data Lake Analytics?!
What's a Data Lake?
 
… Hadoop (Wikipedia)
Linked data centres with HSM (CERN)
A sort of sophisticated row-based database
(Microsoft)
"Fake News!" (some bloke on the Internet)
(See Sam’s post to storage group)
Getting Started (Azure)
 
Analysis language: hybrid SQL and C# - U-SQL
Assumes data is row-based
And text (CSV, TSV, …) … !!
However, it understands .gz and transparently
uncompresses
Auto-parallelisation
You decide # parallel tasks
Max 250 but it can be increased of course
@timestring = EXTRACT AccIndex string,
   
longitude float, latitude float, severity int?,
   
Date string, Time string,
   
SpeedLimit int?, light int?, weather int?
 
FROM "/data/acc-restricted.csv.gz" USING Extractors.Csv(nullEscape:"-1",
  
rowDelimiter: "\n",
  
encoding: Encoding.[ASCII],
  
skipFirstNRows: 1);
@dates = SELECT DateTime.ParseExact( Date + Time, "dd/MM/yyyyHH:mm",
  
System.Globalization.CultureInfo.InvariantCulture )
 
AS date FROM @timestring ;
@hours = SELECT date.ToString( "HH", DateTimeFormatInfo.InvariantInfo )
 
AS hour FROM @dates;
@table = SELECT hour, COUNT(hour) AS HourCount FROM @hours GROUP BY hour;
OUTPUT @table TO "/output/table.csv" USING Outputters.Csv();
Simple example U-SQL (one of my earlier experiments)
Import: column name and
type (MUST import ALL!!)
Listy type
(rows)
Case
sensitive
Partial types!?
Nullable
From default
lake store
Doesn’t
understand
headers 
Use .NET
(in C#)
To default
lake store
 
Getting Started (Azure)
 
Developing - Azure portal or Visual Studio
VS lets you run stuff locally, and gives syntax
highlighting – portal doesn't
VS lets you save stuff!
Portal saves your job 
when it is (attempted) run
Keep reusing old jobs – works well for linear and incremental
development, but takes a bit getting used to
Safe to discard edits and back up to job list
Need to 
submit job
 to find out whether it is correct
and works
 
Getting Started (Azure)
 
Getting Started (Azure)
 
Data – it should be obvious,
but...
 
Put your analytics where your data is
Same cloud region
Don't run your analysis on the full dataset
Create a (representative) sample
Analyse it in a tool you know well (R, Excel,…)
Debug your script; when it finally runs, compare the
outputs
Debug runs with lowest number of parallel jobs
(Optionally) Develop/build locally, then run in cloud
Updated workflow
Tidy
data
Scale to
larger
data
Compare
Develop
/test
Download
results
Delete
cloud
 
Data – sampling
 
Your samples should contain all the weirdness
of your full data
This is tricky for sparse datasets (make sure all
your weirdness is present in the sample)
Easy to sample CSV and TSV (… awk, head)
0
1
3
0
1
1
1.
1
0
1
1
-
1
0
2
1
1
 
2
3
1
 
Data – the rules of the
game
 
Assume we cannot egress cloud generated
data
Until we have the final results
Download/reupload editing not feasible
However, data from outside the cloud can re-
ingest
Thus, edit cloud-generated data in the cloud?
 
Editing in the Cloud
ADLA
ADLS
Blob
store
VM
 
Editing in the Cloud
ADLA
ADLS
Blob
store
VM
Fuse mounted
blob store
“Data
Factory”
Azure AD
authenticator
Azure
AD
 
Editing in the Cloud
ADLA
ADLS
Blob
store
VM
Fuse mounted
blob store
“Data
Factory”
Azure AD
authenticator
Azure
AD
 
Editing in the Cloud –
Resource Groups
ADLA
ADLS
Blob
store
VM
“Data
Factory”
 
RG1
 
RG2
Data Factory
 
Like a data workflow orchestrator
Insists on understanding the file as it
passes through(!?)
Understands CSV but not gzipped CSV(!)
Had trouble with the ADLS permissions
Not fw
Not at-rest encryption
Caveat: you can attach (expensive!)
things that do not delete with the
resource that contains the factory!
 
Big data analysis workflow
 
"Workflow" in Azure-ese is "Data Factory"
Not currently available in Northern Europe
Can move stuff from A to B in Northern Europe
even if it sits in Western Europe
Can also move from 3rd party (e.g. desktop, S3
endpoint, etc.)
 
Down The Rabbit Hole
 
Build Workflows, automate
Do create subdirs/containers
(e.g. input, output)
The rabbit holes:... can spend
a lot of time debugging or
investigating
E.g. at-rest encryption, access
perms, Azure AD
Feels like spending time on
“unrelated” tasks
Sometimes it's best to start over
with your lessons learnt
Eventually need to get to the
“science bit”
 
Giant Steps
 
Services that “understand” science data?
Lots of science domains with not much in common
Except of course everybody uses HDF5 except HEP
Deploy VMs/containers in the cloud
Interfaces to science data
Science apps in market place
ROOT is in Debian (and so is Globus…)
 
Data Scientist
 
As before need platform expertise
Only more so (terascale = extra complexity)
Azure docs pretty good – tuts, refs.
In addition to viz., stats, maths, CS, domain,
comms. [Schutt & O’Neil]
Danger!
The infrastructure scales
But the maths doesn't (necessarily)
 
var(X)=(X-
X
)
²
/N
 
var(X)=(ΣX
² 
-
X
²
)/N
 
ΣX
²
 ...
 
Science Data
 
CSV and TSV are, of course, rather non-
optimal for science
Lots of impl. variations
No consistent NULL
Escape characters
Larger than binary encodings
Variable records size
Need an Extractor for HDF5?
 
Conclusion
 
Need platform expertise
The proof is very much in the pudding
Slide Note
Embed
Share

Delve into the realm of data lakes and cloud analytics through a non-CERN perspective, focusing on terascale data processing in the cloud. Learn about traditional data workflows, analysis tools like R and Jupyter notebooks, and the limits of in-memory processing. Get insights on Hadoop, data lakes, and Azure's analysis language. Discover the complexities and possibilities of analyzing large datasets in the cloud.

  • Data Lakes
  • Cloud Analytics
  • Terascale Data
  • Research

Uploaded on Aug 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data Lakes The Non-CERN Perspective? Towards terascale- in-the-cloud? J Jensen, STFC(?) 10 Apr 2018

  2. Exercise Analyse data in the cloud? Towards the 100 GB 1TB range Old database paradigm: Just drop it in Mostly using my own Azure account, and spare time And work done during phone meetings :-) Split personality (= work/personal) in portals works well GridPP Storage and Data Management Group

  3. Traditional Data Workflow Ingest into platfor m Find data Tidy data Reformat data Hypo- thesis testing results EDA Analysis Archive Publish Annotat e Share GridPP Storage and Data Management Group

  4. "Traditional" analysis R, Mathematica, Matlab, Excel, Jupyter notebooks Locally, or in cloud ALL DATA IS LOADED INTO MEMORY If your data is too big, sample, and analyse the sample Centuries of good (and bad) practice GridPP Storage and Data Management Group

  5. Limits of In-Memory Processing STFC s cloud Default: 8 GB max per machine, 20 GB/person Typ. 90 GB, ~300 GB if you ask v. nicely Azure 432 GB in some regions 256 GB in Northern Europe Alibaba cloud "Up to 768 GB" GridPP Storage and Data Management Group

  6. Suppose you want to analyse all your data... (or just a large dataset) Hadoop Requires a degree in Astrology Made easier with Pig, Hive, etc. Load it into a database Data Lake Analytics?! GridPP Storage and Data Management Group

  7. What's a Data Lake? Hadoop (Wikipedia) Linked data centres with HSM (CERN) A sort of sophisticated row-based database (Microsoft) "Fake News!" (some bloke on the Internet) (See Sam s post to storage group) GridPP Storage and Data Management Group

  8. Getting Started (Azure) Analysis language: hybrid SQL and C# - U-SQL Assumes data is row-based And text (CSV, TSV, ) !! However, it understands .gz and transparently uncompresses Auto-parallelisation You decide # parallel tasks Max 250 but it can be increased of course GridPP Storage and Data Management Group

  9. Listy type (rows) Case sensitive Import: column name and type (MUST import ALL!!) Simple example U-SQL (one of my earlier experiments) @timestring = EXTRACT AccIndex string, FROM "/data/acc-restricted.csv.gz" USING Extractors.Csv(nullEscape:"-1", rowDelimiter: "\n", encoding: Encoding.[ASCII], skipFirstNRows: 1); longitude float, latitude float, severity int?, Date string, Time string, SpeedLimit int?, light int?, weather int? understand headers Doesn t Partial types!? From default lake store Nullable @dates = SELECT DateTime.ParseExact( Date + Time, "dd/MM/yyyyHH:mm", System.Globalization.CultureInfo.InvariantCulture ) AS date FROM @timestring ; Use .NET (in C#) @hours = SELECT date.ToString( "HH", DateTimeFormatInfo.InvariantInfo ) AS hour FROM @dates; To default lake store @table = SELECT hour, COUNT(hour) AS HourCount FROM @hours GROUP BY hour; OUTPUT @table TO "/output/table.csv" USING Outputters.Csv(); GridPP Storage and Data Management Group

  10. Getting Started (Azure) Developing - Azure portal or Visual Studio VS lets you run stuff locally, and gives syntax highlighting portal doesn't VS lets you save stuff! Portal saves your job when it is (attempted) run Keep reusing old jobs works well for linear and incremental development, but takes a bit getting used to Safe to discard edits and back up to job list Need to submit job to find out whether it is correct and works GridPP Storage and Data Management Group

  11. Getting Started (Azure) GridPP Storage and Data Management Group

  12. Getting Started (Azure) GridPP Storage and Data Management Group

  13. GridPP Storage and Data Management Group

  14. GridPP Storage and Data Management Group

  15. GridPP Storage and Data Management Group

  16. Data it should be obvious, but... Put your analytics where your data is Same cloud region Don't run your analysis on the full dataset Create a (representative) sample Analyse it in a tool you know well (R, Excel, ) Debug your script; when it finally runs, compare the outputs Debug runs with lowest number of parallel jobs (Optionally) Develop/build locally, then run in cloud GridPP Storage and Data Management Group

  17. Updated workflow Ingest sample into cloud Tidy data Subset/ sample Local analysis Scale to larger data Develop /test Download results Compare Delete cloud GridPP Storage and Data Management Group

  18. Data sampling Your samples should contain all the weirdness of your full data This is tricky for sparse datasets (make sure all your weirdness is present in the sample) Easy to sample CSV and TSV ( awk, head) 0 1 1 - 1 0 2 0 1 3 0 1 1 1. 1 1 1 2 3 1 GridPP Storage and Data Management Group

  19. Data the rules of the game Assume we cannot egress cloud generated data Until we have the final results Download/reupload editing not feasible However, data from outside the cloud can re- ingest Thus, edit cloud-generated data in the cloud? GridPP Storage and Data Management Group

  20. Editing in the Cloud ADLA VM Blob store ADLS GridPP Storage and Data Management Group

  21. Editing in the Cloud ADLA VM Azure AD Azure AD authenticator Fuse mounted blob store Blob store ADLS Data Factory GridPP Storage and Data Management Group

  22. Editing in the Cloud ADLA VM Azure AD Azure AD authenticator Fuse mounted blob store Blob store ADLS Data Factory GridPP Storage and Data Management Group

  23. Editing in the Cloud Resource Groups ADLA VM Blob store ADLS Data Factory RG1 RG2 GridPP Storage and Data Management Group

  24. Data Factory Like a data workflow orchestrator Insists on understanding the file as it passes through(!?) Understands CSV but not gzipped CSV(!) Had trouble with the ADLS permissions Not fw Not at-rest encryption Caveat: you can attach (expensive!) things that do not delete with the resource that contains the factory! GridPP Storage and Data Management Group

  25. Big data analysis workflow "Workflow" in Azure-ese is "Data Factory" Not currently available in Northern Europe Can move stuff from A to B in Northern Europe even if it sits in Western Europe Can also move from 3rd party (e.g. desktop, S3 endpoint, etc.) GridPP Storage and Data Management Group

  26. Down The Rabbit Hole Build Workflows, automate Do create subdirs/containers (e.g. input, output) The rabbit holes:... can spend a lot of time debugging or investigating E.g. at-rest encryption, access perms, Azure AD Feels like spending time on unrelated tasks Sometimes it's best to start over with your lessons learnt Eventually need to get to the science bit GridPP Storage and Data Management Group

  27. Giant Steps Services that understand science data? Lots of science domains with not much in common Except of course everybody uses HDF5 except HEP Deploy VMs/containers in the cloud Interfaces to science data Science apps in market place ROOT is in Debian (and so is Globus ) GridPP Storage and Data Management Group

  28. Data Scientist As before need platform expertise Only more so (terascale = extra complexity) Azure docs pretty good tuts, refs. In addition to viz., stats, maths, CS, domain, comms. [Schutt & O Neil] GridPP Storage and Data Management Group

  29. Danger! The infrastructure scales But the maths doesn't (necessarily) var(X)=(X-X) /N var(X)=( X -X )/N X ... GridPP Storage and Data Management Group

  30. Science Data CSV and TSV are, of course, rather non- optimal for science Lots of impl. variations No consistent NULL Escape characters Larger than binary encodings Variable records size Need an Extractor for HDF5? GridPP Storage and Data Management Group

  31. Conclusion Need platform expertise The proof is very much in the pudding GridPP Storage and Data Management Group

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#