Distributed Volumetric Data Analytics Toolkit on Apache Spark

  
 
 
 
 
 
  
  
 
 
 
 
 
  
   
    
     
Implementing a Distributed Volumetric Data
Analytics Toolkit on Apache Spark
Chao Chen, Yuzhong Yan, Lei Huang, Lijun Qian
PVAMU
August 2017
O
u
t
l
i
n
e
Challenges
Main contributions
Methodology
Experiments and analysis
Conclusion & future work
References
T
h
e
 
N
o
r
m
a
l
 
W
o
r
k
f
l
o
w
Matlab Codes
Volumetric Data
Data Scientists
OpenMP/MPI
Codes
Computer 
Scientists
Product
T
h
e
 
B
i
g
 
C
h
a
l
l
e
n
g
e
Exponential growth in the volume of data
Increasing complexity and rapidly changing workflows
Handle different formats of unstructured streaming data
From computation-intensive to both data intensive and
computation-intensive
B
i
g
 
D
a
t
a
 
i
n
 
O
i
l
&
G
a
s
 
I
n
d
u
s
t
r
y
Growing Volume Size
300 MB / km
2
 early 90s
25 GB / km
2
 in 2006
Growing to PBs /km
2
Growing 
Variety of Data
Unstructured and semistructured well log and drilling
reports
Growing Velocity of Data
Full 3D acquisition, 8GB/s - 20GB/s
http://www.slideshare.net/bjorna/big-data-in-oil-and-gas?related=3
T
h
e
 
W
o
r
k
f
l
o
w
 
t
o
 
H
a
n
d
l
e
 
B
i
g
 
D
a
t
a
Matlab Codes
Sample Data/
Synthetic Data
Data Scientists
MPI Codes
Actual Data
Product
Computer 
Scientists
M
a
i
n
 
C
o
n
t
r
i
b
u
t
i
o
n
s
T
h
e
 
p
a
p
e
r
 
a
t
t
e
m
p
t
s
 
t
o
 
a
d
d
r
e
s
s
 
t
h
e
p
e
r
f
o
r
m
a
n
c
e
 
o
f
 
l
a
r
g
e
 
d
i
s
t
r
i
b
u
t
e
d
 
m
u
l
t
i
-
d
i
m
e
n
s
i
o
n
a
l
 
a
r
r
a
y
 
o
n
 
a
 
b
i
g
 
d
a
t
a
 
a
n
a
l
y
t
i
c
s
p
l
a
t
f
o
r
m
 
b
y
 
a
p
p
l
y
i
n
g
 
t
h
e
 
H
P
C
 
e
x
p
e
r
i
e
n
c
e
s
 
a
n
d
p
r
a
c
t
i
c
e
s
.
S
c
a
l
a
b
i
l
i
t
y
/
P
e
r
f
o
r
m
a
n
c
e
F
r
i
e
n
d
l
y
 
U
s
e
r
 
I
n
t
e
r
f
a
c
e
s
  
T
h
e
 
U
p
d
a
t
e
 
W
o
r
k
f
l
o
w
DMAT
Actual Volumetric
Data
Data Scientists
Product
S
o
f
t
w
a
r
e
 
S
t
a
c
k
 
o
f
 
D
M
A
T
A
r
c
h
i
t
e
c
t
u
r
e
 
o
f
 
D
M
A
T
M
a
i
n
 
C
o
m
p
o
n
e
n
t
s
 
i
n
 
D
M
A
T
Data Distribution
Distributed Data Structure
Operations on Data Structure
High Level Interfaces
D
a
t
a
 
D
i
s
t
r
i
b
u
t
i
o
n
SEGY or A3D
R
D
D
 
i
n
 
S
p
a
r
k
W
h
a
t
 
i
s
 
R
D
D
?
K
e
y
 
C
o
n
c
e
p
t
s
 
o
f
 
R
D
D
P
r
o
b
l
e
m
s
 
o
f
 
R
D
D
 
i
n
 
S
p
a
r
k
Interfaces are still low level to domain
specific experts
Each partition is independent, and all
mappers run independently
The performance of accessing individual
record is not good
V
o
l
u
m
e
t
r
i
c
R
D
D
Abstract Class hide details of RDD
PlaneVolume/BrickVolume
Metadata Info
Load/Save Interface
Operations on distributed partitions
https://github.com/amplab/spark-indexedrdd
S
e
i
s
m
i
c
 
D
a
t
a
 
E
x
a
m
p
l
e
A
P
I
s
T
r
a
n
s
p
o
s
e
P
e
r
f
o
r
m
a
n
c
e
 
o
f
 
T
r
a
n
s
p
o
s
e
S
t
e
n
c
i
l
 
C
o
m
p
u
t
a
t
i
o
n
:
 
J
a
c
o
b
i
R
u
n
n
i
n
g
 
R
e
s
u
l
t
s
:
 
J
a
c
o
b
i
T
e
m
p
l
a
t
e
s
 
i
n
 
D
M
A
T
Granularity
Sample
Trace
Plane
Volume
P
l
a
n
e
 
T
e
m
p
l
a
t
e
O
v
e
r
l
a
p
 
T
e
m
p
l
a
t
e
P
e
r
f
o
r
m
a
n
c
e
 
o
f
 
S
t
e
n
c
i
l
 
C
o
m
p
u
t
a
t
i
o
n
P
e
r
f
o
r
m
a
n
c
e
 
o
f
 
S
t
e
n
c
i
l
 
C
o
m
p
u
t
a
t
i
o
n
C
o
n
c
l
u
s
i
o
n
Spark provides good performance with in-memory RDD
DMAT provides a bridge of domain experts and parallel
computing by hiding details of parallelism
RDD is simple, but performance differs greatly in
different usages (Granularity and Memory)
Algorithms, data distributions, and configuration of
resources will affect performance; Deep profiling is the
most important one to boost performance
The overall performance of Spark is not as good as
HPC, but productivity is better than HPC
F
u
t
u
r
e
 
W
o
r
k
More templates to handle other complex applications
Support streaming data
Integrate with other frameworks such as TensorFlow,
DL4J, and Python support
Debug and profiling interface
GPU Support
A
c
k
n
o
w
l
e
d
g
e
m
e
n
t
s
This research work is supported in part by the US NSF
award IIP-1543214, the U.S. Dept. of Navy under
agreement number N00014-17-1-3062 and the U.S. Office
of the Assistant Secretary of Defense for Research and
Engineering (OASD(R&E)) under agreement number
FA8750-15-2-0119.
   
  
R
e
f
e
r
e
n
c
e
s
1.
http://www.open.edu/openlearnworks/mod/page/view.php?id=41010
2.
http://csegrecorder.com/articles/view/advances-in-true-volume-
interpretation-of-structure-and-stratigraphy-in-3d
3.
http://www.slideshare.net/bjorna/big-data-in-oil-and-gas?related=3
4.
http://www.winbold.com/upstream/using-big-data-technologies-to-optimize-
operations-in-upstream-oil-and-gas/
5.
https://databricks-training.s3.amazonaws.com/slides/advanced-spark-
training.pdf
6.
https://www.slb.com/~/media/Files/resources/oilfield_review/ors94/0794/p2
3_31.pdf
Thanks a lot!
Slide Note
Embed
Share

This paper discusses the challenges, methodology, experiments, and conclusions of implementing a distributed volumetric data analytics toolkit on Apache Spark to address the performance of large distributed multi-dimensional arrays on big data analytics platforms. The toolkit aims to handle the exponential growth in data volume and complexity in scenarios such as the oil & gas industry. It focuses on scalability, performance, user interfaces, and software stack architecture.

  • Distributed Data Analytics
  • Apache Spark
  • Big Data Challenges
  • Volumetric Data
  • HPC Experiences

Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Implementing a Distributed Volumetric Data Analytics Toolkit on Apache Spark Chao Chen, Yuzhong Yan, Lei Huang, Lijun Qian PVAMU August 2017

  2. Outline Challenges Main contributions Methodology Experiments and analysis Conclusion & future work References

  3. The Normal Workflow Volumetric Data Data Scientists Computer Scientists OpenMP/MPI Codes Matlab Codes Product

  4. The Big Challenge Exponential growth in the volume of data Increasing complexity and rapidly changing workflows Handle different formats of unstructured streaming data From computation-intensive to both data intensive and computation-intensive

  5. Big Data in Oil&Gas Industry Growing Volume Size 300 MB / km2early 90s 25 GB / km2in 2006 Growing to PBs /km2 Growing Variety of Data Unstructured and semistructured well log and drilling reports Growing Velocity of Data Full 3D acquisition, 8GB/s - 20GB/s http://www.slideshare.net/bjorna/big-data-in-oil-and-gas?related=3

  6. The Workflow to Handle Big Data Actual Data Sample Data/ Synthetic Data Data Scientists Computer Scientists Matlab Codes MPI Codes Product

  7. Main Contributions The paper attempts to address the performance of large distributed multi- dimensional array on a big data analytics platform by applying the HPC experiences and practices. Scalability/Performance Friendly User Interfaces

  8. The Update Workflow Actual Volumetric Data Data Scientists DMAT Product

  9. Software Stack of DMAT

  10. Architecture of DMAT

  11. Main Components in DMAT Data Distribution Distributed Data Structure Operations on Data Structure High Level Interfaces

  12. Data Distribution SEGY or A3D

  13. RDD in Spark

  14. What is RDD?

  15. Key Concepts of RDD

  16. Problems of RDD in Spark Interfaces are still low level to domain specific experts Each partition is independent, and all mappers run independently The performance of accessing individual record is not good

  17. VolumetricRDD Abstract Class hide details of RDD PlaneVolume/BrickVolume Metadata Info Load/Save Interface Operations on distributed partitions https://github.com/amplab/spark-indexedrdd

  18. Seismic Data Example

  19. APIs

  20. Transpose

  21. Performance of Transpose

  22. Stencil Computation: Jacobi

  23. Running Results: Jacobi

  24. Templates in DMAT Granularity Sample Trace Plane Volume

  25. Plane Template

  26. Overlap Template

  27. Performance of Stencil Computation

  28. Performance of Stencil Computation

  29. Conclusion Spark provides good performance with in-memory RDD DMAT provides a bridge of domain experts and parallel computing by hiding details of parallelism RDD is simple, but performance differs greatly in different usages (Granularity and Memory) Algorithms, data distributions, and configuration of resources will affect performance; Deep profiling is the most important one to boost performance The overall performance of Spark is not as good as HPC, but productivity is better than HPC

  30. Future Work More templates to handle other complex applications Support streaming data Integrate with other frameworks such as TensorFlow, DL4J, and Python support Debug and profiling interface GPU Support

  31. Acknowledgements This research work is supported in part by the US NSF award IIP-1543214, the U.S. Dept. of Navy under agreement number N00014-17-1-3062 and the U.S. Office of the Assistant Secretary of Defense for Research and Engineering (OASD(R&E)) under agreement number FA8750-15-2-0119.

  32. References 1. http://www.open.edu/openlearnworks/mod/page/view.php?id=41010 2. http://csegrecorder.com/articles/view/advances-in-true-volume- interpretation-of-structure-and-stratigraphy-in-3d 3. http://www.slideshare.net/bjorna/big-data-in-oil-and-gas?related=3 4. http://www.winbold.com/upstream/using-big-data-technologies-to-optimize- operations-in-upstream-oil-and-gas/ 5. https://databricks-training.s3.amazonaws.com/slides/advanced-spark- training.pdf 6. https://www.slb.com/~/media/Files/resources/oilfield_review/ors94/0794/p2 3_31.pdf

  33. Thanks a lot!

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#