Astronomy Big Data and Cyberinfrastructure for AI Innovation

 
Using Astronomy Big Data and
National Cyberinfrastructure to
Drive AI Access and Innovation
 
Curt Dodds - Institute for Astronomy
University of Hawaii, Manoa
 
Pan-STARRS Milky Way Gigapan
 
S
o
u
r
c
e
s
 
o
f
 
B
i
g
 
D
a
t
a
 
Observation (ground, space)
 
Simulation
S
u
r
v
e
y
s
 
Long duration time-series telescope observations
M
o
o
r
e
s
 
L
a
w
 
Increased image size and dimensionality
 
Increased simulation grid resolution and step frequency
 
Astronomy Big Data
 
S
o
l
a
r
 
S
y
s
t
e
m
 
Sun, asteroids, comets, planets
G
a
l
a
c
t
i
c
 
Stars
 
Exoplanets
E
x
t
r
a
g
a
l
a
c
t
i
c
 
Galaxies, quasars
 
Cosmology
 
Astronomy Big Data
 
 
The Sun
 
Daniel K Inouye Solar Telescope (DKIST)
 
 
Daniel K Inouye Solar Telescope (DKIST)
 
 
Spectropolarimetric Inversion in 4-Dimensions (SPIN4D)
 
 
Hinode Solar Optical Telescope Spectropolarimeter
 
 
 
All Sky
Surveys
 
All-Sky Automated Survey for Supernovae
 (ASAS-SN)
Asteroid Terrestrial-impact Last Alert System
 (ATLAS)
Panoramic Survey Telescope and Rapid Response System
 (Pan-STARRS)
 
All-Sky Surveys
 
T
i
m
e
-
d
o
m
a
i
n
 
A
s
t
r
o
n
o
m
y
Variable stars
Supernovae (exploding stars)
Solar flares and coronal mass ejections (CME)
O
b
j
e
c
t
 
C
l
a
s
s
i
f
i
c
a
t
i
o
n
Galaxy, quasar, star, asteroid, comet, supernova, variable star type
R
e
g
r
e
s
s
i
o
n
Estimated photometric redshift (distance from Earth)
 
All-Sky Surveys
 
ASAS-SN Sky Patrol 2.0 light curve service
 
ATLAS-VAR data release of light curves with classification
 
Pan-STARRS WISE-PS1-STRM Catalog
 
 
 
National AI
Cyberinfrastructure
 
National AI Cyberinfrastructure
 
ACCESS
Open Science Grid
Open Science Data Federation (OSDF) / Pelican Platform
National Research Platform
Commercial cloud providers: EC2, GCP, Azure, etc.
National AI Research Resource (NAIRR) pilot
National Data Platform (NDP)
National Science Data Fabric (NSDF)
Campus HPC, Science DMZ, DTNs
 
National AI Cyberinfrastructure
 
ACCESS
O
p
e
n
 
S
c
i
e
n
c
e
 
G
r
i
d
O
p
e
n
 
S
c
i
e
n
c
e
 
D
a
t
a
 
F
e
d
e
r
a
t
i
o
n
 
(
O
S
D
F
)
 
/
 
P
e
l
i
c
a
n
 
P
l
a
t
f
o
r
m
N
a
t
i
o
n
a
l
 
R
e
s
e
a
r
c
h
 
P
l
a
t
f
o
r
m
Commercial cloud providers: EC2, GCP, Azure, etc.
N
a
t
i
o
n
a
l
 
A
I
 
R
e
s
e
a
r
c
h
 
R
e
s
o
u
r
c
e
 
(
N
A
I
R
R
)
 
p
i
l
o
t
N
a
t
i
o
n
a
l
 
D
a
t
a
 
P
l
a
t
f
o
r
m
 
(
N
D
P
)
N
a
t
i
o
n
a
l
 
S
c
i
e
n
c
e
 
D
a
t
a
 
F
a
b
r
i
c
 
(
N
S
D
F
)
C
a
m
p
u
s
 
H
P
C
,
 
S
c
i
e
n
c
e
 
D
M
Z
,
 
D
T
N
s
 
 
National Astronomy
Cyberinfrastructure
 
National Astronomy Data
 
NASA archives were not designed for AI/ML
 
Designed before the AI renaissance
SQL queries with extremely limited result sizes
 
Typically <<10Gbps bandwidth from archive sites
 
Large N**2 crossmatch queries unsupported (but important!)
 
Image cutout services are not performant or scalable
Friction prevents researchers (grad students!) from working at scale
Tools and services are fragmented and heterogeneous
Some recent projects have addressed these issues in part (ASAS-SN, DKIST, LSST)
 
Legacy Data Access
 
A
T
L
A
S
 
P
h
o
t
o
m
e
t
r
y
 
S
e
r
v
e
r
N
e
x
t
,
 
s
u
b
m
i
t
 
a
n
 
R
A
 
a
n
d
 
D
e
c
 
c
o
o
r
d
i
n
a
t
e
 
t
o
 
t
h
e
 
s
e
r
v
e
r
 
t
o
 
o
b
t
a
i
n
 
a
 
U
R
L
 
f
o
r
c
h
e
c
k
i
n
g
 
t
h
e
 
s
t
a
t
u
s
.
 
N
o
t
e
 
t
h
a
t
 
o
u
r
 
r
e
q
u
e
s
t
 
m
a
y
 
b
e
 
t
h
r
o
t
t
l
e
d
 
i
f
 
w
e
 
m
a
k
e
 
t
o
o
m
a
n
y
 
i
n
 
a
 
s
h
o
r
t
 
t
i
m
e
.
M
i
k
u
l
s
k
i
 
A
r
c
h
i
v
e
 
f
o
r
 
S
p
a
c
e
 
T
e
l
e
s
c
o
p
e
s
 
(
M
A
S
T
)
(Hubble Space Telescope, Pan-STARRS, JWST, Kepler, TESS)
3
G
B
 
M
y
D
B
 
f
o
r
 
q
u
e
r
y
 
r
e
s
u
l
t
s
 
(
t
o
 
q
u
e
r
y
 
1
5
0
T
B
 
P
a
n
-
S
T
A
R
R
S
 
D
R
2
 
c
a
t
a
l
o
g
)
Y
o
u
 
c
a
n
 
r
e
t
r
i
e
v
e
 
0
.
0
0
2
%
 
o
f
 
t
h
e
 
d
a
t
a
 
B
Y
 
D
E
S
I
G
N
!
 
Legacy Data Access Patterns
 
Example: Download ATLAS Variable Stars from MAST
https://archive.stsci.edu/hlsp/atlas-var
 (Heinze et al. 2018)
Shard 360deg into 180x 2deg partitions each 100MB < x < 2GB
H
a
d
 
t
o
 
u
s
e
 
t
r
i
a
l
 
a
n
d
 
e
r
r
o
r
 
t
o
 
d
e
t
e
r
m
i
n
e
 
p
a
r
t
i
t
i
o
n
 
l
i
m
i
t
s
M
a
n
u
a
l
l
y
 
w
r
i
t
e
 
a
 
d
o
w
n
l
o
a
d
 
s
c
r
i
p
t
W
a
i
t
 
5
 
d
a
y
s
 
f
o
r
 
d
o
w
n
l
o
a
d
 
o
f
 
2
9
G
B
 
o
f
 
d
a
t
a
 
t
o
 
f
i
n
i
s
h
 
Legacy Data Access
 
Example: “A catalog of broad morphology of Pan-STARRS galaxies based on deep
Learning”, Hunter Goddard (MS thesis)
 
 
 
 
 
 
https://krex.k-state.edu/bitstream/handle/2097/41353/HunterGoddard2021.pdf
 
 
 
New Data Access
 
New Data Access
 
New Data Access
 
 
Driving AI
Innovation
 
Driving AI/ML Innovation
 
R
e
d
u
c
e
 
t
i
m
e
 
t
o
 
g
e
t
 
s
t
a
r
t
e
d
Data discovery as a service
Data exploration as a service
D
a
t
a
 
r
e
a
d
y
 
f
o
r
 
A
I
/
M
L
 
t
r
a
i
n
i
n
g
Preprocessing adjacent to data origin
High throughput data distribution optimized for Pytorch, Keras
Transparent data caching
E
l
i
m
i
n
a
t
e
 
s
o
u
r
c
e
s
 
o
f
 
f
r
i
c
t
i
o
n
 
Support novel data access patterns
Online training data for AI/ML on time-series
Real-time data sources
AI/ML inference applications
Data exploration without data movement
Data preprocessing without data movement
Move only the data you want
Transparent caching for efficiency and performance
 
Driving AI/ML Innovation
 
 
OSDF/Pelican
 
Participate in OSDF/Pelican
Deploy data origin service on UH/IfA DTNs
Deploy data origin service on CC* HPC storage
Internal outreach to researchers
Who produce data
Who consume data
 
Hawaii OSDF Data Origins
 
Hawaii OSDF Data Origins
 
I
f
A
 
D
T
N
s
d
t
n
-
i
t
c
Hinode SOT SP
 solar observations and inversions mirror from High Altitude
Observatory in Boulder, CO
Critical Early DKIST Science: Spectropolarimetric Inversion in Four Dimensions with
Deep Learning (
SPIN4D
)
ATLAS-VAR
 variable star light curves
d
t
n
-
m
a
x
 
(
B
a
l
t
i
m
o
r
e
)
d
t
n
-
n
a
o
j
 
(
T
o
k
y
o
)
d
t
n
-
h
u
r
p
 
(
H
i
l
o
,
 
H
a
w
a
i
i
)
d
t
n
-
u
k
 
(
p
l
a
n
n
e
d
 
-
 
L
o
n
d
o
n
)
 
U
H
 
C
C
*
 
K
o
a
S
t
o
r
e
 
d
a
t
a
 
o
r
i
g
i
n
 
(
n
e
w
)
:
CC* UH 800TB set aside for data federation using OSDF
 
 
D
a
t
a
s
e
t
s
 
(
w
o
r
k
 
i
n
 
p
r
o
g
r
e
s
s
)
ASAS-SN - light curves for any source
SPIN4D - solar photosphere simulation
Hinode SOT SP - solar spectropolarimetric survey
ATLAS-VAR - variable stars
StePS - cosmological N-body simulation
 
Hawaii OSDF Data Origins
 
 
NRP
 
Heterogeneous K8s cluster in Hawaii
640 CPU cores
8x L40S GPU, 2x V100GPU
Federate to NRP
Storage integration
on-premise project storage clusters (ATLAS, ASAS-SN, SPIN4D, Pan-
STARRS, H20)
campus HPC Lustre storage cluster
IfA DTNs
 
Institute for Astronomy K8s/NRP
 
V
i
s
i
o
n
:
 
t
o
 
m
a
k
e
 
s
i
l
o
e
d
 
a
s
t
r
o
n
o
m
y
 
d
a
t
a
 
f
r
o
m
 
H
a
w
a
i
i
 
a
v
a
i
l
a
b
l
e
 
f
o
r
 
M
L
 
t
r
a
i
n
i
n
g
 
o
n
N
A
I
R
R
,
 
N
D
P
,
 
N
R
P
,
 
O
S
G
 
a
n
d
 
o
t
h
e
r
 
H
P
C
 
r
e
s
o
u
r
c
e
s
.
 
O
b
j
e
c
t
i
v
e
s
:
Dataset discovery service on OSDF data origin, UH DTNs
Dataset discovery/exploration on OSG, NRP resources (Jupyter Notebook)
Dataset streaming service on OSDF data origin, UH DTNs
Dataset client streaming to OSG, NRP resources (Jupyter Notebook, PyTorch,
Keras)
 
Data Services
 
Extract-Transform-Distribute (ETD)
 
ETD Data Discovery and Streaming Service is deployed adjacent to a data source
using containers, (Docker, K8s).
 
Discovery - enumerate available datasets, file exploration and access
Extract - select, slice and sample from data sources
Transform - process extracted examples for AI training, e.g.
torch.utils.data.DataLoader
 and 
tf.data.Dataset
Distribute - asynchronous parallel streaming
 
Proof of Concept at Univ. of Hawaii using DTNs, NRP, OSDF
Applications for education, training, transfer-learning, real-time inference
 
Resources
 
ACCESS
Open Science Grid (OSG)
Open Science Data Federation (OSDF)
Pelican Platform
National Research Platform (NRP)
National AI Research Resource (NAIRR) pilot
National Data Platform (NDP)
National Science Data Fabric (NSDF)
Science DMZ
Data Transfer Node (DTN)
 
Contact Information
 
Curt Dodds
Institute for Astronomy, University of Hawaii, Manoa
dodds@hawaii.edu
 
Slide Note
Embed
Share

Harnessing the power of big data in astronomy, this presentation by Curt Dodds from the Institute for Astronomy at the University of Hawaii, Manoa, delves into the utilization of national cyberinfrastructure to advance artificial intelligence access and foster innovation in the field. The discussion covers a range of topics including the sources of astronomy big data, from solar system bodies to galaxies and cosmological phenomena, as well as cutting-edge technologies such as the Daniel K. Inouye Solar Telescope and Spectropolarimetric Inversion in 4-Dimensions (SPIN4D). All-sky surveys like ASAS-SN, ATLAS, and Pan-STARRS play a crucial role in time-domain astronomy, enabling the study of variable stars, supernovae, and other transient events. The presentation underscores how the convergence of astronomy, big data, and AI is revolutionizing our understanding of the universe.

  • Astronomy
  • Big Data
  • Cyberinfrastructure
  • AI
  • National

Uploaded on Sep 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Using Astronomy Big Data and National Cyberinfrastructure to Drive AI Access and Innovation Curt Dodds - Institute for Astronomy University of Hawaii, Manoa

  2. Pan-STARRS Milky Way Gigapan

  3. Astronomy Big Data Sources of Big Data Observation (ground, space) Simulation Surveys Long duration time-series telescope observations Moore s Law Increased image size and dimensionality Increased simulation grid resolution and step frequency

  4. Astronomy Big Data Solar System Sun, asteroids, comets, planets Galactic Stars Exoplanets Extragalactic Galaxies, quasars Cosmology

  5. The Sun

  6. Daniel K Inouye Solar Telescope (DKIST)

  7. Daniel K Inouye Solar Telescope (DKIST)

  8. Spectropolarimetric Inversion in 4-Dimensions (SPIN4D)

  9. Hinode Solar Optical Telescope Spectropolarimeter

  10. All Sky Surveys

  11. All-Sky Surveys All-Sky Automated Survey for Supernovae (ASAS-SN) Asteroid Terrestrial-impact Last Alert System (ATLAS) Panoramic Survey Telescope and Rapid Response System (Pan-STARRS)

  12. All-Sky Surveys Time-domain Astronomy Variable stars Supernovae (exploding stars) Solar flares and coronal mass ejections (CME) Object Classification Galaxy, quasar, star, asteroid, comet, supernova, variable star type Regression Estimated photometric redshift (distance from Earth)

  13. ASAS-SN Sky Patrol 2.0 light curve service

  14. ATLAS-VAR data release of light curves with classification

  15. Pan-STARRS WISE-PS1-STRM Catalog

  16. National AI Cyberinfrastructure

  17. National AI Cyberinfrastructure ACCESS Open Science Grid Open Science Data Federation (OSDF) / Pelican Platform National Research Platform Commercial cloud providers: EC2, GCP, Azure, etc. National AI Research Resource (NAIRR) pilot National Data Platform (NDP) National Science Data Fabric (NSDF) Campus HPC, Science DMZ, DTNs

  18. National AI Cyberinfrastructure ACCESS Open Science Grid Open Science Data Federation (OSDF) / Pelican Platform National Research Platform Commercial cloud providers: EC2, GCP, Azure, etc. National AI Research Resource (NAIRR) pilot National Data Platform (NDP) National Science Data Fabric (NSDF) Campus HPC, Science DMZ, DTNs

  19. National Astronomy Cyberinfrastructure

  20. National Astronomy Data NASA archives were not designed for AI/ML Designed before the AI renaissance SQL queries with extremely limited result sizes Typically <<10Gbps bandwidth from archive sites Large N**2 crossmatch queries unsupported (but important!) Image cutout services are not performant or scalable Friction prevents researchers (grad students!) from working at scale Tools and services are fragmented and heterogeneous Some recent projects have addressed these issues in part (ASAS-SN, DKIST, LSST)

  21. Legacy Data Access ATLAS Photometry Server Next, submit an RA and Dec coordinate to the server to obtain a URL for checking the status. Note that our request may be throttled if we make too many in a short time. Mikulski Archive for Space Telescopes (MAST) (Hubble Space Telescope, Pan-STARRS, JWST, Kepler, TESS) 3GB MyDB for query results (to query 150TB Pan-STARRS DR2 catalog) You can retrieve 0.002% of the data BY DESIGN!

  22. Legacy Data Access Patterns Example: Download ATLAS Variable Stars from MAST https://archive.stsci.edu/hlsp/atlas-var (Heinze et al. 2018) Shard 360deg into 180x 2deg partitions each 100MB < x < 2GB Had to use trial and error to determine partition limits Manually write a download script Wait 5 days for download of 29GB of data to finish

  23. Legacy Data Access Example: A catalog of broad morphology of Pan-STARRS galaxies based on deep Learning , Hunter Goddard (MS thesis) https://krex.k-state.edu/bitstream/handle/2097/41353/HunterGoddard2021.pdf

  24. New Data Access

  25. New Data Access

  26. New Data Access

  27. Driving AI Innovation

  28. Driving AI/ML Innovation Reduce time to get started Data discovery as a service Data exploration as a service Data ready for AI/ML training Preprocessing adjacent to data origin High throughput data distribution optimized for Pytorch, Keras Transparent data caching Eliminate sources of friction

  29. Driving AI/ML Innovation Support novel data access patterns Online training data for AI/ML on time-series Real-time data sources AI/ML inference applications Data exploration without data movement Data preprocessing without data movement Move only the data you want Transparent caching for efficiency and performance

  30. OSDF/Pelican

  31. Hawaii OSDF Data Origins Participate in OSDF/Pelican Deploy data origin service on UH/IfA DTNs Deploy data origin service on CC* HPC storage Internal outreach to researchers Who produce data Who consume data

  32. Hawaii OSDF Data Origins IfA DTNs dtn-itc Hinode SOT SP solar observations and inversions mirror from High Altitude Observatory in Boulder, CO Critical Early DKIST Science: Spectropolarimetric Inversion in Four Dimensions with Deep Learning (SPIN4D) ATLAS-VAR variable star light curves dtn-max (Baltimore) dtn-naoj (Tokyo) dtn-hurp (Hilo, Hawaii) dtn-uk (planned - London)

  33. Hawaii OSDF Data Origins UH CC* KoaStore data origin (new): CC* UH 800TB set aside for data federation using OSDF Datasets (work in progress) ASAS-SN - light curves for any source SPIN4D - solar photosphere simulation Hinode SOT SP - solar spectropolarimetric survey ATLAS-VAR - variable stars StePS - cosmological N-body simulation

  34. NRP

  35. Institute for Astronomy K8s/NRP Heterogeneous K8s cluster in Hawaii 640 CPU cores 8x L40S GPU, 2x V100GPU Federate to NRP Storage integration on-premise project storage clusters (ATLAS, ASAS-SN, SPIN4D, Pan- STARRS, H20) campus HPC Lustre storage cluster IfA DTNs

  36. Data Services Vision: to make siloed astronomy data from Hawaii available for ML training on NAIRR, NDP, NRP, OSG and other HPC resources. Objectives: Dataset discovery service on OSDF data origin, UH DTNs Dataset discovery/exploration on OSG, NRP resources (Jupyter Notebook) Dataset streaming service on OSDF data origin, UH DTNs Dataset client streaming to OSG, NRP resources (Jupyter Notebook, PyTorch, Keras)

  37. Extract-Transform-Distribute (ETD) ETD Data Discovery and Streaming Service is deployed adjacent to a data source using containers, (Docker, K8s). Discovery - enumerate available datasets, file exploration and access Extract - select, slice and sample from data sources Transform - process extracted examples for AI training, e.g. torch.utils.data.DataLoader and tf.data.Dataset Distribute - asynchronous parallel streaming Proof of Concept at Univ. of Hawaii using DTNs, NRP, OSDF Applications for education, training, transfer-learning, real-time inference

  38. Resources ACCESS Open Science Grid (OSG) Open Science Data Federation (OSDF) Pelican Platform National Research Platform (NRP) National AI Research Resource (NAIRR) pilot National Data Platform (NDP) National Science Data Fabric (NSDF) Science DMZ Data Transfer Node (DTN)

  39. Contact Information Curt Dodds Institute for Astronomy, University of Hawaii, Manoa dodds@hawaii.edu

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#