Statistical Genomics and Machine Learning Challenges in AI

S
t
a
t
i
s
t
i
c
a
l
 
G
e
n
o
m
i
c
s
Zhiwu Zhang
Washington State University
M
a
c
h
i
n
e
 
L
e
a
r
n
i
n
g
 
Final HW due on Friday (April 28) at 3:10PM
Final exam: May 3, 4:30-6:30pm, open book, 50 questions
Grade submission: May 4 to students and May 5 to university
Course evaluation target: 100% responses by Friday (April 28)
Group pictures: Wednesday (April 26)
Administration
Outline
 
Challenges in finding the best method.
Machine learning
Knowledge  mining
MMAP algorithm
Cloud computing
MMAP performances
Domains
1996: Garry Kasparov vs  IBM deep blue
 
Hard coded multiple players' moves won the champion
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
 
IF…
IF…
Machine learning
 
Machine learning is a
type of artificial
intelligence (AI) that
provides computers with
the ability to learn
without being explicitly
programmed.
https://engineering.jhu.edu/ams/research/statistics-and-machine-learning
Linear Regression
Cluster Analysis
Decision Tree
Statistical learning methods with ML features
Knowledge Mining
Data
Knowledge
Prediction
Input data
Marker
density
LD
decade
Sample
size
h
2
PCA
Prediction
accuracy
Impute missing accuracy
Mining the best method: 
Find the nearest neighbor from the cluster analysis based on existing
features, including heritability, and newly gained prediction accuracies. Choose the method that
gave the highest prediction accuracy for the nearest neighbor as the candidate to examine in next
iteration.
K-Mean Cluster Analysis
Examined?
Output
Yes
No
Genomic Prediction
Knowledge
 
and expansion
I
m
p
u
t
e
d
New
Nearest
Next
 
 
H
i
g
h
e
s
t
Let M be the best method for the
old data that is the closest to the
new data
If M is applied
for the new
data?
Apply M to
the new data
No
Yes
Stop: output the best
Cluster analysis
New data
Impute missing
accuracy
 
Number of genes
Maize
 
Mice
 
Pine
mMAP stays on the top
Real traits
m
M
a
p
:
 
A
n
 
O
n
l
i
n
e
 
C
o
m
p
u
t
i
n
g
 
P
l
a
t
f
o
r
m
 
t
o
 
T
r
a
n
s
f
o
r
m
 
G
e
n
o
t
y
p
e
s
 
t
o
P
h
e
n
o
t
y
p
e
s
 
b
y
 
M
i
n
i
n
g
 
t
h
e
 
M
a
x
i
m
u
m
 
A
c
c
u
r
a
c
y
 
o
f
 
P
r
e
d
i
c
t
i
o
n
You Tang
mMAP website: 
http://zzlab.net/mMAP
Upload Files
Create New Project
Check status and download results
Simulation with GAPIT
source("http://zzlab.net/GAPIT/gapit_functions.txt")
myGD=read.table(file="http://zzlab.net/GAPIT/data/mdp_numeric.txt",head=T)
myGM=read.table(file="http://zzlab.net/GAPIT/data/mdp_SNP_information.txt",head=T)
set.seed(99164)
n=nrow(myGD)
testing=sample(n,round(n/5),replace=F)
training=-testing
set.seed(99164)
mySim=GAPIT.Phenotype.Simulation(GD=myGD, GM=myGM,
h2=.7,
NQTN=20,
QTNDist="normal")
http://zzlab.net/mMAP
#MMAP
names(mySim$Y)=c("Taxa", "SimTrait")
write.table(mySim$Y[training,], file="mdp_YRef.txt", sep="\t",quote=F,row.names=F)
#upload 
mdp_numeric.txt  and 
mdp_YRef.txt 
to MMAP
#Analysis
mymMapRef=read.csv("public820.csv")
accuracy <- cor(mySim$u[testing],
 mymMapRef[testing,2] 
)^2
plot(
mymMapRef[testing,2] 
,mySim$u[
testing
])
mtext(paste("R square=", accuracy,sep=""), side = 3)
Outline
 
Challenges in finding the best method.
Machine learning
Knowledge  mining
MMAP algorithm
Cloud computing
MMAP performances
Slide Note
Embed
Share

Explore the intersection of statistical genomics, machine learning, and artificial intelligence through topics like knowledge mining, MMAP algorithm, cloud computing, and historical events such as Garry Kasparov vs IBM Deep Blue. Delve into the concepts of statistical learning methods, data prediction, and genomic analysis to understand how these fields intersect and evolve.

  • Statistical Genomics
  • Machine Learning Challenges
  • AI
  • Knowledge Mining
  • Genomic Prediction

Uploaded on Apr 16, 2024 | 7 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Statistical Genomics Statistical Genomics Machine Machine Learning Learning Zhiwu Zhang Washington State University

  2. Administration Final HW due on Friday (April 28) at 3:10PM Final exam: May 3, 4:30-6:30pm, open book, 50 questions Grade submission: May 4 to students and May 5 to university Course evaluation target: 100% responses by Friday (April 28) Group pictures: Wednesday (April 26)

  3. Outline Challenges in finding the best method. Machine learning Knowledge mining MMAP algorithm Cloud computing MMAP performances

  4. Domains

  5. Hard coded multiple players' moves won the champion 1996: Garry Kasparov vs IBM deep blue

  6. IFIFIF IF

  7. Machine learning Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. https://engineering.jhu.edu/ams/research/statistics-and-machine-learning

  8. DEEP BLUE (Sum only, no ability to learn) "Machine learning is a computer system that gets better over new data without reprogramming" -Zhiwu Zhang

  9. Statistical learning methods with ML features Linear Regression Cluster Analysis Decision Tree

  10. Knowledge Mining Data Prediction Knowledge

  11. Genomic Prediction LD Sample size Marker density Prediction accuracy h2 PCA Input data decade Impute missing accuracy K-Mean Cluster Analysis Mining the best method: Find the nearest neighbor from the cluster analysis based on existing features, including heritability, and newly gained prediction accuracies. Choose the method that gave the highest prediction accuracy for the nearest neighbor as the candidate to examine in next iteration. No Yes Examined? Output

  12. Knowledge and expansion Next Data Start.Weight FN.Age FN.PctWtLoss FN.postWeight FN.preWeight Weight.Growth Intercept Weight.Growth Slope HTLC BA BD BLC CWAC CWAL DBH HT rootnum rootnumbin gall rustbin c5c6 density lateWood.4 lignin stiffnessTree rrBLUP 0.41 0.57 0.25 0.40 0.39 0.37 0.31 0.43 0.49 0.25 0.47 0.45 0.36 0.43 0.36 0.23 0.26 0.23 0.27 0.26 0.20 0.23 0.16 0.37 gBLUP 0.41 0.58 0.25 0.40 0.39 0.37 0.32 0.43 0.49 0.25 0.47 0.44 0.35 0.42 0.35 0.23 0.26 0.22 0.27 0.25 0.21 0.24 0.17 0.38 cBLUP 0.50 0.74 0.28 0.49 0.49 0.48 0.34 0.39 0.47 0.24 0.44 0.43 0.34 0.39 0.34 0.21 0.26 0.22 0.28 0.23 0.18 0.24 0.16 0.37 Imputed Imputed Highest Highest New Nearest

  13. New data Impute missing accuracy Cluster analysis Let M be the best method for the old data that is the closest to the new data If M is applied for the new data? No Apply M to the new data Yes Stop: output the best

  14. mMAP stays on the top Maize Mice Pine Number of genes

  15. Real traits

  16. mMap mMap: : An Online Computing Platform to Transform Genotypes to An Online Computing Platform to Transform Genotypes to Phenotypes by Mining the Maximum Accuracy of Prediction Phenotypes by Mining the Maximum Accuracy of Prediction You Tang mMAP website: http://zzlab.net/mMAP

  17. Upload Files

  18. Create New Project

  19. Check status and download results

  20. Simulation with GAPIT source("http://zzlab.net/GAPIT/gapit_functions.txt") myGD=read.table(file="http://zzlab.net/GAPIT/data/mdp_numeric.txt",head=T) myGM=read.table(file="http://zzlab.net/GAPIT/data/mdp_SNP_information.txt",head=T) set.seed(99164) n=nrow(myGD) testing=sample(n,round(n/5),replace=F) training=-testing set.seed(99164) mySim=GAPIT.Phenotype.Simulation(GD=myGD, GM=myGM, h2=.7, NQTN=20, QTNDist="normal")

  21. #MMAP names(mySim$Y)=c("Taxa", "SimTrait") write.table(mySim$Y[training,], file="mdp_YRef.txt", sep="\t",quote=F,row.names=F) #upload mdp_numeric.txt and mdp_YRef.txt to MMAP http://zzlab.net/mMAP #Analysis mymMapRef=read.csv("public820.csv") accuracy <- cor(mySim$u[testing], mymMapRef[testing,2] )^2 plot(mymMapRef[testing,2] ,mySim$u[testing]) mtext(paste("R square=", accuracy,sep=""), side = 3)

  22. Outline Challenges in finding the best method. Machine learning Knowledge mining MMAP algorithm Cloud computing MMAP performances

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#