Machine Learning Approach for Hierarchical Classification of Transposable Elements

12/15/2024
2
Introduction
Introduction
Data Collection
Data Collection
Machine Learning Methods for the Prediction of Hierarchical Categories
Machine Learning Methods for the Prediction of Hierarchical Categories
Feature Extraction
Feature Extraction
Hierarchical Classification Approaches
Hierarchical Classification Approaches
Results
Results
Conclusion
Conclusion
12/15/2024
3
 
Transposable elements (TEs) or jumping genes are the DNA sequences that have
Transposable elements (TEs) or jumping genes are the DNA sequences that have
Intrinsic capability to move within a host genome from one genomic location to another
Intrinsic capability to move within a host genome from one genomic location to another
Genomic location can either be same or different chromosome
Genomic location can either be same or different chromosome
 
TEs were first discovered by Barbara McClintock (a.k.a. maize scientist) in 1948
TEs were first discovered by Barbara McClintock (a.k.a. maize scientist) in 1948
 
TEs play an important role in:
TEs play an important role in:
Modifying  functionalities of genes
Modifying  functionalities of genes
E.g. insertion of L1 type TEs in tumor suppressor genes could lead to cancer.
E.g. insertion of L1 type TEs in tumor suppressor genes could lead to cancer.
 
Hence, proper classification of identified TEs in a genome is important to understand their particular
Hence, proper classification of identified TEs in a genome is important to understand their particular
role in germline and somatic evolution.
role in germline and somatic evolution.
12/15/2024
4
 
1
 
1.1
 
1.1.2
12/15/2024
5
 
For our study, we collected pre-annotated DNA sequences of TEs.
For our study, we collected pre-annotated DNA sequences of TEs.
 
The hierarchical annotations of TEs were performed based on Wicker’s taxonomy.
The hierarchical annotations of TEs were performed based on Wicker’s taxonomy.
 
For the annotation of TEs, the repetitive DNA sequences were obtained from two different public
For the annotation of TEs, the repetitive DNA sequences were obtained from two different public
repositories:
repositories:
Repbase
Repbase
PGSB
PGSB
 
Repbase repository contains TEs from different eukaryotic species.
Repbase repository contains TEs from different eukaryotic species.
 
PGSB is a compilation of plant repetative sequences from different databases:
PGSB is a compilation of plant repetative sequences from different databases:
TREP
TREP
TIGR repeats
TIGR repeats
PlantSat
PlantSat
Genbank
Genbank
12/15/2024
6
 
Each TE in a dataset is represented by a set of k-mers
Each TE in a dataset is represented by a set of k-mers
Which are obtained by frequency count of substring of length k
Which are obtained by frequency count of substring of length k
E.g. for k=2, all combinations of (AA, AT, AG, AC….CC) in the sequence are extracted
E.g. for k=2, all combinations of (AA, AT, AG, AC….CC) in the sequence are extracted
 
For k=2
For k=2
 
 
AA = 2
AA = 2
 
 
CC = 2
CC = 2
 
 
TT = 2
TT = 2
 
For each TE, k-mers with k sizes of 2, 3 and 4 were used as features.
For each TE, k-mers with k sizes of 2, 3 and 4 were used as features.
T
C
C
G
C
A
A
A
A
G
T
G
T
C
 
For k=3
For k=3
 
 
CCG = 1
CCG = 1
 
 
CAA = 1
CAA = 1
 
 
AAG = 1
AAG = 1
 
For k=4
For k=4
 
 
CCGC = 1
CCGC = 1
 
 
AAAA = 1
AAAA = 1
 
 
GTTG = 1
GTTG = 1
 
Feature values were standardized such that the mean = 0 and standard deviation = 1
Feature values were standardized such that the mean = 0 and standard deviation = 1
12/15/2024
7
 
Classification of TEs can be treated as hierarchical classification problem
Classification of TEs can be treated as hierarchical classification problem
The hierarchical classification can be represented by a directed acyclic graph or a tree
The hierarchical classification can be represented by a directed acyclic graph or a tree
 
Hierarchical classification of TEs is performed based on top-down strategies
Hierarchical classification of TEs is performed based on top-down strategies
Two recent top-down strategies for the hierarchical classification of TEs are:
Two recent top-down strategies for the hierarchical classification of TEs are:
non-Leaf Local Classifier per Parent Node (nLLCPN)
non-Leaf Local Classifier per Parent Node (nLLCPN)
Local Classifier per Parent Node and Branch (LCPNB)
Local Classifier per Parent Node and Branch (LCPNB)
12/15/2024
8
 
In nLLCPN, a multi-class classifier is implemented at each non-leaf node of the graph.
In nLLCPN, a multi-class classifier is implemented at each non-leaf node of the graph.
 
…CCGCAAAAGTTGTC…
 
Is classified as either 1 or 2
 
…CCGCAAAAGTTGTC…
 
Is classified as either  itself or 2.1
 
…CCGCAAAAGTTGTC…
 
Is classified as either  itself or 2.1.1
 
…CCGCAAAAGTTGTC…
 
Is classified as 2.1.1.2
12/15/2024
9
 
In LCPNB, a multi-class classifier is implemented at each non-leaf node of the graph and prediction
In LCPNB, a multi-class classifier is implemented at each non-leaf node of the graph and prediction
probabilities are obtained for all the classes.
probabilities are obtained for all the classes.
 
The path leading to final classification:
The path leading to final classification:
2(
2(
0.6
0.6
)          2.1(
)          2.1(
1
1
)          2.1.1(
)          2.1.1(
0.8
0.8
)          2.1.1.1(
)          2.1.1.1(
0.4
0.4
)
)
Average = (0.6+1+0.8+0.4)/4 = 0.7
Average = (0.6+1+0.8+0.4)/4 = 0.7
 
0.4
 
0.6
 
0.2
 
0.4
 
0.4
 
0.2
 
0.2
 
0.6
 
1
 
0.2
 
0.8
 
0.2
 
0.4
 
0.2
 
0.2
12/15/2024
10
 
We applied several machine learning methods at each non-leaf node of the directed acyclic graph.
We applied several machine learning methods at each non-leaf node of the directed acyclic graph.
Artificial Neural Network (ANN)
Artificial Neural Network (ANN)
ExtraTree Classifier (ET)
ExtraTree Classifier (ET)
Gradient Boosting Classifier (GBC)
Gradient Boosting Classifier (GBC)
Logistic Regression (LogReg)
Logistic Regression (LogReg)
Random Forest (RF)
Random Forest (RF)
Support Vector Machines (SVM)
Support Vector Machines (SVM)
12/15/2024
11
 
The state-of-the-art method implements ANN which single hidden layer consisting of 200 nodes as a
The state-of-the-art method implements ANN which single hidden layer consisting of 200 nodes as a
multi-class classifier
multi-class classifier
 
Whereas, in this study we propose a SVM based multi-class classification
Whereas, in this study we propose a SVM based multi-class classification
 
We implemented SVM with RBF kernel and optimized the cost and gamma parameters using grid
We implemented SVM with RBF kernel and optimized the cost and gamma parameters using grid
search approach for optimal performance.
search approach for optimal performance.
12/15/2024
12
 
Here, 
Here, 
C
C
i
i
 and 
 and 
Z
Z
i
i
 represents the set of true and predicted classes for an instance 
 represents the set of true and predicted classes for an instance 
i
i
 respectively.
 respectively.
 
The performance of each of the classifier is evaluated using 3-fold cross-validation strategy.
The performance of each of the classifier is evaluated using 3-fold cross-validation strategy.
12/15/2024
13
 
Table I 
– Shows comparative results of different machine learning approaches in the PGSB hierarchical datasets.
nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch.
12/15/2024
14
 
Table II 
– Shows comparative results of different machine learning approaches in the Repbase hierarchical datasets.
nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch.
12/15/2024
15
 
Fig.1
. – Shows hierarchical f-measure comparison between different machine learning approaches for nLLCPN and
LCPNB hierarchical classification methods in  PGSB dataset.
12/15/2024
16
 
Fig.2
. – Shows hierarchical f-measure comparison between different machine learning approaches for nLLCPN and
LCPNB hierarchical classification methods in  Repbase dataset.
12/15/2024
17
 
Advanced Machine Learning approach improves the prediction accuracy of hierarchical classification of TEs
 
Optimization of the cost and gamma parameters of support vector machine (SVM) with radial basis function
(RBF) kernel leads to a better hierarchical classification of transposable elements
 
We plan to improve the classification accuracy by following approaches:
Addition of biochemical related features
Implementing advanced machine learning techniques
Implementing novel hierarchical classification approache
12/15/2024
18
 
Slide Note
Embed
Share

This study presents a machine learning approach for the hierarchical classification of transposable elements (TEs) based on pre-annotated DNA sequences. The research includes data collection, feature extraction using k-mers, and classification approaches. Proper categorization of TEs is crucial for understanding their impact on genetic evolution. Various machine learning methods are applied to predict hierarchical categories of TEs, offering insights into their functional roles in genomes.


Uploaded on Dec 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Prediction of Hierarchical Classification of Transposable Elements using Machine Learning Approach Avdesh Mishra, Manisha Panta, MdTamjidul Hoque, Joel Atallah Computer Science and Biological Sciences Department, University of New Orleans

  2. Presentation Overview Introduction Data Collection Feature Extraction Hierarchical Classification Approaches Machine Learning Methods for the Prediction of Hierarchical Categories Results Conclusion 2 12/15/2024

  3. Transposable Elements Transposable elements (TEs) or jumping genes are the DNA sequences that have Intrinsic capability to move within a host genome from one genomic location to another Genomic location can either be same or different chromosome TEs were first discovered by Barbara McClintock (a.k.a. maize scientist) in 1948 TEs play an important role in: Modifying functionalities of genes E.g. insertion of L1 type TEs in tumor suppressor genes could lead to cancer. Hence, proper classification of identified TEs in a genome is important to understand their particular role in germline and somatic evolution. 3 12/15/2024

  4. Illustration of TEs Taxonomy Proposed by Wicker et al. Root Class II Class I 1 (DNA Transposons) (Retrotransposons) 1.1 Subclass 1 Subclass 2 LTR DIRS PLE LINE SINE TIR Crypton Helitron Maverick Copia DIRS Penelope R2 tRNA Tc1- Mariner Maverick- Polinton 1.1.2 Crypton Helitron Gypsy Ngaro RTE 7SL hAT Bel-pao VIPER Jockey 5S Mutator Retrovirus L1 Merlin ERV I Transib P PiggyBac PIF- 4 Harbinger 12/15/2024 CACTA

  5. Data Collection For our study, we collected pre-annotated DNA sequences of TEs. The hierarchical annotations of TEs were performed based on Wicker s taxonomy. For the annotation of TEs, the repetitive DNA sequences were obtained from two different public repositories: Repbase PGSB Repbase repository contains TEs from different eukaryotic species. PGSB is a compilation of plant repetative sequences from different databases: TREP TIGR repeats PlantSat Genbank PGSB 18680 Repbase 34561 Fasta Sequences 5 12/15/2024

  6. Feature Extraction Each TE in a dataset is represented by a set of k-mers Which are obtained by frequency count of substring of length k E.g. for k=2, all combinations of (AA, AT, AG, AC .CC) in the sequence are extracted C C G C A A A A G T T G T C For k=2 For k=3 For k=4 AA = 2 CC = 2 TT = 2 CCG = 1 CAA = 1 AAG = 1 CCGC = 1 AAAA = 1 GTTG = 1 For each TE, k-mers with k sizes of 2, 3 and 4 were used as features. Feature values were standardized such that the mean = 0 and standard deviation = 1 6 12/15/2024

  7. Hierarchical Classification Approaches Classification of TEs can be treated as hierarchical classification problem The hierarchical classification can be represented by a directed acyclic graph or a tree Hierarchical classification of TEs is performed based on top-down strategies Two recent top-down strategies for the hierarchical classification of TEs are: non-Leaf Local Classifier per Parent Node (nLLCPN) Local Classifier per Parent Node and Branch (LCPNB) 7 12/15/2024

  8. non-Leaf Local Classifier per Parent Node Approach In nLLCPN, a multi-class classifier is implemented at each non-leaf node of the graph. Is classified as either 1 or 2 Root CCGCAAAAGTTGTC Is classified as either itself or 2.1 1 2 CCGCAAAAGTTGTC Is classified as either itself or 2.1.1 1.1 1.4 1.5 2.1 CCGCAAAAGTTGTC 1.1.1 2.1.1 2.1 Is classified as 2.1.1.2 1.1.2 2.1.1.2 CCGCAAAAGTTGTC 1.1 2.1.1.1 2.1.1.8 8 12/15/2024 2.1.1.5

  9. Local Classifier per Parent Node and Branch Approach In LCPNB, a multi-class classifier is implemented at each non-leaf node of the graph and prediction probabilities are obtained for all the classes. Root The path leading to final classification: 2(0.6) 2.1(1) 2.1.1(0.8) 2.1.1.1(0.4) Average = (0.6+1+0.8+0.4)/4 = 0.7 0.4 1 2 0.2 0.6 0.6 1.1 1.4 1.5 2.1 1 0.2 0.2 1.1.1 2.1.1 2.1 0.8 0.2 0.4 0.2 1.1.2 2.1.1.2 0.4 0.4 1.1 2.1.1.1 2.1.1.8 0.2 9 12/15/2024 0.2 2.1.1.5

  10. Machine Learning Methods for the Prediction of Hierarchical Categories We applied several machine learning methods at each non-leaf node of the directed acyclic graph. Artificial Neural Network (ANN) ExtraTree Classifier (ET) Gradient Boosting Classifier (GBC) Logistic Regression (LogReg) Random Forest (RF) Support Vector Machines (SVM) 10 12/15/2024

  11. Machine Learning Methods for the Prediction of Hierarchical Categories The state-of-the-art method implements ANN which single hidden layer consisting of 200 nodes as a multi-class classifier Whereas, in this study we propose a SVM based multi-class classification We implemented SVM with RBF kernel and optimized the cost and gamma parameters using grid search approach for optimal performance. 11 12/15/2024

  12. Performance Measures ?????? ???? ????????? ( ?) = ?|?? ??| ?|??| ?????? ???? ?????? ( ?) = ?|?? ??| ?|??| ?????? ???? ? ??????? ( ?) =2 ? ? ? + ? Here, Ci and Zi represents the set of true and predicted classes for an instance i respectively. The performance of each of the classifier is evaluated using 3-fold cross-validation strategy. 12 12/15/2024

  13. Results Table I Shows comparative results of different machine learning approaches in the PGSB hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch. MIPS - nLLCPN GBC 86.75% 86.25% 0.864972486 MIPS - LCPNB 86.11% 86.45% 0.862758219 SVM 88.21% 86.51% 0.873518029 ANN 82.13% 85.51% 0.837699065 ExtraTree 76.03% 78.94% 0.774524643 Random Forest 76.98% 79.55% 0.782458818 LogReg 76% 78.89% 0.774172489 hP hR hF hP hR hF 87.34% 86.10% 0.867151847 82.93% 83.44% 0.831846433 84.50% 85% 0.847494297 84.12% 84.69% 0.844037783 83.55% 84.21% 0.838769007 13 12/15/2024

  14. Results Table II Shows comparative results of different machine learning approaches in the Repbase hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch. Repbase - nLLCPN GBC 81.98% 84.04% 0.830022352 Repbase - LCPNB 81.94% 84.59% 0.832277949 SVM 85.44% 86.64% ANN 80.27% 83.32% 0.817704912 ExtraTree 76.02% 78.93% 0.774524643 Random Forest 76.98% 79.55% 0.782458818 LogReg 75.99% 78.89 0.774172489 hP hR hF 0.860347824 hP hR hF 0.863959027 85.75% 87.05% 80.57% 83.26% 0.818944098 76.95% 79.99% 0.78444174 77.67% 80.27% 0.789473439 76.12% 79.16% 0.776128202 14 12/15/2024

  15. Results Fig.1. Shows hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in PGSB dataset. Fig.1. hF Comparision for PGSB dataset 0.88 0.87 Hierarchical f-measure 0.86 0.85 0.84 0.83 0.82 0.81 LogReg Randoom Forest ExtraTree ANN GBC SVM nLLCPN LCPNB 0.840740388 0.838769007 0.855858795 0.844037783 0.856005417 0.847494297 Machine Learning Methods 0.837699065 0.831846433 0.864972486 0.862758219 0.873518029 0.867151847 15 12/15/2024 nLLCPN LCPNB

  16. Results Fig.2. Shows hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in Repbase dataset. Fig.2. hF Comparision for Repbase dataset 0.88 0.86 Hierarchical f-measure 0.84 0.82 0.8 0.78 0.76 0.74 0.72 LogReg Randoom Forest ExtraTree ANN GBC SVM nLLCPN LCPNB 0.774172489 0.776128202 0.782458818 0.789473439 0.774524643 0.78444174 Machine Learning Methods 0.817704912 0.818944098 0.830022352 0.832277949 0.860347824 0.863959027 16 12/15/2024 nLLCPN LCPNB

  17. Conclusion and Future Work Advanced Machine Learning approach improves the prediction accuracy of hierarchical classification of TEs Optimization of the cost and gamma parameters of support vector machine (SVM) with radial basis function (RBF) kernel leads to a better hierarchical classification of transposable elements We plan to improve the classification accuracy by following approaches: Addition of biochemical related features Implementing advanced machine learning techniques Implementing novel hierarchical classification approache 17 12/15/2024

  18. 18 12/15/2024

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#