Machine Learning Approach for Hierarchical Classification of Transposable Elements
This study presents a machine learning approach for the hierarchical classification of transposable elements (TEs) based on pre-annotated DNA sequences. The research includes data collection, feature extraction using k-mers, and classification approaches. Proper categorization of TEs is crucial for understanding their impact on genetic evolution. Various machine learning methods are applied to predict hierarchical categories of TEs, offering insights into their functional roles in genomes.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Prediction of Hierarchical Classification of Transposable Elements using Machine Learning Approach Avdesh Mishra, Manisha Panta, MdTamjidul Hoque, Joel Atallah Computer Science and Biological Sciences Department, University of New Orleans
Presentation Overview Introduction Data Collection Feature Extraction Hierarchical Classification Approaches Machine Learning Methods for the Prediction of Hierarchical Categories Results Conclusion 2 12/15/2024
Transposable Elements Transposable elements (TEs) or jumping genes are the DNA sequences that have Intrinsic capability to move within a host genome from one genomic location to another Genomic location can either be same or different chromosome TEs were first discovered by Barbara McClintock (a.k.a. maize scientist) in 1948 TEs play an important role in: Modifying functionalities of genes E.g. insertion of L1 type TEs in tumor suppressor genes could lead to cancer. Hence, proper classification of identified TEs in a genome is important to understand their particular role in germline and somatic evolution. 3 12/15/2024
Illustration of TEs Taxonomy Proposed by Wicker et al. Root Class II Class I 1 (DNA Transposons) (Retrotransposons) 1.1 Subclass 1 Subclass 2 LTR DIRS PLE LINE SINE TIR Crypton Helitron Maverick Copia DIRS Penelope R2 tRNA Tc1- Mariner Maverick- Polinton 1.1.2 Crypton Helitron Gypsy Ngaro RTE 7SL hAT Bel-pao VIPER Jockey 5S Mutator Retrovirus L1 Merlin ERV I Transib P PiggyBac PIF- 4 Harbinger 12/15/2024 CACTA
Data Collection For our study, we collected pre-annotated DNA sequences of TEs. The hierarchical annotations of TEs were performed based on Wicker s taxonomy. For the annotation of TEs, the repetitive DNA sequences were obtained from two different public repositories: Repbase PGSB Repbase repository contains TEs from different eukaryotic species. PGSB is a compilation of plant repetative sequences from different databases: TREP TIGR repeats PlantSat Genbank PGSB 18680 Repbase 34561 Fasta Sequences 5 12/15/2024
Feature Extraction Each TE in a dataset is represented by a set of k-mers Which are obtained by frequency count of substring of length k E.g. for k=2, all combinations of (AA, AT, AG, AC .CC) in the sequence are extracted C C G C A A A A G T T G T C For k=2 For k=3 For k=4 AA = 2 CC = 2 TT = 2 CCG = 1 CAA = 1 AAG = 1 CCGC = 1 AAAA = 1 GTTG = 1 For each TE, k-mers with k sizes of 2, 3 and 4 were used as features. Feature values were standardized such that the mean = 0 and standard deviation = 1 6 12/15/2024
Hierarchical Classification Approaches Classification of TEs can be treated as hierarchical classification problem The hierarchical classification can be represented by a directed acyclic graph or a tree Hierarchical classification of TEs is performed based on top-down strategies Two recent top-down strategies for the hierarchical classification of TEs are: non-Leaf Local Classifier per Parent Node (nLLCPN) Local Classifier per Parent Node and Branch (LCPNB) 7 12/15/2024
non-Leaf Local Classifier per Parent Node Approach In nLLCPN, a multi-class classifier is implemented at each non-leaf node of the graph. Is classified as either 1 or 2 Root CCGCAAAAGTTGTC Is classified as either itself or 2.1 1 2 CCGCAAAAGTTGTC Is classified as either itself or 2.1.1 1.1 1.4 1.5 2.1 CCGCAAAAGTTGTC 1.1.1 2.1.1 2.1 Is classified as 2.1.1.2 1.1.2 2.1.1.2 CCGCAAAAGTTGTC 1.1 2.1.1.1 2.1.1.8 8 12/15/2024 2.1.1.5
Local Classifier per Parent Node and Branch Approach In LCPNB, a multi-class classifier is implemented at each non-leaf node of the graph and prediction probabilities are obtained for all the classes. Root The path leading to final classification: 2(0.6) 2.1(1) 2.1.1(0.8) 2.1.1.1(0.4) Average = (0.6+1+0.8+0.4)/4 = 0.7 0.4 1 2 0.2 0.6 0.6 1.1 1.4 1.5 2.1 1 0.2 0.2 1.1.1 2.1.1 2.1 0.8 0.2 0.4 0.2 1.1.2 2.1.1.2 0.4 0.4 1.1 2.1.1.1 2.1.1.8 0.2 9 12/15/2024 0.2 2.1.1.5
Machine Learning Methods for the Prediction of Hierarchical Categories We applied several machine learning methods at each non-leaf node of the directed acyclic graph. Artificial Neural Network (ANN) ExtraTree Classifier (ET) Gradient Boosting Classifier (GBC) Logistic Regression (LogReg) Random Forest (RF) Support Vector Machines (SVM) 10 12/15/2024
Machine Learning Methods for the Prediction of Hierarchical Categories The state-of-the-art method implements ANN which single hidden layer consisting of 200 nodes as a multi-class classifier Whereas, in this study we propose a SVM based multi-class classification We implemented SVM with RBF kernel and optimized the cost and gamma parameters using grid search approach for optimal performance. 11 12/15/2024
Performance Measures ?????? ???? ????????? ( ?) = ?|?? ??| ?|??| ?????? ???? ?????? ( ?) = ?|?? ??| ?|??| ?????? ???? ? ??????? ( ?) =2 ? ? ? + ? Here, Ci and Zi represents the set of true and predicted classes for an instance i respectively. The performance of each of the classifier is evaluated using 3-fold cross-validation strategy. 12 12/15/2024
Results Table I Shows comparative results of different machine learning approaches in the PGSB hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch. MIPS - nLLCPN GBC 86.75% 86.25% 0.864972486 MIPS - LCPNB 86.11% 86.45% 0.862758219 SVM 88.21% 86.51% 0.873518029 ANN 82.13% 85.51% 0.837699065 ExtraTree 76.03% 78.94% 0.774524643 Random Forest 76.98% 79.55% 0.782458818 LogReg 76% 78.89% 0.774172489 hP hR hF hP hR hF 87.34% 86.10% 0.867151847 82.93% 83.44% 0.831846433 84.50% 85% 0.847494297 84.12% 84.69% 0.844037783 83.55% 84.21% 0.838769007 13 12/15/2024
Results Table II Shows comparative results of different machine learning approaches in the Repbase hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch. Repbase - nLLCPN GBC 81.98% 84.04% 0.830022352 Repbase - LCPNB 81.94% 84.59% 0.832277949 SVM 85.44% 86.64% ANN 80.27% 83.32% 0.817704912 ExtraTree 76.02% 78.93% 0.774524643 Random Forest 76.98% 79.55% 0.782458818 LogReg 75.99% 78.89 0.774172489 hP hR hF 0.860347824 hP hR hF 0.863959027 85.75% 87.05% 80.57% 83.26% 0.818944098 76.95% 79.99% 0.78444174 77.67% 80.27% 0.789473439 76.12% 79.16% 0.776128202 14 12/15/2024
Results Fig.1. Shows hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in PGSB dataset. Fig.1. hF Comparision for PGSB dataset 0.88 0.87 Hierarchical f-measure 0.86 0.85 0.84 0.83 0.82 0.81 LogReg Randoom Forest ExtraTree ANN GBC SVM nLLCPN LCPNB 0.840740388 0.838769007 0.855858795 0.844037783 0.856005417 0.847494297 Machine Learning Methods 0.837699065 0.831846433 0.864972486 0.862758219 0.873518029 0.867151847 15 12/15/2024 nLLCPN LCPNB
Results Fig.2. Shows hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in Repbase dataset. Fig.2. hF Comparision for Repbase dataset 0.88 0.86 Hierarchical f-measure 0.84 0.82 0.8 0.78 0.76 0.74 0.72 LogReg Randoom Forest ExtraTree ANN GBC SVM nLLCPN LCPNB 0.774172489 0.776128202 0.782458818 0.789473439 0.774524643 0.78444174 Machine Learning Methods 0.817704912 0.818944098 0.830022352 0.832277949 0.860347824 0.863959027 16 12/15/2024 nLLCPN LCPNB
Conclusion and Future Work Advanced Machine Learning approach improves the prediction accuracy of hierarchical classification of TEs Optimization of the cost and gamma parameters of support vector machine (SVM) with radial basis function (RBF) kernel leads to a better hierarchical classification of transposable elements We plan to improve the classification accuracy by following approaches: Addition of biochemical related features Implementing advanced machine learning techniques Implementing novel hierarchical classification approache 17 12/15/2024
18 12/15/2024