Machine Learning Approach for Hierarchical Classification of Transposable Elements

12/15/2024



Introduction

Introduction



Data Collection

Data Collection



Machine Learning Methods for the Prediction of Hierarchical Categories

Machine Learning Methods for the Prediction of Hierarchical Categories



Feature Extraction

Feature Extraction



Hierarchical Classification Approaches

Hierarchical Classification Approaches



Results

Results



Conclusion

Conclusion

12/15/2024



Transposable elements (TEs) or jumping genes are the DNA sequences that have

Transposable elements (TEs) or jumping genes are the DNA sequences that have



Intrinsic capability to move within a host genome from one genomic location to another

Intrinsic capability to move within a host genome from one genomic location to another



Genomic location can either be same or different chromosome

Genomic location can either be same or different chromosome



TEs were first discovered by Barbara McClintock (a.k.a. maize scientist) in 1948

TEs were first discovered by Barbara McClintock (a.k.a. maize scientist) in 1948



TEs play an important role in:

TEs play an important role in:



Modifying  functionalities of genes

Modifying  functionalities of genes



E.g. insertion of L1 type TEs in tumor suppressor genes could lead to cancer.

E.g. insertion of L1 type TEs in tumor suppressor genes could lead to cancer.



Hence, proper classification of identified TEs in a genome is important to understand their particular

Hence, proper classification of identified TEs in a genome is important to understand their particular

role in germline and somatic evolution.

role in germline and somatic evolution.

12/15/2024

1.1

1.1.2

12/15/2024



For our study, we collected pre-annotated DNA sequences of TEs.

For our study, we collected pre-annotated DNA sequences of TEs.



The hierarchical annotations of TEs were performed based on Wicker’s taxonomy.

The hierarchical annotations of TEs were performed based on Wicker’s taxonomy.



For the annotation of TEs, the repetitive DNA sequences were obtained from two different public

For the annotation of TEs, the repetitive DNA sequences were obtained from two different public

repositories:

repositories:



Repbase

Repbase



PGSB

PGSB



Repbase repository contains TEs from different eukaryotic species.

Repbase repository contains TEs from different eukaryotic species.



PGSB is a compilation of plant repetative sequences from different databases:

PGSB is a compilation of plant repetative sequences from different databases:



TREP

TREP



TIGR repeats

TIGR repeats



PlantSat

PlantSat



Genbank

Genbank

12/15/2024



Each TE in a dataset is represented by a set of k-mers

Each TE in a dataset is represented by a set of k-mers



Which are obtained by frequency count of substring of length k

Which are obtained by frequency count of substring of length k



E.g. for k=2, all combinations of (AA, AT, AG, AC….CC) in the sequence are extracted

E.g. for k=2, all combinations of (AA, AT, AG, AC….CC) in the sequence are extracted

For k=2

For k=2

AA = 2

AA = 2

CC = 2

CC = 2

TT = 2

TT = 2



For each TE, k-mers with k sizes of 2, 3 and 4 were used as features.

For each TE, k-mers with k sizes of 2, 3 and 4 were used as features.

For k=3

For k=3

CCG = 1

CCG = 1

CAA = 1

CAA = 1

AAG = 1

AAG = 1

For k=4

For k=4

CCGC = 1

CCGC = 1

AAAA = 1

AAAA = 1

GTTG = 1

GTTG = 1



Feature values were standardized such that the mean = 0 and standard deviation = 1

Feature values were standardized such that the mean = 0 and standard deviation = 1

12/15/2024



Classification of TEs can be treated as hierarchical classification problem

Classification of TEs can be treated as hierarchical classification problem



The hierarchical classification can be represented by a directed acyclic graph or a tree

The hierarchical classification can be represented by a directed acyclic graph or a tree



Hierarchical classification of TEs is performed based on top-down strategies

Hierarchical classification of TEs is performed based on top-down strategies



Two recent top-down strategies for the hierarchical classification of TEs are:

Two recent top-down strategies for the hierarchical classification of TEs are:



non-Leaf Local Classifier per Parent Node (nLLCPN)

non-Leaf Local Classifier per Parent Node (nLLCPN)



Local Classifier per Parent Node and Branch (LCPNB)

Local Classifier per Parent Node and Branch (LCPNB)

12/15/2024



In nLLCPN, a multi-class classifier is implemented at each non-leaf node of the graph.

In nLLCPN, a multi-class classifier is implemented at each non-leaf node of the graph.

…CCGCAAAAGTTGTC…

Is classified as either 1 or 2

…CCGCAAAAGTTGTC…

Is classified as either  itself or 2.1

…CCGCAAAAGTTGTC…

Is classified as either  itself or 2.1.1

…CCGCAAAAGTTGTC…

Is classified as 2.1.1.2

12/15/2024



In LCPNB, a multi-class classifier is implemented at each non-leaf node of the graph and prediction

In LCPNB, a multi-class classifier is implemented at each non-leaf node of the graph and prediction

probabilities are obtained for all the classes.

probabilities are obtained for all the classes.



The path leading to final classification:

The path leading to final classification:



2(

2(

0.6

0.6

)          2.1(

)          2.1(

)          2.1.1(

)          2.1.1(

0.8

0.8

)          2.1.1.1(

)          2.1.1.1(

0.4

0.4



Average = (0.6+1+0.8+0.4)/4 = 0.7

Average = (0.6+1+0.8+0.4)/4 = 0.7

0.4

0.6

0.2

0.4

0.4

0.2

0.2

0.6

0.2

0.8

0.2

0.4

0.2

0.2

12/15/2024



We applied several machine learning methods at each non-leaf node of the directed acyclic graph.

We applied several machine learning methods at each non-leaf node of the directed acyclic graph.



Artificial Neural Network (ANN)

Artificial Neural Network (ANN)



ExtraTree Classifier (ET)

ExtraTree Classifier (ET)



Gradient Boosting Classifier (GBC)

Gradient Boosting Classifier (GBC)



Logistic Regression (LogReg)

Logistic Regression (LogReg)



Random Forest (RF)

Random Forest (RF)



Support Vector Machines (SVM)

Support Vector Machines (SVM)

12/15/2024



The state-of-the-art method implements ANN which single hidden layer consisting of 200 nodes as a

The state-of-the-art method implements ANN which single hidden layer consisting of 200 nodes as a

multi-class classifier

multi-class classifier



Whereas, in this study we propose a SVM based multi-class classification

Whereas, in this study we propose a SVM based multi-class classification



We implemented SVM with RBF kernel and optimized the cost and gamma parameters using grid

We implemented SVM with RBF kernel and optimized the cost and gamma parameters using grid

search approach for optimal performance.

search approach for optimal performance.

12/15/2024

Here,

Here,

and

and

 represents the set of true and predicted classes for an instance

 represents the set of true and predicted classes for an instance

 respectively.

 respectively.

The performance of each of the classifier is evaluated using 3-fold cross-validation strategy.

The performance of each of the classifier is evaluated using 3-fold cross-validation strategy.

12/15/2024

Table I

– Shows comparative results of different machine learning approaches in the PGSB hierarchical datasets.

nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch.

12/15/2024

Table II

– Shows comparative results of different machine learning approaches in the Repbase hierarchical datasets.

nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch.

12/15/2024

Fig.1

. – Shows hierarchical f-measure comparison between different machine learning approaches for nLLCPN and

LCPNB hierarchical classification methods in  PGSB dataset.

12/15/2024

Fig.2

. – Shows hierarchical f-measure comparison between different machine learning approaches for nLLCPN and

LCPNB hierarchical classification methods in  Repbase dataset.

12/15/2024



Advanced Machine Learning approach improves the prediction accuracy of hierarchical classification of TEs



Optimization of the cost and gamma parameters of support vector machine (SVM) with radial basis function

(RBF) kernel leads to a better hierarchical classification of transposable elements



We plan to improve the classification accuracy by following approaches:



Addition of biochemical related features



Implementing advanced machine learning techniques



Implementing novel hierarchical classification approache

12/15/2024

Slide Note

Embed Share

Download Presentation

This study presents a machine learning approach for the hierarchical classification of transposable elements (TEs) based on pre-annotated DNA sequences. The research includes data collection, feature extraction using k-mers, and classification approaches. Proper categorization of TEs is crucial for understanding their impact on genetic evolution. Various machine learning methods are applied to predict hierarchical categories of TEs, offering insights into their functional roles in genomes.

jenesisp Follow

Uploaded on Dec 15, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Prediction of Hierarchical Classification of Transposable Elements using Machine Learning Approach Avdesh Mishra, Manisha Panta, MdTamjidul Hoque, Joel Atallah Computer Science and Biological Sciences Department, University of New Orleans

Presentation Overview Introduction Data Collection Feature Extraction Hierarchical Classification Approaches Machine Learning Methods for the Prediction of Hierarchical Categories Results Conclusion 2 12/15/2024

Transposable Elements Transposable elements (TEs) or jumping genes are the DNA sequences that have Intrinsic capability to move within a host genome from one genomic location to another Genomic location can either be same or different chromosome TEs were first discovered by Barbara McClintock (a.k.a. maize scientist) in 1948 TEs play an important role in: Modifying functionalities of genes E.g. insertion of L1 type TEs in tumor suppressor genes could lead to cancer. Hence, proper classification of identified TEs in a genome is important to understand their particular role in germline and somatic evolution. 3 12/15/2024

Illustration of TEs Taxonomy Proposed by Wicker et al. Root Class II Class I 1 (DNA Transposons) (Retrotransposons) 1.1 Subclass 1 Subclass 2 LTR DIRS PLE LINE SINE TIR Crypton Helitron Maverick Copia DIRS Penelope R2 tRNA Tc1- Mariner Maverick- Polinton 1.1.2 Crypton Helitron Gypsy Ngaro RTE 7SL hAT Bel-pao VIPER Jockey 5S Mutator Retrovirus L1 Merlin ERV I Transib P PiggyBac PIF- 4 Harbinger 12/15/2024 CACTA

Data Collection For our study, we collected pre-annotated DNA sequences of TEs. The hierarchical annotations of TEs were performed based on Wicker s taxonomy. For the annotation of TEs, the repetitive DNA sequences were obtained from two different public repositories: Repbase PGSB Repbase repository contains TEs from different eukaryotic species. PGSB is a compilation of plant repetative sequences from different databases: TREP TIGR repeats PlantSat Genbank PGSB 18680 Repbase 34561 Fasta Sequences 5 12/15/2024

Feature Extraction Each TE in a dataset is represented by a set of k-mers Which are obtained by frequency count of substring of length k E.g. for k=2, all combinations of (AA, AT, AG, AC .CC) in the sequence are extracted C C G C A A A A G T T G T C For k=2 For k=3 For k=4 AA = 2 CC = 2 TT = 2 CCG = 1 CAA = 1 AAG = 1 CCGC = 1 AAAA = 1 GTTG = 1 For each TE, k-mers with k sizes of 2, 3 and 4 were used as features. Feature values were standardized such that the mean = 0 and standard deviation = 1 6 12/15/2024

Hierarchical Classification Approaches Classification of TEs can be treated as hierarchical classification problem The hierarchical classification can be represented by a directed acyclic graph or a tree Hierarchical classification of TEs is performed based on top-down strategies Two recent top-down strategies for the hierarchical classification of TEs are: non-Leaf Local Classifier per Parent Node (nLLCPN) Local Classifier per Parent Node and Branch (LCPNB) 7 12/15/2024

non-Leaf Local Classifier per Parent Node Approach In nLLCPN, a multi-class classifier is implemented at each non-leaf node of the graph. Is classified as either 1 or 2 Root CCGCAAAAGTTGTC Is classified as either itself or 2.1 1 2 CCGCAAAAGTTGTC Is classified as either itself or 2.1.1 1.1 1.4 1.5 2.1 CCGCAAAAGTTGTC 1.1.1 2.1.1 2.1 Is classified as 2.1.1.2 1.1.2 2.1.1.2 CCGCAAAAGTTGTC 1.1 2.1.1.1 2.1.1.8 8 12/15/2024 2.1.1.5

Local Classifier per Parent Node and Branch Approach In LCPNB, a multi-class classifier is implemented at each non-leaf node of the graph and prediction probabilities are obtained for all the classes. Root The path leading to final classification: 2(0.6) 2.1(1) 2.1.1(0.8) 2.1.1.1(0.4) Average = (0.6+1+0.8+0.4)/4 = 0.7 0.4 1 2 0.2 0.6 0.6 1.1 1.4 1.5 2.1 1 0.2 0.2 1.1.1 2.1.1 2.1 0.8 0.2 0.4 0.2 1.1.2 2.1.1.2 0.4 0.4 1.1 2.1.1.1 2.1.1.8 0.2 9 12/15/2024 0.2 2.1.1.5

Machine Learning Methods for the Prediction of Hierarchical Categories We applied several machine learning methods at each non-leaf node of the directed acyclic graph. Artificial Neural Network (ANN) ExtraTree Classifier (ET) Gradient Boosting Classifier (GBC) Logistic Regression (LogReg) Random Forest (RF) Support Vector Machines (SVM) 10 12/15/2024

Machine Learning Methods for the Prediction of Hierarchical Categories The state-of-the-art method implements ANN which single hidden layer consisting of 200 nodes as a multi-class classifier Whereas, in this study we propose a SVM based multi-class classification We implemented SVM with RBF kernel and optimized the cost and gamma parameters using grid search approach for optimal performance. 11 12/15/2024

Performance Measures ?????? ???? ????????? ( ?) = ?|?? ??| ?|??| ?????? ???? ?????? ( ?) = ?|?? ??| ?|??| ?????? ???? ? ??????? ( ?) =2 ? ? ? + ? Here, Ci and Zi represents the set of true and predicted classes for an instance i respectively. The performance of each of the classifier is evaluated using 3-fold cross-validation strategy. 12 12/15/2024

Results Table I Shows comparative results of different machine learning approaches in the PGSB hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch. MIPS - nLLCPN GBC 86.75% 86.25% 0.864972486 MIPS - LCPNB 86.11% 86.45% 0.862758219 SVM 88.21% 86.51% 0.873518029 ANN 82.13% 85.51% 0.837699065 ExtraTree 76.03% 78.94% 0.774524643 Random Forest 76.98% 79.55% 0.782458818 LogReg 76% 78.89% 0.774172489 hP hR hF hP hR hF 87.34% 86.10% 0.867151847 82.93% 83.44% 0.831846433 84.50% 85% 0.847494297 84.12% 84.69% 0.844037783 83.55% 84.21% 0.838769007 13 12/15/2024

Results Table II Shows comparative results of different machine learning approaches in the Repbase hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch. Repbase - nLLCPN GBC 81.98% 84.04% 0.830022352 Repbase - LCPNB 81.94% 84.59% 0.832277949 SVM 85.44% 86.64% ANN 80.27% 83.32% 0.817704912 ExtraTree 76.02% 78.93% 0.774524643 Random Forest 76.98% 79.55% 0.782458818 LogReg 75.99% 78.89 0.774172489 hP hR hF 0.860347824 hP hR hF 0.863959027 85.75% 87.05% 80.57% 83.26% 0.818944098 76.95% 79.99% 0.78444174 77.67% 80.27% 0.789473439 76.12% 79.16% 0.776128202 14 12/15/2024

Results Fig.1. Shows hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in PGSB dataset. Fig.1. hF Comparision for PGSB dataset 0.88 0.87 Hierarchical f-measure 0.86 0.85 0.84 0.83 0.82 0.81 LogReg Randoom Forest ExtraTree ANN GBC SVM nLLCPN LCPNB 0.840740388 0.838769007 0.855858795 0.844037783 0.856005417 0.847494297 Machine Learning Methods 0.837699065 0.831846433 0.864972486 0.862758219 0.873518029 0.867151847 15 12/15/2024 nLLCPN LCPNB

Results Fig.2. Shows hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in Repbase dataset. Fig.2. hF Comparision for Repbase dataset 0.88 0.86 Hierarchical f-measure 0.84 0.82 0.8 0.78 0.76 0.74 0.72 LogReg Randoom Forest ExtraTree ANN GBC SVM nLLCPN LCPNB 0.774172489 0.776128202 0.782458818 0.789473439 0.774524643 0.78444174 Machine Learning Methods 0.817704912 0.818944098 0.830022352 0.832277949 0.860347824 0.863959027 16 12/15/2024 nLLCPN LCPNB

Conclusion and Future Work Advanced Machine Learning approach improves the prediction accuracy of hierarchical classification of TEs Optimization of the cost and gamma parameters of support vector machine (SVM) with radial basis function (RBF) kernel leads to a better hierarchical classification of transposable elements We plan to improve the classification accuracy by following approaches: Addition of biochemical related features Implementing advanced machine learning techniques Implementing novel hierarchical classification approache 17 12/15/2024

18 12/15/2024

Machine Learning Approach for Hierarchical Classification of Transposable Elements

Download Presentation

Presentation Transcript

Related

More Related Content