TY - JOUR
T1 - Using k-mer embeddings learned from a Skip-gram based neural network for building a cross-species DNA N6-methyladenine site prediction model
AU - Nguyen, Trinh Trung Duong
AU - Trinh, Van Ngu
AU - Le, Nguyen Quoc Khanh
AU - Ou, Yu Yen
N1 - Funding Information:
This work was partially supported by the Ministry of Science and Technology, Taiwan, R.O.C. under Grant No. MOST 109-2811-E-155-505 and No. MOST 109-2221-E-155-045.
Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer Nature B.V.
PY - 2021/12
Y1 - 2021/12
N2 - Key message: This study used k-mer embeddings as effective feature to identify DNA N6-Methyladenine sites in plant genomes and obtained improved performance without substantial effort in feature extraction, combination and selection. Abstract: Identification of DNA N6-methyladenine sites has been a very active topic of computational biology due to the unavailability of suitable methods to identify them accurately, especially in plants. Substantial results were obtained with a great effort put in extracting, heuristic searching, or fusing a diverse types of features, not to mention a feature selection step. In this study, we regarded DNA sequences as textual information and employed natural language processing techniques to decipher hidden biological meanings from those sequences. In other words, we considered DNA, the human life book, as a book corpus for training DNA language models. K-mer embeddings then were generated from these language models to be used in machine learning prediction models. Skip-gram neural networks were the base of the language models and ensemble tree-based algorithms were the machine learning algorithms for prediction models. We trained the prediction model on Rosaceae genome dataset and performed a comprehensive test on 3 plant genome datasets. Our proposed method shows promising performance with AUC performance approaching an ideal value on Rosaceae dataset (0.99), a high score on Rice dataset (0.95) and improved performance on Rice dataset while enjoying an elegant, yet efficient feature extraction process.
AB - Key message: This study used k-mer embeddings as effective feature to identify DNA N6-Methyladenine sites in plant genomes and obtained improved performance without substantial effort in feature extraction, combination and selection. Abstract: Identification of DNA N6-methyladenine sites has been a very active topic of computational biology due to the unavailability of suitable methods to identify them accurately, especially in plants. Substantial results were obtained with a great effort put in extracting, heuristic searching, or fusing a diverse types of features, not to mention a feature selection step. In this study, we regarded DNA sequences as textual information and employed natural language processing techniques to decipher hidden biological meanings from those sequences. In other words, we considered DNA, the human life book, as a book corpus for training DNA language models. K-mer embeddings then were generated from these language models to be used in machine learning prediction models. Skip-gram neural networks were the base of the language models and ensemble tree-based algorithms were the machine learning algorithms for prediction models. We trained the prediction model on Rosaceae genome dataset and performed a comprehensive test on 3 plant genome datasets. Our proposed method shows promising performance with AUC performance approaching an ideal value on Rosaceae dataset (0.99), a high score on Rice dataset (0.95) and improved performance on Rice dataset while enjoying an elegant, yet efficient feature extraction process.
KW - DNA N-methyladenine site prediction
KW - Ensemble tree-based algorithms
KW - k-mer embeddings
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85120091604&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85120091604&partnerID=8YFLogxK
U2 - 10.1007/s11103-021-01204-1
DO - 10.1007/s11103-021-01204-1
M3 - Article
C2 - 34843033
AN - SCOPUS:85120091604
SN - 0167-4412
VL - 107
SP - 533
EP - 542
JO - Plant Molecular Biology
JF - Plant Molecular Biology
IS - 6
ER -