TY - JOUR
T1 - Using a hybrid neural network architecture for DNA sequence representation
T2 - A study on N4-methylcytosine sites
AU - Nguyen, Van Nui
AU - Ho, Trang Thi
AU - Doan, Thu Dung
AU - Le, Nguyen Quoc Khanh
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2024/8
Y1 - 2024/8
N2 - N4-methylcytosine (4mC) is a modified form of cytosine found in DNA, contributing to epigenetic regulation. It exists in various genomes, including the Rosaceae family encompassing significant fruit crops like apples, cherries, and roses. Previous investigations have examined the distribution and functional implications of 4mC sites within the Rosaceae genome, focusing on their potential roles in gene expression regulation, environmental adaptation, and evolution. This research aims to improve the accuracy of predicting 4mC sites within the genome of Fragaria vesca, a Rosaceae plant species. Building upon the original 4mc-w2vec method, which combines word embedding processing and a convolutional neural network (CNN), we have incorporated additional feature encoding techniques and leveraged pre-trained natural language processing (NLP) models with different deep learning architectures including different forms of CNN, recurrent neural networks (RNN) and long short-term memory (LSTM). Our assessments have shown that the best model is derived from a CNN model using fastText encoding. This model demonstrates enhanced performance, achieving a sensitivity of 0.909, specificity of 0.77, and accuracy of 0.879 on an independent dataset. Furthermore, our model surpasses previously published works on the same dataset, thus showcasing its superior predictive capabilities.
AB - N4-methylcytosine (4mC) is a modified form of cytosine found in DNA, contributing to epigenetic regulation. It exists in various genomes, including the Rosaceae family encompassing significant fruit crops like apples, cherries, and roses. Previous investigations have examined the distribution and functional implications of 4mC sites within the Rosaceae genome, focusing on their potential roles in gene expression regulation, environmental adaptation, and evolution. This research aims to improve the accuracy of predicting 4mC sites within the genome of Fragaria vesca, a Rosaceae plant species. Building upon the original 4mc-w2vec method, which combines word embedding processing and a convolutional neural network (CNN), we have incorporated additional feature encoding techniques and leveraged pre-trained natural language processing (NLP) models with different deep learning architectures including different forms of CNN, recurrent neural networks (RNN) and long short-term memory (LSTM). Our assessments have shown that the best model is derived from a CNN model using fastText encoding. This model demonstrates enhanced performance, achieving a sensitivity of 0.909, specificity of 0.77, and accuracy of 0.879 on an independent dataset. Furthermore, our model surpasses previously published works on the same dataset, thus showcasing its superior predictive capabilities.
KW - Deep learning
KW - DNA N-methylcytosine
KW - Natural language processing
KW - Rosaceae genome
KW - Sequence analysis
KW - Word embedding
UR - http://www.scopus.com/inward/record.url?scp=85195814917&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85195814917&partnerID=8YFLogxK
U2 - 10.1016/j.compbiomed.2024.108664
DO - 10.1016/j.compbiomed.2024.108664
M3 - Article
C2 - 38875905
AN - SCOPUS:85195814917
SN - 0010-4825
VL - 178
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
M1 - 108664
ER -