TY - JOUR
T1 - iN6-methylat (5-step)
T2 - identifying DNA N 6 -methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule
AU - Le, Nguyen Quoc Khanh
N1 - Publisher Copyright:
© 2019, Springer-Verlag GmbH Germany, part of Springer Nature.
PY - 2019/1/1
Y1 - 2019/1/1
N2 - DNA N 6 -methyladenine is a non-canonical DNA modification that occurs in different eukaryotes at low levels and it has been identified as an extremely important function of life. Moreover, about 0.2% of adenines are marked by DNA N 6 -methyladenine in the rice genome, higher than in most of the other species. Therefore, the identification of them has become a very important area of study, especially in biological research. Despite the few computational tools employed to address this problem, there still requires a lot of efforts to improve their performance results. In this study, we treat DNA sequences by the continuous bags of nucleobases, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to identify them. Our model which uses this hybrid approach could identify DNA N 6 -methyladenine sites with achieved a jackknife test sensitivity of 86.48%, specificity of 89.09%, accuracy of 87.78%, and MCC of 0.756. Compared to the state-of-the-art predictor as well as the other methods, our proposed model is able to yield superior performance in all the metrics. Moreover, this study provides a basis for further research that can enrich a field of applying natural language-processing techniques in biological sequences.
AB - DNA N 6 -methyladenine is a non-canonical DNA modification that occurs in different eukaryotes at low levels and it has been identified as an extremely important function of life. Moreover, about 0.2% of adenines are marked by DNA N 6 -methyladenine in the rice genome, higher than in most of the other species. Therefore, the identification of them has become a very important area of study, especially in biological research. Despite the few computational tools employed to address this problem, there still requires a lot of efforts to improve their performance results. In this study, we treat DNA sequences by the continuous bags of nucleobases, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to identify them. Our model which uses this hybrid approach could identify DNA N 6 -methyladenine sites with achieved a jackknife test sensitivity of 86.48%, specificity of 89.09%, accuracy of 87.78%, and MCC of 0.756. Compared to the state-of-the-art predictor as well as the other methods, our proposed model is able to yield superior performance in all the metrics. Moreover, this study provides a basis for further research that can enrich a field of applying natural language-processing techniques in biological sequences.
KW - Continuous bag of words
KW - DNA N -methyladenine
KW - DNA replication
KW - FastText
KW - Skip gram
KW - Support vector machine
UR - http://www.scopus.com/inward/record.url?scp=85065388389&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85065388389&partnerID=8YFLogxK
U2 - 10.1007/s00438-019-01570-y
DO - 10.1007/s00438-019-01570-y
M3 - Article
AN - SCOPUS:85065388389
SN - 1617-4615
VL - 294
SP - 1173
EP - 1182
JO - Molecular Genetics and Genomics
JF - Molecular Genetics and Genomics
IS - 5
ER -