TY - JOUR
T1 - A sequence-based approach for identifying recombination spots in Saccharomyces cerevisiae by using hyper-parameter optimization in FastText and support vector machine
AU - Do, Duyen Thi
AU - Le, Nguyen Quoc Khanh
N1 - Publisher Copyright:
© 2019 Elsevier B.V.
PY - 2019/11/15
Y1 - 2019/11/15
N2 - Meiotic recombination is a biological process which plays a crucial role in genetic evolution. Therefore, the ability of machine learning models in extracting desire information embedded in DNA sequences has drawn a great deal of attention among biologists. Recently, several attempts have been made to address this problem, however, the performance results still need to be improved. The current study aims to investigate the relationship between natural language processing model and supervised learning in classifying DNA sequences. The idea is to treat DNA sequences by FastText model, including sub-word information and then use them as features in a suitable supervised learning algorithm. To the end, this hybrid approach helps us classify DNA recombination spots with achieved sensitivity of 90%, specificity of 94.76%, accuracy of 92.6%, and MCC of 0.851. These results have suggested that our newly proposed method is superior to other methods on the same benchmark dataset. This study, therefore, could shed the light on developing the prediction models for recombination spots in particular, and DNA sequences in general.
AB - Meiotic recombination is a biological process which plays a crucial role in genetic evolution. Therefore, the ability of machine learning models in extracting desire information embedded in DNA sequences has drawn a great deal of attention among biologists. Recently, several attempts have been made to address this problem, however, the performance results still need to be improved. The current study aims to investigate the relationship between natural language processing model and supervised learning in classifying DNA sequences. The idea is to treat DNA sequences by FastText model, including sub-word information and then use them as features in a suitable supervised learning algorithm. To the end, this hybrid approach helps us classify DNA recombination spots with achieved sensitivity of 90%, specificity of 94.76%, accuracy of 92.6%, and MCC of 0.851. These results have suggested that our newly proposed method is superior to other methods on the same benchmark dataset. This study, therefore, could shed the light on developing the prediction models for recombination spots in particular, and DNA sequences in general.
KW - Continuous bag of words
KW - DNA sequencing
KW - FastText
KW - Meiotic recombination
KW - Prediction model
KW - Support vector machine
UR - http://www.scopus.com/inward/record.url?scp=85073676486&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85073676486&partnerID=8YFLogxK
U2 - 10.1016/j.chemolab.2019.103855
DO - 10.1016/j.chemolab.2019.103855
M3 - Article
AN - SCOPUS:85073676486
SN - 0169-7439
VL - 194
JO - Chemometrics and Intelligent Laboratory Systems
JF - Chemometrics and Intelligent Laboratory Systems
M1 - 103855
ER -