TY - JOUR
T1 - BERT-Promoter
T2 - An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection
AU - Le, Nguyen Quoc Khanh
AU - Ho, Quang Thai
AU - Nguyen, Van Nui
AU - Chang, Jung Su
N1 - Funding Information:
This work has been supported by the Ministry of Science and Technology, Taiwan [ MOST 110-2221-E-038-001-MY2] and [ MOST 111-2628-E-038-002-MY3 ].
Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/8
Y1 - 2022/8
N2 - A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because of its importance in molecular biology, identifying DNA promoters are challenging to provide useful information related to its functions and related diseases. Several computational models have been developed to early predict promoters from high-throughput sequencing over the past decade. Although some useful predictors have been proposed, there remains short-falls in those models and there is an urgent need to enhance the predictive performance to meet the practice requirements. In this study, we proposed a novel architecture that incorporated transformer natural language processing (NLP) and explainable machine learning to address this problem. More specifically, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model was employed to encode DNA sequences, and SHapley Additive exPlanations (SHAP) analysis served as a feature selection step to look at the top-rank BERT encodings. At the last stage, different machine learning classifiers were implemented to learn the top features and produce the prediction outcomes. This study not only predicted the DNA promoters but also their activities (strong or weak promoters). Overall, several experiments showed an accuracy of 85.5 % and 76.9 % for these two levels, respectively. Our performance showed a superiority to previously published predictors on the same dataset in most measurement metrics. We named our predictor as BERT-Promoter and it is freely available at https://github.com/khanhlee/bert-promoter.
AB - A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because of its importance in molecular biology, identifying DNA promoters are challenging to provide useful information related to its functions and related diseases. Several computational models have been developed to early predict promoters from high-throughput sequencing over the past decade. Although some useful predictors have been proposed, there remains short-falls in those models and there is an urgent need to enhance the predictive performance to meet the practice requirements. In this study, we proposed a novel architecture that incorporated transformer natural language processing (NLP) and explainable machine learning to address this problem. More specifically, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model was employed to encode DNA sequences, and SHapley Additive exPlanations (SHAP) analysis served as a feature selection step to look at the top-rank BERT encodings. At the last stage, different machine learning classifiers were implemented to learn the top features and produce the prediction outcomes. This study not only predicted the DNA promoters but also their activities (strong or weak promoters). Overall, several experiments showed an accuracy of 85.5 % and 76.9 % for these two levels, respectively. Our performance showed a superiority to previously published predictors on the same dataset in most measurement metrics. We named our predictor as BERT-Promoter and it is freely available at https://github.com/khanhlee/bert-promoter.
KW - BERT multilingual cases
KW - Contextualized word embedding
KW - Explainable artificial intelligence
KW - EXtreme Gradient Boosting
KW - Promoter region
KW - SHAP
UR - http://www.scopus.com/inward/record.url?scp=85134587901&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134587901&partnerID=8YFLogxK
U2 - 10.1016/j.compbiolchem.2022.107732
DO - 10.1016/j.compbiolchem.2022.107732
M3 - Article
C2 - 35863177
AN - SCOPUS:85134587901
SN - 1476-9271
VL - 99
JO - Computational Biology and Chemistry
JF - Computational Biology and Chemistry
M1 - 107732
ER -