TY - JOUR
T1 - A sequence-based prediction of Kruppel-like factors proteins using XGBoost and optimized features
AU - Le, Nguyen Quoc Khanh
AU - Do, Duyen Thi
AU - Nguyen, Trinh Trung Duong
AU - Le, Quynh Anh
N1 - Funding Information:
This work has been supported by the Research Grant for Newly Hired Faculty, Taipei Medical University (TMU), Taiwan [grant number: TMU108-AE1-B26 ] and Higher Education Sprout Project, Ministry of Education (MOE), Taiwan [grant number: DP2-110-21121-01-A-06].
Publisher Copyright:
© 2021 Elsevier B.V.
PY - 2021/6/30
Y1 - 2021/6/30
N2 - Krüppel-like factors (KLF) refer to a group of conserved zinc finger-containing transcription factors that are involved in various physiological and biological processes, including cell proliferation, differentiation, development, and apoptosis. Some bioinformatics methods such as sequence similarity searches, multiple sequence alignment, phylogenetic reconstruction, and gene synteny analysis have also been proposed to broaden our knowledge of KLF proteins. In this study, we proposed a novel computational approach by using machine learning on features calculated from primary sequences. To detail, our XGBoost-based model is efficient in identifying KLF proteins, with accuracy of 96.4% and MCC of 0.704. It also holds a promising performance when testing our model on an independent dataset. Therefore, our model could serve as an useful tool to identify new KLF proteins and provide necessary information for biologists and researchers in KLF proteins. Our machine learning source codes as well as datasets are freely available at https://github.com/khanhlee/KLF-XGB.
AB - Krüppel-like factors (KLF) refer to a group of conserved zinc finger-containing transcription factors that are involved in various physiological and biological processes, including cell proliferation, differentiation, development, and apoptosis. Some bioinformatics methods such as sequence similarity searches, multiple sequence alignment, phylogenetic reconstruction, and gene synteny analysis have also been proposed to broaden our knowledge of KLF proteins. In this study, we proposed a novel computational approach by using machine learning on features calculated from primary sequences. To detail, our XGBoost-based model is efficient in identifying KLF proteins, with accuracy of 96.4% and MCC of 0.704. It also holds a promising performance when testing our model on an independent dataset. Therefore, our model could serve as an useful tool to identify new KLF proteins and provide necessary information for biologists and researchers in KLF proteins. Our machine learning source codes as well as datasets are freely available at https://github.com/khanhlee/KLF-XGB.
KW - eXtreme Gradient Boosting
KW - Feature selection
KW - Kruppel-like factor
KW - Protein sequence
KW - SMOTE imbalance
KW - Zinc finger
UR - http://www.scopus.com/inward/record.url?scp=85104343442&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104343442&partnerID=8YFLogxK
U2 - 10.1016/j.gene.2021.145643
DO - 10.1016/j.gene.2021.145643
M3 - Article
C2 - 33848577
AN - SCOPUS:85104343442
SN - 0378-1119
VL - 787
JO - Gene
JF - Gene
M1 - 145643
ER -