TY - GEN
T1 - Incorporating Natural Language-Based and Sequence-Based Features to Predict Protein Sumoylation Sites
AU - Tran, Thi Xuan
AU - Nguyen, Van Nui
AU - Le, Nguyen Quoc Khanh
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - The incidence of thyroid cancer and breast cancer is increasing every year, and the specific pathogenesis is unclear. Post-translational modifications are an important regulatory mechanism that affects the function of almost all proteins. They are essential for a diverse and well-functioning proteome and can integrate metabolism with physiological and pathological processes. In recent years, post-translational modifications have become a research hotspot, with methylation, phosphorylation, acetylation and succinylation being the main focus. SUMOylated proteins are predominantly localized in the nucleus, and SUMO regulates nuclear processes, including cell cycle control and DNA repair. SUMOylated proteins are predominantly localized in the nucleus, and SUMO regulates nuclear processes, including cell cycle control and DNA repair. SUMOylation has been increasingly implicated in cancer, Alzheimer’s, and Parkinson’s diseases. Therefore, identification and characterization SUMOylation sites are essential for determining modification-specific proteomics. This study aims to propose a novel schema for predicting protein SUMOylation sites based on the incorporation of natural language features (Word2Vec) and sequence-based features. In addition, the novel model, called RSX_SUMO, is proposed for the prediction of protein SUMOylation sites. Our experiments reveal that the performance of RSX_SUMO model achieves the highest performance in both five-fold cross-validation and independent testing, obtain the performance on independent testing with acccuracy at 88.6% and MCC value of 0.743. In addition, the comparison with several existing prediction models show that our proposed model outperforms and obtains the highest performance. We hope that our findings would provide effective suggestions and be a great helpful for researchers related to their related studies.
AB - The incidence of thyroid cancer and breast cancer is increasing every year, and the specific pathogenesis is unclear. Post-translational modifications are an important regulatory mechanism that affects the function of almost all proteins. They are essential for a diverse and well-functioning proteome and can integrate metabolism with physiological and pathological processes. In recent years, post-translational modifications have become a research hotspot, with methylation, phosphorylation, acetylation and succinylation being the main focus. SUMOylated proteins are predominantly localized in the nucleus, and SUMO regulates nuclear processes, including cell cycle control and DNA repair. SUMOylated proteins are predominantly localized in the nucleus, and SUMO regulates nuclear processes, including cell cycle control and DNA repair. SUMOylation has been increasingly implicated in cancer, Alzheimer’s, and Parkinson’s diseases. Therefore, identification and characterization SUMOylation sites are essential for determining modification-specific proteomics. This study aims to propose a novel schema for predicting protein SUMOylation sites based on the incorporation of natural language features (Word2Vec) and sequence-based features. In addition, the novel model, called RSX_SUMO, is proposed for the prediction of protein SUMOylation sites. Our experiments reveal that the performance of RSX_SUMO model achieves the highest performance in both five-fold cross-validation and independent testing, obtain the performance on independent testing with acccuracy at 88.6% and MCC value of 0.743. In addition, the comparison with several existing prediction models show that our proposed model outperforms and obtains the highest performance. We hope that our findings would provide effective suggestions and be a great helpful for researchers related to their related studies.
KW - Machine learning
KW - Random forest
KW - SUMOylation sites prediction
KW - SVM
KW - Word2Vec
KW - XGBoost
UR - http://www.scopus.com/inward/record.url?scp=85172003989&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85172003989&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-36886-8_7
DO - 10.1007/978-3-031-36886-8_7
M3 - Conference contribution
AN - SCOPUS:85172003989
SN - 9783031368851
T3 - Lecture Notes in Networks and Systems
SP - 74
EP - 88
BT - The 12th Conference on Information Technology and Its Applications - Proceedings of the International Conference CITA 2023
A2 - Nguyen, Ngoc Thanh
A2 - Le-Minh, Hoa
A2 - Huynh, Cong-Phap
A2 - Nguyen, Quang-Vu
PB - Springer Science and Business Media Deutschland GmbH
T2 - Proceedings of the12th International Conference on Information Technology and its Applications, CITA 2023
Y2 - 28 July 2023 through 29 July 2023
ER -