TY - JOUR
T1 - Integrating CNN and Bi-LSTM for protein succinylation sites prediction based on Natural Language Processing technique
AU - Tran, Thi Xuan
AU - Khanh Le, Nguyen Quoc
AU - Nguyen, Van Nui
N1 - Publisher Copyright:
© 2025
PY - 2025/3
Y1 - 2025/3
N2 - Protein succinylation, a post-translational modification wherein a succinyl group (-CO-CH₂-CH₂-CO-) attaches to lysine residues, plays a critical regulatory role in cellular processes. Dysregulated succinylation has been implicated in the onset and progression of various diseases, including liver, cardiac, pulmonary, and neurological disorders. However, identifying succinylation sites through experimental methods is often labor-intensive, costly, and technically challenging. To address this, we introduce an approach called CbiLSuccSite, that integrates Convolutional Neural Networks (CNN) with Bidirectional Long Short-Term Memory (Bi-LSTM) networks for the accurate prediction of protein succinylation sites. Our approach employs a word embedding layer to encode protein sequences, enabling the automatic learning of intricate patterns and dependencies without manual feature extraction. In 10-fold cross-validation, CBiLSuccSite achieved superior predictive performance, with an Area Under the Curve (AUC) of 0.826 and a Matthews Correlation Coefficient (MCC) of 0.502. Independent testing further validated its robustness, yielding an AUC of 0.818 and an MCC of 0.53. The integration of CNN and Bi-LSTM leverages the strengths of both architectures, establishing CBiLSuccSite as an effective tool for protein language processing and succinylation site prediction. Our model and code are publicly accessible at: https://github.com/nuinvtnu/CBiLSuccSite.
AB - Protein succinylation, a post-translational modification wherein a succinyl group (-CO-CH₂-CH₂-CO-) attaches to lysine residues, plays a critical regulatory role in cellular processes. Dysregulated succinylation has been implicated in the onset and progression of various diseases, including liver, cardiac, pulmonary, and neurological disorders. However, identifying succinylation sites through experimental methods is often labor-intensive, costly, and technically challenging. To address this, we introduce an approach called CbiLSuccSite, that integrates Convolutional Neural Networks (CNN) with Bidirectional Long Short-Term Memory (Bi-LSTM) networks for the accurate prediction of protein succinylation sites. Our approach employs a word embedding layer to encode protein sequences, enabling the automatic learning of intricate patterns and dependencies without manual feature extraction. In 10-fold cross-validation, CBiLSuccSite achieved superior predictive performance, with an Area Under the Curve (AUC) of 0.826 and a Matthews Correlation Coefficient (MCC) of 0.502. Independent testing further validated its robustness, yielding an AUC of 0.818 and an MCC of 0.53. The integration of CNN and Bi-LSTM leverages the strengths of both architectures, establishing CBiLSuccSite as an effective tool for protein language processing and succinylation site prediction. Our model and code are publicly accessible at: https://github.com/nuinvtnu/CBiLSuccSite.
KW - Bi-direction long short-term memory (Bi-LSTM)
KW - Convolutional Neural Network (CNN)
KW - Natural Language Processing (NLP)
KW - Succinylation
KW - Word embedding
UR - http://www.scopus.com/inward/record.url?scp=85214348412&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85214348412&partnerID=8YFLogxK
U2 - 10.1016/j.compbiomed.2025.109664
DO - 10.1016/j.compbiomed.2025.109664
M3 - Article
AN - SCOPUS:85214348412
SN - 0010-4825
VL - 186
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
M1 - 109664
ER -