TY - JOUR
T1 - Enhancing Arabidopsis thaliana ubiquitination site prediction through knowledge distillation and natural language processing
AU - Nguyen, Van Nui
AU - Tran, Thi Xuan
AU - Nguyen, Thi Tuyen
AU - Le, Nguyen Quoc Khanh
N1 - Publisher Copyright:
© 2024 Elsevier Inc.
PY - 2024/12
Y1 - 2024/12
N2 - Protein ubiquitination is a critical post-translational modification (PTM) involved in diverse biological processes and plays a pivotal role in regulating physiological mechanisms and disease states. Despite various efforts to develop ubiquitination site prediction tools across species, these tools mainly rely on predefined sequence features and machine learning algorithms, with species-specific variations in ubiquitination patterns remaining poorly understood. This study introduces a novel approach for predicting Arabidopsis thaliana ubiquitination sites using a neural network model based on knowledge distillation and natural language processing (NLP) of protein sequences. Our framework employs a multi-species “Teacher model” to guide a more compact, species-specific “Student model”, with the “Teacher” generating pseudo-labels that enhance the “Student” learning and prediction robustness. Cross-validation results demonstrate that our model achieves superior performance, with an accuracy of 86.3 % and an area under the curve (AUC) of 0.926, while independent testing confirmed these results with an accuracy of 86.3 % and an AUC of 0.923. Comparative analysis with established predictors further highlights the model's superiority, emphasizing the effectiveness of integrating knowledge distillation and NLP in ubiquitination prediction tasks. This study presents a promising and efficient approach for ubiquitination site prediction, offering valuable insights for researchers in related fields. The code and resources are available on GitHub: https://github.com/nuinvtnu/KD_ArapUbi.
AB - Protein ubiquitination is a critical post-translational modification (PTM) involved in diverse biological processes and plays a pivotal role in regulating physiological mechanisms and disease states. Despite various efforts to develop ubiquitination site prediction tools across species, these tools mainly rely on predefined sequence features and machine learning algorithms, with species-specific variations in ubiquitination patterns remaining poorly understood. This study introduces a novel approach for predicting Arabidopsis thaliana ubiquitination sites using a neural network model based on knowledge distillation and natural language processing (NLP) of protein sequences. Our framework employs a multi-species “Teacher model” to guide a more compact, species-specific “Student model”, with the “Teacher” generating pseudo-labels that enhance the “Student” learning and prediction robustness. Cross-validation results demonstrate that our model achieves superior performance, with an accuracy of 86.3 % and an area under the curve (AUC) of 0.926, while independent testing confirmed these results with an accuracy of 86.3 % and an AUC of 0.923. Comparative analysis with established predictors further highlights the model's superiority, emphasizing the effectiveness of integrating knowledge distillation and NLP in ubiquitination prediction tasks. This study presents a promising and efficient approach for ubiquitination site prediction, offering valuable insights for researchers in related fields. The code and resources are available on GitHub: https://github.com/nuinvtnu/KD_ArapUbi.
KW - Arabidopsis thaliana
KW - Knowledge distillation
KW - Natural language processing (NLP)
KW - Neural network model
KW - Post-translational modification (PTM)
KW - Protein ubiquitination
UR - http://www.scopus.com/inward/record.url?scp=85207895665&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85207895665&partnerID=8YFLogxK
U2 - 10.1016/j.ymeth.2024.10.006
DO - 10.1016/j.ymeth.2024.10.006
M3 - Article
C2 - 39447942
AN - SCOPUS:85207895665
SN - 1046-2023
VL - 232
SP - 65
EP - 71
JO - Methods
JF - Methods
ER -