TY - JOUR
T1 - VF-Pred
T2 - Predicting virulence factor using sequence alignment percentage and ensemble learning models
AU - Singh, Shreya
AU - Le, Nguyen Quoc Khanh
AU - Wang, Cheng
N1 - Publisher Copyright:
© 2023 Elsevier Ltd
PY - 2024/1
Y1 - 2024/1
N2 - This study introduces VF-Pred, a novel framework developed for the purpose of detecting virulence factors (VFs) through the analysis of genomic data. VFs are crucial for pathogens to successfully infect host tissue and evade the immune system, leading to the onset of infectious diseases. Identifying VFs accurately is of utmost importance in the quest for developing potent drugs and vaccines to counter these diseases. To accomplish this, VF-Pred combines various feature engineering techniques to generate inputs for distinct machine learning classification models. The collective predictions of these models are then consolidated by a final downstream model using an innovative ensembling approach. One notable aspect of VF-Pred is the inclusion of a novel Seq-Alignment feature, which significantly enhances the accuracy of the employed machine learning algorithms. The framework was meticulously trained on 982 features obtained from extensive feature engineering, utilizing a comprehensive ensemble of 25 models. The new downstream ensembling technique adopted by VF-Pred surpasses existing stacking strategies and other ensembling methods, delivering superior performance in VF detection. There have been similar studies done earlier, VF-Pred stands out in comparison showing higher accuracy (83.5 %), higher sensitivity (87 %) towards identification of VFs. Accessible through a user-friendly web page, VF-Pred can be accessed by providing the identifier and protein sequence, enabling the prediction of high or low likelihoods of VFs. Overall, VF-Pred showcases a highly promising methodology for the identification of VFs, potentially paving the way for the development of more effective strategies in the battle against infectious diseases.
AB - This study introduces VF-Pred, a novel framework developed for the purpose of detecting virulence factors (VFs) through the analysis of genomic data. VFs are crucial for pathogens to successfully infect host tissue and evade the immune system, leading to the onset of infectious diseases. Identifying VFs accurately is of utmost importance in the quest for developing potent drugs and vaccines to counter these diseases. To accomplish this, VF-Pred combines various feature engineering techniques to generate inputs for distinct machine learning classification models. The collective predictions of these models are then consolidated by a final downstream model using an innovative ensembling approach. One notable aspect of VF-Pred is the inclusion of a novel Seq-Alignment feature, which significantly enhances the accuracy of the employed machine learning algorithms. The framework was meticulously trained on 982 features obtained from extensive feature engineering, utilizing a comprehensive ensemble of 25 models. The new downstream ensembling technique adopted by VF-Pred surpasses existing stacking strategies and other ensembling methods, delivering superior performance in VF detection. There have been similar studies done earlier, VF-Pred stands out in comparison showing higher accuracy (83.5 %), higher sensitivity (87 %) towards identification of VFs. Accessible through a user-friendly web page, VF-Pred can be accessed by providing the identifier and protein sequence, enabling the prediction of high or low likelihoods of VFs. Overall, VF-Pred showcases a highly promising methodology for the identification of VFs, potentially paving the way for the development of more effective strategies in the battle against infectious diseases.
KW - Ensemble learning
KW - Feature engineering
KW - Machine learning
KW - Protein sequence analysis
KW - Sequence alignment
KW - Virulence factors
UR - http://www.scopus.com/inward/record.url?scp=85177083268&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85177083268&partnerID=8YFLogxK
U2 - 10.1016/j.compbiomed.2023.107662
DO - 10.1016/j.compbiomed.2023.107662
M3 - Article
C2 - 37979206
AN - SCOPUS:85177083268
SN - 0010-4825
VL - 168
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
M1 - 107662
ER -