Sa-TTCA: An SVM-based approach for tumor T-cell antigen classification using features extracted from biological sequencing and natural language processing

Thi Oanh Tran, Nguyen Quoc Khanh Le

研究成果: 雜誌貢獻文章同行評審

1 引文 斯高帕斯(Scopus)

摘要

Accurately predicting tumor T-cell antigen (TTCA) sequences is a crucial task in the development of cancer vaccines and immunotherapies. TTCAs derived from tumor cells, are presented to immune cells (T cells) through major histocompatibility complex (MHC), via the recognition of specific portions of their structure known as epitopes. More specifically, MHC class I introduces TTCAs to T-cell receptors (TCR) which are located on the surface of CD8+ T cells. However, TTCA sequences are varied and lead to struggles in vaccine design. Recently, Machine learning (ML) models have been developed to predict TTCA sequences which could aid in fast and correct TTCA identification. During the construction of the TTCA predictor, the peptide encoding strategy is an important step. Previous studies have used biological descriptors for encoding TTCA sequences. However, there have been no studies that use natural language processing (NLP), a potential approach for this purpose. As sentences have their own words with diverse properties, biological sequences also hold unique characteristics that reflect evolutionary information, physicochemical values, and structural information. We hypothesized that NLP methods would benefit the prediction of TTCA. To develop a new identifying TTCA model, we first constructed a based model with widely used ML algorithms and extracted features from biological descriptors. Then, to improve our model performance, we added extracted features from biological language models (BLMs) based on NLP methods. Besides, we conducted feature selection by using Chi-square and Pearson Correlation Coefficient techniques. Then, SMOTE, Up-sampling, and Near-Miss were used to treat unbalanced data. Finally, we optimized Sa-TTCA by the SVM algorithm to the four most effective feature groups. The best performance of Sa-TTCA showed a competitive balanced accuracy of 87.5% on a training set, and 72.0% on an independent testing set. Our results suggest that integrating biological descriptors with natural language processing has the potential to improve the precision of predicting protein/peptide functionality, which could be beneficial for developing cancer vaccines.
原文英語
文章編號108408
期刊Computers in Biology and Medicine
174
DOIs
出版狀態已發佈 - 5月 2024

ASJC Scopus subject areas

  • 健康資訊學
  • 電腦科學應用

指紋

深入研究「Sa-TTCA: An SVM-based approach for tumor T-cell antigen classification using features extracted from biological sequencing and natural language processing」主題。共同形成了獨特的指紋。

引用此