Sa-TTCA: An SVM-based approach for tumor T-cell antigen classification using features extracted from biological sequencing and natural language processing

Thi Oanh Tran, Nguyen Quoc Khanh Le

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)


Accurately predicting tumor T-cell antigen (TTCA) sequences is a crucial task in the development of cancer vaccines and immunotherapies. TTCAs derived from tumor cells, are presented to immune cells (T cells) through major histocompatibility complex (MHC), via the recognition of specific portions of their structure known as epitopes. More specifically, MHC class I introduces TTCAs to T-cell receptors (TCR) which are located on the surface of CD8+ T cells. However, TTCA sequences are varied and lead to struggles in vaccine design. Recently, Machine learning (ML) models have been developed to predict TTCA sequences which could aid in fast and correct TTCA identification. During the construction of the TTCA predictor, the peptide encoding strategy is an important step. Previous studies have used biological descriptors for encoding TTCA sequences. However, there have been no studies that use natural language processing (NLP), a potential approach for this purpose. As sentences have their own words with diverse properties, biological sequences also hold unique characteristics that reflect evolutionary information, physicochemical values, and structural information. We hypothesized that NLP methods would benefit the prediction of TTCA. To develop a new identifying TTCA model, we first constructed a based model with widely used ML algorithms and extracted features from biological descriptors. Then, to improve our model performance, we added extracted features from biological language models (BLMs) based on NLP methods. Besides, we conducted feature selection by using Chi-square and Pearson Correlation Coefficient techniques. Then, SMOTE, Up-sampling, and Near-Miss were used to treat unbalanced data. Finally, we optimized Sa-TTCA by the SVM algorithm to the four most effective feature groups. The best performance of Sa-TTCA showed a competitive balanced accuracy of 87.5% on a training set, and 72.0% on an independent testing set. Our results suggest that integrating biological descriptors with natural language processing has the potential to improve the precision of predicting protein/peptide functionality, which could be beneficial for developing cancer vaccines.

Original languageEnglish
Article number108408
JournalComputers in Biology and Medicine
Publication statusPublished - May 2024


  • Biological language models
  • Cancer vaccines
  • Machine learning
  • Peptide sequences
  • Protein encoding

ASJC Scopus subject areas

  • Health Informatics
  • Computer Science Applications


Dive into the research topics of 'Sa-TTCA: An SVM-based approach for tumor T-cell antigen classification using features extracted from biological sequencing and natural language processing'. Together they form a unique fingerprint.

Cite this