TY - JOUR
T1 - Incorporating a transfer learning technique with amino acid embeddings to efficiently predict N-linked glycosylation sites in ion channels
AU - Nguyen, Trinh Trung Duong
AU - Le, Nguyen Quoc Khanh
AU - Tran, The Anh
AU - Pham, Dinh Minh
AU - Ou, Yu Yen
N1 - Funding Information:
This work was supported by the Ministry of Science and Technology, Taiwan, R.O.C . [grant number MOST 109-2811-E-155-505 and no. MOST 109-2221-E-155-045 ].
Publisher Copyright:
© 2021
PY - 2021/3
Y1 - 2021/3
N2 - Glycosylation is a dynamic enzymatic process that attaches glycan to proteins or other organic molecules such as lipoproteins. Research has shown that such a process in ion channel proteins plays a fundamental role in modulating ion channel functions. This study used a computational method to predict N-linked glycosylation sites, the most common type, in ion channel proteins. From segments of ion channel proteins centered around N-linked glycosylation sites, the amino acid embedding vectors of each residue were concatenated to create features for prediction. We experimented with two different models for converting amino acids to their corresponding embeddings: one was fed with ion channel sequences and the other with a large dataset composed of more than one million protein sequences. The latter model stemmed from the idea of transfer learning technique and emerged as a more efficient feature extractor. Our best model was obtained from this transfer learning approach and a hyperparameter tuning process with a random search on 5-fold cross-validation data. It achieved an accuracy, specificity, sensitivity, and Matthews correlation coefficient of 93.4%, 92.8%, 98.6%, and 0.726, respectively. Corresponding scores on an independent test were 92.9%, 92.2%, 99%, and 0.717. These results outperform the position-specific scoring matrix features that are predominantly employed in post-translational modification site predictions. Furthermore, compared to N-GlyDE, GlycoEP, SPRINT-Gly, the most recent N-linked glycosylation site predictors, our model yields higher scores on the above 4 metrics, thus further demonstrating the efficiency of our approach.
AB - Glycosylation is a dynamic enzymatic process that attaches glycan to proteins or other organic molecules such as lipoproteins. Research has shown that such a process in ion channel proteins plays a fundamental role in modulating ion channel functions. This study used a computational method to predict N-linked glycosylation sites, the most common type, in ion channel proteins. From segments of ion channel proteins centered around N-linked glycosylation sites, the amino acid embedding vectors of each residue were concatenated to create features for prediction. We experimented with two different models for converting amino acids to their corresponding embeddings: one was fed with ion channel sequences and the other with a large dataset composed of more than one million protein sequences. The latter model stemmed from the idea of transfer learning technique and emerged as a more efficient feature extractor. Our best model was obtained from this transfer learning approach and a hyperparameter tuning process with a random search on 5-fold cross-validation data. It achieved an accuracy, specificity, sensitivity, and Matthews correlation coefficient of 93.4%, 92.8%, 98.6%, and 0.726, respectively. Corresponding scores on an independent test were 92.9%, 92.2%, 99%, and 0.717. These results outperform the position-specific scoring matrix features that are predominantly employed in post-translational modification site predictions. Furthermore, compared to N-GlyDE, GlycoEP, SPRINT-Gly, the most recent N-linked glycosylation site predictors, our model yields higher scores on the above 4 metrics, thus further demonstrating the efficiency of our approach.
KW - Amino acid embeddings
KW - Ion channel
KW - N-linked glycosylation
KW - Post-translational modification site prediction
KW - Transfer learning
UR - http://www.scopus.com/inward/record.url?scp=85099455451&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85099455451&partnerID=8YFLogxK
U2 - 10.1016/j.compbiomed.2021.104212
DO - 10.1016/j.compbiomed.2021.104212
M3 - Article
C2 - 33454535
AN - SCOPUS:85099455451
SN - 0010-4825
VL - 130
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
M1 - 104212
ER -