TY - JOUR
T1 - Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters
AU - Nguyen, Trinh Trung Duong
AU - Le, Nguyen Quoc Khanh
AU - Ho, Quang Thai
AU - Phan, Dinh Van
AU - Ou, Yu Yen
N1 - Publisher Copyright:
© 2019 Elsevier Inc.
PY - 2019/7/15
Y1 - 2019/7/15
N2 - Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.
AB - Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.
KW - Feature extraction
KW - Natural language processing
KW - Protein function prediction
KW - Substrate specificities
KW - Support vector machine
KW - Transporter
KW - Word embeddings
UR - http://www.scopus.com/inward/record.url?scp=85064809652&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85064809652&partnerID=8YFLogxK
U2 - 10.1016/j.ab.2019.04.011
DO - 10.1016/j.ab.2019.04.011
M3 - Article
C2 - 31022378
AN - SCOPUS:85064809652
SN - 0003-2697
VL - 577
SP - 73
EP - 81
JO - Analytical Biochemistry
JF - Analytical Biochemistry
ER -