TY - JOUR
T1 - Using Language Representation Learning Approach to Efficiently Identify Protein Complex Categories in Electron Transport Chain
AU - Nguyen, Trinh Trung Duong
AU - Le, Nguyen Quoc Khanh
AU - Ho, Quang Thai
AU - Phan, Dinh Van
AU - Ou, Yu Yen
N1 - © 2020 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
PY - 2020/10
Y1 - 2020/10
N2 - We herein proposed a novel approach based on the language representation learning method to categorize electron complex proteins into 5 types. The idea is stemmed from the the shared characteristics of human language and protein sequence language, thus advanced natural language processing techniques were used for extracting useful features. Specifically, we employed transfer learning and word embedding techniques to analyze electron complex sequences and create efficient feature sets before using a support vector machine algorithm to classify them. During the 5-fold cross-validation processes, seven types of sequence-based features were analyzed to find the optimal features. On an average, our final classification models achieved the accuracy, specificity, sensitivity, and MCC of 96 %, 96.1 %, 95.3 %, and 0.86, respectively on cross-validation data. For the independent test data, those corresponding performance scores are 95.3 %, 92.6 %, 94 %, and 0.87. We concluded that using feature extracted using these representation learning methods, the prediction performance of simple machine learning algorithm is on par with existing deep neural network method on the task of categorizing electron complexes while enjoying a much faster way for feature generation. Furthermore, the results also showed that the combination of features learned from the representation learning methods and sequence motif counts helps yield better performance.
AB - We herein proposed a novel approach based on the language representation learning method to categorize electron complex proteins into 5 types. The idea is stemmed from the the shared characteristics of human language and protein sequence language, thus advanced natural language processing techniques were used for extracting useful features. Specifically, we employed transfer learning and word embedding techniques to analyze electron complex sequences and create efficient feature sets before using a support vector machine algorithm to classify them. During the 5-fold cross-validation processes, seven types of sequence-based features were analyzed to find the optimal features. On an average, our final classification models achieved the accuracy, specificity, sensitivity, and MCC of 96 %, 96.1 %, 95.3 %, and 0.86, respectively on cross-validation data. For the independent test data, those corresponding performance scores are 95.3 %, 92.6 %, 94 %, and 0.87. We concluded that using feature extracted using these representation learning methods, the prediction performance of simple machine learning algorithm is on par with existing deep neural network method on the task of categorizing electron complexes while enjoying a much faster way for feature generation. Furthermore, the results also showed that the combination of features learned from the representation learning methods and sequence motif counts helps yield better performance.
KW - electron complexes
KW - motif frequencies
KW - protein function prediction
KW - representation learning
KW - transfer learning
KW - word embeddings
UR - http://www.scopus.com/inward/record.url?scp=85088120445&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85088120445&partnerID=8YFLogxK
U2 - 10.1002/minf.202000033
DO - 10.1002/minf.202000033
M3 - Article
C2 - 32598045
AN - SCOPUS:85088120445
SN - 1868-1743
VL - 39
JO - Molecular Informatics
JF - Molecular Informatics
IS - 10
M1 - 2000033
ER -