TY - JOUR
T1 - Gradient Boosting over Linguistic-Pattern-Structured Trees for Learning Protein–Protein Interaction in the Biomedical Literature
AU - Warikoo, Neha
AU - Chang, Yung Chun
AU - Ma, Shang Pin
N1 - Funding Information:
This research was funded by the National Science and Technology Council of Taiwan, under grant NSTC 111-2221-E-038-025, and the University System of Taipei Joint Research Program under grant USTP-NTOU-TMU-109-03.
Publisher Copyright:
© 2022 by the authors.
PY - 2022/10
Y1 - 2022/10
N2 - Protein-based studies contribute significantly to gathering functional information about biological systems; therefore, the protein–protein interaction detection task is one of the most researched topics in the biomedical literature. To this end, many state-of-the-art systems using syntactic tree kernels (TK) and deep learning have been developed. However, these models are computationally complex and have limited learning interpretability. In this paper, we introduce a linguistic-pattern-representation-based Gradient-Tree Boosting model, i.e., LpGBoost. It uses linguistic patterns to optimize and generate semantically relevant representation vectors for learning over the gradient-tree boosting. The patterns are learned via unsupervised modeling by clustering invariant semantic features. These linguistic representations are semi-interpretable with rich semantic knowledge, and owing to their shallow representation, they are also computationally less expensive. Our experiments with six protein–protein interaction (PPI) corpora demonstrate that LpGBoost outperforms the SOTA tree-kernel models, as well as the CNN-based interaction detection studies for BioInfer and AIMed corpora.
AB - Protein-based studies contribute significantly to gathering functional information about biological systems; therefore, the protein–protein interaction detection task is one of the most researched topics in the biomedical literature. To this end, many state-of-the-art systems using syntactic tree kernels (TK) and deep learning have been developed. However, these models are computationally complex and have limited learning interpretability. In this paper, we introduce a linguistic-pattern-representation-based Gradient-Tree Boosting model, i.e., LpGBoost. It uses linguistic patterns to optimize and generate semantically relevant representation vectors for learning over the gradient-tree boosting. The patterns are learned via unsupervised modeling by clustering invariant semantic features. These linguistic representations are semi-interpretable with rich semantic knowledge, and owing to their shallow representation, they are also computationally less expensive. Our experiments with six protein–protein interaction (PPI) corpora demonstrate that LpGBoost outperforms the SOTA tree-kernel models, as well as the CNN-based interaction detection studies for BioInfer and AIMed corpora.
KW - bioinformatics
KW - gradient-tree boosting
KW - linguistic patterns
KW - natural language processing
KW - protein–protein interaction
UR - http://www.scopus.com/inward/record.url?scp=85140491984&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140491984&partnerID=8YFLogxK
U2 - 10.3390/app122010199
DO - 10.3390/app122010199
M3 - Article
AN - SCOPUS:85140491984
SN - 2076-3417
VL - 12
JO - Applied Sciences (Switzerland)
JF - Applied Sciences (Switzerland)
IS - 20
M1 - 10199
ER -