TY - JOUR
T1 - NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions
AU - Dai, Hong Jie
AU - Singh, Onkar
AU - Jonnagaddala, Jitendra
AU - Su, Emily Chia Yu
N1 - Publisher Copyright:
© 2016 The Author(s) 2016. Published by Oxford University Press.
PY - 2016
Y1 - 2016
N2 - In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery.Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus.
AB - In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery.Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus.
UR - http://www.scopus.com/inward/record.url?scp=85011021624&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85011021624&partnerID=8YFLogxK
U2 - 10.1093/database/baw111
DO - 10.1093/database/baw111
M3 - Article
C2 - 27465130
AN - SCOPUS:85011021624
SN - 1758-0463
VL - 2016
JO - Database
JF - Database
M1 - baw111
ER -