TY - GEN
T1 - Human Pol II promoter prediction by using nucleotide property composition features
AU - Huang, Wen Lin
AU - Tung, Chun Wei
AU - Ho, Shinn Ying
PY - 2010/5/3
Y1 - 2010/5/3
N2 - RNA polymerase II (Pol II) promoter is a key region that regulates differential transcription of protein coding genes. The identification of the RNA polymerase II (Pol II) promoter is one of the most challenging problems in genome annotation. Though many promoter prediction methods and tools have been developed, they have not yet extracted informative features from large-scale DNA sequences to improve predictive accuracy. A prediction method ProPolyII, which involves mining informative nucleotide property composition (NPC) features, is proposed to design a support vector machine-based classifier. An existing data set HumP (1872 human promoters and 1870 non-promoters) is used to evaluate ProPolyII for promoter prediction. ProPolyII yields 70 informative NPC features with training and test accuracies of 99.1% and 95.1%, respectively. The 70 NPC features consist of 46 4-mer motifs, 3 nucleotide properties and 21 global descriptors. The accuracies are better than those of Prom-Machine (94.9% and 91.1%) and M1 (97.4% and 93.6%) which uses top 128 4-mer motifs and 36 global descriptors, respectively. The high predictive performance indicates that ProPolyII can be beneficial in the identification of promoters comparative to other methods.
AB - RNA polymerase II (Pol II) promoter is a key region that regulates differential transcription of protein coding genes. The identification of the RNA polymerase II (Pol II) promoter is one of the most challenging problems in genome annotation. Though many promoter prediction methods and tools have been developed, they have not yet extracted informative features from large-scale DNA sequences to improve predictive accuracy. A prediction method ProPolyII, which involves mining informative nucleotide property composition (NPC) features, is proposed to design a support vector machine-based classifier. An existing data set HumP (1872 human promoters and 1870 non-promoters) is used to evaluate ProPolyII for promoter prediction. ProPolyII yields 70 informative NPC features with training and test accuracies of 99.1% and 95.1%, respectively. The 70 NPC features consist of 46 4-mer motifs, 3 nucleotide properties and 21 global descriptors. The accuracies are better than those of Prom-Machine (94.9% and 91.1%) and M1 (97.4% and 93.6%) which uses top 128 4-mer motifs and 36 global descriptors, respectively. The high predictive performance indicates that ProPolyII can be beneficial in the identification of promoters comparative to other methods.
KW - Global descriptors
KW - Nucleotide property
KW - Promoter
KW - Support vector machine
UR - http://www.scopus.com/inward/record.url?scp=77951548363&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77951548363&partnerID=8YFLogxK
U2 - 10.1145/1722024.1722050
DO - 10.1145/1722024.1722050
M3 - Conference contribution
AN - SCOPUS:77951548363
SN - 9781605587226
T3 - ISB 2010 Proceedings - International Symposium on Biocomputing
BT - ISB 2010 Proceedings - International Symposium on Biocomputing
T2 - International Symposium on Biocomputing, ISB 2010
Y2 - 15 February 2010 through 17 February 2010
ER -