TY - JOUR
T1 - An Extensive Examination of Discovering 5-Methylcytosine Sites in Genome-Wide DNA Promoters Using Machine Learning Based Approaches
AU - Nguyen, Trinh Trung Duong
AU - Tran, The Anh
AU - Le, Nguyen Quoc Khanh
AU - Pham, Dinh Minh
AU - Ou, Yu Yen
N1 - Funding Information:
This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grants MOST 109-2811-E-155-505 and MOST 109-2221-E-155-045. Additionally, Dinh-Minh Pham is partially supported by a VAST Project ?LTE 00.03/18-19.
Publisher Copyright:
© 2004-2012 IEEE.
PY - 2022
Y1 - 2022
N2 - It is well-known that the major reason for the rapid proliferation of cancer cells are the hypomethylation of the whole cancer genome and the hypermethylation of the promoter of particular tumor suppressor genes. Locating 5-methylcytosine (5mC) sites in promoters is therefore a crucial step in further understanding of the relationship between promoter methylation and the regulation of mRNA gene expression. High throughput identification of DNA 5mC in wet lab is still time-consuming and labor-extensive. Thus, finding the 5mC site of genome-wide DNA promoters is still an important task. We compared the effectiveness of the most popular and strong machine learning techniques namely XGBoost, Random Forest, Deep Forest, and Deep Feedforward Neural Network in predicting the 5mC sites of genome-wide DNA promoters. A feature extraction method based on k-mers embeddings learned from a language model were also applied. Overall, the performance of all the surveyed models surpassed deep learning models of the latest studies on the same dataset employing other encoding scheme. Furthermore, the best model achieved AUC scores of 0.962 on both cross-validation and independent test data. We concluded that our approach was efficient for identifying 5mC sites of promoters with high performance.
AB - It is well-known that the major reason for the rapid proliferation of cancer cells are the hypomethylation of the whole cancer genome and the hypermethylation of the promoter of particular tumor suppressor genes. Locating 5-methylcytosine (5mC) sites in promoters is therefore a crucial step in further understanding of the relationship between promoter methylation and the regulation of mRNA gene expression. High throughput identification of DNA 5mC in wet lab is still time-consuming and labor-extensive. Thus, finding the 5mC site of genome-wide DNA promoters is still an important task. We compared the effectiveness of the most popular and strong machine learning techniques namely XGBoost, Random Forest, Deep Forest, and Deep Feedforward Neural Network in predicting the 5mC sites of genome-wide DNA promoters. A feature extraction method based on k-mers embeddings learned from a language model were also applied. Overall, the performance of all the surveyed models surpassed deep learning models of the latest studies on the same dataset employing other encoding scheme. Furthermore, the best model achieved AUC scores of 0.962 on both cross-validation and independent test data. We concluded that our approach was efficient for identifying 5mC sites of promoters with high performance.
KW - advanced machine learning classifiers
KW - DNA 5-methylcytosine site prediction
KW - k-mers embedding
KW - Natural language processing
KW - promoter
UR - http://www.scopus.com/inward/record.url?scp=85107205727&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107205727&partnerID=8YFLogxK
U2 - 10.1109/TCBB.2021.3082184
DO - 10.1109/TCBB.2021.3082184
M3 - Article
C2 - 34014828
AN - SCOPUS:85107205727
SN - 1545-5963
VL - 19
SP - 87
EP - 94
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
IS - 1
ER -