Applying data mining for the analysis of breast cancer data

Der Ming Liou, Wei Pin Chang

Research output: Contribution to journalArticlepeer-review

22 Citations (Scopus)


Data mining, also known as Knowledge-Discovery in Databases (KDD), is the process of automatically searching large volumes of data for patterns. For instance, a clinical pattern might indicate a female who have diabetes or hypertension are easier suffered from stroke for 5 years in a future. Then, a physician can learn valuable knowledge from the data mining processes. Here, we present a study focused on the investigation of the application of artificial intelligence and data mining techniques to the prediction models of breast cancer. The artificial neural network, decision tree, logistic regression, and genetic algorithm were used for the comparative studies and the accuracy and positive predictive value of each algorithm were used as the evaluation indicators. 699 records acquired from the breast cancer patients at the University of Wisconsin, nine predictor variables, and one outcome variable were incorporated for the data analysis followed by the tenfold cross-validation. The results revealed that the accuracies of logistic regression model were 0.9434 (sensitivity 0.9716 and specificity 0.9482), the decision tree model 0.9434 (sensitivity 0.9615, specificity 0.9105), the neural network model 0.9502 (sensitivity 0.9628, specificity 0.9273), and the genetic algorithm model 0.9878 (sensitivity 1, specificity 0.9802). The accuracy of the genetic algorithm was significantly higher than the average predicted accuracy of 0.9612. The predicted outcome of the logistic regression model was higher than that of the neural network model but no significant difference was observed. The average predicted accuracy of the decision tree model was 0.9435 which was the lowest of all four predictive models. The standard deviation of the tenfold cross-validation was rather unreliable. This study indicated that the genetic algorithm model yielded better results than other data mining models for the analysis of the data of breast cancer patients in terms of the overall accuracy of the patient classification, the expression and complexity of the classification rule. The results showed that the genetic algorithm described in the present study was able to produce accurate results in the classification of breast cancer data and the classification rule identified was more acceptable and comprehensible.

Original languageEnglish
Pages (from-to)175-189
Number of pages15
JournalMethods in molecular biology (Clifton, N.J.)
Publication statusPublished - Jan 1 2015
Externally publishedYes

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics


Dive into the research topics of 'Applying data mining for the analysis of breast cancer data'. Together they form a unique fingerprint.

Cite this