Identify Breast Cancer Subtypes by Gene Expression Profiles

Grace S. Shieh, Chyi-Huey Bai, Chih Lee

Research output: Contribution to journalArticlepeer-review


Support vector machines (SVMs), with linear, polynomial and
radial kernels, were applied to classify subtypes of breast cancer by gene
expression profiles of tissues samples. Using the top 500 genes ranked by
between-group to within-group sum of squares, SVMs with linear kernel had
an average accuracy rate about 97% when applied to a balanced dataset; this
accuracy rate was significantly higher than that of the original data. After
imputation, the smallest subsample of the balanced dataset was comparable
to the other subsamples’ (containing more than 10 samples). In biomedical
sciences, it is of interest to identify genes that can be used to classify subtypes
of breast cancer well. Using SVMs, we identified 500 genes and looked up
the functions of 297 genes from databases. Furthermore, about 65% of these
297 genes were known to be related to breast cancer, and this confirms the
consistency of our results with existing biomedical knowledge. Those 203
genes may also be investigated further to see if they are involved in breast
cancer; any novel findings will be important.
Original languageEnglish
Pages (from-to)165-175
Journaljournal of data science
Publication statusPublished - 2004
Externally publishedYes

Cite this