TY - JOUR
T1 - Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data
AU - Huang, Yuan
AU - Liu, Jin
AU - Yi, Huangdi
AU - Shia, Ben Chang
AU - Ma, Shuangge
N1 - Publisher Copyright:
Copyright © 2016 John Wiley & Sons, Ltd.
PY - 2017/2/10
Y1 - 2017/2/10
N2 - In profiling studies, the analysis of a single dataset often leads to unsatisfactory results because of the small sample size. Multi-dataset analysis utilizes information of multiple independent datasets and outperforms single-dataset analysis. Among the available multi-dataset analysis methods, integrative analysis methods aggregate and analyze raw data and outperform meta-analysis methods, which analyze multiple datasets separately and then pool summary statistics. In this study, we conduct integrative analysis and marker selection under the heterogeneity structure, which allows different datasets to have overlapping but not necessarily identical sets of markers. Under certain scenarios, it is reasonable to expect some similarity of identified marker sets – or equivalently, similarity of model sparsity structures – across multiple datasets. However, the existing methods do not have a mechanism to explicitly promote such similarity. To tackle this problem, we develop a sparse boosting method. This method uses a BIC/HDBIC criterion to select weak learners in boosting and encourages sparsity. A new penalty is introduced to promote the similarity of model sparsity structures across datasets. The proposed method has a intuitive formulation and is broadly applicable and computationally affordable. In numerical studies, we analyze right censored survival data under the accelerated failure time model. Simulation shows that the proposed method outperforms alternative boosting and penalization methods with more accurate marker identification. The analysis of three breast cancer prognosis datasets shows that the proposed method can identify marker sets with increased similarity across datasets and improved prediction performance.
AB - In profiling studies, the analysis of a single dataset often leads to unsatisfactory results because of the small sample size. Multi-dataset analysis utilizes information of multiple independent datasets and outperforms single-dataset analysis. Among the available multi-dataset analysis methods, integrative analysis methods aggregate and analyze raw data and outperform meta-analysis methods, which analyze multiple datasets separately and then pool summary statistics. In this study, we conduct integrative analysis and marker selection under the heterogeneity structure, which allows different datasets to have overlapping but not necessarily identical sets of markers. Under certain scenarios, it is reasonable to expect some similarity of identified marker sets – or equivalently, similarity of model sparsity structures – across multiple datasets. However, the existing methods do not have a mechanism to explicitly promote such similarity. To tackle this problem, we develop a sparse boosting method. This method uses a BIC/HDBIC criterion to select weak learners in boosting and encourages sparsity. A new penalty is introduced to promote the similarity of model sparsity structures across datasets. The proposed method has a intuitive formulation and is broadly applicable and computationally affordable. In numerical studies, we analyze right censored survival data under the accelerated failure time model. Simulation shows that the proposed method outperforms alternative boosting and penalization methods with more accurate marker identification. The analysis of three breast cancer prognosis datasets shows that the proposed method can identify marker sets with increased similarity across datasets and improved prediction performance.
KW - heterogeneity structure
KW - integrative analysis
KW - marker identification
KW - model sparsity structure
KW - sparse boosting
UR - http://www.scopus.com/inward/record.url?scp=84988909504&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84988909504&partnerID=8YFLogxK
U2 - 10.1002/sim.7138
DO - 10.1002/sim.7138
M3 - Article
C2 - 27667129
AN - SCOPUS:84988909504
SN - 0277-6715
VL - 36
SP - 509
EP - 559
JO - Statistics in Medicine
JF - Statistics in Medicine
IS - 3
ER -