College of Liberal Arts
Big data and the predictive modeling of high-dimensional datasets are of great interest to practitioners in many fields, such as finance, biology, and economics. These researchers are taking a methodology, model combination, that is widely and efficiently used for low-dimensional datasets and adapting it for high-dimensional situations. However, little literature has discussed the combination of models for high-dimensional datasets. This project will develop a general risk bound for the proposed methodology for high-dimensional predictive modeling, especially classification problems. Further, an efficient computing algorithm for combination schemes will be developed and wrapped into a publicly available R package.
Many big-data sets (real data) will be analyzed by multiple high-dimensional classification methods using cross-validation. This process will take about 10 million non-linear numerical optimizations. Besides working with real data, the researchers will perform various numerical experiments in order to have a better understanding of their methods. For different scenarios, they will compare their methods with between five and ten other popular methods and run large number of replicates to reduce the bias from the samplings. This will take about 10 million calculations.