Mining Earth Science and Biomedical Data
The primary objective of this research is to develop novel, high-performance data-mining algorithms and tools for mining large-scale datasets that arise in a variety of applications. One part of this effort involves gigabyte datasets collected by earth-observing satellites that need to be processed to better understand global scale changes in biosphere processes and patterns, as well as data generated by scientific simulations that can be used to gain insight into the underlying physical processes. The key technical challenges in mining these datasets include: high volume, dimensionality, and heterogeneity; the spatio-temporal aspect of the data; possible skewed class distribution; the distributed nature of the data; and complexity in converting raw collected data into high level features.
Other research efforts include developing novel data-mining techniques for analyzing biomedical data in a number of collaborative interdisciplinary projects, including: mining electronic health records to find patterns that distinguish between similar patients with disparate clinical outcomes; creating software to analyze DNA sequencing data from tumors in order to identify heterogeneous cellular subclones; and mining fMRI data for subjects with schizophrenia, bipolar, and other mental disorders to identify brain activity patterns that can help understand these complex diseases.
All of these projects require significant computing resources due to their "big data" nature, making MSI resources critical for the group's daily research. Computational challenges imposed by the large size of the datasets will be addressed by building upon our past research in highly parallel formulations of key data mining kernels for anomaly/outlier detection, finding association patterns, clustering, and building rare-class predictive models that can take advantage of high performance computers.
A bibliography of this group’s publications is attached.
Return to this PI's main page.