College of Science & Engineering

Twin Cities

This group works on large scale machine learning and data analysis, applied to the problems of applied to natural language processing, climate science, and ecology. The datasets for these problems are substantial, requiring a large number of computations as they search through piles of explicit and implicit information. Further, the methods used will again take advantage of algorithms that can be distributed to multiple nodes. The researchers are working on three projects during 2020:

**Text Mining on Health Journals:**Online social support groups are important places for patients and caregivers to seek information, express themselves, and exchange support. This group is collaborating with CaringBridge, a prominent online community for writing about and sharing personal health journeys, to study these online health communities. Current work investigates how patients manage transitions during their personal health journeys. MSI is used to store and to compute with the tens of millions of text entries provided by CaringBridge. The project currently pursues three lines of inquiry with respect to CaringBridge:- Journal replies analysis: Does receiving certain types of replies have a tangible impact on the writing behavior of site authors? The researchers are using MSI, particularly Jupyter notebooks hosted in interactive jobs and notebooks.msi.umn.edu's R kernel, to quantitatively and qualitatively describe replies on CaringBridge and their impact on author behavior (through the use of causal inference methods).
- Cancer patient labor shifts: How do cancer patients' responsibilities change over time after a diagnosis? In 2020, this work will expand to focus on the health- and cancer-care implications of the group's 2019 modeling work. In 2020, the expanded work may include some training of deep learning models (PyTorch), although for the most part the researchers have been training large CPU-based linear models.
- Reddit community conflict: Not directly related to CaringBridge, the group has a project analyzing inter-community conflict and conflict resolution using Reddit data.

**Deep Learning Methods for Sub-Seasonal Forecasting:**The researchers are training deep learning models for sub-seasonal climate prediction using datasets from NOAA. The datasets contain the daily observations of climate variables with spatial resolution of 0.5 degrees (latitude) by 0.5 degrees (longitude) from 1980 to 2019. Therefore, there are approximately 15,000 observations and two to three million feature dimensions reflecting various climate variables like temperature, precipitation, etc. The dataset is to be used for a prediction task, which predicts temperature and precipitation over North America for 2-8 weeks ahead of time. The researchers will explore the use of deep learning models like convolutional nets, auto-encoder, transformer, and recurrent nets. Each of these models has many parameters that require training. To the best of teh group's knowledge, this is one of the first applications proposing using deep learning models for sub-seasonal climate forecasting and as such will require running multiple iterations of these models for parameter tuning and live evaluation.**Structure Learning for Ecology:**The researchers are applying statistical methods like sub-sampling, re-sampling, and cross-validation on structure learning methods for undirected graphical models. The existing structure learning methods (like graphical lasso, neighborhood selection, clime, etc.) highly depend on hyper-parameters. Therefore, a large number of experiments are required to find stable structures from the given data without considering hyper-parameter tunning. The researchers will apply the proposed methods to both synthetic datasets and some real-world datasets from climate science and ecology, like the TRY plant trait dataset. To guarantee the reliability of the results, they will run the repeated experiments on a different combination of statistical methods and structure learning methods for graphs from ~10 nodes to ~10,000 nodes.-
**High-Dimensional Geometry of Deep Neural Networks:**When training deep neural networks for machine learning, the Hessian of the training loss is crucial in determining many behaviors of the neural networks. The eigenvalues of the Hessian characterize the local curvature of the loss which, for example, determines how fast models can be optimized via first-order methods (at least for convex problems) and is also conjectured to influence the generalization properties. To get a better understanding of the recent success behind deep learning methods, these researchers are studying two important matrices: the Hessian of the training loss of deep neural networks, and the second moment of the stochastic gradients. The state-of-the-art architectures, such as the convolutional neural network "VGG" and very deep architectures like ResNets, have tens of millions of parameters. Thus both the Hessian of the loss and the second moment of the stochastic gradients are million by million-dimensional matrices. With limited memory, the direct computation of such a huge matrix is nearly impossible. The researchers take advantage of state-of-the-art tools in numerical linear algebra, e.g. Lanczos method and Power method, to approximate the top few eigenvalues of the Hessian and the second moment of the stochastic gradients. It takes more than a thousand iterations to train a deep neural network to convergence, and the estimated time to compute even the top few eigenvalues of desired matrices at each iteration is about 3-4 hours. The researchers are experimenting with 50-100 different settings.