Dr. Weber Matthew Weber

CLA Journalism & Mass Comm
College of Liberal Arts
Twin Cities
Project Title: 
Large Scale Network Analysis of Web Data Tracing News Media

This project aims to create sample databases and develop a prototype research tool, Archives Unleashed Toolkit, using data from the Internet Archive, a library of webpages. Convenient, efficient access to archival Internet data has the potential to open up countless new avenues of social science research. In addition, evidence from this project will inform the creation and dissemination of general guidelines for conducting theoretically and methodologically rigorous longitudinal research using archival web data. This project focuses on analyzing data pertaining to news media in order to understand information flow through social network analysis. The Archives Unleashed Toolkit is an open-source platform for managing web archives built on Hadoop. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark. The datasets analyzed are roughly 5 - 10 TB in raw format, but can be quickly analyzed and compressed using the Archives Unleashed Toolkit, and subsequently processed for social network analysis.

Project Investigators

Mr. David Olsen
Dr. David Porter
Michael Souren
Dr. Weber Matthew Weber
Are you a member of this group? Log in to see more information.