College of Liberal Arts
This project aims to create sample databases and develop a prototype research tool, Archives Unleashed Toolkit, using data from the Internet Archive, a library of webpages. Convenient, efficient access to archival Internet data has the potential to open up countless new avenues of social science research. In addition, evidence from this project will inform the creation and dissemination of general guidelines for conducting theoretically and methodologically rigorous longitudinal research using archival web data. This project focuses on analyzing data pertaining to news media in order to understand information flow through social network analysis. The Archives Unleashed Toolkit is an open-source platform for managing web archives built on Hadoop. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark. The datasets analyzed are roughly 5 - 10 TB in raw format, but can be quickly analyzed and compressed using the Archives Unleashed Toolkit, and subsequently processed for social network analysis.