Data Mining


From the Caffe website:


Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.



D4M attempts to combine the advantages of five distinct processing technologies (sparse linear algebra, associative arrays, fuzzy algebra, distributed arrays, and triple-store/NoSQL databases such as Hadoop HBase and Apache Accumulo) to provide a database and computation system that addresses the problems associated with Big Data. 


Graphviz is a free and open source graphing application.  It takes the description of graphs in text format and creates images, so that it can be embedded in presentations, papers etc.


JMP is a statistical discovery software. It is the SAS product designed for dynamic data visualization on the desktop. It links statistics with graphics and makes users data accessible easily in many ways.


Weka is a data mining software in Java. It consists of a collection of machine learning algorithms for data pre-processing, classification, regression, clustering, association rules, and visualization.


Bioconductor is an open source and open development software project to provide tools for the analysis and comprehension of genomic data. The broad goals of the projects are to provide access to a wide range of powerful statistical and graphical methods for the analysis of genomic data, to facilitate the integration of biological metadata in the analysis of experimental data, and to allow the rapid development of extensible, scalable, and interoperable software.


Note: MSI's SAS license will permanently expire on 10/29/17, after which the software will no longer be available.


SAS is a scalable, integrated software environment and a flexible and extensible programming language designed for data access, data manipulation, and data analysis. It offers data management and predictive analytic capabilities, data mining process, drug discovery solutions, and much more.


R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.


Matlab is a high level technical computing language and interactive environment for data visualization, data analysis, numerical computation, and algorithm development.