Modern society is generating massive amounts of new information each and every day, and much of this information comes in the form of natural language text. Such text spans both traditional media (e.g. newspapers) and social media (e.g. blogs and microblogs). For example, Twitter users alone produce over 140 million tweets every day!  In order to help people keep track and make sense of such vast amounts of new information being constantly generated, massively scalable methods for text processing and analysis are required.
This course will introduce and provide hands-on experience building massively scalable text analysis applications using MapReduce, a programming paradigm pioneered by Google.  MapReduce achieves scalability by dividing computation to be performed across many different computers. A key strength of MapReduce is that it automatically handles many low-level, system details of distributed computing, freeing the programmer to focus instead on higher-level issues of data analysis. Whereas traditional high performance computing has addressed computationally-intensive analysis of smaller datasets, MapReduce addresses information-intensive analysis in which the computation is I/O-bound (i.e. the sheer volume of data being processed presents the major bottleneck rather than the sophistication of the computation being performed). Nonetheless, a variety of additional challenges and considerations arise in adapting traditional algorithms to scale to massive datasets via MapReduce.

Today, MapReduce has achieved widespread adoption across industry and scientific computing via an open-source Java implementation called Hadoop, whose development is being led by Yahoo. We will use Hadoop in this course to perform scalable text analysis and build scalable Information Retrieval, Web search, and Natural Language Processing applications.

This class is based on and will closely follow the Data Intensive Information Processing Applications course offered by Jimmy Lin at the University of Maryland.

Prerequisites. Course concepts will be taught and reinforced through extensive programming using Hadoop. Students are expected to have prior experience with Java (or Scala) and already be comfortable programming in one or both languages. No previous experience is necessary with MapReduce, Hadoop, or distributed programming. We welcome students across disciplines who may be interested in learning to use MapReduce in order to later analyze massive scientific or humanities datasets arising in their own respective disciplines.

Credit card requirement. Computing will be performed using two clusters: Amazon's Elastic Map Reduce (EMR) service and the smaller TACC-Hadoop Cluster. Students will be provided free credits for using the TACC-Hadoop cluster for coursework. Thanks to sponsorship by, students will also receive sufficient free EMR credit to perform all required coursework. However, in order to activate their EMR accounts, students will be required to have and use their own credit card. Students will also be responsible for any additional EMR computing costs if they exhaust the free credit provided.