Problem Definition:

Nashville is the capital and most populous city in Tennessee with 691,243 citizens in 2017.

Austin is the state capital of Texas and had a population of 950,715 in 2017.

Both cities are known for their live-music scene, outdoor activities, tourism and VC’s. Ultimately your team’s goal is to show why one city is a better place to be a VC.

During this project, you’ll work with Big Data techniques to help in the decision process, discuss many aspects of data analytics and have as a resource a big data guru to seek guidance from. The expectation is you and your team will lead a final presentation of your findings on day 5.

Technologies Covered:

  • Scala
  • Hadoop
  • Hive
  • Spark
  • Cassandra
  • HBase
  • Kafka


How do you illustrate that one place is better than another? That is entirely up to your team of course. For example you could use statistical information to show growth…,, and

For Austin, you might want to utilize data from places like, and

For Nashville you could see, and


Five Days

Outline for Hadoop Ecosystem Workshop Using Hackathon Style Project Training

Day 1:

Scala The goal on day 1 will be to immerse students into the hackathon like environment while beginning their exploration of the Scala programming language. Our project on Day 1 will be to leverage the java-based JSoup framework to scrape individual websites. Our baseline challenge will be: Given an arbitrary website, extract the text contents and:

  • Build up a word-frequency table. Later on, we will use this data as a website signature
  • Find the number of internal and external links referenced on the page
  • Safely navigate through linked pages while finding the most important pieces of information

In order to enable student work, the Day 1 lectures will cover:

  • JVM Languages and an introduction to Scala
  • Lab: Scala Compilation
  • Scala as a functional programming language
  • Basic Scala data types     
  • Basic Scala collection types
  • Lab: Working with Dictionaries
  • Functions as objects
  • Classes

Day 2:

Hadoop Ecosystem
On day 2, students will leverage basic Hadoop system tools to store scraped content and perform aggregate analyses. In particular, they will join Crunchbase information about VC funding rounds to website contents to identify features differentiating VCs.

  • The lecture portion in support of this work will cover:
  • Introduction to the Hadoop Ecosystem
  • SQL Querying by means of Hive
  • Reading and writing to/from HDFS
  • MapReduce to join features
  • Word Count analytics using Hive

Day 3:

On Day 3, students will continue to expand their analysis of features using Spark and Spark SQL. They will also leverage Spark Streaming to generate aggregate information about the scraping process and its progress.
Lecture contents will include:

  • Spark Architecture
  • RDDs and the Spark API
  • Lab: Text Analysis using Spark
  • Spark SQL and DataFrames (overview)
  • Spark Streaming in Detail
  • Apache Zepplin
  • Lab: Spark Streaming

Day 4:

Cassandra and HBase
On Day 4, students will begin the process of merging their team’s information into an overall Cassandra database. They will further compete on analytics based upon features introduced by various teams.
The day’s lecture contents will include:

  • The CAP theorem
  • Eventual consistency
  • Cassandra architecture
  • Cassandra queries
  • Lab: Querying and updating Cassandra
  • HBase architecture
  • Lab: HBase vs Cassandra

Day 5:

Kafka and Spark Streaming
On Day 5, students will integrate the various components into a full application using Kafka. The formal contents will be kept short in order to let teams explore topics of interest to them.
The lecture portion of the class will cover:

  • Introduction to data pipelines
  • Kafka architecture
  • Lab: Kafka and Spark Streaming
  • Presentations of findings
  • An optional SparkML lab will be provided, and students may decide to leverage SparkML to classify interesting properties off the Crunchbase dataset.