Spark RDD Performance Improvement Techniques (Post 2 of 2)

October 4, 2016

In this post we will review the more important aspects related to RDD checkpointing. We will continue working on the over500 RDD we created in…

Spark RDD Performance Improvement Techniques (Post 1 of 2)

September 13, 2016

Spark offers developers two simple and quite efficient techniques to improve RDD performance and operations against them: caching and checkpointing. Caching allows you to save a materialized…

Apache Spark class development complete

September 13, 2016

Last week I completed development of our 2 day class teaching Apache Spark which will be integrated in our Big Data and Data Science classes after the…

SparkR on CDH and HDP

September 13, 2016

Spark added support for R back in version 1.4.1. and you can use it in Spark Standalone mode. Big Hadoop distros, like Cloudera’s CDH and…

Simple Algorithms for Effective Data Processing in Java

January 30, 2016

The needs of Big Data processing require specific tools which nowadays are, in many cases, represented by the Hadoop product ecosystem. When I speak to…

Spark SQL

December 24, 2015

In this blog I will show you how to configure and run Spark SQL on Cloudera Distribution of Hadoop (CDH). I used the QuickStart VM…

The Simplest Possible Streaming MapReduce Script

September 10, 2015

With Hadoop Streaming API you can use any scripting language — Perl, Ruby, Python, etc. — as long as they understand STDIN / SDTOUT channels…

MuleSoft Summit 2015 in Toronto

July 24, 2015

On July 16, I attended a one-day MuleSoft Summit in Toronto. In a nutshell, Mule is trying to add more API management capabilities to their…

Provisioning Tomcat with the Amazon EC2 Service

April 24, 2015

In this blog article, I will walk you through the steps required to quickly provision an instance of the Tomcat web server in the Amazone…

Linux Containers

April 22, 2015

What are Linux Containers LinuX Containers (LXC) is an OS-level virtualization that allows multiple Linux systems to run on a single physical machine in a…