Python with NumPy and pandas
This tutorial is adapted from Web Age course Applied Data Science with Python. This tutorial aims at helping you refresh your knowledge of Python and…
Using k-means Machine Learning Algorithm with Apache Spark and R
In this post, I will demonstrate the usage of the k-means clustering algorithm in R and in Apache Spark.Apache Spark (hereinafter Spark) offers two implementations…
Spark RDD Performance Improvement Techniques (Post 2 of 2)
In this post we will review the more important aspects related to RDD checkpointing. We will continue working on the over500 RDD we created in…
Spark RDD Performance Improvement Techniques (Post 1 of 2)
Spark offers developers two simple and quite efficient techniques to improve RDD performance and operations against them: caching and checkpointing. Caching allows you to save a materialized…
Apache Spark class development complete
Last week I completed development of our 2 day class teaching Apache Spark which will be integrated in our Big Data and Data Science classes after the…
SparkR on CDH and HDP
Spark added support for R back in version 1.4.1. and you can use it in Spark Standalone mode. Big Hadoop distros, like Cloudera’s CDH and…
Simple Algorithms for Effective Data Processing in Java
The needs of Big Data processing require specific tools which nowadays are, in many cases, represented by the Hadoop product ecosystem. When I speak to…
Spark SQL
In this blog I will show you how to configure and run Spark SQL on Cloudera Distribution of Hadoop (CDH). I used the QuickStart VM…
Using the k-Nearest Neighbors Algorithm in R
k-Nearest Neighbors is a supervised machine learning algorithm for object classification that is widely used in data science and business analytics. In this post, I…