Providing Technology Training and Mentoring For Modern Technology Adoption
In this post, I will demonstrate the usage of the k-means clustering algorithm in R and in Apache Spark.Apache Spark (hereinafter Spark) offers two implementations of k-means algorithm: one is packaged with its MLlib library; the other one exists in Spark’s spark.ml package. While both implementations are currently more or less functionally equivalent, the Spark
In this post we will review the more important aspects related to RDD checkpointing. We will continue working on the over500 RDD we created in the previous post on caching. You will remember that checkpointing is a process of truncating an RDD’s lineage graph and saving its materialized version on a persistence store.
Spark offers developers two simple and quite efficient techniques to improve RDD performance and operations against them: caching and checkpointing. Caching allows you to save a materialized RDD in memory, which greatly improves iterative or multi-pass operations that need to traverse the same data set over and over again (e.g. in machine learning algorithms.)
Last week I completed development of our 2 day class teaching Apache Spark which will be integrated in our Big Data and Data Science classes after the QA cycle. I will be feeding some fragments of the material with additional comments and notes that would help you get a taste of what the new content is all
Spark added support for R back in version 1.4.1. and you can use it in Spark Standalone mode. Big Hadoop distros, like Cloudera’s CDH and Hortonworks’ HDP that bundle Spark, have varying degree of support for R. For the time being, CDH decided to opt out of supporting R (their latest CDH 5.8.x version does
The needs of Big Data processing require specific tools which nowadays are, in many cases, represented by the Hadoop product ecosystem. When I speak to people who work with Hadoop, they say that their deployments are usually pretty modest: about 20 machines, give or take. It may account for the fact that most companies are
In this blog I will show you how to configure and run Spark SQL on Cloudera Distribution of Hadoop (CDH). I used the QuickStart VM version 5.4 running Spark SQL version 1.3 from inside the Spark shell (Scala REPL).
k-Nearest Neighbors is a supervised machine learning algorithm for object classification that is widely used in data science and business analytics. In this post, I will show how to use R’s knn() function which implements the k-Nearest Neighbors (kNN) algorithm in a simple scenario which you can extend to cover your more complex and practical