Providing Technology Training and Mentoring For Modern Technology Adoption
In this post, I will demonstrate the usage of the k-means clustering algorithm in R and in Apache Spark.Apache Spark (hereinafter Spark) offers two implementations of k-means algorithm: one is packaged with its MLlib library; the other one exists in Spark’s spark.ml package. While both implementations are currently more or less functionally equivalent, the Spark
In this post we will review the more important aspects related to RDD checkpointing. We will continue working on the over500 RDD we created in the previous post on caching. You will remember that checkpointing is a process of truncating an RDD’s lineage graph and saving its materialized version on a persistence store.
Spark offers developers two simple and quite efficient techniques to improve RDD performance and operations against them: caching and checkpointing. Caching allows you to save a materialized RDD in memory, which greatly improves iterative or multi-pass operations that need to traverse the same data set over and over again (e.g. in machine learning algorithms.)
Last week I completed development of our 2 day class teaching Apache Spark which will be integrated in our Big Data and Data Science classes after the QA cycle. I will be feeding some fragments of the material with additional comments and notes that would help you get a taste of what the new content is all
Spark added support for R back in version 1.4.1. and you can use it in Spark Standalone mode. Big Hadoop distros, like Cloudera’s CDH and Hortonworks’ HDP that bundle Spark, have varying degree of support for R. For the time being, CDH decided to opt out of supporting R (their latest CDH 5.8.x version does
The needs of Big Data processing require specific tools which nowadays are, in many cases, represented by the Hadoop product ecosystem. When I speak to people who work with Hadoop, they say that their deployments are usually pretty modest: about 20 machines, give or take. It may account for the fact that most companies are
In this blog I will show you how to configure and run Spark SQL on Cloudera Distribution of Hadoop (CDH). I used the QuickStart VM version 5.4 running Spark SQL version 1.3 from inside the Spark shell (Scala REPL).
With Hadoop Streaming API you can use any scripting language — Perl, Ruby, Python, etc. — as long as they understand STDIN / SDTOUT channels (they all do directly or via an I/O library). The corresponding runtimes of these languages must be present on all data nodes in your cluster to process the input data