Posted by Mikhail Vladimirov in Big Data, Data Science and Business Analytics on January 31, 2017
The k-means clustering is an example of an unsupervised ML algorithm where you are only required to give a hint to the computer as to how many clusters (classes of objects) you expect to be present in your data set. The algorithm will go ahead and use your data as the training data set to build a model and try to figure out the boundaries of those clusters. Then you can proceed to the classification phase with your test data.
With k-means, you, essentially, have your computer (or a cluster of computers) perform a partitioning of your data into Voronoi cells where the cells represent the identified clusters.
Read the rest of this entry »
Posted by Mikhail Vladimirov in Big Data, Data Science and Business Analytics on October 4, 2016
You will remember that checkpointing is a process of truncating an RDD’s lineage graph and saving its materialized version on a persistence store.
Read the rest of this entry »
Posted by Mikhail Vladimirov in Big Data, Data Science and Business Analytics on September 13, 2016
Caching allows you to save a materialized RDD in memory, which greatly improves iterative or multi-pass operations that need to traverse the same data set over and over again (e.g. in machine learning algorithms.)
Posted by Mikhail Vladimirov in Big Data, Data Science and Business Analytics on September 13, 2016
Last week I completed development of our 2 day class teaching Apache Spark which will be integrated in our Big Data and Data Science classes after the QA cycle.
I will be feeding some fragments of the material with additional comments and notes that would help you get a taste of what the new content is all about and see if it can help you in your work.
Stay tuned!
Posted by Mikhail Vladimirov in Big Data, Data Science and Business Analytics on September 13, 2016
Spark added support for R back in version 1.4.1. and you can use it in Spark Standalone mode.
Big Hadoop distros, like Cloudera’s CDH and Hortonworks’ HDP that bundle Spark, have varying degree of support for R. For the time being, CDH decided to opt out of supporting R (their latest CDH 5.8.x version does not even have sparkR binaries), while HDP (versions 2.3.2, 2.4, … ) includes SparkR as a technical preview technology and bundles some R-related components, like the sparkR script. Making it all work (if at all this is presently possible) is another story and making it run on YARN may be a whole novel of a size of War and Peace. So you can view this more as a demonstration of Hortonworks’ commitment to Spark, and we are left with the original supported language triad: Scala, Python, and Java.
Posted by Mikhail Vladimirov in Big Data, Data Science and Business Analytics, Java on January 29, 2016
The needs of Big Data processing require specific tools which nowadays are, in many cases, represented by the Hadoop product ecosystem.
When I speak to people who work with Hadoop, they say that their deployments are usually pretty modest: about 20 machines, give or take. It may account for the fact that most companies are still in the technology adoption phase evaluating this Big Data platform and with time the number of machines in their Hadoop clusters would probably grow into 3- or even 4-digit ranges.
Development on Hadoop is becoming more agile with shorter execution cycles — Apache Tez, Cloudera’s Impala, Databricks’ Spark are some of the technologies that aid in the process along the way.
Posted by Mikhail Vladimirov in Big Data, Data Science and Business Analytics, Scala on December 24, 2015
In this blog I will show you how to configure and run Spark SQL on Cloudera Distribution of Hadoop (CDH). I used the QuickStart VM version 5.4 running Spark SQL version 1.3 from inside the Spark shell (Scala REPL).
Read the rest of this entry »
Posted by Mikhail Vladimirov in Data Science and Business Analytics on November 24, 2013
k-Nearest Neighbors is a supervised machine learning algorithm for object classification that is widely used in data science and business analytics.
In this post, I will show how to use R’s knn() function which implements the k-Nearest Neighbors (kNN) algorithm in a simple scenario which you can extend to cover your more complex and practical scenarios. R is free and kNN has not been patented by some evil patent trolls (“patent assertion entities”), so there is no legal or other restrictions for us to go ahead with the demonstration.
Read the rest of this entry »
Copyright © 2012-2017 Web Age Solutions Inc.