• Spark Machine Learning Library (MLlib) provides an array of high quality distributed Machine Learning (ML) algorithms
• The MLlib library implements a whole suite of statistical and machine learning algorithms (see Notes for details)
• MLlib provides tools for
• Building processing workflows (e.g. feature extraction and data transformation),
• Parameter optimization, and
• ML model management for model saving and loading
• MLlib applications run on top of Spark and take full advantage of Spark’s distributed in-memory design
• MLlib applications claim 10X+ faster performance for applications that implement similar algorithms created using Apache
• Mahout
• Apache Mahout apps leverage Hadoop’s MapReduce engine

Continue reading…


Machine Learning Algorithms in Apache Spark


• The following options are available for running Spark applications on a cluster:
• Spark Stand-alone – Spark’s own cluster management system
• Limited in terms of configuration options and scalability
• External cluster management systems (the preferred option for large processing jobs):
• Hadoop’s YARN
• Mesos
• Running Spark using a cluster management system aids in computing efficiency, fault-tolerance, and scalability of your data processing solutions
• For development and prototyping, you can run Spark on a single (local) machine (without distributed processing capabilities)
• In all scenarios there is a Driver program (your Spark application or a Spark Shell session) which creates a Spark Context pointing to the Spark Master

Continue Reading…


To Spark or Not to Spark?


WA2490 Spark Fundamentals
This high-octane Spark training course provides theoretical and technical aspects of Spark programming. The course teaches developers Spark fundamentals, APIs, common programming idioms and more. This Spark training course is supplemented by hands-on labs that help attendees reinforce their theoretical knowledge of the learned material and quickly get them up to speed on using Spark for data exploration.


• R is a programming language and environment used for statistical computing and data analysis (
• Distributed under the GNU General Public License
• Widely used by statisticians and data miners
• R is supported by a very active user community
• More than five thousand additional packages available at the Comprehensive R Archive Network (CRAN) and other repositories
• R is an interpreted implementation of the S statistical computing language with elements borrowed from the Scheme language
• For computationally intensive tasks, R can leverage C/C++ and FORTRAN routines that can be linked to R and called at run time
• In addition to the command line interface, R has several GUI environments, the primary GUI is shipped with R itself
• R supports the production of publication-quality statistical graphs

Continue Reading…


Using R as a tool for Business Analytics