Defining Data Science for Architects

December 30, 2021

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects. 1.1 What is Data Science? Data science focuses on the extraction of knowledge and business…

Introduction to Pandas for Architects

December 29, 2021

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects. 1.1 What is pandas? pandas (https://pandas.pydata.org/) is an open-source library that provides high-performance, memory-efficient, easy-to-use…

Data Visualization in Python for Architects

December 29, 2021

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects. 1.1 Why Do I Need Data Visualization? The common wisdom states that: Seeing is believing…

An AWS CLI / Node.js Script for Terminating EC2 Instances

June 6, 2017

The AWS Command Line Interface (CLI) is a powerful scripting platform written in Python that uses the AWS Cloud’s RESTful management API for performing various…

Using k-means Machine Learning Algorithm with Apache Spark and R

January 31, 2017

In this post, I will demonstrate the usage of the k-means clustering algorithm in R and in Apache Spark.Apache Spark (hereinafter Spark) offers two implementations…

Spark RDD Performance Improvement Techniques (Post 2 of 2)

October 4, 2016

In this post we will review the more important aspects related to RDD checkpointing. We will continue working on the over500 RDD we created in…

Spark RDD Performance Improvement Techniques (Post 1 of 2)

September 13, 2016

Spark offers developers two simple and quite efficient techniques to improve RDD performance and operations against them: caching and checkpointing. Caching allows you to save a materialized…

Apache Spark class development complete

September 13, 2016

Last week I completed development of our 2 day class teaching Apache Spark which will be integrated in our Big Data and Data Science classes after the…

SparkR on CDH and HDP

September 13, 2016

Spark added support for R back in version 1.4.1. and you can use it in Spark Standalone mode. Big Hadoop distros, like Cloudera’s CDH and…

Simple Algorithms for Effective Data Processing in Java

January 30, 2016

The needs of Big Data processing require specific tools which nowadays are, in many cases, represented by the Hadoop product ecosystem. When I speak to…