Home  > Resources  > Blog

Data Science and Business Analytics Blog Articles

 

SQL Notebooks in Databricks

August 11, 2022

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3208-programming-on-azure-databricks-with-pyspark-sql-and-scala. In this tutorial, you will learn how to create and use SQL Notebooks in Databricks that enable developers and business users to query data cataloged as tables using standard SQL commands. This tutorial depends on the resour

How to unlock a higher salary in IT with Data Science and Data Engineering training

April 8, 2022

Whether you or your employees are just entering the IT field, or have been working in the industry for years, two areas of expertise they will want to consider are data science and data engineering. These quickly growing fields are in high demand and enable professionals to access better job titles with higher salaries. On the employer’s side, mastery in these fields also increases the potential to take

Robust Python Programming Techniques

March 29, 2022

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3174-pragmatic-python-programming.1.1 Defining Robust ProgrammingWe will define Robust Programming as a collection of assorted programming techniques, methods, practices, and libraries that can help yo

Learning the CoLab Jupyter Notebook Environment

March 29, 2022

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3174-pragmatic-python-programming. Google Colaboratory (CoLab) is a free Jupyter notebook interactive development environment (REPL) hosted in Google’s cloud that we are going to use in this course. In this tutorial, you will learn about the main features of the Goo

Future Trends in Data Science and Data Engineering

February 7, 2022

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3169-data-science-and-data-engineering-in-2022.1.1 Big Trends in 20212021 was a very incremental year in terms of breakthroughs, with an exponential rise in the demand for data professionals, the r

Defining Data Science for Architects

December 30, 2021

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects.1.1 What is Data Science?Data science focuses on the extraction of knowledge and business insights from dataIt does so by leveraging techniques and theorie

Data Visualization in Python for Architects

December 29, 2021

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects.1.1 Why Do I Need Data Visualization?The common wisdom states that:Seeing is believing and a picture is worth a thousand wordsData visual

Introduction to Pandas for Architects

December 29, 2021

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects.1.1 What is pandas?pandas (https://pandas.pydata.org/) is an open-source library that provides high-performance, memory-efficient, easy-to-use data

Addressing the IT Skills Shortage by Developing Your Own Technical Expertise

September 14, 2021

Developing and maintaining your organization’s technical skills is paramount to the success of your business, and to all individuals who are part of your IT Team. A recent statement from a Bloomberg article on the state of the IT  talent shortage:  “Nearly 2-in-3 consulting firms say they’re short-staffed, and 1-in-5 are turning down work as a result, according to a survey from Source Global Research, which provides research and analysis for the professional services industry.

Creating and Working with Databases in Snowflake

May 3, 2021

This tutorial is adapted from the Web Age course Snowflake Cloud Data Platform from the Ground Up. In this tutorial, you will learn how to create databases, tables, and warehouses in the Snowflake Web UI.

Querying Data in Snowflake

May 3, 2021

In this tutorial, you will learn how to create and run queries in Snowflake. You will also learn about query profiles and views. We will also review the new Snowsight experience UI. According to Snowflake’s documentation, “Snowflake supports standard SQL, including a subset of ANSI SQL:1999 and the SQL:2003 analytic extensions. Snowflake also supports common variations for a number of commands where those variations do not co

The Snowflake Web UI

April 29, 2021

This tutorial is adapted from Web Age course Snowflake Cloud Data Platform from the Ground Up Training. In this tutorial, you will familiarize yourself with the Snowflake Web UI (a.k.a Web Portal, Snowflake Manager, and Snowflake Console).

Searching with Apache Solr

April 27, 2021

This tutorial is adapted from the Web Age course Apache Solr for Data Engineers.Part 1. Solr Sets Solr has many capabilities when it comes to searching, of course, this is dependent on the data that is being utilized in the set. Like SQL, it is very important to know and understand the data sets prior to running high-level queries on them. Let’s work agai

How to Repair and Normalize Data with Pandas?

April 5, 2021

This tutorial is adapted from Web Age course Data Engineering Bootcamp Training Using Python and PySpark. When you embark on a new data engineering/data science/machine learning project, right off the bat you may be faced with defects in your input dataset, including but not

Data Visualization and EDA with Pandas and Seaborn

April 3, 2021

This tutorial is adapted from Web Age course Intermediate Data Engineering with Python. Data visualization is a great vehicle for communicating data analysis results to potentially not technical stakeholders, as well as being a critical activity in exploratory data analysis (EDA). In this tutorial, you will learn about

How to Use Python’s Functional Programming Capabilities?

March 31, 2021

This tutorial is adapted from the Web Age course Practical Python 3 Programming. In this tutorial, you will learn how to use Python’s functional programming capabilities. Part 1 – Create a New Python 3 Jupyter Notebook

Functional Programming in Python

November 21, 2020

This tutorial is adapted from the Web Age course Introduction to Python Programming.1.1 What is Functional Programming? Functional programming reduces problems to a set of function calls. The functions used, referred to as Pure functions, follow these rules:Only produce a resultDo not modify the par

How to do Data Grouping and Aggregation with Pandas?

October 30, 2020

This tutorial is adapated from the Web Age course Data Engineering Bootcamp Training (Using Python and PySpark).1.1 Data Aggregation and Grouping The pandas module offers functions that closely emulate SQL functionality for data grouping, aggregation, and filtering. <

Comparing Hive with Spark SQL

March 9, 2020

This tutorial is adapted from Web Age course Hadoop Programming on the Cloudera Platform. In this tutorial, you will work through two functionally equivalent examples / demos – one written in Hive (v. 1.1) and the other written using PySpark API for the Spark SQL module (v. 1.6) – to see the dif

Data Visualization with matplotlib and seaborn in Python

March 4, 2020

This tutorial is adapted from Web Age course Advanced Data Analytics with Pyspark.1.1 Data Visualization  The common wisdom states that ‘Seeing is believing and a picture is worth a thousand words’. Data visualization techniques help users understand the data, underlying trends and patterns by displaying it in a variety of graphi

Data Science and ML Algorithms with PySpark

December 11, 2019

This tutorial is adapted from Web Age course Practical Machine Learning with Apache Spark.8.1 Types of Machine Learning There are three main types of machine learning (ML),  unsupervised learning,  supervised learning, and  reinforcement learning.  We will be learning only about the unsupervised and supervised learning types.8.2 Supervised vs U

Introduction to Jupyter Notebooks

November 25, 2019

This tutorial is adapted from Web Age course Practical Machine Learning with Apache Spark.6.1 Python Dev Tools and REPLs  In addition to the standard Python REPL, Python development is supported through these tools and systems: IPython,  Jupyter with Python kernel (runtime),  Visual Studio Code’s Python plug-in,  PySpark (integrated with Python RE

Data Visualization in Python using Matplotlib

November 25, 2019

7.1 What is Data Visualization?  The common wisdom states that seeing is believing and a picture is worth a thousand words.  Data visualization techniques help users understand the data, underlying trends and patterns by displaying it in a variety of graphical forms (heatmaps, scatter plots, charts, etc.) Data visualization is also a great vehicle for communicating analysis results to stakeholders.  Data visualization is an indispensable activity in exploratory data analysis (EDA). Business inte

What is Data Engineering?

November 15, 2019

1.1 Data is King Data is king and it outlives applications. Applications outlive integrations. Organizations striving to become data-driven need to institute efficient, intelligent, and robust ways for data processing. Data engineering addresses many of the aspects of this process.1.2 Translating Data into Operational and Business Insights

Distributed Computing Concepts for Data Engineers

November 15, 2019

1.1 The Traditional Client–Server Processing Pattern It is good for small-to-medium data set sizes. Fetching 1TB worth of data might take longer than 1 hour.

PySpark Shell

October 17, 2019

1.1 What is Spark Shell? The Spark Shell offers interactive command-line environments forScala and Pythonusers.  SparkR Shell has only been thoroughly tested to work with Spark standalone so far and not all Hadoop distros available, and therefore is not covered here. The

Data Ingestion in AWS

October 16, 2019

Multipart UploadOverview The Multipart upload API enables you to upload large objects inparts.

Introduction to PySpark

October 16, 2019

1.1 What is Apache Spark? Apache Spark (Spark) is a general-purpose processing system forlarge- scaledata. Spark is effective for data processing of up to 100s of terabytes on

Python for Data Science

July 25, 2019

This tutorial is adapted from Web Age course  Applied Data Science with Python. This tutorial provides  quick overview of Python modules and high-power features, NumPy library, pandas library, SciPy library, scikit-learn library, Jupyter notebooks and Anaconda distribution.

Python with NumPy and pandas

July 24, 2019

This tutorial is adapted from Web Age course  Applied Data Science with Python. This tutorial aims at helping you refresh your knowledge of Python and show how Python integrates with NumPy and pandas libraries.Part 1 – Set up the Environment

Using k-means Machine Learning Algorithm with Apache Spark and R

January 31, 2017

In this post, I will demonstrate the usage of the k-means clustering algorithm in R and in Apache Spark.Apache Spark (hereinafter Spark) offers two implementations of k-means algorithm: one is packaged with its MLlib library; the other one exists in Spark’s spark.ml package. While both implementations are currently more or less functionally equivalent, the Spark ML team recommends using the

Spark RDD Performance Improvement Techniques (Post 2 of 2)

October 4, 2016

In this post we will review the more important aspects related to RDD checkpointing. We will continue working on the over500 RDD we created in the previous post on caching. You will remember that checkpointing is a process of truncating an RDD’s lineage graph and saving its materi

Spark RDD Performance Improvement Techniques (Post 1 of 2)

September 13, 2016

Spark offers developers two simple and quite efficient techniques to improve RDD performance and operations against them: caching and checkpointing. Caching allows you to save a materialized RDD in memory, which greatly improves iterative or multi-pass operations that need to traverse the same data set over and over again (e.g. in machine learning algorithms.)

SparkR on CDH and HDP

September 13, 2016

Spark added support for R back in version 1.4.1. and you can use it in Spark Standalone mode. Big Hadoop distros, like Cloudera’s CDH and Hortonworks’ HDP that bundle Spark, have varying degree of support for R. For the time being, CDH decided to opt out of supporting R (their latest CDH 5.8.x version does not even have sparkR binaries), while HDP (versions 2.3.2, 2.4, … ) includes SparkR as a technical preview technology and bundles some R-related components, like the sparkR script. Making it all

Simple Algorithms for Effective Data Processing in Java

January 30, 2016

The needs of Big Data processing require specific tools which nowadays are, in many cases, represented by the Hadoop product ecosystem. When I speak to people who work with Hadoop, they say that their deployments are usually pretty modest: about 20 machines, give or take. It may account for the fact that most companies are still in the technology adoption phase evaluating this Big Data platform and with time the number of machines in their Hadoop clusters would probably grow into 3- or even 4-di

Follow Us

Blog Categories