There is no consensus on how best to describe data quality -- Wikipedia's contributors offer this opinion [https://en.wikipedia.org/wiki/Data_quality]:

"There are many definitions of data quality, but data is generally considered high quality if it is 'fit for [its] intended uses in operations, decision making and planning'."   

Essentially, data quality is a shared concern for each organization's IT and the business. Ultimately, however, data quality (or lack thereof) affects the overall quality of the analytical work carried out in support of tactical and strategic decision-making as well as an organization's ability to generate revenue through integrations with partners and/or trading their data as a valuable asset with high liquidity.

This hands-on course discusses the core concepts and ideas underlying data quality and introduces the audience to the methods, practices, and techniques used to achieve the much sought-after levels of data quality.

Some of the hands-on exercises and labs use Python as the programming language of choice by data engineers, data scientists, and machine learning practitioners. Familiarity with Python syntax is desirable, but not critical: there is not much actual coding involved (most of the code is already written for you) as the exercises and labs are intended to facilitate the learning process rather than teaching you how to program in Python. 

Audience

Business Analysts, Data Engineers, Software Developers, Architects, and Technical Managers 

Prerequisites

Practical work experience in data processing environments.

Duration

Two days

Outline for Practical Data Quality Training

Chapter 1. Data Quality Introduction

  • Data Quality Defined
  • Data Quality Dimensions/Properties
  • Interpreting Data Quality Properties
  • The Typical Data Analytics (Machine Learning) Pipeline
  • Data Quality Assurance
  • Common Factors Contributing to Poor Data Quality (1 of 2)
  • Common Factors Contributing to Poor Data Quality (2 of 2)
  • Is Bad Data Quality a Good or a Bad Thing?
  • Data Quality is a Shared Concern
  • Data Governance
  • Common Issues that can be Prevented through Effective Governance
  • The Data Steward Role
  • Common Steps to Overcome Data Quality Issues
  • Data Observability
  • Application Performance Monitoring (APM) and Observability Magic Quadrant
  • Example of (Operational) Observability Dashboard
  • Data Quality and Data Observability Relationship
  • Example of an Observability-Enabling Service
  • A Glossary of Business Terms
  • Data Dictionaries
  • Example of a Data Dictionary
  • SLAs
  • SLAs and Non-Functional Requirements
  • The Great, Fast, and Cheap Quality Diagram
  • Summary

Chapter 2. Measuring the Quality of the Data

  • Examples of Data Quality Metrics
  • Measuring Data Quality
  • Common Corrective Measures for Data Quality Problems
  • Descriptive Statistics
  • Correlation
  • Normal Distribution and Z-Score
  • Non-uniformity of a Probability Distribution
  • Shannon Entropy
  • Gini Impurity
  • Example of Using Gini Impurity Formula
  • Confusion Matrix
  • The Binary Classification Confusion Matrix
  • A Binary Classification Confusion Matrix Visually
  • Example of a Confusion Matrix

Chapter 3. Methods and Techniques for Data Quality

  • Connecting to the Digital Realm
  • States of Digital Data
  • The Methods and Techniques to Ensure Data Quality
  • Maintenance
  • Automation
  • Workflow (Pipeline) Orchestration Systems
  • Example of a Workflow Orchestration System: Apache NiFi
  • NiFi Processor Types
  • Building a Simple Data Flow in the NiFi Designer
  • Logging
  • Logging Levels
  • Data Formats
  • Interoperable Data
  • Timeliness
  • Efficient Storage with Columnar Formats
  • Storage and Querying Efficiencies of the Parquet Columnar Storage Format
  • Assertions
  • The assert Expression in Python
  • Two Types of Errors
  • Runtime Errors/Exceptions
  • Life after an Exception
  • Assertions vs Errors (Exceptions)
  • Data Validation
  • Data Normalization
  • DDL-based Data Validation
  • An SQL DDL Schema with Constraints Example
  • Apache Hive and Schema-on-Demand
  • An Example of Hive DDL
  • XML and JSON Schemas
  • The Schema Production and Consumption Diagram
  • Example of an XSD Schema Authoring Editor
  • Regular Expressions
  • Regular Expressions Elements
  • What is Unit Testing and Why Should I Care?
  • Unit Testing and Test-Driven Development
  • TDD Benefits
  • Testing for Failure
  • Logging and Monitoring

Chapter 4. Data Consistency

  • The Consistency Consensus
  • The Two-phase Commit (2PC) Protocol
  • The Two-phase Commit (2PC) Protocol Diagram
  • The CAP Theorem
  • Mechanisms for Guaranteeing a Single CAP Property
  • The CAP Triangle
  • Eventual Consistency
  • Example of the Consistency vs Availability Gap
  • How eBay Preempts Possible Database Corruption
  • The Saga Pattern
  • Saga Log and Execution Coordinator
  • The Saga Happy Path
  • A Saga Compensatory Requests Example
  • The Event Sourcing Pattern
  • Event Sourcing Example
  • Applying Efficiencies to Event Sourcing
  • Time Accuracy and Consistency
  • Network Time Protocol (NTP)

Chapter 5. Data Quality Best Practices

  • Best Practices
04/01/2024 - 04/02/2024
10:00 AM - 06:00 PM
Eastern Standard Time
Online Virtual Class
USD $1,600.00
Enroll
05/06/2024 - 05/07/2024
10:00 AM - 06:00 PM
Eastern Standard Time
Online Virtual Class
USD $1,600.00
Enroll