WA3247

DataOps for IT Professionals Training

DataOps (Data Operations) can be defined as a process-oriented methodology with related technologies that support the needs of data analytics teams throughout the entire data lifecycle - from data acquisition to data storage, to processing and consumption (converting the data into insights/information).
Course Details

Duration

1 day

Prerequisites

Practical work experience in data processing environments.

Target Audience

  • Data Engineers
  • Developers
  • Architects
  • Technical Managers 

Skills Gained

Ensure high level of data quality.
Course Outline
  • DataOps Introduction
    • DataOps Enterprise Data Technologies
    • Enterprise Data Processing Challenges and IT Systems' Woes:
      • Data Quality
      • What Makes Information Systems Cluttered and Myopic
      • Fragmented Data Sources
      • Different Data Formats
      • System Interoperability 
      • Maintenance Issues
    • Data-Related Roles
    • Data Engineering
    • What is DataOps?
    • The DataOps Technology and Methodology Stack
    • The DataOps Manifesto
    • Agile Development
    • DevOps
    • The Lean Manufacturing Methodology
    • Key Components of a DataOps Platform
    • Overview of DataOps Tools and Services
    • Overview of DataOps Platforms
  • Data Quality
    • Data Quality Definitions
    • Dimensions of Data Quality
    • Defining "Bad" Data
      • Missing Data
      • Wrong/Incorrect Data or Data Format
      • Inconsistent Data
      • Outdated (Stale) Information
      • Unverifiable Data
      • Withheld Data
    • Common Causes for “Bad" Data
      • Human Factor
      • Infrastructure- and Network-Related Issues
      • Software Defects
      • Using the Wrong Tool for the Job
      • Using Untrusted Data
      • Aggregation of Data from Disparate Data Sources that have Impedance Mismatch
      • Wrong QoS Settings of Queueing Systems
      • Wrong Caching System Settings, e.g. TTL
      • Not Using the "Ground Truth" Data
      • Differently Configured Development/UAT/Production Systems
        • How to Eliminate Environment Disparity
      • Confusing Big-Endian and Little-Endian Byte Order
    • Ensuring Data Quality
      • Ensuring Integrity of Datasets 
        • Dataset Checksums:
          • CRC (cyclic redundancy check) as automatic error-detection mechanism
          • MD5 and SHA-* Hashes 
        • The Dataset Shapes for Basic Integrity Checks
    • Dealing with \"Bad\" Input Data
      • DDL-enforced Schema & Schema-on-Demand (-on-Read)
      • SQL Constraints as Rules for Column-Level and Table-Wide Data
      • XML Schema Definition (XSD) for XML Documents 
      • Validating JSON Documents
      • Regular Expressions
      • Data Cleansing of Data at Rest
      • Controlling Integrity of Data-in-Transit
      • Database Normalization
        • Normal Forms
        • When to De/normalize
      • Using Assertions in Applications 
      • Operationalizing Input Data Validation
        • Microservices
        • API Management Solutions
    • Data Consistency and Availability
      • Example of a Consistency vs Availability Gap: https://www.youtube.com/watch?v=A-brgkkjnHc  
      • The CAP Triangle: Selecting Which System to Use
    • Dealing with Duplicate Data
      • At Source
      • In Application
    • Dealing with Missing (NaN) Data
      • Example of Using NumPy and pandas Python Libraries
    • Master (Authoritative) Data Management 
      • The "Golden Record"/"Ground Truth" Concept
    • Enforcing Data Consistency with the scikit-learn LabelEncoder Class
    • Data Provenance
    • The Event Sourcing Pattern
    • Adopting the Culture of Automation
    • On-going Auditing
    • Monitoring and Alerting
    • UiPath
    • Workflow (Pipeline) Orchestration Systems
      • DataOps Data Pipelines
      • Apache NiFi
  • How to Lead with Data
    • Enterprise Architecture Components
      • Business Architecture
      • Information Architecture
      • Application Architecture
      • Technology Architecture
    • DataOps Functional Architecture
    • The Snowflake Data Cloud
    • Cloud Design for System Resiliency
    • New Data Architecture:
      • Data Ownership
      • Shared Environment Security Controls
  • Data Governance [OPTIONAL]
    • The Need for Data Governance
    • Controlling the Decision-Making Process
    • Controlling "Agile IT"
    • Types of Requirements
      • Product
      • Process
    • Scoping Requirements
    • Governance Gotchas
    • Governance Best Practices