WA3247
DataOps for IT Professionals Training
DataOps (Data Operations) can be defined as a process-oriented methodology with related technologies that support the needs of data analytics teams throughout the entire data lifecycle - from data acquisition to data storage, to processing and consumption (converting the data into insights/information).
Course Details
Duration
1 day
Prerequisites
Practical work experience in data processing environments.
Target Audience
- Data Engineers
- Developers
- Architects
- Technical Managers
Skills Gained
Ensure high level of data quality.
Course Outline
- DataOps Introduction
- DataOps Enterprise Data Technologies
- Enterprise Data Processing Challenges and IT Systems' Woes:
- Data Quality
- What Makes Information Systems Cluttered and Myopic
- Fragmented Data Sources
- Different Data Formats
- System Interoperability
- Maintenance Issues
- Data-Related Roles
- Data Engineering
- What is DataOps?
- The DataOps Technology and Methodology Stack
- The DataOps Manifesto
- Agile Development
- DevOps
- The Lean Manufacturing Methodology
- Key Components of a DataOps Platform
- Overview of DataOps Tools and Services
- Overview of DataOps Platforms
- Data Quality
- Data Quality Definitions
- Dimensions of Data Quality
- Defining "Bad" Data
- Missing Data
- Wrong/Incorrect Data or Data Format
- Inconsistent Data
- Outdated (Stale) Information
- Unverifiable Data
- Withheld Data
- Common Causes for “Bad" Data
- Human Factor
- Infrastructure- and Network-Related Issues
- Software Defects
- Using the Wrong Tool for the Job
- Using Untrusted Data
- Aggregation of Data from Disparate Data Sources that have Impedance Mismatch
- Wrong QoS Settings of Queueing Systems
- Wrong Caching System Settings, e.g. TTL
- Not Using the "Ground Truth" Data
- Differently Configured Development/UAT/Production Systems
- How to Eliminate Environment Disparity
- Confusing Big-Endian and Little-Endian Byte Order
- Ensuring Data Quality
- Ensuring Integrity of Datasets
- Dataset Checksums:
- CRC (cyclic redundancy check) as automatic error-detection mechanism
- MD5 and SHA-* Hashes
- The Dataset Shapes for Basic Integrity Checks
- Dataset Checksums:
- Ensuring Integrity of Datasets
- Dealing with \"Bad\" Input Data
- DDL-enforced Schema & Schema-on-Demand (-on-Read)
- SQL Constraints as Rules for Column-Level and Table-Wide Data
- XML Schema Definition (XSD) for XML Documents
- Validating JSON Documents
- Regular Expressions
- Data Cleansing of Data at Rest
- Controlling Integrity of Data-in-Transit
- Database Normalization
- Normal Forms
- When to De/normalize
- Using Assertions in Applications
- Operationalizing Input Data Validation
- Microservices
- API Management Solutions
- Data Consistency and Availability
- Example of a Consistency vs Availability Gap: https://www.youtube.com/watch?v=A-brgkkjnHc
- The CAP Triangle: Selecting Which System to Use
- Dealing with Duplicate Data
- At Source
- In Application
- Dealing with Missing (NaN) Data
- Example of Using NumPy and pandas Python Libraries
- Master (Authoritative) Data Management
- The "Golden Record"/"Ground Truth" Concept
- Enforcing Data Consistency with the scikit-learn LabelEncoder Class
- Data Provenance
- The Event Sourcing Pattern
- Adopting the Culture of Automation
- On-going Auditing
- Monitoring and Alerting
- UiPath
- Workflow (Pipeline) Orchestration Systems
- DataOps Data Pipelines
- Apache NiFi
- How to Lead with Data
- Enterprise Architecture Components
- Business Architecture
- Information Architecture
- Application Architecture
- Technology Architecture
- DataOps Functional Architecture
- The Snowflake Data Cloud
- Cloud Design for System Resiliency
- New Data Architecture:
- Data Ownership
- Shared Environment Security Controls
- Enterprise Architecture Components
- Data Governance [OPTIONAL]
- The Need for Data Governance
- Controlling the Decision-Making Process
- Controlling "Agile IT"
- Types of Requirements
- Product
- Process
- Scoping Requirements
- Governance Gotchas
- Governance Best Practices