What you NEED to know about Data Engineering on Microsoft Azure in 2023
Traditional job titles like database administrator, database developer, and business intelligence developer have evolved.
The data in modern systems involve 3 Vs:
Understanding when to use which data source is critical since modern systems
frequently have massive data (aka. big data) and streaming requirements.
This is where data engineering enters the picture.
A data engineer needs to be familiar with the many alternatives for storing and manipulating data.
There are three types of data, and Microsoft Azure offers a wide range of data platform technologies
to fulfill the demands of these different types of data.
Structured data is data that follows a schema, which means that all of the data has the same fields or properties.
Structured data can be kept in a table with rows and columns in a database.
Semi-structured data cannot be properly organized into tables, rows, and columns.
Semi-structured data use _tags_ or _keys_ to organize and structure the data.
Examples of semi-structured data include XML and JSON.
Unstructured data refers to data that does not have a predefined structure.
No-SQL databases are classified into four types:
- Key-Value Store
- Document Database
- Graph Databases
- Column Base
What to use for Data?
As data engineers, we have several options available to us to store data:
Azure Data Lake Storage
Azure Cosmos DB
Azure SQL Database
Azure SQL Data Warehouse
Azure Stream Analytics
Azure Data Factory
Azure Data Catalog
Let's explore the basics of what these options are and when to use what option.
1. Azure Storage
Azure storage, or storage account, is useful when you need a low-cost and high throughput data store.
It can be used to store No-SQL data.
If you are coming from the traditional business intelligence developer/dba/database developer background, you can use this service to store files, such as CSV, Excel, and XML.
This service offers various techniques to store data, such as containers, file shares, tables, and queues. It can also be used as an HDInsight Hadoop data store.
2. Data Lake Storage
Data Lake Storage is an extension of the Azure Storage/storage account.
This service is also useful when you need a low-cost and high throughput data store.
It can also be used as a DataBricks, HDInsight, and IoT data store.Learn about Data Lake Storage
3. Azure Databricks
Azure Databricks makes the deployment of a Spark-based cluster easier.
This service enables the fastest processing of Machine Learning solutions.
Azure Databricks can be utilized both by data engineers and data scientists.
Azure Databricks provides integration with other Azure Services and Power BI.
Learn about Azure Cosmos DB and how to configure it.
As part of this course, you will learn:
- Introduction to Azure Cosmos DB
- Select appropriate CosmosDB APIs
- Set up replicas in CosmosDB
- Comparison with AWS DynamoDB
Azure CosmosDB provides global distribution for both structured and unstructured data stores.
Azure Cosmos DB offers multiple database APIs, which include the Core (SQL) API, API for MongoDB, Cassandra API, Gremlin API, and Table API. By using these APIs, you can model real-world data using documents, key-value, graphs, and column-family data models.
These APIs allow your applications to treat Azure Cosmos DB as if it were various other databases technologies, without the overhead of management, and scaling approaches.
Here are some of the prominent characteristics of Azure CosmosDB:
– Millisecond query response time.
– 99.999% availability of data.
– Worldwide elastic scale of both the storage and throughput
– Multiple consistency levels to control data integrity with concurrency
5. Azure SQL
If you are coming from the traditional database administrator/developer/bi developer, it’s the easiest one to understand.
Azure SQL Database is a relational data store.
This service supports transactional (OLTP) workloads.
This service supports elastic scalability and a high volume on inserts and reads.
Learn to design a Modern Data Warehouse using Azure Synapse Analytics and how to secure a data warehouse in Azure Synapse Analytics.
As part of this course, you will learn:
- How to Design a Modern Data Warehouse using Azure Synapse Analytics
Secure a data warehouse in Azure Synapse Analytics
Managing files in an Azure data lake
Securing files stored in an Azure data lake
6. Azure Synapse Analytics
Azure Synapse Analytics is useful when you want to manage data warehouse and analytical workloads.
This service can also be used when you require an integrated relational and big data store.
It is a low-cost storage solution.
You can pause and resume computing resources for Azure Synapse Analytics to save costs even further when you don’t plan to use the service.
It can be scaled elastically.
The service has an integrated workbench that allows you to perform the following operations:
1. Data Ingestion
2. Data Exploration
3. Data Analysis
4. Data Visualization
7. Azure Stream Analytics
Traditional business intelligence solutions used to be static. Modern systems often require data streaming in real-time.
Azure Stream Analytics is useful when you require a fully managed event processing engine and analysis of streaming data.
It can also be combined with the Azure IoT service to analyze streaming data.
Stream Analytics Query Language can be used to query the streaming data.
Learn to perform data integration with Azure Data Factory and to perform code-free transformation at scale with Azure Data Factory
As part of this course, you will learn:
- Data integration with Azure Data Factory or Azure Synapse Pipelines
- Code-free transformation at scale with Azure Data Factory or Azure Synapse Pipelines
- Execute code-free transformations at scale with Azure Synapse Pipelines
- Create data pipeline to import poorly formatted CSV files
- Create Mapping Data Flows
8. Azure Data Factory
If you are coming from a traditional business intelligence background then you might have used SQL Server Integration Services (SSIS) to create ETL pipelines.
Azure Data Factory is similar to SSIS for modern cloud-based systems.
This service can be used to connect to a wide range of data platforms, transform data, and orchestrate the batch movement of data.
It can also be integrated with SSIS packages.
Azure Data Factory is also integrated into Azure Synapse Analytics.
9. Azure HDInsight
Azure HDInsight is useful when you need a storage solution to store No-SQL data that is low cost and supports high throughput.
This service provides a Hadoop Platform as a Service approach that supports Hadoop, Hbase, Storm, or Kafka data store.
10. Azure Data Catalog
Having several data sources can become challenging to maintain.
To make things easier, you can annotate data sources with descriptive metadata.
Azure Data Catalog is useful when you require documentation of your data stores.
This service also helps users discover the data sources by searching for the metadata
DP-203: Data Engineering on Microsoft Azure
As you learned in this article, data engineering on Azure can be quite daunting since there are several technologies available to data engineers.
The DP-203: Data Engineering on Microsoft Azure course is a four-day course that helps you understand the various data storage solutions and create an integrated solution that utilizes a variety of data sources.
In the DP-203 Data Engineering on Microsoft Azure course, you will learn the various ingestion techniques that can be used to load data using the Apache Spark capability found in Azure Synapse Analytics or Azure Databricks, or how to ingest using Azure Data Factory or Azure Synapse pipelines.
The students will also learn the various ways they can transform the data using the same technologies that are used to ingest data.
The student will spend time on the course learning how to monitor and analyze the performance of the analytical systems so that they can optimize the performance of data loads, or queries that are issued against the systems.
They will understand the importance of implementing security to ensure that the data is protected at rest or in transit.
The student will then show how the data in an analytical system can be used to create dashboards or build predictive models in Azure Synapse Analytics.
Certification Exam DP-203: Data Engineering on Microsoft Azure
Obtaining a certification in the subject is also a great way to learn, improve, and display your expertise. The DP-203 course helps you prepare for the certification exam.
Candidates for this exam should have subject matter expertise in integrating, transforming, and consolidating data from various structured and unstructured data systems into a structure that is suitable for building analytics solutions.
You can read up on the certification exam details on the official website.
A background in data engineering is advantageous but not needed. You may get a head start by reading up on essential data engineering concepts like OLTP vs OLAP, data warehouses, and data lakes. You can optionally take the DP-900 course to go through the data engineering concepts. Having familiarity with cloud computing and Microsoft Azure can also be helpful.
Suggested Roadmap to Prepare for the Exam
To prepare for the DP-203 certification exam, although you can choose only to take the DP-203 course, the suggested roadmap for the DP-203: Data Engineering on Microsoft Azure Exam is as follows:
AZ-900 -> DP-900 -> DP-203
Prepares you for the cloud fundamentals.
Prepares you for the data engineering core concepts.
The actual Data Engineering on Azure course.
How to Practice for the Exam
You can use the official practice exam tool available here:
(Note: It is NOT free.)
The official practice exam allows you to select test lengths for as long as you have to practice at the time.
You may choose whether it tells you the answers immediately or at the end.
The format of the questions matches the exam.