
Azure Data Engineer Associate Certification Guide
Azure is one of the leading cloud providers in the world, providing numerous services for data hosting and data processing. Most of the companies today are either cloud-native or are migrating to the cloud much faster than ever. This has led to an explosion of data engineering jobs, with aspiring and experienced data engineers trying to outshine each other.
Gaining the DP-203: Azure Data Engineer Associate certification is a sure-fire way of showing future employers that you have what it takes to become an Azure Data Engineer. This book will help you prepare for the DP-203 examination in a structured way, covering all the topics specified in the syllabus with detailed explanations and exam tips. The book starts by covering the fundamentals of Azure, and then takes the example of a hypothetical company and walks you through the various stages of building data engineering solutions. Throughout the chapters, you'll learn about the various Azure components involved in building the data systems and will explore them using a wide range of real-world use cases. Finally, you'll work on sample questions and answers to familiarize yourself with the pattern of the exam.
By the end of this Azure book, you'll have gained the confidence you need to pass the DP-203 exam with ease and land your dream job in data engineering.
- Azure Data Engineer Associate Certification Guide
- Contributors
- About the author
- About the reviewers
- Preface
- Who this book is for
- What this book covers
- Download the example code files
- Download the color images
- Get in touch
- Reviews
- Share Your Thoughts
- Part 1: Azure Basics
- Chapter 1: Introducing Azure Basics
- Technical requirements
- Introducing the Azure portal
- Exploring Azure accounts, subscriptions, and resource groups
- Azure account
- Azure subscription
- Resource groups
- Establishing a use case
- Introducing Azure Services
- Infrastructure as a Service (IaaS)
- Platform as a Service (PaaS)
- Software as a Service (SaaS), also known as Function as a Service (FaaS)
- Exploring Azure VMs
- Creating a VM using the Azure portal
- Creating a VM using the Azure CLI
- Exploring Azure Storage
- Azure Blob storage
- Azure Data Lake Gen 2
- Azure Files
- Azure Queues
- Azure tables
- Azure Managed disks
- Exploring Azure Networking (VNet)
- Exploring Azure Compute
- VM Scale Sets
- Azure App Service
- Azure Kubernetes Service
- Azure Functions
- Azure Service Fabric
- Azure Batch
- Summary
- Part 2: Data Storage
- Chapter 2: Designing a Data Storage Structure
- Technical requirements
- Designing an Azure data lake
- How is a data lake different from a data warehouse?
- When should you use a data lake?
- Data lake zones
- Data lake architecture
- Exploring Azure technologies that can be used to build a data lake
- Selecting the right file types for storage
- Avro
- Parquet
- ORC
- Comparing Avro, Parquet, and ORC
- Choosing the right file types for analytical queries
- Designing storage for efficient querying
- Storage layer
- Application Layer
- Query layer
- Designing storage for data pruning
- Dedicated SQL pool example with pruning
- Spark example with pruning
- Designing folder structures for data transformation
- Streaming and IoT Scenarios
- Batch scenarios
- Designing a distribution strategy
- Round-robin tables
- Hash tables
- Replicated tables
- Designing a data archiving solution
- Hot Access Tier
- Cold Access Tier
- Archive Access Tier
- Data life cycle management
- Summary
- Chapter 3: Designing a Partition Strategy
- Understanding the basics of partitioning
- Benefits of partitioning
- Designing a partition strategy for files
- Azure Blob storage
- ADLS Gen2
- Designing a partition strategy for analytical workloads
- Horizontal partitioning
- Vertical partitioning
- Functional partitioning
- Designing a partition strategy for efficiency/performance
- Iterative query performance improvement process
- Designing a partition strategy for Azure Synapse Analytics
- Performance improvement while loading data
- Performance improvement for filtering queries
- Identifying when partitioning is needed in ADLS Gen2
- Summary
- Understanding the basics of partitioning
- Chapter 4: Designing the Serving Layer
- Technical requirements
- Learning the basics of data modeling and schemas
- Dimensional models
- Designing Star and Snowflake schemas
- Star schemas
- Snowflake schemas
- Designing SCDs
- Designing SCD1
- Designing SCD2
- Designing SCD3
- Designing SCD4
- Designing SCD5, SCD6, and SCD7
- Designing a solution for temporal data
- Designing a dimensional hierarchy
- Designing for incremental loading
- Watermarks
- File timestamps
- File partitions and folder structures
- Designing analytical stores
- Security considerations
- Scalability considerations
- Designing metastores in Azure Synapse Analytics and Azure Databricks
- Azure Synapse Analytics
- Azure Databricks (and Azure Synapse Spark)
- Summary
- Chapter 5: Implementing Physical Data Storage Structures
- Technical requirements
- Getting started with Azure Synapse Analytics
- Implementing compression
- Compressing files using Synapse Pipelines or ADF
- Compressing files using Spark
- Implementing partitioning
- Using ADF/Synapse pipelines to create data partitions
- Partitioning for analytical workloads
- Implementing horizontal partitioning or sharding
- Sharding in Synapse dedicated pools
- Sharding using Spark
- Implementing distributions
- Hash distribution
- Round-robin distribution
- Replicated distribution
- Implementing different table geometries with Azure Synapse Analytics pools
- Clustered columnstore indexing
- Heap indexing
- Clustered indexing
- Implementing data redundancy
- Azure storage redundancy in the primary region
- Azure storage redundancy in secondary regions
- Azure SQL Geo Replication
- Azure Synapse SQL Data Replication
- CosmosDB Data Replication
- Example of setting up redundancy in Azure Storage
- Implementing data archiving
- Summary
- Chapter 6: Implementing Logical Data Structures
- Technical requirements
- Building a temporal data solution
- Building a slowly changing dimension
- Updating new rows
- Updating the modified rows
- Building a logical folder structure
- Implementing file and folder structures for efficient querying and data pruning
- Deleting an old partition
- Adding a new partition
- Building external tables
- Summary
- Chapter 7: Implementing the Serving Layer
- Technical requirements
- Delivering data in a relational star schema
- Implementing a dimensional hierarchy
- Synapse SQL serverless
- Synapse Spark
- Azure Databricks
- Maintaining metadata
- Metadata using Synapse SQL and Spark pools
- Metadata using Azure Databricks
- Summary
- Part 3: Design and Develop Data Processing (25-30%)
- Chapter 8: Ingesting and Transforming Data
- Technical requirements
- Transforming data by using Apache Spark
- What are RDDs?
- What are DataFrames?
- Transforming data by using T-SQL
- Transforming data by using ADF
- Schema transformations
- Row transformations
- Multi-I/O transformations
- ADF templates
- Transforming data by using Azure Synapse pipelines
- Transforming data by using Stream Analytics
- Cleansing data
- Handling missing/null values
- Trimming inputs
- Standardizing values
- Handling outliers
- Removing duplicates/deduping
- Splitting data
- File splits
- Shredding JSON
- Extracting values from JSON using Spark
- Extracting values from JSON using SQL
- Extracting values from JSON using ADF
- Encoding and decoding data
- Encoding and decoding using SQL
- Encoding and decoding using Spark
- Encoding and decoding using ADF
- Configuring error handling for the transformation
- Normalizing and denormalizing values
- Denormalizing values using Pivot
- Normalizing values using Unpivot
- Transforming data by using Scala
- Performing Exploratory Data Analysis (EDA)
- Data exploration using Spark
- Data exploration using SQL
- Data exploration using ADF
- Summary
- Chapter 9: Designing and Developing a Batch Processing Solution
- Technical requirements
- Designing a batch processing solution
- Developing batch processing solutions by using Data Factory, Data Lake, Spark, Azure Synapse Pipelines, PolyBase, and Azure Databricks
- Storage
- Data ingestion
- Data preparation/data cleansing
- Transformation
- Using PolyBase to ingest the data into the Analytics data store
- Using Power BI to display the insights
- Creating data pipelines
- Integrating Jupyter/Python notebooks into a data pipeline
- Designing and implementing incremental data loads
- Designing and developing slowly changing dimensions
- Handling duplicate data
- Handling missing data
- Handling late-arriving data
- Handling late-arriving data in the ingestion/transformation stage
- Handling late-arriving data in the serving stage
- Upserting data
- Regressing to a previous state
- Introducing Azure Batch
- Running a sample Azure Batch job
- Configuring the batch size
- Scaling resources
- Azure Batch
- Azure Databricks
- Synapse Spark
- Synapse SQL
- Configuring batch retention
- Designing and configuring exception handling
- Types of errors
- Remedial actions
- Handling security and compliance requirements
- The Azure Security Benchmark
- Best practices for Azure Batch
- Summary
- Chapter 10: Designing and Developing a Stream Processing Solution
- Technical requirements
- Designing a stream processing solution
- Introducing Azure Event Hubs
- Introducing ASA
- Introducing Spark Streaming
- Developing a stream processing solution using ASA, Azure Databricks, and Azure Event Hubs
- A streaming solution using Event Hubs and ASA
- A streaming solution using Event Hubs and Spark Streaming
- Processing data using Spark Structured Streaming
- Monitoring for performance and functional regressions
- Monitoring in Event Hubs
- Monitoring in ASA
- Monitoring in Spark Streaming
- Processing time series data
- Types of timestamps
- Windowed aggregates
- Checkpointing or watermarking
- Replaying data from a previous timestamp
- Designing and creating windowed aggregates
- Tumbling windows
- Hopping windows
- Sliding windows
- Session windows
- Snapshot windows
- Configuring checkpoints/watermarking during processing
- Checkpointing in ASA
- Checkpointing in Event Hubs
- Checkpointing in Spark
- Replaying archived stream data
- Transformations using streaming analytics
- The COUNT and DISTINCT transformations
- CAST transformations
- LIKE transformations
- Handling schema drifts
- Handling schema drifts using Event Hubs
- Handling schema drifts in Spark
- Processing across partitions
- What are partitions?
- Processing data across partitions
- Processing within one partition
- Scaling resources
- Scaling in Event Hubs
- Scaling in ASA
- Scaling in Azure Databricks Spark Streaming
- Handling interruptions
- Handling interruptions in Event Hubs
- Handling interruptions in ASA
- Designing and configuring exception handling
- Upserting data
- Designing and creating tests for data pipelines
- Optimizing pipelines for analytical or transactional purposes
- Summary
- Chapter 11: Managing Batches and Pipelines
- Technical requirements
- Triggering batches
- Handling failed Batch loads
- Pool errors
- Node errors
- Job errors
- Task errors
- Validating Batch loads
- Scheduling data pipelines in Data Factory/Synapse pipelines
- Managing data pipelines in Data Factory/Synapse pipelines
- Integration runtimes
- ADF monitoring
- Managing Spark jobs in a pipeline
- Implementing version control for pipeline artifacts
- Configuring source control in ADF
- Integrating with Azure DevOps
- Integrating with GitHub
- Summary
- Part 4: Design and Implement Data Security (10-15%)
- Chapter 12: Designing Security for Data Policies and Standards
- Technical requirements
- Introducing the security and privacy requirements
- Designing and implementing data encryption for data at rest and in transit
- Encryption at rest
- Encryption in transit
- Designing and implementing a data auditing strategy
- Storage auditing
- SQL auditing
- Designing and implementing a data masking strategy
- Designing and implementing Azure role-based access control and a POSIX-like access control list for Data Lake Storage Gen2
- Restricting access using Azure RBAC
- Restricting access using ACLs
- Designing and implementing row-level and column-level security
- Designing row-level security
- Designing column-level security
- Designing and implementing a data retention policy
- Designing to purge data based on business requirements
- Purging data in Azure Data Lake Storage Gen2
- Purging data in Azure Synapse SQL
- Managing identities, keys, and secrets across different data platform technologies
- Azure Active Directory
- Azure Key Vault
- Access keys and Shared Access keys in Azure Storage
- Implementing secure endpoints (private and public)
- Implementing resource tokens in Azure Databricks
- Loading a DataFrame with sensitive information
- Writing encrypted data to tables or Parquet files
- Designing for data privacy and managing sensitive information
- Microsoft Defender
- Summary
- Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)
- Chapter 13: Monitoring Data Storage and Data Processing
- Technical requirements
- Implementing logging used by Azure Monitor
- Configuring monitoring services
- Understanding custom logging options
- Interpreting Azure Monitor metrics and logs
- Interpreting Azure Monitor metrics
- Interpreting Azure Monitor logs
- Measuring the performance of data movement
- Monitoring data pipeline performance
- Monitoring and updating statistics about data across a system
- Creating statistics for Synapse dedicated pools
- Updating statistics for Synapse dedicated pools
- Creating statistics for Synapse serverless pools
- Updating statistics for Synapse serverless pools
- Measuring query performance
- Monitoring Synapse SQL pool performance
- Spark query performance monitoring
- Interpreting a Spark DAG
- Monitoring cluster performance
- Monitoring overall cluster performance
- Monitoring per-node performance
- Monitoring YARN queue/scheduler performance
- Monitoring storage throttling
- Scheduling and monitoring pipeline tests
- Summary
- Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing
- Technical requirements
- Compacting small files
- Rewriting user-defined functions (UDFs)
- Writing UDFs in Synapse SQL Pool
- Writing UDFs in Spark
- Writing UDFs in Stream Analytics
- Handling skews in data
- Fixing skews at the storage level
- Fixing skews at the compute level
- Handling data spills
- Identifying data spills in Synapse SQL
- Identifying data spills in Spark
- Tuning shuffle partitions
- Finding shuffling in a pipeline
- Identifying shuffles in a SQL query plan
- Identifying shuffles in a Spark query plan
- Optimizing resource management
- Optimizing Synapse SQL pools
- Optimizing Spark
- Tuning queries by using indexers
- Indexing in Synapse SQL
- Indexing in the Synapse Spark pool using Hyperspace
- Tuning queries by using cache
- Optimizing pipelines for analytical or transactional purposes
- OLTP systems
- OLAP systems
- Implementing HTAP using Synapse Link and CosmosDB
- Optimizing pipelines for descriptive versus analytical workloads
- Common optimizations for descriptive and analytical pipelines
- Specific optimizations for descriptive and analytical pipelines
- Troubleshooting a failed Spark job
- Debugging environmental issues
- Debugging job issues
- Troubleshooting a failed pipeline run
- Summary
- Part 6: Practice Exercises
- Chapter 15: Sample Questions with Solutions
- Exploring the question formats
- Case study-based questions
- Case study data lake
- Scenario-based questions
- Shared access signature
- Direct questions
- ADF transformation
- Ordering sequence questions
- ASA setup steps
- Code segment questions
- Column security
- Sample questions from the Design and Implement Data Storage section
- Case study data lake
- Data visualization
- Data partitioning
- Synapse SQL pool table design 1
- Synapse SQL pool table design 2
- Slowly changing dimensions
- Storage tiers
- Disaster recovery
- Synapse SQL external tables
- Sample questions from the Design and Develop Data Processing section
- Data lake design
- ASA windows
- Spark transformation
- ADF integration runtimes
- ADF triggers
- Sample questions from the Design and Implement Data Security section
- TDE/Always Encrypted
- Auditing Azure SQL/Synapse SQL
- Dynamic data masking
- RBAC POSIX
- Row-level security
- Sample questions from the Monitor and Optimize Data Storage and Data Processing section
- Blob storage monitoring
- T-SQL optimization
- ADF monitoring
- Setting up alerts in ASA
- Summary
- Why subscribe?
- Other Books You May Enjoy
- Packt is searching for authors like you
- Share Your Thoughts
- Tytuły: Azure Data Engineer Associate Certification Guide
- Autor: Newton Alex
- Tytuł oryginału: Azure Data Engineer Associate Certification Guide
- ISBN Ebooka: 9781801812832, 9781801812832
- Data wydania: 2022-02-28
- Identyfikator pozycji: e_2t2p
-
Kategorie: