
Simplify Big Data Analytics with Amazon EMR
Amazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS.
This book is a practical guide to Amazon EMR for building data pipelines. You'll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You'll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you'll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you'll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR.
By the end of this book, you'll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS.
- Simplify Big Data Analytics with Amazon EMR
- Contributors
- About the author
- About the reviewers
- Preface
- Who this book is for
- What this book covers
- To get the most out of this book
- Download the example code files
- Code in Action
- Download the color images
- Conventions used
- Get in touch
- Share Your Thoughts
- Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR
- Chapter 1: An Overview of Amazon EMR
- What is Amazon EMR?
- What is big data?
- Hadoop processing framework to handle big data
- Overview of Amazon EMR managed and scalable Hadoop cluster in AWS
- A brief history of the major big data releases
- Benefits of Amazon EMR
- Decoupling compute and storage
- Persistent versus transient clusters
- Integration with other AWS services
- Amazon S3 with EMR File System (EMRFS)
- Amazon Kinesis Data Streams (KDS)
- Amazon Managed Streaming for Kafka (MSK)
- AWS Glue Data Catalog
- Amazon Relational Database Service (RDS)
- Amazon DynamoDB
- Amazon Redshift
- AWS Lake Formation
- AWS Identity and Access Management (IAM)
- AWS Key Management Service (KMS)
- Lake House architecture overview
- EMR release history
- Comparing Amazon EMR with AWS Glue and AWS Glue DataBrew
- AWS Glue
- AWS Glue DataBrew
- Choosing the right service for your use case
- Summary
- Test your knowledge
- Further reading
- What is Amazon EMR?
- Chapter 2: Exploring the Architecture and Deployment Options
- EMR architecture deep dive
- Distributed storage layer
- YARN cluster resource manager
- Distributed processing frameworks
- Hadoop applications
- Understanding clusters and nodes
- Uniform instance groups
- Instance fleet
- Using S3 versus HDFS for cluster storage
- HDFS as cluster-persistent storage
- Amazon S3 as a persistent data store
- Understanding the cluster life cycle
- Options to submit work to the cluster
- Submitting jobs to the cluster as EMR steps
- Building Hadoop jobs with dependencies in a specific EMR release version
- EMR deployment options
- Amazon EMR on Amazon EC2
- Amazon EMR on Amazon EKS
- Amazon EMR on AWS Outposts
- EMR pricing for different deployment options
- Monitoring and controlling your costs with AWS Budgets and Cost Explorer
- Summary
- Test your knowledge
- Further reading
- EMR architecture deep dive
- Chapter 3: Common Use Cases and Architecture Patterns
- Reference architecture for batch ETL workloads
- Use case overview
- Reference architecture walkthrough
- Best practices to follow during implementation
- Reference architecture for clickstream analytics
- Use case overview
- Reference architecture walkthrough
- Best practices to follow during implementation
- Reference architecture for interactive analytics and ML
- Use case overview
- Reference architecture walkthrough
- Best practices to follow during implementation
- Reference architecture for real-time streaming analytics
- Use case overview
- Reference architecture walkthrough
- Best practices to follow during implementation
- Reference architecture for genomics data analytics
- Use case overview
- Reference architecture walkthrough
- Best practices to follow during implementation
- Reference architecture for log analytics
- Use case overview
- Reference architecture walkthrough
- Best practices to follow during implementation
- Summary
- Test your knowledge
- Further reading
- Reference architecture for batch ETL workloads
- Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR
- Technical requirements
- Understanding popular big data applications in EMR
- Hive
- Presto
- Spark
- HBase
- Hue
- Ganglia
- Machine learning frameworks available in EMR
- TensorFlow
- MXNet
- Notebook options available in EMR
- EMR Notebooks
- JupyterHub
- EMR Studio
- Zeppelin
- Summary
- Test your knowledge
- Further reading
- Section 2: Configuration, Scaling, Data Security, and Governance
- Chapter 5: Setting Up and Configuring EMR Clusters
- Technical requirements
- Setting up and configuring clusters with the EMR consoles quick create option
- Advanced configuration for cluster hardware and software
- Understanding the Software Configuration section
- Understanding Steps
- Understanding the Hardware Configuration section
- Understanding general configurations
- Understanding Security Options
- Working with AMIs and controlling cluster termination
- Working with AMIs
- Controlling the EMR cluster termination process
- Troubleshooting and logging in your EMR cluster
- Tools available to debug your EMR cluster
- Viewing and restarting cluster application processes
- Troubleshooting a failed cluster
- Troubleshooting a slow cluster
- Logging in your EMR cluster
- Summary
- Test your knowledge
- Further reading
- Chapter 6: Monitoring, Scaling, and High Availability
- Technical requirements
- Monitoring your EMR cluster
- Monitoring clusters and applications with web user interfaces
- Monitoring cluster metrics with CloudWatch monitoring
- EMR API audit logging with AWS CloudTrail
- Scaling cluster resources
- Managed scaling in EMR
- Autoscaling in EMR with a custom policy for instance groups
- Manually resizing your EMR cluster
- Comparing managed scaling with autoscaling
- Cluster cloning and high availability with multiple master nodes
- High availability with multiple master nodes
- Cloning an existing EMR cluster
- Summary
- Test your knowledge
- Further reading
- Chapter 7: Understanding Security in Amazon EMR
- Technical requirements
- Understanding the basics of security
- Creating security configurations
- Specifying a security configuration for your cluster
- AWS IAM integration with Amazon EMR
- Configuring an IAM service role for your EMR cluster
- Configuring IAM roles for EMRFS
- Integrating IAM roles in applications that invoke AWS services directly
- Allowing users and groups to create and modify roles
- Identity-based policies and best practices
- Understanding authentication to cluster nodes
- Understanding data protection in EMR
- Encrypting data at rest for EMRFS on Amazon S3 data
- Encrypting data in transit for EMRFS on Amazon S3 data
- Role of security groups and interface VPC endpoints
- Controlling cluster network traffic with security groups
- Connecting to Amazon EMR on an EC2 cluster using an interface VPC endpoint
- Connecting to Amazon EMR on an EKS cluster using an interface VPC endpoint
- Summary
- Test your knowledge
- Further reading
- Chapter 8: Understanding Data Governance in Amazon EMR
- Technical requirements
- Understanding data catalog and access management options
- Using AWS Glue Data Catalog
- Integrating AWS Glue Data Catalog with Amazon EMR
- Permission management on top of a data catalog
- Understanding Amazon EMR integration with AWS Lake Formation
- Integrating Lake Formation with Amazon EMR
- Launching an EMR cluster with Lake Formation
- Setting up EMR notebooks to work with Lake Formation
- Understanding Amazon EMR integration with Apache Ranger
- Setting up Apache Ranger in EMR
- Understanding Apache Ranger plugins
- Summary
- Test your knowledge
- Further reading
- Section 3: Implementing Common Use Cases and Best Practices
- Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark
- Technical requirements
- Use case and architecture overview
- Architecture overview
- Implementation steps
- Creating Amazon S3 buckets
- Creating the AWS Lambda function
- Configuring an S3 file arrival event to trigger the Lambda function
- Triggering the EMR job
- Validating the output using Amazon Athena
- Defining a virtual Glue Data Catalog table on top of Amazon S3 data
- Querying output data using Amazon Athena standard SQL
- Spark ETL and Lambda function code walk-through
- Understanding the AWS Lambda function code
- Understanding the PySpark script integrated into the EMR step
- Summary
- Test your knowledge
- Further reading
- Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming
- Technical requirements
- Use case and architecture overview
- Architecture overview
- Implementation steps
- Creating Amazon S3 buckets
- Creating the Amazon Kinesis data stream
- Creating and configuring the Kinesis Data Generator tool
- Creating an Amazon EMR cluster and configuring a Spark Streaming job
- Validating output using Amazon Athena
- Defining a virtual Glue Catalog table on top of Amazon S3 data
- Querying output data using a standard SQL query in Amazon Athena
- Spark Streaming code walk-through
- Summary
- Test your knowledge
- Further reading
- Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi
- Technical requirements
- Apache Hudi overview
- Popular use cases
- Registering Hudi data with your Hive or Glue Data Catalog metastore
- Creating an EMR cluster and an EMR notebook
- Creating an EMR cluster
- Creating an EMR notebook
- Creating an Amazon S3 bucket
- Interactive development with Spark and Hudi
- Creating a PySpark notebook for development
- Integrating Hudi with our PySpark notebook
- Executing Spark and Hudi scripts in your notebook
- Summary
- Test your knowledge
- Further reading
- Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA
- Technical requirements
- Overview of AWS Step Functions
- Integrating AWS Step Functions to orchestrate EMR jobs
- Overview of Apache Airflow and MWAA
- Integrating Airflow to trigger EMR jobs
- Summary
- Test your knowledge
- Further reading
- Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR
- Understanding migration approaches
- Lift and shift
- Re-architecting
- Hybrid architecture
- Migrating data and metadata catalogs
- Migrating data
- Migrating metadata catalogs
- Migrating ETL jobs and Oozie workflows
- Migrating Oozie workflows
- Testing and validation
- Validating metadata quality
- Validating data quality
- Best practices for migration
- Summary
- Test your knowledge
- Further reading
- Understanding migration approaches
- Chapter 14: Best Practices and Cost-Optimization Techniques
- Best practices around EMR cluster configurations
- Choosing the correct cluster type (transient versus long-running)
- Best practices around sizing your cluster
- Optimization techniques for data processing and storage
- Best practices for cluster persistent storage
- Best practices while processing data using EMR
- Security best practices
- Configuring edge nodes outside of the cluster to limit connectivity
- Integrating logging, monitoring, and audit controls into your cluster
- Blocking public access to your EMR cluster
- Protecting your data at rest and in transit
- Cost-optimization techniques
- Cost savings with compute resources
- Cost savings with storage
- Integrating AWS Budgets and Cost Explorer
- AWS Trusted Advisor
- Cost allocation tags
- Limitations of Amazon EMR and possible workarounds
- Summary
- Test your knowledge
- Further reading
- Why subscribe?
- Best practices around EMR cluster configurations
- Other Books You May Enjoy
- Packt is searching for authors like you
- Share Your Thoughts
- Tytuły: Simplify Big Data Analytics with Amazon EMR
- Autor: Sakti Mishra
- Tytuł oryginału: Simplify Big Data Analytics with Amazon EMR
- ISBN Ebooka: 9781801077729, 9781801077729
- Data wydania: 2022-03-25
- Identyfikator pozycji: e_2t2b
- Kategorie: