Helion


Szczegóły ebooka

Building Big Data Pipelines with Apache Beam

Building Big Data Pipelines with Apache Beam


Apache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing.

This book will help you to confidently build data processing pipelines with Apache Beam. You'll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You'll also learn how to test and run the pipelines efficiently. As you progress, you'll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you'll understand advanced Apache Beam concepts, such as implementing your own I/O connectors.

By the end of this book, you'll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.

  • Building Big Data Pipelines with Apache Beam
  • Contributors
  • About the author
  • About the reviewer
  • Preface
    • Who this book is for
    • What this book covers
    • To get the most out of this book
    • Download the example code files
    • Download the color images
    • Conventions used
    • Get in touch
    • Share Your Thoughts
  • Section 1 Apache Beam: Essentials
  • Chapter 1: Introduction to Data Processing with Apache Beam
    • Technical requirements
    • Why Apache Beam?
    • Writing your first pipeline
    • Running our pipeline against streaming data
    • Exploring the key properties of unbounded data
    • Measuring event time progress inside data streams
      • States and triggers
      • Timers
    • Assigning data to windows
      • Defining the life cycle of a state in terms of windows
      • Pane accumulation
    • Unifying batch and streaming data processing
    • Summary
  • Chapter 2: Implementing, Testing, and Deploying Basic Pipelines
    • Technical requirements
    • Setting up the environment for this book
      • Installing Apache Kafka
      • Making our code accessible from minikube
      • Installing Apache Flink
      • Reinstalling the complete environment
    • Task 1 Calculating the K most frequent words in a stream of lines of text
      • Defining the problem
      • Discussing the problem decomposition
      • Implementing the solution
      • Testing our solution
      • Deploying our solution
    • Task 2 Calculating the maximal length of a word in a stream
      • Defining the problem
      • Discussing the problem decomposition
      • Implementing the solution
      • Testing our solution
      • Deploying our solution
    • Specifying the PCollection Coder object and the TypeDescriptor object
    • Understanding default triggers, on time, and closing behavior
    • Introducing the primitive PTransform object Combine
    • Task 3 Calculating the average length of words in a stream
      • Defining the problem
      • Discussing the problem decomposition
      • Implementing the solution
      • Testing our solution
      • Deploying our solution
    • Task 4 Calculating the average length of words in a stream with fixed lookback
      • Defining the problem
      • Discussing the problem decomposition
      • Implementing the solution
      • Testing our solution
      • Deploying our solution
    • Ensuring pipeline upgradability
    • Task 5 Calculating performance statistics for a sport activity tracking application
      • Defining the problem
      • Discussing the problem decomposition
      • Solution implementation
      • Testing our solution
      • Deploying our solution
    • Introducing the primitive PTransform object GroupByKey
    • Introducing the primitive PTransform object Partition
    • Summary
  • Chapter 3: Implementing Pipelines Using Stateful Processing
    • Technical requirements
    • Task 6 Using an external service for data augmentation
      • Defining the problem
      • Discussing the problem decomposition
      • Implementing the solution
      • Testing our solution
      • Deploying our solution
    • Introducing the primitive PTransform object stateless ParDo
    • Task 7 Batching queries to an external RPC service
      • Defining the problem
      • Discussing the problem decomposition
      • Implementing the solution
    • Task 8 Batching queries to an external RPC service with defined batch sizes
      • Defining the problem
      • Discussing the problem decomposition
      • Implementing the solution
    • Introducing the primitive PTransform object stateful ParDo
      • Describing the theoretical properties of the stateful ParDo object
      • Applying the theoretical properties of the stateful ParDo object to the API of DoFn
    • Using side outputs
      • As an example, lets imagine we are processing data coming in as JSON values. We need to parse these messages into an internal object. But what should we do with the values that cannot be parsed because they contain a syntax error? If we do not do any validation before we store them in the stream (topic), then it is certainly possible that we will encounter such a situation. We can silently drop those records, but that is obviously not a great idea, as that could cause hard-to-debug problems. A much better option would be to store these values on the side to be able to investigate and fix them. Therefore, we should aim to do the following:
    • Defining droppable data in Beam
    • Task 9 Separating droppable data from the rest of the data processing
      • Defining the problem
      • Discussing the problem decomposition
      • Implementing the solution
      • Testing our solution
      • Deploying our solution
    • Task 10 Separating droppable data from the rest of the data processing, part 2
      • Defining the problem
      • Discussing the problem decomposition
      • Implementing the solution
      • Testing our solution
      • Deploying our solution
    • Using side inputs
    • Summary
  • Section 2 Apache Beam: Toward Improving Usability
  • Chapter 4: Structuring Code for Reusability
    • Technical requirements
    • Explaining PTransform expansion
    • Task 11 Enhancing SportTracker by runner motivation using side inputs
      • Problem definition
      • Problem decomposition discussion
      • Solution implementation
      • Testing our solution
      • Deploying our solution
    • Introducing composite transform CoGroupByKey
    • Task 12 enhancing SportTracker by runner motivation using CoGroupByKey
      • Problem definition
      • Problem decomposition discussion
      • Solution implementation
    • Introducing the Join library DSL
    • Stream-to-stream joins explained
    • Task 13 Writing a reusable PTransform StreamingInnerJoin
      • Problem definition
      • Problem decomposition discussion
      • Solution implementation
      • Testing our solution
      • Deploying our solution
    • Table-stream duality
    • Summary
  • Chapter 5: Using SQL for Pipeline Implementation
    • Technical requirements
    • Understanding schemas
      • Attaching a schema to a PCollection
      • Transforms for PCollections with schemas
    • Implementing our first streaming pipeline using SQL
    • Task 14 Implementing SQLMaxWordLength
      • Problem definition
      • Problem decomposition discussion
      • Solution implementation
    • Task 15 Implementing SchemaSportTracker
      • Problem definition
      • Problem decomposition discussion
      • Solution implementation
    • Task 16 Implementing SQLSportTrackerMotivation
      • Problem definition
      • Problem decomposition discussion
      • Solution implementation
    • Further development of Apache Beam SQL
    • Summary
  • Chapter 6: Using Your Preferred Language with Portability
    • Technical requirements
    • Introducing the portability layer
      • Portable representation of the pipeline
      • Job Service
      • SDK harness
    • Implementing our first pipelines in the Python SDK
      • Implementing our first Python pipeline
      • Implementing our first streaming Python pipeline
    • Task 17 Implementing MaxWordLength in the Python SDK
      • Problem definition
      • Problem decomposition discussion
      • Solution implementation
      • Testing our solution
      • Deploying our solution
    • Python SDK type hints and coders
    • Task 18 Implementing SportTracker in the Python SDK
      • Problem definition
      • Solution implementation
      • Testing our solution
      • Deploying our solution
    • Task 19 Implementing RPCParDo in the Python SDK
      • Problem definition
      • Solution implementation
      • Testing our solution
      • Deploying our solution
    • Task 20 Implementing SportTrackerMotivation in the Python SDK
      • Problem definition
      • Solution implementation
      • Deploying our solution
    • Using the DataFrame API
    • Interactive programming using InteractiveRunner
    • Introducing and using cross-language pipelines
    • Summary
  • Section 3 Apache Beam: Advanced Concepts
  • Chapter 7: Extending Apache Beam's I/O Connectors
    • Technical requirements
    • Defining splittable DoFn as a unification for bounded and unbounded sources
    • Task 21 Implementing our own splittable DoFn a streaming file source
      • The problem definition
      • Discussing the problem decomposition
      • Implementing the solution
      • Testing our solution
      • Deploying our solution
    • Task 22 A non-I/O application of splittable DoFn PiSampler
      • The problem definition
      • Discussing the problem decomposition
      • Implementing the solution
      • Testing our solution
      • Deploying our solution
    • The legacy Source API and the Read transform
    • Writing a custom data sink
      • The inherent non-determinism of Apache Beam pipelines
    • Summary
  • Chapter 8: Understanding How Runners Execute Pipelines
    • Describing the anatomy of an Apache Beam runner
      • Identifying which transforms should be overridden
    • Explaining the differences between classic and portable runners
      • Classic runners
      • Portable pipeline representations
      • The executable stage concept and the pipeline fusion process
    • Understanding how a runner handles state
      • Ensuring fault tolerance
      • Local state with periodic checkpoints
      • Remote state
    • Exploring the Apache Beam capability matrix
    • Understanding windowing semantics in depth
      • Merging and non-merging windows
    • Debugging pipelines and using Apache Beam metrics for observability
      • Using metrics in the Java SDK
    • Summary
    • Why subscribe?
  • Other Books You May Enjoy
    • Packt is searching for authors like you
    • Share Your Thoughts