E-book details

Apache Flume: Distributed Log Collection for Hadoop. If your role includes moving datasets into Hadoop, this book will help you do it more efficiently using Apache Flume. From installation to customization, it's a complete step-by-step guide on making the service work for you

Apache Flume: Distributed Log Collection for Hadoop. If your role includes moving datasets into Hadoop, this book will help you do it more efficiently using Apache Flume. From installation to customization, it's a complete step-by-step guide on making the service work for you

Steve Hoffman, Steven Hoffman, Kevin A. McGrail

Ebook
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with many failover and recovery mechanisms.

Apache Flume: Distributed Log Collection for Hadoop covers problems with HDFS and streaming data/logs, and how Flume can resolve these problems. This book explains the generalized architecture of Flume, which includes moving data to/from databases, NO-SQL-ish data stores, as well as optimizing performance. This book includes real-world scenarios on Flume implementation.

Apache Flume: Distributed Log Collection for Hadoop starts with an architectural overview of Flume and then discusses each component in detail. It guides you through the complete installation process and compilation of Flume.

It will give you a heads-up on how to use channels and channel selectors. For each architectural component (Sources, Channels, Sinks, Channel Processors, Sink Groups, and so on) the various implementations will be covered in detail along with configuration options. You can use it to customize Flume to your specific needs. There are pointers given on writing custom implementations as well that would help you learn and implement them.

By the end, you should be able to construct a series of Flume agents to transport your streaming data and logs from your systems into Hadoop in near real time.
  • Apache Flume: Distributed Log Collection for Hadoop
    • Table of Contents
    • Apache Flume: Distributed Log Collection for Hadoop
    • Credits
    • About the Author
    • About the Reviewers
    • www.PacktPub.com
      • Support files, eBooks, discount offers and more
        • Why Subscribe?
        • Free Access for Packt account holders
    • Preface
      • What this book covers
      • What you need for this book
      • Who this book is for
      • Conventions
      • Reader feedback
      • Customer support
        • Errata
        • Piracy
        • Questions
    • 1. Overview and Architecture
      • Flume 0.9
      • Flume 1.X (Flume-NG)
      • The problem with HDFS and streaming data/logs
      • Sources, channels, and sinks
      • Flume events
        • Interceptors, channel selectors, and sink processors
        • Tiered data collection (multiple flows and/or agents)
      • Summary
    • 2. Flume Quick Start
      • Downloading Flume
        • Flume in Hadoop distributions
      • Flume configuration file overview
      • Starting up with "Hello World"
      • Summary
    • 3. Channels
      • Memory channel
      • File channel
      • Summary
    • 4. Sinks and Sink Processors
      • HDFS sink
        • Path and filename
        • File rotation
      • Compression codecs
      • Event serializers
        • Text output
        • Text with headers
        • Apache Avro
        • File type
          • Sequence file
          • Data stream
          • Compressed stream
        • Timeouts and workers
      • Sink groups
        • Load balancing
        • Failover
      • Summary
    • 5. Sources and Channel Selectors
      • The problem with using tail
      • The exec source
      • The spooling directory source
      • Syslog sources
        • The syslog UDP source
        • The syslog TCP source
        • The multiport syslog TCP source
      • Channel selectors
        • Replicating
        • Multiplexing
      • Summary
    • 6. Interceptors, ETL, and Routing
      • Interceptors
        • Timestamp
        • Host
        • Static
        • Regular expression filtering
        • Regular expression extractor
        • Custom interceptors
      • Tiering data flows
        • Avro Source/Sink
        • Command-line Avro
        • Log4J Appender
        • The Load Balancing Log4J Appender
      • Routing
      • Summary
    • 7. Monitoring Flume
      • Monitoring the agent process
        • Monit
        • Nagios
      • Monitoring performance metrics
        • Ganglia
        • The internal HTTP server
        • Custom monitoring hooks
      • Summary
    • 8. There Is No Spoon The Realities of Real-time Distributed Data Collection
      • Transport time versus log time
      • Time zones are evil
      • Capacity planning
      • Considerations for multiple data centers
      • Compliance and data expiry
      • Summary
    • Index
  • Title: Apache Flume: Distributed Log Collection for Hadoop. If your role includes moving datasets into Hadoop, this book will help you do it more efficiently using Apache Flume. From installation to customization, it's a complete step-by-step guide on making the service work for you
  • Author: Steve Hoffman, Steven Hoffman, Kevin A. McGrail
  • Original title: Apache Flume: Distributed Log Collection for Hadoop. If your role includes moving datasets into Hadoop, this book will help you do it more efficiently using Apache Flume. From installation to customization, it's a complete step-by-step guide on making the service work for you.
  • ISBN: 9781782167921, 9781782167921
  • Date of issue: 2013-07-16
  • Format: Ebook
  • Item ID: e_3bhv
  • Publisher: Packt Publishing