E-book details

Hadoop Beginner's Guide. Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of the software or cloud services – just a willingness to learn the basics from this practical step-by-step tutorial

Hadoop Beginner's Guide. Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of the software or cloud services – just a willingness to learn the basics from this practical step-by-step tutorial

Gerald Turkington, Kevin A. McGrail

Ebook
Data is arriving faster than you can process it and the overall volumes keep growing at a rate that keeps you awake at night. Hadoop can help you tame the data beast. Effective use of Hadoop however requires a mixture of programming, design, and system administration skills.Hadoop Beginner's Guide removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.Starting with the basics of installing and configuring Hadoop, the book explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems.While learning different ways to develop applications to run on Hadoop the book also covers tools such as Hive, Sqoop, and Flume that show how Hadoop can be integrated with relational databases and log collection.In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.
  • Hadoop Beginners Guide
    • Table of Contents
    • Hadoop Beginner's Guide
    • Credits
    • About the Author
    • About the Reviewers
    • www.PacktPub.com
      • Support files, eBooks, discount offers and more
        • Why Subscribe?
        • Free Access for Packt account holders
    • Preface
      • What this book covers
      • What you need for this book
      • Who this book is for
      • Conventions
      • Time for action heading
        • What just happened?
        • Pop quiz heading
        • Have a go hero heading
      • Reader feedback
      • Customer support
        • Downloading the example code
        • Errata
        • Piracy
        • Questions
    • 1. What It's All About
      • Big data processing
        • The value of data
        • Historically for the few and not the many
          • Classic data processing systems
            • Scale-up
            • Early approaches to scale-out
          • Limiting factors
        • A different approach
          • All roads lead to scale-out
          • Share nothing
          • Expect failure
          • Smart software, dumb hardware
          • Move processing, not data
          • Build applications, not infrastructure
        • Hadoop
          • Thanks, Google
          • Thanks, Doug
          • Thanks, Yahoo
          • Parts of Hadoop
          • Common building blocks
          • HDFS
          • MapReduce
          • Better together
          • Common architecture
          • What it is and isn't good for
      • Cloud computing with Amazon Web Services
        • Too many clouds
        • A third way
        • Different types of costs
        • AWS infrastructure on demand from Amazon
          • Elastic Compute Cloud (EC2)
          • Simple Storage Service (S3)
          • Elastic MapReduce (EMR)
        • What this book covers
          • A dual approach
      • Summary
    • 2. Getting Hadoop Up and Running
      • Hadoop on a local Ubuntu host
        • Other operating systems
      • Time for action checking the prerequisites
        • What just happened?
        • Setting up Hadoop
          • A note on versions
      • Time for action downloading Hadoop
        • What just happened?
      • Time for action setting up SSH
        • What just happened?
        • Configuring and running Hadoop
      • Time for action using Hadoop to calculate Pi
        • What just happened?
        • Three modes
      • Time for action configuring the pseudo-distributed mode
        • What just happened?
        • Configuring the base directory and formatting the filesystem
      • Time for action changing the base HDFS directory
        • What just happened?
      • Time for action formatting the NameNode
        • What just happened?
        • Starting and using Hadoop
      • Time for action starting Hadoop
        • What just happened?
      • Time for action using HDFS
        • What just happened?
      • Time for action WordCount, the Hello World of MapReduce
        • What just happened?
        • Have a go hero WordCount on a larger body of text
        • Monitoring Hadoop from the browser
          • The HDFS web UI
            • The MapReduce web UI
      • Using Elastic MapReduce
        • Setting up an account in Amazon Web Services
          • Creating an AWS account
          • Signing up for the necessary services
      • Time for action WordCount on EMR using the management console
        • What just happened?
        • Have a go hero other EMR sample applications
        • Other ways of using EMR
          • AWS credentials
          • The EMR command-line tools
        • The AWS ecosystem
      • Comparison of local versus EMR Hadoop
      • Summary
    • 3. Understanding MapReduce
      • Key/value pairs
        • What it mean
        • Why key/value data?
          • Some real-world examples
        • MapReduce as a series of key/value transformations
        • Pop quiz key/value pairs
      • The Hadoop Java API for MapReduce
        • The 0.20 MapReduce Java API
          • The Mapper class
          • The Reducer class
          • The Driver class
      • Writing MapReduce programs
      • Time for action setting up the classpath
        • What just happened?
      • Time for action implementing WordCount
        • What just happened?
      • Time for action building a JAR file
        • What just happened?
      • Time for action running WordCount on a local Hadoop cluster
        • What just happened?
      • Time for action running WordCount on EMR
        • What just happened?
        • The pre-0.20 Java MapReduce API
        • Hadoop-provided mapper and reducer implementations
      • Time for action WordCount the easy way
        • What just happened?
      • Walking through a run of WordCount
        • Startup
        • Splitting the input
        • Task assignment
        • Task startup
        • Ongoing JobTracker monitoring
        • Mapper input
        • Mapper execution
        • Mapper output and reduce input
        • Partitioning
        • The optional partition function
        • Reducer input
        • Reducer execution
        • Reducer output
        • Shutdown
        • That's all there is to it!
        • Apart from the combinermaybe
          • Why have a combiner?
      • Time for action WordCount with a combiner
        • What just happened?
          • When you can use the reducer as the combiner
      • Time for action fixing WordCount to work with a combiner
        • What just happened?
        • Reuse is your friend
        • Pop quiz MapReduce mechanics
      • Hadoop-specific data types
        • The Writable and WritableComparable interfaces
        • Introducing the wrapper classes
          • Primitive wrapper classes
          • Array wrapper classes
          • Map wrapper classes
      • Time for action using the Writable wrapper classes
        • What just happened?
          • Other wrapper classes
        • Have a go hero playing with Writables
          • Making your own
      • Input/output
        • Files, splits, and records
        • InputFormat and RecordReader
        • Hadoop-provided InputFormat
        • Hadoop-provided RecordReader
        • OutputFormat and RecordWriter
        • Hadoop-provided OutputFormat
        • Don't forget Sequence files
      • Summary
    • 4. Developing MapReduce Programs
      • Using languages other than Java with Hadoop
        • How Hadoop Streaming works
        • Why to use Hadoop Streaming
      • Time for action implementing WordCount using Streaming
        • What just happened?
        • Differences in jobs when using Streaming
      • Analyzing a large dataset
        • Getting the UFO sighting dataset
        • Getting a feel for the dataset
      • Time for action summarizing the UFO data
        • What just happened?
          • Examining UFO shapes
      • Time for action summarizing the shape data
        • What just happened?
      • Time for action correlating of sighting duration to UFO shape
        • What just happened?
          • Using Streaming scripts outside Hadoop
      • Time for action performing the shape/time analysis from the command line
        • What just happened?
        • Java shape and location analysis
      • Time for action using ChainMapper for field validation/analysis
        • What just happened?
        • Have a go hero
          • Too many abbreviations
          • Using the Distributed Cache
      • Time for action using the Distributed Cache to improve location output
        • What just happened?
      • Counters, status, and other output
      • Time for action creating counters, task states, and writing log output
        • What just happened?
        • Too much information!
      • Summary
    • 5. Advanced MapReduce Techniques
      • Simple, advanced, and in-between
      • Joins
        • When this is a bad idea
        • Map-side versus reduce-side joins
        • Matching account and sales information
      • Time for action reduce-side join using MultipleInputs
        • What just happened?
          • DataJoinMapper and TaggedMapperOutput
        • Implementing map-side joins
          • Using the Distributed Cache
        • Have a go hero - Implementing map-side joins
          • Pruning data to fit in the cache
          • Using a data representation instead of raw data
          • Using multiple mappers
        • To join or not to join...
      • Graph algorithms
        • Graph 101
        • Graphs and MapReduce a match made somewhere
        • Representing a graph
      • Time for action representing the graph
        • What just happened?
        • Overview of the algorithm
          • The mapper
          • The reducer
          • Iterative application
      • Time for action creating the source code
        • What just happened?
      • Time for action the first run
        • What just happened?
      • Time for action the second run
        • What just happened?
      • Time for action the third run
        • What just happened?
      • Time for action the fourth and last run
        • What just happened?
        • Running multiple jobs
        • Final thoughts on graphs
      • Using language-independent data structures
        • Candidate technologies
        • Introducing Avro
      • Time for action getting and installing Avro
        • What just happened?
        • Avro and schemas
      • Time for action defining the schema
        • What just happened?
      • Time for action creating the source Avro data with Ruby
        • What just happened?
      • Time for action consuming the Avro data with Java
        • What just happened?
        • Using Avro within MapReduce
      • Time for action generating shape summaries in MapReduce
        • What just happened?
      • Time for action examining the output data with Ruby
        • What just happened?
      • Time for action examining the output data with Java
        • What just happened?
        • Have a go hero graphs in Avro
        • Going forward with Avro
      • Summary
    • 6. When Things Break
      • Failure
        • Embrace failure
        • Or at least don't fear it
        • Don't try this at home
        • Types of failure
        • Hadoop node failure
          • The dfsadmin command
          • Cluster setup, test files, and block sizes
          • Fault tolerance and Elastic MapReduce
      • Time for action killing a DataNode process
        • What just happened?
          • NameNode and DataNode communication
        • Have a go hero NameNode log delving
      • Time for action the replication factor in action
        • What just happened?
      • Time for action intentionally causing missing blocks
        • What just happened?
          • When data may be lost
          • Block corruption
      • Time for action killing a TaskTracker process
        • What just happened?
          • Comparing the DataNode and TaskTracker failures
          • Permanent failure
        • Killing the cluster masters
      • Time for action killing the JobTracker
        • What just happened?
          • Starting a replacement JobTracker
        • Have a go hero moving the JobTracker to a new host
      • Time for action killing the NameNode process
        • What just happened?
          • Starting a replacement NameNode
          • The role of the NameNode in more detail
          • File systems, files, blocks, and nodes
          • The single most important piece of data in the cluster fsimage
          • DataNode startup
          • Safe mode
          • SecondaryNameNode
          • So what to do when the NameNode process has a critical failure?
          • BackupNode/CheckpointNode and NameNode HA
          • Hardware failure
          • Host failure
          • Host corruption
          • The risk of correlated failures
        • Task failure due to software
          • Failure of slow running tasks
      • Time for action causing task failure
        • What just happened?
        • Have a go hero HDFS programmatic access
          • Hadoop's handling of slow-running tasks
          • Speculative execution
          • Hadoop's handling of failing tasks
        • Have a go hero causing tasks to fail
        • Task failure due to data
          • Handling dirty data through code
          • Using Hadoop's skip mode
      • Time for action handling dirty data by using skip mode
        • What just happened?
          • To skip or not to skip...
      • Summary
    • 7. Keeping Things Running
      • A note on EMR
      • Hadoop configuration properties
        • Default values
      • Time for action browsing default properties
        • What just happened?
        • Additional property elements
        • Default storage location
        • Where to set properties
      • Setting up a cluster
        • How many hosts?
          • Calculating usable space on a node
          • Location of the master nodes
          • Sizing hardware
          • Processor / memory / storage ratio
          • EMR as a prototyping platform
        • Special node requirements
        • Storage types
          • Commodity versus enterprise class storage
          • Single disk versus RAID
          • Finding the balance
          • Network storage
        • Hadoop networking configuration
          • How blocks are placed
          • Rack awareness
            • The rack-awareness script
      • Time for action examining the default rack configuration
        • What just happened?
      • Time for action adding a rack awareness script
        • What just happened?
        • What is commodity hardware anyway?
        • Pop quiz setting up a cluster
      • Cluster access control
        • The Hadoop security model
      • Time for action demonstrating the default security
        • What just happened?
          • User identity
            • The super user
          • More granular access control
        • Working around the security model via physical access control
      • Managing the NameNode
        • Configuring multiple locations for the fsimage class
      • Time for action adding an additional fsimage location
        • What just happened?
          • Where to write the fsimage copies
        • Swapping to another NameNode host
          • Having things ready before disaster strikes
      • Time for action swapping to a new NameNode host
        • What just happened?
          • Don't celebrate quite yet!
          • What about MapReduce?
        • Have a go hero swapping to a new NameNode host
      • Managing HDFS
        • Where to write data
        • Using balancer
          • When to rebalance
      • MapReduce management
        • Command line job management
        • Have a go hero command line job management
        • Job priorities and scheduling
      • Time for action changing job priorities and killing a job
        • What just happened?
        • Alternative schedulers
          • Capacity Scheduler
          • Fair Scheduler
          • Enabling alternative schedulers
          • When to use alternative schedulers
      • Scaling
        • Adding capacity to a local Hadoop cluster
        • Have a go hero adding a node and running balancer
        • Adding capacity to an EMR job flow
          • Expanding a running job flow
      • Summary
    • 8. A Relational View on Data with Hive
      • Overview of Hive
        • Why use Hive?
        • Thanks, Facebook!
      • Setting up Hive
        • Prerequisites
        • Getting Hive
      • Time for action installing Hive
        • What just happened?
      • Using Hive
      • Time for action creating a table for the UFO data
        • What just happened?
      • Time for action inserting the UFO data
        • What just happened?
        • Validating the data
      • Time for action validating the table
        • What just happened?
      • Time for action redefining the table with the correct column separator
        • What just happened?
        • Hive tables real or not?
      • Time for action creating a table from an existing file
        • What just happened?
      • Time for action performing a join
        • What just happened?
        • Have a go hero improve the join to use regular expressions
        • Hive and SQL views
      • Time for action using views
        • What just happened?
        • Handling dirty data in Hive
        • Have a go hero do it!
      • Time for action exporting query output
        • What just happened?
        • Partitioning the table
      • Time for action making a partitioned UFO sighting table
        • What just happened?
        • Bucketing, clustering, and sorting... oh my!
        • User-Defined Function
      • Time for action adding a new User Defined Function (UDF)
        • What just happened?
        • To preprocess or not to preprocess...
        • Hive versus Pig
        • What we didn't cover
      • Hive on Amazon Web Services
      • Time for action running UFO analysis on EMR
        • What just happened?
        • Using interactive job flows for development
        • Have a go hero using an interactive EMR cluster
        • Integration with other AWS products
      • Summary
    • 9. Working with Relational Databases
      • Common data paths
        • Hadoop as an archive store
        • Hadoop as a preprocessing step
        • Hadoop as a data input tool
        • The serpent eats its own tail
      • Setting up MySQL
      • Time for action installing and setting up MySQL
        • What just happened?
        • Did it have to be so hard?
      • Time for action configuring MySQL to allow remote connections
        • What just happened?
        • Don't do this in production!
      • Time for action setting up the employee database
        • What just happened?
        • Be careful with data file access rights
      • Getting data into Hadoop
        • Using MySQL tools and manual import
        • Have a go hero exporting the employee table into HDFS
        • Accessing the database from the mapper
        • A better way introducing Sqoop
      • Time for action downloading and configuring Sqoop
        • What just happened?
          • Sqoop and Hadoop versions
          • Sqoop and HDFS
      • Time for action exporting data from MySQL to HDFS
        • What just happened?
          • Mappers and primary key columns
          • Other options
          • Sqoop's architecture
        • Importing data into Hive using Sqoop
      • Time for action exporting data from MySQL into Hive
        • What just happened?
      • Time for action a more selective import
        • What just happened?
          • Datatype issues
      • Time for action using a type mapping
        • What just happened?
      • Time for action importing data from a raw query
        • What just happened?
        • Have a go hero
          • Sqoop and Hive partitions
          • Field and line terminators
      • Getting data out of Hadoop
        • Writing data from within the reducer
        • Writing SQL import files from the reducer
        • A better way Sqoop again
      • Time for action importing data from Hadoop into MySQL
        • What just happened?
          • Differences between Sqoop imports and exports
          • Inserts versus updates
        • Have a go hero
          • Sqoop and Hive exports
      • Time for action importing Hive data into MySQL
        • What just happened?
      • Time for action fixing the mapping and re-running the export
        • What just happened?
          • Other Sqoop features
            • Incremental merge
            • Avoiding partial exports
            • Sqoop as a code generator
      • AWS considerations
        • Considering RDS
      • Summary
    • 10. Data Collection with Flume
      • A note about AWS
      • Data data everywhere...
        • Types of data
        • Getting network traffic into Hadoop
      • Time for action getting web server data into Hadoop
        • What just happened?
        • Have a go hero
        • Getting files into Hadoop
        • Hidden issues
          • Keeping network data on the network
          • Hadoop dependencies
          • Reliability
          • Re-creating the wheel
          • A common framework approach
      • Introducing Apache Flume
        • A note on versioning
      • Time for action installing and configuring Flume
        • What just happened?
        • Using Flume to capture network data
      • Time for action capturing network traffic in a log file
        • What just happened?
      • Time for action logging to the console
        • What just happened?
        • Writing network data to log files
      • Time for action capturing the output of a command to a flat file
        • What just happened?
          • Logs versus files
      • Time for action capturing a remote file in a local flat file
        • What just happened?
        • Sources, sinks, and channels
          • Sources
          • Sinks
          • Channels
          • Or roll your own
        • Understanding the Flume configuration files
        • Have a go hero
        • It's all about events
      • Time for action writing network traffic onto HDFS
        • What just happened?
      • Time for action adding timestamps
        • What just happened?
        • To Sqoop or to Flume...
      • Time for action multi level Flume networks
        • What just happened?
      • Time for action writing to multiple sinks
        • What just happened?
        • Selectors replicating and multiplexing
        • Handling sink failure
        • Have a go hero - Handling sink failure
        • Next, the world
        • Have a go hero - Next, the world
      • The bigger picture
        • Data lifecycle
        • Staging data
        • Scheduling
      • Summary
    • 11. Where to Go Next
      • What we did and didn't cover in this book
      • Upcoming Hadoop changes
      • Alternative distributions
        • Why alternative distributions?
          • Bundling
          • Free and commercial extensions
            • Cloudera Distribution for Hadoop
            • Hortonworks Data Platform
            • MapR
            • IBM InfoSphere Big Insights
          • Choosing a distribution
      • Other Apache projects
        • HBase
        • Oozie
        • Whir
        • Mahout
        • MRUnit
      • Other programming abstractions
        • Pig
        • Cascading
      • AWS resources
        • HBase on EMR
        • SimpleDB
        • DynamoDB
      • Sources of information
        • Source code
        • Mailing lists and forums
        • LinkedIn groups
        • HUGs
        • Conferences
      • Summary
    • A. Pop Quiz Answers
      • Chapter 3, Understanding MapReduce
        • Pop quiz key/value pairs
        • Pop quiz walking through a run of WordCount
      • Chapter 7, Keeping Things Running
        • Pop quiz setting up a cluster
    • Index
  • Title: Hadoop Beginner's Guide. Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of the software or cloud services ‚Äì just a willingness to learn the basics from this practical step-by-step tutorial
  • Author: Gerald Turkington, Kevin A. McGrail
  • Original title: Hadoop Beginner's Guide. Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of the software or cloud services ‚Äì just a willingness to learn the basics from this practical step-by-step tutorial.
  • ISBN: 9781849517317, 9781849517317
  • Date of issue: 2013-02-22
  • Format: Ebook
  • Item ID: e_3cc7
  • Publisher: Packt Publishing