Ebook1,003 pages7 hours

Hadoop Beginner's Guide

Name: Hadoop Beginner's Guide
Brand: Packt Publishing
Rating: 4.1 (7 reviews)

By Garry Turkington

Rating: 4 out of 5 stars

4/5

()

Read preview

About this ebook

In Detail

Data is arriving faster than you can process it and the overall volumes keep growing at a rate that keeps you awake at night. Hadoop can help you tame the data beast. Effective use of Hadoop however requires a mixture of programming, design, and system administration skills.

"Hadoop Beginner's Guide" removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.

Starting with the basics of installing and configuring Hadoop, the book explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems.

While learning different ways to develop applications to run on Hadoop the book also covers tools such as Hive, Sqoop, and Flume that show how Hadoop can be integrated with relational databases and log collection.

In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.

Approach

As a Packt Beginner's Guide, the book is packed with clear step-by-step instructions for performing the most useful tasks, getting you up and running quickly, and learning by doing.

Who this book is for

This book assumes no existing experience with Hadoop or cloud services. It assumes you have familiarity with a programming language such as Java or Ruby but gives you the needed background on the other topics.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateFeb 22, 2013

ISBN9781849517317

Author

Garry Turkington

Garry Turkington has 14 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current roles as VP Data Engineering at Improve Digital and the company's lead architect he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital he spent time at Amazon UK where he led several software development teams, building systems that process the Amazon catalog data for every item worldwide. Prior to this he spent a decade in various government positions in both the UK and USA. He has BSc and PhD degrees in computer science from the Queens University of Belfast in Northern Ireland and a MEng in Systems Engineering from Stevens Institute of Technology in the USA.

Related to Hadoop Beginner's Guide

Related ebooks

Skip carousel

Hadoop Real-World Solutions Cookbook - Second Edition
Ebook
Hadoop Real-World Solutions Cookbook - Second Edition
byDeshpande Tanmay
Rating: 0 out of 5 stars
0 ratings
HDInsight Essentials - Second Edition
Ebook
HDInsight Essentials - Second Edition
byRajesh Nadipalli
Rating: 0 out of 5 stars
0 ratings
Hadoop MapReduce v2 Cookbook - Second Edition
Ebook
Hadoop MapReduce v2 Cookbook - Second Edition
byThilina Gunarathne
Rating: 0 out of 5 stars
0 ratings
Hadoop in Practice
Ebook
Hadoop in Practice
byAlex Holmes
Rating: 0 out of 5 stars
0 ratings
Hadoop Blueprints
Ebook
Hadoop Blueprints
byAnurag Shrivastava
Rating: 0 out of 5 stars
0 ratings
Hadoop in Action
Ebook
Hadoop in Action
byChuck Lam
Rating: 0 out of 5 stars
0 ratings
Hadoop 2.x Administration Cookbook
Ebook
Hadoop 2.x Administration Cookbook
byGurmukh Singh
Rating: 0 out of 5 stars
0 ratings
Mastering Hadoop
Ebook
Mastering Hadoop
bySandeep Karanth
Rating: 0 out of 5 stars
0 ratings
Learning Apache Spark 2
Ebook
Learning Apache Spark 2
byMuhammad Asif Abbasi
Rating: 0 out of 5 stars
0 ratings
Learning PySpark
Ebook
Learning PySpark
byTomasz Drabas
Rating: 0 out of 5 stars
0 ratings
Apache Hive Cookbook
Ebook
Apache Hive Cookbook
byShrey Mehrotra
Rating: 0 out of 5 stars
0 ratings
Mastering Spark for Data Science
Ebook
Mastering Spark for Data Science
byAndrew Morgan
Rating: 0 out of 5 stars
0 ratings
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
Ebook
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
byalasdair gilchrist
Rating: 5 out of 5 stars
5/5
Data Pipelines with Apache Airflow
Ebook
Data Pipelines with Apache Airflow
byJulian de Ruiter
Rating: 0 out of 5 stars
0 ratings
Learn Hadoop in 24 Hours
Ebook
Learn Hadoop in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics
Ebook
Big Data Analytics
byVenkat Ankam
Rating: 0 out of 5 stars
0 ratings
Cassandra High Availability
Ebook
Cassandra High Availability
byRobbie Strickland
Rating: 5 out of 5 stars
5/5
Graph Databases in Action: Examples in Gremlin
Ebook
Graph Databases in Action: Examples in Gremlin
byJosh Perryman
Rating: 0 out of 5 stars
0 ratings
Talend Open Studio Cookbook
Ebook
Talend Open Studio Cookbook
byRick Barton
Rating: 2 out of 5 stars
2/5
Frank Kane's Taming Big Data with Apache Spark and Python
Ebook
Frank Kane's Taming Big Data with Apache Spark and Python
byFrank Kane
Rating: 0 out of 5 stars
0 ratings
Practical Machine Learning
Ebook
Practical Machine Learning
byGollapudi Sunila
Rating: 2 out of 5 stars
2/5
Scala for Data Science
Ebook
Scala for Data Science
byBugnion Pascal
Rating: 0 out of 5 stars
0 ratings
Mastering Apache Cassandra - Second Edition
Ebook
Mastering Apache Cassandra - Second Edition
byNishant Neeraj
Rating: 0 out of 5 stars
0 ratings
Cloudera Administration Handbook
Ebook
Cloudera Administration Handbook
byRohit Menon
Rating: 0 out of 5 stars
0 ratings
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Ebook
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
byEric Chou
Rating: 0 out of 5 stars
0 ratings
Functional Python Programming
Ebook
Functional Python Programming
bySteven Lott
Rating: 0 out of 5 stars
0 ratings
Mastering Tableau
Ebook
Mastering Tableau
byDavid Baldwin
Rating: 3 out of 5 stars
3/5
MLOps Engineering at Scale
Ebook
MLOps Engineering at Scale
byCarl Osipov
Rating: 0 out of 5 stars
0 ratings
Machine Learning Systems: Designs that scale
Ebook
Machine Learning Systems: Designs that scale
byJeffrey Smith
Rating: 0 out of 5 stars
0 ratings
YARN Essentials
Ebook
YARN Essentials
byAmol Fasale
Rating: 0 out of 5 stars
0 ratings

CAD-CAM For You

Skip carousel

3D Printing For Dummies
Ebook
3D Printing For Dummies
byRichard Horne
Rating: 4 out of 5 stars
4/5
AutoCAD For Dummies
Ebook
AutoCAD For Dummies
byBill Fane
Rating: 0 out of 5 stars
0 ratings
AutoCAD 2018 For Beginners
Ebook
AutoCAD 2018 For Beginners
byKishore Topu
Rating: 5 out of 5 stars
5/5
Tinkercad | Step by Step
Ebook
Tinkercad | Step by Step
byM.Eng. Johannes Wild
Rating: 0 out of 5 stars
0 ratings
Fusion 360 | Step by Step
Ebook
Fusion 360 | Step by Step
byM.Eng. Johannes Wild
Rating: 0 out of 5 stars
0 ratings
FreeCAD Basics Tutorial
Ebook
FreeCAD Basics Tutorial
byTutorial Books
Rating: 3 out of 5 stars
3/5
FreeCAD | Step by Step: Learn how to easily create 3D objects, assemblies, and technical drawings
Ebook
FreeCAD | Step by Step: Learn how to easily create 3D objects, assemblies, and technical drawings
byM.Eng. Johannes Wild
Rating: 5 out of 5 stars
5/5
Computer Building Made Simple
Ebook
Computer Building Made Simple
byFrancis Ernissee
Rating: 4 out of 5 stars
4/5
3D Printing Designs: Fun and Functional Projects
Ebook
3D Printing Designs: Fun and Functional Projects
byLarson Joe
Rating: 0 out of 5 stars
0 ratings
NX 12 For Beginners
Ebook
NX 12 For Beginners
byTutorial Books
Rating: 5 out of 5 stars
5/5
Fusion 360 | CAD Design Projects Part I
Ebook
Fusion 360 | CAD Design Projects Part I
byM.Eng. Johannes Wild
Rating: 0 out of 5 stars
0 ratings
Mastering AutoCAD for Mac
Ebook
Mastering AutoCAD for Mac
byGeorge Omura
Rating: 0 out of 5 stars
0 ratings
SketchUp Success for Woodworkers: Four Simple Rules to Create 3D Drawings Quickly and Accurately
Ebook
SketchUp Success for Woodworkers: Four Simple Rules to Create 3D Drawings Quickly and Accurately
byDavid Heim
Rating: 2 out of 5 stars
2/5
3D Printing Designs: Octopus Pencil Holder
Ebook
3D Printing Designs: Octopus Pencil Holder
byLarson Joe
Rating: 0 out of 5 stars
0 ratings
AutoCAD® Pocket Reference
Ebook
AutoCAD® Pocket Reference
byCheryl R. Shrock
Rating: 0 out of 5 stars
0 ratings
Manual of Engineering Drawing: British and International Standards
Ebook
Manual of Engineering Drawing: British and International Standards
byColin H. Simmons
Rating: 3 out of 5 stars
3/5
Autodesk Fusion 360: A Power Guide for Beginners and Intermediate Users (3rd Edition)
Ebook
Autodesk Fusion 360: A Power Guide for Beginners and Intermediate Users (3rd Edition)
bySandeep Dogra
Rating: 5 out of 5 stars
5/5
Autodesk Revit 2023 Black Book
Ebook
Autodesk Revit 2023 Black Book
byGaurav Verma
Rating: 5 out of 5 stars
5/5
Autodesk Fusion 360 Black Book (V 2.0.15293) - Part 1
Ebook
Autodesk Fusion 360 Black Book (V 2.0.15293) - Part 1
byGaurav Verma
Rating: 0 out of 5 stars
0 ratings
NX 9 for Beginners - Part 4 (Assemblies and Drawings)
Ebook
NX 9 for Beginners - Part 4 (Assemblies and Drawings)
byCADfolks
Rating: 0 out of 5 stars
0 ratings
FreeCAD | Design Projects: Design advanced CAD models step by step
Ebook
FreeCAD | Design Projects: Design advanced CAD models step by step
byM.Eng. Johannes Wild
Rating: 5 out of 5 stars
5/5
SketchUp Pro 2014 New features
Ebook
SketchUp Pro 2014 New features
byJoão Gaspar
Rating: 0 out of 5 stars
0 ratings
CAD Projects with Tinkercad | 3D Models Part 1: Learn how to create advanced 3D objects with Tinkercad in an easy way
Ebook
CAD Projects with Tinkercad | 3D Models Part 1: Learn how to create advanced 3D objects with Tinkercad in an easy way
byM.Eng. Johannes Wild
Rating: 0 out of 5 stars
0 ratings
Autodesk AutoCAD 2013 Practical 3D Drafting and Design
Ebook
Autodesk AutoCAD 2013 Practical 3D Drafting and Design
byJoão Santos
Rating: 0 out of 5 stars
0 ratings
CNC Tips and Techniques: A Reader for Programmers
Ebook
CNC Tips and Techniques: A Reader for Programmers
byPeter Smid
Rating: 0 out of 5 stars
0 ratings
Autodesk Fusion 360 Black Book (V 2.0.6508) Part 2: Autodesk Fusion 360 Black Book (V 2.0.6508)
Ebook
Autodesk Fusion 360 Black Book (V 2.0.6508) Part 2: Autodesk Fusion 360 Black Book (V 2.0.6508)
byGaurav Verma
Rating: 0 out of 5 stars
0 ratings
Plastic Injection Mold Design for Toolmakers
Ebook series
Plastic Injection Mold Design for Toolmakers
byMike Rowe
Autodesk Fusion 360 Black Book (V 2.0.6508) Part 1: Autodesk Fusion 360 Black Book (V 2.0.6508)
Ebook
Autodesk Fusion 360 Black Book (V 2.0.6508) Part 1: Autodesk Fusion 360 Black Book (V 2.0.6508)
byGaurav Verma
Rating: 0 out of 5 stars
0 ratings
Solidworks 2018 Learn by doing - Part 1
Ebook
Solidworks 2018 Learn by doing - Part 1
byTutorial Books
Rating: 5 out of 5 stars
5/5
Revit 2020 for Architecture: No Experience Required
Ebook
Revit 2020 for Architecture: No Experience Required
byEric Wing
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Engineering interview tips & tricks: with Emma Draper & Jonas
Podcast episode
Engineering interview tips & tricks: with Emma Draper & Jonas
byGo Time: Golang, Software Engineering
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
Podcast episode
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
byData Engineering Podcast
0 ratings
0% found this document useful
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
Podcast episode
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
Podcast episode
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
byData Engineering Podcast
0 ratings
0% found this document useful
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
Podcast episode
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
byData Engineering Podcast
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
Podcast episode
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
This Week In Machine Learning & AI - 5/20/16: AI at Google I/O, Amazon's Deep Learning DSSTNE: This Week In Machine Learning & AI - May 20, 2016…
Podcast episode
This Week In Machine Learning & AI - 5/20/16: AI at Google I/O, Amazon's Deep Learning DSSTNE: This Week In Machine Learning & AI - May 20, 2016…
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
State In React: In this episode of Syntax, Scott and Wes talk about state in React: local state, global state, UI state, data state, caching, API data and more! LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
Podcast episode
State In React: In this episode of Syntax, Scott and Wes talk about state in React: local state, global state, UI state, data state, caching, API data and more! LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39: Self Service Data Flows With Apache NiFi (Interview)
Podcast episode
Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39: Self Service Data Flows With Apache NiFi (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Open Source TensorFlow with Yifei Feng: Yifei Feng, a TensorFlow software engineer, shares with Melanie and Mark about her work on the open source TensorFlow project and the tools she builds.
Podcast episode
Open Source TensorFlow with Yifei Feng: Yifei Feng, a TensorFlow software engineer, shares with Melanie and Mark about her work on the open source TensorFlow project and the tools she builds.
byGoogle Cloud Platform Podcast
100%
100% found this document useful
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
Podcast episode
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
byData Engineering Podcast
0 ratings
0% found this document useful
Azure Databricks: I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were...
Podcast episode
Azure Databricks: I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were...
byData Skeptic
0 ratings
0% found this document useful
Streaming Data Pipelines Made SQL With Decodable: An interview with Eric Sammer about the difficulty of working with streaming engines at a low level of abstraction and how he and his team at Decodable are working to make development of streaming data pipelines as straightforward as writing SQL
Podcast episode
Streaming Data Pipelines Made SQL With Decodable: An interview with Eric Sammer about the difficulty of working with streaming engines at a low level of abstraction and how he and his team at Decodable are working to make development of streaming data pipelines as straightforward as writing SQL
byData Engineering Podcast
0 ratings
0% found this document useful
Kafka Event Sourcing with Neha Narkhede: When a user of a social network updates her profile, that profile update needs to propagate to several databases that want to know about such an update–search indexes, user databases, caches, and other services. When Neha Narkhede was at LinkedIn,
Podcast episode
Kafka Event Sourcing with Neha Narkhede: When a user of a social network updates her profile, that profile update needs to propagate to several databases that want to know about such an update–search indexes, user databases, caches, and other services. When Neha Narkhede was at LinkedIn,
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
Podcast episode
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
byScreaming in the Cloud
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42: A Whirlwind Tour Of The PostgreSQL Database (Interview)
Podcast episode
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42: A Whirlwind Tour Of The PostgreSQL Database (Interview)
byData Engineering Podcast
100%
100% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57: Scalable and Stateful Streaming Data With Apache Flink (Interview)
Podcast episode
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57: Scalable and Stateful Streaming Data With Apache Flink (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Hasty Treat - Why should I use React Hooks?: In this Hasty Treat, Scott and Wes talk about React Hooks and why you might want to use them instead of class components. Sentry - Sponsor If you want to know what’s happening with your errors, track them with . Sentry is open-source error...
Podcast episode
Hasty Treat - Why should I use React Hooks?: In this Hasty Treat, Scott and Wes talk about React Hooks and why you might want to use them instead of class components. Sentry - Sponsor If you want to know what’s happening with your errors, track them with . Sentry is open-source error...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
A Practical Introduction To Graph Data Applications - Episode 144: An interview with the authors of the Practitioner's Guide To Graph Data about how, when, and why to use graph data algorithms and data structures.
Podcast episode
A Practical Introduction To Graph Data Applications - Episode 144: An interview with the authors of the Practitioner's Guide To Graph Data about how, when, and why to use graph data algorithms and data structures.
byData Engineering Podcast
0 ratings
0% found this document useful
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
Podcast episode
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Brain-Inspired Hardware and Algorithm Co-Design with Melika Payvand - #585
Podcast episode
Brain-Inspired Hardware and Algorithm Co-Design with Melika Payvand - #585
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
#515: [Right Now at AWS] Episode 15 - Future of Payments Dominated by AI / ML & Emerging Payments Use Cases: Fintechs are enabling seamless payments and are increasingly providing more options, like extending
Podcast episode
#515: [Right Now at AWS] Episode 15 - Future of Payments Dominated by AI / ML & Emerging Payments Use Cases: Fintechs are enabling seamless payments and are increasingly providing more options, like extending
byAWS Podcast
0 ratings
0% found this document useful

Skip carousel

Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
Types Of Databases
Linux Format
Article
Types Of Databases
Aug 27, 2019
NoSQL databases provide the performance, scalability and stability that’s required by the modern data-driven apps we interact with these days. But that is where the similarity between NoSQL systems end. In fact, it wouldn’t be wrong to say that the o
1 min read
Understanding ELT & ETL
Techfastly
Article
Understanding ELT & ETL
Apr 1, 2021
8 min read
DJANGO Create A Database-driven Website
Linux Format
Article
DJANGO Create A Database-driven Website
Jun 4, 2019
The Django web framework was named after the famous guitarist Django Reinhardt and was first created by web developers at a small newspaper in Kansas. The main goals of Django is to enable fast development of complex websites with database needs. It
7 min read
Grafana, Telegraf And Influxdb
Linux Format
Article
Grafana, Telegraf And Influxdb
Jun 30, 2020
If you don’t like Netdata or if you want to try something else, you can give Grafana (https://grafana.com), Telegraf (www.influxdata.com/time-series-platform/telegraf) and InfluxDB (www.influxdata.com/products/influxdb-overview) a try. Grafana can’t
1 min read
How Image Recognition Works
APC
Article
How Image Recognition Works
Nov 4, 2019
4 min read
The Future Of The Database
Linux Format
Article
The Future Of The Database
Aug 27, 2019
7 min read
2029 VISION Where Technology Is Taking Business
NZBusiness and Management
Article
2029 VISION Where Technology Is Taking Business
May 27, 2019
6 min read
KAFKA Build Utilities With The Kafka Server
Linux Format
Article
KAFKA Build Utilities With The Kafka Server
Jul 2, 2019
Nowadays, quite a few data architectures involve both a database and Apache Kafka, which is a distributed streaming platform and the subject of this tutorial. You can also find Kafka described as a publish-subscribe message system, which is a fancy w
7 min read
Protect Your Privacy
Linux Format
Article
Protect Your Privacy
Nov 15, 2022
4 min read
Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
Tweak And Tune Your Own Kernel Scheduler
Linux Format
Article
Tweak And Tune Your Own Kernel Scheduler
Nov 14, 2023
SCHEDTOOL Credit: https://github.com/freequaos/schedtool OUR EXPERT QUICK TIP The first time you compile your own kernel, prepare the disk for handling up to 12GB of new data. Also reserve a good chunk of time and your favourite brew. A compile runs
11 min read
Rediscover Speed With The Redis Revolution
Linux Format
Article
Rediscover Speed With The Redis Revolution
Jul 25, 2023
Credit: https://redis.io Redis is an open-source, in-memory data structure store that has gained popularity R as a highly efficient caching and messaging system. It prioritises speed, efficiency and versatility, making it a top choice for various ap
8 min read
Disk Management
Linux Format
Article
Disk Management
Mar 5, 2024
Recently, I have spent a not insignificant amount of time R remediating Linux systems with some poor design choices regards disk layout and management on Linux. I thought of it as something worthwhile sharing with the community at large. Ubuntu is th
3 min read
MapReduce: The ‘Big Data’ Idea Inside Your Android Phone
APC
Article
MapReduce: The ‘Big Data’ Idea Inside Your Android Phone
Dec 2, 2019
4 min read
Answers
Linux Format
Article
Answers
Jul 2, 2019
Q How do I generate a very large file, say around 980GB, with the dd command? I’m trying to test a quota for an XFS file system under RHEL7. I use the following command as the user oracle, which has the user quota set: and it outputs Where is my
7 min read
Accurate, Open Source IP-based Localisation
Linux Format
Article
Accurate, Open Source IP-based Localisation
Dec 14, 2021
8 min read
Answers
Linux Format
Article
Answers
Mar 9, 2021
8 min read
Answers
Linux Format
Article
Answers
Mar 5, 2024
10 min read
Monitor And Graph Your System Metrics
Linux Format
Article
Monitor And Graph Your System Metrics
Dec 13, 2022
Credit: https://oss.oetiker.ch/rrdtool Matt Holder has worked in IT support for over a decade, and always tries to use Linux alongside other installed systems. The code used in this article can be downloaded from https:// github.com/ mattmole/ LXF297
8 min read
“We Should Pay Attention To The Way That A New Language Can Redefine The Limits Of Computing”
PC Pro Magazine
Article
“We Should Pay Attention To The Way That A New Language Can Redefine The Limits Of Computing”
Feb 11, 2021
7 min read
Hot Picks
Linux Format
Article
Hot Picks
Mar 9, 2021
13 min read
Set Up A Production- Ready Web Server
APC
Article
Set Up A Production- Ready Web Server
Nov 4, 2019
8 min read
HotPicks
Linux Format
Article
HotPicks
Feb 11, 2020
13 min read
How We Tested…
Linux Format
Article
How We Tested…
Jan 12, 2021
You’ll find these applications in the software repositories of most desktop distributions, even if the featured version is not the latest. Some programs provide Snap packages, and others provide installable binaries for RPM- and DEB-based distributio
1 min read
HotPicks
Linux Format
Article
HotPicks
Jun 29, 2021
12 min read
Mailserver
Linux Format
Article
Mailserver
Oct 19, 2021
3 min read
Lag Is Killing Games
Linux Format
Article
Lag Is Killing Games
Jan 11, 2022
8 min read
Automatically Provision Devices With Ansible
Linux Format
Article
Automatically Provision Devices With Ansible
Nov 15, 2022
Matt Holder has worked in IT support for over a decade, and always tries to utilise Linux alongside other installed systems. C loud computing is a term that means a number of things. Software as a Service (SaaS) is one such example of what can be hos
9 min read
Perfect Backup: Perfect? No, But Darn Close
PCWorld
Article
Perfect Backup: Perfect? No, But Darn Close
Jan 11, 2023
3 min read

Related categories

Skip carousel

Reviews for Hadoop Beginner's Guide

Rating: 4.142857142857143 out of 5 stars

4/5

7 ratings2 reviews

Rating: 1 out of 5 stars
1/5
I'm a data architect and modeler. I'm not a programmer. I was looking to get an overview to understand how Hadoop would influence the way I architect and model. I read the first chapter and haven't got a clue.
Rating: 4 out of 5 stars
4/5
good

Book preview

Hadoop Beginner's Guide - Garry Turkington

Hadoop Beginner's Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Time for action – heading

What just happened?

Pop quiz – heading

Have a go hero – heading

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. What It's All About

Big data processing

The value of data

Historically for the few and not the many

Classic data processing systems

Scale-up

Early approaches to scale-out

Limiting factors

A different approach

All roads lead to scale-out

Share nothing

Expect failure

Smart software, dumb hardware

Move processing, not data

Build applications, not infrastructure

Hadoop

Thanks, Google

Thanks, Doug

Thanks, Yahoo

Parts of Hadoop

Common building blocks

HDFS

MapReduce

Better together

Common architecture

What it is and isn't good for

Cloud computing with Amazon Web Services

Too many clouds

A third way

Different types of costs

AWS – infrastructure on demand from Amazon

Elastic Compute Cloud (EC2)

Simple Storage Service (S3)

Elastic MapReduce (EMR)

What this book covers

A dual approach

Summary

2. Getting Hadoop Up and Running

Hadoop on a local Ubuntu host

Other operating systems

Time for action – checking the prerequisites

What just happened?

Setting up Hadoop

A note on versions

Time for action – downloading Hadoop

What just happened?

Time for action – setting up SSH

What just happened?

Configuring and running Hadoop

Time for action – using Hadoop to calculate Pi

What just happened?

Three modes

Time for action – configuring the pseudo-distributed mode

What just happened?

Configuring the base directory and formatting the filesystem

Time for action – changing the base HDFS directory

What just happened?

Time for action – formatting the NameNode

What just happened?

Starting and using Hadoop

Time for action – starting Hadoop

What just happened?

Time for action – using HDFS

What just happened?

Time for action – WordCount, the Hello World of MapReduce

What just happened?

Have a go hero – WordCount on a larger body of text

Monitoring Hadoop from the browser

The HDFS web UI

The MapReduce web UI

Using Elastic MapReduce

Setting up an account in Amazon Web Services

Creating an AWS account

Signing up for the necessary services

Time for action – WordCount on EMR using the management console

What just happened?

Have a go hero – other EMR sample applications

Other ways of using EMR

AWS credentials

The EMR command-line tools

The AWS ecosystem

Comparison of local versus EMR Hadoop

Summary

3. Understanding MapReduce

Key/value pairs

What it mean

Why key/value data?

Some real-world examples

MapReduce as a series of key/value transformations

Pop quiz – key/value pairs

The Hadoop Java API for MapReduce

The 0.20 MapReduce Java API

The Mapper class

The Reducer class

The Driver class

Writing MapReduce programs

Time for action – setting up the classpath

What just happened?

Time for action – implementing WordCount

What just happened?

Time for action – building a JAR file

What just happened?

Time for action – running WordCount on a local Hadoop cluster

What just happened?

Time for action – running WordCount on EMR

What just happened?

The pre-0.20 Java MapReduce API

Hadoop-provided mapper and reducer implementations

Time for action – WordCount the easy way

What just happened?

Walking through a run of WordCount

Startup

Splitting the input

Task assignment

Task startup

Ongoing JobTracker monitoring

Mapper input

Mapper execution

Mapper output and reduce input

Partitioning

The optional partition function

Reducer input

Reducer execution

Reducer output

Shutdown

That's all there is to it!

Apart from the combiner…maybe

Why have a combiner?

Time for action – WordCount with a combiner

What just happened?

When you can use the reducer as the combiner

Time for action – fixing WordCount to work with a combiner

What just happened?

Reuse is your friend

Pop quiz – MapReduce mechanics

Hadoop-specific data types

The Writable and WritableComparable interfaces

Introducing the wrapper classes

Primitive wrapper classes

Array wrapper classes

Map wrapper classes

Time for action – using the Writable wrapper classes

What just happened?

Other wrapper classes

Have a go hero – playing with Writables

Making your own

Input/output

Files, splits, and records

InputFormat and RecordReader

Hadoop-provided InputFormat

Hadoop-provided RecordReader

OutputFormat and RecordWriter

Hadoop-provided OutputFormat

Don't forget Sequence files

Summary

4. Developing MapReduce Programs

Using languages other than Java with Hadoop

How Hadoop Streaming works

Why to use Hadoop Streaming

Time for action – implementing WordCount using Streaming

What just happened?

Differences in jobs when using Streaming

Analyzing a large dataset

Getting the UFO sighting dataset

Getting a feel for the dataset

Time for action – summarizing the UFO data

What just happened?

Examining UFO shapes

Time for action – summarizing the shape data

What just happened?

Time for action – correlating of sighting duration to UFO shape

What just happened?

Using Streaming scripts outside Hadoop

Time for action – performing the shape/time analysis from the command line

What just happened?

Java shape and location analysis

Time for action – using ChainMapper for field validation/analysis

What just happened?

Have a go hero

Too many abbreviations

Using the Distributed Cache

Time for action – using the Distributed Cache to improve location output

What just happened?

Counters, status, and other output

Time for action – creating counters, task states, and writing log output

What just happened?

Too much information!

Summary

5. Advanced MapReduce Techniques

Simple, advanced, and in-between

Joins

When this is a bad idea

Map-side versus reduce-side joins

Matching account and sales information

Time for action – reduce-side join using MultipleInputs

What just happened?

DataJoinMapper and TaggedMapperOutput

Implementing map-side joins

Using the Distributed Cache

Have a go hero - Implementing map-side joins

Pruning data to fit in the cache

Using a data representation instead of raw data

Using multiple mappers

To join or not to join...

Graph algorithms

Graph 101

Graphs and MapReduce – a match made somewhere

Representing a graph

Time for action – representing the graph

What just happened?

Overview of the algorithm

The mapper

The reducer

Iterative application

Time for action – creating the source code

What just happened?

Time for action – the first run

What just happened?

Time for action – the second run

What just happened?

Time for action – the third run

What just happened?

Time for action – the fourth and last run

What just happened?

Running multiple jobs

Final thoughts on graphs

Using language-independent data structures

Candidate technologies

Introducing Avro

Time for action – getting and installing Avro

What just happened?

Avro and schemas

Time for action – defining the schema

What just happened?

Time for action – creating the source Avro data with Ruby

What just happened?

Time for action – consuming the Avro data with Java

What just happened?

Using Avro within MapReduce

Time for action – generating shape summaries in MapReduce

What just happened?

Time for action – examining the output data with Ruby

What just happened?

Time for action – examining the output data with Java

What just happened?

Have a go hero – graphs in Avro

Going forward with Avro

Summary

6. When Things Break

Failure

Embrace failure

Or at least don't fear it

Don't try this at home

Types of failure

Hadoop node failure

The dfsadmin command

Cluster setup, test files, and block sizes

Fault tolerance and Elastic MapReduce

Time for action – killing a DataNode process

What just happened?

NameNode and DataNode communication

Have a go hero – NameNode log delving

Time for action – the replication factor in action

What just happened?

Time for action – intentionally causing missing blocks

What just happened?

When data may be lost

Block corruption

Time for action – killing a TaskTracker process

What just happened?

Comparing the DataNode and TaskTracker failures

Permanent failure

Killing the cluster masters

Time for action – killing the JobTracker

What just happened?

Starting a replacement JobTracker

Have a go hero – moving the JobTracker to a new host

Time for action – killing the NameNode process

What just happened?

Starting a replacement NameNode

The role of the NameNode in more detail

File systems, files, blocks, and nodes

The single most important piece of data in the cluster – fsimage

DataNode startup

Safe mode

SecondaryNameNode

So what to do when the NameNode process has a critical failure?

BackupNode/CheckpointNode and NameNode HA

Hardware failure

Host failure

Host corruption

The risk of correlated failures

Task failure due to software

Failure of slow running tasks

Time for action – causing task failure

What just happened?

Have a go hero – HDFS programmatic access

Hadoop's handling of slow-running tasks

Speculative execution

Hadoop's handling of failing tasks

Have a go hero – causing tasks to fail

Task failure due to data

Handling dirty data through code

Using Hadoop's skip mode

Time for action – handling dirty data by using skip mode

What just happened?

To skip or not to skip...

Summary

7. Keeping Things Running

A note on EMR

Hadoop configuration properties

Default values

Time for action – browsing default properties

What just happened?

Additional property elements

Default storage location

Where to set properties

Setting up a cluster

How many hosts?

Calculating usable space on a node

Location of the master nodes

Sizing hardware

Processor / memory / storage ratio

EMR as a prototyping platform

Special node requirements

Storage types

Commodity versus enterprise class storage

Single disk versus RAID

Finding the balance

Network storage

Hadoop networking configuration

How blocks are placed

Rack awareness

The rack-awareness script

Time for action – examining the default rack configuration

What just happened?

Time for action – adding a rack awareness script

What just happened?

What is commodity hardware anyway?

Pop quiz – setting up a cluster

Cluster access control

The Hadoop security model

Time for action – demonstrating the default security

What just happened?

User identity

The super user

More granular access control

Working around the security model via physical access control

Managing the NameNode

Configuring multiple locations for the fsimage class

Time for action – adding an additional fsimage location

What just happened?

Where to write the fsimage copies

Swapping to another NameNode host

Having things ready before disaster strikes

Time for action – swapping to a new NameNode host

What just happened?

Don't celebrate quite yet!

What about MapReduce?

Have a go hero – swapping to a new NameNode host

Managing HDFS

Where to write data

Using balancer

When to rebalance

MapReduce management

Command line job management

Have a go hero – command line job management

Job priorities and scheduling

Time for action – changing job priorities and killing a job

What just happened?

Alternative schedulers

Capacity Scheduler

Fair Scheduler

Enabling alternative schedulers

When to use alternative schedulers

Scaling

Adding capacity to a local Hadoop cluster

Have a go hero – adding a node and running balancer

Adding capacity to an EMR job flow

Expanding a running job flow

Summary

8. A Relational View on Data with Hive

Overview of Hive

Why use Hive?

Thanks, Facebook!

Setting up Hive

Prerequisites

Getting Hive

Time for action – installing Hive

What just happened?

Using Hive

Time for action – creating a table for the UFO data

What just happened?

Time for action – inserting the UFO data

What just happened?

Validating the data

Time for action – validating the table

What just happened?

Time for action – redefining the table with the correct column separator

What just happened?

Hive tables – real or not?

Time for action – creating a table from an existing file

What just happened?

Time for action – performing a join

What just happened?

Have a go hero – improve the join to use regular expressions

Hive and SQL views

Time for action – using views

What just happened?

Handling dirty data in Hive

Have a go hero – do it!

Time for action – exporting query output

What just happened?

Partitioning the table

Time for action – making a partitioned UFO sighting table

What just happened?

Bucketing, clustering, and sorting... oh my!

User-Defined Function

Time for action – adding a new User Defined Function (UDF)

What just happened?

To preprocess or not to preprocess...

Hive versus Pig

What we didn't cover

Hive on Amazon Web Services

Time for action – running UFO analysis on EMR

What just happened?

Using interactive job flows for development

Have a go hero – using an interactive EMR cluster

Integration with other AWS products

Summary

9. Working with Relational Databases

Common data paths

Hadoop as an archive store

Hadoop as a preprocessing step

Hadoop as a data input tool

The serpent eats its own tail

Setting up MySQL

Time for action – installing and setting up MySQL

What just happened?

Did it have to be so hard?

Time for action – configuring MySQL to allow remote connections

What just happened?

Don't do this in production!

Time for action – setting up the employee database

What just happened?

Be careful with data file access rights

Getting data into Hadoop

Using MySQL tools and manual import

Have a go hero – exporting the employee table into HDFS

Accessing the database from the mapper

A better way – introducing Sqoop

Time for action – downloading and configuring Sqoop

What just happened?

Sqoop and Hadoop versions

Sqoop and HDFS

Time for action – exporting data from MySQL to HDFS

What just happened?

Mappers and primary key columns

Other options

Sqoop's architecture

Importing data into Hive using Sqoop

Time for action – exporting data from MySQL into Hive

What just happened?

Time for action – a more selective import

What just happened?

Datatype issues

Time for action – using a type mapping

What just happened?

Time for action – importing data from a raw query

What just happened?

Have a go hero

Sqoop and Hive partitions

Field and line terminators

Getting data out of Hadoop

Writing data from within the reducer

Writing SQL import files from the reducer

A better way – Sqoop again

Time for action – importing data from Hadoop into MySQL

What just happened?

Differences between Sqoop imports and exports

Inserts versus updates

Have a go hero

Sqoop and Hive exports

Time for action – importing Hive data into MySQL

What just happened?

Time for action – fixing the mapping and re-running the export

What just happened?

Other Sqoop features

Incremental merge

Avoiding partial exports

Sqoop as a code generator

AWS considerations

Considering RDS

Summary

10. Data Collection with Flume

A note about AWS

Data data everywhere...

Types of data

Getting network traffic into Hadoop

Time for action – getting web server data into Hadoop

What just happened?

Have a go hero

Getting files into Hadoop

Hidden issues

Keeping network data on the network

Hadoop dependencies

Reliability

Re-creating the wheel

A common framework approach

Introducing Apache Flume

A note on versioning

Time for action – installing and configuring Flume

What just happened?

Using Flume to capture network data

Time for action – capturing network traffic in a log file

What just happened?

Time for action – logging to the console

What just happened?

Writing network data to log files

Time for action – capturing the output of a command to a flat file

What just happened?

Logs versus files

Time for action – capturing a remote file in a local flat file

What just happened?

Sources, sinks, and channels

Sources

Sinks

Channels

Or roll your own

Understanding the Flume configuration files

Have a go hero

It's all about events

Time for action – writing network traffic onto HDFS

What just happened?

Time for action – adding timestamps

What just happened?

To Sqoop or to Flume...

Time for action – multi level Flume networks

What just happened?

Time for action – writing to multiple sinks

What just happened?

Selectors replicating and multiplexing

Handling sink failure

Have a go hero - Handling sink failure

Next, the world

Have a go hero - Next, the world

The bigger picture

Data lifecycle

Staging data

Scheduling

Summary

11. Where to Go Next

What we did and didn't cover in this book

Upcoming Hadoop changes

Alternative distributions

Why alternative distributions?

Bundling

Free and commercial extensions

Cloudera Distribution for Hadoop

Hortonworks Data Platform

MapR

IBM InfoSphere Big Insights

Choosing a distribution

Other Apache projects

HBase

Oozie

Whir

Mahout

MRUnit

Other programming abstractions

Pig

Cascading

AWS resources

HBase on EMR

SimpleDB

DynamoDB

Sources of information

Source code

Mailing lists and forums

LinkedIn groups

HUGs

Conferences

Summary

A. Pop Quiz Answers

Chapter 3, Understanding MapReduce

Pop quiz – key/value pairs

Pop quiz – walking through a run of WordCount

Chapter 7, Keeping Things Running

Pop quiz – setting up a cluster

Index

Hadoop Beginner's Guide

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: February 2013

Production Reference: 1150213

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-84951-7-300

www.packtpub.com

Cover Image by Asher Wishkerman (<a.wishkerman@mpic.de>)

Credits

Author

Garry Turkington

Reviewers

David Gruzman

Muthusamy Manigandan

Vidyasagar N V

Acquisition Editor

Robin de Jongh

Lead Technical Editor

Azharuddin Sheikh

Technical Editors

Ankita Meshram

Varun Pius Rodrigues

Copy Editors

Brandt D'Mello

Aditya Nair

Laxmi Subramanian

Ruta Waghmare

Project Coordinator

Leena Purkait

Proofreader

Maria Gould

Indexer

Hemangini Bari

Production Coordinator

Nitesh Thakur

Cover Work

Nitesh Thakur

About the Author

Garry Turkington has 14 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current roles as VP Data Engineering and Lead Architect at Improve Digital, he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital, he spent time at Amazon.co.uk, where he led several software development teams building systems that process Amazon catalog data for every item worldwide. Prior to this, he spent a decade in various government positions in both the UK and USA.

He has BSc and PhD degrees in Computer Science from the Queens University of Belfast in Northern Ireland and an MEng in Systems Engineering from Stevens Institute of Technology in the USA.

I would like to thank my wife Lea for her support and encouragement—not to mention her patience—throughout the writing of this book and my daughter, Maya, whose spirit and curiosity is more of an inspiration than she could ever imagine.

About the Reviewers

David Gruzman is a Hadoop and big data architect with more than 18 years of hands-on experience, specializing in the design and implementation of scalable high-performance distributed systems. He has extensive expertise of OOA/OOD and (R)DBMS technology. He is an Agile methodology adept and strongly believes that a daily coding routine makes good software architects. He is interested in solving challenging problems related to real-time analytics and the application of machine learning algorithms to the big data sets.

He founded—and is working with—BigDataCraft.com, a boutique consulting firm in the area of big data. Visit their site at www.bigdatacraft.com. David can be contacted at david@bigdatacraft.com. More detailed information about his skills and experience can be found at http://www.linkedin.com/in/davidgruzman.

Muthusamy Manigandan is a systems architect for a startup. Prior to this, he was a Staff Engineer at VMWare and Principal Engineer with Oracle. Mani has been programming for the past 14 years on large-scale distributed-computing applications. His areas of interest are machine learning and algorithms.

Vidyasagar N V has been interested in computer science since an early age. Some of his serious work in computers and computer networks began during his high school days. Later, he went to the prestigious Institute Of Technology, Banaras Hindu University, for his B.Tech. He has been working as a software developer and data expert, developing and building scalable systems. He has worked with a variety of second, third, and fourth generation languages. He has worked with flat files, indexed files, hierarchical databases, network databases, relational databases, NoSQL databases, Hadoop, and related technologies. Currently, he is working as Senior Developer at Collective Inc., developing big data-based structured data extraction techniques from the Web and local information. He enjoys producing high-quality software and web-based solutions and designing secure and scalable data systems. He can be contacted at <vidyasagar1729@gmail.com>.

I would like to thank the Almighty, my parents, Mr. N Srinivasa Rao and Mrs. Latha Rao, and my family who supported and backed me throughout my life. I would also like to thank my friends for being good friends and all those people willing to donate their time, effort, and expertise by participating in open source software projects. Thank you, Packt Publishing for selecting me as one of the technical reviewers for this wonderful book. It is my honor to be a part of it.

www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by Packt

Copy and paste, print and bookmark content

On demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

Preface

This book is here to help you make sense of Hadoop and use it to solve your big data problems. It's a really exciting time to work with data processing technologies such as Hadoop. The ability to apply complex analytics to large data sets—once the monopoly of large corporations and government agencies—is now possible through free open source software (OSS).

But because of the seeming complexity and pace of change in this area, getting a grip on the basics can be somewhat intimidating. That's where this book comes in, giving you an understanding of just what Hadoop is, how it works, and how you can use it to extract value from your data now.

In addition to an explanation of core Hadoop, we also spend several chapters exploring other technologies that either use Hadoop or integrate with it. Our goal is to give you an understanding not just of what Hadoop is but also how to use it as a part of your broader technical infrastructure.

A complementary technology is the use of cloud computing, and in particular, the offerings from Amazon Web Services. Throughout the book, we will show you how to use these services to host your Hadoop workloads, demonstrating that not only can you process large data volumes, but also you don't actually need to buy any physical hardware to do so.

What this book covers

This book comprises of three main parts: chapters 1 through 5, which cover the core of Hadoop and how it works, chapters 6 and 7, which cover the more operational aspects of Hadoop, and chapters 8 through 11, which look at the use of Hadoop alongside other products and technologies.

Chapter 1, What It's All About, gives an overview of the trends that have made Hadoop and cloud computing such important technologies today.

Chapter 2, Getting Hadoop Up and Running, walks you through the initial setup of a local Hadoop cluster and the running of some demo jobs. For comparison, the same work is also executed on the hosted Hadoop Amazon service.

Chapter 3, Understanding MapReduce, goes inside the workings of Hadoop to show how MapReduce jobs are executed and shows how to write applications using the Java API.

Chapter 4, Developing MapReduce Programs, takes a case study of a moderately sized data set to demonstrate techniques to help when deciding how to approach the processing and analysis of a new data source.

Chapter 5, Advanced MapReduce Techniques, looks at a few more sophisticated ways of applying MapReduce to problems that don't necessarily seem immediately applicable to the Hadoop processing model.

Chapter 6, When Things Break, examines Hadoop's much-vaunted high availability and fault tolerance in some detail and sees just how good it is by intentionally causing havoc through killing processes and intentionally using corrupt data.

Chapter 7, Keeping Things Running, takes a more operational view of Hadoop and will be of most use for those who need to administer a Hadoop cluster. Along with demonstrating some best practice, it describes how to prepare for the worst operational disasters so you can sleep at night.

Chapter 8, A Relational View On Data With Hive, introduces Apache Hive, which allows Hadoop data to be queried with a SQL-like syntax.

Chapter 9, Working With Relational Databases, explores how Hadoop can be integrated with existing databases, and in particular, how to move data from one to the other.

Chapter 10, Data Collection with Flume, shows how Apache Flume can be used to gather data from multiple sources and deliver it to destinations such as Hadoop.

Chapter 11, Where To Go Next, wraps up the book with an overview of the broader Hadoop ecosystem, highlighting other products and technologies of potential interest. In addition, it gives some ideas on how to get involved with the Hadoop community and to get help.

What you need for this book

As we discuss the various Hadoop-related software packages used in this book, we will describe the particular requirements for each chapter. However, you will generally need somewhere to run your Hadoop cluster.

In the simplest case, a single Linux-based machine will give you a platform to explore almost all the exercises in this book. We assume you have a recent distribution of Ubuntu, but as long as you have command-line Linux familiarity any modern distribution will suffice.

Some of the examples in later chapters really need multiple machines to see things working, so you will require access to at least four such hosts. Virtual machines are completely acceptable; they're not ideal for production but are fine for learning and exploration.

Since we also explore Amazon Web Services in this book, you can run all the examples on EC2 instances, and we will look at some other more Hadoop-specific uses of AWS throughout the book. AWS services are usable by anyone, but you will need a credit card to sign up!

Who this book is for

We assume you are reading this book because you want to know more about Hadoop at a hands-on level; the key audience is those with software development experience but no prior exposure to Hadoop or similar big data technologies.

For developers who want to know how to write MapReduce applications, we assume you are comfortable writing Java programs and are familiar with the Unix command-line interface. We will also show you a few programs in Ruby, but these are usually only to demonstrate language independence, and you don't need to be a Ruby expert.

For architects and system administrators, the book also provides significant value in explaining how Hadoop works, its place in the broader architecture, and how it can be managed operationally. Some of the more involved techniques in Chapter 4, Developing MapReduce Programs, and Chapter 5, Advanced MapReduce Techniques, are probably of less direct interest to this audience.

Conventions

In this book, you will find several headings appearing frequently.

To give clear instructions of how to complete a procedure or task, we use:

Time for action – heading

Action 1

Action 2

Action 3

Instructions often need some extra explanation so that they make sense, so they are followed with:

What just happened?

This heading explains the working of tasks or instructions that you have just completed.

You will also find some other learning aids in the book, including:

Pop quiz – heading

These are short multiple-choice questions intended to help you test your own understanding.

Have a go hero – heading

These set practical challenges and give you ideas for experimenting with what you have learned.

You will also find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: You may notice that we used the Unix command rm to remove the Drush directory rather than the DOS del command.

A block of code is set as follows:

# * Fine Tuning

key_buffer = 16M

key_buffer_size = 32M

max_allowed_packet = 16M

thread_stack = 512K

thread_cache_size = 8

max_connections = 300

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

# * Fine Tuning

key_buffer = 16M

key_buffer_size = 32M

max_allowed_packet = 16M

thread_stack = 512K

thread_cache_size = 8

max_connections = 300

Any command-line input or output is written as follows:

cd /ProgramData/Propeople rm -r Drush git clone --branch master http://git.drupal.org/project/drush.git

Newterms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: On the Select Destination Location screen, click on Next to accept the default destination.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <feedback@packtpub.com>, and mention the book title through the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website, or added to any list of existing errata, under the Errata section of that title.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <questions@packtpub.com> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. What It's All About

This book is about Hadoop, an open source framework for large-scale data processing. Before we get into the details of the technology and its use in later chapters, it is important to spend a little time exploring the trends that led to Hadoop's creation and its enormous success.

Hadoop was not created in a vacuum; instead, it exists due to the explosion in the amount of data being created and consumed and a shift that sees this data deluge arrive at small startups and not just huge multinationals. At the same time, other trends have changed how software and systems are deployed, using cloud resources alongside or even in preference to more traditional infrastructures.

This chapter will explore some of these trends and explain in detail the specific problems Hadoop seeks to solve and the drivers that shaped its design.

In the rest of this chapter we shall:

Learn about the big data revolution

Understand what Hadoop

Enjoying the preview?

Page 1 of 1

Hadoop Beginner's Guide

About this ebook

Garry Turkington

Read more from Garry Turkington

Related authors

Related to Hadoop Beginner's Guide

Related ebooks

CAD-CAM For You

Related podcast episodes

Related articles

Related categories

Reviews for Hadoop Beginner's Guide

What did you think?

Book preview

Hadoop Beginner's Guide - Garry Turkington

Table of Contents

Hadoop Beginner's Guide

Hadoop Beginner's Guide

Credits

About the Author

About the Reviewers

Support files, eBooks, discount offers and more

Why Subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

What just happened?

Have a go hero – heading

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Chapter 1. What It's All About