Apache Hadoop 2.0: Data Analysis with the Hortonworks Data Platform using Pig and Hive

Training a team? Use a QA Skills Licence and makes better use of your budget

Course type Essentials (What does this mean?)

Course details
Course title Apache Hadoop 2.0: Data Analysis with the Hortonworks Data Platform using Pig and Hive
Delivery method Classroom Classroom
Days/Duration 4

Secure online payment

dates, pricing & booking
course description

Print course outline | Download as PDF document | Link to page: www.qa.com/HWAHDPH2

Course dates

Currently scheduled dates for this training course
Location AUG SEP OCT NOV View later dates


International House, E1W

1 8 9 show prices/book

Middlesex Street, E1

6 show prices/book


This 4-day hands-on training course teaches students how to develop applications and analyze Big Data stored in Apache Hadoop 2.0 using Pig and Hive. Students will learn the details of Hadoop 2.0, YARN, the Hadoop Distributed File System (HDFS), an overview of MapReduce, and a deep dive into using Pig and Hive to perform data analytics on Big Data. Other topics covered include data ingestion using Sqoop and Flume, and defining workflow using Oozie.


Students should be familiar with programming principles and have experience in software development.
SQL knowledge is also helpful. No prior Hadoop knowledge is required.

Target Audience:

Software developers who need to understand and develop applications for Hadoop 2.0.

Delegates will learn how to

  • Explain Hadoop 2.0 and YARN
  • Explain use cases for Hadoop
  • Explain how HDFS Federation works in Hadoop 2.0
  • Explain the various tools and frameworks in the Hadoop 2.0 ecosystem
  • Explain the architecture of the Hadoop Distributed File System (HDFS)
  • Use the Hadoop client to input data into HDFS
  • Use Sqoop to transfer data between Hadoop and a relational database
  • Explain the architecture of MapReduce
  • Explain the architecture of YARN
  • Run a MapReduce job on YARN
  • Write a Pig script to explore and transform data in HDFS
  • Define advanced Pig relations
  • Use Pig to apply structure to unstructured Big Data
  • Invoke a Pig User-Defined Function
  • Use Pig to organize and analyze Big Data
  • Understand how Hive tables are defined and implemented
  • Use the new Hive windowing functions
  • Explain and use the various Hive file formats
  • Create and populate a Hive table that uses the new ORC file format
  • Use Hive to run SQL-like queries to perform data analysis
  • Use Hive to join datasets using a variety of techniques, including Map-side joins and Sort-Merge-Bucket joins
  • Write efficient Hive queries
  • Create ngrams and context ngrams using Hive
  • Perform data analytics like quantiles and page rank on Big Data using the DataFu Pig library
  • Explain the uses and purpose of HCatalog
  • Use HCatalog with Pig and Hive
  • Define a workflow using Oozie
  • Schedule a recurring workflow

Course Outline.

Day 1

  • Understanding Hadoop 2.0 and YARN

  • The Hadoop Distributed File System (HDFS)

  • Inputting Data into HDFS

  • The MapReduce Framework

Day 2

  • Introduction to Pig

  • Advanced Pig Programming

Day 3

  • Hive Programming

  • Using HCatalog

Day 4

  • Advanced Hive Programming
  • Data Analysis and Statistics
  • Defining Workflow with Oozie

Lab Content

Students will work through the following lab exercises using the Hortonworks Data Platform 2.0:
  • Using Sqoop to transfer data between HDFS and a RDBMS
  • Running a MapReduce job
  • Run a YARN application
  • Explore and transform data using Pig
  • Split a dataset using Pig
  • Join two datasets using Pig
  • Use Pig to transform and export a dataset for use with Hive
  • Use HCatLoader and HCatStorer to retrieve HCatalog schemas from within a Pig script
  • Understand how a Hive table is stored in HDFS
  • Use Hive to discover useful information in a dataset
  • Understand how Hive queries get executed as MapReduce jobs
  • Perform a join of two datasets with Hive
  • Use advanced Hive features like windowing, views and ORC files
  • Use the Hive analytics functions (rank, dense_rank, cume_dist, row_number)
  • Write a custom reducer in Python that reduces the number of underlying MapReduce jobs generated from a Hive query
  • Analyse and sessionize clickstream data using the Pig DataFu library
  • Compute quantiles of NYSE stock prices
  • Use Hive to compute ngrams on Avroformatted files
  • Define an Oozie workflow

related blogs

An AWS trainer in the Red Queen’s Race

Posted by Matt Bishop on 21 August 2014

I often feel I’m running the Red Queen’s race to ensure our delegates don’t have to.

Are you ready for End of Life for Windows Server 2003

Posted by on

It has been well documented that Windows Server 2003 will have support withdrawn on the 15th July 2015.

The benefits of the Cloud and Amazon Web Services (AWS)

Posted by Matt Bishop on 19 June 2014

If you read the tech press, you would think absolutely everybody was moving to the cloud. But is that just hype, or is it really true? And if it’s true, what benefits are they getting from it?


Posted by on

SharePoint 2013 and Internet Explorer 10 have a stormy relationship. I think it's time for marriage guidance counselling.

App-V 4.x to 5.0 Package conversion: Fixing the broken Pipeline!

Posted by Mark Cresswell on 31 January 2014

The App-V 5.0 package format is very different from the previous 4.5/4.6 version, and the App-V 5.0 client is not compatible with the earlier package versions. To help protect your sequencing investment, Microsoft included two PowerShell commands on the sequencer to aid in migration: Test-AppVLegacyPackage and ConvertFrom-AppVLegacyPackage. The first tests the old package for known constraints, while the second attempts to convert the package to the new format

Top 20 Photoshop Shortcuts

Posted by on

One of the things we're regularly asked on courses is "is there a quicker way to do xyz?" Very often the answer is a resounding 'yes'. So, I thought with this post I'd cover my favourite (and most commonly used) top 20 shortcuts when working with Adobe Photoshop (either in Creative Suite or Creative Cloud).

See all related blogs

top of page
  • Amazon logo
  • Apple logo
  • AppSense logo
  • cisco logo
  • citrix logo
  • compTIA logo
  • ec council logo
  • Hortonworks CTP logo
  • microsoft gold logo
  • novell logo
  • oracle logo
  • Pya -winner -2013 logo
  • redhat logo
  • Salesforce logo
  • symantec logo
  • vmware logo
  • symantec logo
  • novell logo
  • symantec logo
  • Amazon logo