About this course

Course type Premium
Course code QAASBH
Duration 3 Days

This three-day course covers the essentials for developers who need to create applications to analyze Big Data stored in Apache Hadoop using Spark. Each student has their own Spark cluster and have access to dozens of hands-on labs.

AUDIENCE
Data Analysts and Software Developers

Prerequisites

To get the most out of this training, you should have the following knowledge or experience as they will not be discussed during class.

  • Hadoop Distributed File System (HDFS), YARN (Yet Another Resource Manager) and MapReduce processing engine
  • Scala or Python coding
  • Linux command line experience

Delegates will learn how to

After successfully completing this course, you will be able to:

  • Discuss how Spark ties into HDFS and YARN component
  • Performance Tuning
  • Creating applications using ‘spark-submit’
  • Perform Spark operators (Transformations and Actions)
  • Working with Pair RDD including Joins, Unions
  • Streaming Spark using micro-batch processing and windowing
  • Discuss Spark 2.0 machine learning (ML) including Pipeline architecture
  • Understanding of Zeppelin visualizations of Spark output
  • Explaining differences of Resilient Distributed Datasets (RDD), DataFrames, Tables and DataSet objects

Outline

Module 0. Intro and Setup:

  • How to start Spark and Zeppelin services in Ambari
  • How to login to Spark using Python and Scala

Module 1. Spark Architecture:

  • What is Apache Spark?
  • Spark components (Driver, Context, Yarn, HDFS, Workers, Executors)
  • Spark processing (Jobs, Stages, Tasks)

Module 2. Getting Started with RDDs:

  • Running queries in Python, Scala and Zeppelin
  • Creating RDDs
  • Queries using most popular Transformations and Actions

Module 3. Pair RDDs

  • Difference between RDDs and Pair RDD
  • 1 Pair Actions, 1 Pair Transformations and 2 Pair Transformations


Module 4. Spark SQL:  

  • Working with DataFrames and Tables and DataSets
  • Catalyst optimizer overview

Module 5. Spark Streaming:

  • Working with DStreams
  • Stateless and Stateful Streaming labs using HDFS and Sockets

Module 6. Visualizations using Zeppelin:

  • Creating various Charts using DataFrames and Tables
  • How to create Pivot charts and Dynamic forms

Module 7. Spark UI:

  • Overview of Job, Stage and Tasks
  • Monitoring Spark jobs in Spark UI

Module 8. Performance Tuning:

  • Caching, Checkpoint, Accumulators and Broadcast Variables
  • Hashed Partitions, Tungsten, Executor memory and Serialization

Module 9. Spark Applications:

  • Creating an application via spark-submit
  • Parameter configurations (number executors, driver memory, executor cores, etc.)

Module 10. Spark 2.0 Machine Learning (ML):

  • How ML Pipelines work
  • Making Predictions using Decision Tree

Premium Course

3 Days

Duration
Delivery Method

Delivery method

Classroom / Attend from Anywhere

Receive classroom training at one of our nationwide training centres, or attend remotely via web access from anywhere.

Trusted, awarded and accredited

Fully accredited to ensure we provide the highest possible standards in learning

All third party trademark rights acknowledged.