About this course

Course type Premium
Course code QAASAH
Duration 2 Days

The purpose of this 2-day training course is to acquaint you with Spark 2.0 functionality and Performance Tuning techniques. It will cover the fundamentals of the Spark project, covering the basics of RDDs (Resilient Distributed Datasets) and the various operators used (Transformations and Actions).

AUDIENCE
Data Analysts and Software Developers

Prerequisites

To get the most out of this training, that you have the following knowledge or experience as it builds the foundation for the advance course.

  • Apache Spark (Basic) on Hadoop

Delegates will learn how to

After successfully completing this course, you will be able to:

  • Discuss Datasets and how to use the Spark catalog
  • Performance Tuning
  • Understand the internal workings of Catalysts optimizer and Tungsten memory manager
  • Discuss the two ML implementations in Spark (ML and MLlib) and what are their differences

Outline

Module 0. Intro and Setup:

  • Zeppelin note

Module 1. Datasets and Catalogs:

  • What is a Dataset?
  • Dataset versus SQL/DataFrames
  • When to use which object
  • Serialization performance using Encoders
  • Encoders and semi-structured data
  • Dataset caching (1 of 2)
  • 01: Dataset Caching (2 of 2)
  • 02a/b/c: Common ways to create DS
  • 03: Creating DS from an RDD
  • Cannot create DS these ways
  • 04: Casting DS and convert DS to DF to RDD
  • 05a: map() on DS means lose column names
  • 05b: map() characteristics on Dataset
  • 06: select on DS
  • 07: filter() and groupBy() on DS
  • 08: joinWith() on DS
  • 09: explain() on DS
  • 10: Catalog: List Hive databases
  • 11: Catalog: List Hive tables, Spark Views
  • 012: Catalog: List column names on table
  • 13: Catalog: List Spark functions
  • Review Questions: Datasets/Catalog
  • In Review: Datasets/Catalog

Module 2. Catalyst and Tungsten functionalities:

  • Before we Begin: Open Zeppelin note
  • DataFrames, Datasets and Views use Catalyst/Tungsten
  • Catalyst optimizer overview
  • 01a: Catalyst: Join on 2 Spark Views demo
  • 01a: Catalyst demo: Join on 2 Spark views
  • But RDDs can’t use Catalyst
  • Loading data in Spark 2.x and Catalyst
  • 02a: Load data (old way), then Join (1 of 3)
  • Execution Plan from ‘old way’ loading (2 of 3)
  • 02b: DataFrameReader: Load/Execution Plan (3 of 3)
  • 03a: Dropping hints to Catalyst (1 of 2)
  • 03b: Dropping hints to Catalyst (2 of 2)
  • 04a: Catalyst: Column pruning demo
  • 04b: Catalyst: Column (& Partition) pruning
  • Catalyst: Predicate pushdown concepts
  • 05: Catalyst: Predicate pushdown (1 of 2)
  • 05: Catalyst: Predicate pushdown (2 of 2)
  • Tungsten overview
  • Tungsten: Binary processing
  • Tungsten: Improved Memory usage
  • 06: Tungsten: Improved Caching demo
  • 07: Tungsten: Whole-stage code gen
  • 08: Tungsten: Whole-stage code gen demo
  • Tungsten: Whole-stage code gen Vectorization
  • Review Questions: Catalyst/Tungsten
  • In Review: Catalyst/Tungsten

Module 3. Performance Tuning:

  • 2 types of Machine Learning
  • How Models Created
  • Four common MLlib functions
  • What is Supervised Learning?
  • Spark Supervised Learning workflow
  • Walking the Workflow: Predicting SPAM (1 of 3)
  • Walking the Workflow: Predicting SPAM (2 of 3)
  • Walking the Workflow: Predicting SPAM (3 of 3)
  • Unsupervised Learning
  • RDD – Machine Learning (MLlib)
  • Walking the Workflow: Predicting SPAM (1 of 3)
  • KMeans scenario
  • 01a: Kmeans – Load data
  • 01b: Kmeans – Create Model and Predict
  • 01c: Kmeans – Compare Actual to Predict
  • Collaborative Filtering (CF) recommender
  • Will Carl like ‘Star Wars’?
  • 02a: CF – Load Movie data
  • 02b: CF – Create Model and Factors
  • 02c: CF – Map MovieID to MovieName
  • 02d: CF – Make User recommendation
  • Classification Functions (Supervised)
  • Before we Begin: Classification uses LabelPoint. So what is LabelPoint?
  • CASTing X-var and Y-vars for LabelPoint
  • Logistic Regression, Support Vector Machines, NaïveBayes and Decision Tree (Supervised)
  • 03a: Logistic Regression, Support Vector Machines, NaïveBayes, and Decision Tree
  • 03b: Logistic Regression, Support Vector Machines, NaïveBayes, and Decision Tree
  • 03c: Logistic Regression, Support Vector Machines, NaïveBayes, and Decision Tree
  • 03c: Logistic Regression, Support Vector Machines, NaïveBayes, and Decision Tree (con’t)
  • DataFrames – Machine Learning (ML)
  • ML Pipeline Terminology
  • How ML Pipeline Works
  • 02: Predict Bike Rentals (GBT Regression)
  • 02a: Know the Data
  • 02b: Load and View Data types
  • Clean the Data (remove columns)
  • 02c: Clean the Data (remove columns) (cont.)
  • 02d: Clean the Data (change to Double)
  • 02e: Visualize the DataFrame
  • 02f: Create Train/Test Set from DataFrame
  • Train ML Pipeline – The Big Picture
  • 02g: Define Feature Processing Pipeline
  • 02h: Define Model Training of Pipeline
  • 02i: Add CrossValidation to Pipeline
  • 02j: Tie Features/Model Together in Pipeline
  • 02k: Train the Pipeline
  • 02l: Make Predictions, evaluate Results
  • 02l: Make Predictions, evaluate Results (cont.)
  • 02m/n: Visualize the Model’s DataFrame
  • Improving the Model
  • Predict Titanic Survivors (Random Forest)
  • 03a: Know the Data
  • 03b: Load and view Data types and Data
  • 03c: Clean data – Add column ‘FamilySize’
  • 03d: Clean data – Replace NULLs (con’t)
  • 03e: Clean data – Replace empty strings (con’t)
  • 03f: Split DataFrame into TrainDF / TestDF
  • 03g: IMPORT ML packages
  • 03h: Index Categorical and Label columns
  • 03i: Assemble all Features into Vector
  • 03j: Using Decision Tree classifier, 03k: Retrieve Original labels, 03l: Create Pipeline
  • 03m: Selecting the best Model
  • 03n: Make Prediction using TestDF
  • Review Questions: Machine Learning
  • In Review: Machine Learning
  • But wait, there’s more (for MLlib) (Appendix)
  • Linear Regression scenario (Supervised)
  • Linear Regression (1 of 6)
  • Linear Regression (2 of 6)
  • Linear Regression (3 of 6)
  • Linear Regression (4 of 6)
  • Linear Regression (5 of 6)
  • Linear Regression (6 of 6

Premium Course

2 Days

Duration
Delivery Method

Delivery method

Classroom / Attend from Anywhere

Receive classroom training at one of our nationwide training centres, or attend remotely via web access from anywhere.

Trusted, awarded and accredited

Fully accredited to ensure we provide the highest possible standards in learning

All third party trademark rights acknowledged.