Overview
Get hands-on experience designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hands-on labs to show you how to design data processing systems, build end-to-end data pipelines, and analyze data. This course covers structured, unstructured, and streaming data.
Products:
- BigQuery
- Bigtable
- Cloud Storage
- Cloud SQL
- Spanner
- Dataproc
- Dataflow
- Cloud Data Fusion
- Cloud Composer
- Pub/Sub
Prerequisites
Participants should have:
- Prior Google Cloud experience using Cloud Shell and accessing products from the Google Cloud console.
- Basic proficiency with a common query language such as SQL.
- Experience with data modeling and ETL (extract, transform, load) activities.
- Experience developing applications using a common programming language such as Python
Target audience
This course is designed for:
- Data engineers
- Database administrators
- System administrators
Learning Outcomes
By the end of this course, learners will be able to:
- Design and build data processing systems on Google Cloud.
- Process batch and streaming data by implementing autoscaling data pipelines on Dataflow.
- Derive business insights from extremely large datasets using BigQuery.
- Leverage unstructured data using Spark and ML APIs on Dataproc.
- Enable instant insights from streaming data.
Course Outline
Module 01: Data engineering tasks and components
- The role of a data engineer
- Data sources versus data syncs
- Data formats
- Storage solution options on Google Cloud
- Metadata management options on Google Cloud
- Share datasets using Analytics Hub
Module 02: Data replication and migration
- Replication and migration architecture
- The gcloud command line tool
- Moving datasets
- Datastream
Module 03: The extract and load data pipeline pattern
- Extract and load architecture
- The bq command line tool
- BigQuery Data Transfer Service
- BigLake
Module 04: The extract, load, and transform data pipeline pattern
- Extract, load, and transform (ELT) architecture
- SQL scripting and scheduling with BigQuery
- Dataform
Module 05: The extract, transform, and load data pipeline pattern
- Extract, transform, and load (ETL) architecture
- Google Cloud GUI tools for ETL data pipelines
- Batch data processing using Dataproc
- Streaming data processing options
- Bigtable and data pipelines
Module 06: Automation techniques
- Automation patterns and options for pipelines
- Cloud Scheduler and Workflows
- Cloud Composer
- Cloud Run functions
- Eventarc
Module 07: Introduction to data engineering
- Data engineer’s role
- Data engineering challenges
- Introduction to BigQuery
- Data lakes and data warehouses
- Transactional databases versus data warehouses
- Effective partnership with other data teams
- Management of data access and governance
- Building of production-ready pipelines
- Google Cloud customer case study
Module 08: Build a Data Lake
- Introduction to data lakes
- Data storage and ETL options on Google Cloud
- Building of a data lake using Cloud Storage
- Secure Cloud Storage
- Store all sorts of data types
- Cloud SQL as your OLTP system
Module 09: Build a data warehouse
- The modern data warehouse
- Introduction to BigQuery
- Get started with BigQuery
- Loading of data into BigQuery
- Exploration of schemas
- Schema design
- Nested and repeated fields
- Optimization with partitioning and clustering
Module 10: Introduction to building batch data pipelines
- EL, ELT, ETL
- Quality considerations
- Ways of executing operations in BigQuery
- Shortcomings
- ETL to solve data quality issues
Module 11: Execute Spark on Dataproc
- The Hadoop ecosystem
- Run Hadoop on Dataproc
- Cloud Storage instead of HDFS
- Optimize Dataproc
Module 12: Serverless data processing with Dataflow
- Introduction to Dataflow
- Reasons why customers value Dataflow
- Dataflow pipelines
- Aggregating with GroupByKey and Combine
- Side inputs and windows
- Dataflow templates
Module 13: Manage data pipelines with Cloud Data Fusion and Cloud Composer
- Build batch data pipelines visually with Cloud Data Fusion
- Components
- Overview
- Building a pipeline
- Exploring data using Wrangler
- Orchestrate work between Google Cloud services with Cloud Composer
- Apache Airflow environment
- DAGs and operators
- Workflow scheduling
- Monitoring and logging
Module 14: Serverless messaging with Pub/Sub
- Introduction to Pub/Sub
- Pub/Sub push versus pull
- Publishing with Pub/Sub code
Module 16: Dataflow streaming features
- Streaming data challenges
- Dataflow windowing
Module 17: High-throughput BigQuery and Bigtable streaming features
- Streaming into BigQuery and visualizing results
- High-throughput streaming with Bigtable
- Optimizing Bigtable performance
Module 18: Advanced BigQuery functionality and performance
- Analytic window functions
- GIS functions
- Performance considerations
Exams and assessments
There is no specific certification related to this course.
Hands-on learning
There are practical labs in this course.
Frequently asked questions
How can I create an account on myQA.com?
There are a number of ways to create an account. If you are a self-funder, simply select the "Create account" option on the login page.
If you have been booked onto a course by your company, you will receive a confirmation email. From this email, select "Sign into myQA" and you will be taken to the "Create account" page. Complete all of the details and select "Create account".
If you have the booking number you can also go here and select the "I have a booking number" option. Enter the booking reference and your surname. If the details match, you will be taken to the "Create account" page from where you can enter your details and confirm your account.
Find more answers to frequently asked questions in our FAQs: Bookings & Cancellations page.
How do QA’s virtual classroom courses work?
Our virtual classroom courses allow you to access award-winning classroom training, without leaving your home or office. Our learning professionals are specially trained on how to interact with remote attendees and our remote labs ensure all participants can take part in hands-on exercises wherever they are.
We use the WebEx video conferencing platform by Cisco. Before you book, check that you meet the WebEx system requirements and run a test meeting to ensure the software is compatible with your firewall settings. If it doesn’t work, try adjusting your settings or contact your IT department about permitting the website.
How do QA’s online courses work?
QA online courses, also commonly known as distance learning courses or elearning courses, take the form of interactive software designed for individual learning, but you will also have access to full support from our subject-matter experts for the duration of your course.
Once you have purchased the Online course and have completed your registration, you will receive the necessary details to enable you to immediately access it through our e-learning platform and you can start to learn straight away, from any compatible device. Access to the online learning platform is valid for one year from the booking date.
All courses are built around case studies and presented in an engaging format, which includes storytelling elements, video, audio and humour. Every case study is supported by sample documents and a collection of Knowledge Nuggets that provide more in-depth detail on the wider processes.
When will I receive my joining instructions?
Joining instructions for QA courses are sent two weeks prior to the course start date, or immediately if the booking is confirmed within this timeframe. For course bookings made via QA but delivered by a third-party supplier, joining instructions are sent to attendees prior to the training course, but timescales vary depending on each supplier’s terms. Read more FAQs.
When will I receive my certificate?
Certificates of Achievement are issued at the end the course, either as a hard copy or via email. Read more here.