QA | 26 January 2016
Over the last six months or so, I’ve fallen into the pattern of writing a blog entry every time AWS release a new major version of a course. This means I’m now behind by a couple of entries, since just before Christmas, AWS released major updates of their Big Data and Developer courses. In this blog, then, I’ll talk about the new version of the Amazon Web Services Big Data on AWS.
The course 'Big Data on AWS', provides an introduction to running Big Data workloads in the AWS cloud, with an emphasis on Amazon services and best practices. Specifically, it isn't a course about open-source Big Data software- it doesn't teach you Hadoop, Hive, Pig, Spark, or any other such tools, although delegates will briefly use all of those things during the labs. It also isn't a course about becoming a data scientist- I can show you how to build a Big Data pipeline on AWS, but I can't tell you what questions to ask to extract value from your data.
If you'd like to learn about Hadoop and its ecosystem (Hive, Pig, etc.) in more detail, take a look at our Hortonworks curriculum. If you're interested in data science, we have several courses, beginning with our introductory course 'Understanding Data Science and Big Data'. But if you want to know more about the AWS services for Big Data, read on!
This is a standalone course, sitting outside of the various AWS certification paths. There is no corresponding exam. The prerequisites are fairly light – delegates should have a basic knowledge of core AWS services such as the 'Technical Essentials' course, which will provide all the background you need. The official prerequisites also suggest delegates should be familiar with Big Data technologies such as Hadoop, although this is not absolutely essential.
As with the other blogs in this series, I'd like to tell you a bit about the content, add some detail to what's in the official course outline and generally give you a feel for the course. Please understand that AWS frequently release new versions of courseware, each delivery is different, and this blog will surely be outdated after a while, so please don't treat these details as contractual – contact us if you want to know the current state of play. That said, the content currently looks roughly like this:
Day One: This day introduces the big data pipeline, covers ingestion and storage issues, introduces Kinesis, and begins to discuss EMR.
- Overview of Big Data: A quick introduction, with discussion of some likely use cases. We discuss the “pipeline” model that provides the structure of the course: raw data arriving to be ingested and stored, followed by processing, possibly followed by visualization. Hot and cold data: how long does it take to get from raw data to actionable insights?
- Ingestion, Transfer and Compression: Best practices for ingesting data and transferring it to AWS. Issues around transferring and storing large amounts of data
- Storage Solutions: Introduces AWS storage solutions, particularly S3 and datastores such as RDS and DynamoDB, comparing and contrasting the capabilities of each. Followed by a hands-on lab using DynamoDB.
- Big Data Processing and Amazon Kinesis: Introducing streaming big data and Kinesis Streams. Ingesting and processing data using Kinesis; architecture of a Kinesis application.
- Introduction to Apache Hadoop and Amazon EMR: Describes Hadoop and its components (e.g. YARN, HDFS, MapReduce) and introduces EMR as a managed Hadoop service.
- Using Amazon EMR: Since the first version of the course, EMR has seen some major changes; we discuss the options now available for launching an EMR cluster.
Day Two: This day is completely devoted to EMR, the Hadoop ecosystem, and cost and security issues around using EMR. Day two has many more hands-on labs than day one, to give delegates hands-on experience with several different tools from the Hadoop ecosystem.
- Hadoop Programming Frameworks: We introduce some of the most popular ecosystem projects around Hadoop: Pig, Hive, Streaming, Spark, Presto and others. The goal is to get a feel for which tools are suited to which use cases, and how they compare to each other. Followed by two labs: one using Hive to process server logs, and another using Hadoop Streaming to process chemical data.
- Streamlining Your Amazon EMR Experience with Hue: The Hadoop User Experience provides a convenient web-based interface for writing and managing Hadoop projects, particularly for Hive and Pig. We introduce Hue, and then do a hands-on lab using it to run Pig scripts.
- Spark on Amazon EMR: We introduce Spark, Spark SQL and some of its associated modules, and discuss what it offers to Hadoop/EMR users. Followed by a hands-on lab using Spark and Spark SQL on EMR.
- Managing Your Amazon EMR Costs: We discuss the costs associated with EMR, and how to minimize them. This includes associated costs around data transfer and storage.
- Securing Your Amazon EMR Deployments: The data held in a big data processing environment is often both valuable and confidential; securing the pipeline is essential. We discuss security both for EMR itself and for the other parts of your pipeline, including encryption of data at rest and in flight.
Day Three: This day focuses on analysis and visualisation of data, as well as introducing some design patterns and discussing orchestration. A large part of the day is devoted to Amazon Redshift.
- Data Warehousing and Columnar Datastores: A brief introduction to data warehousing and how it differs from transactional datastores.
- Amazon Redshift and Big Data: An introduction to Redshift, the AWS data warehouse product. Followed by a hands-on lab in which we create an ETL pipeline that processes CloudFront logs using EMR before loading the results into Redshift.
- Optimising Your Amazon Redshift Environment: Discusses loading data into Redshift, choosing data and sort keys, compression options and other best practices.
- Big Data Design Patterns: Brings together the various products discussed during the course to show how they can be integrated into a real-world big data environment.
- Visualising and Orchestrating Big Data: A discussion of data visualisation, followed by a quick overview of Data Pipeline, an AWS service that can be used to orchestrate the regular processing of data. Followed by a hands-on lab in which we create a visualization of the Redshift data created in the previous lab.
If you're looking to create a Big Data environment on Amazon Web Services, this course contains a great deal of invaluable information. It integrates well both with our AWS curriculum and with the courses we offer around the Hadoop ecosystem.Related blogs