Module 1: Introduction
This module introduces both Data Science and Big Data.
- The origins of data science
- What skills does a data scientist need?
- How do they differ from those required for BI (Business Intelligence)?
- Case Studies
Module 2: Big Data
This module outlines the difference between the tabular data that underpins the relational model and Big Data. It explores not only what Big Data is, but why the commercial (and scientific) worlds are so interested in it. Finally it looks at how we can cross analyse tabular data and Big Data.
- Data Science isn't just about big data, but the two are certainly related
- Atomicity of data
- Tabular and big data
- What is big data?
- Why are we interested in big data?
- Where is the business value?
- How can we cross analysing tabular and big data?
Module 3: Finding the patterns in data
One of the vital skills for a data scientist is to be able to understand how numbers behave, how they are distributed and how we can determine the significance of any differences that we observe between numerical values. This involves an understanding of normal distributions, means, modes and standard deviations as well as, for example, Chi squared and t tests. This module covers these topics.
- Continuous and discontinuous data
- Random numbers aren't
- Flat distributions
- How multiple independent factors interact
- Normal distributions
- Mean, mode and median
- Standard deviation
- Sampling populations
- Chi squared
- t test
Module 4: Data models - relational and NoSQL
This module describes the different models that are used to represent data and specifically contrasts the relational and NoSQL worlds. It covers CAP theorem and why that is relevant to data models.
- Schema and schema-less storage
- Deciding what analysis can be performed and where
- CAP theorem
- NoSQL databases
Module 5: Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce are well established examples of tools/methodologies for manipulations Big Data.
- HDFS and MapReduce
Module 6: Data visualisation
The ability to create data visualisations that have meaning for a given set of data and the target audience is a major part of being a Data Scientist. This module describes how to plan and deploy visualisations and provides two case studies of where this has been successfully achieved.
- Different visualisations for different types of data
- Visualising data - a case study
Module 7: Introduction to R
R is a well-established, open source language, very specifically aimed at analysis. This module introduces the language and provides some practical work in using it.
- Introduction to R
- Lab : Using R
Module 8: Data mining
This module introduces not only data mining, but the CRISP methodology which helps to ensure that data mining is carried out effectively. It also introduces the Monte Carlo methodology for modelling and analysing systems.
- What is data mining?
- Data mining v. querying
- Business understanding
- Data understanding
- Data preparation
- Result validation
- Change and monitor
- Dangers of over-fitting
- Outliers (and how to deal with them)
- False positives
- Monte Carlo simulations
- Specific data mining techniques
- Clustering - Design an algorithm
- Decision trees
- Sequence analysis
- Neural nets