Exploratory Data Analysis is the 4th course in the Coursera data science specialization

Quick Overview
Duration of the course 4 weeks
Work load 3-7 hours pr week, most time will be spend on the projects.
Videos with slides Approximately 5 hours in total.
Quizzes 2, in week 1 and week 2.
Other material No.
Course project 2 peer reviewed projects, in week 1 and week 3, see details below.
Formal prerequisites  This course has hard dependencies on R Programming and The Data Scientist’s Toolbox
Level of difficulty given only the formal prerequisites Easy to Medium. 

  

 

Course Project

 

Project 1: 

You are given a dataset of electric power consumption and four figures. The assignment is to recreate those four figures (colors, text, types, etc.) using the base plotting system, so with a very clear goal you get som hands on experience with the base plotting system. The resulting figures and R scripts to produce them are shared at Github.

Project 2:

In this project you will be exploring a dataset of fine particulate matter polution in the US. You are given 6 questions about the data set, and the assignment is two answer each of them with a single plot. You can any of the plotting systems in R. The resulting figures and R scripts to produce them are uploaded in Coursera.

About the course

This course covers the essential exploratory techniques for summarizing data. These techniques are typically applied before formal modeling commences and can help inform the development of more complex statistical models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the world that can be addressed by the data. We will cover in detail the plotting systems in R as well as some of the basic principles of constructing data graphics. We will also cover some of the common multivariate statistical techniques used to visualize high-dimensional data. [Coursera]

What will you learn

Week 1

  • Principles of analytic graphics.
    • Principle 1: Show comparisons
    • Principle 2: Show causality, mechanism, explanation
    • Principle 3: Show multivariate data
    • Principle 4: Integrate multiple modes of evidence
    • Principle 5: Describe and document the evidence
    • Principle 6: Content is king
  • Exploratory graphs: Quick and dirty plots to explore basic questions and hypothesis.
  • Base plotting system: create a plot and add annotations. You learn about:
    • Boxplots
    • Histograms
    • Barplots
    • Scatterplots
    • And more....
  • Graphics devices in R. 
  • Tip: In R Studio, wirting the command example(points) or example(plot) demonstrates the plotting options of base plotting. 

 

Week 2

  • Lattice plotting system: everything is constructed in one go, returning a so-called trellis object. This is not very easy to work with and doesn't seem to be a favourite of the course instructors either.
  • The ggplot2 plotting system: the newest and prettiest plotting system, created by Hadley Wickham as an implementation of the the grammar of graphics.

Week 3

  • Hierarchical clustering. Clustering organizes things that are close into groups. The following questions are explored, though not very deep, and we don't really get any theory.
    • How do we define close?
    • How do we group things?
    • How do we visualize the grouping?
    • How do we interpret the grouping?
  • K-means clustering. A partitioning approach to clustering.
  • Dimension reduction. We are introduced to:
    • PCA: Principal component analysis (which is the same as feature normalization). This is a statistical goal of finding a subset of variables to explain most of the variability.
    • SVD: Singular value decomposition. This is a data compression goal of finding the best matrix with lower rank that still explains the original data.
  • Colors in R plots. Fun for Feinschmeckers.


Week 4

  • Clustering case study. Using cluster analysis to understand human activities from smartphones. This is the dataset you worked with in the course Getting and Cleaning Data, Roger Peng demonstrates how to we can use cluster analysis on movement data collected with smartphones to understand how we can get from the movement data to a conclusion of what the person is doing, eg. standing, sitting, laying, quite fun!
  • Air pollution case study. The speed-talking Roger Peng gives a demonstration of how to explore a dataset, with small quizzes in the video to keep you engaged.

Review

This course is basically about learning how to make plots in R. It is definately one of the easier courses in the specialization. You get an introduction to the three plotting systems in R: base plotting, lattice plotting and ggplot2 and their properties and differences. In later courses the base plotting and ggplot are used frequently, so I suggest to focus on learning (one of) those two. ggplot2 is the newest plotting system, created by Hadley Wickham as an implementation of the the grammar of graphics.

Week 3 and 4 introduces clustering techniques and and dimension reduction, but we never get to practice any of this in quizzes or projects, we might get to that in a later course, though.

Where to go next

Move on:

After finishing this course, consider:
Coursera Reproducible Research, the 5th course in the Coursera data science specialization. Read my review of Reproducible Research Course.

Go deeper:

If you want to explore this topic further, consider these books:

The Grammar of Graphics is a theory or technique used to shorten the distance from mind to page

   

   The Grammar of Graphics, by Leland Wilkinson.

 

 

ggplot2 is an implementation in R of the Grammar of Graphics. ggplot2 is created by Hadley Wickham.

  ggplot2: Elegant Graphics for Data Analysis (Use R!), by Hadley Wickham.

Beautiful Evidence is about how seeing turns into showing, how data and evidence turn into explanation.

  Beautiful Evidence, by Edward Tufte