Reproducible Research is the 5th course in the Coursera data science specialization

Quick Overview
Duration of the course 4 weeks
Work load 6-9 hours pr week, most time will be spend on the projects.
Videos with slides About 4 hours in total.
Quizzes 2, in week 1 and week 2.
Other material No.
Course project 2 peer reviewed projects, in week 2 and week 3, see details below.
Formal prerequisites  This course has hard dependencies on R Programming and The Data Scientist’s Toolbox

Additional, helpful 

prerequisites

If you have some experience with writing academic reports, in addition to the formal prerequisites, this course will not be hard. You also need to make plots in the projects, which is taught in the Exploratory Data Analysis course, so I strongly recommend taking that course first.
Level of difficulty given only the formal prerequisites Medium to Hard 
Level of difficulty given the formal and additional prerequisites Easy.

 

 

Course Project

Project 1: 

You are given a dataset of activity monitoring to analyse. The assignment specifies which steps to perform for the anlysis including which plots to make, so your hand is held all the way. The result is a report written in R markdown, and you are even provided with a template for the report. The report is shared at Github.

Project 2:

In this project you will be exploring a dataset of storms and other extreme weather in the US. You will perform an analysis to answer two questions and the analysis and conclusions should be presented in a report written in R Markdown and compiled to HTML using knitr. How you perform the analysis is more up to you than the in the first project. The report is shared in RPubs.

About the course

This course focuses on the concepts and tools behind reporting modern data analyses in a reproducible manner. Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them.  The need for reproducibility is increasing dramatically as data analyses become more complex, involving larger datasets and more sophisticated computations. Reproducibility allows for people to focus on the actual content of a data analysis, rather than on superficial details reported in a written summary. In addition, reproducibility makes an analysis more useful to others because the data and code that actually conducted the analysis are available. This course will focus on literate statistical analysis tools which allow one to publish data analyses in a single document that allows others to easily execute the same analysis to obtain the same results. [Coursera]

What will you learn

Week 1

  • Concepts and ideas of reproducible research.
  • Literate (Statistical) Programming (originating from Knuth): concepts of weave and tangle. 
  • How to structure a data analysis.
  • Steps in data analysis:

Define the question
Define the ideal data set
Determine what data you can access
Obtain the data
Clean the data
Exploratory data analysis
Statistical prediction/modeling
Interpret results
Challenge results
Synthesize/write up results
Create reproducible code

Week 2

  • R Markdown: an extension to markdown that makes it possible to include R code and latex equations. Really cool. Here is a cheat sheet for R Markdown.
  • Knitr: an R package that lets you compile (or "knit") R markdown to HTML, PDF og Word via markdown. Built into R Studio. Knitr is a tool for literate statistical programming, Sweave is another one, but not used in this course.

Week 3

  • RPubs.
  • Evidence-based data analysis: A standard best practice for a given scientific area (to support reproducible research). Analogues to a pre-specified clinical trial protocol. A deterministic statistical machine.
  • Reproducible Research Checklist: 

Are we doing good science?
Was any part of this analysis done by hand?

If so, are those parts precisely document?
- Does the documentation match reality?
Have we taught a computer to do as much as possible (i.e. coded)?
Are we using a version control system?
Have we documented our software environment?
Have we saved any output that we cannot reconstruct from original data + code?
How far back in the analysis pipeline can we go before our results are no longer (automatically)
reproducible?


Week 4

  • Building cacher packages with cached computations for distribution using the R cacher package.
  • Guest lecturer Baggerly telling about finding many flaws in a published paper on cancer research.

Review

In this course we are introduced to a number of interesting tools and the two projects provide good opportunity to practice doing data analysis in R. We are also taught the scientific methods and philosophy of data science, which - not surprisingly - resembles those of other natural sciences. Thus, if you have experience with scientific methods in some other (natural) science, you will most likely find the material of this course very intuitive and easy to follow.

In one of the lectures Roger Peng makes a claim that I find surprising: He says that the group (of researchers) which is most likely to reproduce you research is the group who believes that you are wrong, that you are making false conclusions, I can agree to that; but then Roger Peng claims that this is not science!? I disagree with that, you should embrace the sceptic and think of it as quality testing, and make sure that your analysis and conclusions are solid enough for the sceptics investigation. Reproducibility does not mean we can trust the analysis (it may still be wrong or have flaws), but it makes it checkable like writing down a proof of a mathematical theorem: The proof (the data analysis) may have flaws and thus the theorem (conclusion) may or may not be false, but including the proof means that we can check the proof, and if the proof is correct we can also trust the theorem, if on the other hand the proof (the data analysis) is not correct, the theorem (conclusion) may still be true, but we can't tell from this proof.

Where to go next

Move on:

After finishing this course, consider:
Coursera Statistical Inference, the 6th course in the Coursera data science specialization. Read my review of Statistical Inference Course.

Go deeper:

If you want to explore this topic further, consider:
Book: Literate Programming by Donald Knuth.