Datacamp will hold your hand as you dip your toes in with your very first Kaggl competition.  This mini course will walk you through the steps and make sure you get to feel the success of crawling up the leaderboard on Kaggle in no time. 

Here is my review of the datacamp course Kaggle R Tutorial on Machine Learning, which is build on Trevor Stephens tutorial.

 

 Fotolia dippingtoe S

It's always nice to have someone holding your hand the first time you dip your toe

Datacamp will hold your hand while you explore a small dataset, and gradually build an R script using decision trees and random forest. A small part of the script - the data cleaning part - is already written. The script produces a csv file that you can upload to Kaggle.

What prerequitites are needed

A little R programming experience.

Knowing the concepts of training and test set.

Knowing just a little about random forest makes is more fun, but is not strictly necessary.

If you have been following the Coursera courses R progamming and Practical machine learning you are good to go.

Data cleaning is done for you, otherwise knowing how to clean data would be a prerequisite.

Course summary:

The course is divided into three chapters:

Chapter 1, Raising anchor:

- Do a first prediction purely based on gender, very simple.

- You can choose to upload this on kaggle. I tried it and was Nr 2565 on leader board, with prediction precision 0.76555.

Chapter 2, From icebergs to trees:

- Build a prediction function with decision trees using the rpart package.

- Upload the result to Kaggle. I did this and was Nr 1553 on the leader board with precision 0.78469. Up by 0.01914 percentage points:

kaggle 1 Screen Shot

 

- Next we learn about overfitting.

- Introducing new, calculated features gives a better prediction function, now nr 758 with precision 0.80383:

kaggle 2 Screen Shot

Chapter 3, Improving your predictions throuch random forest:

- Random forest.

- Code is cleaned for missing values (this is done for you).

- Some trouble with the way answer is tested - it depends on order of variables in function call, I wasted some time with this anoying problem, if you look at the discussion in sidebar of the course, you will se that others had the same issue.

Here is the code that worked for me:

# Apply the Random Forest Algorithm

my_forest <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp +

Parch + Fare + Embarked + Title, data = train, importance=TRUE, ntree=1000)

# Make your prediction using the test set

my_prediction <- predict(object = my_forest, newdata = test)

- Importance of variables with varImpPlot(my_forest)

It turns out that randomforest in this case is not as good as the simple decisiontrees:

kaggel 3 Screen Shot

This is somewhat confusing since the chapter is called "Improving your predictions throuch random forest", they let you set a seed in the script, so my best guess is, that this is due to changes in some of the R packages involved.

This is fun

This is a very small and easy to understand data set, so if you are eager to try out Kaggle, but not sure how to get started, I can really recommend this course - uploading to Kaggle for the first time and seeing your name on the leaderboard, even if it's number 2565 feels great - there is nothing like a friendly competition to get you motivated. It's a shame though, that the third chapter doesn't give the expected results, I hope datacamp will fix this.

Anyone can do it

Anyone who has seen a bit of R code and understands very basic concepts of machine learning can follow this little course.

What did I learn from this

I learned that getting started with Kaggle is easier than one might think - it's not all about huge datasets and competing for money. Also, you get a nice-looking certificate, that you can download and share on social media.

Here is what mine looks like:

datacamp kaggle certificate

How much time did I spend:

Datacamp says 1 hour, but I spend a little more than 2 hours, some time was waisted due to the problem with evaluation in datacamp (it shows from the disussion on the site, that a lot of people are having trouble with this). I think 1-3 hours is what you need.

Here is the code that worked for me in chapter 3:

# Apply the Random Forest Algorithm

my_forest <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp +

Parch + Fare + Embarked + Title, data = train, importance=TRUE, ntree=1000)

# Make your prediction using the test set

my_prediction <- predict(object = my_forest, newdata = test)

Would be nice if 

Datacamp would let me downloaded my entire script work after the course.

Your scripts are stored in datacamp though, so you can go back to the course after you have finished and see your work by clicking through the exercises with the 'next'-button.

If you want to keep a local copy of the scripts you produce you will have to copy-paste after each chapter.

Would also be nice if datacamp would fix chapter 3 so that random forest script produces a better result than the decision trees.