Random Forest

1) Introduction

For this project we will be exploring the use of tree methods to classify schools as Private or Public based off their features.Let’s start by getting the data which is included in the ISLR library, the College data frame.

2) Load Libraries

Let’s start by installing the library

install.packages('ISLR')
install.packages ('ggplot2')
install.packages ('caTools')
install.packages('rpart')
install.packages('rpart.plot')

3) Read the data

Let’s check the head of College, which is a built in data frame with ISLR.

library(ISLR)
head(College)

The data frame has 777 observations on the following 18 variables

Private - A factor with levels No and Yes indicating private or public university
Apps - Number of applications received
Accept - Number of applications accepted
Enroll - Number of new students enrolled
Top10perc - Pct. new students from top 10% of H.S. class
Top25perc - Pct. new students from top 25% of H.S. class
F.Undergrad - Number of fulltime undergraduates
P.Undergrad - Number of parttime undergraduates
Outstate - Out-of-state tuition
Room.Board - Room and board costs
Books - Estimated book costs
Personal - Estimated personal spending
PhD - Pct. of faculty with Ph.D.’s
Terminal - Pct. of faculty with terminal degree
S.F.Ratio - Student/faculty ratio
perc.alumni - Pct. alumni who donate
Expend - Instructional expenditure per student
Grad.Rate - Graduation rate

4) Explore the data

library(ggplot2)
ggplot(df,aes(Room.Board,Grad.Rate)) + geom_point(aes(color=Private))

screen shot 2018-01-22 at 9 59 03 pm

Let’s create a histogram of full time undergrad students

ggplot(df,aes(F.Undergrad)) + geom_histogram(aes(fill=Private),color='black',bins=50)

screen shot 2018-01-22 at 11 09 07 pm

Now let’s create a histogram of Grad.Rate colored by Private

ggplot(df,aes(Grad.Rate)) + geom_histogram(aes(fill=Private),color='black',bins=50)

screen shot 2018-01-22 at 11 12 24 pm

It’s interesting to note that there’s a college with a grad rate more than 100%. Let’s find out which and update the value to 100


subset(df,Grad.Rate > 100)

df['Cazenovia College','Grad.Rate'] <- 100

5) Train & Test data

Before we apply machine learning algorithms, we will need to split the data into training and testing sets. This enables to train an algorithm using the training data set and evaluate its accuracy on the test data set. An unrealistically low error value can arise due to overfitting if an algorithm is trained on the training data and evaluated for performance on the same data.

library(caTools)
set.seed(101) 
sample = sample.split(df$Private, SplitRatio = .70)
train = subset(df, sample == TRUE)
test = subset(df, sample == FALSE)

6) Decision Tree

We will create the model using rpart library to build a decision tree to predict whether or not a school is Private.

Machine Learning

Example of a project}