Decision Trees with Python

Let’s take a look at how we can construct decision trees in Python.

A decision tree is a model used to solve classification and regression tasks. As we saw in our example for R, the model allows us to generate various outcomes using the model, allowing us to make a decision with the data.

In this particular example, we will analyse the effect of various explanatory variables (age, gender, web pages viewed per day, hours of video watched per week, person’s income) on internet usage (megabyte consumption per week).

1. Firstly, let’s load our libraries.

Here, we are importing the numpy library as np, as well as the train_test_split from sklearn. Using the latter, we are going to split the data into training and test data, whereby the model is created on the training data, and the accuracy of this model is then “tested” against the test data.

In[1]:

import numpy as np
from sklearn.model_selection import train_test_split
import os;
path="C:/Users/michaeljgrogan/Documents/a_documents/computing/data science/datasets"
os.chdir(path)
os.getcwd()

Out[1]:

'C:\\Users\\michaeljgrogan\\Documents\\a_documents\\computing\\data science\\datasets'

2. We then load our dataset and variables.

We are now using np.loadtxt to load our data in csv format.

In [2]:

#Variables
dataset=np.loadtxt("internetlogit.csv", delimiter=",")
x=dataset[:,0:5]
y=dataset[:,5]

3. Import the DecisionTreeRegressor from sklearn.

Here, we are importing the DecisionTreeRegressor, and then splitting our data into training and test components. As mentioned, we can see that the X_train and y_train data are used to construct the model.

In [3]:

from sklearn.tree import DecisionTreeRegressor
X_train, X_test, y_train, y_test = train_test_split(x, y)
tree = DecisionTreeRegressor().fit(X_train,y_train)

4. Determine training and test set accuracy.

When we check accuracy, we see that there is a 100% accuracy for the training set and an 85.1% accuracy for the test set.

In [4]:

print("Training set accuracy: {:.3f}".format(tree.score(X_train, y_train)))
print("Test set accuracy: {:.3f}".format(tree.score(X_test, y_test)))

Out [4]:

Training set accuracy: 1.000
Test set accuracy: 0.851

5. Yield predictions from our decision tree.

Using tree.predict, we are yielding predictions for the dependent variable using our model.

In [5]:

dtree = tree.predict(x)

In [6]:

dtree

Out[6]:

array([  875.,  1792., 27754., 28256.,  4438.,  2102.,  8520.,   500.,
       22997., 26517., 15109., 20956.,  3310.,  3197.,  1211., 18005.,
       22854., 10278.,   739.,  3724.,  4733.,   971.,  6263., 24677.,
...
        8113.,  3166.,  5332.,  2232., 21989.,  5360.,  5837.,  2509.,
        5580.,  5947., 11564.,  5888.,  9130., 16105.,  1593.,  4448.,
       12771., 28511.,  6883.])

6. Next, we calculate the percentage error between our predictions and the actual data.

Self-explanatory, really. The decision tree model has yielded estimates of internet usage for us, and we now wish to calculate the percentage deviation between the estimated and actual results.

In [7]:

percentageerror_tree=((y-dtree)/dtree)*100
percentageerror_tree

Out[7]:

array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -6.59871600e+00,  0.00000000e+00,
...
        1.64641577e+02,  0.00000000e+00,  1.09477689e+01,  0.00000000e+00,
        0.00000000e+00,  1.21515057e+01,  0.00000000e+00,  6.60971223e+00,
        0.00000000e+00,  0.00000000e+00, -1.13177394e+01])

In [8]:

np.mean(percentageerror_tree)

Out[8]:

4.116144626944652

7. Let’s now graph our decision tree.

While we have calculated that our decision tree model has a high rate of accuracy, we would like to now plot our decision tree visually, so as to be able to interpret the relationships between our variables.

The library we will use to do this is called graphviz, and can be installed with pip as follows (I’m using Python 3.6. at the time of writing, so will use pip3 to install:

pip3 install graphviz

Now that graphviz is installed, we will firstly export the tree as a .dot file, and then reimport:

from sklearn.tree import export_graphviz
export_graphviz(tree,out_file="tree.dot")

import graphviz
with open("tree.dot") as f:
    dot_graph = f.read()

graphviz.Source(dot_graph)

Once that’s done, we can now open up a shell and export our decision tree in pdf format:

dot -Tpdf tree.dot -o tree.pdf

Here is a plot of our decision tree:

decision trees

Conclusion: How Decision Trees Have Helped Us Predict Internet Usage

As we can see, we yield an average of 4.11% deviation between our predicted values and the actual values. Taking this into account as well as the yielded score of 0.851 for our test set, the decision tree is indicated to have a high degree of accuracy in predicting internet usage.


Dataset

internet.csv

Leave a Reply

Your email address will not be published. Required fields are marked *

fifteen − 9 =