News

New paper! in the American Naturalist

Friday, October 9, 2020

Decision Tree



"""
Codes from "COGNITIVE CLASS.ai - Decision Trees" , author: Saeed Aghabozorgi
Decision Tree is a supervised classification algorithm.
"""

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

##### Step1: Data & Pre-processing #####
# Imagine that you are a medical researcher compiling data for a study.
# You have collected data about a set of patients, all of whom suffered from the same illness.
# During their course of treatment, each patient responded to one of 5 medications, Drug A, B, c, x, and y.
# Part of your job is to build a model to find out whihc drug might be appropriate for a future patient with the same illness.
# The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients. Target is the drug taht each patient responded to.
# It is a sample of binary classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe it to a new patient.
# Unfortunately, Sklearn Decision Trees do not handle categorical variables (e.g. Sex, BP). But still we can convert these features to numerical values.

df = pd.read_csv("data.csv", delimiter=',')
X = df[['Age','Sex','BP','Cholesterol','Na_to_K']].values # the dataset used in https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv

# Pre-processing: convert categorical values into numerical values
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1])

le_BP = preprocessing.LabelEncoder()
le_BP.fit(['LOW','NORMAL','HIGH'])
X[:,2] = le_PB.transform(X[:,2])

le_Chol = preprocessing.LabelEncoder()
le_Chol.fit(['NORMAL','HIGH'])
X[:,3] = le_Chol.transform(X[:,3])

y = df["Drug"] # type of drug a patient responded to

##### Step2: Train/Test data #####
from sklearn.model_selection import train_test_split
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)

##### Step3: Modeling Decision Tree #####
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree.fit(X_trainset,y_trainset)
predTree = drugTree.predict(X_testset)

##### Step4: Evaluation #####
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTree's Accuracy:", metrics.accuracy_score(y_testset, predTree))

##### Step5: Visualization #####
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree

dot_data = StringIO()
filename = "drugtree.png"
featureNames = df.columns[0:5]
targetNames = df["Drug"].unique().tolist()
out = tree.export_graphviz(drugTree,feature_names=featureNames, out_file=dot_data, class_names = np.unique(y_trainset), filled=True, special_characters=True, rotate=False)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img, interpolation='nearest')