The following tutorial contains Python examples for solving classification problems. You should refer to the Chapters 3 and 4 of the "Introduction to Data Mining" book to understand some of the concepts introduced in this tutorial. The notebook can be downloaded from http://www.cse.msu.edu/~ptan/dmbook/tutorials/tutorial6/tutorial6.ipynb.
Classification is the task of predicting a nominal-valued attribute (known as class label) based on the values of other attributes (known as predictor variables). The goals for this tutorial are as follows: 1. To provide examples of using different classification techniques from the scikit-learn library package. 2. To demonstrate the problem of model overfitting.
Read the step-by-step instructions below carefully. To execute the code, click on the corresponding cell and press the SHIFT-ENTER keys simultaneously.
Vertebrate Dataset
Each vertebrate is classified into one of 5 categories: mammals, reptiles, birds, fishes, and amphibians, based on a set of explanatory attributes (predictor variables). Except for "name", the rest of the attributes have been converted into a one hot encoding binary representation. To illustrate this, we will first load the data into a Pandas DataFrame object and display its content.
import pandas as pddata = pd.read_csv('https://raw.githubusercontent.com/INFO-523-Exercises/r-python-exercise3/main/data/vertebrate.csv', header='infer')data
Given the limited number of training examples, suppose we convert the problem into a binary classification task (mammals versus non-mammals). We can do so by replacing the class labels of the instances to non-mammals except for those that belong to the mammals class.
The results above show that it is possible to distinguish mammals from non-mammals using these two attributes alone since each combination of their attribute values would yield only instances that belong to the same class. For example, mammals can be identified as warm-blooded vertebrates that give birth to their young. Such a relationship can also be derived using a decision tree classifier, as shown by the example given in the next subsection.
Decision Tree Classifier
In this section, we apply a decision tree classifier to the vertebrate dataset described in the previous subsection.
The preceding commands will extract the predictor (X) and target class (Y) attributes from the vertebrate dataset and create a decision tree classifier object using entropy as its impurity measure for splitting criterion. The decision tree class in Python sklearn library also supports using 'gini' as impurity measure. The classifier above is also constrained to generate trees with a maximum depth equals to 3. Next, the classifier is trained on the labeled data using the fit() function.
We can plot the resulting decision tree obtained after training the classifier. To do this, you must first install both graphviz (http://www.graphviz.org) and its Python interface called pydotplus (http://pydotplus.readthedocs.io/).
Name Predicted Class
0 gila monster non-mammals
1 platypus non-mammals
2 owl non-mammals
3 dolphin mammals
Except for platypus, which is an egg-laying mammal, the classifier correctly predicts the class label of the test examples. We can calculate the accuracy of the classifier on the test data as shown by the example given below.
from sklearn.metrics import accuracy_scoreprint('Accuracy on test data is %.2f'% (accuracy_score(testY, predY)))
Accuracy on test data is 0.75
Model Overfitting
To illustrate the problem of model overfitting, we consider a two-dimensional dataset containing 1500 labeled instances, each of which is assigned to one of two classes, 0 or 1. Instances from each class are generated as follows: 1. Instances from class 1 are generated from a mixture of 3 Gaussian distributions, centered at [6,14], [10,6], and [14 14], respectively. 2. Instances from class 0 are generated from a uniform distribution in a square region, whose sides have a length equals to 20.
For simplicity, both classes have equal number of labeled instances. The code for generating and plotting the data is shown below. All instances from class 1 are shown in red while those from class 0 are shown in black.
In this example, we reserve 80% of the labeled data for training and the remaining 20% for testing. We then fit decision trees of different maximum depths (from 2 to 50) to the training set and plot their respective accuracies when applied to the training and test sets.
########################################## Training and Test set creation#########################################from sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)from sklearn import treefrom sklearn.metrics import accuracy_score########################################## Model fitting and evaluation#########################################maxdepths = [2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,45,50]trainAcc = np.zeros(len(maxdepths))testAcc = np.zeros(len(maxdepths))index =0for depth in maxdepths: clf = tree.DecisionTreeClassifier(max_depth=depth) clf = clf.fit(X_train, Y_train) Y_predTrain = clf.predict(X_train) Y_predTest = clf.predict(X_test) trainAcc[index] = accuracy_score(Y_train, Y_predTrain) testAcc[index] = accuracy_score(Y_test, Y_predTest) index +=1########################################## Plot of training and test accuracies#########################################plt.plot(maxdepths,trainAcc,'ro-',maxdepths,testAcc,'bv--')plt.legend(['Training Accuracy','Test Accuracy'])plt.xlabel('Max depth')plt.ylabel('Accuracy')
The plot above shows that training accuracy will continue to improve as the maximum depth of the tree increases (i.e., as the model becomes more complex). However, the test accuracy initially improves up to a maximum depth of 5, before it gradually decreases due to model overfitting.
Alternative Classification Techniques
Besides decision tree classifier, the Python sklearn library also supports other classification techniques. In this section, we provide examples to illustrate how to apply the k-nearest neighbor classifier, linear classifiers (logistic regression and support vector machine), as well as ensemble methods (boosting, bagging, and random forest) to the 2-dimensional data given in the previous section.
K-Nearest neighbor classifier
In this approach, the class label of a test instance is predicted based on the majority class of its k closest training instances. The number of nearest neighbors, k, is a hyperparameter that must be provided by the user, along with the distance metric. By default, we can use Euclidean distance (which is equivalent to Minkowski distance with an exponent factor equals to p=2):
from sklearn.neighbors import KNeighborsClassifierimport matplotlib.pyplot as pltnumNeighbors = [1, 5, 10, 15, 20, 25, 30]trainAcc = []testAcc = []for k in numNeighbors: clf = KNeighborsClassifier(n_neighbors=k, metric='minkowski', p=2) clf.fit(X_train, Y_train) Y_predTrain = clf.predict(X_train) Y_predTest = clf.predict(X_test) trainAcc.append(accuracy_score(Y_train, Y_predTrain)) testAcc.append(accuracy_score(Y_test, Y_predTest))
KNeighborsClassifier(n_neighbors=30)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=30)
plt.plot(numNeighbors, trainAcc, 'ro-', numNeighbors, testAcc,'bv--')plt.legend(['Training Accuracy','Test Accuracy'])plt.xlabel('Number of neighbors')plt.ylabel('Accuracy')
Linear Classifiers
Linear classifiers such as logistic regression and support vector machine (SVM) constructs a linear separating hyperplane to distinguish instances from different classes.
For logistic regression, the model can be described by the following equation:
where \(C\) is a hyperparameter that controls the inverse of model complexity (smaller values imply stronger regularization) while \(\Omega(\cdot)\) is the regularization term, which by default, is assumed to be an \(l_2\)-norm in sklearn.
For support vector machine, the model parameters \((w^*,b^*)\) are estimated by solving the following constrained optimization problem:
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Note that linear classifiers perform poorly on the data since the true decision boundaries between classes are nonlinear for the given 2-dimensional dataset.
Nonlinear Support Vector Machine
The code below shows an example of using nonlinear support vector machine with a Gaussian radial basis function kernel to fit the 2-dimensional dataset.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Observe that the nonlinear SVM can achieve a higher test accuracy compared to linear SVM.
Ensemble Methods
An ensemble classifier constructs a set of base classifiers from the training data and performs classification by taking a vote on the predictions made by each base classifier. We consider 3 types of ensemble classifiers in this example: bagging, boosting, and random forest. Detailed explanation about these classifiers can be found in Section 4.10 of the book.
In the example below, we fit 500 base classifiers to the 2-dimensional dataset using each ensemble method. The base classifier corresponds to a decision tree with maximum depth equals to 10.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
This section provides several examples of using Python sklearn library to build classification models from a given input data. We also illustrate the problem of model overfitting and show how to apply different classification methods to the given dataset.