Add learning curves

moink · May 8, 2019 · 307aa93 · 307aa93
1 parent 3371e83
commit 307aa93
Show file tree

Hide file tree

Showing 7 changed files with 98 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -132,6 +132,12 @@ Running that model on the test set (not used in tuning the hyperparameters) resu
 
 This comes out to a precision of 98.8%, recall of 98.2%, f1-score of 0.985, and accuracy of 97.6%. This seems like a reasonably good classifier for this problem.
 
+I created a learning curve for this classifier, shown below.
+
+![Learning curve for two-class PCA SVM](/outputs/lc_two_class_pca_svm "Learning curve for two-class PCA SVM")
+
+This learning curve shows that the training and test accuracy have not converged to each other, and that the test accuracy is still increasing with the number of data points. This model would benefit from gathering more data, or perhaps making new synthetic data from the existing data. There's a gap between the training and testing error of about 2% accuracy. The problem could also benefit from more time put into feature engineering, feature selection, and hyperparameter optimization.
+
 ## Multiclass Classification
 
 ### Principal Components Analysis and Support Vector Machine
@@ -146,6 +152,12 @@ Let's look in more detail about why the classifier was not able to get better ac
 
 As you can see, the algorithm does pretty well at distinguishing class 1 (seizures) from the rest, like in the binary classification. It has errors of almost every combination, except that it doesn't ever identify a seizure data point as a healthy, eyes-open point. It has trouble telling classes 4 and 5 - eyes open and closed - and especially classes 2 and 3 - measurements in the tumor area and in the healthy part of the brain where there's a tumor elsewhere - apart.
 
+Here's the learning curve
+
+![Learning curve for muticlass PCA SVM](/outputs/lc_five_class_pca_svm "Learning curve for five-class PCA SVM")
+
+Similarly to the binary classification problem, but worse, this learning curve shows that the training and test accuracy have not converged to each other, and that the test accuracy is still increasing with the number of data points. This model would benefit from gathering more data, or perhaps making new synthetic data from the existing data. There's a gap between the training and testing error of about 25% accuracy, which is quite high. There is quite a bit of potential for improvement here. The problem could also benefit from more time put into feature engineering, feature selection, and hyperparameter optimization.
+
 ### Random Forest
 
 Since the accuracy was only 70%, even with the best hyperparameters, I wanted to try a different class of method. So I chose a random decision forest. I also ran 5-fold cross-validation on the training set to choose the hyperparameters. Unfortunately, it wasn't an improvement on the PCA and SVM pipeline. The grid search chose a maximum tree depth of 25, and maximum number of features used per tree of 30, and 250 trees. All of these were the maximum number over my grid search and it took 43 seconds to train this forest. The best cross-validation accuracy on the training set was 69.3%. On the held-out test set, the accuracy 68.7%. Here's the confusion matrix on the test set:
@@ -242,6 +254,6 @@ To run it run:
 
 Your python will give you a URL - visit it with your browser.
 
-## Conclusion
+## Next steps
 
-I think I could improve the models with further tuning (for example, a finer grid search) and by trying other models (e.g. different neural network architectures, logistic regression, fourier transforms). However I think the improvements would be marginal and that I have gotten pretty close to the limits of what the data can provide.
+The learning curves show that there is still substantial improvement to be made in these classifiers. Further attention to feature engineering and dimensionality reduction would help. I would also want to try Fourier decomposition as the dimensionality-reduction step. If it were an option from the researchers, gathering more data would also help. It would be interesting to explore generating synthetic data as well to see if that could improve the model.
diff --git a/epiclass/epiclass.py b/epiclass/epiclass.py
@@ -27,6 +27,7 @@
 import os
 
 import joblib
+import numpy as np
 import matplotlib
 import matplotlib.pyplot as plt
 import pandas as pd
@@ -39,6 +40,8 @@
 from sklearn.model_selection import train_test_split, GridSearchCV
 from sklearn.pipeline import Pipeline
 from sklearn.svm import SVC
+from sklearn.model_selection import learning_curve
+from sklearn.model_selection import ShuffleSplit
 
 LEGEND_COORDS = (1.2, 0.8)
 TOTAL_PCA_COMPONENTS = 60
@@ -73,7 +76,6 @@ def run(actions):
     """
     epidata = pd.read_csv(os.path.join('data', 'data.csv'))
     set_matplotlib_params()
-    # explore_data(epidata)
     features = epidata.drop(['y', 'Unnamed: 0'], axis=1) / 2047.0
     target = epidata['y']
     x_train, x_test, y_train, y_test = train_test_split(features, target,
@@ -90,6 +92,7 @@ def run(actions):
     if 'nn' in actions:
         run_nn(x_train, y_train, x_test, y_test)
 
+
 def run_explore(epidata, x_train, y_train):
     """Generate several plots to help understand the data
 
@@ -130,6 +133,7 @@ def run_pca_svm2(x_train, y_train, x_test, y_test):
                            x_test, (y_test == 1).astype(int),
                            'two_class_pca_svm')
 
+
 def run_pca_svm5(x_train, y_train, x_test, y_test):
     """Train multiclass classifier with PCA and SVM
 
@@ -208,6 +212,7 @@ def run_nn(x_train, y_train, x_test, y_test):
     create_and_test_neural_net(x_train, x_test, y_train, y_test)
     visualize_confusion(os.path.join('outputs', 'confusion_nn'))
 
+
 def save_data_to_file(features, targets, filename):
     """Save features and targets to a csv file
 
@@ -404,6 +409,10 @@ def train_and_save_pca_svm(n_components, C, gamma, x_train, y_train,
                                   'confusion_' + filename_root + '.csv'))
     model_filename = os.path.join('models', filename_root + '.z')
     joblib.dump(pipeline, model_filename)
+    cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
+    plot_learning_curve(pipeline, 'Learning curve: PCA-SVM', x_train, y_train,
+                        os.path.join('outputs', 'lc_' + filename_root + '.png'),
+                        cv=cv)
 
 
 def make_violin_plots(features, targets):
@@ -878,5 +887,75 @@ def train_nn(x_train, y_train):
     return model
 
 
+def plot_learning_curve(estimator, title, x_train, y_train, filename,
+                        ylim=None, cv=None, train_sizes=None):
+    """
+    Generate a plot of the learning curve.
+    Args:
+        estimator : object type that implements the "fit" and "predict" methods
+            A classifier for which to plot the learning curve.
+        title: str
+            Title for the chart.
+        x_train: array-like, shape (n_samples, n_features)
+            Training vector, where n_samples is the number of samples and
+            n_features is the number of features.
+        y_train: array-like, shape (n_samples) or (n_samples, n_features),
+            optional
+            Target relative to X for classification
+        filename: str
+            Path to file to save
+        ylim: tuple, shape (ymin, ymax), optional
+            Defines minimum and maximum y-values plotted.
+        cv: int, cross-validation generator or an iterable, optional
+            Determines the cross-validation splitting strategy.
+           Possible inputs for cv are:
+              - None, to use the default 3-fold cross-validation,
+              - integer, to specify the number of folds.
+              - :term:`CV splitter`,
+              - An iterable yielding (train, test) splits as arrays of indices.
+        train_sizes : array-like, shape (n_ticks,) of float or int
+            Relative or absolute numbers of training examples that will be used
+            to generate the learning curve. If a float, it is
+            regarded as a fraction of the maximum size of the training set
+            (that is determined by the selected validation method), i.e. it has
+            to be within (0, 1]. Otherwise it is interpreted as absolute 
+            sizes of the training sets. Note that for classification the 
+            number of samples usually have to be big enough to contain at 
+            least one sample from each class. (default:
+            np.linspace(0.1, 1.0, 5))
+
+    Returns:
+        None
+    """
+    # mostly taken from
+    # https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
+    if train_sizes is None:
+        train_sizes = np.linspace(.1, 1.0, 5)
+    plt.figure()
+    plt.title(title)
+    if ylim is not None:
+        plt.ylim(*ylim)
+    plt.xlabel("Training examples")
+    plt.ylabel("Score")
+    train_sizes, train_scores, test_scores = learning_curve(
+        estimator, x_train, y_train, cv=cv, train_sizes=train_sizes)
+    train_scores_mean = np.mean(train_scores, axis=1)
+    train_scores_std = np.std(train_scores, axis=1)
+    test_scores_mean = np.mean(test_scores, axis=1)
+    test_scores_std = np.std(test_scores, axis=1)
+    plt.grid()
+    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
+                     train_scores_mean + train_scores_std, alpha=0.1,
+                     color="r")
+    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
+                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
+    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
+             label="Training score")
+    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
+             label="Cross-validation score")
+    plt.legend(loc="best")
+    plt.savefig(filename)
+
+
 if __name__ == '__main__':
     print('To run methods in this module, use the run_epiclass.py script')
diff --git a/outputs/lc_five_class_pca_svm.png b/outputs/lc_five_class_pca_svm.png
diff --git a/outputs/lc_two_class_pca_svm.png b/outputs/lc_two_class_pca_svm.png
diff --git a/setup.py b/setup.py
@@ -11,6 +11,6 @@
     description='Visualization and prediction of epileptic seizure data set',
     install_requires=['keras', 'pandas', 'joblib', 'matplotlib', 'seaborn',
                       'scikit-learn', 'flask>=1.0.2', 'flask_restful',
-                      'TensorFlow'],
+                      'TensorFlow', 'numpy'],
     scripts=['run_epiclass.py', 'api.py', 'test_deployment.py']
 )
diff --git a/test_deployment.py b/test_deployment.py
@@ -122,6 +122,5 @@ def test_confusion_matrix(self):
         expected_result.columns = confusion.columns
         assert_frame_equal(confusion, expected_result)
 
-
 if __name__ == '__main__':
     unittest.main()
diff --git a/web_gui.py b/web_gui.py
@@ -120,8 +120,6 @@ def convert_query(self, query):
         model_input = rescaled.values.reshape(1, -1)
         return model_input
 
-
-if __name__ == '__main__':
-    # Route the URL to the resource
-    api.add_resource(PredictSeizure, '/')
-    app.run(debug=True)
+# Route the URL to the resource
+api.add_resource(PredictSeizure, '/')
+app.run(debug=True)