Skip to content

Commit

Permalink
Add learning curves
Browse files Browse the repository at this point in the history
  • Loading branch information
moink committed May 8, 2019
1 parent 3371e83 commit 307aa93
Show file tree
Hide file tree
Showing 7 changed files with 98 additions and 10 deletions.
16 changes: 14 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,12 @@ Running that model on the test set (not used in tuning the hyperparameters) resu

This comes out to a precision of 98.8%, recall of 98.2%, f1-score of 0.985, and accuracy of 97.6%. This seems like a reasonably good classifier for this problem.

I created a learning curve for this classifier, shown below.

![Learning curve for two-class PCA SVM](/outputs/lc_two_class_pca_svm "Learning curve for two-class PCA SVM")

This learning curve shows that the training and test accuracy have not converged to each other, and that the test accuracy is still increasing with the number of data points. This model would benefit from gathering more data, or perhaps making new synthetic data from the existing data. There's a gap between the training and testing error of about 2% accuracy. The problem could also benefit from more time put into feature engineering, feature selection, and hyperparameter optimization.

## Multiclass Classification

### Principal Components Analysis and Support Vector Machine
Expand All @@ -146,6 +152,12 @@ Let's look in more detail about why the classifier was not able to get better ac

As you can see, the algorithm does pretty well at distinguishing class 1 (seizures) from the rest, like in the binary classification. It has errors of almost every combination, except that it doesn't ever identify a seizure data point as a healthy, eyes-open point. It has trouble telling classes 4 and 5 - eyes open and closed - and especially classes 2 and 3 - measurements in the tumor area and in the healthy part of the brain where there's a tumor elsewhere - apart.

Here's the learning curve

![Learning curve for muticlass PCA SVM](/outputs/lc_five_class_pca_svm "Learning curve for five-class PCA SVM")

Similarly to the binary classification problem, but worse, this learning curve shows that the training and test accuracy have not converged to each other, and that the test accuracy is still increasing with the number of data points. This model would benefit from gathering more data, or perhaps making new synthetic data from the existing data. There's a gap between the training and testing error of about 25% accuracy, which is quite high. There is quite a bit of potential for improvement here. The problem could also benefit from more time put into feature engineering, feature selection, and hyperparameter optimization.

### Random Forest

Since the accuracy was only 70%, even with the best hyperparameters, I wanted to try a different class of method. So I chose a random decision forest. I also ran 5-fold cross-validation on the training set to choose the hyperparameters. Unfortunately, it wasn't an improvement on the PCA and SVM pipeline. The grid search chose a maximum tree depth of 25, and maximum number of features used per tree of 30, and 250 trees. All of these were the maximum number over my grid search and it took 43 seconds to train this forest. The best cross-validation accuracy on the training set was 69.3%. On the held-out test set, the accuracy 68.7%. Here's the confusion matrix on the test set:
Expand Down Expand Up @@ -242,6 +254,6 @@ To run it run:

Your python will give you a URL - visit it with your browser.

## Conclusion
## Next steps

I think I could improve the models with further tuning (for example, a finer grid search) and by trying other models (e.g. different neural network architectures, logistic regression, fourier transforms). However I think the improvements would be marginal and that I have gotten pretty close to the limits of what the data can provide.
The learning curves show that there is still substantial improvement to be made in these classifiers. Further attention to feature engineering and dimensionality reduction would help. I would also want to try Fourier decomposition as the dimensionality-reduction step. If it were an option from the researchers, gathering more data would also help. It would be interesting to explore generating synthetic data as well to see if that could improve the model.
81 changes: 80 additions & 1 deletion epiclass/epiclass.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
import os

import joblib
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
Expand All @@ -39,6 +40,8 @@
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

LEGEND_COORDS = (1.2, 0.8)
TOTAL_PCA_COMPONENTS = 60
Expand Down Expand Up @@ -73,7 +76,6 @@ def run(actions):
"""
epidata = pd.read_csv(os.path.join('data', 'data.csv'))
set_matplotlib_params()
# explore_data(epidata)
features = epidata.drop(['y', 'Unnamed: 0'], axis=1) / 2047.0
target = epidata['y']
x_train, x_test, y_train, y_test = train_test_split(features, target,
Expand All @@ -90,6 +92,7 @@ def run(actions):
if 'nn' in actions:
run_nn(x_train, y_train, x_test, y_test)


def run_explore(epidata, x_train, y_train):
"""Generate several plots to help understand the data
Expand Down Expand Up @@ -130,6 +133,7 @@ def run_pca_svm2(x_train, y_train, x_test, y_test):
x_test, (y_test == 1).astype(int),
'two_class_pca_svm')


def run_pca_svm5(x_train, y_train, x_test, y_test):
"""Train multiclass classifier with PCA and SVM
Expand Down Expand Up @@ -208,6 +212,7 @@ def run_nn(x_train, y_train, x_test, y_test):
create_and_test_neural_net(x_train, x_test, y_train, y_test)
visualize_confusion(os.path.join('outputs', 'confusion_nn'))


def save_data_to_file(features, targets, filename):
"""Save features and targets to a csv file
Expand Down Expand Up @@ -404,6 +409,10 @@ def train_and_save_pca_svm(n_components, C, gamma, x_train, y_train,
'confusion_' + filename_root + '.csv'))
model_filename = os.path.join('models', filename_root + '.z')
joblib.dump(pipeline, model_filename)
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
plot_learning_curve(pipeline, 'Learning curve: PCA-SVM', x_train, y_train,
os.path.join('outputs', 'lc_' + filename_root + '.png'),
cv=cv)


def make_violin_plots(features, targets):
Expand Down Expand Up @@ -878,5 +887,75 @@ def train_nn(x_train, y_train):
return model


def plot_learning_curve(estimator, title, x_train, y_train, filename,
ylim=None, cv=None, train_sizes=None):
"""
Generate a plot of the learning curve.
Args:
estimator : object type that implements the "fit" and "predict" methods
A classifier for which to plot the learning curve.
title: str
Title for the chart.
x_train: array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y_train: array-like, shape (n_samples) or (n_samples, n_features),
optional
Target relative to X for classification
filename: str
Path to file to save
ylim: tuple, shape (ymin, ymax), optional
Defines minimum and maximum y-values plotted.
cv: int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
train_sizes : array-like, shape (n_ticks,) of float or int
Relative or absolute numbers of training examples that will be used
to generate the learning curve. If a float, it is
regarded as a fraction of the maximum size of the training set
(that is determined by the selected validation method), i.e. it has
to be within (0, 1]. Otherwise it is interpreted as absolute
sizes of the training sets. Note that for classification the
number of samples usually have to be big enough to contain at
least one sample from each class. (default:
np.linspace(0.1, 1.0, 5))
Returns:
None
"""
# mostly taken from
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
if train_sizes is None:
train_sizes = np.linspace(.1, 1.0, 5)
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, x_train, y_train, cv=cv, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
plt.savefig(filename)


if __name__ == '__main__':
print('To run methods in this module, use the run_epiclass.py script')
Binary file added outputs/lc_five_class_pca_svm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added outputs/lc_two_class_pca_svm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,6 @@
description='Visualization and prediction of epileptic seizure data set',
install_requires=['keras', 'pandas', 'joblib', 'matplotlib', 'seaborn',
'scikit-learn', 'flask>=1.0.2', 'flask_restful',
'TensorFlow'],
'TensorFlow', 'numpy'],
scripts=['run_epiclass.py', 'api.py', 'test_deployment.py']
)
1 change: 0 additions & 1 deletion test_deployment.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,5 @@ def test_confusion_matrix(self):
expected_result.columns = confusion.columns
assert_frame_equal(confusion, expected_result)


if __name__ == '__main__':
unittest.main()
8 changes: 3 additions & 5 deletions web_gui.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,6 @@ def convert_query(self, query):
model_input = rescaled.values.reshape(1, -1)
return model_input


if __name__ == '__main__':
# Route the URL to the resource
api.add_resource(PredictSeizure, '/')
app.run(debug=True)
# Route the URL to the resource
api.add_resource(PredictSeizure, '/')
app.run(debug=True)

0 comments on commit 307aa93

Please sign in to comment.