Include explained variance in PCA component plot #1239

gregparkes · 2022-04-22T13:33:01Z

Explained variance ratio (as percentage) included in brackets for each dimension in component plot, including 3D.

This PR contributes to #476 which suggests enhancements to PCA component plots, among other things.

I've added additional text into the x, y and z labels of the plot which include the explained_variance_ratio property of a fitted sklearn PCA model.

This is a very small change and hence doesn't really warrant an example as all changes are clear in the diff.

CHECKLIST

Is the commit message formatted correctly?
Have you noted the new functionality/bugfix in the release notes of the next release?

Included a sample plot to visually illustrate your changes?
Do all of your functions and methods have docstrings?
Have you added/updated unit tests where appropriate?
Have you updated the baseline images if necessary?
Have you run the unit tests using pytest?
Is your code style correct (are you using PEP8, pyflakes)?
Have you documented your new feature/functionality in the docs?

Explained variance ratio (as percentage) included in brackets for each dimension in component plot, including 3D.

lwgray · 2022-05-17T20:26:50Z

@bbengfort how do you feel about these changes?

bbengfort · 2022-05-21T16:23:28Z

@gregparkes in a PCA projection, how would you use the explained variance in the legend? E.g. would you trust a projection more if the sum of the explained variance percentage was greater than 85% or differentiate between the different axes based on their explained variance?

Would you mind attaching a figure produced from your changes to help us understand how the legend influences analysis of the visualization?

The primary use of this projection is as a high dimensional data visualization tool; the goal of which is to discern separbility between classes or other patterns that might be easy to model. Explained variance is a useful tool to understanding PCA projection and we have a work in progress explained variance visualizer here: #1037 -- if you're interested that tool could definitely use some help getting to the finish line!

codecov · 2022-05-21T16:31:20Z

Codecov Report

Merging #1239 (f1e43ea) into develop (ad0d133) will increase coverage by 0.00%.
The diff coverage is 75.00%.

@@           Coverage Diff            @@
##           develop    #1239   +/-   ##
========================================
  Coverage    90.58%   90.58%           
========================================
  Files           92       92           
  Lines         5213     5214    +1     
========================================
+ Hits          4722     4723    +1     
  Misses         491      491

Impacted Files	Coverage Δ
yellowbrick/features/pca.py	`92.24% <75.00%> (+0.06%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad0d133...f1e43ea. Read the comment docs.

gregparkes · 2022-05-21T21:32:31Z

@bbengfort Thanks for your response!

Q: in a PCA projection, how would you use the explained variance in the legend?
A: I would not use this information in the legend as displaying floating points there looks ugly in my opinion. This information in the component plot is nice as, along with the 2d visualisation, it provides information as to the PCA model to linear compress the input space into the subspace, whilst not being too confusing for those uninitiated in PCA.

Q+: E.g. would you trust a projection more if the sum of the explained variance percentage was greater than 85% or differentiate between the different axes based on their explained variance?
A: I would avoid arbitrary thresholds such as 90% or 85% explained variance ratio, instead if the first few principle components manage to capture the variance with a high AUC, then PCA has done good job. If the first 2 PCs have very low %, it may indicate to the user to use a different technique, as the visualisation isn't representative of an optimal subspace. Differentiation with respect to the different axes will depend entirely on the input data and likely requires expert knowledge to interpret directional information with respect to each eigenvector and value. The eigenvalues/vectors are sorted anyway so PC1 is always > PC2.

Q: Would you mind attaching a figure produced from your changes to help us understand how the legend influences analysis of the visualization?
A: Sure, I'll provide a code snippet and plot below.

from yellowbrick.features import PCA
from yellowbrick.datasets import load_credit

# Specify the features of interest and the target
X, y = load_credit()
classes = ['account in default', 'current with bills']

visualizer = PCA(scale=True, classes=classes)
visualizer.fit_transform(X, y)
visualizer.show()

The primary use of this projection is as a high dimensional data visualization tool; the goal of which is to discern separability between classes or other patterns that might be easy to model. Explained variance is a useful tool to understanding PCA projection and we have a work in progress explained variance visualizer here: #1037 -- if you're interested that tool could definitely use some help getting to the finish line!
A: Sure if I have time - as I mentioned above thinking of intelligent ways to select an optimal n_components/threshold for explained variance is tricky, there's some interesting work by Minka in automatic selection using probabilistic techniques, but not sure how this fits in with the visualizer scope.

Include explained variance in PCA component plot

bdc0fa9

Explained variance ratio (as percentage) included in brackets for each dimension in component plot, including 3D.

Merge branch 'develop' into develop

f1e43ea

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include explained variance in PCA component plot #1239

Include explained variance in PCA component plot #1239

gregparkes commented Apr 22, 2022

lwgray commented May 17, 2022

bbengfort commented May 21, 2022

codecov bot commented May 21, 2022 •

edited

Loading

gregparkes commented May 21, 2022

Include explained variance in PCA component plot #1239

Include explained variance in PCA component plot #1239

Conversation

gregparkes commented Apr 22, 2022

CHECKLIST

lwgray commented May 17, 2022

bbengfort commented May 21, 2022

codecov bot commented May 21, 2022 • edited Loading

Codecov Report

gregparkes commented May 21, 2022

codecov bot commented May 21, 2022 •

edited

Loading