Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include explained variance in PCA component plot #1239

Closed
wants to merge 2 commits into from
Closed

Include explained variance in PCA component plot #1239

wants to merge 2 commits into from

Conversation

gregparkes
Copy link

Explained variance ratio (as percentage) included in brackets for each dimension in component plot, including 3D.

This PR contributes to #476 which suggests enhancements to PCA component plots, among other things.

I've added additional text into the x, y and z labels of the plot which include the explained_variance_ratio property of a fitted sklearn PCA model.

This is a very small change and hence doesn't really warrant an example as all changes are clear in the diff.

CHECKLIST

  • Is the commit message formatted correctly?
  • Have you noted the new functionality/bugfix in the release notes of the next release?
  • Included a sample plot to visually illustrate your changes?
  • Do all of your functions and methods have docstrings?
  • Have you added/updated unit tests where appropriate?
  • Have you updated the baseline images if necessary?
  • Have you run the unit tests using pytest?
  • Is your code style correct (are you using PEP8, pyflakes)?
  • Have you documented your new feature/functionality in the docs?

Explained variance ratio (as percentage) included in brackets for each dimension in component plot, including 3D.
@lwgray
Copy link
Contributor

lwgray commented May 17, 2022

@bbengfort how do you feel about these changes?

@bbengfort
Copy link
Member

@gregparkes in a PCA projection, how would you use the explained variance in the legend? E.g. would you trust a projection more if the sum of the explained variance percentage was greater than 85% or differentiate between the different axes based on their explained variance?

Would you mind attaching a figure produced from your changes to help us understand how the legend influences analysis of the visualization?

The primary use of this projection is as a high dimensional data visualization tool; the goal of which is to discern separbility between classes or other patterns that might be easy to model. Explained variance is a useful tool to understanding PCA projection and we have a work in progress explained variance visualizer here: #1037 -- if you're interested that tool could definitely use some help getting to the finish line!

@codecov
Copy link

codecov bot commented May 21, 2022

Codecov Report

Merging #1239 (f1e43ea) into develop (ad0d133) will increase coverage by 0.00%.
The diff coverage is 75.00%.

@@           Coverage Diff            @@
##           develop    #1239   +/-   ##
========================================
  Coverage    90.58%   90.58%           
========================================
  Files           92       92           
  Lines         5213     5214    +1     
========================================
+ Hits          4722     4723    +1     
  Misses         491      491           
Impacted Files Coverage Δ
yellowbrick/features/pca.py 92.24% <75.00%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad0d133...f1e43ea. Read the comment docs.

@gregparkes
Copy link
Author

@bbengfort Thanks for your response!

Q: in a PCA projection, how would you use the explained variance in the legend?
A: I would not use this information in the legend as displaying floating points there looks ugly in my opinion. This information in the component plot is nice as, along with the 2d visualisation, it provides information as to the PCA model to linear compress the input space into the subspace, whilst not being too confusing for those uninitiated in PCA.

Q+: E.g. would you trust a projection more if the sum of the explained variance percentage was greater than 85% or differentiate between the different axes based on their explained variance?
A: I would avoid arbitrary thresholds such as 90% or 85% explained variance ratio, instead if the first few principle components manage to capture the variance with a high AUC, then PCA has done good job. If the first 2 PCs have very low %, it may indicate to the user to use a different technique, as the visualisation isn't representative of an optimal subspace. Differentiation with respect to the different axes will depend entirely on the input data and likely requires expert knowledge to interpret directional information with respect to each eigenvector and value. The eigenvalues/vectors are sorted anyway so PC1 is always > PC2.

Q: Would you mind attaching a figure produced from your changes to help us understand how the legend influences analysis of the visualization?
A: Sure, I'll provide a code snippet and plot below.

from yellowbrick.features import PCA
from yellowbrick.datasets import load_credit

# Specify the features of interest and the target
X, y = load_credit()
classes = ['account in default', 'current with bills']

visualizer = PCA(scale=True, classes=classes)
visualizer.fit_transform(X, y)
visualizer.show()

plot

The primary use of this projection is as a high dimensional data visualization tool; the goal of which is to discern separability between classes or other patterns that might be easy to model. Explained variance is a useful tool to understanding PCA projection and we have a work in progress explained variance visualizer here: #1037 -- if you're interested that tool could definitely use some help getting to the finish line!
A: Sure if I have time - as I mentioned above thinking of intelligent ways to select an optimal n_components/threshold for explained variance is tricky, there's some interesting work by Minka in automatic selection using probabilistic techniques, but not sure how this fits in with the visualizer scope.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants