Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Chapter 2 OneHotEncoder Shape Mismatch Issue + Solution #115

Open
MadinaKamolova opened this issue Dec 13, 2023 · 2 comments
Open

Comments

@MadinaKamolova
Copy link

Thanks for helping us improve this project!

Before you create this issue
Please make sure you are using the latest updated code and libraries: see https://github.com/ageron/handson-ml3/blob/main/INSTALL.md#update-this-project-and-its-libraries

Also please make sure to read the FAQ (https://github.com/ageron/handson-ml3#faq) and search for existing issues (both open and closed), as your question may already have been answered: https://github.com/ageron/handson-ml3/issues

Describe the bug
Edition 3, page 133/1457 (Kindle e-book), the date fit-transformed by OneHotEncoder is not sent into .toarray() and results in error -- onehotencoder ValueError: Shape of passed values is (2, 1), indices imply (2, 5). With current code in the book, Python sees df_test_unknown.shape as (2,1).

To Reproduce
Please copy the code that fails here, using code blocks like this:

cat_encoder.handle_unknown = "ignore"
cat_encoder.transform(df_test_unknown)

df_output = pd.DataFrame(cat_encoder.transform(df_test_unknown),
                         columns=cat_encoder.get_feature_names_out(),
                         index=df_test_unknown.index)

Solution

cat_encoder.handle_unknown = "ignore"
test = cat_encoder.transform(df_test_unknown)
df_output = pd.DataFrame(test.toarray(),
                         columns=cat_encoder.get_feature_names_out(),
                         index=df_test_unknown.index)

Versions (please complete the following information):

  • OS: Windows 11
  • Python: [e.g. 3.11]

Additional context
Maybe add to FaQ or elsewhere where you think readers will notice (buying a book again for just one fix is impractical)

@dialogbox
Copy link

I think a better solution is using pandas.DataFrame.sparse.from_spmatrix explicitly.

df_output = pd.DataFrame.sparse.from_spmatrix(cat_encoder.transform(df_test_unknown),
                         columns=cat_encoder.get_feature_names_out(),
                         index=df_test_unknown.index)

@dialogbox
Copy link

Oh.. I think I found where the confusion is coming from.

In the book, the author suggested using dense matrix as just an alternative.

Alternatively, you can set sparse=False when creating the OneHotEncoder, in which case the transform() method will return a regular (dense) NumPy array directly.

Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (p. 131). O'Reilly Media. Kindle Edition.

There is no code block on the book. So I used sparse encoder which caused this problem.
But in the notebook code in this repo, the cat_encoder is created with sparse_output=False option.

cat_encoder = OneHotEncoder(sparse_output=False)

So the code in this repo is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants