-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EHN] Add jointly
option for min_max_scale
#1112
[EHN] Add jointly
option for min_max_scale
#1112
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #1112 +/- ##
==========================================
+ Coverage 97.95% 97.98% +0.03%
==========================================
Files 77 77
Lines 3175 3180 +5
==========================================
+ Hits 3110 3116 +6
+ Misses 65 64 -1 |
janitor/functions/min_max_scale.py
Outdated
@@ -23,23 +23,18 @@ def min_max_scale( | |||
df: pd.DataFrame, | |||
feature_range: tuple[int | float, int | float] = (0, 1), | |||
column_name: str | int | list[str | int] | pd.Index = None, | |||
entire_data: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think entire_data
is a better name.
Need to be added to the changelog file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, though I also need a bit of time and space to think of a better name. I wonder if others on the dev team have ideas? @pyjanitor-devs/core-devs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure of a good name to use. maybe keep
like in Pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean 'keep' its value could be 'column' or 'all'?
But this variable is better as a boolean type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, horrible parameter name. maybe scale_all
or scale_all_columns
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, @Zeroto521, on second thought, I think we need a bit better definition of the semantics. min_max_scale
currently operates with the assumption of operating on one column, so the use of column_name
makes sense here. Below is my attempt at reasoning through the multiple ways we could use min_max_scale
.
- Scale one column.
- Scale multiple columns independently.
- Scale multiple columns, but jointly (so they are all scaled to the same min and max)
entire_data
is a special case of scaling multiple columns jointly.
Since we're working on this function, it's a good chance to change the API to be flexible yet also sensible. What if the API was, instead the following?
def min_max_scale(df, feature_range, column_names: Iterable[Hashable] | Callable, jointly: bool):
If jointly
is True, then the column_names
provided are jointly scaled; otherwise, they are not.
I wanted to point out a new behaviour that we might be able to support across the rest of the API -- by making column_names
accept a Callable that has the signature:
def column_names_callable(df) -> Iterable[Hashable]:
we can enable min_max_scale
on all columns by doing:
df.min_max_scale(feature_range=(0, 1), column_names = lambda df: df.columns, jointly=True)
This is pretty concise without resorting to needing to maintain string mappings for special-case behaviour.
I'm glad we didn't rush to merge this PR, giving us the time and space to think clearly about the semantics of the API.
@samukweku and @Zeroto521 what do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to point out a new behaviour that we might be able to support across the rest of the API -- by making column_names accept a Callable that has the signature:
def column_names_callable(df) -> Iterable[Hashable]:
I'm sorry I still can't get the point why column_names
could accept a callback type argument.
The usage of column_names
is to get dataframe's columns like df[column_names]
.
So if column_names
has some column names which is not in df.columns
.
There will raise an error. Whatever column_names
is Iterable[Hashable]
or callback type could return Iterable[Hashable]
.
we can enable min_max_scale on all columns by doing:
df.min_max_scale(feature_range=(0, 1), column_names = lambda df: df.columns, jointly=True)
To scale all columns, I thought we could use the default argument as None
for column_names
without no more inputting.
df.min_max_scale(feature_range=(0, 1), column_names=None, jointly=True)
Are there more examples to show the importance of the callback type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments, @Zeroto521!
Regarding why we might want to allow column_names
to be a Callable, I had the idea that it helps support being explicit over implicit, which is in the Zen of Python. Setting column_names=None
makes selecting all columns implicit, whereas setting column_names=lambda df: df.columns
makes selecting all columns explicit. In addition, it allows the selection of arbitrary subsets of column names programmatically, without needing to hard-code those names.
On further thought, I can see how column_names=None
actually follows the pattern established in other places in the library, so I think, for now, we can:
- Use
column_names=None
to imply selection of all columns, and - Talk more about
column_names: Callable
in the issue tracker, deferring the implementation till later.
What do you all think about jointly=True
as the keyword for triggering whether to independently scale each column or to jointly scale all columns specified in column_names
? @Zeroto521 if you're in agreement with the keyword argument, then I think, let's get that specified in this PR, then we can close out the PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree with using jointly
.
About column_names
whether could receive a callable type or not.
I can understand now. column_names=lambda df: df.columns
is an implicit style and also a trick.
Select columns | Using callable type | Using Iterable type | df.columns |
---|---|---|---|
Select the first three columns | lambda df: df.columns[:3] |
['a', 'b', 'c'] |
pd.Index(list('abcde')) |
Select str type columns | lambda df: [i for i in df.columns if instance(i, str)] |
['a', 'b', 'c'] |
pd.Index(['a', 'b', 'c', 1]) |
As you said, we can put it aside at present.
Once the parameter column_names
of min_max_scale
could receive, other functions also need to do the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More discussions move to #1115
Well, GitHub Action style checking is failing but pre-commit.ci app is okay. |
@Zeroto521 looks like we're good. The action didn't fail b/c of your code; it failed b/c of a system error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Zeroto521 I'm going to pre-approve the PR. Regarding whether to use jointly
as the kwarg name, please take a decision and we'll run with it.
entire_data
option for min_max_scale
jointly
option for min_max_scale
…/Zeroto521/pyjanitor into min-max-scale-entire-data-option
|
||
Changed in version 0.24.0: Deleted "old_min", "old_max", "new_min", and "new_max" | ||
options. | ||
Changed in version 0.24.0: Added "feature_range", and "jointly" options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this renders well, then we can merge this PR.
PR Description
Please describe the changes proposed in the pull request:
min_max_scale
support to transform each column values or entire valuesThis PR resolves #1067.
PR Checklist
Please ensure that you have done the following:
<your_username>
:dev
, but rather from<your_username>
:<feature-branch_name>
.AUTHORS.md
.CHANGELOG.md
under the latest version header (i.e. the one that is "on deck") describing the contribution.Automatic checks
There will be automatic checks run on the PR. These include:
Relevant Reviewers
Please tag maintainers to review.