-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Modify concatenate_columns
ignore_empty output
#1166
base: dev
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #1166 +/- ##
==========================================
- Coverage 98.04% 98.01% -0.03%
==========================================
Files 76 76
Lines 3524 3525 +1
==========================================
Hits 3455 3455
- Misses 69 70 +1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this problem.
I thought ignore_empty=True
we could fill nan.
df[new_column_name] = (
df[column_names].fillna("").astype(str).agg(sep.join, axis=1)
if ignore_empty
else df[column_names].astype(str).agg(sep.join, axis=1)
)
And It's better to do the fill operation and then do the change type operation.
It's hard to say how many nan strings there have.
df.fillna("").astype(str)
df.astype(str).replace(["NaT", "nan", "<NA>"], "")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fu-Jie thank you for your contribution here! I'm noticing that the change will likely be a breaking change, i.e. it modifies the old expected behaviour of the function. Can we ensure that the suggested changes are toggleable via function arguments?
df[column_names] | ||
.astype(str) | ||
.replace(["NaT", "nan", "<NA>"], "") | ||
.agg(sep.join, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want an argument here to toggle between old and new behaviours. Would you be open to doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, the translation software I used may not have described it clearly。
For ignoring null values, my idea comes from the implementation of Excel, feeling that the implementation of Excel is more in line with the actual use
https://support.microsoft.com/en-us/office/textjoin-function-357b449a-ec91-49d0-80c3-0e8fc845691c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericmjl
It is my understanding that if null values are not ignored, then this is more reasonable.
Pseudo-code
split = ','
if ignore_empty = False then 1,2,pd.NA -> 1,2,
if ignore_empty = True then 1,2,pd.NA -> 1,2
Co-authored-by: 40% <[email protected]>
Co-authored-by: 40% <[email protected]>
@Zeroto521 I am test None、pd.NA、pd.NaT、np.nan,there should be nothing else. |
concatenate_columns
ignore_empty output
@Zeroto521 |
import numpy as np
import pandas as pd
def fillna_astype(df, sep="-"):
return df.fillna("").astype(str).agg(sep.join, axis=1)
def astype_fillna(df, sep="-"):
return df.astype(str).replace(["NaT", "nan", "<NA>"], "").agg(sep.join, axis=1)
# normal case
# both of them passed, but `astype_fillna` need to replace `None` and `'NaN'`.
pd.DataFrame(
{
"a": ["string", 1, 1.5, np.nan],
"b": ["another_string", 0, pd.NA, None],
}
).pipe(fillna_astype)
# 0 string-another_string
# 1 1-0
# 2 1.5-
# 3 -
# dtype: object
pd.DataFrame(
{
"a": ["string", 1, 1.5, np.nan],
"b": ["another_string", 0, pd.NA, None],
}
).pipe(astype_fillna)
# 0 string-another_string
# 1 1-0
# 2 1.5-
# 3 -None
# dtype: object
# this one is a special case. `astype_fillna` is failed.
# we only want to fill na value.
pd.DataFrame(
{
"a": ["string", np.nan, pd.NA, None],
"b": ["another_string", "nan", "<NA>", "None"],
}
).pipe(fillna_astype)
# 0 string-another_string
# 1 -nan
# 2 -<NA>
# 3 -None
# dtype: object
pd.DataFrame(
{
"a": ["string", np.nan, pd.NA, None],
"b": ["another_string", "nan", "<NA>", "None"],
}
).pipe(astype_fillna)
# 0 string-another_string
# 1 - # wrong
# 2 - # wrong
# 3 None-None # wrong
# dtype: object |
normal caseIt could be a pandas(1.3.5) version issue,my env both None and np.nan astype for "nan" ,Should need to increase the na value. def astype_fillna(df, sep="-"):
return df.astype(str).replace(["NaT", "nan", "<NA>","None"], "").agg(sep.join, axis=1)
about fillna astype float or int issueimport pandas as pd
def fillna_astype(df, sep="-"):
return df.fillna("").astype(str).agg(sep.join, axis=1)
pd.DataFrame(
{
"b": [1, 0, pd.NA, 3],
}
,dtype=pd.Float32Dtype()
).pipe(fillna_astype)
##
## TypeError: <U1 cannot be converted to a FloatingDtype special caseIn my opinion, what should be dealt with is the null value of the column, not the text that represents the empty meaning. |
I think that isn't a good example for
|
This situation may appear in the read_sql int column(pandas dafaultl parse) or read_excel specify the dtype = 'int' |
use astype("string")
Any thoughts on the progress of this PR @thatlittleboy @ericmjl @Zeroto521 ? @Fu-Jie kindly rebase so that this PR is updated to the latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're close! just need to clear up some inconsistencies against the docstrings
@@ -28,7 +28,7 @@ def test_concatenate_columns_null_values(missingdata_df): | |||
new_column_name="index", | |||
ignore_empty=True, | |||
) | |||
expected_values = ["1.0-1", "2.0-2", "nan-3"] * 3 | |||
expected_values = ["1.0-1", "2.0-2", "3"] * 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also update the docstrings for the test.
@@ -28,7 +28,7 @@ def test_concatenate_columns_null_values(missingdata_df): | |||
new_column_name="index", | |||
ignore_empty=True, | |||
) | |||
expected_values = ["1.0-1", "2.0-2", "nan-3"] * 3 | |||
expected_values = ["1.0-1", "2.0-2", "3"] * 3 | |||
assert expected_values == df["index"].tolist() | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think it might be worth writing a test merging a custom dataframe with a float column (NaN), a datetime column (NaT) and a string column (None/NA?).
And assert the expected output accordingly.
Then, mention this PR or the attached issue in the test docstring as well, please.
PR Description
Please describe the changes proposed in the pull request:
This PR resolves #1164.
PR Checklist
Please ensure that you have done the following:
<your_username>
:dev
, but rather from<your_username>
:<feature-branch_name>
.AUTHORS.md
.CHANGELOG.md
under the latest version header (i.e. the one that is "on deck") describing the contribution.Automatic checks
There will be automatic checks run on the PR. These include:
Relevant Reviewers
Please tag maintainers to review.