Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: sum() got an unexpected keyword argument 'skipna' #29481

Closed
sbitzer opened this issue Nov 8, 2019 · 14 comments
Closed

TypeError: sum() got an unexpected keyword argument 'skipna' #29481

sbitzer opened this issue Nov 8, 2019 · 14 comments
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version

Comments

@sbitzer
Copy link

sbitzer commented Nov 8, 2019

Code Sample, a copy-pastable example if possible

df = pd.DataFrame(index=np.arange(10), columns=np.arange(5), dtype=float)

df = 
    0   1   2   3   4
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN

df.groupby(pd.Series(['a', 'a', 'b', 'b', 'b']), axis=1).agg('sum', skipna=True)

Problem description

The above call to agg gives

KeyError: 'a'

This is, because here:

result[col] = self._try_cast(result[col], self.obj[col])

we are trying to access a new column name ('a') in the original DataFrame.

It only occurs, when no _cython_agg_general is possible, e.g., when keyword argument skipna is given to agg. Without skipna argument the expected output below will be produced.

Expected Output

df = 
     a    b
0  0.0  0.0
1  0.0  0.0
2  0.0  0.0
3  0.0  0.0
4  0.0  0.0
5  0.0  0.0
6  0.0  0.0
7  0.0  0.0
8  0.0  0.0
9  0.0  0.0

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.8.final.0
python-bits      : 64
OS               : Windows
OS-release       : 7
machine          : AMD64
processor        : Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : en
LOCALE           : None.None

pandas           : 0.25.1
numpy            : 1.16.4
pytz             : 2018.9
dateutil         : 2.8.0
pip              : 19.2.3
setuptools       : 40.8.0
Cython           : 0.29.7
pytest           : 4.3.1
hypothesis       : None
sphinx           : 2.1.2
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10
IPython          : 7.8.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.1
numexpr          : 2.6.9
odfpy            : None
openpyxl         : 2.6.1
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : 1.3.1
sqlalchemy       : 1.3.1
tables           : 3.5.2
xarray           : None
xlrd             : 1.2.0
xlwt             : None
xlsxwriter       : None
@WillAyd
Copy link
Member

WillAyd commented Nov 10, 2019

Somewhat interesting but this gives a different error on master:

>>> df.groupby(pd.Series(['a', 'a', 'b', 'b', 'b']), axis=1).agg('sum', skipna=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/williamayd/clones/pandas/pandas/core/groupby/generic.py", line 880, in aggregate
    result, how = self._aggregate(func, _level=_level, *args, **kwargs)
  File "/Users/williamayd/clones/pandas/pandas/core/base.py", line 330, in _aggregate
    return self._try_aggregate_string_function(arg, *args, **kwargs), None
  File "/Users/williamayd/clones/pandas/pandas/core/base.py", line 281, in _try_aggregate_string_function
    return f(*args, **kwargs)
  File "/Users/williamayd/clones/pandas/pandas/core/groupby/groupby.py", line 1356, in f
    return self._cython_agg_general(alias, alt=npfunc, **kwargs)
TypeError: _cython_agg_general() got an unexpected keyword argument 'skipna'

@jbrockmendel

@WillAyd
Copy link
Member

WillAyd commented Nov 10, 2019

Though this might be correct now; sum doesn't accept keyword arguments when supplied a DataFrame or as an argument to DataFrame.apply (note requirements for agg in docs)

@jbrockmendel
Copy link
Member

I guess you could reasonably expect it to work like df.sum, which uses nansum rules?

@ghost
Copy link

ghost commented May 17, 2020

Hi everyone ! This issue hadn't had interactions for too long. Is it still relevant ?

@sireesha-m
Copy link

Hi all,
I am running into the same with a fresh installation of pandas 1.0.3. After a group_by, am aggregating based on sum() with skipna=False, and it throws this error.

cnt = df.groupby(groupby_cols).sum(skipna=False)[prop_cols]

_cython_agg_general() got an unexpected keyword argument 'skipna'

It was working perfectly fine until I installed the libraries in a new virtual env. Requirement is that I have a column in the dataframe which has all NaNs, and I don't want them to be ignored after group by clause. I want NaNs to be replicated as NaNs in the result object.

Is this a known issue and got introduced recently? If so, can you please tell me if there's any fix that I can install.

@noelslice
Copy link

noelslice commented May 27, 2020

Same thing here with 1.0.3 but I think the skipna argument has been removed from the underlying groupby median/sum etc and missing values are just always excluded:

self = <pandas.core.groupby.generic.SeriesGroupBy object at 0x7f9a1f720438>, kwargs = {'skipna': True}

    @Substitution(name="groupby")
    @Appender(_common_see_also)
    def median(self, **kwargs):
        """
        Compute median of groups, excluding missing values.
    
        For multiple groupings, the result index will be a MultiIndex
    
        Returns
        -------
        Series or DataFrame
            Median of values within each group.
        """
        return self._cython_agg_general(
            "median",
            alt=lambda x, axis: Series(x).median(axis=axis, **kwargs),
>           **kwargs,
        )
E       TypeError: _cython_agg_general() got an unexpected keyword argument 'skipna'

But what if skipna was only included in the kwargs for Series(x).median(axis=axis, **kwargs) and not _cython_agg_general?

@ghuname
Copy link

ghuname commented Jun 7, 2020

I can confirm that I got the same error when I tried to groupby dataframe by columns (one of them contains nan values), and than to find maximum of series "Lp".

df.groupby([columns_but_one_of_them_contains_nans]).Lp.max(skipna=False)

returned

TypeError: _cython_agg_general() got an unexpected keyword argument 'skipna'

pandas 1.0.3

@the-moose-machine
Copy link

the-moose-machine commented Jun 8, 2020

I confirm the same. All NaN values are picked up as 0. This is useless when manipulating data for academic research. For instance,

>>> a = pd.DataFrame([np.NaN, np.NaN])
>>> a.sum()
0    0.0

This results in 0

Meanwhile, when doing the same with skipna=False:

>>> a.sum(skipna=False)
0   NaN

This results in NaN which is the desired output when calculating means. However when attempting the same within the groupby function:

>>> a[2] = ['a','a']
>>> a.groupby(2).agg({0:sum})
     0
2
a  0.0

the sum always returns 0 and there is no option of skipping NaN values.

These 0 values skew the means and standard deviations resulting in wrong figures. When working with a huge amount of data we realised that the results of our study did not make sense, On further investigation I discovered this bug within pandas. I fear that several others may have unknowingly reported inaccurate figures when manipulating data with pandas data frames. So this bug is very much relevant.

@lucashusted
Copy link

Definitely still an issue
@the-moose-machine you can do this as a temporary workaround:

df = df.groupby([groupby_variables]).apply(lambda x: x.sum(skipna=False))

This will return null values whenever it encounters missing values in the thing it is summing. However, I have found that this method is far slower than a comparably groupby.sum()

The problem might actually be due to the casting back of np.sum, though this is above my paygrade.

All I know is that

np.sum([np.nan,12,1,3,1])

returns nan, but

pd.DataFrame([np.nan,12,1,3,1]).apply(np.sum)

returns 17.

@njdepsky
Copy link

njdepsky commented Jul 10, 2020

This is a frustrating shortcoming of the groupby.sum() function. But since the mean is just the sum of values divided by the number of values, one alternative is to just multiply the groupby.mean() result with the groupby.count().

df.groupby('group')['values'].mean()*df.groupby('group')['values'].count()

The mean() returns NaN when all values in group are NaN and count() returns '0' when all values are NaN, and 0*np.nan returns NaN so their product returns a groupby.sum result that has correct sums but maintains NaN values where all values in a group are NaN.

Not sure how much slower this is than a simple groupby.sum(), however...

@simonjayhawkins
Copy link
Member

on master code sample in OP gives

>>> pd.__version__
'1.2.0.dev0+78.g838070883'
>>>
>>> df = pd.DataFrame(index=np.arange(10), columns=np.arange(5), dtype=float)
>>>
>>> df.groupby(pd.Series(["a", "a", "b", "b", "b"]), axis=1).agg("sum", skipna=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\pandas\pandas\core\groupby\generic.py", line 943, in aggregate
    result, how = self._aggregate(func, *args, **kwargs)
  File "C:\Users\simon\pandas\pandas\core\base.py", line 307, in _aggregate
    return self._try_aggregate_string_function(arg, *args, **kwargs), None
  File "C:\Users\simon\pandas\pandas\core\base.py", line 263, in _try_aggregate_string_function
    return f(*args, **kwargs)
TypeError: sum() got an unexpected keyword argument 'skipna'
>>>

will update title to make issue more discoverable

@simonjayhawkins simonjayhawkins changed the title KeyError during aggregation of grouped columns TypeError: sum() got an unexpected keyword argument 'skipna' Aug 11, 2020
@simonjayhawkins simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version labels Aug 11, 2020
@thomas6g
Copy link

thomas6g commented Nov 6, 2020

I noticed the same issue with ...groupby.median(skipna=True). I checked several versions. It works with pandas 0.25.3 and fails since pandas 1.0.0.

I wonder if that was intended because the API code and doc changed from 0.25.3 to 1.1.0:

pandas 0.25.3

    @Substitution(name="groupby")
    @Appender(_common_see_also)
    def median(self, **kwargs):
        """
        Compute median of groups, excluding missing values.
        For multiple groupings, the result index will be a MultiIndex
        Returns
        -------
        Series or DataFrame
            Median of values within each group.
        """

pandas 1.1.0

    @Substitution(name="groupby")
    @Appender(_common_see_also)
    def median(self, numeric_only=True):
        """
        Compute median of groups, excluding missing values.
        For multiple groupings, the result index will be a MultiIndex
        Parameters
        ----------
        numeric_only : bool, default True
            Include only float, int, boolean columns. If None, will attempt to use
            everything, then use only numeric data.
        Returns
        -------
        Series or DataFrame
            Median of values within each group.
        """

@jorisvandenbossche
Copy link
Member

The underlying issue here is that the skipna keyword is at the moment not yet implemented for groupby reductions like groupby(..).sum().

It might be that before this keyword was ignored and recently started to raise, but it never actually worked (or was never documented).

The improvement to add skipna to the grouped reductions is covered in #15675, so going to close this as a duplicate of #15675

@jorisvandenbossche
Copy link
Member

Duplicate of #15675

@jorisvandenbossche jorisvandenbossche marked this as a duplicate of #15675 Nov 20, 2020
@jorisvandenbossche jorisvandenbossche added this to the No action milestone Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests