TypeError: sum() got an unexpected keyword argument 'skipna' #29481

sbitzer · 2019-11-08T12:13:18Z

Code Sample, a copy-pastable example if possible

df = pd.DataFrame(index=np.arange(10), columns=np.arange(5), dtype=float)

df = 
    0   1   2   3   4
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN

df.groupby(pd.Series(['a', 'a', 'b', 'b', 'b']), axis=1).agg('sum', skipna=True)

Problem description

The above call to agg gives

KeyError: 'a'

This is, because here:

pandas/pandas/core/groupby/groupby.py

Line 1376 in 67ee16a

result[col] = self._try_cast(result[col], self.obj[col])

we are trying to access a new column name ('a') in the original DataFrame.

It only occurs, when no _cython_agg_general is possible, e.g., when keyword argument skipna is given to agg. Without skipna argument the expected output below will be produced.

Expected Output

df = 
     a    b
0  0.0  0.0
1  0.0  0.0
2  0.0  0.0
3  0.0  0.0
4  0.0  0.0
5  0.0  0.0
6  0.0  0.0
7  0.0  0.0
8  0.0  0.0
9  0.0  0.0

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.8.final.0
python-bits      : 64
OS               : Windows
OS-release       : 7
machine          : AMD64
processor        : Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : en
LOCALE           : None.None

pandas           : 0.25.1
numpy            : 1.16.4
pytz             : 2018.9
dateutil         : 2.8.0
pip              : 19.2.3
setuptools       : 40.8.0
Cython           : 0.29.7
pytest           : 4.3.1
hypothesis       : None
sphinx           : 2.1.2
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10
IPython          : 7.8.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.1
numexpr          : 2.6.9
odfpy            : None
openpyxl         : 2.6.1
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : 1.3.1
sqlalchemy       : 1.3.1
tables           : 3.5.2
xarray           : None
xlrd             : 1.2.0
xlwt             : None
xlsxwriter       : None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-11-10T00:48:04Z

Somewhat interesting but this gives a different error on master:

>>> df.groupby(pd.Series(['a', 'a', 'b', 'b', 'b']), axis=1).agg('sum', skipna=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/williamayd/clones/pandas/pandas/core/groupby/generic.py", line 880, in aggregate
    result, how = self._aggregate(func, _level=_level, *args, **kwargs)
  File "/Users/williamayd/clones/pandas/pandas/core/base.py", line 330, in _aggregate
    return self._try_aggregate_string_function(arg, *args, **kwargs), None
  File "/Users/williamayd/clones/pandas/pandas/core/base.py", line 281, in _try_aggregate_string_function
    return f(*args, **kwargs)
  File "/Users/williamayd/clones/pandas/pandas/core/groupby/groupby.py", line 1356, in f
    return self._cython_agg_general(alias, alt=npfunc, **kwargs)
TypeError: _cython_agg_general() got an unexpected keyword argument 'skipna'

@jbrockmendel

WillAyd · 2019-11-10T00:50:58Z

Though this might be correct now; sum doesn't accept keyword arguments when supplied a DataFrame or as an argument to DataFrame.apply (note requirements for agg in docs)

jbrockmendel · 2019-11-19T00:06:28Z

I guess you could reasonably expect it to work like df.sum, which uses nansum rules?

ghost · 2020-05-17T18:28:33Z

Hi everyone ! This issue hadn't had interactions for too long. Is it still relevant ?

sireesha-m · 2020-05-18T06:56:23Z

Hi all,
I am running into the same with a fresh installation of pandas 1.0.3. After a group_by, am aggregating based on sum() with skipna=False, and it throws this error.

cnt = df.groupby(groupby_cols).sum(skipna=False)[prop_cols]

_cython_agg_general() got an unexpected keyword argument 'skipna'

It was working perfectly fine until I installed the libraries in a new virtual env. Requirement is that I have a column in the dataframe which has all NaNs, and I don't want them to be ignored after group by clause. I want NaNs to be replicated as NaNs in the result object.

Is this a known issue and got introduced recently? If so, can you please tell me if there's any fix that I can install.

noelslice · 2020-05-27T16:12:12Z

Same thing here with 1.0.3 but I think the skipna argument has been removed from the underlying groupby median/sum etc and missing values are just always excluded:

self = <pandas.core.groupby.generic.SeriesGroupBy object at 0x7f9a1f720438>, kwargs = {'skipna': True}

    @Substitution(name="groupby")
    @Appender(_common_see_also)
    def median(self, **kwargs):
        """
        Compute median of groups, excluding missing values.
    
        For multiple groupings, the result index will be a MultiIndex
    
        Returns
        -------
        Series or DataFrame
            Median of values within each group.
        """
        return self._cython_agg_general(
            "median",
            alt=lambda x, axis: Series(x).median(axis=axis, **kwargs),
>           **kwargs,
        )
E       TypeError: _cython_agg_general() got an unexpected keyword argument 'skipna'

But what if skipna was only included in the kwargs for Series(x).median(axis=axis, **kwargs) and not _cython_agg_general?

ghuname · 2020-06-07T09:25:44Z

I can confirm that I got the same error when I tried to groupby dataframe by columns (one of them contains nan values), and than to find maximum of series "Lp".

df.groupby([columns_but_one_of_them_contains_nans]).Lp.max(skipna=False)

returned

TypeError: _cython_agg_general() got an unexpected keyword argument 'skipna'

pandas 1.0.3

the-moose-machine · 2020-06-08T04:26:35Z

I confirm the same. All NaN values are picked up as 0. This is useless when manipulating data for academic research. For instance,

>>> a = pd.DataFrame([np.NaN, np.NaN])
>>> a.sum()
0    0.0

This results in 0

Meanwhile, when doing the same with skipna=False:

>>> a.sum(skipna=False)
0   NaN

This results in NaN which is the desired output when calculating means. However when attempting the same within the groupby function:

>>> a[2] = ['a','a']
>>> a.groupby(2).agg({0:sum})
     0
2
a  0.0

the sum always returns 0 and there is no option of skipping NaN values.

These 0 values skew the means and standard deviations resulting in wrong figures. When working with a huge amount of data we realised that the results of our study did not make sense, On further investigation I discovered this bug within pandas. I fear that several others may have unknowingly reported inaccurate figures when manipulating data with pandas data frames. So this bug is very much relevant.

lucashusted · 2020-07-01T21:08:07Z

Definitely still an issue
@the-moose-machine you can do this as a temporary workaround:

df = df.groupby([groupby_variables]).apply(lambda x: x.sum(skipna=False))

This will return null values whenever it encounters missing values in the thing it is summing. However, I have found that this method is far slower than a comparably groupby.sum()

The problem might actually be due to the casting back of np.sum, though this is above my paygrade.

All I know is that

np.sum([np.nan,12,1,3,1])

returns nan, but

pd.DataFrame([np.nan,12,1,3,1]).apply(np.sum)

returns 17.

njdepsky · 2020-07-10T19:49:23Z

This is a frustrating shortcoming of the groupby.sum() function. But since the mean is just the sum of values divided by the number of values, one alternative is to just multiply the groupby.mean() result with the groupby.count().

df.groupby('group')['values'].mean()*df.groupby('group')['values'].count()

The mean() returns NaN when all values in group are NaN and count() returns '0' when all values are NaN, and 0*np.nan returns NaN so their product returns a groupby.sum result that has correct sums but maintains NaN values where all values in a group are NaN.

Not sure how much slower this is than a simple groupby.sum(), however...

simonjayhawkins · 2020-08-11T10:33:47Z

on master code sample in OP gives

>>> pd.__version__
'1.2.0.dev0+78.g838070883'
>>>
>>> df = pd.DataFrame(index=np.arange(10), columns=np.arange(5), dtype=float)
>>>
>>> df.groupby(pd.Series(["a", "a", "b", "b", "b"]), axis=1).agg("sum", skipna=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\pandas\pandas\core\groupby\generic.py", line 943, in aggregate
    result, how = self._aggregate(func, *args, **kwargs)
  File "C:\Users\simon\pandas\pandas\core\base.py", line 307, in _aggregate
    return self._try_aggregate_string_function(arg, *args, **kwargs), None
  File "C:\Users\simon\pandas\pandas\core\base.py", line 263, in _try_aggregate_string_function
    return f(*args, **kwargs)
TypeError: sum() got an unexpected keyword argument 'skipna'
>>>

will update title to make issue more discoverable

thomas6g · 2020-11-06T15:48:02Z

I noticed the same issue with ...groupby.median(skipna=True). I checked several versions. It works with pandas 0.25.3 and fails since pandas 1.0.0.

I wonder if that was intended because the API code and doc changed from 0.25.3 to 1.1.0:

pandas 0.25.3

    @Substitution(name="groupby")
    @Appender(_common_see_also)
    def median(self, **kwargs):
        """
        Compute median of groups, excluding missing values.
        For multiple groupings, the result index will be a MultiIndex
        Returns
        -------
        Series or DataFrame
            Median of values within each group.
        """

pandas 1.1.0

    @Substitution(name="groupby")
    @Appender(_common_see_also)
    def median(self, numeric_only=True):
        """
        Compute median of groups, excluding missing values.
        For multiple groupings, the result index will be a MultiIndex
        Parameters
        ----------
        numeric_only : bool, default True
            Include only float, int, boolean columns. If None, will attempt to use
            everything, then use only numeric data.
        Returns
        -------
        Series or DataFrame
            Median of values within each group.
        """

jorisvandenbossche · 2020-11-20T13:00:32Z

The underlying issue here is that the skipna keyword is at the moment not yet implemented for groupby reductions like groupby(..).sum().

It might be that before this keyword was ignored and recently started to raise, but it never actually worked (or was never documented).

The improvement to add skipna to the grouped reductions is covered in #15675, so going to close this as a duplicate of #15675

jorisvandenbossche · 2020-11-20T13:01:29Z

Duplicate of #15675

WillAyd added the Groupby label Nov 10, 2019

simonjayhawkins mentioned this issue Aug 11, 2020

QST:TypeError: sum() got an unexpected keyword argument 'skipna' #35616

Closed

simonjayhawkins changed the title ~~KeyError during aggregation of grouped columns~~ TypeError: sum() got an unexpected keyword argument 'skipna' Aug 11, 2020

simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version labels Aug 11, 2020

jorisvandenbossche closed this as completed Nov 20, 2020

jorisvandenbossche marked this as a duplicate of #15675 Nov 20, 2020

jorisvandenbossche added this to the No action milestone Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: sum() got an unexpected keyword argument 'skipna' #29481

TypeError: sum() got an unexpected keyword argument 'skipna' #29481

sbitzer commented Nov 8, 2019 •

edited

Loading

WillAyd commented Nov 10, 2019 •

edited

Loading

WillAyd commented Nov 10, 2019

jbrockmendel commented Nov 19, 2019

ghost commented May 17, 2020

sireesha-m commented May 18, 2020

noelslice commented May 27, 2020 •

edited

Loading

ghuname commented Jun 7, 2020 •

edited

Loading

the-moose-machine commented Jun 8, 2020 •

edited

Loading

lucashusted commented Jul 1, 2020

njdepsky commented Jul 10, 2020 •

edited

Loading

simonjayhawkins commented Aug 11, 2020

thomas6g commented Nov 6, 2020 •

edited

Loading

jorisvandenbossche commented Nov 20, 2020

jorisvandenbossche commented Nov 20, 2020

TypeError: sum() got an unexpected keyword argument 'skipna' #29481

TypeError: sum() got an unexpected keyword argument 'skipna' #29481

Comments

sbitzer commented Nov 8, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

WillAyd commented Nov 10, 2019 • edited Loading

WillAyd commented Nov 10, 2019

jbrockmendel commented Nov 19, 2019

ghost commented May 17, 2020

sireesha-m commented May 18, 2020

noelslice commented May 27, 2020 • edited Loading

ghuname commented Jun 7, 2020 • edited Loading

the-moose-machine commented Jun 8, 2020 • edited Loading

lucashusted commented Jul 1, 2020

njdepsky commented Jul 10, 2020 • edited Loading

simonjayhawkins commented Aug 11, 2020

thomas6g commented Nov 6, 2020 • edited Loading

jorisvandenbossche commented Nov 20, 2020

jorisvandenbossche commented Nov 20, 2020

sbitzer commented Nov 8, 2019 •

edited

Loading

Output of `pd.show_versions()`

WillAyd commented Nov 10, 2019 •

edited

Loading

noelslice commented May 27, 2020 •

edited

Loading

ghuname commented Jun 7, 2020 •

edited

Loading

the-moose-machine commented Jun 8, 2020 •

edited

Loading

njdepsky commented Jul 10, 2020 •

edited

Loading

thomas6g commented Nov 6, 2020 •

edited

Loading