API / COW: ensure every new Series/DataFrame also has new (shallow copy) index #53699

jorisvandenbossche · 2023-06-16T16:14:55Z

The issue at #53529 was about the index still sharing mutable state (because of being the same object) in case of getting a series out of a DataFrame (even with CoW turned on):

pd.options.mode.copy_on_write = True
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]})
s = df["a"]

>>> df.index is s.index
True

and thus if you would modify for example the index name of the Series, also the DataFrame gets updated (this PR ensures the above return value is False).

I think under the CoW rules, it makes sense to also ensure mutation of such index attributes don't propagate, similarly to mutating values, by ensuring we use a shallow copy of the Index whenever a new DataFrame/Series object is being returned from some operation or method.

Now, this goes quite a bit further than just the typical indexing operation above. To start, methods that return a shallow copy under CoW should also do this, as a start the copy() method itself:

df2 = df.copy(deep=False)
# this currently raises, fixed with this PR (and same for .columns)
assert df2.index is not df.index

But even for new objects that actually don't share data even with CoW, we do share the index / columns:

# method that return new data
df2 = df.diff()
# currently returns True, should be False?
df2.index is df.index

# operations that return new data
df2 = df + 1
# currently returns True, should be False?
df2.index is df.index

I think that in the CoW spirit that every new DataFrame/Series object should be independent, none of those cases should ever share the index and always use shallow copies (so essentially df1.index is df2.index can only be true if df1 is df2, i.e. if we have identical objects).

closes Copy-on-Write (CoW): index labels are still shared mutable state #53529, closes API: with CoW, should every new Series/DataFrame object also have its own new Index objects? #53721
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…py) index

jorisvandenbossche · 2023-06-16T16:18:51Z

Note about the status of the PR: I fixed it for the indexing operations and tested this, and for shallow copy(). I already went through managers.py code and added a comment everywhere it might need an additional shallow copy, but I only want to fix those while also adding a test for it to ensure every case is covered (will do that next).

It does maybe raise the question where this is best to handle? Currently I handle this in many places where it is needed. In theory, if the view is very cheap, we could also "just" always take a view of the passed axes in the BlockManager(..) constructor, and then it would need much fewer code changes (and you have less chance to forget this in some new code). But that would of course do this too many times (I should check the performance overhead of Index.view())

EDIT: The shallow copy(deep=False) case was broken off and is already merged as #53722

github-actions · 2023-07-20T00:05:35Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

API / COW: ensure every new Series/DataFrame also has new (shallow co…

1a00586

…py) index

jorisvandenbossche added API Design Copy / view semantics labels Jun 16, 2023

jorisvandenbossche added this to the 2.1 milestone Jun 16, 2023

jorisvandenbossche mentioned this pull request Jun 16, 2023

Copy-on-Write (CoW): index labels are still shared mutable state #53529

Open

3 tasks

fix existing tests checking for index identity

dff37fe

jorisvandenbossche mentioned this pull request Jun 19, 2023

API / CoW: shallow copy of DataFrame/Series (.copy(deep=False)) also returns shallow copy of the index/columns #53722

Merged

4 tasks

github-actions bot added the Stale label Jul 20, 2023

mroeschke removed this from the 2.1 milestone Aug 23, 2023

jorisvandenbossche added this to the 2.2 milestone Nov 17, 2023

Merge remote-tracking branch 'upstream/main' into cow-index-shallow-copy

589b2a3

jorisvandenbossche modified the milestones: 2.2, 3.0 Jan 10, 2024

simonjayhawkins modified the milestones: 3.0, 2.3 Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API / COW: ensure every new Series/DataFrame also has new (shallow copy) index #53699

API / COW: ensure every new Series/DataFrame also has new (shallow copy) index #53699

jorisvandenbossche commented Jun 16, 2023 •

edited

Loading

jorisvandenbossche commented Jun 16, 2023 •

edited

Loading

github-actions bot commented Jul 20, 2023

API / COW: ensure every new Series/DataFrame also has new (shallow copy) index #53699

Are you sure you want to change the base?

API / COW: ensure every new Series/DataFrame also has new (shallow copy) index #53699

Conversation

jorisvandenbossche commented Jun 16, 2023 • edited Loading

jorisvandenbossche commented Jun 16, 2023 • edited Loading

github-actions bot commented Jul 20, 2023

jorisvandenbossche commented Jun 16, 2023 •

edited

Loading

jorisvandenbossche commented Jun 16, 2023 •

edited

Loading