Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API / COW: ensure every new Series/DataFrame also has new (shallow copy) index #53699

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jun 16, 2023

The issue at #53529 was about the index still sharing mutable state (because of being the same object) in case of getting a series out of a DataFrame (even with CoW turned on):

pd.options.mode.copy_on_write = True
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]})
s = df["a"]

>>> df.index is s.index
True

and thus if you would modify for example the index name of the Series, also the DataFrame gets updated (this PR ensures the above return value is False).

I think under the CoW rules, it makes sense to also ensure mutation of such index attributes don't propagate, similarly to mutating values, by ensuring we use a shallow copy of the Index whenever a new DataFrame/Series object is being returned from some operation or method.

Now, this goes quite a bit further than just the typical indexing operation above. To start, methods that return a shallow copy under CoW should also do this, as a start the copy() method itself:

df2 = df.copy(deep=False)
# this currently raises, fixed with this PR (and same for .columns)
assert df2.index is not df.index

But even for new objects that actually don't share data even with CoW, we do share the index / columns:

# method that return new data
df2 = df.diff()
# currently returns True, should be False?
df2.index is df.index

# operations that return new data
df2 = df + 1
# currently returns True, should be False?
df2.index is df.index

I think that in the CoW spirit that every new DataFrame/Series object should be independent, none of those cases should ever share the index and always use shallow copies (so essentially df1.index is df2.index can only be true if df1 is df2, i.e. if we have identical objects).

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 16, 2023

Note about the status of the PR: I fixed it for the indexing operations and tested this, and for shallow copy(). I already went through managers.py code and added a comment everywhere it might need an additional shallow copy, but I only want to fix those while also adding a test for it to ensure every case is covered (will do that next).

It does maybe raise the question where this is best to handle? Currently I handle this in many places where it is needed. In theory, if the view is very cheap, we could also "just" always take a view of the passed axes in the BlockManager(..) constructor, and then it would need much fewer code changes (and you have less chance to forget this in some new code). But that would of course do this too many times (I should check the performance overhead of Index.view())

EDIT: The shallow copy(deep=False) case was broken off and is already merged as #53722

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 20, 2023
@mroeschke mroeschke removed this from the 2.1 milestone Aug 23, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.2 milestone Nov 17, 2023
@jorisvandenbossche jorisvandenbossche modified the milestones: 2.2, 3.0 Jan 10, 2024
@simonjayhawkins simonjayhawkins modified the milestones: 3.0, 2.3 Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants