Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51314][DOCS][PS] Add proper note for distributed-sequence about indeterministic case #50086

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Feb 26, 2025

What changes were proposed in this pull request?

This PR proposes to add proper note for distributed-sequence about indeterministic case

Why are the changes needed?

There could be some indeterministic case leading users get confused when using distributed-sequence so we'd better to document it.

For example,

# Reading the same data
>>> df1.read_csv("big_data.csv")
>>> df2.read_csv("big_data.csv")

# The row-index mapping for `df1` and `df2` could be different when using `distributed-sequence`.
>>> df1.head(10)
     record_id start_date   end_date
0  RECORD_1001 2024-01-01 2024-01-10
1  RECORD_1002 2024-01-15 2024-01-20
2  RECORD_1003 2024-02-01 2024-02-10
3  RECORD_1004 2024-02-15 2024-02-20
4  RECORD_1005 2024-03-01 2024-03-10
5  RECORD_1006 2024-03-15 2024-03-20
6  RECORD_1007 2024-04-01 2024-04-10
7  RECORD_1008 2024-04-15 2024-04-20
8  RECORD_1009 2024-05-01 2024-05-10
9  RECORD_1010 2024-05-15 2024-05-20

>>> df2.head(10)
     record_id start_date   end_date
0  RECORD_2001 2024-06-01 2024-06-10
1  RECORD_2002 2024-06-15 2024-06-20
2  RECORD_2003 2024-07-01 2024-07-10
3  RECORD_2004 2024-07-15 2024-07-20
4  RECORD_2005 2024-08-01 2024-08-10
5  RECORD_2006 2024-08-15 2024-08-20
6  RECORD_2007 2024-09-01 2024-09-10
7  RECORD_2008 2024-09-15 2024-09-20
8  RECORD_2009 2024-10-01 2024-10-10
9  RECORD_2010 2024-10-15 2024-10-20

# Using `index_col` prevent the indeterministic case
>>> df1.read_csv("big_data.csv", index_col="record_id")
>>> df2.read_csv("big_data.csv", index_col="record_id")

# Now this guarantees the order of the rows for both DataFrame
>>> df1.head(10)
            start_date   end_date
record_id
RECORD_1001 2024-01-01 2024-01-10
RECORD_1002 2024-01-15 2024-01-20
RECORD_1003 2024-02-01 2024-02-10
RECORD_1004 2024-02-15 2024-02-20
RECORD_1005 2024-03-01 2024-03-10
RECORD_1006 2024-03-15 2024-03-20
RECORD_1007 2024-04-01 2024-04-10
RECORD_1008 2024-04-15 2024-04-20
RECORD_1009 2024-05-01 2024-05-10
RECORD_1010 2024-05-15 2024-05-20

>>> df2.head(10)
            start_date   end_date
record_id
RECORD_1001 2024-01-01 2024-01-10
RECORD_1002 2024-01-15 2024-01-20
RECORD_1003 2024-02-01 2024-02-10
RECORD_1004 2024-02-15 2024-02-20
RECORD_1005 2024-03-01 2024-03-10
RECORD_1006 2024-03-15 2024-03-20
RECORD_1007 2024-04-01 2024-04-10
RECORD_1008 2024-04-15 2024-04-20
RECORD_1009 2024-05-01 2024-05-10
RECORD_1010 2024-05-15 2024-05-20

Does this PR introduce any user-facing change?

No API changes, but the note will be added to user-facing documentation.

Screenshot 2025-02-26 at 5 02 18 PM

How was this patch tested?

Manually tested, and also the existing CI should pass.

Was this patch authored or co-authored using generative AI tooling?

No.

@the-sakthi
Copy link
Member

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants