[SPARK-51314][DOCS][PS] Add proper note for distributed-sequence about indeterministic case #50086

itholic · 2025-02-26T08:02:47Z

What changes were proposed in this pull request?

This PR proposes to add proper note for distributed-sequence about indeterministic case

Why are the changes needed?

There could be some indeterministic case leading users get confused when using distributed-sequence so we'd better to document it.

For example,

# Reading the same data
>>> df1.read_csv("big_data.csv")
>>> df2.read_csv("big_data.csv")

# The row-index mapping for `df1` and `df2` could be different when using `distributed-sequence`.
>>> df1.head(10)
     record_id start_date   end_date
0  RECORD_1001 2024-01-01 2024-01-10
1  RECORD_1002 2024-01-15 2024-01-20
2  RECORD_1003 2024-02-01 2024-02-10
3  RECORD_1004 2024-02-15 2024-02-20
4  RECORD_1005 2024-03-01 2024-03-10
5  RECORD_1006 2024-03-15 2024-03-20
6  RECORD_1007 2024-04-01 2024-04-10
7  RECORD_1008 2024-04-15 2024-04-20
8  RECORD_1009 2024-05-01 2024-05-10
9  RECORD_1010 2024-05-15 2024-05-20

>>> df2.head(10)
     record_id start_date   end_date
0  RECORD_2001 2024-06-01 2024-06-10
1  RECORD_2002 2024-06-15 2024-06-20
2  RECORD_2003 2024-07-01 2024-07-10
3  RECORD_2004 2024-07-15 2024-07-20
4  RECORD_2005 2024-08-01 2024-08-10
5  RECORD_2006 2024-08-15 2024-08-20
6  RECORD_2007 2024-09-01 2024-09-10
7  RECORD_2008 2024-09-15 2024-09-20
8  RECORD_2009 2024-10-01 2024-10-10
9  RECORD_2010 2024-10-15 2024-10-20

# Using `index_col` prevent the indeterministic case
>>> df1.read_csv("big_data.csv", index_col="record_id")
>>> df2.read_csv("big_data.csv", index_col="record_id")

# Now this guarantees the order of the rows for both DataFrame
>>> df1.head(10)
            start_date   end_date
record_id
RECORD_1001 2024-01-01 2024-01-10
RECORD_1002 2024-01-15 2024-01-20
RECORD_1003 2024-02-01 2024-02-10
RECORD_1004 2024-02-15 2024-02-20
RECORD_1005 2024-03-01 2024-03-10
RECORD_1006 2024-03-15 2024-03-20
RECORD_1007 2024-04-01 2024-04-10
RECORD_1008 2024-04-15 2024-04-20
RECORD_1009 2024-05-01 2024-05-10
RECORD_1010 2024-05-15 2024-05-20

>>> df2.head(10)
            start_date   end_date
record_id
RECORD_1001 2024-01-01 2024-01-10
RECORD_1002 2024-01-15 2024-01-20
RECORD_1003 2024-02-01 2024-02-10
RECORD_1004 2024-02-15 2024-02-20
RECORD_1005 2024-03-01 2024-03-10
RECORD_1006 2024-03-15 2024-03-20
RECORD_1007 2024-04-01 2024-04-10
RECORD_1008 2024-04-15 2024-04-20
RECORD_1009 2024-05-01 2024-05-10
RECORD_1010 2024-05-15 2024-05-20

Does this PR introduce any user-facing change?

No API changes, but the note will be added to user-facing documentation.

How was this patch tested?

Manually tested, and also the existing CI should pass.

Was this patch authored or co-authored using generative AI tooling?

No.

…t indeterministic case

python/docs/source/user_guide/pandas_on_spark/options.rst

the-sakthi · 2025-02-27T01:30:22Z

LGTM

[SPARK-51314][DOCS][PS] Add proper note for distributed-sequence abou…

ba7b0c4

…t indeterministic case

github-actions bot added DOCS PYTHON labels Feb 26, 2025

the-sakthi reviewed Feb 26, 2025

View reviewed changes

python/docs/source/user_guide/pandas_on_spark/options.rst Outdated Show resolved Hide resolved

Applied comments

7acc626

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51314][DOCS][PS] Add proper note for distributed-sequence about indeterministic case #50086

[SPARK-51314][DOCS][PS] Add proper note for distributed-sequence about indeterministic case #50086

itholic commented Feb 26, 2025

the-sakthi commented Feb 27, 2025

[SPARK-51314][DOCS][PS] Add proper note for distributed-sequence about indeterministic case #50086

Are you sure you want to change the base?

[SPARK-51314][DOCS][PS] Add proper note for distributed-sequence about indeterministic case #50086

Conversation

itholic commented Feb 26, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

the-sakthi commented Feb 27, 2025