Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple uniqueness checks for sorting-related functions #3312

Merged
merged 31 commits into from
Jun 19, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
376b569
Basic support for checkunique kwarg in sorting-related functions
alonsoC1s Apr 7, 2023
b9fc81d
Added tests for the checkunique kwarg
alonsoC1s Apr 8, 2023
c35b338
Fixing Tests + Changing Error to ArgumentError
alonsoC1s Apr 8, 2023
9de0421
Accounting for the case with order(...) clauses
alonsoC1s Apr 9, 2023
01761ee
Improved tests for selectors with checkunique
alonsoC1s Apr 11, 2023
add0ad6
Fixed broken uniqueness testing
alonsoC1s Apr 11, 2023
e310687
Merge branch 'JuliaData:main' into simple_uniqueness_sorting
alonsoC1s Apr 12, 2023
7de6e70
Added missing tests for sort
alonsoC1s Apr 12, 2023
262db54
Fixed Docstrings + Replace == with ===
alonsoC1s Apr 13, 2023
d433cf1
Apply suggestions from code review
alonsoC1s Apr 19, 2023
f7d2ad9
Fixed sorting with order kwarg
alonsoC1s Apr 21, 2023
bdff445
Fixed complex sorting order detection
alonsoC1s Apr 21, 2023
d68dda9
Improved tests for complex order clauses
alonsoC1s Apr 22, 2023
651eab6
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s Apr 22, 2023
f49023d
Fixed logic error in is_complex + added tests
alonsoC1s Apr 24, 2023
73a31bb
Apply suggestions from code review
alonsoC1s Apr 28, 2023
5ef1e4d
More robust complexity checks for Order.Orderings + tests
alonsoC1s May 18, 2023
e5bd016
Update src/abstractdataframe/sort.jl
alonsoC1s May 18, 2023
bac221b
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s May 18, 2023
b672135
Added error message for unsupported Order types
alonsoC1s May 18, 2023
8746370
Added complexity checks and tests for other subtypes of Order
alonsoC1s May 21, 2023
1576c07
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s Jun 1, 2023
2a71db8
Finalizing checks for multiple Ordering subtypes + tests
alonsoC1s Jun 6, 2023
ce79d07
Update src/abstractdataframe/sort.jl
bkamins Jun 12, 2023
536926a
Update src/abstractdataframe/sort.jl
alonsoC1s Jun 13, 2023
cf59d15
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s Jun 13, 2023
d333474
Stylistic changes
alonsoC1s Jun 13, 2023
ac25bb5
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s Jun 16, 2023
e002b7b
Update src/abstractdataframe/sort.jl
bkamins Jun 16, 2023
56c228e
Removed explanation from sort docstring
alonsoC1s Jun 16, 2023
929cacd
Apply suggestions from code review
bkamins Jun 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@
treated as if they were wrapped in `Cols` and does not throw an error
when a vector of duplicate indices is passed when doing column selection
([#3302](https://github.com/JuliaData/DataFrames.jl/pull/3302))
* Added the kwarg `checkunique` to sorting related functions (`issorted`,
`sort`, `sort!` and `sortperm`) that throws an error when duplicate elements
make multiple sort orders valid
([#2159](https://github.com/JuliaData/DataFrames.jl/issues/2159))
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved

# DataFrames.jl v1.5 Release Notes

Expand Down
69 changes: 59 additions & 10 deletions src/abstractdataframe/sort.jl
Original file line number Diff line number Diff line change
Expand Up @@ -357,7 +357,8 @@ column in the corresponding position in `cols`.
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward)
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
checkunique::Bool=false)

Test whether data frame `df` sorted by column(s) `cols`. Checking against
multiple columns is done lexicographically.
Expand Down Expand Up @@ -397,14 +398,26 @@ function Base.issorted(df::AbstractDataFrame, cols=All();
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward)
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)
to_scalar(x::AbstractVector) = only(x)
to_scalar(x::Any) = x

# exclude AbstractVector as in that case cols can contain order(...) clauses
if cols isa MultiColumnIndex && !(cols isa AbstractVector)
cols = index(df)[cols]
end
if checkunique
newcols = Int[]

for col in cols
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
push!(newcols, index(df)[(_getcol(col))])
end
if !allunique(df, newcols)
throw(ArgumentError("Non-unique elements found. Multiple orders " *
"are valid"))
end
end
if cols isa ColumnIndex
return issorted(df[!, cols], lt=to_scalar(lt), by=to_scalar(by),
rev=to_scalar(rev), order=to_scalar(order))
Expand All @@ -427,7 +440,8 @@ Base.issorted(::AbstractDataFrame, ::Base.Order.Ordering) =
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
view::Bool=false)
view::Bool=false,
checkunique::Bool=false)

Return a data frame containing the rows in `df` sorted by column(s) `cols`.
Sorting on multiple columns is done lexicographically.
Expand Down Expand Up @@ -506,8 +520,9 @@ julia> sort(df, [:x, order(:y, rev=true)])
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
view::Bool=false)
rowidxs = sortperm(df, cols, alg=alg, lt=lt, by=by, rev=rev, order=order)
view::Bool=false,
checkunique::Bool=false)
rowidxs = sortperm(df, cols, alg=alg, lt=lt, by=by, rev=rev, order=order, checkunique=checkunique)
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
return view ? Base.view(df, rowidxs, :) : df[rowidxs, :]
end

Expand All @@ -517,7 +532,8 @@ end
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward)
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)

Return a permutation vector of row indices of data frame `df` that puts them in
sorted order according to column(s) `cols`.
Expand Down Expand Up @@ -579,13 +595,25 @@ function Base.sortperm(df::AbstractDataFrame, cols=All();
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward)
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)
# exclude AbstractVector as in that case cols can contain order(...) clauses
if cols isa MultiColumnIndex && !(cols isa AbstractVector)
cols = index(df)[cols]
end
ord = ordering(df, cols, lt, by, rev, order)
_alg = Sort.defalg(df, ord; alg=alg, cols=cols)
if checkunique
newcols = Int[]

for col in cols
push!(newcols, index(df)[(_getcol(col))])
end
if !allunique(df, newcols)
throw(ArgumentError("Non-unique elements found. Multiple orders " *
"are valid"))
end
end
return _sortperm(df, _alg, ord)
end

Expand All @@ -601,7 +629,8 @@ _sortperm(df::AbstractDataFrame, a::Algorithm, o::Ordering) =
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward)
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)

Sort data frame `df` by column(s) `cols` by permuting its rows in-place.
Sorting on multiple columns is done lexicographicallly.
Expand Down Expand Up @@ -682,16 +711,36 @@ function Base.sort!(df::AbstractDataFrame, cols=All();
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward)
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)

# exclude AbstractVector as in that case cols can contain order(...) clauses
if cols isa MultiColumnIndex && !(cols isa AbstractVector)
cols = index(df)[cols]
end
ord = ordering(df, cols, lt, by, rev, order)
_alg = Sort.defalg(df, ord; alg=alg, cols=cols)
if checkunique
newcols = Int[]

for col in cols
push!(newcols, index(df)[(_getcol(col))])
end
if !allunique(df, newcols)
throw(ArgumentError("Non-unique elements found. Multiple orders " *
"are valid"))
end
end
return sort!(df, _alg, ord)
end

Base.sort!(df::AbstractDataFrame, a::Base.Sort.Algorithm, o::Base.Sort.Ordering) =
function Base.sort!(df::AbstractDataFrame, a::Base.Sort.Algorithm,
o::Base.Sort.Ordering, checkunique::Bool=false)
if checkunique
if !allunique(df) # Necessary to check all cols AFAIU
throw(ArgumentError("Non-unique elements found. Multiple orders " *
"are valid"))
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
end
end
permute!(df, _sortperm(df, a, o))
end
3 changes: 3 additions & 0 deletions test/sort.jl
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ using DataFrames, Random, Test, CategoricalArrays
@test issorted(DataFrame())
@test sortperm(d) == sortperm(dv1)
@test sortperm(d[:, [:dv3, :dv1]]) == sortperm(dv3)
@test_throws ArgumentError sortperm(d, :cv1, checkunique=true)
@test_throws ArgumentError sortperm(d, [:cv1, :dv1], checkunique=true)
@test sort(d, :dv1)[!, :dv3] == sort(d, "dv1")[!, "dv3"] == sortperm(dv1)
@test sort(d, :dv2)[!, :dv3] == sortperm(dv1)
@test sort(d, :cv1)[!, :dv3] == sortperm(dv1)
Expand All @@ -30,6 +32,7 @@ using DataFrames, Random, Test, CategoricalArrays
@test issorted(sort(df, rev=true), rev=true)
@test issorted(sort(df, [:chrom, :pos])[:, [:chrom, :pos]])
@test issorted(sort(df, ["chrom", "pos"])[:, ["chrom", "pos"]])
@test_throws ArgumentError issorted(sort(df), :rank, checkunique=true)

ds = sort(df, [order(:rank, rev=true), :chrom, :pos])
@test issorted(ds, [order(:rank, rev=true), :chrom, :pos])
Expand Down