Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort missing data placement #2267

Closed
chris-b1 opened this issue May 22, 2020 · 36 comments
Closed

sort missing data placement #2267

chris-b1 opened this issue May 22, 2020 · 36 comments

Comments

@chris-b1
Copy link

Currently, missing is treated as the largest value in a sort. pandas has a na_position kwarg that lets you specify how missing data should be ordered, by default placing it last, regardless of ascending or descending sort.

Whether that default is correct (I like it, but could be familiarity) I do think having an option on sort/order to control missing placement would be nice.

DataFrames 0.21

julia> df = DataFrame(:a => [1.0, 9.0, 2.0, missing])
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 11.0      │
│ 29.0      │
│ 32.0      │
│ 4missing  │

julia> sort(df, :a)
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 11.0      │
│ 22.0      │
│ 39.0      │
│ 4missing  │

julia> sort(df, :a, rev=true)
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 1missing  │
│ 29.0      │
│ 32.0      │
│ 41.0

pandas

df = pd.DataFrame({'a': [1.0, 9.0, 2.0, np.NaN]})

df
Out[42]: 
     a
0  1.0
1  9.0
2  2.0
3  NaN

df.sort_values('a')
Out[43]: 
     a
0  1.0
2  2.0
1  9.0
3  NaN

df.sort_values('a', ascending=False)
Out[44]: 
     a
1  9.0
2  2.0
0  1.0
3  NaN

df.sort_values('a', ascending=False, na_position='first')
Out[45]: 
     a
3  NaN
1  9.0
2  2.0
0  1.0
@bkamins
Copy link
Member

bkamins commented May 22, 2020

@nalimilan - what do you think? We could add this, but it is against the design in Base.

@chris-b1
Copy link
Author

Got it, yeah not sure how hard this should be special cased. For what it's worth below is an isless comparer that got me to what I needed (sort with rev=true and missing at the end)

isless_missingsmallest(::Missing, ::Any) = true
isless_missingsmallest(::Any, ::Missing) = false
isless_missingsmallest(::Missing, ::Missing) = false
isless_missingsmallest(x, y) = isless(x, y)

julia> sort(df, :a; lt=isless_missingsmallest)
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 1missing  │
│ 21.0      │
│ 32.0      │
│ 49.0      │

julia> sort(df, :a; lt=isless_missingsmallest, rev=true)
4×1 DataFrame
│ Row │ a        │
│     │ Float64? │
├─────┼──────────┤
│ 19.0      │
│ 22.0      │
│ 31.0      │
│ 4missing

@nalimilan
Copy link
Member

Do you have an actual use case for sorting missings first? I wonder whether it's worth adding more complexity to that function. Maybe Missings.jl could provide a custom isless function like the one you showed and groupby could allow passing it.

@chris-b1
Copy link
Author

The actual usecase is missings last, because I'm using a descending sort. This would be pretty typical for reporting-type tasks - e.g., "show sales by region, ranked from largest to smallest, with any region missing data at the bottom of the report"

Most of the time could probably work around by filling missings with 0s

@nalimilan
Copy link
Member

Ah yes it's indeed unfortunate that missing values are sorted first when reversing order. I guess adding a keyword argument could be justified then. We could have missinglast::Bool=!rev. Though you wouldn't be able to set it when passing order as they would conflict.

@nivupaiR
Copy link

I think this feature is useful. How about an optional keyword in sort such as skipmissing=true?

@bkamins bkamins added this to the 1.x milestone Nov 25, 2020
@bkamins
Copy link
Member

bkamins commented Nov 25, 2020

It sure can be added as an opt-in, with the current behaviour being the default. We just need to wrork out the details of the design.

@russmack
Copy link

How did this play out?

@bkamins
Copy link
Member

bkamins commented May 15, 2023

As you can see it is marked for 1.x release. We did not get additional requests for this feature during the last three years, so it did not get a high priority.

The current solution for this requirement is:

julia> sort(df, rev=true, lt=(x, y) -> ismissing(x) ? true : ismissing(y) ? false : isless(x, y))
4×1 DataFrame
 Row │ a
     │ Float64?
─────┼───────────
   1 │       9.0
   2 │       2.0
   3 │       1.0
   4 │ missing

Essentially what we would need to add is a special variant of isless which considers missing as the smallest not as the largest.
The question is what should be the name of this function?

@bkamins bkamins modified the milestones: 1.x, 1.6 May 15, 2023
@alonsoC1s
Copy link
Contributor

I think I might be able to implement this. Would we handle this via a keyword argument, or a new variant of isless?

@bkamins
Copy link
Member

bkamins commented Jun 25, 2023

@alonsoC1s - I think adding a function would be better. Also, it probably makes sense to have it in Missings.jl.
The reason is that then it could be used in any sorting/permutation function in the whole ecosystem.

The question is what the name should be. The isless_missingsmallest seems too lengthy.

@nalimilan @LilithHafner - do you have opinions?

@LilithHafner
Copy link
Contributor

It would be nice to have this compose nicely with other orderings (e.g. use lt=⊂ and missingsmallest=true). Perhaps missingsmallest could be a higher order function with

missingsmallest(f) = (x,y) -> ismissing(y) ? false : ismissing(x) ? true : f(x, y)

Then the name would be missingsmallest(isless).

If efficiency is important, we should make the missingsmallest option dispatchable, though. Using an opaque lt function rules out most optimizations. This could look like

struct MissingSmallest{O<:Ordering} <: Ordering
    o::O
end

lt(::MissingSmallest, ::Missing, ::Missing) = false
lt(::MissingSmallest, ::Any, ::Missing) = false
lt(::MissingSmallest, ::Missing, ::Any) = true
lt(ms::MissingSmallest, x::Any, y::Any) = lt(ms.o, x, y)

@static if isdefined(Base.Sort, :MissingOptimization) && isdefined(Base.Sort, :_sort!)
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::MissingSmallest, kw)
        # put missing at end
        Base.Sort._sort!(v, a.next, o.o, (kw...; hi=new_hi))
    end
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::ReverseOrdering{<:MissingSmallest}, kw)
        # put missing at beginning
        Base.Sort._sort!(v, a.next, ReverseOrdering(o.fwd.o), (kw...; lo=new_lo))
    end
end

If this is implemented as a new Ordering, that would require a keyword argument or users creating their own Orderings.

The users creating their own orderings approach can be implemented in Missings.jl

@bkamins
Copy link
Member

bkamins commented Jun 25, 2023

Thank you for your nice ideas 😄.

If efficiency is important, we should make the missingsmallest option dispatchable, though.

I think efficiency is important.

Why do you think your approach with missingsmallest(f) would not be efficient? Because it introduces extra branching?

An approach with Ordering subtype indeed makes sense as an alternative. My questions would be:

  • why do you think it would be faster?
  • I would be a bit afraid that using Ordering would be hard for users to learn (but maybe I am wrong here; it just seems to me that using the missingsmallest(f) wrapper is simpler to grasp for a typical user)

@LilithHafner
Copy link
Contributor

LilithHafner commented Jun 25, 2023

why do you think it would be faster?

With an opaque lt ordering (e.g. missingsmallest(f)), we can get a plenty fast comparison based sort, and I imagine a non-adaptive comparison based sorting algorithm would perform identically on lt=missingsmallest(isless) and lt=isless but the sneaky tricks to optimize all the special cases don't apply. EDIT: yes they do, see @bkamins's comment.

For example, sorting a Vector{Union{Missing, Int64}} in Forward order typically shifts the missing values to the end so that the remaining sorting is faster (either branch free or simply very well predicted branches, I haven't checked) and in this case we can also use countsort or radixsort to sort the remaining values which are all Int64. With missingsmallest(f) we'd be stuck with vanilla ScratchQuickSort or InsertionSort.

I would be a bit afraid that using Ordering would be hard for users to learn (but maybe I am wrong here; it just seems to me that using the missingsmallest(f) wrapper is simpler to grasp for a typical user)

Yeah. Usability-wise, I think sort!(df, missingssmallest=true) >> sort!(df, lt=missingssmallest(isless) ≈> sort(df, ord=MissingSmallest(Forward)). However, if we don't add missingsmallest to base, then sort!(v::Vector, missingsmallest=true) will not be an option. I think it would be good to get an idea of weather or not folks want to support this option in Base before finalizing the design here.

@bkamins
Copy link
Member

bkamins commented Jun 25, 2023

For example, sorting a Vector{Union{Missing, Int64}} in Forward order typically shifts the missing values to the end so that the remaining sorting is faster

We could add special methods that would detect that missingsmallest(f) is passed that would do the same tricks that we already do but just put missing in another place.

I think it would be good to get an idea of weather or not folks want to support this option in Base before finalizing the design here.

I have pinged #internals on Slack. Let us see if someone has an opinion.

I think that the preference will be not to add this extra kwarg. The issue is that, in general, it could conflict with lt or by passed by user (that could treat missing in some different non-standard way). I think passing lt=missingssmallest(isless) is more composable.

@nalimilan
Copy link
Member

See also previous discussion at JuliaData/Missings.jl#142. There I had proposed lt=missingisless. lt=missingsmallest(isless) sounds a bit too complex for casual users, and "smallest" isn't really the vocabulary we use regarding sorting. Though maybe we could have both something like missingisless for convenience (possibly in Missings.jl) and something lower-level but more flexible and on which algorithms would dispatch.

@LilithHafner
Copy link
Contributor

We could add special methods that would detect that missingsmallest(f) is passed that would do the same tricks that we already do but just put missing in another place.

Yep! Good idea, you're right. There should be no performance loss from using the lt approach.

lt=missingsmallest(isless) seems like the soundest approach internally (up to renaming), but I don't see a problem with an alias. I like that Reverse === ReverseOrdering(ForwardOrdering()).

Bikeshed: we could call the higher order function missingsfirst

@jariji
Copy link
Contributor

jariji commented Jun 25, 2023

Current feeling - missingsfirst seems more intuitive than missingsless under rev = true and equally intuitive under rev = false.

@LilithHafner
Copy link
Contributor

welll... I was thinking that sort!(lt=missingsfirst(isless), rev=true) would put missing values at the end, which is the opposite of intuitive imo. We can't really have sort!(lt=X, rev=true) and sort!(lt=X, rev=false) put missing values in the same place because rev is applied by swapping the arguments to lt, so having the term first in X is probably a bad idea in retrospect. (unless someone can think of a way of implementing this that lets us have sort!(lt=missingsfirst(isless), rev=true) do the obvious thing of putting missing first and sorting everything else in reverse order)

@bkamins
Copy link
Member

bkamins commented Jun 26, 2023

I was thinking that sort!(lt=missingsfirst(isless), rev=true) would put missing values at the end

I agree. I feel we need lt to define the standard order and rev=true to reverse it entirely.
As opposed to SQL I think in Julia we do not treat missing in a special way in general (except for selected packages, but this functionality should be general).


So the conclusion of what is to do would be to have in Missings.jl:

  1. Define missingssmallest(fun)
  2. Define const missingsless = missingssmallest(isless)
  3. Add documentation and make a release
  4. Add special implementations of sorting algorithms if they encounter lt=missingsless (this can be a separate PR from the PR implementing the functionality)

The only thing is what names should we use (and this is usually hard).

Are there better alternatives than missingssmallest and missingless (as commented missingsfirst is unfortunately not good as we are specifying order not location)?

@jariji
Copy link
Contributor

jariji commented Jun 26, 2023

missingsleast would be more suitable than missingssmallest. missingless is fine too, its only downside being that it could be interpreted as meaning "without missings".

@nalimilan
Copy link
Member

Note that my suggestion was missingisless, not missingsless. :-)

Something that worries me if we support both lt=missingisless and lt=missingsmallest(isless) is that the two names are quite similar and people who don't get the subtlety of isless vs "smallest" could confuse them. Maybe we could merge them? With some tricks it should be possible to write lt=missingsmallest as a shorthand for lt=missingsmallest(isless).

Instead of "smallest", we could also say "lowest", as it's more consistent with "lower than"/"greater than".

@jariji
Copy link
Contributor

jariji commented Jun 26, 2023

Instead of "smallest", we could also say "lowest", as it's more consistent with "lower than"/"greater than".

"least" goes with "less".

@alonsoC1s
Copy link
Contributor

I feel like this feature could be well recieved in the wider ecosystem, it might be worth it to propose it upstream (Missings.jl or perhaps even Base itself)

@bkamins
Copy link
Member

bkamins commented Jun 28, 2023

@alonsoC1s - yes. This should be added in Missings.jl. I just keep the discussion here to have all things in one place. I understand you offered to implement it. If this is the case please feel free to open a PR in Missings.jl adding what is agreed. We still have to settle on the names, but these can be easily updated when we have the PR opened (and it would help to move the things forward). Thank you!

@bkamins bkamins modified the milestones: 1.6, 1.7 Jul 10, 2023
@bkamins
Copy link
Member

bkamins commented Jul 15, 2023

@alonsoC1s - are you planning to make a PR fo Missings.jl with this? (if not I can make it to move things forward)

@alonsoC1s
Copy link
Contributor

Sorry, it totally slipped my mind. I am planning on it, it will be done in a few days at most

@alonsoC1s
Copy link
Contributor

I just opened PR JuliaData/Missings.jl#144 at Missings.jl as a draft PR to work out the naming. I followed the roadmap set by @bkamins (i.e implementing the partial order function solution).

My two cents on the naming: The missingsless function (isless but modified so missing is the smallest value) can be a bit hard to undestand and for some reason I found myself making typos even while writing simple tests and things like that. Perhaps something like missingslast would be intuitive?

@bkamins
Copy link
Member

bkamins commented Jul 19, 2023

Thank you. So I close this issue, and we can move the discussion to Missings.jl

@bkamins bkamins closed this as completed Jul 19, 2023
@nalimilan
Copy link
Member

Let's continue the discussion on naming given that it has already been developed here. I don't really like missingsless as it could mean "missings less", i.e. without missing values, and it uses the plural while missingsmallest uses the singular. Why not use missingisless, which I find clearer?

My two cents on the naming: The missingsless function (isless but modified so missing is the smallest value) can be a bit hard to undestand and for some reason I found myself making typos even while writing simple tests and things like that. Perhaps something like missingslast would be intuitive?

I guess this criticism applies to missingisless as well? missingslast is problematic as with sort(x, lt=missingslast), they would end up first instead.

@jariji proposed missingsleast. Do you think that's better? I'm not convinced it would be easier to remember or type for users. Among the options there's also missinglowest.

@bkamins
Copy link
Member

bkamins commented Jul 25, 2023

My ranking of preference missinglowest > missingleast > missingsisless. The reasoning is based on what I would find easiest to remember and type.

Now another idea would be to define:

missingsmallest(f=isless) = (x, y) -> ismissing(y) ? false : ismissing(x) ? true : f(x, y)

Then we would introduce just one name missingsmallest and user could call it by=missingsmallest() if by=missingsmallest(isless) is wanted.

@nalimilan
Copy link
Member

Then we would introduce just one name missingsmallest and user could call it by=missingsmallest() if by=missingsmallest(isless) is wanted.

You mean lt=missingsmallest(), right?

Maybe we could even allow lt=missingsmallest (without parentheses), by having missingsmallest(x, y; lt=isless) = ismissing(y) ? false : ismissing(x) ? true : lt(x, y) and missingsmallest(; lt=isless) = (x, y) -> missingsmallest(x, y; lt=lt). Not sure whether it would be confusing.

@bkamins
Copy link
Member

bkamins commented Jul 28, 2023

You mean lt=missingsmallest(), right?

Yes - this is what I meant. Sorry.

Maybe we could even allow lt=missingsmallest

We could but I think that then a simpler definition could be:

missingsmallest(x, y) = ismissing(y) ? false : ismissing(x) ? true : isless(x, y)
missingsmallest(lt) = (x, y) -> ismissing(y) ? false : ismissing(x) ? true : lt(x, y)

I am not sure which syntax would be better. lt=missinsmallest is appealing for consistency with lt=isless, but lt=missinsmallest() has a simpler design. What do other people involved think?

@bkamins
Copy link
Member

bkamins commented Sep 3, 2023

Any more comments on these options? (so that we can have some decision and move forward with the implementation)

@LilithHafner
Copy link
Contributor

LilithHafner commented Sep 3, 2023

I agree with your latest API proposal with missingsmallest as an "alias" for missingsmallest(isless). The API would be:

missingsmallest(lt)(x,y) === ismissing(y) ? false : ismissing(x) ? true : lt(x, y)
missingsmallest(x,y) === missingsmallest(isless)(x,y)

I support supporting bare lt=missingsmallest because I think using missingsmallest in conjunction with custom lt will be uncommon.


A simple implementation, if we don't want to dispatch to specialized algorithms is

missingsmallest(lt) = (x, y) -> ismissing(y) ? false : ismissing(x) ? true : lt(x, y)
missingsmallest(x, y) = missingsmallest(isless)(x, y)

A clever (too clever) for implementation that allows missingsmallest(isless) === missingsmallest and supports optimizations is

# this struct is internal and exists because we can't dispatch on anonymous functions.
struct MissingSmallest{T}
    lt::T
end
const missingsmallest = MissingSmallest(lt)
(ms::MissingSmallest)(x, y) = ismissing(y) ? false : ismissing(x) ? true : ms.lt(x, y)
(::MissingSmallest{::typeof(isless)})(lt) = MissingSmallest(lt)

# Interesting properties

@test missingsmallest === missingsmallest(isless)
@test missingsmallest === missingsmallest(isless)(isless)

# Optimizations

# TODO: upstream `_Lt` into Base.Order
_Lt(::typeof(isless)) = Forward
_Lt(lt) = Lt(lt)

@static if isdefined(Base.Sort, :MissingOptimization) && isdefined(Base.Sort, :_sort!)
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::Lt{<:MissingSmallest}, kw)
        # put missing at beginning
        Base.Sort._sort!(v, a.next, _Lt(o.lt.lt), (kw...; hi=new_hi))
    end
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::ReverseOrdering{<:Lt{<:MissingSmallest}}, kw)
        # put missing at end
        Base.Sort._sort!(v, a.next, ReverseOrdering(_Lt(o.fwd.lt.lt)), (kw...; lo=new_lo))
    end
end

Another, less clever but probably better, implementation is

# Implementation

# this struct is internal and exists because we can't dispatch on anonymous functions.
struct MissingSmallest{T}
    lt::T
end
missingsmallest(lt) = MissingSmallest(lt)
missingsmallest(x, y) = missingsmallest(lt)(x, y)
(ms::MissingSmallest)(x, y) = ismissing(y) ? false : ismissing(x) ? true : ms.lt(x, y)

# Interesting properties

@test_throws MethodError missingsmallest(isless)(isless)
@test missingsmallest !== missingsmallest(isless) # sad but necessary

# Optimizations

# TODO: upstream `_Lt` into Base.Order
_Lt(::typeof(isless)) = Forward
_Lt(lt) = Lt(lt)

withoutmissingordering(::typeof(missingsmallest)) = Forward
withoutmissingordering(ms::MissingSmallest) = _Lt(ms)
const _MissingSmallest = Union{MissingSmallest, typeof(missingsmallest}}

@static if isdefined(Base.Sort, :MissingOptimization) && isdefined(Base.Sort, :_sort!)
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::Lt{<:_MissingSmallest}, kw)
        # put missing at beginning
        Base.Sort._sort!(v, a.next, withoutmissingordering(o), (kw...; hi=new_hi))
    end
    function Base.Sort._sort!(v::AbstractVector, a::MissingOptimization, o::ReverseOrdering{Lt{<:_MissingSmallest}}, kw)
        # put missing at end
        Base.Sort._sort!(v, a.next, ReverseOrdering(withoutmissingordering(o.fwd)), (kw...; lo=new_lo))
    end
end

@nalimilan
Copy link
Member

I'd be fine with either the simple implementation or the less clever one. Optimizations can be added later if we feel the need for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants