Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: System.Linq.Shuffle() #111221

Open
andanteyk opened this issue Jan 9, 2025 · 14 comments
Open

[API Proposal]: System.Linq.Shuffle() #111221

andanteyk opened this issue Jan 9, 2025 · 14 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Linq
Milestone

Comments

@andanteyk
Copy link

Background and motivation

Shuffle is a universal operation, such as in games and preprocessing for machine learning.

LINQ includes APIs for changing the order, such as OrderBy() and Reverse().
However, there is no dedicated API for shuffling an IEnumerable<T> yet.
(There is System.Random.Shuffle() for Span<T>, but if you want to take an arbitrary IEnumerable<T> sequence, you'll probably need a custom implementation.)

On StackOverflow, we often see examples of shuffling implemented with code like .OrderBy(_ => Guid.NewGuid()).
However, this implementation is inefficient, and there are more efficient ways to shuffle.

Therefore, I propose to implement System.Linq.Shuffle().

API Proposal

namespace System.Linq;

public static class Enumerable 
{
    public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source) { }
    public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source, Random randomSource) { }
}

API Usage

var songs = ["Lorem", "ipsum", "dolor", "sit", "amet"];
foreach(var song in songs.Shuffle())
{
    Play(song);
}

// pick 3 songs with no duplicates
var threeSongs = songs.Shuffle().Take(3).ToArray();

Alternative Designs

Possible considerations:

  • Make the return type a dedicated type, for example IOrderedEnumerable<T> for OrderBy().
  • An overload that takes a Random instance, which allows for reproducible shuffling, but requires a fixed implementation.

Besides, Shuffle() allows for an easy (though not optimal) solution to #102229.

var noDuplicationGetItems = array.Shuffle().Take(n);

Risks

No response

@andanteyk andanteyk added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jan 9, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jan 9, 2025
@LeaFrock
Copy link
Contributor

LeaFrock commented Jan 9, 2025

See #78419 (comment), #73864 (comment).

Shuffle depends highly on the random algorithm instead of IEnumerable<T>. At least it's easy to add an extension for yourself.

@andanteyk
Copy link
Author

Even if you say "you can easily implement it yourself," it is not efficient, and if someone (including me) has implemented it themselves many times, I think that is a reason to include it in the official API.
OrderBy() is not the optimal implementation, nor is it the optimal use. It is for "sorting".

I don't think it's a problem even if Random.Shuffle() is used internally.
(Strictly speaking, an inside-out implementation would be more efficient than the method of "copying to an internal array and then performing an inplace shuffle".)
Even though Random.Shuffle() exists, additional implementation is currently required to support IEnumerable<T>. (It will probably be necessary to make it T[] first.)
Therefore, I think there is a reason for its existence.

If "Random, not the collection, should be responsible", would that be Random.Shared.Shuffle(IEnumerable<T> source)?
Still, I personally prefer the LINQ style.

@LeaFrock
Copy link
Contributor

LeaFrock commented Jan 9, 2025

Random, not the collection, should be responsible

Yes, the key point is how to implement the shuffle effect with randomness.

additional implementation is currently required to support IEnumerable. (It will probably be necessary to make it T[] first.)

The IEnumerable is not the concept of 'collection' (without Count, which ICollection has; without indexer, which IList has). How to shuffle a data set by Random without an expectable count and an accessible indexer? The problem is not so easy as you think.

it is not efficient

I agree but the similar problems aren't always the responsibility of runtime.

As I mentioned above, the upper level developers/libraries can provides some encapsulation practice based on BCL.

You can add the following extension method, which I think covers 99% usage scenes.

        public static void Shuffle<T>(this IList<T> source)
        {
            if (source is T[] array)
            {
                Random.Shared.Shuffle(array);
                return;
            }
            if (source is List<T> list)
            {
                Random.Shared.Shuffle(CollectionsMarshal.AsSpan(list));
                return;
            }
            ShuffleSlow(source);

            static void ShuffleSlow(IList<T> values)
            {
                int n = values.Count;
                var rand = Random.Shared;
                for (int i = 0; i < n - 1; i++)
                {
                    int j = rand.Next(i, n);

                    if (j != i)
                    {
                        (values[j], values[i]) = (values[i], values[j]);
                    }
                }
            }
        }

Even so, you should be realized:

  1. the Shuffle modify the source immediately, which LINQ APIs doesn't do so.
  2. CollectionsMarshal.AsSpan(list) has some risks if other thread is updating the list.

So you can see, the answer depends on your use cases. You have to change the implement as you require.

@eiriktsarpalis
Copy link
Member

How could this be implemented without bias in cases where the length of the enumerable isn't known ahead of time or could potentially be infinite?

@eiriktsarpalis
Copy link
Member

To answer my own question, this is typically addressed using reservoir sampling. This necessarily introduces bias in the generated permutations, however it might be a good-enough compromise for some use cases.

One potential implementation could involve implementing the classical shuffling algorithm for sources implementing IList<T>/IReadOnlyList<T> and then falling back to reservoir sampling for everything else. The problem with such an approach is that it violates LSP, two enumerables with identical elements would be shuffled using different algorithms with different statistical properties depending on what other interfaces they happen to implement.

@andanteyk
Copy link
Author

As far as I understand, reservoir sampling is an algorithm that extracts elements partially, so it may be difficult to apply it to the current application, which requires shuffling the entire array.
In addition, the result cannot be determined until all elements are enumerated (at least once), so I don't think it can handle infinitely long IEnumerable<T>.
(I'm sorry if I'm wrong, please point it out.)

In my opinion, it would be implemented like Reverse() does now, by first materializing the array with ToArray(), then shuffling it, and returning it in order.
OrderBy() seems to have a separate array of the original data obtained with ToArray() and a sorted index.

Of course, there may be better ideas.

@eiriktsarpalis eiriktsarpalis removed the untriaged New issue has not been triaged by the area owner label Jan 10, 2025
@eiriktsarpalis eiriktsarpalis added this to the Future milestone Jan 10, 2025
@stephentoub
Copy link
Member

In my opinion, it would be implemented like Reverse() does now, by first materializing the array with ToArray(), then shuffling it, and returning it in order.

That is my expectation as well. It would do the simple/obvious thing, basically:

T[] data = source.ToArray();
Random.Shared.Shuffle(data);

Which begs the question how much value adding this actually provides.

@eiriktsarpalis
Copy link
Member

Which begs the question how much value adding this actually provides.

The only benefit I can think of is avoiding copying when the source is a list.

@stephentoub
Copy link
Member

Which begs the question how much value adding this actually provides.

The only benefit I can think of is avoiding copying when the source is a list.

That would violate expectations set by other LINQ APIs like Order, OrderBy, Reverse, Except, Intersect, etc.

@eiriktsarpalis
Copy link
Member

In what way?

@stephentoub
Copy link
Member

In what way?

Maybe I misunderstood your suggestion, which I thought was to mutate the source rather than copying.

That said, it would still deviate from Reverse, for example, which snapshots the contents when calling GetEnumerator, such that changes to the source after that aren't reflected in the reversed data. There's a long-standing debate about that behavior, though.

@eiriktsarpalis
Copy link
Member

In what way?

Maybe I misunderstood your suggestion, which I thought was to mutate the source rather than copying.

My thinking was that we could replicate the permutation algorithm for IList<T> using an iterator. It should make it more efficient to sample data by doing list.Shuffle().Take(10) when list is sufficiently large.

That said, it would still deviate from Reverse

I can see how changing Reverse might be a point of contention, but I don't think he have to replicate that behavior for a new method.

@stephentoub
Copy link
Member

It should make it more efficient to sample data by doing list.Shuffle().Take(10) when list is sufficiently large.

Sure, if we believe that to be a common case, it could be optimized in combination with subsequent operations if the source is of a known length.

@rickbrew
Copy link
Contributor

rickbrew commented Jan 18, 2025

I use this a little in Paint.NET. There is one location in my code where I randomize the order that tiles are rendered so as to reduce locking dependencies (spatially adjacent tiles often share an underlying cache node, which means they are tied to the same mutex). This helps rendering throughput quite a bit in this case.

Here's my implementation of the standard Fisher-Yates shuffling algorithm. It may be possible to modernize some of this more, I wrote it a long time ago, but the core algorithm is dirt simple.

internal static class ListAlgorithms
{
    public static void FisherYatesShuffle<T, TList>(TList list, int startIndex, int length, Random random)
        where TList : IList<T>
    {
        if (length == 0)
        {
            return;
        }

        // http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle

        for (int i = startIndex + length - 1; i >= startIndex + 1; --i)
        {
            int j = startIndex + random.Next(i - startIndex + 1);
            SwapElements<T, TList>(list, i, j);
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static void FisherYatesShuffle<T>(IList<T> list)
    {
        FisherYatesShuffle(list, Random.Shared);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static void FisherYatesShuffle<T>(IList<T> list, Random random)
    {
        FisherYatesShuffle(list, 0, list.Count, random);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static void FisherYatesShuffle<T>(IList<T> list, int startIndex, int length, Random random)
    {
        FisherYatesShuffle<T, IList<T>>(list, startIndex, length, random);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static void FisherYatesShuffle<T, TList>(TList list)
        where TList : IList<T>
    {
        FisherYatesShuffle<T, TList>(list, Random.Shared);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static void FisherYatesShuffle<T, TList>(TList list, Random random)
        where TList : IList<T>
    {
        FisherYatesShuffle<T, TList>(list, 0, list.Count, random);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static void SwapElements<T, TList>(TList a, int i, int j)
        where TList : IList<T>
    {
        if (i != j)
        {
            T local = a[i];
            a[i] = a[j];
            a[j] = local;
        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Linq
Projects
None yet
Development

No branches or pull requests

5 participants