Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(3DS) Network issues #14381

Closed
MrHuu opened this issue Sep 4, 2022 · 27 comments · Fixed by #14389
Closed

(3DS) Network issues #14381

MrHuu opened this issue Sep 4, 2022 · 27 comments · Fixed by #14389

Comments

@MrHuu
Copy link
Contributor

MrHuu commented Sep 4, 2022

Description

Latest builds for the 3DS platform have an issue with all network enabled features.
Affecting the Online Updater, netplay and achievements.

It turns out i was loading an older core during testing earlier.
I did not test Netplay functionality itself.

Expected behavior

Network connectivity.

Actual behavior

No network connectivity.

Steps to reproduce the bug

  • Online Updater -> Core Downloader
    Result: Failed to retreive core list!
[ERROR] [core updater] Download of core list 'http://buildbot.libretro.com/nightly/nintendo/3ds/latest/cia/.index-extended' failed: Internal error.
  • Netplay -> Refresh Netplay Host List
    Result: No error is shown in menu, only console.
[ERROR] Download failed: Internal error.
  • Start game with Achievements enabled:
    Result: Could not communicate with http://retroachievements.org
[INFO] [RCHEEVOS]: Attempting to login user (with password)
[ERROR] [RCHEEVOS]: Error resolving hash 0: Internal error. (automatic retry in 250ms)
[ERROR] [RCHEEVOS]: Error logging in 0: Internal error. (automatic retry in 250ms)
[ERROR] [RCHEEVOS]: Error resolving hash 0: Internal error. (automatic retry in 500ms)
[ERROR] [RCHEEVOS]: Error logging in 0: Internal error. (automatic retry in 500ms)
[ERROR] [RCHEEVOS]: Error resolving hash 0: Internal error. (automatic retry in 1000ms)
[ERROR] [RCHEEVOS]: Error logging in 0: Internal error. (automatic retry in 1000ms)
[INFO] [RCHEEVOS]: Could not communicate with http://retroachievements.org
[INFO] [RCHEEVOS]: Error logging in: Load aborted

Bisect Results

First bad commit:
e45958b

Version/Commit

All versions since #14351 are affected.
Reverting e45958b on current master restores functionality.

Environment information

  • OS: 3DS
  • Compiler: arm-none-eabi-gcc (devkitARM release 57) 11.2.0
@LibretroAdmin
Copy link
Contributor

@cthulhu-throwaway pinging you. @MrHuu is a 3DS developer so he likely could work with you figuring out a solution.

@ghost
Copy link

ghost commented Sep 4, 2022

@MrHuu probably because I've changed all calls to socket_connect that required a timeout to socket_connect_with_timeout and we already know that function (likely due to making the socket non-blocking through fcntl) is behaving strangely on the 3DS.
Back then you'd reported that you would investigate it; I've neither a 3DS nor do I know anyone else running RetroArch through it.

Could you try doing some macro branching for socket_connect_with_timeout?

#ifdef _3DS
socket_connect(...);
#else
socket_connect_with_timeout(...);
#endif

P.S. Note that some calls to socket_connect require you to leave the socket in non-blocking mode afterwards; socket_connect_with_timeout already does that, so pay attention whether the call site does a socket_set_block(fd, true) after socket_connect_with_timeout, and if it does, you don't want to make the socket non-blocking.

EDIT: The 3DS library uses the same method we do to make sockets non-blocking: https://github.com/devkitPro/libctru/blob/master/libctru/source/services/soc/soc_ioctl.c
I don't think there is anything I can do other than point at the source of the problem.

EDIT: I would also suggest debugging every step of the socket_connect_with_timeout function (it's a small function) to ensure that it's indeed the call to socket_nonblock/socket_set_block that is the one failing.

EDIT: The reason for the warning at https://github.com/libretro/RetroArch/blob/master/network/netplay/netplay_frontend.c#L95-L96 is because libctr does not accept F_SETFD: https://github.com/devkitPro/libctru/blob/master/libctru/source/services/soc/soc_fcntl.c#L48-L51
The problem with socket_connect_with_timeout is likely something else; I absolutely can't do anything without you verifying where socket_connect_with_timeout is failing.

@MrHuu
Copy link
Contributor Author

MrHuu commented Sep 4, 2022

https://github.com/libretro/RetroArch/blob/master/libretro-common/net/net_socket.c#L694-L695

   if (!socket_nonblock(fd))
      return false;

It seems to fail here somehow.
When not making the call, things seem to work so far.

@ghost
Copy link

ghost commented Sep 4, 2022

Could you check the returned values (fcntl(fd, F_GETFL) and fcntl(fd, F_SETFL, flags)): https://github.com/libretro/RetroArch/blob/master/libretro-common/net/net_socket.c#L194-L201

It's very odd that's failing there because before that commit, net_http already checked for the returned value of socket_nonblock for non-HTTPS connections (netplay lobby is always HTTP):

if ( socket_connect(fd, (void*)next_addr, true) >= 0
&& socket_nonblock(fd))
break;

@MrHuu
Copy link
Contributor Author

MrHuu commented Sep 4, 2022

It works as a fix when disabling that call, my apologies, it does not fail on:

if (!socket_nonblock(fd))
return false;
Both fcntl calls return 0.

Next call:

res = connect(fd, addr->ai_addr, addr->ai_addrlen);
Returns: -1

Next:

getsockopt(fd, SOL_SOCKET, SO_ERROR, (char*)&error, &errsz);
Returns: -1
With an error: -26

Returning false from the socket_connect_with_timeout() function.

@ghost
Copy link

ghost commented Sep 5, 2022

Next call:

res = connect(fd, addr->ai_addr, addr->ai_addrlen);

Returns: -1

Correct. A non-blocking socket will return -1 on connect.

Next:

getsockopt(fd, SOL_SOCKET, SO_ERROR, (char*)&error, &errsz);

Returns: -1
With an error: -26
Returning false from the socket_connect_with_timeout() function.

This seems to be EINPROGRESS: https://github.com/devkitPro/libctru/blob/master/libctru/source/services/soc/soc_common.c#L37

Either poll is behaving incorrectly and returning a writable socket before the connection is ready or the error opt is not being cleaned fast enough (or at all).

For the first you can prevent net_compat.h from defining HAVE_POLL for the 3DS: https://github.com/libretro/RetroArch/blob/master/libretro-common/include/net/net_compat.h#L179-L184

#elif !defined(__PS3__) && !defined(_3DS)
#include <poll.h>

#define NETWORK_HAVE_POLL 1

#endif

This will have socket_wait fall back to select.

If the above solution is still returning EINPROGRESS on getsockopt's SO_ERROR, we can just not perform that check on the 3DS: https://github.com/libretro/RetroArch/blob/master/libretro-common/net/net_socket.c#L731-L740

#if !defined(GEKKO) && !defined(_3DS) && defined(SO_ERROR)
   {
      int       error = -1;
      socklen_t errsz = sizeof(error);

      getsockopt(fd, SOL_SOCKET, SO_ERROR, (char*)&error, &errsz);
      if (error)
         return false;
   }
#endif

@MrHuu
Copy link
Contributor Author

MrHuu commented Sep 5, 2022

Excluding NETWORK_HAVE_POLL results in the same error.

#elif !defined(__PS3__) && !defined(_3DS)
#include <poll.h>

#define NETWORK_HAVE_POLL 1

#endif

Excluding the getsockopt call does solve the issue.

#if !defined(GEKKO) && !defined(_3DS) && defined(SO_ERROR)
   {
      int       error = -1;
      socklen_t errsz = sizeof(error);

      getsockopt(fd, SOL_SOCKET, SO_ERROR, (char*)&error, &errsz);
      if (error)
         return false;
   }
#endif

If i understand correctly, in this case getsockopt is only used to check if the returned socket is valid?

Other than not catching potential socket errors, could excluding it cause issues on it's own?
Otherwise, i think we can mark this as the solution as it restores functionality in my experience.

@LibretroAdmin
Copy link
Contributor

LibretroAdmin commented Sep 5, 2022

If we do decide to skip the getsockopt call for both Gekko (NGC/Wii) and 3DS, can we add some comment to it so that a future coder knows why this ifdef is there?

@ghost
Copy link

ghost commented Sep 5, 2022

If i understand correctly, in this case getsockopt is only used to check if the returned socket is valid?

The reason we've that check there is that the socket might've been made writable (which returns success on select/poll), but the connnection actually failed, such as a connection refused error.
After returning from select/poll, we then check the value of SO_ERROR to confirm that the connection was indeed made.
If we skip this check and the connection failed with the socket being made writable without select/poll reporting an error, it should error out when we call send or recv. I think most, if not all, code using socket_connect_with_timeout will handle the error and close the socket afterwards, so it SHOULD be safe to skip this check for platforms that can't reliably test the value of SO_ERROR.

libogc (Wii) doesn't implement getsockopt, which is why it was branched out, and to be honest, those unofficial homebrew POSIX-like implementations are so bad... They are either poorly implemented, doesn't follow POSIX standards or are buggy as hell (like here, SO_ERROR should've been cleaned by the time the socket was made writable with a successful connection).

Feel free to submit a PR adding the 3DS as another exception to the SO_ERROR check.

@MrHuu
Copy link
Contributor Author

MrHuu commented Sep 5, 2022

Makes me wonder if the libctru implementation is incomplete, or maybe some other way of using it is expected.

@cthulhu-throwaway
May i kindly thank you for your clear guidance on this issue. It's been very informative to me. At least we get to know a(n other) limitation to be aware of.

@ghost
Copy link

ghost commented Sep 6, 2022

Makes me wonder if the libctru implementation is incomplete, or maybe some other way of using it is expected.

I think it's a problem with the 3DS kernel and they just go along with it. Checked the code earlier and it just does a syscall as usual: https://github.com/devkitPro/libctru/blob/master/libctru/source/services/soc/soc_connect.c#L35-L46

@cthulhu-throwaway May i kindly thank you for your clear guidance on this issue. It's been very informative to me. At least we get to know a(n other) limitation to be aware of.

Those homebrew SDKs have a lot of limitations compared to other POSIX or POSIX wannabe systems. On libogc, many of the network-related functions available in any half-compliant POSIX system are not implemented, even though there are syscalls for them; even select is not implemented, forcing us to have to always use poll, which was not a thing until my recent implementation of socket_poll in libretro-common.
Vita has neither select nor poll, and I'd to implement those functions around the non-POSIX epoll, which Vita has an implementation similar to that of Linux, albeit with some differences.

@ghost
Copy link

ghost commented Sep 6, 2022

In another note, fixing socket_connect_with_timeout will likely fix netplay for the 3DS aswell.

@ghost
Copy link

ghost commented Sep 9, 2022

@MrHuu Found this old issue at their repo: devkitPro/libctru#412

Seems like you need to implement an alternative check through EISCONN afterwards.

@MrHuu
Copy link
Contributor Author

MrHuu commented Sep 11, 2022

I'm not really sure on how to properly add a check, with the information provided in said issue.
I'll have to get back to this, once i''ve read myself into the subject.

@ghost
Copy link

ghost commented Sep 11, 2022

I'm not really sure on how to properly add a check, with the information provided in said issue. I'll have to get back to this, once i''ve read myself into the subject.

#if !defined(GEKKO) && !defined(_3DS) && defined(SO_ERROR)
...
#else
   if (connect(fd, addr->ai_addr, addr->ai_addrlen) < 0 && errno != EISCONN)
      return false;
#endif

@MrHuu
Copy link
Contributor Author

MrHuu commented Sep 11, 2022

mtheall mentioned we shouldn't be calling connect a second time.
Would this be the most optimal solution in this case?

If i understand correct; calling connect again with the same arguments, should return EISCONN (which translates to the socket already being connected successfully).

@ghost
Copy link

ghost commented Sep 12, 2022

mtheall mentioned we shouldn't be calling connect a second time. Would this be the most optimal solution in this case?

See the responses below that. You wouldn't need to call connect a second time, if we could verify the value of SO_ERROR, which we obviously can't here. The same applies to libogc, as it doesn't implement getsockopt.

This issue has been there for over 4 years, I doubt there is much of a chance of it being fixed.
The only other way of verifying whether the socket is successfully connected or not, is by brute forcing a send or recv; both require that the call site do it rather than the function, as we would otherwise be sending/consuming data. My quote from above:

it should error out when we call send or recv. I think most, if not all, code using socket_connect_with_timeout will handle the error and close the socket afterwards, so it SHOULD be safe to skip this check for platforms that can't reliably test the value of SO_ERROR.

Still, the second connection hack should be implemented for those platforms in case a call site for socket_connect_with_timeout does not expect a subsequent recv or send call to fail after just establishing connection.

@ghost
Copy link

ghost commented Sep 12, 2022

If i understand correct; calling connect again with the same arguments, should return EISCONN (which translates to the socket already being connected successfully).

Regardless of the arguments, it should always return EISCONN if the socket is already connected; we just use the same arguments to make it easier to understand our intentions.

At that point:
EISCONN: successfully connected.
EALREADY: connection is still in progress (shouldn't happen because socket_wait will return ready = false, and thus an early return false).
ANY OTHER ERROR (refused, etc): we catch those here now and return false from the function.

@ghost
Copy link

ghost commented Sep 16, 2022

@MrHuu I am still awaiting you on this. I don't feel like committing the change myself because I lack the hardware, and would rather not change stuff unless there are people to test them out.

I would like to wrap these final issues up (poll might need fixing/tweaking for the 3DS aswell, according to that libctr issue from above).

@MrHuu
Copy link
Contributor Author

MrHuu commented Sep 16, 2022

On the 3DS, calling connect a second time, it does return EISCONN.

I'm unable to verify the same behavior for GEKKO, should it be excluded until verified?

@ghost
Copy link

ghost commented Sep 17, 2022

@MrHuu you should probably check if that poll issue is present here too.
According to the issue, poll will always block for the full timeout on libctr, even if we already have one event into revents.

I think the easiest way to test this out is by using netplay to connect to a host that you know should never take 10 seconds to connect, but is still taking 10 seconds or more. We've a 10 seconds timeout for the client to connect:

if (socket_connect_with_timeout(fd, (void*)addr, 10000))

Take in mind that if the 3DS has core updating available, netplay always enforce a core update unless the core is locked. Run the test after updating all cores, or after starting netplay for a second time, or by locking the core (prevents update).

@MrHuu
Copy link
Contributor Author

MrHuu commented Sep 18, 2022

It appears netplay currently doesn't work on both .cia an .3dsx builds. Without in-depth investigation, i suspect the command line approach of launching a netplay session would be the issue here. Since the core relaunches, just loading the game without starting netplay.
I'll have to get back on this.

I did notice quite a bit of delay regarding polling. When running a game with achievements enabled, the task queue seems to be on hold. Most noticeable in the menu, while a achievement enabled game is running in the background, there's a huge delay on loading backgrounds and thumbnails.
Disabling achievements resolves this behavior.

Also, setting the 10 sec timeout to 100ms reduces the delay to a minimum while retaining network connectivity.
So far it does seem to wait for the full timeout even if the socket is already connected.

@ghost
Copy link

ghost commented Sep 18, 2022

It appears netplay currently doesn't work on both .cia an .3dsx builds. Without in-depth investigation, i suspect the command line approach of launching a netplay session would be the issue here. Since the core relaunches, just loading the game without starting netplay. I'll have to get back on this.

This approach seems to be the only one that works correctly with static-core platforms. It was developed to fix netplay for the Vita, but I've since patched all network-enabled static-core platforms to work with it.
Some platforms do discard all arguments (with the exception of content) when forking, so I'd to patch that aswell in order to not touch the arguments if the first parameter is either -H or -C; this however, was not the case for the 3DS, so you might wanna take another look if the arguments are not being discarded at the forked process.

EDIT: I've checked the Rosalina, Ninjhax2 and Cia bootloaders and all of them appear to be correct with the current netplay fork arguments code.
Would be interesting if you could check out which arguments we've at the start and end of rarch_main.

@ghost
Copy link

ghost commented Sep 18, 2022

I did notice quite a bit of delay regarding polling. When running a game with achievements enabled, the task queue seems to be on hold. Most noticeable in the menu, while a achievement enabled game is running in the background, there's a huge delay on loading backgrounds and thumbnails. Disabling achievements resolves this behavior.

Also, setting the 10 sec timeout to 100ms reduces the delay to a minimum while retaining network connectivity. So far it does seem to wait for the full timeout even if the socket is already connected.

One of the comments in that issue mentioned splitting poll into multiple calls. It's hacky and involves multiple syscalls, but I don't see another way to avoid blocking for the full timeout.
Add this branch into socket_poll at net_socket.c:

#elif defined(_3DS)
   int i;
   int timeout_quotient;
   int timeout_remainder;
   int ret = -1;

#define TIMEOUT_DIVISOR 100
   if (timeout <= TIMEOUT_DIVISOR)
      return poll(fds, nfds, timeout);

   timeout_quotient = timeout / TIMEOUT_DIVISOR;
   for (i = 0; i < timeout_quotient; i++)
   {
      ret = poll(fds, nfds, TIMEOUT_DIVISOR);

      /* Success or error. */
      if (ret)
         return ret;
   }

   timeout_remainder = timeout % TIMEOUT_DIVISOR;
   if (timeout_remainder)
      ret = poll(fds, nfds, timeout_remainder);

   return ret;
#undef TIMEOUT_DIVISOR

#else

@MrHuu
Copy link
Contributor Author

MrHuu commented Sep 19, 2022

This fixes just about every hiccup i've encountered lately. Which were mostly achievement related.

I've also noticed:
devkitPro/pacman-packages#130 (comment)
Looking back at how retroarch used select, would probably be the reason why there were no noticeable issues before.

I'll try to figure out how the commandline is handled for both .3dsx and .cia, to enable netplay again.
There are still a few 3ds related issues i'm worried about encountering;
Some threaded tasks may lock up the system, and there's still a small chance of failure during self-updating a .cia file.

Other than that i think most, if not all, network issues are fixed when merged.
I really appreciate your effort on refactoring and fixing these platform specific issues.

@ghost
Copy link

ghost commented Sep 19, 2022

I've also noticed: devkitPro/pacman-packages#130 (comment) Looking back at how retroarch used select, would probably be the reason why there were no noticeable issues before.

libctr uses poll for select (as mentioned in the linked issue): https://github.com/devkitPro/libctru/blob/master/libctru/source/services/soc/soc_select.c#L44-L47
The reason you didn't notice any hiccups before is because the http code used by achievements used connect on a blocking socket, returning immediately after a successful connection without having to call select/poll.

I'll try to figure out how the commandline is handled for both .3dsx and .cia, to enable netplay again. There are still a few 3ds related issues i'm worried about encountering; Some threaded tasks may lock up the system, and there's still a small chance of failure during self-updating a .cia file.

You can do it easily by adding some logs to rarch_main:

for (i = 0; i < argc; i++)
   RARCH_LOG("ARG (%d): %s\n", i, argv[i]);

If these logs are correct, then the content load task is discarding the netplay arguments.

@Queaven
Copy link

Queaven commented Apr 6, 2024

It appears netplay currently doesn't work on both .cia an .3dsx builds. Without in-depth investigation, i suspect the command line approach of launching a netplay session would be the issue here. Since the core relaunches, just loading the game without starting netplay. I'll have to get back on this.

I did notice quite a bit of delay regarding polling. When running a game with achievements enabled, the task queue seems to be on hold. Most noticeable in the menu, while a achievement enabled game is running in the background, there's a huge delay on loading backgrounds and thumbnails. Disabling achievements resolves this behavior.

Also, setting the 10 sec timeout to 100ms reduces the delay to a minimum while retaining network connectivity. So far it does seem to wait for the full timeout even if the socket is already connected.

Hello, any news on this? im just curious about the possibility of multiplaying retroarch from a new 3ds. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants