Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WorkerSettings: Add disableLiburing option (enable_liburing in Rust) #1442

Merged
merged 15 commits into from
Aug 12, 2024

Conversation

ibc
Copy link
Member

@ibc ibc commented Aug 9, 2024

Details

### Details

- `createWorker({ disableLiburing: true })` disables LibUring usage despite it's supported by the worker and current host.
- Related (still to be fixed) issue which brings lot of context: #1435
@ibc ibc requested review from jmillan and nazar-pc August 9, 2024 16:34
rust/src/worker.rs Outdated Show resolved Hide resolved
rust/src/worker.rs Outdated Show resolved Hide resolved
@ibc ibc requested a review from nazar-pc August 9, 2024 17:01
@ibc ibc changed the title WorkerSettings: Add disableLiburing option WorkerSettings: Add disableLiburing option (enable_liburing in Rust) Aug 9, 2024
@ibc
Copy link
Member Author

ibc commented Aug 9, 2024

OMG Rust tests fail because of this:

mediasoup-worker::mediasoup_worker_run() | settings error: unknown long option given as argument

I hate that getopt.h ancient thing...

And BTW it fails due to this obviously wrong code that I'm fixing now, but it doesn't justify the problem:

if enable_liburing {
    spawn_args.push("--disable_liburing".to_string());
}

rust/src/worker.rs Outdated Show resolved Hide resolved
rust/src/worker.rs Outdated Show resolved Hide resolved
@ibc
Copy link
Member Author

ibc commented Aug 9, 2024

Ok, this is done.

Copy link
Collaborator

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming CI passes this looks fine to me

@ibc
Copy link
Member Author

ibc commented Aug 9, 2024

Why is this happening in CI????

https://github.com/versatica/mediasoup/actions/runs/10323061408/job/28579657794?pr=1442

cargo clippy works in my machine and we force Rust version!!!!

@ibc
Copy link
Member Author

ibc commented Aug 9, 2024

OMG I don't want to spend all the weekend with this.

@ibc
Copy link
Member Author

ibc commented Aug 9, 2024

Ok, cargo clippy --all-targets -- -D warnings fails in local too.

@nazar-pc
Copy link
Collaborator

nazar-pc commented Aug 9, 2024

I suspect you might be forcing a different version locally. It wants you to write this idiomatic Rust instead:

let settings = WorkerSettings {
    enable_liburing: false,
    ..WorkerSettings::default()
};

@ibc
Copy link
Member Author

ibc commented Aug 9, 2024

Why does it complain about this?

error: field assignment outside of initializer for an instance created with Default::default()
  --> rust/src/router/tests.rs:21:13
   |
21 |             settings.enable_liburing = false;
   |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
note: consider initializing the variable with `worker::WorkerSettings { enable_liburing: false, ..Default::default() }` and removing relevant reassignments
  --> rust/src/router/tests.rs:20:13
   |
20 |             let mut settings = WorkerSettings::default();
   |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is the code:

worker_manager
        .create_worker({
            let mut settings = WorkerSettings::default();
            settings.enable_liburing = false;

            settings
        })
        .await
        .expect("Failed to create worker")

We do THE SAME in many other files such as here:

let worker = worker_manager
            .create_worker({
                let mut settings = WorkerSettings::default();

                settings.log_level = WorkerLogLevel::Debug;
                settings.log_tags = vec![WorkerLogTag::Info];
                settings.rtc_port_range = 0..=9999;
                settings.dtls_files = Some(WorkerDtlsFiles {
                    certificate: "tests/integration/data/dtls-cert.pem".into(),
                    private_key: "tests/integration/data/dtls-key.pem".into(),
                });
                settings.libwebrtc_field_trials =
                    Some("WebRTC-Bwe-AlrLimitedBackoff/Disabled/".to_string());
                settings.app_data = AppData::new(CustomAppData { bar: 456 });

                settings
            })
            .await
            .expect("Failed to create worker with custom settings");

@nazar-pc
Copy link
Collaborator

nazar-pc commented Aug 9, 2024

I suspect it has a heuristic about number of fields or just didn't learn to handle more complex cases, not sure which one of those two

@ibc
Copy link
Member Author

ibc commented Aug 9, 2024

Pushed, should be green now.

@nazar-pc
Copy link
Collaborator

nazar-pc commented Aug 9, 2024

Generally what it suggests does look a bit nicer and even shorter, but not always

@ibc
Copy link
Member Author

ibc commented Aug 9, 2024

OMG now tests failing...

@ibc
Copy link
Member Author

ibc commented Aug 9, 2024

WTH???

https://github.com/versatica/mediasoup/actions/runs/10323816619/job/28581996102?pr=1442

  process didn't exit successfully: `/home/runner/work/mediasoup/mediasoup/target/debug/deps/mediasoup-8f5b5278ff2f31a1` (signal: 11, SIGSEGV: invalid memory reference)

@nazar-pc
Copy link
Collaborator

nazar-pc commented Aug 9, 2024

Memory issues... again 😕

@ibc
Copy link
Member Author

ibc commented Aug 9, 2024

Memory corruption... again 😕

When did this happen in a test recently? I don't remember.

@ibc
Copy link
Member Author

ibc commented Aug 9, 2024

It fails consistently, this is depressing, no changes have been done that should affect this AT ALL:

@ibc
Copy link
Member Author

ibc commented Aug 9, 2024

No idea why the issue happens consistently but indeed it's the already know issue in Rust due to the separate usrsctp thread. I worked on this some months ago and there is an unfinished and very complex PR for which honestly I don't have time to revive now. I think I will disable these failing tests because we do know that the underlying code can fail. Well, all this assuming that the failing test is using two workers with pipe transports. If not, ignore all I said here.

@ibc
Copy link
Member Author

ibc commented Aug 12, 2024

Those are the ways Rust tests systematically fail. Since they run in parallel and taking into account that memory issues could make tests fail at any time once

running 27 tests
test data_structures::tests::dtls_fingerprint ... ok
test ortc::tests::generate_router_rtp_capabilities_succeeds ... ok
test ortc::tests::generate_router_rtp_capabilities_too_many_codecs ... ok
test ortc::tests::generate_router_rtp_capabilities_unsupported ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_get_consumable_rtp_parameters_get_consumer_rtp_parameters_get_pipe_consumer_rtp_parameters_succeeds ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_unsupported ... ok
test router::audio_level_observer::tests::router_close_event ... ok
test router::active_speaker_observer::tests::router_close_event ... ok
test router::consumer::tests::transport_close_event ... ok
test router::consumer::tests::producer_close_event ... ok
test router::direct_transport::tests::router_close_event ... ok
test router::data_producer::tests::transport_close_event ... ok
test router::data_consumer::tests::transport_close_event ... ok
test router::data_consumer::tests::data_producer_close_event ... ok
test router::pipe_transport::tests::data_producer_close_is_transmitted_to_pipe_data_consumer ... ok
test router::plain_transport::tests::router_close_event ... ok
test router::producer::tests::transport_close_event ... ok
error: test failed, to rerun pass `-p mediasoup --lib`

Caused by:
  process didn't exit successfully: `/home/runner/work/mediasoup/mediasoup/target/debug/deps/mediasoup-8f5b5278ff2f31a1` (signal: 11, SIGSEGV: invalid memory reference)
Error: Process completed with exit code 101.
##[debug]Finishing: cargo test
running 27 tests
test data_structures::tests::dtls_fingerprint ... ok
test ortc::tests::generate_router_rtp_capabilities_succeeds ... ok
test ortc::tests::generate_router_rtp_capabilities_too_many_codecs ... ok
test ortc::tests::generate_router_rtp_capabilities_unsupported ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_unsupported ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_get_consumable_rtp_parameters_get_consumer_rtp_parameters_get_pipe_consumer_rtp_parameters_succeeds ... ok
test router::audio_level_observer::tests::router_close_event ... ok
test router::active_speaker_observer::tests::router_close_event ... ok
test router::consumer::tests::producer_close_event ... ok
test router::data_consumer::tests::data_producer_close_event ... ok
test router::data_consumer::tests::transport_close_event ... ok
test router::consumer::tests::transport_close_event ... ok
test router::data_producer::tests::transport_close_event ... ok
test router::direct_transport::tests::router_close_event ... ok
test router::plain_transport::tests::router_close_event ... ok
test router::pipe_transport::tests::data_producer_close_is_transmitted_to_pipe_data_consumer ... ok
test router::producer::tests::transport_close_event ... ok
error: test failed, to rerun pass `-p mediasoup --lib`

Caused by:
  process didn't exit successfully: `/home/runner/work/mediasoup/mediasoup/target/debug/deps/mediasoup-8f5b5278ff2f31a1` (signal: 11, SIGSEGV: invalid memory reference)
Error: Process completed with exit code 101.
running 27 tests
test data_structures::tests::dtls_fingerprint ... ok
test ortc::tests::generate_router_rtp_capabilities_succeeds ... ok
test ortc::tests::generate_router_rtp_capabilities_too_many_codecs ... ok
test ortc::tests::generate_router_rtp_capabilities_unsupported ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_unsupported ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_get_consumable_rtp_parameters_get_consumer_rtp_parameters_get_pipe_consumer_rtp_parameters_succeeds ... ok
test router::active_speaker_observer::tests::router_close_event ... ok
test router::audio_level_observer::tests::router_close_event ... ok
test router::consumer::tests::transport_close_event ... ok
test router::consumer::tests::producer_close_event ... ok
test router::data_consumer::tests::transport_close_event ... ok
test router::direct_transport::tests::router_close_event ... ok
test router::data_producer::tests::transport_close_event ... ok
test router::data_consumer::tests::data_producer_close_event ... ok
test router::plain_transport::tests::router_close_event ... ok
error: test failed, to rerun pass `-p mediasoup --lib`

Caused by:
  process didn't exit successfully: `/home/runner/work/mediasoup/mediasoup/target/debug/deps/mediasoup-8f5b5278ff2f31a1` (signal: 11, SIGSEGV: invalid memory reference)
Error: Process completed with exit code 101.

I would like to say that this is pipe transport related but the third test attempt doesn't even execute any "pipe" test.

@ibc
Copy link
Member Author

ibc commented Aug 12, 2024

This is completely amazing. Rust tests in v3 branch don't fail. But Rust tests in this PR branch fial consistently. Changes in this PR cannot be the culprit at all, they do not affect anything in Rust.

@ibc
Copy link
Member Author

ibc commented Aug 12, 2024

@nazar-pc help please? I doubt that Rust tests are even failing due to this well know issue:

#1352

In a comment above you can see that sometimes pipe transports do not even run and tests still fail. Maybe I'm wrong but AFAIK we only run more than 1 workers in Rust tests in those pipe transport related tests, so I'm completely lost.

@nazar-pc
Copy link
Collaborator

Changes in this PR cannot be the culprit at all, they do not affect anything in Rust.

Either changes in this PR are the culprit (however unlikely that seems) or they trigger an issue that existed before as well, but was hard to reproduce.

Maybe I'm wrong but AFAIK we only run more than 1 workers in Rust tests in those pipe transport related tests, so I'm completely lost.

There are as many tests running concurrently as CPU cores on the machine by default, so we have many workers with potentially different settings running at the same time and order is not deterministic.

@ibc ibc mentioned this pull request Aug 12, 2024
Comment on lines 78 to 81
if (!optarg)
{
MS_THROW_TYPE_ERROR("unknown configuration parameter: %s", optarg);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this deleted? It seems like an important safety check that caller provided something meaningful as an input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the new --disableLiburing command line argument doesn't have any value so such a check throws if present. I can check that optargs exist for all the other arguments but didn't consider it necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, but other options do require a value. Maybe we have a test that checks that and it causes memory corruption because you suddenly create a value out of null pointer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change:

db2252c

Is it enough? What do you mean with "you suddenly create a value out of null pointer"? Command line arguments are created by Node and Rust layers in their Worker classes. Tests can not trigger wrong arguments passed to the worker.

Just wondering about this: In worker/utils.rs:

pub(super) fn run_worker_with_channels<OE>(
    id: WorkerId,
    thread_initializer: Option<Arc<dyn Fn() + Send + Sync>>,
    args: Vec<String>,
    worker_closed: Arc<AtomicBool>,
    on_exit: OE,
) -> WorkerRunResult
where
    OE: FnOnce(Result<(), ExitError>) + Send + 'static,
{
    let (channel, prepared_channel_read, prepared_channel_write) =
        Channel::new(Arc::clone(&worker_closed));
    let buffer_worker_messages_guard =
        channel.buffer_messages_for(SubscriptionTarget::String(std::process::id().to_string()));

    std::thread::Builder::new()
        .name(format!("mediasoup-worker-{id}"))
        .spawn(move || {
            if let Some(thread_initializer) = thread_initializer {
                thread_initializer();
            }
            let argc = args.len() as c_int;
            let args_cstring = args
                .into_iter()
                .map(|s| -> CString { CString::new(s).unwrap() })
                .collect::<Vec<_>>();
            let argv = args_cstring
                .iter()
                .map(|arg| arg.as_ptr().cast::<c_char>())
                .collect::<Vec<_>>();
            let version = CString::new(env!("CARGO_PKG_VERSION")).unwrap();

            let status_code = unsafe {
                let (channel_read_fn, channel_read_ctx, _channel_write_callback) =
                    prepared_channel_read.deconstruct();
                let (channel_write_fn, channel_write_ctx, _channel_read_callback) =
                    prepared_channel_write.deconstruct();

                mediasoup_sys::mediasoup_worker_run(
                    argc,
                    argv.as_ptr(),
                    version.as_ptr(),
                    0,
                    0,
                    channel_read_fn,
                    channel_read_ctx,
                    channel_write_fn,
                    channel_write_ctx,
                )
            };

Here args is a command line arguments string, something like:

"--logLevel=warn --disableLiburing"

Maybe something dangerous when doing this?:

let args_cstring = args
                .into_iter()
                .map(|s| -> CString { CString::new(s).unwrap() })
                .collect::<Vec<_>>();
            let argv = args_cstring
                .iter()
                .map(|arg| arg.as_ptr().cast::<c_char>())
                .collect::<Vec<_>>();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, no problem there IMHO. It just splits the string into these strings:

  • "--logLevel=warn"
  • "--disableLiburing"

It doesn't do anything like assuming/expecting a "=" symbol, so no danger here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why commit db2252c should fix this problem. It probably won't and, instead of wasting more time on this, I will change the new command line argument and add a value to it. No time to deal with ancient command line args stuff.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we're not calling it incorrectly right now, but we could. And that would blow up instead of crashing with a nice message. Do not trust input, at least not to the degree that impacts memory safety.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's safe and we don't assume anything. Arg values are now mandatory. See latest changes.

@ibc
Copy link
Member Author

ibc commented Aug 12, 2024

Maybe I'm wrong but AFAIK we only run more than 1 workers in Rust tests in those pipe transport related tests, so I'm completely lost.

There are as many tests running concurrently as CPU cores on the machine by default, so we have many workers with potentially different settings running at the same time and order is not deterministic.

Do you mean that different tests run in parallel in the same Rust process? I mean, of course! Ok, I assume there is no way in Rust to create a separate process for each test file, right?

@nazar-pc
Copy link
Collaborator

Do you mean that different tests run in parallel in the same Rust process? I mean, of course! Ok, I assume there is no way in Rust to create a separate process for each test file, right?

Yes, it does run in the same process by default. You could probably make it run in separate processes somehow, but we're dealing with memory safe language, so why, right? That is not a solution to the problem we have here anyway 🙂

@ibc
Copy link
Member Author

ibc commented Aug 12, 2024

You could probably make it run in separate processes somehow

No, trust me, I couldn't.

but we're dealing with memory safe language, so why, right?

Because there is also C code running its own separate thread: #1352

@ibc
Copy link
Member Author

ibc commented Aug 12, 2024

Ok, some news:

  • I can reproduce the Rust test failure in a real Linux host, no need to wait for CI.
  • I've made command line argument mandatory here: 669ddca
  • Now Rust tests fail even in macOS with this error:
    mediasoup-worker::mediasoup_worker_run() | settings error: missing value in command line 
    argument: (null)
    test router::data_producer::tests::transport_close_event ... ok
    test router::tests::worker_close_event ... FAILED
    

So it seems that we are sending some command line arg without value. I will print things to debug.

@ibc
Copy link
Member Author

ibc commented Aug 12, 2024

Ok, some news:

  • I can reproduce the Rust test failure in a real Linux host, no need to wait for CI.
  • I've made command line argument mandatory here: 669ddca
  • Now Rust tests fail even in macOS with this error:
    mediasoup-worker::mediasoup_worker_run() | settings error: missing value in command line 
    argument: (null)
    test router::data_producer::tests::transport_close_event ... ok
    test router::tests::worker_close_event ... FAILED
    

So it seems that we are sending some command line arg without value. I will print things to debug.

Ignore, I forgot to add =true in Rust. Now things should work.

@ibc
Copy link
Member Author

ibc commented Aug 12, 2024

This is strongly depressing. Now WHAT????

running 27 tests
test data_structures::tests::dtls_fingerprint ... ok
test ortc::tests::generate_router_rtp_capabilities_succeeds ... ok
test ortc::tests::generate_router_rtp_capabilities_too_many_codecs ... ok
test ortc::tests::generate_router_rtp_capabilities_unsupported ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_get_consumable_rtp_parameters_get_consumer_rtp_parameters_get_pipe_consumer_rtp_parameters_succeeds ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_unsupported ... ok
test router::audio_level_observer::tests::router_close_event ... ok
test router::active_speaker_observer::tests::router_close_event ... ok
test router::data_consumer::tests::transport_close_event ... ok
test router::consumer::tests::producer_close_event ... ok
test router::consumer::tests::transport_close_event ... ok
test router::data_consumer::tests::data_producer_close_event ... ok
test router::data_producer::tests::transport_close_event ... ok
error: test failed, to rerun pass `-p mediasoup --lib`

Caused by:
  process didn't exit successfully: `/home/deploy/src/mediasoup-v3-deploy/target/debug/deps/mediasoup-8f5b5278ff2f31a1` (signal: 11, SIGSEGV: invalid memory reference)

@ibc
Copy link
Member Author

ibc commented Aug 12, 2024

@ibc
Copy link
Member Author

ibc commented Aug 12, 2024

Ohhh, so we are considering that liburing is enabled or disabled for all workers, but Rust tests run in parallel in same process and in just one of those tests we were disabling liburing. In the other ones liburing is enabled. So here the problem:

void DepLibUring::ClassInit()
{
	const auto mayor = io_uring_major_version();
	const auto minor = io_uring_minor_version();

	MS_DEBUG_TAG(info, "liburing version: \"%i.%i\"", mayor, minor);

	if (Settings::configuration.liburingDisabled)
	{
		MS_DEBUG_TAG(info, "liburing disabled by user settings");

		return;
	}

	// This must be called first.
	DepLibUring::CheckRuntimeSupport();

	if (DepLibUring::IsEnabled())
	{
		DepLibUring::liburing = new LibUring();

		MS_DEBUG_TAG(info, "liburing enabled");
	}
	else
	{
		MS_DEBUG_TAG(info, "liburing not enabled");
	}
}
  • DepLibUring::liburing singleton is thread_local static.
  • However we have static bool enabled.

Comment on lines +19 to +22
.create_worker(WorkerSettings {
enable_liburing: false,
..WorkerSettings::default()
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nazar-pc, despite this works at intended (default settings are used but enable_liburing which is set to false instead), is this correct idiomatic syntax in Rust?

In Node the default object must be added first, then those fields whose value must be different:

options = { ...options, foo: false }

It surprises me that in Rust it works if the changes are added first before the object with default values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, I believe .. should be used at the very end, I don't think it compiles otherwise at all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. For me idiomatic would be "here some values and after them those that override them" but it's ok. Rust guys have their own anti idiomatic concept of what idiomatic should mean.

@ibc
Copy link
Member Author

ibc commented Aug 12, 2024

Finally green. Merging!

@ibc ibc merged commit 43ef325 into v3 Aug 12, 2024
34 checks passed
@ibc ibc deleted the worker-add-option-to-disable-liburing branch August 12, 2024 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants