Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Use readv for the RX path of the network device to avoid one memory copy per frame #4799

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

bchalios
Copy link
Contributor

This is still WIP. I will update the description before the PR is ready for reviewing.

Changes

...

Reason

...

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

@bchalios bchalios force-pushed the net_rx_readv branch 3 times, most recently from 266551b to f98af8a Compare September 12, 2024 15:01
Copy link

codecov bot commented Sep 12, 2024

Codecov Report

Attention: Patch coverage is 88.26531% with 23 lines in your changes missing coverage. Please review.

Project coverage is 84.33%. Comparing base (8eea9df) to head (f98af8a).

Current head f98af8a differs from pull request most recent head 424b001

Please upload reports for the commit 424b001 to get more accurate results.

Files with missing lines Patch % Lines
src/vmm/src/devices/virtio/iovec.rs 69.49% 18 Missing ⚠️
src/vmm/src/devices/virtio/iov_deque.rs 96.82% 4 Missing ⚠️
src/vmm/src/devices/virtio/vsock/mod.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4799      +/-   ##
==========================================
+ Coverage   84.32%   84.33%   +0.01%     
==========================================
  Files         249      250       +1     
  Lines       27521    27693     +172     
==========================================
+ Hits        23206    23355     +149     
- Misses       4315     4338      +23     
Flag Coverage Δ
5.10-c5n.metal 84.55% <88.26%> (+0.01%) ⬆️
5.10-m5n.metal 84.53% <88.26%> (+<0.01%) ⬆️
5.10-m6a.metal 83.83% <88.26%> (+0.01%) ⬆️
5.10-m6g.metal 80.94% <88.26%> (+0.04%) ⬆️
5.10-m6i.metal 84.53% <88.26%> (+<0.01%) ⬆️
5.10-m7g.metal 80.94% <88.26%> (+0.04%) ⬆️
6.1-c5n.metal 84.55% <88.26%> (+0.01%) ⬆️
6.1-m5n.metal 84.53% <88.26%> (+<0.01%) ⬆️
6.1-m6a.metal 83.83% <88.26%> (+0.02%) ⬆️
6.1-m6g.metal 80.94% <88.26%> (+0.04%) ⬆️
6.1-m6i.metal 84.53% <88.26%> (+0.01%) ⬆️
6.1-m7g.metal 80.94% <88.26%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bchalios bchalios force-pushed the net_rx_readv branch 2 times, most recently from 1d90b83 to da21f71 Compare September 12, 2024 22:15
Add a ring buffer type that is tailored for holding `struct iovec`
objects that point to guest memory for IO. The `struct iovec` objects
represent the memory that the guest passed to us as `Descriptors` in a
VirtIO queue for performing some I/O operation.

We plan to use this type to describe the guest memory we have available
for doing network RX. This should facilitate us in optimizing the
reception of data from the TAP device using `readv`, thus avoiding a
memory copy.

Co-authored-by: Egor Lazarchuk <[email protected]>
Signed-off-by: Babis Chalios <[email protected]>
Allow IoVecBufferMut objects to store multiple DescriptorChain objects,
so that we can describe guest memory meant to be used for receiving data
(for example memory used for network RX) as a single (sparse) memory
region.

This will allow us to always keep track all the available memory we have
for performing RX and use `readv` for copying memory from the TAP device
inside guest memory avoiding the extra copy. In the future, it will also
facilitate the implementation of mergeable buffers for the RX path of
the network device.

Signed-off-by: Babis Chalios <[email protected]>
Right now, we are performing two copies for writing a frame from the TAP
device into guest memory. We first read the frame in an array held by
the Net device and then copy that array in a DescriptorChain.

In order to avoid the double copy use the readv system call to read
directly from the TAP device into the buffers described by
DescriptorChain.

The main challenge with this is that DescriptorChain objects describe
memory that is at least 65562 bytes long when guest TSO4, TSO6 or UFO
are enabled or 1526 otherwise and parsing the chain includes overhead
which we pay even if the frame we are receiving is much smaller than
these sizes.

PR firecracker-microvm#4748 reduced
the overheads involved with parsing DescriptorChain objects. To further
avoid this overhead, move the parsing of DescriptorChain objects out of
the hot path of process_rx() where we are actually receiving a frame
into process_rx_queue_event() where we get the notification that the
guest added new buffers for network RX.

Signed-off-by: Babis Chalios <[email protected]>
Comment on lines +113 to +114
// SAFETY: We are calling the system call with valid arguments and properly checking its
// return value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be able to claim that we are calling the syscall with valid arguments, this function should be unsafe, with the invariant being "arguments are valid for mmap"

Comment on lines +187 to +197
// SAFETY:
// * `buffer` is valid both for reads and writes (allocated with `libc::PROT_READ |
// libc::PROT_WRITE`. `
// * `buffer` is aligned at `PAGE_SIZE`
// * `buffer` points to memory allocated with a single system call to `libc::mmap`
let iov = unsafe {
std::slice::from_raw_parts_mut(
buffer.cast(),
2 * PAGE_SIZE / std::mem::size_of::<libc::iovec>(),
)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhhh, I don't think we can do this. The two pages contain the same physical memory, so we're creating aliasing mutable references here. But if we just store everything as *mut libc::iovec, I think it'll be fine :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we use pointers, we also get rid of the lifetime parameter (which is kinda wrong to have anyway, because the iov field is self referential, and that's not something rust's lifetimes can express)

Comment on lines +256 to +260
self.head += 1;
if self.head > usize::from(FIRECRACKER_MAX_QUEUE_SIZE) {
self.head -= usize::from(FIRECRACKER_MAX_QUEUE_SIZE);
self.tail -= usize::from(FIRECRACKER_MAX_QUEUE_SIZE);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could do the trick that the virtio spec does here, and simply store these indices as Wrapping<usize>, let them count up indefinitely, and just do modulo when we want to access an element? Although I also don't really see why we would want to do that 😂


/// Drop the first `iovec` objects that describe guest memory of `size` length.
///
/// This will drop `iovec` buffers which describe a total memory size of up to `size`. This
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// This will drop `iovec` buffers which describe a total memory size of up to `size`. This
/// This will drop `iovec` buffers which describe a total memory size of at least `size`. This

no? Because if size == 1 and the first iovec contains 10 bytes, then it drop iovecs describing 10 bytes

Comment on lines +284 to +286
pub(crate) fn as_mut_slice(&mut self) -> &mut [iovec] {
&mut self.iov[self.head..self.tail]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to work with pointers here

pub fn drop_iovecs(&mut self, size: u32) -> Result<(), IoVecError> {
let dropped = self.vecs.drop_iovs(size as usize);

// Users should ask us to drop a `size` of memory that is not exactly covered by `iovec`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/should/should not ?

// objects. In other words, the sum of the lengths of all dropped `iovec` objects should be
// equal to the `size` we were asked to drop. If it isn't, something is seriously wrong
// with the VirtIO queue or the emulation logic, so fail at this point.
assert_eq!(u32::try_from(dropped).unwrap(), size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_eq!(u32::try_from(dropped).unwrap(), size);
assert_eq!(dropped, usize::from(size));

do we really want to panic here though? The function returns Result, so could we just return an Err?

rand::fill(&mut rand_bytes).map_err(|err| {
METRICS.host_rng_fails.inc();
err
})?;

// It is ok to unwrap here. We are writing `iovec.len()` bytes at offset 0.
iovec.write_all_volatile_at(&rand_bytes, 0).unwrap();
Ok(iovec.len())
Ok(u32::try_from(iovec.len()).unwrap())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me sad :(

I don't think its strictly necessary, but it'd be cool to differentiate between the new "multi descriptor chain" iovecbuffermuts and the old "one descriptor chain" ones. We could add a L: Add type parameter that is either u32 or usize and specifies the length of the IoVecBufferMut. Then only the impl IoVecBufferMut<usize> block would have all the appending stuff, and in the rng device and everywhere else can continue using IoVecBufferMut<u32> without these ugly try_froms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree with this. In my head thought it would be nicer to just leave IoVecBufferMut alone and use IovDequeue standalone. The specialization of IoVecBufferMut feels like C++ hell, let's not go that rout.

if !Self::rate_limit_request(&mut self.rate_limiter, u64::from(iovec.len())) {
if !Self::rate_limit_request(
&mut self.rate_limiter,
u64::try_from(iovec.len()).unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usize_to_u64 here

#[derive(Debug)]
pub(crate) struct IovDeque<'a> {
iov: &'a mut [libc::iovec],
_memfd: Memfd,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to store memfd. After mmap is done, it is safe to drop memfd:

       After the mmap() call has returned, the file descriptor, fd, can
       be closed immediately without invalidating the mapping.

}

/// Create a new [`IovDeque`] that can hold memory described by a single VirtIO queue.
pub(crate) fn new() -> Result<Self, IovDequeError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove (crate) thing? It is useless because this type only used in vmm crate anyways.

rand::fill(&mut rand_bytes).map_err(|err| {
METRICS.host_rng_fails.inc();
err
})?;

// It is ok to unwrap here. We are writing `iovec.len()` bytes at offset 0.
iovec.write_all_volatile_at(&rand_bytes, 0).unwrap();
Ok(iovec.len())
Ok(u32::try_from(iovec.len()).unwrap())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree with this. In my head thought it would be nicer to just leave IoVecBufferMut alone and use IovDequeue standalone. The specialization of IoVecBufferMut feels like C++ hell, let's not go that rout.

Comment on lines +316 to +318
/// Drop memory from the `IoVecBufferMut`
///
/// This will drop memory described by the `IoVecBufferMut` starting from the beginning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really drop memory, just some iovs.

/// Drop iovecs from the beginning of the `IoVecBufferMut`
...

@@ -510,6 +495,10 @@ impl Net {

// We currently prioritize packets from the MMDS over regular network packets.
fn read_from_mmds_or_tap(&mut self) -> Result<usize, NetError> {
if self.rx_buffer.len() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't clippy complain about not having is_empty method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants