-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Use readv
for the RX path of the network device to avoid one memory copy per frame
#4799
base: main
Are you sure you want to change the base?
Conversation
266551b
to
f98af8a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4799 +/- ##
==========================================
+ Coverage 84.32% 84.33% +0.01%
==========================================
Files 249 250 +1
Lines 27521 27693 +172
==========================================
+ Hits 23206 23355 +149
- Misses 4315 4338 +23
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
1d90b83
to
da21f71
Compare
Add a ring buffer type that is tailored for holding `struct iovec` objects that point to guest memory for IO. The `struct iovec` objects represent the memory that the guest passed to us as `Descriptors` in a VirtIO queue for performing some I/O operation. We plan to use this type to describe the guest memory we have available for doing network RX. This should facilitate us in optimizing the reception of data from the TAP device using `readv`, thus avoiding a memory copy. Co-authored-by: Egor Lazarchuk <[email protected]> Signed-off-by: Babis Chalios <[email protected]>
da21f71
to
23551b8
Compare
Allow IoVecBufferMut objects to store multiple DescriptorChain objects, so that we can describe guest memory meant to be used for receiving data (for example memory used for network RX) as a single (sparse) memory region. This will allow us to always keep track all the available memory we have for performing RX and use `readv` for copying memory from the TAP device inside guest memory avoiding the extra copy. In the future, it will also facilitate the implementation of mergeable buffers for the RX path of the network device. Signed-off-by: Babis Chalios <[email protected]>
Right now, we are performing two copies for writing a frame from the TAP device into guest memory. We first read the frame in an array held by the Net device and then copy that array in a DescriptorChain. In order to avoid the double copy use the readv system call to read directly from the TAP device into the buffers described by DescriptorChain. The main challenge with this is that DescriptorChain objects describe memory that is at least 65562 bytes long when guest TSO4, TSO6 or UFO are enabled or 1526 otherwise and parsing the chain includes overhead which we pay even if the frame we are receiving is much smaller than these sizes. PR firecracker-microvm#4748 reduced the overheads involved with parsing DescriptorChain objects. To further avoid this overhead, move the parsing of DescriptorChain objects out of the hot path of process_rx() where we are actually receiving a frame into process_rx_queue_event() where we get the notification that the guest added new buffers for network RX. Signed-off-by: Babis Chalios <[email protected]>
23551b8
to
424b001
Compare
// SAFETY: We are calling the system call with valid arguments and properly checking its | ||
// return value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be able to claim that we are calling the syscall with valid arguments, this function should be unsafe
, with the invariant being "arguments are valid for mmap
"
// SAFETY: | ||
// * `buffer` is valid both for reads and writes (allocated with `libc::PROT_READ | | ||
// libc::PROT_WRITE`. ` | ||
// * `buffer` is aligned at `PAGE_SIZE` | ||
// * `buffer` points to memory allocated with a single system call to `libc::mmap` | ||
let iov = unsafe { | ||
std::slice::from_raw_parts_mut( | ||
buffer.cast(), | ||
2 * PAGE_SIZE / std::mem::size_of::<libc::iovec>(), | ||
) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhhh, I don't think we can do this. The two pages contain the same physical memory, so we're creating aliasing mutable references here. But if we just store everything as *mut libc::iovec
, I think it'll be fine :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if we use pointers, we also get rid of the lifetime parameter (which is kinda wrong to have anyway, because the iov
field is self referential, and that's not something rust's lifetimes can express)
self.head += 1; | ||
if self.head > usize::from(FIRECRACKER_MAX_QUEUE_SIZE) { | ||
self.head -= usize::from(FIRECRACKER_MAX_QUEUE_SIZE); | ||
self.tail -= usize::from(FIRECRACKER_MAX_QUEUE_SIZE); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could do the trick that the virtio spec does here, and simply store these indices as Wrapping<usize>
, let them count up indefinitely, and just do modulo when we want to access an element? Although I also don't really see why we would want to do that 😂
|
||
/// Drop the first `iovec` objects that describe guest memory of `size` length. | ||
/// | ||
/// This will drop `iovec` buffers which describe a total memory size of up to `size`. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// This will drop `iovec` buffers which describe a total memory size of up to `size`. This | |
/// This will drop `iovec` buffers which describe a total memory size of at least `size`. This |
no? Because if size == 1
and the first iovec contains 10 bytes, then it drop iovecs describing 10 bytes
pub(crate) fn as_mut_slice(&mut self) -> &mut [iovec] { | ||
&mut self.iov[self.head..self.tail] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to work with pointers here
pub fn drop_iovecs(&mut self, size: u32) -> Result<(), IoVecError> { | ||
let dropped = self.vecs.drop_iovs(size as usize); | ||
|
||
// Users should ask us to drop a `size` of memory that is not exactly covered by `iovec` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/should/should not ?
// objects. In other words, the sum of the lengths of all dropped `iovec` objects should be | ||
// equal to the `size` we were asked to drop. If it isn't, something is seriously wrong | ||
// with the VirtIO queue or the emulation logic, so fail at this point. | ||
assert_eq!(u32::try_from(dropped).unwrap(), size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_eq!(u32::try_from(dropped).unwrap(), size); | |
assert_eq!(dropped, usize::from(size)); |
do we really want to panic here though? The function returns Result
, so could we just return an Err
?
rand::fill(&mut rand_bytes).map_err(|err| { | ||
METRICS.host_rng_fails.inc(); | ||
err | ||
})?; | ||
|
||
// It is ok to unwrap here. We are writing `iovec.len()` bytes at offset 0. | ||
iovec.write_all_volatile_at(&rand_bytes, 0).unwrap(); | ||
Ok(iovec.len()) | ||
Ok(u32::try_from(iovec.len()).unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes me sad :(
I don't think its strictly necessary, but it'd be cool to differentiate between the new "multi descriptor chain" iovecbuffermuts and the old "one descriptor chain" ones. We could add a L: Add
type parameter that is either u32
or usize
and specifies the length of the IoVecBufferMut
. Then only the impl IoVecBufferMut<usize>
block would have all the appending stuff, and in the rng device and everywhere else can continue using IoVecBufferMut<u32>
without these ugly try_froms
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly agree with this. In my head thought it would be nicer to just leave IoVecBufferMut
alone and use IovDequeue
standalone. The specialization of IoVecBufferMut
feels like C++ hell, let's not go that rout.
if !Self::rate_limit_request(&mut self.rate_limiter, u64::from(iovec.len())) { | ||
if !Self::rate_limit_request( | ||
&mut self.rate_limiter, | ||
u64::try_from(iovec.len()).unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usize_to_u64
here
#[derive(Debug)] | ||
pub(crate) struct IovDeque<'a> { | ||
iov: &'a mut [libc::iovec], | ||
_memfd: Memfd, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to store memfd. After mmap
is done, it is safe to drop memfd
:
After the mmap() call has returned, the file descriptor, fd, can
be closed immediately without invalidating the mapping.
} | ||
|
||
/// Create a new [`IovDeque`] that can hold memory described by a single VirtIO queue. | ||
pub(crate) fn new() -> Result<Self, IovDequeError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove (crate)
thing? It is useless because this type only used in vmm
crate anyways.
rand::fill(&mut rand_bytes).map_err(|err| { | ||
METRICS.host_rng_fails.inc(); | ||
err | ||
})?; | ||
|
||
// It is ok to unwrap here. We are writing `iovec.len()` bytes at offset 0. | ||
iovec.write_all_volatile_at(&rand_bytes, 0).unwrap(); | ||
Ok(iovec.len()) | ||
Ok(u32::try_from(iovec.len()).unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly agree with this. In my head thought it would be nicer to just leave IoVecBufferMut
alone and use IovDequeue
standalone. The specialization of IoVecBufferMut
feels like C++ hell, let's not go that rout.
/// Drop memory from the `IoVecBufferMut` | ||
/// | ||
/// This will drop memory described by the `IoVecBufferMut` starting from the beginning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really drop memory, just some iovs.
/// Drop iovecs from the beginning of the `IoVecBufferMut`
...
@@ -510,6 +495,10 @@ impl Net { | |||
|
|||
// We currently prioritize packets from the MMDS over regular network packets. | |||
fn read_from_mmds_or_tap(&mut self) -> Result<usize, NetError> { | |||
if self.rx_buffer.len() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't clippy
complain about not having is_empty
method?
This is still WIP. I will update the description before the PR is ready for reviewing.
Changes
...
Reason
...
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md
.PR Checklist
PR.
CHANGELOG.md
.TODO
s link to an issue.contribution quality standards.
rust-vmm
.