-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(virtio-net): avoid copy on rx #4742
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4742 +/- ##
==========================================
- Coverage 84.37% 84.13% -0.25%
==========================================
Files 249 249
Lines 27433 27601 +168
==========================================
+ Hits 23147 23222 +75
- Misses 4286 4379 +93
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ihciah, thx for the PR. Really nice description and from the first glance changes are good as well. But can you please split commit into multiple? It would be easier for people to review it.
cdf8683
to
c752d3d
Compare
Thank you. It is splitted now, and also more testing are added. |
c752d3d
to
4c10f3e
Compare
Add and implement MemBytesExt to provide load_obj. Signed-off-by: ihciah <[email protected]>
Add load_next_descriptor and use it in load_descriptor_chain. Signed-off-by: ihciah <[email protected]>
Add load_descriptor_chain and other essential methods for IoVecBufferMut. Signed-off-by: ihciah <[email protected]>
Add read_iovec for Tap. Signed-off-by: ihciah <[email protected]>
Drain tap device on init to prevent reading some initial packets. Signed-off-by: ihciah <[email protected]>
Improve virtio-net performance by directly read into desc chain; introduce Readiness management to avoid redundant readv and make code more readable. Signed-off-by: ihciah <[email protected]>
4c10f3e
to
c145abf
Compare
@ShadowCurse Hi! I've splitted the commits, could you re-review this PR? Thanks. |
Hi @ihciah and thanks a lot for the PR. It looks very promising. Strangely enough, I was working as well in a similar approach and have a draft branch here: https://github.com/bchalios/firecracker/tree/pre-process-rx-bufs. I went through a different path for eliminating the overhead of processing We are currently in the process of evaluating the performance of both PRs. In our setup we do see performance improvements, but some test cases regress and we want to understand why and if we could fix them. I will keep you posted with my investigation. |
I think this may good for latency but cannot eliminate the cost of desc chain converting. In my testing, I emit metrics about the cvt op and the chain length, it is really surprising that all read desc chains' length are 19(every one is exact 19), which is just the same as I speculated here cloud-hypervisor/cloud-hypervisor#6636 (comment). 19 is big enough for a performance regression. What we can do is to make the loop cheaper. The cost which can be avoided here is the checking and copying.
|
Did you validate this? I would assume because the usage pattern is: let mut next_descriptor = Some(head);
while let Some(desc) = next_descriptor {
...
next_descriptor = desc.next_descriptor();
} the Also regarding |
I know and I agree. The two optimizations (reducing the cost of parsing and moving the parsing out of the hot path) are largely orthogonal. I suggest we co-ordinate as well with Egor in order to understand which optimizations make sense and bundle them together. I will be working to measure performance of the various pieces and post updates here. |
Hi @ihciah , we have looked through your PR and we appreciate the effort you put in, but the scope of your changes is bigger than it is needed for a |
Appreciated it. I'll work on it recently.
This PR indeed contains a lot things to make the performance better to avoid performance degradation on certain cases. If the PR you think can be splitted, could you tell me your expected way? If you think there's anything needed to change, please comment it and I can adjust it as your preferred way. I'm willing to work with your team to get it done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's see what we can do about this. As I previously mentioned, the feat: add load_obj which is a faster read_obj
commit can be move into it's own PR. Also as I mentioned, @bchalios will take care of readv
, so improve: add load_next_descriptor
, feat: impl basic iovec ability for IoVecBufferMut
, feat: support read_iovec for tap device
, and small portion of refactor(virtio-net): avoid copy on rx
commits are not needed in this PR. So only the remaining part of refactor(virtio-net): avoid copy on rx
remains. I will leave some comments on it, but in general it would be nice if you could somehow split it into smaller commits (like add readiness
, update rx path
, ...). Otherwise it basically touches the whole device.rs
.
let mut notify_guest = false; | ||
let mut tried_pop = false; | ||
let mut poped_any = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you use these flags to track what is going on in this function. I assume you need this, because you put all the logic of handling rx in here. Because of this, the function is a bit oversized, and it quite hard to keep in mind all the state. In general I see that it consists of several parts (you even left comments there):
// Read from MMDS first.
if self.readiness.rx_mmds_pre_check() {
// mmds hadling
}
// Read from tap.
if self.readiness.rx_tap_pre_check() {
// tap hadling
}
So I would suggest moving MMDS and TAP functionality into separate methods. As for the "shared" part you have here:
if notify_guest {
self.try_signal_queue(NetQueue::Rx)?;
}
if tried_pop && !poped_any {
self.metrics.no_rx_avail_buffer.inc();
}
I think you can make the above MMDS and TAP functions return smth like:
enum RxResult {
Empty,
Ok,
Err(...)
}
or
Option<Result<(), Error>>
So in the end the whole function can be:
fn process_rx(&mut self) -> Result<(), DeviceError> {
let result = if self.readiness.rx_mmds_pre_check() {
self.process_rx_mmds()
} else if self.readiness.rx_tap_pre_check() {
self.process_rx_tap()
};
match result {
Empty => self.metrics.no_rx_avail_buffer.inc(),
Ok => self.try_signal_queue(NetQueue::Rx)?,
Err(e) => Err(e)?,
}
}
macro_rules! push_desc { | ||
($n:expr) => { | ||
rx_queue | ||
.add_used(mem, head_index, $n) | ||
.map_err(DeviceError::QueueError)?; | ||
notify_guest = true; | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we really need a macros for this. Also if you address comment above, I think this will go away naturally.
let frame_consumed_by_mmds = Self::write_to_mmds_or_tap( | ||
self.mmds_ns.as_mut(), | ||
&mut self.tx_rate_limiter, | ||
&mut self.tx_frame_headers, | ||
&self.tx_buffer, | ||
&mut self.tap, | ||
self.guest_mac, | ||
&self.metrics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if you could make this function work on self
(so self.write_to_mmds_or_tap
). It only takes self.*
arguments anyway.
macro_rules! push_desc { | ||
() => { | ||
tx_queue | ||
.add_used(mem, head_index, 0) | ||
.map_err(DeviceError::QueueError)?; | ||
used_any = true; | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can also skip macros and repeat these 4 line a couple of time.
Hi @ihciah, it has bee some time and since then we updated virtio-net with readv (and MRG_RXBUF) support. Because of those changes, this PR seems to be redundant now, so I will close it. Feel free to open follow up PR if you think there is space for additional improvements. |
Improve virtio-net performance by directly read into desc chain; introduce Readiness management to avoid redundant readv and make code more readable.
Changes
Reason
In the current VMM simulated net device read implementation, data is copied to a VMM-owned buffer before being copied to the descriptor chain. The new method converts the descriptor chain directly into an iovec and then uses tap readv for direct input.
Each implementation has its advantages and disadvantages. The current method avoids traversing the entire descriptor chain for iovec conversion when handling small packets but incurs copy overhead for larger packets.
Previously, #2958 switched from copy+write to writev, resulting in performance improvement for the write side. The author of that PR mentioned that modifying it to readv was tried but faced iovec conversion overhead.
However, I believe this issue is hard to avoid completely but isn't unsolvable. We can try to minimize the read overhead of the descriptor chain to reduce conversion costs:
load_next_descriptor
method that reduces the stack copying overhead of the DescriptorChain structure.MemBytesExt
, which after passing certain safety checks (inevitably passed by a non-malicious guest, if not pass, it fallbacks to read_obj), usesptr::read_volatile
to replace the more complexread_obj
, achieving significant performance improvements in benchmarks.Performance comparison data under these optimizations(the unit is Gbps):
Attention, because this is a home-use CPU and Turbo Boost has not been disabled, the data may be somewhat inaccurate. If needed, I can upload some perf data.
Testing environment:
iperf3 -A 1 --client 172.16.0.2 -T 5201 -p 5201 --omit 5 --time 30 -l 512K
iperf3 -A 1 --client 172.16.0.2 -T 5201 -p 5201 --omit 5 --time 30 -l 512K -R
iperf3 -A 1 --client 172.16.0.2 -T 5201 -p 5201 --omit 5 --time 30 -l 512K -M "$1"
iperf3 -A 1 --client 172.16.0.2 -T 5201 -p 5201 --omit 5 --time 30 -l 512K -R -M "$1"
The new implementation shows significant performance improvements when reading large packets, with over a 10% maximum throughput increase in my benchmarks; however, it has a minor performance degradation with small packets.
The current implementation on the read link is quite complex as below:
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md
.PR Checklist
PR.
CHANGELOG.md
.TODO
s link to an issue.contribution quality standards.
rust-vmm
.