Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(virtio-net): avoid copy on rx #4742

Closed

Conversation

ihciah
Copy link
Contributor

@ihciah ihciah commented Aug 21, 2024

Improve virtio-net performance by directly read into desc chain; introduce Readiness management to avoid redundant readv and make code more readable.

Changes

  1. Change virtio-net from copy-read to direct-read: Previously, the net device implementation would read data from the TAP into a buffer in VMM, and then copy it into the descriptor chain. The new implementation converts the descriptor chain into iovec first and then reads the data directly into it.
  2. Optimize the reading performance of the descriptor chain: Reduce stack copying and the overhead of guest memory reads.
  3. Refactor the code related to reading data in the net device: Add management of readiness and centralize the main logic in the process_rx function, making the code simpler and clearer.

Reason

  1. Performance
    In the current VMM simulated net device read implementation, data is copied to a VMM-owned buffer before being copied to the descriptor chain. The new method converts the descriptor chain directly into an iovec and then uses tap readv for direct input.

Each implementation has its advantages and disadvantages. The current method avoids traversing the entire descriptor chain for iovec conversion when handling small packets but incurs copy overhead for larger packets.

Previously, #2958 switched from copy+write to writev, resulting in performance improvement for the write side. The author of that PR mentioned that modifying it to readv was tried but faced iovec conversion overhead.

However, I believe this issue is hard to avoid completely but isn't unsolvable. We can try to minimize the read overhead of the descriptor chain to reduce conversion costs:

  1. In virtio: skip redundant memory check #4723, I removed an unnecessary check which reduced certain memory read costs in my benchmarks.
  2. I introduced a load_next_descriptor method that reduces the stack copying overhead of the DescriptorChain structure.
  3. I added a MemBytesExt, which after passing certain safety checks (inevitably passed by a non-malicious guest, if not pass, it fallbacks to read_obj), uses ptr::read_volatile to replace the more complex read_obj, achieving significant performance improvements in benchmarks.
  4. For future optimization, focusing on GuestMemory look-up could be effective. Since memory mappings are static, if the entire virtio queue is within a single contiguous mapping region, we could save this specific GuestMemoryRegion instead of the entire GuestMemory upon device activation. This would substantially reduce the lookup overhead for each memory access. Although there might typically be only two regions, the performance data show significant overhead that should not be overlooked. I have attempted to make changes to facilitate this optimization in the vm-memory repository: feat: support find_arc_region and IntoIterator for GuestMemoryMmap rust-vmm/vm-memory#293.

Performance comparison data under these optimizations(the unit is Gbps):

| Bench  | Old   | New   | Change   |
|--------|-------|-------|----------|
| F      | 37.74 | 42.4  | +12.35%  |
| R      | 40.8  | 40.4  | -0.98%   |
| F-96   | 37.6  | 42.8  | +13.83%  |
| F-128  | 37.81 | 42.5  | +12.40%  |
| F-256  | 37.43 | 42.8  | +14.35%  |
| F-1460 | 37.89 | 42.4  | +11.90%  |
| R-96   | 38.42 | 39.82 | +3.64%   |
| R-128  | 38.22 | 38.29 | +0.18%   |
| R-256  | 38.55 | 38.3  | -0.65%   |
| R-1460 | 40.8  | 40.3  | -1.23%   |

Attention, because this is a home-use CPU and Turbo Boost has not been disabled, the data may be somewhat inaccurate. If needed, I can upload some perf data.

Testing environment:

  • CPU: AMD Ryzen 7 7840HS w/ Radeon Graphics
  • VM: 4C2G, Host: 16C32G
  • Affinity: 4 vCPUs and 1 main thread bound to 5 different cores; 4 host iperf processes bound to 4 different cores.
  • iperf3: 4 host clients, 4 guest servers.
    • F: iperf3 -A 1 --client 172.16.0.2 -T 5201 -p 5201 --omit 5 --time 30 -l 512K
    • R: iperf3 -A 1 --client 172.16.0.2 -T 5201 -p 5201 --omit 5 --time 30 -l 512K -R
    • F-num: iperf3 -A 1 --client 172.16.0.2 -T 5201 -p 5201 --omit 5 --time 30 -l 512K -M "$1"
    • R-num: iperf3 -A 1 --client 172.16.0.2 -T 5201 -p 5201 --omit 5 --time 30 -l 512K -R -M "$1"

The new implementation shows significant performance improvements when reading large packets, with over a 10% maximum throughput increase in my benchmarks; however, it has a minor performance degradation with small packets.

  1. Code Refactoring
    The current implementation on the read link is quite complex as below:
image In the new implementation, the main logic is consolidated in `process_rx` and `process_tx`, with states managed using readiness tracking. Recording states with readiness also avoids potential overhead from queue and tap reads, making the code easier to understand.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented Aug 22, 2024

Codecov Report

Attention: Patch coverage is 86.85969% with 59 lines in your changes missing coverage. Please review.

Project coverage is 84.13%. Comparing base (a364da8) to head (c145abf).
Report is 246 commits behind head on main.

Files with missing lines Patch % Lines
src/vmm/src/devices/virtio/net/device.rs 84.45% 51 Missing ⚠️
src/vmm/src/devices/virtio/queue.rs 86.36% 6 Missing ⚠️
src/vmm/src/devices/virtio/iovec.rs 98.38% 1 Missing ⚠️
src/vmm/src/devices/virtio/net/tap.rs 92.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4742      +/-   ##
==========================================
- Coverage   84.37%   84.13%   -0.25%     
==========================================
  Files         249      249              
  Lines       27433    27601     +168     
==========================================
+ Hits        23147    23222      +75     
- Misses       4286     4379      +93     
Flag Coverage Δ
5.10-c5n.metal 84.33% <86.85%> (-0.27%) ⬇️
5.10-m5n.metal 84.32% <86.85%> (-0.27%) ⬇️
5.10-m6a.metal 83.60% <86.85%> (-0.28%) ⬇️
5.10-m6g.metal 80.65% <86.85%> (-0.26%) ⬇️
5.10-m6i.metal 84.32% <86.85%> (-0.27%) ⬇️
5.10-m7g.metal 80.65% <86.85%> (-0.26%) ⬇️
6.1-c5n.metal 84.33% <86.85%> (-0.27%) ⬇️
6.1-m5n.metal 84.32% <86.85%> (-0.27%) ⬇️
6.1-m6a.metal 83.61% <86.85%> (-0.28%) ⬇️
6.1-m6g.metal 80.64% <86.85%> (-0.27%) ⬇️
6.1-m6i.metal 84.32% <86.85%> (-0.26%) ⬇️
6.1-m7g.metal 80.64% <86.85%> (-0.27%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ShadowCurse ShadowCurse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ihciah, thx for the PR. Really nice description and from the first glance changes are good as well. But can you please split commit into multiple? It would be easier for people to review it.

@ihciah
Copy link
Contributor Author

ihciah commented Aug 22, 2024

Hi @ihciah, thx for the PR. Really nice description and from the first glance changes are good as well. But can you please split commit into multiple? It would be easier for people to review it.

Thank you. It is splitted now, and also more testing are added.

Add and implement MemBytesExt to provide load_obj.

Signed-off-by: ihciah <[email protected]>
Add load_next_descriptor and use it in load_descriptor_chain.

Signed-off-by: ihciah <[email protected]>
Add load_descriptor_chain and other essential methods for
IoVecBufferMut.

Signed-off-by: ihciah <[email protected]>
Add read_iovec for Tap.

Signed-off-by: ihciah <[email protected]>
Drain tap device on init to prevent reading some initial packets.

Signed-off-by: ihciah <[email protected]>
Improve virtio-net performance by directly read into desc chain;
introduce Readiness management to avoid redundant readv and make
code more readable.

Signed-off-by: ihciah <[email protected]>
@ihciah
Copy link
Contributor Author

ihciah commented Aug 27, 2024

@ShadowCurse Hi! I've splitted the commits, could you re-review this PR? Thanks.

@bchalios
Copy link
Contributor

Hi @ihciah and thanks a lot for the PR. It looks very promising. Strangely enough, I was working as well in a similar approach and have a draft branch here: https://github.com/bchalios/firecracker/tree/pre-process-rx-bufs.

I went through a different path for eliminating the overhead of processing DescriptorChain objects which is I am pre-processing them outside the hot path (when the TAP device is ready). I am processing the buffers when the guest sends us a notification that it added RX buffers in the queue. The idea is that once we have TAP ready we will already have an IoVecBuffer read for handling the RX.

We are currently in the process of evaluating the performance of both PRs. In our setup we do see performance improvements, but some test cases regress and we want to understand why and if we could fix them.

I will keep you posted with my investigation.

@ihciah
Copy link
Contributor Author

ihciah commented Aug 27, 2024

@bchalios

I went through a different path for eliminating the overhead of processing DescriptorChain objects which is I am pre-processing them outside the hot path (when the TAP device is ready).

I think this may good for latency but cannot eliminate the cost of desc chain converting. In my testing, I emit metrics about the cvt op and the chain length, it is really surprising that all read desc chains' length are 19(every one is exact 19), which is just the same as I speculated here cloud-hypervisor/cloud-hypervisor#6636 (comment).

19 is big enough for a performance regression. What we can do is to make the loop cheaper. The cost which can be avoided here is the checking and copying.

  1. I compared the diff before and after the virtio: skip redundant memory check #4723 , it shows very positive optimization.
  2. read_obj is heavy. With the quick load_obj in this PR, it reduces a lot memcpy.
    1. before: https://share.firefox.dev/3MmTQru
    2. after: https://share.firefox.dev/3MovNrY
    3. Note: select firecracker thread only.
  3. Update DescriptorChain inplace also benefits in a small amount. After all, DescriptorChain itself is 40byte, and 40*19 is a unnegligible stack copy size.

@ShadowCurse
Copy link
Contributor

Update DescriptorChain inplace also benefits in a small amount. After all, DescriptorChain itself is 40byte, and 40*19 is a unnegligible stack copy size.

Did you validate this? I would assume because the usage pattern is:

let mut next_descriptor = Some(head);
while let Some(desc) = next_descriptor {
  ...
  next_descriptor = desc.next_descriptor();
}

the next_descriptor = desc.next_descriptor(); would just reuse already existing next_descriptor.

Also regarding read_obj optimization: I am working on an overall improvement for the Queue here: #4748, and would prefer to go that direction (e.g. address all unnecessary calls to GuestMemoryMmap at once) rather than optimizing single read_obj call.

@bchalios
Copy link
Contributor

I think this may good for latency but cannot eliminate the cost of desc chain converting. In my testing, I emit metrics about the cvt op and the chain length, it is really surprising that all read desc chains' length are 19(every one is exact 19), which is just the same as I speculated here cloud-hypervisor/cloud-hypervisor#6636 (comment).

I know and I agree. The two optimizations (reducing the cost of parsing and moving the parsing out of the hot path) are largely orthogonal. I suggest we co-ordinate as well with Egor in order to understand which optimizations make sense and bundle them together. I will be working to measure performance of the various pieces and post updates here.

@ShadowCurse
Copy link
Contributor

Hi @ihciah , we have looked through your PR and we appreciate the effort you put in, but the scope of your changes is bigger than it is needed for a readv implementation. As @bchalios mentioned, we were working on similar changes recently and we think it is better for us to take over the implementation of readv in this case.
The change we are happy to accept though is the load_obj change. If you think it is indeed faster than the default read_obj, can you please open a new PR for it?

@ihciah
Copy link
Contributor Author

ihciah commented Sep 5, 2024

The change we are happy to accept though is the load_obj change. If you think it is indeed faster than the default read_obj, can you please open a new PR for it?

Appreciated it. I'll work on it recently.

the scope of your changes is bigger than it is needed for a readv implementation. As @bchalios mentioned, we were working on similar changes recently and we think it is better for us to take over the implementation of readv in this case.

This PR indeed contains a lot things to make the performance better to avoid performance degradation on certain cases. If the PR you think can be splitted, could you tell me your expected way? If you think there's anything needed to change, please comment it and I can adjust it as your preferred way. I'm willing to work with your team to get it done.

Copy link
Contributor

@ShadowCurse ShadowCurse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's see what we can do about this. As I previously mentioned, the feat: add load_obj which is a faster read_obj commit can be move into it's own PR. Also as I mentioned, @bchalios will take care of readv, so improve: add load_next_descriptor, feat: impl basic iovec ability for IoVecBufferMut, feat: support read_iovec for tap device, and small portion of refactor(virtio-net): avoid copy on rx commits are not needed in this PR. So only the remaining part of refactor(virtio-net): avoid copy on rx remains. I will leave some comments on it, but in general it would be nice if you could somehow split it into smaller commits (like add readiness, update rx path, ...). Otherwise it basically touches the whole device.rs.

Comment on lines +420 to +422
let mut notify_guest = false;
let mut tried_pop = false;
let mut poped_any = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you use these flags to track what is going on in this function. I assume you need this, because you put all the logic of handling rx in here. Because of this, the function is a bit oversized, and it quite hard to keep in mind all the state. In general I see that it consists of several parts (you even left comments there):

// Read from MMDS first.
if self.readiness.rx_mmds_pre_check() {
  // mmds hadling
}
// Read from tap.
if self.readiness.rx_tap_pre_check() {
  // tap hadling
}

So I would suggest moving MMDS and TAP functionality into separate methods. As for the "shared" part you have here:

if notify_guest {
  self.try_signal_queue(NetQueue::Rx)?;
}
if tried_pop && !poped_any {
  self.metrics.no_rx_avail_buffer.inc();
}

I think you can make the above MMDS and TAP functions return smth like:

enum RxResult {
  Empty,
  Ok,
  Err(...)
}
or
Option<Result<(), Error>>

So in the end the whole function can be:

fn process_rx(&mut self) -> Result<(), DeviceError> {
   let result = if self.readiness.rx_mmds_pre_check() {
     self.process_rx_mmds()
   } else if self.readiness.rx_tap_pre_check() {
     self.process_rx_tap()
   };
   match result {
     Empty => self.metrics.no_rx_avail_buffer.inc(),
     Ok => self.try_signal_queue(NetQueue::Rx)?,
     Err(e) => Err(e)?,
   }
}

Comment on lines +524 to +531
macro_rules! push_desc {
($n:expr) => {
rx_queue
.add_used(mem, head_index, $n)
.map_err(DeviceError::QueueError)?;
notify_guest = true;
};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we really need a macros for this. Also if you address comment above, I think this will go away naturally.

Comment on lines +710 to +717
let frame_consumed_by_mmds = Self::write_to_mmds_or_tap(
self.mmds_ns.as_mut(),
&mut self.tx_rate_limiter,
&mut self.tx_frame_headers,
&self.tx_buffer,
&mut self.tap,
self.guest_mac,
&self.metrics,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if you could make this function work on self (so self.write_to_mmds_or_tap). It only takes self.* arguments anyway.

Comment on lines +674 to +681
macro_rules! push_desc {
() => {
tx_queue
.add_used(mem, head_index, 0)
.map_err(DeviceError::QueueError)?;
used_any = true;
};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can also skip macros and repeat these 4 line a couple of time.

@ShadowCurse
Copy link
Contributor

Hi @ihciah, it has bee some time and since then we updated virtio-net with readv (and MRG_RXBUF) support. Because of those changes, this PR seems to be redundant now, so I will close it. Feel free to open follow up PR if you think there is space for additional improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants