refactor(virtio-net): avoid copy on rx #4742

ihciah · 2024-08-21T16:34:53Z

Improve virtio-net performance by directly read into desc chain; introduce Readiness management to avoid redundant readv and make code more readable.

Changes

Change virtio-net from copy-read to direct-read: Previously, the net device implementation would read data from the TAP into a buffer in VMM, and then copy it into the descriptor chain. The new implementation converts the descriptor chain into iovec first and then reads the data directly into it.
Optimize the reading performance of the descriptor chain: Reduce stack copying and the overhead of guest memory reads.
Refactor the code related to reading data in the net device: Add management of readiness and centralize the main logic in the process_rx function, making the code simpler and clearer.

Reason

Performance
In the current VMM simulated net device read implementation, data is copied to a VMM-owned buffer before being copied to the descriptor chain. The new method converts the descriptor chain directly into an iovec and then uses tap readv for direct input.

Each implementation has its advantages and disadvantages. The current method avoids traversing the entire descriptor chain for iovec conversion when handling small packets but incurs copy overhead for larger packets.

Previously, #2958 switched from copy+write to writev, resulting in performance improvement for the write side. The author of that PR mentioned that modifying it to readv was tried but faced iovec conversion overhead.

However, I believe this issue is hard to avoid completely but isn't unsolvable. We can try to minimize the read overhead of the descriptor chain to reduce conversion costs:

In virtio: skip redundant memory check #4723, I removed an unnecessary check which reduced certain memory read costs in my benchmarks.
I introduced a load_next_descriptor method that reduces the stack copying overhead of the DescriptorChain structure.
I added a MemBytesExt, which after passing certain safety checks (inevitably passed by a non-malicious guest, if not pass, it fallbacks to read_obj), uses ptr::read_volatile to replace the more complex read_obj, achieving significant performance improvements in benchmarks.
For future optimization, focusing on GuestMemory look-up could be effective. Since memory mappings are static, if the entire virtio queue is within a single contiguous mapping region, we could save this specific GuestMemoryRegion instead of the entire GuestMemory upon device activation. This would substantially reduce the lookup overhead for each memory access. Although there might typically be only two regions, the performance data show significant overhead that should not be overlooked. I have attempted to make changes to facilitate this optimization in the vm-memory repository: feat: support find_arc_region and IntoIterator for GuestMemoryMmap rust-vmm/vm-memory#293.

Performance comparison data under these optimizations(the unit is Gbps):

| Bench  | Old   | New   | Change   |
|--------|-------|-------|----------|
| F      | 37.74 | 42.4  | +12.35%  |
| R      | 40.8  | 40.4  | -0.98%   |
| F-96   | 37.6  | 42.8  | +13.83%  |
| F-128  | 37.81 | 42.5  | +12.40%  |
| F-256  | 37.43 | 42.8  | +14.35%  |
| F-1460 | 37.89 | 42.4  | +11.90%  |
| R-96   | 38.42 | 39.82 | +3.64%   |
| R-128  | 38.22 | 38.29 | +0.18%   |
| R-256  | 38.55 | 38.3  | -0.65%   |
| R-1460 | 40.8  | 40.3  | -1.23%   |

Attention, because this is a home-use CPU and Turbo Boost has not been disabled, the data may be somewhat inaccurate. If needed, I can upload some perf data.

Testing environment:

CPU: AMD Ryzen 7 7840HS w/ Radeon Graphics
VM: 4C2G, Host: 16C32G
Affinity: 4 vCPUs and 1 main thread bound to 5 different cores; 4 host iperf processes bound to 4 different cores.
iperf3: 4 host clients, 4 guest servers.
- F: iperf3 -A 1 --client 172.16.0.2 -T 5201 -p 5201 --omit 5 --time 30 -l 512K
- R: iperf3 -A 1 --client 172.16.0.2 -T 5201 -p 5201 --omit 5 --time 30 -l 512K -R
- F-num: iperf3 -A 1 --client 172.16.0.2 -T 5201 -p 5201 --omit 5 --time 30 -l 512K -M "$1"
- R-num: iperf3 -A 1 --client 172.16.0.2 -T 5201 -p 5201 --omit 5 --time 30 -l 512K -R -M "$1"

The new implementation shows significant performance improvements when reading large packets, with over a 10% maximum throughput increase in my benchmarks; however, it has a minor performance degradation with small packets.

Code Refactoring
The current implementation on the read link is quite complex as below:

In the new implementation, the main logic is consolidated in `process_rx` and `process_tx`, with states managed using readiness tracking. Recording states with readiness also avoids potential overhead from queue and tap reads, making the code easier to understand.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

If a specific issue led to this PR, this PR closes the issue.
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this
PR.
API changes follow the Runbook for Firecracker API changes.
User-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.
New TODOs link to an issue.
Commits meet
contribution quality standards.

This functionality cannot be added in rust-vmm.

codecov · 2024-08-22T09:44:38Z

Codecov Report

Attention: Patch coverage is 86.85969% with 59 lines in your changes missing coverage. Please review.

Project coverage is 84.13%. Comparing base (a364da8) to head (c145abf).
Report is 246 commits behind head on main.

Files with missing lines	Patch %	Lines
src/vmm/src/devices/virtio/net/device.rs	84.45%	51 Missing ⚠️
src/vmm/src/devices/virtio/queue.rs	86.36%	6 Missing ⚠️
src/vmm/src/devices/virtio/iovec.rs	98.38%	1 Missing ⚠️
src/vmm/src/devices/virtio/net/tap.rs	92.30%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4742      +/-   ##
==========================================
- Coverage   84.37%   84.13%   -0.25%     
==========================================
  Files         249      249              
  Lines       27433    27601     +168     
==========================================
+ Hits        23147    23222      +75     
- Misses       4286     4379      +93

Flag	Coverage Δ
5.10-c5n.metal	`84.33% <86.85%> (-0.27%)`	⬇️
5.10-m5n.metal	`84.32% <86.85%> (-0.27%)`	⬇️
5.10-m6a.metal	`83.60% <86.85%> (-0.28%)`	⬇️
5.10-m6g.metal	`80.65% <86.85%> (-0.26%)`	⬇️
5.10-m6i.metal	`84.32% <86.85%> (-0.27%)`	⬇️
5.10-m7g.metal	`80.65% <86.85%> (-0.26%)`	⬇️
6.1-c5n.metal	`84.33% <86.85%> (-0.27%)`	⬇️
6.1-m5n.metal	`84.32% <86.85%> (-0.27%)`	⬇️
6.1-m6a.metal	`83.61% <86.85%> (-0.28%)`	⬇️
6.1-m6g.metal	`80.64% <86.85%> (-0.27%)`	⬇️
6.1-m6i.metal	`84.32% <86.85%> (-0.26%)`	⬇️
6.1-m7g.metal	`80.64% <86.85%> (-0.27%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ShadowCurse

Hi @ihciah, thx for the PR. Really nice description and from the first glance changes are good as well. But can you please split commit into multiple? It would be easier for people to review it.

ihciah · 2024-08-22T17:27:12Z

Hi @ihciah, thx for the PR. Really nice description and from the first glance changes are good as well. But can you please split commit into multiple? It would be easier for people to review it.

Thank you. It is splitted now, and also more testing are added.

Add and implement MemBytesExt to provide load_obj. Signed-off-by: ihciah <[email protected]>

Add load_next_descriptor and use it in load_descriptor_chain. Signed-off-by: ihciah <[email protected]>

Add load_descriptor_chain and other essential methods for IoVecBufferMut. Signed-off-by: ihciah <[email protected]>

Add read_iovec for Tap. Signed-off-by: ihciah <[email protected]>

Drain tap device on init to prevent reading some initial packets. Signed-off-by: ihciah <[email protected]>

Improve virtio-net performance by directly read into desc chain; introduce Readiness management to avoid redundant readv and make code more readable. Signed-off-by: ihciah <[email protected]>

ihciah · 2024-08-27T12:44:28Z

@ShadowCurse Hi! I've splitted the commits, could you re-review this PR? Thanks.

bchalios · 2024-08-27T16:03:18Z

Hi @ihciah and thanks a lot for the PR. It looks very promising. Strangely enough, I was working as well in a similar approach and have a draft branch here: https://github.com/bchalios/firecracker/tree/pre-process-rx-bufs.

I went through a different path for eliminating the overhead of processing DescriptorChain objects which is I am pre-processing them outside the hot path (when the TAP device is ready). I am processing the buffers when the guest sends us a notification that it added RX buffers in the queue. The idea is that once we have TAP ready we will already have an IoVecBuffer read for handling the RX.

We are currently in the process of evaluating the performance of both PRs. In our setup we do see performance improvements, but some test cases regress and we want to understand why and if we could fix them.

I will keep you posted with my investigation.

ihciah · 2024-08-27T16:25:40Z

@bchalios

I went through a different path for eliminating the overhead of processing DescriptorChain objects which is I am pre-processing them outside the hot path (when the TAP device is ready).

I think this may good for latency but cannot eliminate the cost of desc chain converting. In my testing, I emit metrics about the cvt op and the chain length, it is really surprising that all read desc chains' length are 19(every one is exact 19), which is just the same as I speculated here cloud-hypervisor/cloud-hypervisor#6636 (comment).

19 is big enough for a performance regression. What we can do is to make the loop cheaper. The cost which can be avoided here is the checking and copying.

I compared the diff before and after the virtio: skip redundant memory check #4723 , it shows very positive optimization.
read_obj is heavy. With the quick load_obj in this PR, it reduces a lot memcpy.
1. before: https://share.firefox.dev/3MmTQru
2. after: https://share.firefox.dev/3MovNrY
3. Note: select firecracker thread only.
Update DescriptorChain inplace also benefits in a small amount. After all, DescriptorChain itself is 40byte, and 40*19 is a unnegligible stack copy size.

ShadowCurse · 2024-08-27T19:10:20Z

Update DescriptorChain inplace also benefits in a small amount. After all, DescriptorChain itself is 40byte, and 40*19 is a unnegligible stack copy size.

Did you validate this? I would assume because the usage pattern is:

let mut next_descriptor = Some(head);
while let Some(desc) = next_descriptor {
  ...
  next_descriptor = desc.next_descriptor();
}

the next_descriptor = desc.next_descriptor(); would just reuse already existing next_descriptor.

Also regarding read_obj optimization: I am working on an overall improvement for the Queue here: #4748, and would prefer to go that direction (e.g. address all unnecessary calls to GuestMemoryMmap at once) rather than optimizing single read_obj call.

bchalios · 2024-08-27T19:16:57Z

I think this may good for latency but cannot eliminate the cost of desc chain converting. In my testing, I emit metrics about the cvt op and the chain length, it is really surprising that all read desc chains' length are 19(every one is exact 19), which is just the same as I speculated here cloud-hypervisor/cloud-hypervisor#6636 (comment).

I know and I agree. The two optimizations (reducing the cost of parsing and moving the parsing out of the hot path) are largely orthogonal. I suggest we co-ordinate as well with Egor in order to understand which optimizations make sense and bundle them together. I will be working to measure performance of the various pieces and post updates here.

ShadowCurse · 2024-09-04T09:41:59Z

Hi @ihciah , we have looked through your PR and we appreciate the effort you put in, but the scope of your changes is bigger than it is needed for a readv implementation. As @bchalios mentioned, we were working on similar changes recently and we think it is better for us to take over the implementation of readv in this case.
The change we are happy to accept though is the load_obj change. If you think it is indeed faster than the default read_obj, can you please open a new PR for it?

ihciah · 2024-09-05T11:38:22Z

The change we are happy to accept though is the load_obj change. If you think it is indeed faster than the default read_obj, can you please open a new PR for it?

Appreciated it. I'll work on it recently.

the scope of your changes is bigger than it is needed for a readv implementation. As @bchalios mentioned, we were working on similar changes recently and we think it is better for us to take over the implementation of readv in this case.

This PR indeed contains a lot things to make the performance better to avoid performance degradation on certain cases. If the PR you think can be splitted, could you tell me your expected way? If you think there's anything needed to change, please comment it and I can adjust it as your preferred way. I'm willing to work with your team to get it done.

ShadowCurse

Ok, let's see what we can do about this. As I previously mentioned, the feat: add load_obj which is a faster read_obj commit can be move into it's own PR. Also as I mentioned, @bchalios will take care of readv, so improve: add load_next_descriptor, feat: impl basic iovec ability for IoVecBufferMut, feat: support read_iovec for tap device, and small portion of refactor(virtio-net): avoid copy on rx commits are not needed in this PR. So only the remaining part of refactor(virtio-net): avoid copy on rx remains. I will leave some comments on it, but in general it would be nice if you could somehow split it into smaller commits (like add readiness, update rx path, ...). Otherwise it basically touches the whole device.rs.

ShadowCurse · 2024-09-09T19:53:06Z

src/vmm/src/devices/virtio/net/device.rs

+        let mut notify_guest = false;
+        let mut tried_pop = false;
+        let mut poped_any = false;


I see you use these flags to track what is going on in this function. I assume you need this, because you put all the logic of handling rx in here. Because of this, the function is a bit oversized, and it quite hard to keep in mind all the state. In general I see that it consists of several parts (you even left comments there):

// Read from MMDS first. if self.readiness.rx_mmds_pre_check() { // mmds hadling } // Read from tap. if self.readiness.rx_tap_pre_check() { // tap hadling }

So I would suggest moving MMDS and TAP functionality into separate methods. As for the "shared" part you have here:

if notify_guest { self.try_signal_queue(NetQueue::Rx)?; } if tried_pop && !poped_any { self.metrics.no_rx_avail_buffer.inc(); }

I think you can make the above MMDS and TAP functions return smth like:

enum RxResult { Empty, Ok, Err(...) } or Option<Result<(), Error>>

So in the end the whole function can be:

fn process_rx(&mut self) -> Result<(), DeviceError> { let result = if self.readiness.rx_mmds_pre_check() { self.process_rx_mmds() } else if self.readiness.rx_tap_pre_check() { self.process_rx_tap() }; match result { Empty => self.metrics.no_rx_avail_buffer.inc(), Ok => self.try_signal_queue(NetQueue::Rx)?, Err(e) => Err(e)?, } }

ShadowCurse · 2024-09-09T20:01:02Z

src/vmm/src/devices/virtio/net/device.rs

+                macro_rules! push_desc {
+                    ($n:expr) => {
+                        rx_queue
+                            .add_used(mem, head_index, $n)
+                            .map_err(DeviceError::QueueError)?;
+                        notify_guest = true;
+                    };
+                }


I don't think we really need a macros for this. Also if you address comment above, I think this will go away naturally.

ShadowCurse · 2024-09-09T20:03:39Z

src/vmm/src/devices/virtio/net/device.rs

+            let frame_consumed_by_mmds = Self::write_to_mmds_or_tap(
+                self.mmds_ns.as_mut(),
+                &mut self.tx_rate_limiter,
+                &mut self.tx_frame_headers,
+                &self.tx_buffer,
+                &mut self.tap,
+                self.guest_mac,
+                &self.metrics,


Would be nice if you could make this function work on self (so self.write_to_mmds_or_tap). It only takes self.* arguments anyway.

ShadowCurse · 2024-09-09T20:16:19Z

src/vmm/src/devices/virtio/net/device.rs

+            macro_rules! push_desc {
+                () => {
+                    tx_queue
+                        .add_used(mem, head_index, 0)
+                        .map_err(DeviceError::QueueError)?;
+                    used_any = true;
+                };
+            }


Here we can also skip macros and repeat these 4 line a couple of time.

ShadowCurse · 2024-10-29T11:04:16Z

Hi @ihciah, it has bee some time and since then we updated virtio-net with readv (and MRG_RXBUF) support. Because of those changes, this PR seems to be redundant now, so I will close it. Feel free to open follow up PR if you think there is space for additional improvements.

ShadowCurse reviewed Aug 22, 2024

View reviewed changes

ihciah force-pushed the virtio-net-reduce-copy branch from cdf8683 to c752d3d Compare August 22, 2024 17:24

ihciah force-pushed the virtio-net-reduce-copy branch from c752d3d to 4c10f3e Compare August 22, 2024 17:33

ihciah added 6 commits August 22, 2024 17:36

feat: add load_obj which is a faster read_obj

5235775

Add and implement MemBytesExt to provide load_obj. Signed-off-by: ihciah <[email protected]>

improve: add load_next_descriptor

3b41a03

Add load_next_descriptor and use it in load_descriptor_chain. Signed-off-by: ihciah <[email protected]>

feat: impl basic iovec ability for IoVecBufferMut

c845d28

Add load_descriptor_chain and other essential methods for IoVecBufferMut. Signed-off-by: ihciah <[email protected]>

feat: support read_iovec for tap device

3ae6a25

Add read_iovec for Tap. Signed-off-by: ihciah <[email protected]>

test: improve test stability

abe5a89

Drain tap device on init to prevent reading some initial packets. Signed-off-by: ihciah <[email protected]>

refactor(virtio-net): avoid copy on rx

c145abf

Improve virtio-net performance by directly read into desc chain; introduce Readiness management to avoid redundant readv and make code more readable. Signed-off-by: ihciah <[email protected]>

ihciah force-pushed the virtio-net-reduce-copy branch from 4c10f3e to c145abf Compare August 22, 2024 17:36

ihciah requested a review from ShadowCurse August 27, 2024 12:42

ShadowCurse reviewed Sep 9, 2024

View reviewed changes

ShadowCurse closed this Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(virtio-net): avoid copy on rx #4742

refactor(virtio-net): avoid copy on rx #4742

ihciah commented Aug 21, 2024 •

edited

Loading

codecov bot commented Aug 22, 2024 •

edited

Loading

ShadowCurse left a comment

ihciah commented Aug 22, 2024

ihciah commented Aug 27, 2024

bchalios commented Aug 27, 2024

ihciah commented Aug 27, 2024 •

edited

Loading

ShadowCurse commented Aug 27, 2024

bchalios commented Aug 27, 2024

ShadowCurse commented Sep 4, 2024

ihciah commented Sep 5, 2024

ShadowCurse left a comment

ShadowCurse Sep 9, 2024

ShadowCurse Sep 9, 2024

ShadowCurse Sep 9, 2024

ShadowCurse Sep 9, 2024

ShadowCurse commented Oct 29, 2024

refactor(virtio-net): avoid copy on rx #4742

refactor(virtio-net): avoid copy on rx #4742

Conversation

ihciah commented Aug 21, 2024 • edited Loading

Changes

Reason

License Acceptance

PR Checklist

codecov bot commented Aug 22, 2024 • edited Loading

Codecov Report

ShadowCurse left a comment

Choose a reason for hiding this comment

ihciah commented Aug 22, 2024

ihciah commented Aug 27, 2024

bchalios commented Aug 27, 2024

ihciah commented Aug 27, 2024 • edited Loading

ShadowCurse commented Aug 27, 2024

bchalios commented Aug 27, 2024

ShadowCurse commented Sep 4, 2024

ihciah commented Sep 5, 2024

ShadowCurse left a comment

Choose a reason for hiding this comment

ShadowCurse Sep 9, 2024

Choose a reason for hiding this comment

ShadowCurse Sep 9, 2024

Choose a reason for hiding this comment

ShadowCurse Sep 9, 2024

Choose a reason for hiding this comment

ShadowCurse Sep 9, 2024

Choose a reason for hiding this comment

ShadowCurse commented Oct 29, 2024

ihciah commented Aug 21, 2024 •

edited

Loading

codecov bot commented Aug 22, 2024 •

edited

Loading

ihciah commented Aug 27, 2024 •

edited

Loading