[PATCH v3 0/8] vdpa: Send all CVQ state load commands in parallel

Hawkins Jiawei posted 8 patches 9 months, 3 weeks ago
Failed in applying to current master (apply log)
Maintainers: "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>
There is a newer version of this series
hw/virtio/vhost-shadow-virtqueue.c |  38 ++--
hw/virtio/vhost-shadow-virtqueue.h |   3 +-
net/vhost-vdpa.c                   | 354 ++++++++++++++++++-----------
3 files changed, 249 insertions(+), 146 deletions(-)
[PATCH v3 0/8] vdpa: Send all CVQ state load commands in parallel
Posted by Hawkins Jiawei 9 months, 3 weeks ago
This patchset allows QEMU to delay polling and checking the device
used buffer until either the SVQ is full or control commands shadow
buffers are full, instead of polling and checking immediately after
sending each SVQ control command, so that QEMU can send all the SVQ
control commands in parallel, which have better performance improvement.

I use vp_vdpa device to simulate vdpa device, and create 4094 VLANS in
guest to build a test environment for sending multiple CVQ state load
commands. This patch series can improve latency from 10023 us to
8697 us for about 4099 CVQ state load commands, about 0.32 us per command.

Note that this patch should be based on
patch "Vhost-vdpa Shadow Virtqueue VLAN support" at [1].

[1]. https://lists.gnu.org/archive/html/qemu-devel/2023-07/msg03719.html

TestStep
========
1. regression testing using vp-vdpa device
  - For L0 guest, boot QEMU with two virtio-net-pci net device with
`ctrl_vq`, `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
      -device virtio-net-pci,disable-legacy=on,disable-modern=off,
iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
indirect_desc=off,queue_reset=off,ctrl_rx=on,ctrl_rx_extra=on,...

  - For L1 guest, apply the patch series and compile the source code,
start QEMU with two vdpa device with svq mode on, enable the `ctrl_vq`,
`ctrl_rx`, `ctrl_rx_extra` features on, command line like:
      -netdev type=vhost-vdpa,x-svq=true,...
      -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
ctrl_rx=on,ctrl_rx_extra=on...

  - For L2 source guest, run the following bash command:
```bash
#!/bin/sh

for idx1 in {0..9}
do
  for idx2 in {0..9}
  do
    for idx3 in {0..6}
    do
      ip link add macvlan$idx1$idx2$idx3 link eth0
address 4a:30:10:19:$idx1$idx2:1$idx3 type macvlan mode bridge
      ip link set macvlan$idx1$idx2$idx3 up
    done
  done
done
```
  - Execute the live migration in L2 source monitor

  - Result
    * with this series, QEMU should not trigger any error or warning.



2. perf using vp-vdpa device
  - For L0 guest, boot QEMU with two virtio-net-pci net device with
`ctrl_vq`, `ctrl_vlan` features on, command line like:
      -device virtio-net-pci,disable-legacy=on,disable-modern=off,
iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
indirect_desc=off,queue_reset=off,ctrl_vlan=on,...

  - For L1 guest, apply the patch series, then apply an addtional
patch to record the load time in microseconds as following:
```diff
diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 6b958d6363..501b510fd2 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -295,7 +295,10 @@ static int vhost_net_start_one(struct vhost_net *net,
     }
 
     if (net->nc->info->load) {
+        int64_t start_us = g_get_monotonic_time();
         r = net->nc->info->load(net->nc);
+        error_report("vhost_vdpa_net_load() = %ld us",
+                     g_get_monotonic_time() - start_us);
         if (r < 0) {
             goto fail;
         }
```

  - For L1 guest, compile the code, and start QEMU with two vdpa device
with svq mode on, enable the `ctrl_vq`, `ctrl_vlan` features on,
command line like:
      -netdev type=vhost-vdpa,x-svq=true,...
      -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
ctrl_vlan=on...

  - For L2 source guest, run the following bash command:
```bash
#!/bin/sh

for idx in {1..4094}
do
  ip link add link eth0 name vlan$idx type vlan id $idx
done
```

  - wait for some time, then execute the live migration in L2 source monitor

  - Result
    * with this series, QEMU should not trigger any warning
or error except something like "vhost_vdpa_net_load() = 8697 us"
    * without this series, QEMU should not trigger any warning
or error except something like "vhost_vdpa_net_load() = 10023 us"

ChangeLog
=========
v3:
  - refactor vhost_svq_poll() to accept cmds_in_flight
suggested by Jason and Eugenio
  - refactor vhost_vdpa_net_cvq_add() to make control commands buffers
is not tied to `s->cvq_cmd_out_buffer` and `s->status`, so we can reuse
it suggested by Eugenio
  - poll and check when SVQ is full or control commands shadow buffers is
full

v2: https://lore.kernel.org/all/cover.1683371965.git.yin31149@gmail.com/
  - recover accidentally deleted rows
  - remove extra newline
  - refactor `need_poll_len` to `cmds_in_flight`
  - return -EINVAL when vhost_svq_poll() return 0 or check
on buffers written by device fails
  - change the type of `in_cursor`, and refactor the
code for updating cursor
  - return directly when vhost_vdpa_net_load_{mac,mq}()
returns a failure in vhost_vdpa_net_load()

v1: https://lore.kernel.org/all/cover.1681732982.git.yin31149@gmail.com/

Hawkins Jiawei (8):
  vhost: Add argument to vhost_svq_poll()
  vdpa: Use iovec for vhost_vdpa_net_cvq_add()
  vhost: Expose vhost_svq_available_slots()
  vdpa: Avoid using vhost_vdpa_net_load_*() outside
    vhost_vdpa_net_load()
  vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
  vdpa: Move vhost_svq_poll() to the caller of vhost_vdpa_net_cvq_add()
  vdpa: Add cursors to vhost_vdpa_net_loadx()
  vdpa: Send cvq state load commands in parallel

 hw/virtio/vhost-shadow-virtqueue.c |  38 ++--
 hw/virtio/vhost-shadow-virtqueue.h |   3 +-
 net/vhost-vdpa.c                   | 354 ++++++++++++++++++-----------
 3 files changed, 249 insertions(+), 146 deletions(-)

-- 
2.25.1
Re: [PATCH v3 0/8] vdpa: Send all CVQ state load commands in parallel
Posted by Michael S. Tsirkin 9 months, 3 weeks ago
On Wed, Jul 19, 2023 at 03:53:45PM +0800, Hawkins Jiawei wrote:
> This patchset allows QEMU to delay polling and checking the device
> used buffer until either the SVQ is full or control commands shadow
> buffers are full, instead of polling and checking immediately after
> sending each SVQ control command, so that QEMU can send all the SVQ
> control commands in parallel, which have better performance improvement.
> 
> I use vp_vdpa device to simulate vdpa device, and create 4094 VLANS in
> guest to build a test environment for sending multiple CVQ state load
> commands. This patch series can improve latency from 10023 us to
> 8697 us for about 4099 CVQ state load commands, about 0.32 us per command.

Looks like a tiny improvement.
At the same time we have O(n^2) behaviour with memory mappings.
Not saying we must not do this but I think it's worth
checking where the bottleneck is. My guess would be
vp_vdpa is not doing things in parallel. Want to try fixing that
to see how far it can be pushed?


> Note that this patch should be based on
> patch "Vhost-vdpa Shadow Virtqueue VLAN support" at [1].
> 
> [1]. https://lists.gnu.org/archive/html/qemu-devel/2023-07/msg03719.html
> 
> TestStep
> ========
> 1. regression testing using vp-vdpa device
>   - For L0 guest, boot QEMU with two virtio-net-pci net device with
> `ctrl_vq`, `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
>       -device virtio-net-pci,disable-legacy=on,disable-modern=off,
> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
> indirect_desc=off,queue_reset=off,ctrl_rx=on,ctrl_rx_extra=on,...
> 
>   - For L1 guest, apply the patch series and compile the source code,
> start QEMU with two vdpa device with svq mode on, enable the `ctrl_vq`,
> `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
>       -netdev type=vhost-vdpa,x-svq=true,...
>       -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
> ctrl_rx=on,ctrl_rx_extra=on...
> 
>   - For L2 source guest, run the following bash command:
> ```bash
> #!/bin/sh
> 
> for idx1 in {0..9}
> do
>   for idx2 in {0..9}
>   do
>     for idx3 in {0..6}
>     do
>       ip link add macvlan$idx1$idx2$idx3 link eth0
> address 4a:30:10:19:$idx1$idx2:1$idx3 type macvlan mode bridge
>       ip link set macvlan$idx1$idx2$idx3 up
>     done
>   done
> done
> ```
>   - Execute the live migration in L2 source monitor
> 
>   - Result
>     * with this series, QEMU should not trigger any error or warning.
> 
> 
> 
> 2. perf using vp-vdpa device
>   - For L0 guest, boot QEMU with two virtio-net-pci net device with
> `ctrl_vq`, `ctrl_vlan` features on, command line like:
>       -device virtio-net-pci,disable-legacy=on,disable-modern=off,
> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
> indirect_desc=off,queue_reset=off,ctrl_vlan=on,...
> 
>   - For L1 guest, apply the patch series, then apply an addtional
> patch to record the load time in microseconds as following:
> ```diff
> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> index 6b958d6363..501b510fd2 100644
> --- a/hw/net/vhost_net.c
> +++ b/hw/net/vhost_net.c
> @@ -295,7 +295,10 @@ static int vhost_net_start_one(struct vhost_net *net,
>      }
>  
>      if (net->nc->info->load) {
> +        int64_t start_us = g_get_monotonic_time();
>          r = net->nc->info->load(net->nc);
> +        error_report("vhost_vdpa_net_load() = %ld us",
> +                     g_get_monotonic_time() - start_us);
>          if (r < 0) {
>              goto fail;
>          }
> ```
> 
>   - For L1 guest, compile the code, and start QEMU with two vdpa device
> with svq mode on, enable the `ctrl_vq`, `ctrl_vlan` features on,
> command line like:
>       -netdev type=vhost-vdpa,x-svq=true,...
>       -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
> ctrl_vlan=on...
> 
>   - For L2 source guest, run the following bash command:
> ```bash
> #!/bin/sh
> 
> for idx in {1..4094}
> do
>   ip link add link eth0 name vlan$idx type vlan id $idx
> done
> ```
> 
>   - wait for some time, then execute the live migration in L2 source monitor
> 
>   - Result
>     * with this series, QEMU should not trigger any warning
> or error except something like "vhost_vdpa_net_load() = 8697 us"
>     * without this series, QEMU should not trigger any warning
> or error except something like "vhost_vdpa_net_load() = 10023 us"
> 
> ChangeLog
> =========
> v3:
>   - refactor vhost_svq_poll() to accept cmds_in_flight
> suggested by Jason and Eugenio
>   - refactor vhost_vdpa_net_cvq_add() to make control commands buffers
> is not tied to `s->cvq_cmd_out_buffer` and `s->status`, so we can reuse
> it suggested by Eugenio
>   - poll and check when SVQ is full or control commands shadow buffers is
> full
> 
> v2: https://lore.kernel.org/all/cover.1683371965.git.yin31149@gmail.com/
>   - recover accidentally deleted rows
>   - remove extra newline
>   - refactor `need_poll_len` to `cmds_in_flight`
>   - return -EINVAL when vhost_svq_poll() return 0 or check
> on buffers written by device fails
>   - change the type of `in_cursor`, and refactor the
> code for updating cursor
>   - return directly when vhost_vdpa_net_load_{mac,mq}()
> returns a failure in vhost_vdpa_net_load()
> 
> v1: https://lore.kernel.org/all/cover.1681732982.git.yin31149@gmail.com/
> 
> Hawkins Jiawei (8):
>   vhost: Add argument to vhost_svq_poll()
>   vdpa: Use iovec for vhost_vdpa_net_cvq_add()
>   vhost: Expose vhost_svq_available_slots()
>   vdpa: Avoid using vhost_vdpa_net_load_*() outside
>     vhost_vdpa_net_load()
>   vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
>   vdpa: Move vhost_svq_poll() to the caller of vhost_vdpa_net_cvq_add()
>   vdpa: Add cursors to vhost_vdpa_net_loadx()
>   vdpa: Send cvq state load commands in parallel
> 
>  hw/virtio/vhost-shadow-virtqueue.c |  38 ++--
>  hw/virtio/vhost-shadow-virtqueue.h |   3 +-
>  net/vhost-vdpa.c                   | 354 ++++++++++++++++++-----------
>  3 files changed, 249 insertions(+), 146 deletions(-)
> 
> -- 
> 2.25.1
Re: [PATCH v3 0/8] vdpa: Send all CVQ state load commands in parallel
Posted by Hawkins Jiawei 9 months, 3 weeks ago
在 2023/7/19 17:11, Michael S. Tsirkin 写道:
> On Wed, Jul 19, 2023 at 03:53:45PM +0800, Hawkins Jiawei wrote:
>> This patchset allows QEMU to delay polling and checking the device
>> used buffer until either the SVQ is full or control commands shadow
>> buffers are full, instead of polling and checking immediately after
>> sending each SVQ control command, so that QEMU can send all the SVQ
>> control commands in parallel, which have better performance improvement.
>>
>> I use vp_vdpa device to simulate vdpa device, and create 4094 VLANS in
>> guest to build a test environment for sending multiple CVQ state load
>> commands. This patch series can improve latency from 10023 us to
>> 8697 us for about 4099 CVQ state load commands, about 0.32 us per command.
>
> Looks like a tiny improvement.
> At the same time we have O(n^2) behaviour with memory mappings.

Hi Michael,

Thanks for your review.

I wonder why you say "we have O(n^2) behaviour on memory mappings" here?

 From my understanding, QEMU maps two page-size buffers as control
commands shadow buffers at device startup. These buffers then are used
to cache SVQ control commands, where QEMU fills them with multiple SVQ control
commands bytes, flushes them when SVQ descriptors are full or these
control commands shadow buffers reach their capacity.

QEMU repeats this process until all CVQ state load commands have been
sent in loading.

In this loading process, only control commands shadow buffers
translation should be relative to memory mappings, which should be
O(log n) behaviour to my understanding(Please correct me if I am wrong).

> Not saying we must not do this but I think it's worth
> checking where the bottleneck is. My guess would be
> vp_vdpa is not doing things in parallel. Want to try fixing that

As for "vp_vdpa is not doing things in parallel.", do you mean
the vp_vdpa device cannot process QEMU's SVQ control commands
in parallel?

In this situation, I will try to use real vdpa hardware to
test the patch series performance.

> to see how far it can be pushed?

Currently, I am involved in the "Add virtio-net Control Virtqueue state
restore support" project in Google Summer of Code now. Because I am
uncertain about the time it will take to fix that problem in the vp_vdpa
device, I prefer to complete the gsoc project first.

Thanks!


>
>
>> Note that this patch should be based on
>> patch "Vhost-vdpa Shadow Virtqueue VLAN support" at [1].
>>
>> [1]. https://lists.gnu.org/archive/html/qemu-devel/2023-07/msg03719.html
>>
>> TestStep
>> ========
>> 1. regression testing using vp-vdpa device
>>    - For L0 guest, boot QEMU with two virtio-net-pci net device with
>> `ctrl_vq`, `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
>>        -device virtio-net-pci,disable-legacy=on,disable-modern=off,
>> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
>> indirect_desc=off,queue_reset=off,ctrl_rx=on,ctrl_rx_extra=on,...
>>
>>    - For L1 guest, apply the patch series and compile the source code,
>> start QEMU with two vdpa device with svq mode on, enable the `ctrl_vq`,
>> `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
>>        -netdev type=vhost-vdpa,x-svq=true,...
>>        -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
>> ctrl_rx=on,ctrl_rx_extra=on...
>>
>>    - For L2 source guest, run the following bash command:
>> ```bash
>> #!/bin/sh
>>
>> for idx1 in {0..9}
>> do
>>    for idx2 in {0..9}
>>    do
>>      for idx3 in {0..6}
>>      do
>>        ip link add macvlan$idx1$idx2$idx3 link eth0
>> address 4a:30:10:19:$idx1$idx2:1$idx3 type macvlan mode bridge
>>        ip link set macvlan$idx1$idx2$idx3 up
>>      done
>>    done
>> done
>> ```
>>    - Execute the live migration in L2 source monitor
>>
>>    - Result
>>      * with this series, QEMU should not trigger any error or warning.
>>
>>
>>
>> 2. perf using vp-vdpa device
>>    - For L0 guest, boot QEMU with two virtio-net-pci net device with
>> `ctrl_vq`, `ctrl_vlan` features on, command line like:
>>        -device virtio-net-pci,disable-legacy=on,disable-modern=off,
>> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
>> indirect_desc=off,queue_reset=off,ctrl_vlan=on,...
>>
>>    - For L1 guest, apply the patch series, then apply an addtional
>> patch to record the load time in microseconds as following:
>> ```diff
>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>> index 6b958d6363..501b510fd2 100644
>> --- a/hw/net/vhost_net.c
>> +++ b/hw/net/vhost_net.c
>> @@ -295,7 +295,10 @@ static int vhost_net_start_one(struct vhost_net *net,
>>       }
>>
>>       if (net->nc->info->load) {
>> +        int64_t start_us = g_get_monotonic_time();
>>           r = net->nc->info->load(net->nc);
>> +        error_report("vhost_vdpa_net_load() = %ld us",
>> +                     g_get_monotonic_time() - start_us);
>>           if (r < 0) {
>>               goto fail;
>>           }
>> ```
>>
>>    - For L1 guest, compile the code, and start QEMU with two vdpa device
>> with svq mode on, enable the `ctrl_vq`, `ctrl_vlan` features on,
>> command line like:
>>        -netdev type=vhost-vdpa,x-svq=true,...
>>        -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
>> ctrl_vlan=on...
>>
>>    - For L2 source guest, run the following bash command:
>> ```bash
>> #!/bin/sh
>>
>> for idx in {1..4094}
>> do
>>    ip link add link eth0 name vlan$idx type vlan id $idx
>> done
>> ```
>>
>>    - wait for some time, then execute the live migration in L2 source monitor
>>
>>    - Result
>>      * with this series, QEMU should not trigger any warning
>> or error except something like "vhost_vdpa_net_load() = 8697 us"
>>      * without this series, QEMU should not trigger any warning
>> or error except something like "vhost_vdpa_net_load() = 10023 us"
>>
>> ChangeLog
>> =========
>> v3:
>>    - refactor vhost_svq_poll() to accept cmds_in_flight
>> suggested by Jason and Eugenio
>>    - refactor vhost_vdpa_net_cvq_add() to make control commands buffers
>> is not tied to `s->cvq_cmd_out_buffer` and `s->status`, so we can reuse
>> it suggested by Eugenio
>>    - poll and check when SVQ is full or control commands shadow buffers is
>> full
>>
>> v2: https://lore.kernel.org/all/cover.1683371965.git.yin31149@gmail.com/
>>    - recover accidentally deleted rows
>>    - remove extra newline
>>    - refactor `need_poll_len` to `cmds_in_flight`
>>    - return -EINVAL when vhost_svq_poll() return 0 or check
>> on buffers written by device fails
>>    - change the type of `in_cursor`, and refactor the
>> code for updating cursor
>>    - return directly when vhost_vdpa_net_load_{mac,mq}()
>> returns a failure in vhost_vdpa_net_load()
>>
>> v1: https://lore.kernel.org/all/cover.1681732982.git.yin31149@gmail.com/
>>
>> Hawkins Jiawei (8):
>>    vhost: Add argument to vhost_svq_poll()
>>    vdpa: Use iovec for vhost_vdpa_net_cvq_add()
>>    vhost: Expose vhost_svq_available_slots()
>>    vdpa: Avoid using vhost_vdpa_net_load_*() outside
>>      vhost_vdpa_net_load()
>>    vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
>>    vdpa: Move vhost_svq_poll() to the caller of vhost_vdpa_net_cvq_add()
>>    vdpa: Add cursors to vhost_vdpa_net_loadx()
>>    vdpa: Send cvq state load commands in parallel
>>
>>   hw/virtio/vhost-shadow-virtqueue.c |  38 ++--
>>   hw/virtio/vhost-shadow-virtqueue.h |   3 +-
>>   net/vhost-vdpa.c                   | 354 ++++++++++++++++++-----------
>>   3 files changed, 249 insertions(+), 146 deletions(-)
>>
>> --
>> 2.25.1
>
Re: [PATCH v3 0/8] vdpa: Send all CVQ state load commands in parallel
Posted by Michael S. Tsirkin 9 months, 3 weeks ago
On Wed, Jul 19, 2023 at 08:35:50PM +0800, Hawkins Jiawei wrote:
> 在 2023/7/19 17:11, Michael S. Tsirkin 写道:
> > On Wed, Jul 19, 2023 at 03:53:45PM +0800, Hawkins Jiawei wrote:
> >> This patchset allows QEMU to delay polling and checking the device
> >> used buffer until either the SVQ is full or control commands shadow
> >> buffers are full, instead of polling and checking immediately after
> >> sending each SVQ control command, so that QEMU can send all the SVQ
> >> control commands in parallel, which have better performance improvement.
> >>
> >> I use vp_vdpa device to simulate vdpa device, and create 4094 VLANS in
> >> guest to build a test environment for sending multiple CVQ state load
> >> commands. This patch series can improve latency from 10023 us to
> >> 8697 us for about 4099 CVQ state load commands, about 0.32 us per command.
> >
> > Looks like a tiny improvement.
> > At the same time we have O(n^2) behaviour with memory mappings.
> 
> Hi Michael,
> 
> Thanks for your review.
> 
> I wonder why you say "we have O(n^2) behaviour on memory mappings" here?

it's not specific to virtio - it's related to device init.
generally each device has some memory. during boot bios
enables each individually O(n) where n is # of devices.
memory maps has to be updated and in qemu this update
is at least superlinear with n (more like O(n log n) I think).
This gets up > O(n^2) with n number of devices.

>  From my understanding, QEMU maps two page-size buffers as control
> commands shadow buffers at device startup. These buffers then are used
> to cache SVQ control commands, where QEMU fills them with multiple SVQ control
> commands bytes, flushes them when SVQ descriptors are full or these
> control commands shadow buffers reach their capacity.
> 
> QEMU repeats this process until all CVQ state load commands have been
> sent in loading.
> 
> In this loading process, only control commands shadow buffers
> translation should be relative to memory mappings, which should be
> O(log n) behaviour to my understanding(Please correct me if I am wrong).
> 
> > Not saying we must not do this but I think it's worth
> > checking where the bottleneck is. My guess would be
> > vp_vdpa is not doing things in parallel. Want to try fixing that
> 
> As for "vp_vdpa is not doing things in parallel.", do you mean
> the vp_vdpa device cannot process QEMU's SVQ control commands
> in parallel?
> 
> In this situation, I will try to use real vdpa hardware to
> test the patch series performance.

yea, pls do that.

> > to see how far it can be pushed?
> 
> Currently, I am involved in the "Add virtio-net Control Virtqueue state
> restore support" project in Google Summer of Code now. Because I am
> uncertain about the time it will take to fix that problem in the vp_vdpa
> device, I prefer to complete the gsoc project first.
> 
> Thanks!
> 
> 
> >
> >
> >> Note that this patch should be based on
> >> patch "Vhost-vdpa Shadow Virtqueue VLAN support" at [1].
> >>
> >> [1]. https://lists.gnu.org/archive/html/qemu-devel/2023-07/msg03719.html
> >>
> >> TestStep
> >> ========
> >> 1. regression testing using vp-vdpa device
> >>    - For L0 guest, boot QEMU with two virtio-net-pci net device with
> >> `ctrl_vq`, `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
> >>        -device virtio-net-pci,disable-legacy=on,disable-modern=off,
> >> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
> >> indirect_desc=off,queue_reset=off,ctrl_rx=on,ctrl_rx_extra=on,...
> >>
> >>    - For L1 guest, apply the patch series and compile the source code,
> >> start QEMU with two vdpa device with svq mode on, enable the `ctrl_vq`,
> >> `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
> >>        -netdev type=vhost-vdpa,x-svq=true,...
> >>        -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
> >> ctrl_rx=on,ctrl_rx_extra=on...
> >>
> >>    - For L2 source guest, run the following bash command:
> >> ```bash
> >> #!/bin/sh
> >>
> >> for idx1 in {0..9}
> >> do
> >>    for idx2 in {0..9}
> >>    do
> >>      for idx3 in {0..6}
> >>      do
> >>        ip link add macvlan$idx1$idx2$idx3 link eth0
> >> address 4a:30:10:19:$idx1$idx2:1$idx3 type macvlan mode bridge
> >>        ip link set macvlan$idx1$idx2$idx3 up
> >>      done
> >>    done
> >> done
> >> ```
> >>    - Execute the live migration in L2 source monitor
> >>
> >>    - Result
> >>      * with this series, QEMU should not trigger any error or warning.
> >>
> >>
> >>
> >> 2. perf using vp-vdpa device
> >>    - For L0 guest, boot QEMU with two virtio-net-pci net device with
> >> `ctrl_vq`, `ctrl_vlan` features on, command line like:
> >>        -device virtio-net-pci,disable-legacy=on,disable-modern=off,
> >> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
> >> indirect_desc=off,queue_reset=off,ctrl_vlan=on,...
> >>
> >>    - For L1 guest, apply the patch series, then apply an addtional
> >> patch to record the load time in microseconds as following:
> >> ```diff
> >> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> >> index 6b958d6363..501b510fd2 100644
> >> --- a/hw/net/vhost_net.c
> >> +++ b/hw/net/vhost_net.c
> >> @@ -295,7 +295,10 @@ static int vhost_net_start_one(struct vhost_net *net,
> >>       }
> >>
> >>       if (net->nc->info->load) {
> >> +        int64_t start_us = g_get_monotonic_time();
> >>           r = net->nc->info->load(net->nc);
> >> +        error_report("vhost_vdpa_net_load() = %ld us",
> >> +                     g_get_monotonic_time() - start_us);
> >>           if (r < 0) {
> >>               goto fail;
> >>           }
> >> ```
> >>
> >>    - For L1 guest, compile the code, and start QEMU with two vdpa device
> >> with svq mode on, enable the `ctrl_vq`, `ctrl_vlan` features on,
> >> command line like:
> >>        -netdev type=vhost-vdpa,x-svq=true,...
> >>        -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
> >> ctrl_vlan=on...
> >>
> >>    - For L2 source guest, run the following bash command:
> >> ```bash
> >> #!/bin/sh
> >>
> >> for idx in {1..4094}
> >> do
> >>    ip link add link eth0 name vlan$idx type vlan id $idx
> >> done
> >> ```
> >>
> >>    - wait for some time, then execute the live migration in L2 source monitor
> >>
> >>    - Result
> >>      * with this series, QEMU should not trigger any warning
> >> or error except something like "vhost_vdpa_net_load() = 8697 us"
> >>      * without this series, QEMU should not trigger any warning
> >> or error except something like "vhost_vdpa_net_load() = 10023 us"
> >>
> >> ChangeLog
> >> =========
> >> v3:
> >>    - refactor vhost_svq_poll() to accept cmds_in_flight
> >> suggested by Jason and Eugenio
> >>    - refactor vhost_vdpa_net_cvq_add() to make control commands buffers
> >> is not tied to `s->cvq_cmd_out_buffer` and `s->status`, so we can reuse
> >> it suggested by Eugenio
> >>    - poll and check when SVQ is full or control commands shadow buffers is
> >> full
> >>
> >> v2: https://lore.kernel.org/all/cover.1683371965.git.yin31149@gmail.com/
> >>    - recover accidentally deleted rows
> >>    - remove extra newline
> >>    - refactor `need_poll_len` to `cmds_in_flight`
> >>    - return -EINVAL when vhost_svq_poll() return 0 or check
> >> on buffers written by device fails
> >>    - change the type of `in_cursor`, and refactor the
> >> code for updating cursor
> >>    - return directly when vhost_vdpa_net_load_{mac,mq}()
> >> returns a failure in vhost_vdpa_net_load()
> >>
> >> v1: https://lore.kernel.org/all/cover.1681732982.git.yin31149@gmail.com/
> >>
> >> Hawkins Jiawei (8):
> >>    vhost: Add argument to vhost_svq_poll()
> >>    vdpa: Use iovec for vhost_vdpa_net_cvq_add()
> >>    vhost: Expose vhost_svq_available_slots()
> >>    vdpa: Avoid using vhost_vdpa_net_load_*() outside
> >>      vhost_vdpa_net_load()
> >>    vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
> >>    vdpa: Move vhost_svq_poll() to the caller of vhost_vdpa_net_cvq_add()
> >>    vdpa: Add cursors to vhost_vdpa_net_loadx()
> >>    vdpa: Send cvq state load commands in parallel
> >>
> >>   hw/virtio/vhost-shadow-virtqueue.c |  38 ++--
> >>   hw/virtio/vhost-shadow-virtqueue.h |   3 +-
> >>   net/vhost-vdpa.c                   | 354 ++++++++++++++++++-----------
> >>   3 files changed, 249 insertions(+), 146 deletions(-)
> >>
> >> --
> >> 2.25.1
> >


Re: [PATCH v3 0/8] vdpa: Send all CVQ state load commands in parallel
Posted by Hawkins Jiawei 9 months, 3 weeks ago
在 2023/7/19 20:46, Michael S. Tsirkin 写道:
> On Wed, Jul 19, 2023 at 08:35:50PM +0800, Hawkins Jiawei wrote:
>> 在 2023/7/19 17:11, Michael S. Tsirkin 写道:
>>> On Wed, Jul 19, 2023 at 03:53:45PM +0800, Hawkins Jiawei wrote:
>>>> This patchset allows QEMU to delay polling and checking the device
>>>> used buffer until either the SVQ is full or control commands shadow
>>>> buffers are full, instead of polling and checking immediately after
>>>> sending each SVQ control command, so that QEMU can send all the SVQ
>>>> control commands in parallel, which have better performance improvement.
>>>>
>>>> I use vp_vdpa device to simulate vdpa device, and create 4094 VLANS in
>>>> guest to build a test environment for sending multiple CVQ state load
>>>> commands. This patch series can improve latency from 10023 us to
>>>> 8697 us for about 4099 CVQ state load commands, about 0.32 us per command.
>>>
>>> Looks like a tiny improvement.
>>> At the same time we have O(n^2) behaviour with memory mappings.
>>
>> Hi Michael,
>>
>> Thanks for your review.
>>
>> I wonder why you say "we have O(n^2) behaviour on memory mappings" here?
>
> it's not specific to virtio - it's related to device init.
> generally each device has some memory. during boot bios
> enables each individually O(n) where n is # of devices.
> memory maps has to be updated and in qemu this update
> is at least superlinear with n (more like O(n log n) I think).
> This gets up > O(n^2) with n number of devices.

Thanks for your explanation.


>
>>   From my understanding, QEMU maps two page-size buffers as control
>> commands shadow buffers at device startup. These buffers then are used
>> to cache SVQ control commands, where QEMU fills them with multiple SVQ control
>> commands bytes, flushes them when SVQ descriptors are full or these
>> control commands shadow buffers reach their capacity.
>>
>> QEMU repeats this process until all CVQ state load commands have been
>> sent in loading.
>>
>> In this loading process, only control commands shadow buffers
>> translation should be relative to memory mappings, which should be
>> O(log n) behaviour to my understanding(Please correct me if I am wrong).
>>
>>> Not saying we must not do this but I think it's worth
>>> checking where the bottleneck is. My guess would be
>>> vp_vdpa is not doing things in parallel. Want to try fixing that
>>
>> As for "vp_vdpa is not doing things in parallel.", do you mean
>> the vp_vdpa device cannot process QEMU's SVQ control commands
>> in parallel?
>>
>> In this situation, I will try to use real vdpa hardware to
>> test the patch series performance.
>
> yea, pls do that.
>
>>> to see how far it can be pushed?
>>
>> Currently, I am involved in the "Add virtio-net Control Virtqueue state
>> restore support" project in Google Summer of Code now. Because I am
>> uncertain about the time it will take to fix that problem in the vp_vdpa
>> device, I prefer to complete the gsoc project first.
>>
>> Thanks!
>>
>>
>>>
>>>
>>>> Note that this patch should be based on
>>>> patch "Vhost-vdpa Shadow Virtqueue VLAN support" at [1].
>>>>
>>>> [1]. https://lists.gnu.org/archive/html/qemu-devel/2023-07/msg03719.html
>>>>
>>>> TestStep
>>>> ========
>>>> 1. regression testing using vp-vdpa device
>>>>     - For L0 guest, boot QEMU with two virtio-net-pci net device with
>>>> `ctrl_vq`, `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
>>>>         -device virtio-net-pci,disable-legacy=on,disable-modern=off,
>>>> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
>>>> indirect_desc=off,queue_reset=off,ctrl_rx=on,ctrl_rx_extra=on,...
>>>>
>>>>     - For L1 guest, apply the patch series and compile the source code,
>>>> start QEMU with two vdpa device with svq mode on, enable the `ctrl_vq`,
>>>> `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
>>>>         -netdev type=vhost-vdpa,x-svq=true,...
>>>>         -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
>>>> ctrl_rx=on,ctrl_rx_extra=on...
>>>>
>>>>     - For L2 source guest, run the following bash command:
>>>> ```bash
>>>> #!/bin/sh
>>>>
>>>> for idx1 in {0..9}
>>>> do
>>>>     for idx2 in {0..9}
>>>>     do
>>>>       for idx3 in {0..6}
>>>>       do
>>>>         ip link add macvlan$idx1$idx2$idx3 link eth0
>>>> address 4a:30:10:19:$idx1$idx2:1$idx3 type macvlan mode bridge
>>>>         ip link set macvlan$idx1$idx2$idx3 up
>>>>       done
>>>>     done
>>>> done
>>>> ```
>>>>     - Execute the live migration in L2 source monitor
>>>>
>>>>     - Result
>>>>       * with this series, QEMU should not trigger any error or warning.
>>>>
>>>>
>>>>
>>>> 2. perf using vp-vdpa device
>>>>     - For L0 guest, boot QEMU with two virtio-net-pci net device with
>>>> `ctrl_vq`, `ctrl_vlan` features on, command line like:
>>>>         -device virtio-net-pci,disable-legacy=on,disable-modern=off,
>>>> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
>>>> indirect_desc=off,queue_reset=off,ctrl_vlan=on,...
>>>>
>>>>     - For L1 guest, apply the patch series, then apply an addtional
>>>> patch to record the load time in microseconds as following:
>>>> ```diff
>>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>>>> index 6b958d6363..501b510fd2 100644
>>>> --- a/hw/net/vhost_net.c
>>>> +++ b/hw/net/vhost_net.c
>>>> @@ -295,7 +295,10 @@ static int vhost_net_start_one(struct vhost_net *net,
>>>>        }
>>>>
>>>>        if (net->nc->info->load) {
>>>> +        int64_t start_us = g_get_monotonic_time();
>>>>            r = net->nc->info->load(net->nc);
>>>> +        error_report("vhost_vdpa_net_load() = %ld us",
>>>> +                     g_get_monotonic_time() - start_us);
>>>>            if (r < 0) {
>>>>                goto fail;
>>>>            }
>>>> ```
>>>>
>>>>     - For L1 guest, compile the code, and start QEMU with two vdpa device
>>>> with svq mode on, enable the `ctrl_vq`, `ctrl_vlan` features on,
>>>> command line like:
>>>>         -netdev type=vhost-vdpa,x-svq=true,...
>>>>         -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
>>>> ctrl_vlan=on...
>>>>
>>>>     - For L2 source guest, run the following bash command:
>>>> ```bash
>>>> #!/bin/sh
>>>>
>>>> for idx in {1..4094}
>>>> do
>>>>     ip link add link eth0 name vlan$idx type vlan id $idx
>>>> done
>>>> ```
>>>>
>>>>     - wait for some time, then execute the live migration in L2 source monitor
>>>>
>>>>     - Result
>>>>       * with this series, QEMU should not trigger any warning
>>>> or error except something like "vhost_vdpa_net_load() = 8697 us"
>>>>       * without this series, QEMU should not trigger any warning
>>>> or error except something like "vhost_vdpa_net_load() = 10023 us"
>>>>
>>>> ChangeLog
>>>> =========
>>>> v3:
>>>>     - refactor vhost_svq_poll() to accept cmds_in_flight
>>>> suggested by Jason and Eugenio
>>>>     - refactor vhost_vdpa_net_cvq_add() to make control commands buffers
>>>> is not tied to `s->cvq_cmd_out_buffer` and `s->status`, so we can reuse
>>>> it suggested by Eugenio
>>>>     - poll and check when SVQ is full or control commands shadow buffers is
>>>> full
>>>>
>>>> v2: https://lore.kernel.org/all/cover.1683371965.git.yin31149@gmail.com/
>>>>     - recover accidentally deleted rows
>>>>     - remove extra newline
>>>>     - refactor `need_poll_len` to `cmds_in_flight`
>>>>     - return -EINVAL when vhost_svq_poll() return 0 or check
>>>> on buffers written by device fails
>>>>     - change the type of `in_cursor`, and refactor the
>>>> code for updating cursor
>>>>     - return directly when vhost_vdpa_net_load_{mac,mq}()
>>>> returns a failure in vhost_vdpa_net_load()
>>>>
>>>> v1: https://lore.kernel.org/all/cover.1681732982.git.yin31149@gmail.com/
>>>>
>>>> Hawkins Jiawei (8):
>>>>     vhost: Add argument to vhost_svq_poll()
>>>>     vdpa: Use iovec for vhost_vdpa_net_cvq_add()
>>>>     vhost: Expose vhost_svq_available_slots()
>>>>     vdpa: Avoid using vhost_vdpa_net_load_*() outside
>>>>       vhost_vdpa_net_load()
>>>>     vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
>>>>     vdpa: Move vhost_svq_poll() to the caller of vhost_vdpa_net_cvq_add()
>>>>     vdpa: Add cursors to vhost_vdpa_net_loadx()
>>>>     vdpa: Send cvq state load commands in parallel
>>>>
>>>>    hw/virtio/vhost-shadow-virtqueue.c |  38 ++--
>>>>    hw/virtio/vhost-shadow-virtqueue.h |   3 +-
>>>>    net/vhost-vdpa.c                   | 354 ++++++++++++++++++-----------
>>>>    3 files changed, 249 insertions(+), 146 deletions(-)
>>>>
>>>> --
>>>> 2.25.1
>>>
>
Re: [PATCH v3 0/8] vdpa: Send all CVQ state load commands in parallel
Posted by Lei Yang 9 months, 3 weeks ago
Hello Hawkins and Michael

Looks like there are big changes about vp_vdpa, therefore, if needed,
QE can test this series in QE's environment before the patch is
merged, and provide the result.

BR
Lei


On Wed, Jul 19, 2023 at 8:37 PM Hawkins Jiawei <yin31149@gmail.com> wrote:
>
> 在 2023/7/19 17:11, Michael S. Tsirkin 写道:
> > On Wed, Jul 19, 2023 at 03:53:45PM +0800, Hawkins Jiawei wrote:
> >> This patchset allows QEMU to delay polling and checking the device
> >> used buffer until either the SVQ is full or control commands shadow
> >> buffers are full, instead of polling and checking immediately after
> >> sending each SVQ control command, so that QEMU can send all the SVQ
> >> control commands in parallel, which have better performance improvement.
> >>
> >> I use vp_vdpa device to simulate vdpa device, and create 4094 VLANS in
> >> guest to build a test environment for sending multiple CVQ state load
> >> commands. This patch series can improve latency from 10023 us to
> >> 8697 us for about 4099 CVQ state load commands, about 0.32 us per command.
> >
> > Looks like a tiny improvement.
> > At the same time we have O(n^2) behaviour with memory mappings.
>
> Hi Michael,
>
> Thanks for your review.
>
> I wonder why you say "we have O(n^2) behaviour on memory mappings" here?
>
>  From my understanding, QEMU maps two page-size buffers as control
> commands shadow buffers at device startup. These buffers then are used
> to cache SVQ control commands, where QEMU fills them with multiple SVQ control
> commands bytes, flushes them when SVQ descriptors are full or these
> control commands shadow buffers reach their capacity.
>
> QEMU repeats this process until all CVQ state load commands have been
> sent in loading.
>
> In this loading process, only control commands shadow buffers
> translation should be relative to memory mappings, which should be
> O(log n) behaviour to my understanding(Please correct me if I am wrong).
>
> > Not saying we must not do this but I think it's worth
> > checking where the bottleneck is. My guess would be
> > vp_vdpa is not doing things in parallel. Want to try fixing that
>
> As for "vp_vdpa is not doing things in parallel.", do you mean
> the vp_vdpa device cannot process QEMU's SVQ control commands
> in parallel?
>
> In this situation, I will try to use real vdpa hardware to
> test the patch series performance.
>
> > to see how far it can be pushed?
>
> Currently, I am involved in the "Add virtio-net Control Virtqueue state
> restore support" project in Google Summer of Code now. Because I am
> uncertain about the time it will take to fix that problem in the vp_vdpa
> device, I prefer to complete the gsoc project first.
>
> Thanks!
>
>
> >
> >
> >> Note that this patch should be based on
> >> patch "Vhost-vdpa Shadow Virtqueue VLAN support" at [1].
> >>
> >> [1]. https://lists.gnu.org/archive/html/qemu-devel/2023-07/msg03719.html
> >>
> >> TestStep
> >> ========
> >> 1. regression testing using vp-vdpa device
> >>    - For L0 guest, boot QEMU with two virtio-net-pci net device with
> >> `ctrl_vq`, `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
> >>        -device virtio-net-pci,disable-legacy=on,disable-modern=off,
> >> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
> >> indirect_desc=off,queue_reset=off,ctrl_rx=on,ctrl_rx_extra=on,...
> >>
> >>    - For L1 guest, apply the patch series and compile the source code,
> >> start QEMU with two vdpa device with svq mode on, enable the `ctrl_vq`,
> >> `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
> >>        -netdev type=vhost-vdpa,x-svq=true,...
> >>        -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
> >> ctrl_rx=on,ctrl_rx_extra=on...
> >>
> >>    - For L2 source guest, run the following bash command:
> >> ```bash
> >> #!/bin/sh
> >>
> >> for idx1 in {0..9}
> >> do
> >>    for idx2 in {0..9}
> >>    do
> >>      for idx3 in {0..6}
> >>      do
> >>        ip link add macvlan$idx1$idx2$idx3 link eth0
> >> address 4a:30:10:19:$idx1$idx2:1$idx3 type macvlan mode bridge
> >>        ip link set macvlan$idx1$idx2$idx3 up
> >>      done
> >>    done
> >> done
> >> ```
> >>    - Execute the live migration in L2 source monitor
> >>
> >>    - Result
> >>      * with this series, QEMU should not trigger any error or warning.
> >>
> >>
> >>
> >> 2. perf using vp-vdpa device
> >>    - For L0 guest, boot QEMU with two virtio-net-pci net device with
> >> `ctrl_vq`, `ctrl_vlan` features on, command line like:
> >>        -device virtio-net-pci,disable-legacy=on,disable-modern=off,
> >> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
> >> indirect_desc=off,queue_reset=off,ctrl_vlan=on,...
> >>
> >>    - For L1 guest, apply the patch series, then apply an addtional
> >> patch to record the load time in microseconds as following:
> >> ```diff
> >> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> >> index 6b958d6363..501b510fd2 100644
> >> --- a/hw/net/vhost_net.c
> >> +++ b/hw/net/vhost_net.c
> >> @@ -295,7 +295,10 @@ static int vhost_net_start_one(struct vhost_net *net,
> >>       }
> >>
> >>       if (net->nc->info->load) {
> >> +        int64_t start_us = g_get_monotonic_time();
> >>           r = net->nc->info->load(net->nc);
> >> +        error_report("vhost_vdpa_net_load() = %ld us",
> >> +                     g_get_monotonic_time() - start_us);
> >>           if (r < 0) {
> >>               goto fail;
> >>           }
> >> ```
> >>
> >>    - For L1 guest, compile the code, and start QEMU with two vdpa device
> >> with svq mode on, enable the `ctrl_vq`, `ctrl_vlan` features on,
> >> command line like:
> >>        -netdev type=vhost-vdpa,x-svq=true,...
> >>        -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
> >> ctrl_vlan=on...
> >>
> >>    - For L2 source guest, run the following bash command:
> >> ```bash
> >> #!/bin/sh
> >>
> >> for idx in {1..4094}
> >> do
> >>    ip link add link eth0 name vlan$idx type vlan id $idx
> >> done
> >> ```
> >>
> >>    - wait for some time, then execute the live migration in L2 source monitor
> >>
> >>    - Result
> >>      * with this series, QEMU should not trigger any warning
> >> or error except something like "vhost_vdpa_net_load() = 8697 us"
> >>      * without this series, QEMU should not trigger any warning
> >> or error except something like "vhost_vdpa_net_load() = 10023 us"
> >>
> >> ChangeLog
> >> =========
> >> v3:
> >>    - refactor vhost_svq_poll() to accept cmds_in_flight
> >> suggested by Jason and Eugenio
> >>    - refactor vhost_vdpa_net_cvq_add() to make control commands buffers
> >> is not tied to `s->cvq_cmd_out_buffer` and `s->status`, so we can reuse
> >> it suggested by Eugenio
> >>    - poll and check when SVQ is full or control commands shadow buffers is
> >> full
> >>
> >> v2: https://lore.kernel.org/all/cover.1683371965.git.yin31149@gmail.com/
> >>    - recover accidentally deleted rows
> >>    - remove extra newline
> >>    - refactor `need_poll_len` to `cmds_in_flight`
> >>    - return -EINVAL when vhost_svq_poll() return 0 or check
> >> on buffers written by device fails
> >>    - change the type of `in_cursor`, and refactor the
> >> code for updating cursor
> >>    - return directly when vhost_vdpa_net_load_{mac,mq}()
> >> returns a failure in vhost_vdpa_net_load()
> >>
> >> v1: https://lore.kernel.org/all/cover.1681732982.git.yin31149@gmail.com/
> >>
> >> Hawkins Jiawei (8):
> >>    vhost: Add argument to vhost_svq_poll()
> >>    vdpa: Use iovec for vhost_vdpa_net_cvq_add()
> >>    vhost: Expose vhost_svq_available_slots()
> >>    vdpa: Avoid using vhost_vdpa_net_load_*() outside
> >>      vhost_vdpa_net_load()
> >>    vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
> >>    vdpa: Move vhost_svq_poll() to the caller of vhost_vdpa_net_cvq_add()
> >>    vdpa: Add cursors to vhost_vdpa_net_loadx()
> >>    vdpa: Send cvq state load commands in parallel
> >>
> >>   hw/virtio/vhost-shadow-virtqueue.c |  38 ++--
> >>   hw/virtio/vhost-shadow-virtqueue.h |   3 +-
> >>   net/vhost-vdpa.c                   | 354 ++++++++++++++++++-----------
> >>   3 files changed, 249 insertions(+), 146 deletions(-)
> >>
> >> --
> >> 2.25.1
> >
>
Re: [PATCH v3 0/8] vdpa: Send all CVQ state load commands in parallel
Posted by Hawkins Jiawei 9 months, 3 weeks ago
在 2023/7/19 20:44, Lei Yang 写道:
> Hello Hawkins and Michael
>
> Looks like there are big changes about vp_vdpa, therefore, if needed,
> QE can test this series in QE's environment before the patch is

Hi Lei,

This patch series does not modify the code of vp_vdpa. Instead, it only
modifies how QEMU sends SVQ control commands to the vdpa device.

Considering that the behavior of the vp_vdpa device differs from that
of real vdpa hardware, would it be possible for you to test this patch
series on a real vdpa device?

Thanks!


> merged, and provide the result.
>
> BR
> Lei
>
>
> On Wed, Jul 19, 2023 at 8:37 PM Hawkins Jiawei <yin31149@gmail.com> wrote:
>>
>> 在 2023/7/19 17:11, Michael S. Tsirkin 写道:
>>> On Wed, Jul 19, 2023 at 03:53:45PM +0800, Hawkins Jiawei wrote:
>>>> This patchset allows QEMU to delay polling and checking the device
>>>> used buffer until either the SVQ is full or control commands shadow
>>>> buffers are full, instead of polling and checking immediately after
>>>> sending each SVQ control command, so that QEMU can send all the SVQ
>>>> control commands in parallel, which have better performance improvement.
>>>>
>>>> I use vp_vdpa device to simulate vdpa device, and create 4094 VLANS in
>>>> guest to build a test environment for sending multiple CVQ state load
>>>> commands. This patch series can improve latency from 10023 us to
>>>> 8697 us for about 4099 CVQ state load commands, about 0.32 us per command.
>>>
>>> Looks like a tiny improvement.
>>> At the same time we have O(n^2) behaviour with memory mappings.
>>
>> Hi Michael,
>>
>> Thanks for your review.
>>
>> I wonder why you say "we have O(n^2) behaviour on memory mappings" here?
>>
>>   From my understanding, QEMU maps two page-size buffers as control
>> commands shadow buffers at device startup. These buffers then are used
>> to cache SVQ control commands, where QEMU fills them with multiple SVQ control
>> commands bytes, flushes them when SVQ descriptors are full or these
>> control commands shadow buffers reach their capacity.
>>
>> QEMU repeats this process until all CVQ state load commands have been
>> sent in loading.
>>
>> In this loading process, only control commands shadow buffers
>> translation should be relative to memory mappings, which should be
>> O(log n) behaviour to my understanding(Please correct me if I am wrong).
>>
>>> Not saying we must not do this but I think it's worth
>>> checking where the bottleneck is. My guess would be
>>> vp_vdpa is not doing things in parallel. Want to try fixing that
>>
>> As for "vp_vdpa is not doing things in parallel.", do you mean
>> the vp_vdpa device cannot process QEMU's SVQ control commands
>> in parallel?
>>
>> In this situation, I will try to use real vdpa hardware to
>> test the patch series performance.
>>
>>> to see how far it can be pushed?
>>
>> Currently, I am involved in the "Add virtio-net Control Virtqueue state
>> restore support" project in Google Summer of Code now. Because I am
>> uncertain about the time it will take to fix that problem in the vp_vdpa
>> device, I prefer to complete the gsoc project first.
>>
>> Thanks!
>>
>>
>>>
>>>
>>>> Note that this patch should be based on
>>>> patch "Vhost-vdpa Shadow Virtqueue VLAN support" at [1].
>>>>
>>>> [1]. https://lists.gnu.org/archive/html/qemu-devel/2023-07/msg03719.html
>>>>
>>>> TestStep
>>>> ========
>>>> 1. regression testing using vp-vdpa device
>>>>     - For L0 guest, boot QEMU with two virtio-net-pci net device with
>>>> `ctrl_vq`, `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
>>>>         -device virtio-net-pci,disable-legacy=on,disable-modern=off,
>>>> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
>>>> indirect_desc=off,queue_reset=off,ctrl_rx=on,ctrl_rx_extra=on,...
>>>>
>>>>     - For L1 guest, apply the patch series and compile the source code,
>>>> start QEMU with two vdpa device with svq mode on, enable the `ctrl_vq`,
>>>> `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
>>>>         -netdev type=vhost-vdpa,x-svq=true,...
>>>>         -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
>>>> ctrl_rx=on,ctrl_rx_extra=on...
>>>>
>>>>     - For L2 source guest, run the following bash command:
>>>> ```bash
>>>> #!/bin/sh
>>>>
>>>> for idx1 in {0..9}
>>>> do
>>>>     for idx2 in {0..9}
>>>>     do
>>>>       for idx3 in {0..6}
>>>>       do
>>>>         ip link add macvlan$idx1$idx2$idx3 link eth0
>>>> address 4a:30:10:19:$idx1$idx2:1$idx3 type macvlan mode bridge
>>>>         ip link set macvlan$idx1$idx2$idx3 up
>>>>       done
>>>>     done
>>>> done
>>>> ```
>>>>     - Execute the live migration in L2 source monitor
>>>>
>>>>     - Result
>>>>       * with this series, QEMU should not trigger any error or warning.
>>>>
>>>>
>>>>
>>>> 2. perf using vp-vdpa device
>>>>     - For L0 guest, boot QEMU with two virtio-net-pci net device with
>>>> `ctrl_vq`, `ctrl_vlan` features on, command line like:
>>>>         -device virtio-net-pci,disable-legacy=on,disable-modern=off,
>>>> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
>>>> indirect_desc=off,queue_reset=off,ctrl_vlan=on,...
>>>>
>>>>     - For L1 guest, apply the patch series, then apply an addtional
>>>> patch to record the load time in microseconds as following:
>>>> ```diff
>>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>>>> index 6b958d6363..501b510fd2 100644
>>>> --- a/hw/net/vhost_net.c
>>>> +++ b/hw/net/vhost_net.c
>>>> @@ -295,7 +295,10 @@ static int vhost_net_start_one(struct vhost_net *net,
>>>>        }
>>>>
>>>>        if (net->nc->info->load) {
>>>> +        int64_t start_us = g_get_monotonic_time();
>>>>            r = net->nc->info->load(net->nc);
>>>> +        error_report("vhost_vdpa_net_load() = %ld us",
>>>> +                     g_get_monotonic_time() - start_us);
>>>>            if (r < 0) {
>>>>                goto fail;
>>>>            }
>>>> ```
>>>>
>>>>     - For L1 guest, compile the code, and start QEMU with two vdpa device
>>>> with svq mode on, enable the `ctrl_vq`, `ctrl_vlan` features on,
>>>> command line like:
>>>>         -netdev type=vhost-vdpa,x-svq=true,...
>>>>         -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
>>>> ctrl_vlan=on...
>>>>
>>>>     - For L2 source guest, run the following bash command:
>>>> ```bash
>>>> #!/bin/sh
>>>>
>>>> for idx in {1..4094}
>>>> do
>>>>     ip link add link eth0 name vlan$idx type vlan id $idx
>>>> done
>>>> ```
>>>>
>>>>     - wait for some time, then execute the live migration in L2 source monitor
>>>>
>>>>     - Result
>>>>       * with this series, QEMU should not trigger any warning
>>>> or error except something like "vhost_vdpa_net_load() = 8697 us"
>>>>       * without this series, QEMU should not trigger any warning
>>>> or error except something like "vhost_vdpa_net_load() = 10023 us"
>>>>
>>>> ChangeLog
>>>> =========
>>>> v3:
>>>>     - refactor vhost_svq_poll() to accept cmds_in_flight
>>>> suggested by Jason and Eugenio
>>>>     - refactor vhost_vdpa_net_cvq_add() to make control commands buffers
>>>> is not tied to `s->cvq_cmd_out_buffer` and `s->status`, so we can reuse
>>>> it suggested by Eugenio
>>>>     - poll and check when SVQ is full or control commands shadow buffers is
>>>> full
>>>>
>>>> v2: https://lore.kernel.org/all/cover.1683371965.git.yin31149@gmail.com/
>>>>     - recover accidentally deleted rows
>>>>     - remove extra newline
>>>>     - refactor `need_poll_len` to `cmds_in_flight`
>>>>     - return -EINVAL when vhost_svq_poll() return 0 or check
>>>> on buffers written by device fails
>>>>     - change the type of `in_cursor`, and refactor the
>>>> code for updating cursor
>>>>     - return directly when vhost_vdpa_net_load_{mac,mq}()
>>>> returns a failure in vhost_vdpa_net_load()
>>>>
>>>> v1: https://lore.kernel.org/all/cover.1681732982.git.yin31149@gmail.com/
>>>>
>>>> Hawkins Jiawei (8):
>>>>     vhost: Add argument to vhost_svq_poll()
>>>>     vdpa: Use iovec for vhost_vdpa_net_cvq_add()
>>>>     vhost: Expose vhost_svq_available_slots()
>>>>     vdpa: Avoid using vhost_vdpa_net_load_*() outside
>>>>       vhost_vdpa_net_load()
>>>>     vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
>>>>     vdpa: Move vhost_svq_poll() to the caller of vhost_vdpa_net_cvq_add()
>>>>     vdpa: Add cursors to vhost_vdpa_net_loadx()
>>>>     vdpa: Send cvq state load commands in parallel
>>>>
>>>>    hw/virtio/vhost-shadow-virtqueue.c |  38 ++--
>>>>    hw/virtio/vhost-shadow-virtqueue.h |   3 +-
>>>>    net/vhost-vdpa.c                   | 354 ++++++++++++++++++-----------
>>>>    3 files changed, 249 insertions(+), 146 deletions(-)
>>>>
>>>> --
>>>> 2.25.1
>>>
>>
>
Re: [PATCH v3 0/8] vdpa: Send all CVQ state load commands in parallel
Posted by Lei Yang 9 months, 2 weeks ago
On Wed, Jul 19, 2023 at 11:25 PM Hawkins Jiawei <yin31149@gmail.com> wrote:
>
> 在 2023/7/19 20:44, Lei Yang 写道:
> > Hello Hawkins and Michael
> >
> > Looks like there are big changes about vp_vdpa, therefore, if needed,
> > QE can test this series in QE's environment before the patch is
>
> Hi Lei,
>
> This patch series does not modify the code of vp_vdpa. Instead, it only
> modifies how QEMU sends SVQ control commands to the vdpa device.
>
Hi Hawkins

> Considering that the behavior of the vp_vdpa device differs from that
> of real vdpa hardware, would it be possible for you to test this patch
> series on a real vdpa device?

Yes, there is a hardware device to test it , I will update the test
results ASAP.

BR
Lei
>
> Thanks!
>
>
> > merged, and provide the result.
> >
> > BR
> > Lei
> >
> >
> > On Wed, Jul 19, 2023 at 8:37 PM Hawkins Jiawei <yin31149@gmail.com> wrote:
> >>
> >> 在 2023/7/19 17:11, Michael S. Tsirkin 写道:
> >>> On Wed, Jul 19, 2023 at 03:53:45PM +0800, Hawkins Jiawei wrote:
> >>>> This patchset allows QEMU to delay polling and checking the device
> >>>> used buffer until either the SVQ is full or control commands shadow
> >>>> buffers are full, instead of polling and checking immediately after
> >>>> sending each SVQ control command, so that QEMU can send all the SVQ
> >>>> control commands in parallel, which have better performance improvement.
> >>>>
> >>>> I use vp_vdpa device to simulate vdpa device, and create 4094 VLANS in
> >>>> guest to build a test environment for sending multiple CVQ state load
> >>>> commands. This patch series can improve latency from 10023 us to
> >>>> 8697 us for about 4099 CVQ state load commands, about 0.32 us per command.
> >>>
> >>> Looks like a tiny improvement.
> >>> At the same time we have O(n^2) behaviour with memory mappings.
> >>
> >> Hi Michael,
> >>
> >> Thanks for your review.
> >>
> >> I wonder why you say "we have O(n^2) behaviour on memory mappings" here?
> >>
> >>   From my understanding, QEMU maps two page-size buffers as control
> >> commands shadow buffers at device startup. These buffers then are used
> >> to cache SVQ control commands, where QEMU fills them with multiple SVQ control
> >> commands bytes, flushes them when SVQ descriptors are full or these
> >> control commands shadow buffers reach their capacity.
> >>
> >> QEMU repeats this process until all CVQ state load commands have been
> >> sent in loading.
> >>
> >> In this loading process, only control commands shadow buffers
> >> translation should be relative to memory mappings, which should be
> >> O(log n) behaviour to my understanding(Please correct me if I am wrong).
> >>
> >>> Not saying we must not do this but I think it's worth
> >>> checking where the bottleneck is. My guess would be
> >>> vp_vdpa is not doing things in parallel. Want to try fixing that
> >>
> >> As for "vp_vdpa is not doing things in parallel.", do you mean
> >> the vp_vdpa device cannot process QEMU's SVQ control commands
> >> in parallel?
> >>
> >> In this situation, I will try to use real vdpa hardware to
> >> test the patch series performance.
> >>
> >>> to see how far it can be pushed?
> >>
> >> Currently, I am involved in the "Add virtio-net Control Virtqueue state
> >> restore support" project in Google Summer of Code now. Because I am
> >> uncertain about the time it will take to fix that problem in the vp_vdpa
> >> device, I prefer to complete the gsoc project first.
> >>
> >> Thanks!
> >>
> >>
> >>>
> >>>
> >>>> Note that this patch should be based on
> >>>> patch "Vhost-vdpa Shadow Virtqueue VLAN support" at [1].
> >>>>
> >>>> [1]. https://lists.gnu.org/archive/html/qemu-devel/2023-07/msg03719.html
> >>>>
> >>>> TestStep
> >>>> ========
> >>>> 1. regression testing using vp-vdpa device
> >>>>     - For L0 guest, boot QEMU with two virtio-net-pci net device with
> >>>> `ctrl_vq`, `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
> >>>>         -device virtio-net-pci,disable-legacy=on,disable-modern=off,
> >>>> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
> >>>> indirect_desc=off,queue_reset=off,ctrl_rx=on,ctrl_rx_extra=on,...
> >>>>
> >>>>     - For L1 guest, apply the patch series and compile the source code,
> >>>> start QEMU with two vdpa device with svq mode on, enable the `ctrl_vq`,
> >>>> `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
> >>>>         -netdev type=vhost-vdpa,x-svq=true,...
> >>>>         -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
> >>>> ctrl_rx=on,ctrl_rx_extra=on...
> >>>>
> >>>>     - For L2 source guest, run the following bash command:
> >>>> ```bash
> >>>> #!/bin/sh
> >>>>
> >>>> for idx1 in {0..9}
> >>>> do
> >>>>     for idx2 in {0..9}
> >>>>     do
> >>>>       for idx3 in {0..6}
> >>>>       do
> >>>>         ip link add macvlan$idx1$idx2$idx3 link eth0
> >>>> address 4a:30:10:19:$idx1$idx2:1$idx3 type macvlan mode bridge
> >>>>         ip link set macvlan$idx1$idx2$idx3 up
> >>>>       done
> >>>>     done
> >>>> done
> >>>> ```
> >>>>     - Execute the live migration in L2 source monitor
> >>>>
> >>>>     - Result
> >>>>       * with this series, QEMU should not trigger any error or warning.
> >>>>
> >>>>
> >>>>
> >>>> 2. perf using vp-vdpa device
> >>>>     - For L0 guest, boot QEMU with two virtio-net-pci net device with
> >>>> `ctrl_vq`, `ctrl_vlan` features on, command line like:
> >>>>         -device virtio-net-pci,disable-legacy=on,disable-modern=off,
> >>>> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
> >>>> indirect_desc=off,queue_reset=off,ctrl_vlan=on,...
> >>>>
> >>>>     - For L1 guest, apply the patch series, then apply an addtional
> >>>> patch to record the load time in microseconds as following:
> >>>> ```diff
> >>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> >>>> index 6b958d6363..501b510fd2 100644
> >>>> --- a/hw/net/vhost_net.c
> >>>> +++ b/hw/net/vhost_net.c
> >>>> @@ -295,7 +295,10 @@ static int vhost_net_start_one(struct vhost_net *net,
> >>>>        }
> >>>>
> >>>>        if (net->nc->info->load) {
> >>>> +        int64_t start_us = g_get_monotonic_time();
> >>>>            r = net->nc->info->load(net->nc);
> >>>> +        error_report("vhost_vdpa_net_load() = %ld us",
> >>>> +                     g_get_monotonic_time() - start_us);
> >>>>            if (r < 0) {
> >>>>                goto fail;
> >>>>            }
> >>>> ```
> >>>>
> >>>>     - For L1 guest, compile the code, and start QEMU with two vdpa device
> >>>> with svq mode on, enable the `ctrl_vq`, `ctrl_vlan` features on,
> >>>> command line like:
> >>>>         -netdev type=vhost-vdpa,x-svq=true,...
> >>>>         -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
> >>>> ctrl_vlan=on...
> >>>>
> >>>>     - For L2 source guest, run the following bash command:
> >>>> ```bash
> >>>> #!/bin/sh
> >>>>
> >>>> for idx in {1..4094}
> >>>> do
> >>>>     ip link add link eth0 name vlan$idx type vlan id $idx
> >>>> done
> >>>> ```
> >>>>
> >>>>     - wait for some time, then execute the live migration in L2 source monitor
> >>>>
> >>>>     - Result
> >>>>       * with this series, QEMU should not trigger any warning
> >>>> or error except something like "vhost_vdpa_net_load() = 8697 us"
> >>>>       * without this series, QEMU should not trigger any warning
> >>>> or error except something like "vhost_vdpa_net_load() = 10023 us"
> >>>>
> >>>> ChangeLog
> >>>> =========
> >>>> v3:
> >>>>     - refactor vhost_svq_poll() to accept cmds_in_flight
> >>>> suggested by Jason and Eugenio
> >>>>     - refactor vhost_vdpa_net_cvq_add() to make control commands buffers
> >>>> is not tied to `s->cvq_cmd_out_buffer` and `s->status`, so we can reuse
> >>>> it suggested by Eugenio
> >>>>     - poll and check when SVQ is full or control commands shadow buffers is
> >>>> full
> >>>>
> >>>> v2: https://lore.kernel.org/all/cover.1683371965.git.yin31149@gmail.com/
> >>>>     - recover accidentally deleted rows
> >>>>     - remove extra newline
> >>>>     - refactor `need_poll_len` to `cmds_in_flight`
> >>>>     - return -EINVAL when vhost_svq_poll() return 0 or check
> >>>> on buffers written by device fails
> >>>>     - change the type of `in_cursor`, and refactor the
> >>>> code for updating cursor
> >>>>     - return directly when vhost_vdpa_net_load_{mac,mq}()
> >>>> returns a failure in vhost_vdpa_net_load()
> >>>>
> >>>> v1: https://lore.kernel.org/all/cover.1681732982.git.yin31149@gmail.com/
> >>>>
> >>>> Hawkins Jiawei (8):
> >>>>     vhost: Add argument to vhost_svq_poll()
> >>>>     vdpa: Use iovec for vhost_vdpa_net_cvq_add()
> >>>>     vhost: Expose vhost_svq_available_slots()
> >>>>     vdpa: Avoid using vhost_vdpa_net_load_*() outside
> >>>>       vhost_vdpa_net_load()
> >>>>     vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
> >>>>     vdpa: Move vhost_svq_poll() to the caller of vhost_vdpa_net_cvq_add()
> >>>>     vdpa: Add cursors to vhost_vdpa_net_loadx()
> >>>>     vdpa: Send cvq state load commands in parallel
> >>>>
> >>>>    hw/virtio/vhost-shadow-virtqueue.c |  38 ++--
> >>>>    hw/virtio/vhost-shadow-virtqueue.h |   3 +-
> >>>>    net/vhost-vdpa.c                   | 354 ++++++++++++++++++-----------
> >>>>    3 files changed, 249 insertions(+), 146 deletions(-)
> >>>>
> >>>> --
> >>>> 2.25.1
> >>>
> >>
> >
>
Re: [PATCH v3 0/8] vdpa: Send all CVQ state load commands in parallel
Posted by Lei Yang 9 months, 2 weeks ago
According to the Hawkins provided steps, I tested two cases based on
applied this series patches and without it. And all tests are based on
the real hardware.
Case 1, without  this series
Source: qemu-system-x86_64: vhost_vdpa_net_load() = 23308 us
Dest: qemu-system-x86_64: vhost_vdpa_net_load() = 23296 us

Case 2, applied  this series
Source: qemu-system-x86_64: vhost_vdpa_net_load() = 6558 us
Dest: qemu-system-x86_64: vhost_vdpa_net_load() = 6539 us

Tested-by: Lei Yang <leiyang@redhat.com>


On Thu, Jul 20, 2023 at 6:54 AM Lei Yang <leiyang@redhat.com> wrote:
>
> On Wed, Jul 19, 2023 at 11:25 PM Hawkins Jiawei <yin31149@gmail.com> wrote:
> >
> > 在 2023/7/19 20:44, Lei Yang 写道:
> > > Hello Hawkins and Michael
> > >
> > > Looks like there are big changes about vp_vdpa, therefore, if needed,
> > > QE can test this series in QE's environment before the patch is
> >
> > Hi Lei,
> >
> > This patch series does not modify the code of vp_vdpa. Instead, it only
> > modifies how QEMU sends SVQ control commands to the vdpa device.
> >
> Hi Hawkins
>
> > Considering that the behavior of the vp_vdpa device differs from that
> > of real vdpa hardware, would it be possible for you to test this patch
> > series on a real vdpa device?
>
> Yes, there is a hardware device to test it , I will update the test
> results ASAP.
>
> BR
> Lei
> >
> > Thanks!
> >
> >
> > > merged, and provide the result.
> > >
> > > BR
> > > Lei
> > >
> > >
> > > On Wed, Jul 19, 2023 at 8:37 PM Hawkins Jiawei <yin31149@gmail.com> wrote:
> > >>
> > >> 在 2023/7/19 17:11, Michael S. Tsirkin 写道:
> > >>> On Wed, Jul 19, 2023 at 03:53:45PM +0800, Hawkins Jiawei wrote:
> > >>>> This patchset allows QEMU to delay polling and checking the device
> > >>>> used buffer until either the SVQ is full or control commands shadow
> > >>>> buffers are full, instead of polling and checking immediately after
> > >>>> sending each SVQ control command, so that QEMU can send all the SVQ
> > >>>> control commands in parallel, which have better performance improvement.
> > >>>>
> > >>>> I use vp_vdpa device to simulate vdpa device, and create 4094 VLANS in
> > >>>> guest to build a test environment for sending multiple CVQ state load
> > >>>> commands. This patch series can improve latency from 10023 us to
> > >>>> 8697 us for about 4099 CVQ state load commands, about 0.32 us per command.
> > >>>
> > >>> Looks like a tiny improvement.
> > >>> At the same time we have O(n^2) behaviour with memory mappings.
> > >>
> > >> Hi Michael,
> > >>
> > >> Thanks for your review.
> > >>
> > >> I wonder why you say "we have O(n^2) behaviour on memory mappings" here?
> > >>
> > >>   From my understanding, QEMU maps two page-size buffers as control
> > >> commands shadow buffers at device startup. These buffers then are used
> > >> to cache SVQ control commands, where QEMU fills them with multiple SVQ control
> > >> commands bytes, flushes them when SVQ descriptors are full or these
> > >> control commands shadow buffers reach their capacity.
> > >>
> > >> QEMU repeats this process until all CVQ state load commands have been
> > >> sent in loading.
> > >>
> > >> In this loading process, only control commands shadow buffers
> > >> translation should be relative to memory mappings, which should be
> > >> O(log n) behaviour to my understanding(Please correct me if I am wrong).
> > >>
> > >>> Not saying we must not do this but I think it's worth
> > >>> checking where the bottleneck is. My guess would be
> > >>> vp_vdpa is not doing things in parallel. Want to try fixing that
> > >>
> > >> As for "vp_vdpa is not doing things in parallel.", do you mean
> > >> the vp_vdpa device cannot process QEMU's SVQ control commands
> > >> in parallel?
> > >>
> > >> In this situation, I will try to use real vdpa hardware to
> > >> test the patch series performance.
> > >>
> > >>> to see how far it can be pushed?
> > >>
> > >> Currently, I am involved in the "Add virtio-net Control Virtqueue state
> > >> restore support" project in Google Summer of Code now. Because I am
> > >> uncertain about the time it will take to fix that problem in the vp_vdpa
> > >> device, I prefer to complete the gsoc project first.
> > >>
> > >> Thanks!
> > >>
> > >>
> > >>>
> > >>>
> > >>>> Note that this patch should be based on
> > >>>> patch "Vhost-vdpa Shadow Virtqueue VLAN support" at [1].
> > >>>>
> > >>>> [1]. https://lists.gnu.org/archive/html/qemu-devel/2023-07/msg03719.html
> > >>>>
> > >>>> TestStep
> > >>>> ========
> > >>>> 1. regression testing using vp-vdpa device
> > >>>>     - For L0 guest, boot QEMU with two virtio-net-pci net device with
> > >>>> `ctrl_vq`, `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
> > >>>>         -device virtio-net-pci,disable-legacy=on,disable-modern=off,
> > >>>> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
> > >>>> indirect_desc=off,queue_reset=off,ctrl_rx=on,ctrl_rx_extra=on,...
> > >>>>
> > >>>>     - For L1 guest, apply the patch series and compile the source code,
> > >>>> start QEMU with two vdpa device with svq mode on, enable the `ctrl_vq`,
> > >>>> `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
> > >>>>         -netdev type=vhost-vdpa,x-svq=true,...
> > >>>>         -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
> > >>>> ctrl_rx=on,ctrl_rx_extra=on...
> > >>>>
> > >>>>     - For L2 source guest, run the following bash command:
> > >>>> ```bash
> > >>>> #!/bin/sh
> > >>>>
> > >>>> for idx1 in {0..9}
> > >>>> do
> > >>>>     for idx2 in {0..9}
> > >>>>     do
> > >>>>       for idx3 in {0..6}
> > >>>>       do
> > >>>>         ip link add macvlan$idx1$idx2$idx3 link eth0
> > >>>> address 4a:30:10:19:$idx1$idx2:1$idx3 type macvlan mode bridge
> > >>>>         ip link set macvlan$idx1$idx2$idx3 up
> > >>>>       done
> > >>>>     done
> > >>>> done
> > >>>> ```
> > >>>>     - Execute the live migration in L2 source monitor
> > >>>>
> > >>>>     - Result
> > >>>>       * with this series, QEMU should not trigger any error or warning.
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2. perf using vp-vdpa device
> > >>>>     - For L0 guest, boot QEMU with two virtio-net-pci net device with
> > >>>> `ctrl_vq`, `ctrl_vlan` features on, command line like:
> > >>>>         -device virtio-net-pci,disable-legacy=on,disable-modern=off,
> > >>>> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
> > >>>> indirect_desc=off,queue_reset=off,ctrl_vlan=on,...
> > >>>>
> > >>>>     - For L1 guest, apply the patch series, then apply an addtional
> > >>>> patch to record the load time in microseconds as following:
> > >>>> ```diff
> > >>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> > >>>> index 6b958d6363..501b510fd2 100644
> > >>>> --- a/hw/net/vhost_net.c
> > >>>> +++ b/hw/net/vhost_net.c
> > >>>> @@ -295,7 +295,10 @@ static int vhost_net_start_one(struct vhost_net *net,
> > >>>>        }
> > >>>>
> > >>>>        if (net->nc->info->load) {
> > >>>> +        int64_t start_us = g_get_monotonic_time();
> > >>>>            r = net->nc->info->load(net->nc);
> > >>>> +        error_report("vhost_vdpa_net_load() = %ld us",
> > >>>> +                     g_get_monotonic_time() - start_us);
> > >>>>            if (r < 0) {
> > >>>>                goto fail;
> > >>>>            }
> > >>>> ```
> > >>>>
> > >>>>     - For L1 guest, compile the code, and start QEMU with two vdpa device
> > >>>> with svq mode on, enable the `ctrl_vq`, `ctrl_vlan` features on,
> > >>>> command line like:
> > >>>>         -netdev type=vhost-vdpa,x-svq=true,...
> > >>>>         -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
> > >>>> ctrl_vlan=on...
> > >>>>
> > >>>>     - For L2 source guest, run the following bash command:
> > >>>> ```bash
> > >>>> #!/bin/sh
> > >>>>
> > >>>> for idx in {1..4094}
> > >>>> do
> > >>>>     ip link add link eth0 name vlan$idx type vlan id $idx
> > >>>> done
> > >>>> ```
> > >>>>
> > >>>>     - wait for some time, then execute the live migration in L2 source monitor
> > >>>>
> > >>>>     - Result
> > >>>>       * with this series, QEMU should not trigger any warning
> > >>>> or error except something like "vhost_vdpa_net_load() = 8697 us"
> > >>>>       * without this series, QEMU should not trigger any warning
> > >>>> or error except something like "vhost_vdpa_net_load() = 10023 us"
> > >>>>
> > >>>> ChangeLog
> > >>>> =========
> > >>>> v3:
> > >>>>     - refactor vhost_svq_poll() to accept cmds_in_flight
> > >>>> suggested by Jason and Eugenio
> > >>>>     - refactor vhost_vdpa_net_cvq_add() to make control commands buffers
> > >>>> is not tied to `s->cvq_cmd_out_buffer` and `s->status`, so we can reuse
> > >>>> it suggested by Eugenio
> > >>>>     - poll and check when SVQ is full or control commands shadow buffers is
> > >>>> full
> > >>>>
> > >>>> v2: https://lore.kernel.org/all/cover.1683371965.git.yin31149@gmail.com/
> > >>>>     - recover accidentally deleted rows
> > >>>>     - remove extra newline
> > >>>>     - refactor `need_poll_len` to `cmds_in_flight`
> > >>>>     - return -EINVAL when vhost_svq_poll() return 0 or check
> > >>>> on buffers written by device fails
> > >>>>     - change the type of `in_cursor`, and refactor the
> > >>>> code for updating cursor
> > >>>>     - return directly when vhost_vdpa_net_load_{mac,mq}()
> > >>>> returns a failure in vhost_vdpa_net_load()
> > >>>>
> > >>>> v1: https://lore.kernel.org/all/cover.1681732982.git.yin31149@gmail.com/
> > >>>>
> > >>>> Hawkins Jiawei (8):
> > >>>>     vhost: Add argument to vhost_svq_poll()
> > >>>>     vdpa: Use iovec for vhost_vdpa_net_cvq_add()
> > >>>>     vhost: Expose vhost_svq_available_slots()
> > >>>>     vdpa: Avoid using vhost_vdpa_net_load_*() outside
> > >>>>       vhost_vdpa_net_load()
> > >>>>     vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
> > >>>>     vdpa: Move vhost_svq_poll() to the caller of vhost_vdpa_net_cvq_add()
> > >>>>     vdpa: Add cursors to vhost_vdpa_net_loadx()
> > >>>>     vdpa: Send cvq state load commands in parallel
> > >>>>
> > >>>>    hw/virtio/vhost-shadow-virtqueue.c |  38 ++--
> > >>>>    hw/virtio/vhost-shadow-virtqueue.h |   3 +-
> > >>>>    net/vhost-vdpa.c                   | 354 ++++++++++++++++++-----------
> > >>>>    3 files changed, 249 insertions(+), 146 deletions(-)
> > >>>>
> > >>>> --
> > >>>> 2.25.1
> > >>>
> > >>
> > >
> >
Re: [PATCH v3 0/8] vdpa: Send all CVQ state load commands in parallel
Posted by Hawkins Jiawei 9 months, 2 weeks ago
在 2023/7/20 16:53, Lei Yang 写道:
> According to the Hawkins provided steps, I tested two cases based on
> applied this series patches and without it. And all tests are based on
> the real hardware.
> Case 1, without  this series
> Source: qemu-system-x86_64: vhost_vdpa_net_load() = 23308 us
> Dest: qemu-system-x86_64: vhost_vdpa_net_load() = 23296 us
>
> Case 2, applied  this series
> Source: qemu-system-x86_64: vhost_vdpa_net_load() = 6558 us
> Dest: qemu-system-x86_64: vhost_vdpa_net_load() = 6539 us
>
> Tested-by: Lei Yang <leiyang@redhat.com>
>
>
> On Thu, Jul 20, 2023 at 6:54 AM Lei Yang <leiyang@redhat.com> wrote:
>>
>> On Wed, Jul 19, 2023 at 11:25 PM Hawkins Jiawei <yin31149@gmail.com> wrote:
>>>
>>> 在 2023/7/19 20:44, Lei Yang 写道:
>>>> Hello Hawkins and Michael
>>>>
>>>> Looks like there are big changes about vp_vdpa, therefore, if needed,
>>>> QE can test this series in QE's environment before the patch is
>>>
>>> Hi Lei,
>>>
>>> This patch series does not modify the code of vp_vdpa. Instead, it only
>>> modifies how QEMU sends SVQ control commands to the vdpa device.
>>>
>> Hi Hawkins
>>
>>> Considering that the behavior of the vp_vdpa device differs from that
>>> of real vdpa hardware, would it be possible for you to test this patch
>>> series on a real vdpa device?
>>
>> Yes, there is a hardware device to test it , I will update the test
>> results ASAP.
>>
>> BR
>> Lei
>>>
>>> Thanks!
>>>
>>>
>>>> merged, and provide the result.
>>>>
>>>> BR
>>>> Lei
>>>>
>>>>
>>>> On Wed, Jul 19, 2023 at 8:37 PM Hawkins Jiawei <yin31149@gmail.com> wrote:
>>>>>
>>>>> 在 2023/7/19 17:11, Michael S. Tsirkin 写道:
>>>>>> On Wed, Jul 19, 2023 at 03:53:45PM +0800, Hawkins Jiawei wrote:
>>>>>>> This patchset allows QEMU to delay polling and checking the device
>>>>>>> used buffer until either the SVQ is full or control commands shadow
>>>>>>> buffers are full, instead of polling and checking immediately after
>>>>>>> sending each SVQ control command, so that QEMU can send all the SVQ
>>>>>>> control commands in parallel, which have better performance improvement.
>>>>>>>
>>>>>>> I use vp_vdpa device to simulate vdpa device, and create 4094 VLANS in
>>>>>>> guest to build a test environment for sending multiple CVQ state load
>>>>>>> commands. This patch series can improve latency from 10023 us to
>>>>>>> 8697 us for about 4099 CVQ state load commands, about 0.32 us per command.

It appears that the performance timing with the vp_vdpa device in this
method is not consistently stable.

I retest this patch series with the same steps with the vp_vdpa device.

With this patch series, in the majority of cases, the time for CVQ state
load commands is around 8000 us ~ 10000 us.

Without this patch series, in the majority of cases, the time for CVQ
state load commands is around 14000 us ~ 20000 us.

Thanks!


>>>>>>
>>>>>> Looks like a tiny improvement.
>>>>>> At the same time we have O(n^2) behaviour with memory mappings.
>>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Thanks for your review.
>>>>>
>>>>> I wonder why you say "we have O(n^2) behaviour on memory mappings" here?
>>>>>
>>>>>    From my understanding, QEMU maps two page-size buffers as control
>>>>> commands shadow buffers at device startup. These buffers then are used
>>>>> to cache SVQ control commands, where QEMU fills them with multiple SVQ control
>>>>> commands bytes, flushes them when SVQ descriptors are full or these
>>>>> control commands shadow buffers reach their capacity.
>>>>>
>>>>> QEMU repeats this process until all CVQ state load commands have been
>>>>> sent in loading.
>>>>>
>>>>> In this loading process, only control commands shadow buffers
>>>>> translation should be relative to memory mappings, which should be
>>>>> O(log n) behaviour to my understanding(Please correct me if I am wrong).
>>>>>
>>>>>> Not saying we must not do this but I think it's worth
>>>>>> checking where the bottleneck is. My guess would be
>>>>>> vp_vdpa is not doing things in parallel. Want to try fixing that
>>>>>
>>>>> As for "vp_vdpa is not doing things in parallel.", do you mean
>>>>> the vp_vdpa device cannot process QEMU's SVQ control commands
>>>>> in parallel?
>>>>>
>>>>> In this situation, I will try to use real vdpa hardware to
>>>>> test the patch series performance.
>>>>>
>>>>>> to see how far it can be pushed?
>>>>>
>>>>> Currently, I am involved in the "Add virtio-net Control Virtqueue state
>>>>> restore support" project in Google Summer of Code now. Because I am
>>>>> uncertain about the time it will take to fix that problem in the vp_vdpa
>>>>> device, I prefer to complete the gsoc project first.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>> Note that this patch should be based on
>>>>>>> patch "Vhost-vdpa Shadow Virtqueue VLAN support" at [1].
>>>>>>>
>>>>>>> [1]. https://lists.gnu.org/archive/html/qemu-devel/2023-07/msg03719.html
>>>>>>>
>>>>>>> TestStep
>>>>>>> ========
>>>>>>> 1. regression testing using vp-vdpa device
>>>>>>>      - For L0 guest, boot QEMU with two virtio-net-pci net device with
>>>>>>> `ctrl_vq`, `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
>>>>>>>          -device virtio-net-pci,disable-legacy=on,disable-modern=off,
>>>>>>> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
>>>>>>> indirect_desc=off,queue_reset=off,ctrl_rx=on,ctrl_rx_extra=on,...
>>>>>>>
>>>>>>>      - For L1 guest, apply the patch series and compile the source code,
>>>>>>> start QEMU with two vdpa device with svq mode on, enable the `ctrl_vq`,
>>>>>>> `ctrl_rx`, `ctrl_rx_extra` features on, command line like:
>>>>>>>          -netdev type=vhost-vdpa,x-svq=true,...
>>>>>>>          -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
>>>>>>> ctrl_rx=on,ctrl_rx_extra=on...
>>>>>>>
>>>>>>>      - For L2 source guest, run the following bash command:
>>>>>>> ```bash
>>>>>>> #!/bin/sh
>>>>>>>
>>>>>>> for idx1 in {0..9}
>>>>>>> do
>>>>>>>      for idx2 in {0..9}
>>>>>>>      do
>>>>>>>        for idx3 in {0..6}
>>>>>>>        do
>>>>>>>          ip link add macvlan$idx1$idx2$idx3 link eth0
>>>>>>> address 4a:30:10:19:$idx1$idx2:1$idx3 type macvlan mode bridge
>>>>>>>          ip link set macvlan$idx1$idx2$idx3 up
>>>>>>>        done
>>>>>>>      done
>>>>>>> done
>>>>>>> ```
>>>>>>>      - Execute the live migration in L2 source monitor
>>>>>>>
>>>>>>>      - Result
>>>>>>>        * with this series, QEMU should not trigger any error or warning.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2. perf using vp-vdpa device
>>>>>>>      - For L0 guest, boot QEMU with two virtio-net-pci net device with
>>>>>>> `ctrl_vq`, `ctrl_vlan` features on, command line like:
>>>>>>>          -device virtio-net-pci,disable-legacy=on,disable-modern=off,
>>>>>>> iommu_platform=on,mq=on,ctrl_vq=on,guest_announce=off,
>>>>>>> indirect_desc=off,queue_reset=off,ctrl_vlan=on,...
>>>>>>>
>>>>>>>      - For L1 guest, apply the patch series, then apply an addtional
>>>>>>> patch to record the load time in microseconds as following:
>>>>>>> ```diff
>>>>>>> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
>>>>>>> index 6b958d6363..501b510fd2 100644
>>>>>>> --- a/hw/net/vhost_net.c
>>>>>>> +++ b/hw/net/vhost_net.c
>>>>>>> @@ -295,7 +295,10 @@ static int vhost_net_start_one(struct vhost_net *net,
>>>>>>>         }
>>>>>>>
>>>>>>>         if (net->nc->info->load) {
>>>>>>> +        int64_t start_us = g_get_monotonic_time();
>>>>>>>             r = net->nc->info->load(net->nc);
>>>>>>> +        error_report("vhost_vdpa_net_load() = %ld us",
>>>>>>> +                     g_get_monotonic_time() - start_us);
>>>>>>>             if (r < 0) {
>>>>>>>                 goto fail;
>>>>>>>             }
>>>>>>> ```
>>>>>>>
>>>>>>>      - For L1 guest, compile the code, and start QEMU with two vdpa device
>>>>>>> with svq mode on, enable the `ctrl_vq`, `ctrl_vlan` features on,
>>>>>>> command line like:
>>>>>>>          -netdev type=vhost-vdpa,x-svq=true,...
>>>>>>>          -device virtio-net-pci,mq=on,guest_announce=off,ctrl_vq=on,
>>>>>>> ctrl_vlan=on...
>>>>>>>
>>>>>>>      - For L2 source guest, run the following bash command:
>>>>>>> ```bash
>>>>>>> #!/bin/sh
>>>>>>>
>>>>>>> for idx in {1..4094}
>>>>>>> do
>>>>>>>      ip link add link eth0 name vlan$idx type vlan id $idx
>>>>>>> done
>>>>>>> ```
>>>>>>>
>>>>>>>      - wait for some time, then execute the live migration in L2 source monitor
>>>>>>>
>>>>>>>      - Result
>>>>>>>        * with this series, QEMU should not trigger any warning
>>>>>>> or error except something like "vhost_vdpa_net_load() = 8697 us"
>>>>>>>        * without this series, QEMU should not trigger any warning
>>>>>>> or error except something like "vhost_vdpa_net_load() = 10023 us"
>>>>>>>
>>>>>>> ChangeLog
>>>>>>> =========
>>>>>>> v3:
>>>>>>>      - refactor vhost_svq_poll() to accept cmds_in_flight
>>>>>>> suggested by Jason and Eugenio
>>>>>>>      - refactor vhost_vdpa_net_cvq_add() to make control commands buffers
>>>>>>> is not tied to `s->cvq_cmd_out_buffer` and `s->status`, so we can reuse
>>>>>>> it suggested by Eugenio
>>>>>>>      - poll and check when SVQ is full or control commands shadow buffers is
>>>>>>> full
>>>>>>>
>>>>>>> v2: https://lore.kernel.org/all/cover.1683371965.git.yin31149@gmail.com/
>>>>>>>      - recover accidentally deleted rows
>>>>>>>      - remove extra newline
>>>>>>>      - refactor `need_poll_len` to `cmds_in_flight`
>>>>>>>      - return -EINVAL when vhost_svq_poll() return 0 or check
>>>>>>> on buffers written by device fails
>>>>>>>      - change the type of `in_cursor`, and refactor the
>>>>>>> code for updating cursor
>>>>>>>      - return directly when vhost_vdpa_net_load_{mac,mq}()
>>>>>>> returns a failure in vhost_vdpa_net_load()
>>>>>>>
>>>>>>> v1: https://lore.kernel.org/all/cover.1681732982.git.yin31149@gmail.com/
>>>>>>>
>>>>>>> Hawkins Jiawei (8):
>>>>>>>      vhost: Add argument to vhost_svq_poll()
>>>>>>>      vdpa: Use iovec for vhost_vdpa_net_cvq_add()
>>>>>>>      vhost: Expose vhost_svq_available_slots()
>>>>>>>      vdpa: Avoid using vhost_vdpa_net_load_*() outside
>>>>>>>        vhost_vdpa_net_load()
>>>>>>>      vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
>>>>>>>      vdpa: Move vhost_svq_poll() to the caller of vhost_vdpa_net_cvq_add()
>>>>>>>      vdpa: Add cursors to vhost_vdpa_net_loadx()
>>>>>>>      vdpa: Send cvq state load commands in parallel
>>>>>>>
>>>>>>>     hw/virtio/vhost-shadow-virtqueue.c |  38 ++--
>>>>>>>     hw/virtio/vhost-shadow-virtqueue.h |   3 +-
>>>>>>>     net/vhost-vdpa.c                   | 354 ++++++++++++++++++-----------
>>>>>>>     3 files changed, 249 insertions(+), 146 deletions(-)
>>>>>>>
>>>>>>> --
>>>>>>> 2.25.1
>>>>>>
>>>>>
>>>>
>>>
>
[PATCH v3 1/8] vhost: Add argument to vhost_svq_poll()
Posted by Hawkins Jiawei 9 months, 3 weeks ago
Next patches in this series will no longer perform an
immediate poll and check of the device's used buffers
for each CVQ state load command. Instead, they will
send CVQ state load commands in parallel by polling
multiple pending buffers at once.

To achieve this, this patch refactoring vhost_svq_poll()
to accept a new argument `num`, which allows vhost_svq_poll()
to wait for the device to use multiple elements,
rather than polling for a single element.

Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 36 ++++++++++++++++++------------
 hw/virtio/vhost-shadow-virtqueue.h |  2 +-
 net/vhost-vdpa.c                   |  2 +-
 3 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 49e5aed931..e731b1d2ea 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -514,29 +514,37 @@ static void vhost_svq_flush(VhostShadowVirtqueue *svq,
 }
 
 /**
- * Poll the SVQ for one device used buffer.
+ * Poll the SVQ to wait for the device to use the specified number
+ * of elements and return the total length written by the device.
  *
  * This function race with main event loop SVQ polling, so extra
  * synchronization is needed.
  *
- * Return the length written by the device.
+ * @svq: The svq
+ * @num: The number of elements that need to be used
  */
-size_t vhost_svq_poll(VhostShadowVirtqueue *svq)
+size_t vhost_svq_poll(VhostShadowVirtqueue *svq, size_t num)
 {
-    int64_t start_us = g_get_monotonic_time();
-    uint32_t len = 0;
+    size_t len = 0;
+    uint32_t r;
 
-    do {
-        if (vhost_svq_more_used(svq)) {
-            break;
-        }
+    while (num--) {
+        int64_t start_us = g_get_monotonic_time();
 
-        if (unlikely(g_get_monotonic_time() - start_us > 10e6)) {
-            return 0;
-        }
-    } while (true);
+        do {
+            if (vhost_svq_more_used(svq)) {
+                break;
+            }
+
+            if (unlikely(g_get_monotonic_time() - start_us > 10e6)) {
+                return len;
+            }
+        } while (true);
+
+        vhost_svq_get_buf(svq, &r);
+        len += r;
+    }
 
-    vhost_svq_get_buf(svq, &len);
     return len;
 }
 
diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 6efe051a70..5bce67837b 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -119,7 +119,7 @@ void vhost_svq_push_elem(VhostShadowVirtqueue *svq,
 int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
                   size_t out_num, const struct iovec *in_sg, size_t in_num,
                   VirtQueueElement *elem);
-size_t vhost_svq_poll(VhostShadowVirtqueue *svq);
+size_t vhost_svq_poll(VhostShadowVirtqueue *svq, size_t num);
 
 void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
 void vhost_svq_set_svq_call_fd(VhostShadowVirtqueue *svq, int call_fd);
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index dfd271c456..d1dd140bf6 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -625,7 +625,7 @@ static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len,
      * descriptor. Also, we need to take the answer before SVQ pulls by itself,
      * when BQL is released
      */
-    return vhost_svq_poll(svq);
+    return vhost_svq_poll(svq, 1);
 }
 
 static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
-- 
2.25.1
Re: [PATCH v3 1/8] vhost: Add argument to vhost_svq_poll()
Posted by Eugenio Perez Martin 8 months, 3 weeks ago
On Wed, Jul 19, 2023 at 9:54 AM Hawkins Jiawei <yin31149@gmail.com> wrote:
>

The subject could be more explicit. What about "add count argument to
vhost_svq_poll"?

Apart from that:
Acked-by: Eugenio Pérez <eperezma@redhat.com>

> Next patches in this series will no longer perform an
> immediate poll and check of the device's used buffers
> for each CVQ state load command. Instead, they will
> send CVQ state load commands in parallel by polling
> multiple pending buffers at once.
>
> To achieve this, this patch refactoring vhost_svq_poll()
> to accept a new argument `num`, which allows vhost_svq_poll()
> to wait for the device to use multiple elements,
> rather than polling for a single element.
>
> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
> ---
>  hw/virtio/vhost-shadow-virtqueue.c | 36 ++++++++++++++++++------------
>  hw/virtio/vhost-shadow-virtqueue.h |  2 +-
>  net/vhost-vdpa.c                   |  2 +-
>  3 files changed, 24 insertions(+), 16 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 49e5aed931..e731b1d2ea 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -514,29 +514,37 @@ static void vhost_svq_flush(VhostShadowVirtqueue *svq,
>  }
>
>  /**
> - * Poll the SVQ for one device used buffer.
> + * Poll the SVQ to wait for the device to use the specified number
> + * of elements and return the total length written by the device.
>   *
>   * This function race with main event loop SVQ polling, so extra
>   * synchronization is needed.
>   *
> - * Return the length written by the device.
> + * @svq: The svq
> + * @num: The number of elements that need to be used
>   */
> -size_t vhost_svq_poll(VhostShadowVirtqueue *svq)
> +size_t vhost_svq_poll(VhostShadowVirtqueue *svq, size_t num)
>  {
> -    int64_t start_us = g_get_monotonic_time();
> -    uint32_t len = 0;
> +    size_t len = 0;
> +    uint32_t r;
>
> -    do {
> -        if (vhost_svq_more_used(svq)) {
> -            break;
> -        }
> +    while (num--) {
> +        int64_t start_us = g_get_monotonic_time();
>
> -        if (unlikely(g_get_monotonic_time() - start_us > 10e6)) {
> -            return 0;
> -        }
> -    } while (true);
> +        do {
> +            if (vhost_svq_more_used(svq)) {
> +                break;
> +            }
> +
> +            if (unlikely(g_get_monotonic_time() - start_us > 10e6)) {
> +                return len;
> +            }
> +        } while (true);
> +
> +        vhost_svq_get_buf(svq, &r);
> +        len += r;
> +    }
>
> -    vhost_svq_get_buf(svq, &len);
>      return len;
>  }
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 6efe051a70..5bce67837b 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -119,7 +119,7 @@ void vhost_svq_push_elem(VhostShadowVirtqueue *svq,
>  int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
>                    size_t out_num, const struct iovec *in_sg, size_t in_num,
>                    VirtQueueElement *elem);
> -size_t vhost_svq_poll(VhostShadowVirtqueue *svq);
> +size_t vhost_svq_poll(VhostShadowVirtqueue *svq, size_t num);
>
>  void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
>  void vhost_svq_set_svq_call_fd(VhostShadowVirtqueue *svq, int call_fd);
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index dfd271c456..d1dd140bf6 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -625,7 +625,7 @@ static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len,
>       * descriptor. Also, we need to take the answer before SVQ pulls by itself,
>       * when BQL is released
>       */
> -    return vhost_svq_poll(svq);
> +    return vhost_svq_poll(svq, 1);
>  }
>
>  static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
> --
> 2.25.1
>
Re: [PATCH v3 1/8] vhost: Add argument to vhost_svq_poll()
Posted by Hawkins Jiawei 8 months, 2 weeks ago
On 2023/8/18 23:08, Eugenio Perez Martin wrote:
> On Wed, Jul 19, 2023 at 9:54 AM Hawkins Jiawei <yin31149@gmail.com> wrote:
>>
>
> The subject could be more explicit. What about "add count argument to
> vhost_svq_poll"?

Hi Eugenio,

Thanks for reviewing.
You are right, I will use this new subject in the v4 patch.

Thanks!


>
> Apart from that:
> Acked-by: Eugenio Pérez <eperezma@redhat.com>
>
>> Next patches in this series will no longer perform an
>> immediate poll and check of the device's used buffers
>> for each CVQ state load command. Instead, they will
>> send CVQ state load commands in parallel by polling
>> multiple pending buffers at once.
>>
>> To achieve this, this patch refactoring vhost_svq_poll()
>> to accept a new argument `num`, which allows vhost_svq_poll()
>> to wait for the device to use multiple elements,
>> rather than polling for a single element.
>>
>> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
>> ---
>>   hw/virtio/vhost-shadow-virtqueue.c | 36 ++++++++++++++++++------------
>>   hw/virtio/vhost-shadow-virtqueue.h |  2 +-
>>   net/vhost-vdpa.c                   |  2 +-
>>   3 files changed, 24 insertions(+), 16 deletions(-)
>>
>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>> index 49e5aed931..e731b1d2ea 100644
>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>> @@ -514,29 +514,37 @@ static void vhost_svq_flush(VhostShadowVirtqueue *svq,
>>   }
>>
>>   /**
>> - * Poll the SVQ for one device used buffer.
>> + * Poll the SVQ to wait for the device to use the specified number
>> + * of elements and return the total length written by the device.
>>    *
>>    * This function race with main event loop SVQ polling, so extra
>>    * synchronization is needed.
>>    *
>> - * Return the length written by the device.
>> + * @svq: The svq
>> + * @num: The number of elements that need to be used
>>    */
>> -size_t vhost_svq_poll(VhostShadowVirtqueue *svq)
>> +size_t vhost_svq_poll(VhostShadowVirtqueue *svq, size_t num)
>>   {
>> -    int64_t start_us = g_get_monotonic_time();
>> -    uint32_t len = 0;
>> +    size_t len = 0;
>> +    uint32_t r;
>>
>> -    do {
>> -        if (vhost_svq_more_used(svq)) {
>> -            break;
>> -        }
>> +    while (num--) {
>> +        int64_t start_us = g_get_monotonic_time();
>>
>> -        if (unlikely(g_get_monotonic_time() - start_us > 10e6)) {
>> -            return 0;
>> -        }
>> -    } while (true);
>> +        do {
>> +            if (vhost_svq_more_used(svq)) {
>> +                break;
>> +            }
>> +
>> +            if (unlikely(g_get_monotonic_time() - start_us > 10e6)) {
>> +                return len;
>> +            }
>> +        } while (true);
>> +
>> +        vhost_svq_get_buf(svq, &r);
>> +        len += r;
>> +    }
>>
>> -    vhost_svq_get_buf(svq, &len);
>>       return len;
>>   }
>>
>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>> index 6efe051a70..5bce67837b 100644
>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>> @@ -119,7 +119,7 @@ void vhost_svq_push_elem(VhostShadowVirtqueue *svq,
>>   int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
>>                     size_t out_num, const struct iovec *in_sg, size_t in_num,
>>                     VirtQueueElement *elem);
>> -size_t vhost_svq_poll(VhostShadowVirtqueue *svq);
>> +size_t vhost_svq_poll(VhostShadowVirtqueue *svq, size_t num);
>>
>>   void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
>>   void vhost_svq_set_svq_call_fd(VhostShadowVirtqueue *svq, int call_fd);
>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>> index dfd271c456..d1dd140bf6 100644
>> --- a/net/vhost-vdpa.c
>> +++ b/net/vhost-vdpa.c
>> @@ -625,7 +625,7 @@ static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len,
>>        * descriptor. Also, we need to take the answer before SVQ pulls by itself,
>>        * when BQL is released
>>        */
>> -    return vhost_svq_poll(svq);
>> +    return vhost_svq_poll(svq, 1);
>>   }
>>
>>   static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
>> --
>> 2.25.1
>>
>
[PATCH v3 2/8] vdpa: Use iovec for vhost_vdpa_net_cvq_add()
Posted by Hawkins Jiawei 9 months, 3 weeks ago
Next patches in this series will no longer perform an
immediate poll and check of the device's used buffers
for each CVQ state load command. Consequently, there
will be multiple pending buffers in the shadow VirtQueue,
making it a must for every control command to have its
own buffer.

To achieve this, this patch refactor vhost_vdpa_net_cvq_add()
to accept `struct iovec`, which eliminates the coupling of
control commands to `s->cvq_cmd_out_buffer` and `s->status`,
allowing them to use their own buffer.

Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
---
 net/vhost-vdpa.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index d1dd140bf6..6b16c8ece0 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -596,22 +596,14 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
     vhost_vdpa_net_client_stop(nc);
 }
 
-static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len,
-                                      size_t in_len)
+static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s,
+                                      struct iovec *out_sg, size_t out_num,
+                                      struct iovec *in_sg, size_t in_num)
 {
-    /* Buffers for the device */
-    const struct iovec out = {
-        .iov_base = s->cvq_cmd_out_buffer,
-        .iov_len = out_len,
-    };
-    const struct iovec in = {
-        .iov_base = s->status,
-        .iov_len = sizeof(virtio_net_ctrl_ack),
-    };
     VhostShadowVirtqueue *svq = g_ptr_array_index(s->vhost_vdpa.shadow_vqs, 0);
     int r;
 
-    r = vhost_svq_add(svq, &out, 1, &in, 1, NULL);
+    r = vhost_svq_add(svq, out_sg, out_num, in_sg, in_num, NULL);
     if (unlikely(r != 0)) {
         if (unlikely(r == -ENOSPC)) {
             qemu_log_mask(LOG_GUEST_ERROR, "%s: No space on device queue\n",
@@ -637,6 +629,15 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
         .cmd = cmd,
     };
     size_t data_size = iov_size(data_sg, data_num);
+    /* Buffers for the device */
+    struct iovec out = {
+        .iov_base = s->cvq_cmd_out_buffer,
+        .iov_len = sizeof(ctrl) + data_size,
+    };
+    struct iovec in = {
+        .iov_base = s->status,
+        .iov_len = sizeof(*s->status),
+    };
 
     assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl));
 
@@ -647,8 +648,7 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
     iov_to_buf(data_sg, data_num, 0,
                s->cvq_cmd_out_buffer + sizeof(ctrl), data_size);
 
-    return vhost_vdpa_net_cvq_add(s, data_size + sizeof(ctrl),
-                                  sizeof(virtio_net_ctrl_ack));
+    return vhost_vdpa_net_cvq_add(s, &out, 1, &in, 1);
 }
 
 static int vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n)
@@ -1222,9 +1222,7 @@ static int vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,
     struct iovec out = {
         .iov_base = s->cvq_cmd_out_buffer,
     };
-    /* in buffer used for device model */
-    const struct iovec in = {
-        .iov_base = &status,
+    struct iovec in = {
         .iov_len = sizeof(status),
     };
     ssize_t dev_written = -EINVAL;
@@ -1232,6 +1230,8 @@ static int vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,
     out.iov_len = iov_to_buf(elem->out_sg, elem->out_num, 0,
                              s->cvq_cmd_out_buffer,
                              vhost_vdpa_net_cvq_cmd_page_len());
+    /* In buffer used for the vdpa device */
+    in.iov_base = s->status;
 
     ctrl = s->cvq_cmd_out_buffer;
     if (ctrl->class == VIRTIO_NET_CTRL_ANNOUNCE) {
@@ -1260,7 +1260,7 @@ static int vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,
             goto out;
         }
     } else {
-        dev_written = vhost_vdpa_net_cvq_add(s, out.iov_len, sizeof(status));
+        dev_written = vhost_vdpa_net_cvq_add(s, &out, 1, &in, 1);
         if (unlikely(dev_written < 0)) {
             goto out;
         }
@@ -1276,6 +1276,8 @@ static int vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,
     }
 
     status = VIRTIO_NET_ERR;
+    /* In buffer used for the device model */
+    in.iov_base = &status;
     virtio_net_handle_ctrl_iov(svq->vdev, &in, 1, &out, 1);
     if (status != VIRTIO_NET_OK) {
         error_report("Bad CVQ processing in model");
-- 
2.25.1
[PATCH v3 3/8] vhost: Expose vhost_svq_available_slots()
Posted by Hawkins Jiawei 9 months, 3 weeks ago
Next patches in this series will delay the polling
and checking of buffers until either the SVQ is
full or control commands shadow buffers are full,
no longer perform an immediate poll and check of
the device's used buffers for each CVQ state load command.

To achieve this, this patch exposes
vhost_svq_available_slots() and introduces a helper function,
allowing QEMU to know whether the SVQ is full.

Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 2 +-
 hw/virtio/vhost-shadow-virtqueue.h | 1 +
 net/vhost-vdpa.c                   | 9 +++++++++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index e731b1d2ea..fc5f408f77 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -66,7 +66,7 @@ bool vhost_svq_valid_features(uint64_t features, Error **errp)
  *
  * @svq: The svq
  */
-static uint16_t vhost_svq_available_slots(const VhostShadowVirtqueue *svq)
+uint16_t vhost_svq_available_slots(const VhostShadowVirtqueue *svq)
 {
     return svq->num_free;
 }
diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 5bce67837b..19c842a15b 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -114,6 +114,7 @@ typedef struct VhostShadowVirtqueue {
 
 bool vhost_svq_valid_features(uint64_t features, Error **errp);
 
+uint16_t vhost_svq_available_slots(const VhostShadowVirtqueue *svq);
 void vhost_svq_push_elem(VhostShadowVirtqueue *svq,
                          const VirtQueueElement *elem, uint32_t len);
 int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 6b16c8ece0..dd71008e08 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -620,6 +620,13 @@ static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s,
     return vhost_svq_poll(svq, 1);
 }
 
+/* Convenience wrapper to get number of available SVQ descriptors */
+static uint16_t vhost_vdpa_net_svq_available_slots(VhostVDPAState *s)
+{
+    VhostShadowVirtqueue *svq = g_ptr_array_index(s->vhost_vdpa.shadow_vqs, 0);
+    return vhost_svq_available_slots(svq);
+}
+
 static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
                                        uint8_t cmd, const struct iovec *data_sg,
                                        size_t data_num)
@@ -640,6 +647,8 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
     };
 
     assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl));
+    /* Each CVQ command has one out descriptor and one in descriptor */
+    assert(vhost_vdpa_net_svq_available_slots(s) >= 2);
 
     /* pack the CVQ command header */
     memcpy(s->cvq_cmd_out_buffer, &ctrl, sizeof(ctrl));
-- 
2.25.1
Re: [PATCH v3 3/8] vhost: Expose vhost_svq_available_slots()
Posted by Eugenio Perez Martin 8 months, 3 weeks ago
On Wed, Jul 19, 2023 at 9:54 AM Hawkins Jiawei <yin31149@gmail.com> wrote:
>
> Next patches in this series will delay the polling
> and checking of buffers until either the SVQ is
> full or control commands shadow buffers are full,
> no longer perform an immediate poll and check of
> the device's used buffers for each CVQ state load command.
>
> To achieve this, this patch exposes
> vhost_svq_available_slots() and introduces a helper function,
> allowing QEMU to know whether the SVQ is full.
>
> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>

Acked-by: Eugenio Pérez <eperezma@redhat.com>

> ---
>  hw/virtio/vhost-shadow-virtqueue.c | 2 +-
>  hw/virtio/vhost-shadow-virtqueue.h | 1 +
>  net/vhost-vdpa.c                   | 9 +++++++++
>  3 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index e731b1d2ea..fc5f408f77 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -66,7 +66,7 @@ bool vhost_svq_valid_features(uint64_t features, Error **errp)
>   *
>   * @svq: The svq
>   */
> -static uint16_t vhost_svq_available_slots(const VhostShadowVirtqueue *svq)
> +uint16_t vhost_svq_available_slots(const VhostShadowVirtqueue *svq)
>  {
>      return svq->num_free;
>  }
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 5bce67837b..19c842a15b 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -114,6 +114,7 @@ typedef struct VhostShadowVirtqueue {
>
>  bool vhost_svq_valid_features(uint64_t features, Error **errp);
>
> +uint16_t vhost_svq_available_slots(const VhostShadowVirtqueue *svq);
>  void vhost_svq_push_elem(VhostShadowVirtqueue *svq,
>                           const VirtQueueElement *elem, uint32_t len);
>  int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index 6b16c8ece0..dd71008e08 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -620,6 +620,13 @@ static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s,
>      return vhost_svq_poll(svq, 1);
>  }
>
> +/* Convenience wrapper to get number of available SVQ descriptors */
> +static uint16_t vhost_vdpa_net_svq_available_slots(VhostVDPAState *s)
> +{
> +    VhostShadowVirtqueue *svq = g_ptr_array_index(s->vhost_vdpa.shadow_vqs, 0);
> +    return vhost_svq_available_slots(svq);
> +}
> +
>  static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
>                                         uint8_t cmd, const struct iovec *data_sg,
>                                         size_t data_num)
> @@ -640,6 +647,8 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
>      };
>
>      assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl));
> +    /* Each CVQ command has one out descriptor and one in descriptor */
> +    assert(vhost_vdpa_net_svq_available_slots(s) >= 2);
>
>      /* pack the CVQ command header */
>      memcpy(s->cvq_cmd_out_buffer, &ctrl, sizeof(ctrl));
> --
> 2.25.1
>
[PATCH v3 4/8] vdpa: Avoid using vhost_vdpa_net_load_*() outside vhost_vdpa_net_load()
Posted by Hawkins Jiawei 9 months, 3 weeks ago
Next patches in this series will refactor vhost_vdpa_net_load_cmd()
to iterate through the control commands shadow buffers, allowing QEMU
to send CVQ state load commands in parallel at device startup.

Considering that QEMU always forwards the CVQ command serialized
outside of vhost_vdpa_net_load(), it is more elegant to send the
CVQ commands directly without invoking vhost_vdpa_net_load_*() helpers.

Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
---
 net/vhost-vdpa.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index dd71008e08..ae8f59adaa 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -1098,12 +1098,14 @@ static NetClientInfo net_vhost_vdpa_cvq_info = {
  */
 static int vhost_vdpa_net_excessive_mac_filter_cvq_add(VhostVDPAState *s,
                                                        VirtQueueElement *elem,
-                                                       struct iovec *out)
+                                                       struct iovec *out,
+                                                       struct iovec *in)
 {
     struct virtio_net_ctrl_mac mac_data, *mac_ptr;
     struct virtio_net_ctrl_hdr *hdr_ptr;
     uint32_t cursor;
     ssize_t r;
+    uint8_t on = 1;
 
     /* parse the non-multicast MAC address entries from CVQ command */
     cursor = sizeof(*hdr_ptr);
@@ -1151,7 +1153,16 @@ static int vhost_vdpa_net_excessive_mac_filter_cvq_add(VhostVDPAState *s,
      * filter table to the vdpa device, it should send the
      * VIRTIO_NET_CTRL_RX_PROMISC CVQ command to enable promiscuous mode
      */
-    r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_PROMISC, 1);
+    cursor = 0;
+    hdr_ptr = out->iov_base;
+    out->iov_len = sizeof(*hdr_ptr) + sizeof(on);
+    assert(out->iov_len < vhost_vdpa_net_cvq_cmd_page_len());
+
+    hdr_ptr->class = VIRTIO_NET_CTRL_RX;
+    hdr_ptr->cmd = VIRTIO_NET_CTRL_RX_PROMISC;
+    cursor += sizeof(*hdr_ptr);
+    *(uint8_t *)(out->iov_base + cursor) = on;
+    r = vhost_vdpa_net_cvq_add(s, out, 1, in, 1);
     if (unlikely(r < 0)) {
         return r;
     }
@@ -1264,7 +1275,7 @@ static int vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,
          * the CVQ command direclty.
          */
         dev_written = vhost_vdpa_net_excessive_mac_filter_cvq_add(s, elem,
-                                                                  &out);
+                                                                  &out, &in);
         if (unlikely(dev_written < 0)) {
             goto out;
         }
-- 
2.25.1
Re: [PATCH v3 4/8] vdpa: Avoid using vhost_vdpa_net_load_*() outside vhost_vdpa_net_load()
Posted by Eugenio Perez Martin 8 months, 3 weeks ago
On Wed, Jul 19, 2023 at 9:54 AM Hawkins Jiawei <yin31149@gmail.com> wrote:
>
> Next patches in this series will refactor vhost_vdpa_net_load_cmd()
> to iterate through the control commands shadow buffers, allowing QEMU
> to send CVQ state load commands in parallel at device startup.
>
> Considering that QEMU always forwards the CVQ command serialized
> outside of vhost_vdpa_net_load(), it is more elegant to send the
> CVQ commands directly without invoking vhost_vdpa_net_load_*() helpers.
>
> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
> ---
>  net/vhost-vdpa.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index dd71008e08..ae8f59adaa 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -1098,12 +1098,14 @@ static NetClientInfo net_vhost_vdpa_cvq_info = {
>   */
>  static int vhost_vdpa_net_excessive_mac_filter_cvq_add(VhostVDPAState *s,
>                                                         VirtQueueElement *elem,
> -                                                       struct iovec *out)
> +                                                       struct iovec *out,
> +                                                       struct iovec *in)
>  {
>      struct virtio_net_ctrl_mac mac_data, *mac_ptr;
>      struct virtio_net_ctrl_hdr *hdr_ptr;
>      uint32_t cursor;
>      ssize_t r;
> +    uint8_t on = 1;
>
>      /* parse the non-multicast MAC address entries from CVQ command */
>      cursor = sizeof(*hdr_ptr);
> @@ -1151,7 +1153,16 @@ static int vhost_vdpa_net_excessive_mac_filter_cvq_add(VhostVDPAState *s,
>       * filter table to the vdpa device, it should send the
>       * VIRTIO_NET_CTRL_RX_PROMISC CVQ command to enable promiscuous mode
>       */
> -    r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_PROMISC, 1);
> +    cursor = 0;
> +    hdr_ptr = out->iov_base;
> +    out->iov_len = sizeof(*hdr_ptr) + sizeof(on);
> +    assert(out->iov_len < vhost_vdpa_net_cvq_cmd_page_len());
> +
> +    hdr_ptr->class = VIRTIO_NET_CTRL_RX;
> +    hdr_ptr->cmd = VIRTIO_NET_CTRL_RX_PROMISC;
> +    cursor += sizeof(*hdr_ptr);
> +    *(uint8_t *)(out->iov_base + cursor) = on;
> +    r = vhost_vdpa_net_cvq_add(s, out, 1, in, 1);

Can this be done with iov_from_buf?

>      if (unlikely(r < 0)) {
>          return r;
>      }
> @@ -1264,7 +1275,7 @@ static int vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,
>           * the CVQ command direclty.
>           */
>          dev_written = vhost_vdpa_net_excessive_mac_filter_cvq_add(s, elem,
> -                                                                  &out);
> +                                                                  &out, &in);
>          if (unlikely(dev_written < 0)) {
>              goto out;
>          }
> --
> 2.25.1
>
Re: [PATCH v3 4/8] vdpa: Avoid using vhost_vdpa_net_load_*() outside vhost_vdpa_net_load()
Posted by Hawkins Jiawei 8 months, 2 weeks ago
On 2023/8/18 23:39, Eugenio Perez Martin wrote:
> On Wed, Jul 19, 2023 at 9:54 AM Hawkins Jiawei <yin31149@gmail.com> wrote:
>>
>> Next patches in this series will refactor vhost_vdpa_net_load_cmd()
>> to iterate through the control commands shadow buffers, allowing QEMU
>> to send CVQ state load commands in parallel at device startup.
>>
>> Considering that QEMU always forwards the CVQ command serialized
>> outside of vhost_vdpa_net_load(), it is more elegant to send the
>> CVQ commands directly without invoking vhost_vdpa_net_load_*() helpers.
>>
>> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
>> ---
>>   net/vhost-vdpa.c | 17 ++++++++++++++---
>>   1 file changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
>> index dd71008e08..ae8f59adaa 100644
>> --- a/net/vhost-vdpa.c
>> +++ b/net/vhost-vdpa.c
>> @@ -1098,12 +1098,14 @@ static NetClientInfo net_vhost_vdpa_cvq_info = {
>>    */
>>   static int vhost_vdpa_net_excessive_mac_filter_cvq_add(VhostVDPAState *s,
>>                                                          VirtQueueElement *elem,
>> -                                                       struct iovec *out)
>> +                                                       struct iovec *out,
>> +                                                       struct iovec *in)
>>   {
>>       struct virtio_net_ctrl_mac mac_data, *mac_ptr;
>>       struct virtio_net_ctrl_hdr *hdr_ptr;
>>       uint32_t cursor;
>>       ssize_t r;
>> +    uint8_t on = 1;
>>
>>       /* parse the non-multicast MAC address entries from CVQ command */
>>       cursor = sizeof(*hdr_ptr);
>> @@ -1151,7 +1153,16 @@ static int vhost_vdpa_net_excessive_mac_filter_cvq_add(VhostVDPAState *s,
>>        * filter table to the vdpa device, it should send the
>>        * VIRTIO_NET_CTRL_RX_PROMISC CVQ command to enable promiscuous mode
>>        */
>> -    r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_PROMISC, 1);
>> +    cursor = 0;
>> +    hdr_ptr = out->iov_base;
>> +    out->iov_len = sizeof(*hdr_ptr) + sizeof(on);
>> +    assert(out->iov_len < vhost_vdpa_net_cvq_cmd_page_len());
>> +
>> +    hdr_ptr->class = VIRTIO_NET_CTRL_RX;
>> +    hdr_ptr->cmd = VIRTIO_NET_CTRL_RX_PROMISC;
>> +    cursor += sizeof(*hdr_ptr);
>> +    *(uint8_t *)(out->iov_base + cursor) = on;
>> +    r = vhost_vdpa_net_cvq_add(s, out, 1, in, 1);
>
> Can this be done with iov_from_buf?

Hi Eugenio,

Yes, this should be done by iov_from_buf(), I will refactor the code
according to your suggestion.

Thanks!


>
>>       if (unlikely(r < 0)) {
>>           return r;
>>       }
>> @@ -1264,7 +1275,7 @@ static int vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,
>>            * the CVQ command direclty.
>>            */
>>           dev_written = vhost_vdpa_net_excessive_mac_filter_cvq_add(s, elem,
>> -                                                                  &out);
>> +                                                                  &out, &in);
>>           if (unlikely(dev_written < 0)) {
>>               goto out;
>>           }
>> --
>> 2.25.1
>>
>
[PATCH v3 5/8] vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
Posted by Hawkins Jiawei 9 months, 3 weeks ago
Considering that vhost_vdpa_net_load_rx_mode() is only called
within vhost_vdpa_net_load_rx() now, this patch refactors
vhost_vdpa_net_load_rx_mode() to include a check for the
device's ack, simplifying the code and improving its maintainability.

Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
---
 net/vhost-vdpa.c | 76 ++++++++++++++++++++----------------------------
 1 file changed, 31 insertions(+), 45 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index ae8f59adaa..fe0ba19724 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -814,14 +814,24 @@ static int vhost_vdpa_net_load_rx_mode(VhostVDPAState *s,
         .iov_base = &on,
         .iov_len = sizeof(on),
     };
-    return vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_RX,
-                                   cmd, &data, 1);
+    ssize_t dev_written;
+
+    dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_RX,
+                                          cmd, &data, 1);
+    if (unlikely(dev_written < 0)) {
+        return dev_written;
+    }
+    if (*s->status != VIRTIO_NET_OK) {
+        return -EIO;
+    }
+
+    return 0;
 }
 
 static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
                                   const VirtIONet *n)
 {
-    ssize_t dev_written;
+    ssize_t r;
 
     if (!virtio_vdev_has_feature(&n->parent_obj, VIRTIO_NET_F_CTRL_RX)) {
         return 0;
@@ -846,13 +856,9 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
      * configuration only at live migration.
      */
     if (!n->mac_table.uni_overflow && !n->promisc) {
-        dev_written = vhost_vdpa_net_load_rx_mode(s,
-                                            VIRTIO_NET_CTRL_RX_PROMISC, 0);
-        if (unlikely(dev_written < 0)) {
-            return dev_written;
-        }
-        if (*s->status != VIRTIO_NET_OK) {
-            return -EIO;
+        r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_PROMISC, 0);
+        if (unlikely(r < 0)) {
+            return r;
         }
     }
 
@@ -874,13 +880,9 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
      * configuration only at live migration.
      */
     if (n->mac_table.multi_overflow || n->allmulti) {
-        dev_written = vhost_vdpa_net_load_rx_mode(s,
-                                            VIRTIO_NET_CTRL_RX_ALLMULTI, 1);
-        if (unlikely(dev_written < 0)) {
-            return dev_written;
-        }
-        if (*s->status != VIRTIO_NET_OK) {
-            return -EIO;
+        r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_ALLMULTI, 1);
+        if (unlikely(r < 0)) {
+            return r;
         }
     }
 
@@ -899,13 +901,9 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
      * configuration only at live migration.
      */
     if (n->alluni) {
-        dev_written = vhost_vdpa_net_load_rx_mode(s,
-                                            VIRTIO_NET_CTRL_RX_ALLUNI, 1);
-        if (dev_written < 0) {
-            return dev_written;
-        }
-        if (*s->status != VIRTIO_NET_OK) {
-            return -EIO;
+        r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_ALLUNI, 1);
+        if (r < 0) {
+            return r;
         }
     }
 
@@ -920,13 +918,9 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
      * configuration only at live migration.
      */
     if (n->nomulti) {
-        dev_written = vhost_vdpa_net_load_rx_mode(s,
-                                            VIRTIO_NET_CTRL_RX_NOMULTI, 1);
-        if (dev_written < 0) {
-            return dev_written;
-        }
-        if (*s->status != VIRTIO_NET_OK) {
-            return -EIO;
+        r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_NOMULTI, 1);
+        if (r < 0) {
+            return r;
         }
     }
 
@@ -941,13 +935,9 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
      * configuration only at live migration.
      */
     if (n->nouni) {
-        dev_written = vhost_vdpa_net_load_rx_mode(s,
-                                            VIRTIO_NET_CTRL_RX_NOUNI, 1);
-        if (dev_written < 0) {
-            return dev_written;
-        }
-        if (*s->status != VIRTIO_NET_OK) {
-            return -EIO;
+        r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_NOUNI, 1);
+        if (r < 0) {
+            return r;
         }
     }
 
@@ -962,13 +952,9 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
      * configuration only at live migration.
      */
     if (n->nobcast) {
-        dev_written = vhost_vdpa_net_load_rx_mode(s,
-                                            VIRTIO_NET_CTRL_RX_NOBCAST, 1);
-        if (dev_written < 0) {
-            return dev_written;
-        }
-        if (*s->status != VIRTIO_NET_OK) {
-            return -EIO;
+        r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_NOBCAST, 1);
+        if (r < 0) {
+            return r;
         }
     }
 
-- 
2.25.1
[PATCH v3 6/8] vdpa: Move vhost_svq_poll() to the caller of vhost_vdpa_net_cvq_add()
Posted by Hawkins Jiawei 9 months, 3 weeks ago
This patch moves vhost_svq_poll() to the caller of
vhost_vdpa_net_cvq_add() and introduces a helper funtion.

By making this change, next patches in this series is
able to refactor vhost_vdpa_net_load_x() only to delay
the polling and checking process until either the SVQ
is full or control commands shadow buffers are full.

Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
---
 net/vhost-vdpa.c | 50 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index fe0ba19724..d06f38403f 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -609,15 +609,21 @@ static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s,
             qemu_log_mask(LOG_GUEST_ERROR, "%s: No space on device queue\n",
                           __func__);
         }
-        return r;
     }
 
-    /*
-     * We can poll here since we've had BQL from the time we sent the
-     * descriptor. Also, we need to take the answer before SVQ pulls by itself,
-     * when BQL is released
-     */
-    return vhost_svq_poll(svq, 1);
+    return r;
+}
+
+/*
+ * Convenience wrapper to poll SVQ for multiple control commands.
+ *
+ * Caller should hold the BQL when invoking this function, and should take
+ * the answer before SVQ pulls by itself when BQL is released.
+ */
+static ssize_t vhost_vdpa_net_svq_poll(VhostVDPAState *s, size_t cmds_in_flight)
+{
+    VhostShadowVirtqueue *svq = g_ptr_array_index(s->vhost_vdpa.shadow_vqs, 0);
+    return vhost_svq_poll(svq, cmds_in_flight);
 }
 
 /* Convenience wrapper to get number of available SVQ descriptors */
@@ -645,6 +651,7 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
         .iov_base = s->status,
         .iov_len = sizeof(*s->status),
     };
+    ssize_t r;
 
     assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl));
     /* Each CVQ command has one out descriptor and one in descriptor */
@@ -657,7 +664,16 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
     iov_to_buf(data_sg, data_num, 0,
                s->cvq_cmd_out_buffer + sizeof(ctrl), data_size);
 
-    return vhost_vdpa_net_cvq_add(s, &out, 1, &in, 1);
+    r = vhost_vdpa_net_cvq_add(s, &out, 1, &in, 1);
+    if (unlikely(r < 0)) {
+        return r;
+    }
+
+    /*
+     * We can poll here since we've had BQL from the time
+     * we sent the descriptor.
+     */
+    return vhost_vdpa_net_svq_poll(s, 1);
 }
 
 static int vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n)
@@ -1152,6 +1168,12 @@ static int vhost_vdpa_net_excessive_mac_filter_cvq_add(VhostVDPAState *s,
     if (unlikely(r < 0)) {
         return r;
     }
+
+    /*
+     * We can poll here since we've had BQL from the time
+     * we sent the descriptor.
+     */
+    vhost_vdpa_net_svq_poll(s, 1);
     if (*s->status != VIRTIO_NET_OK) {
         return sizeof(*s->status);
     }
@@ -1266,10 +1288,18 @@ static int vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,
             goto out;
         }
     } else {
-        dev_written = vhost_vdpa_net_cvq_add(s, &out, 1, &in, 1);
-        if (unlikely(dev_written < 0)) {
+        ssize_t r;
+        r = vhost_vdpa_net_cvq_add(s, &out, 1, &in, 1);
+        if (unlikely(r < 0)) {
+            dev_written = r;
             goto out;
         }
+
+        /*
+         * We can poll here since we've had BQL from the time
+         * we sent the descriptor.
+         */
+        dev_written = vhost_vdpa_net_svq_poll(s, 1);
     }
 
     if (unlikely(dev_written < sizeof(status))) {
-- 
2.25.1
[PATCH v3 7/8] vdpa: Add cursors to vhost_vdpa_net_loadx()
Posted by Hawkins Jiawei 9 months, 3 weeks ago
This patch adds `out_cursor` and `in_cursor` arguments
to vhost_vdpa_net_loadx().

By making this change, next patches in this series
can refactor vhost_vdpa_net_load_cmd() directly to
iterate through the control commands shadow buffers,
allowing QEMU to send CVQ state load commands in parallel
at device startup.

Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
---
 net/vhost-vdpa.c | 79 ++++++++++++++++++++++++++++++------------------
 1 file changed, 50 insertions(+), 29 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index d06f38403f..795c9c1fd2 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -633,7 +633,8 @@ static uint16_t vhost_vdpa_net_svq_available_slots(VhostVDPAState *s)
     return vhost_svq_available_slots(svq);
 }
 
-static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
+static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, void **out_cursor,
+                                       void **in_cursor, uint8_t class,
                                        uint8_t cmd, const struct iovec *data_sg,
                                        size_t data_num)
 {
@@ -644,11 +645,11 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
     size_t data_size = iov_size(data_sg, data_num);
     /* Buffers for the device */
     struct iovec out = {
-        .iov_base = s->cvq_cmd_out_buffer,
+        .iov_base = *out_cursor,
         .iov_len = sizeof(ctrl) + data_size,
     };
     struct iovec in = {
-        .iov_base = s->status,
+        .iov_base = *in_cursor,
         .iov_len = sizeof(*s->status),
     };
     ssize_t r;
@@ -658,11 +659,11 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
     assert(vhost_vdpa_net_svq_available_slots(s) >= 2);
 
     /* pack the CVQ command header */
-    memcpy(s->cvq_cmd_out_buffer, &ctrl, sizeof(ctrl));
+    memcpy(out.iov_base, &ctrl, sizeof(ctrl));
 
     /* pack the CVQ command command-specific-data */
     iov_to_buf(data_sg, data_num, 0,
-               s->cvq_cmd_out_buffer + sizeof(ctrl), data_size);
+               out.iov_base + sizeof(ctrl), data_size);
 
     r = vhost_vdpa_net_cvq_add(s, &out, 1, &in, 1);
     if (unlikely(r < 0)) {
@@ -676,14 +677,16 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class,
     return vhost_vdpa_net_svq_poll(s, 1);
 }
 
-static int vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n)
+static int vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n,
+                                   void **out_cursor, void **in_cursor)
 {
     if (virtio_vdev_has_feature(&n->parent_obj, VIRTIO_NET_F_CTRL_MAC_ADDR)) {
         const struct iovec data = {
             .iov_base = (void *)n->mac,
             .iov_len = sizeof(n->mac),
         };
-        ssize_t dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_MAC,
+        ssize_t dev_written = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
+                                                  VIRTIO_NET_CTRL_MAC,
                                                   VIRTIO_NET_CTRL_MAC_ADDR_SET,
                                                   &data, 1);
         if (unlikely(dev_written < 0)) {
@@ -735,7 +738,7 @@ static int vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n)
             .iov_len = mul_macs_size,
         },
     };
-    ssize_t dev_written = vhost_vdpa_net_load_cmd(s,
+    ssize_t dev_written = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
                                 VIRTIO_NET_CTRL_MAC,
                                 VIRTIO_NET_CTRL_MAC_TABLE_SET,
                                 data, ARRAY_SIZE(data));
@@ -750,7 +753,8 @@ static int vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n)
 }
 
 static int vhost_vdpa_net_load_mq(VhostVDPAState *s,
-                                  const VirtIONet *n)
+                                  const VirtIONet *n,
+                                  void **out_cursor, void **in_cursor)
 {
     struct virtio_net_ctrl_mq mq;
     ssize_t dev_written;
@@ -764,7 +768,8 @@ static int vhost_vdpa_net_load_mq(VhostVDPAState *s,
         .iov_base = &mq,
         .iov_len = sizeof(mq),
     };
-    dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_MQ,
+    dev_written = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
+                                          VIRTIO_NET_CTRL_MQ,
                                           VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET,
                                           &data, 1);
     if (unlikely(dev_written < 0)) {
@@ -778,7 +783,8 @@ static int vhost_vdpa_net_load_mq(VhostVDPAState *s,
 }
 
 static int vhost_vdpa_net_load_offloads(VhostVDPAState *s,
-                                        const VirtIONet *n)
+                                        const VirtIONet *n,
+                                        void **out_cursor, void **in_cursor)
 {
     uint64_t offloads;
     ssize_t dev_written;
@@ -809,7 +815,8 @@ static int vhost_vdpa_net_load_offloads(VhostVDPAState *s,
         .iov_base = &offloads,
         .iov_len = sizeof(offloads),
     };
-    dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_GUEST_OFFLOADS,
+    dev_written = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
+                                          VIRTIO_NET_CTRL_GUEST_OFFLOADS,
                                           VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
                                           &data, 1);
     if (unlikely(dev_written < 0)) {
@@ -823,6 +830,7 @@ static int vhost_vdpa_net_load_offloads(VhostVDPAState *s,
 }
 
 static int vhost_vdpa_net_load_rx_mode(VhostVDPAState *s,
+                                       void **out_cursor, void **in_cursor,
                                        uint8_t cmd,
                                        uint8_t on)
 {
@@ -832,8 +840,8 @@ static int vhost_vdpa_net_load_rx_mode(VhostVDPAState *s,
     };
     ssize_t dev_written;
 
-    dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_RX,
-                                          cmd, &data, 1);
+    dev_written = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
+                                          VIRTIO_NET_CTRL_RX, cmd, &data, 1);
     if (unlikely(dev_written < 0)) {
         return dev_written;
     }
@@ -845,7 +853,8 @@ static int vhost_vdpa_net_load_rx_mode(VhostVDPAState *s,
 }
 
 static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
-                                  const VirtIONet *n)
+                                  const VirtIONet *n,
+                                  void **out_cursor, void **in_cursor)
 {
     ssize_t r;
 
@@ -872,7 +881,8 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
      * configuration only at live migration.
      */
     if (!n->mac_table.uni_overflow && !n->promisc) {
-        r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_PROMISC, 0);
+        r = vhost_vdpa_net_load_rx_mode(s, out_cursor, in_cursor,
+                                        VIRTIO_NET_CTRL_RX_PROMISC, 0);
         if (unlikely(r < 0)) {
             return r;
         }
@@ -896,7 +906,8 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
      * configuration only at live migration.
      */
     if (n->mac_table.multi_overflow || n->allmulti) {
-        r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_ALLMULTI, 1);
+        r = vhost_vdpa_net_load_rx_mode(s, out_cursor, in_cursor,
+                                        VIRTIO_NET_CTRL_RX_ALLMULTI, 1);
         if (unlikely(r < 0)) {
             return r;
         }
@@ -917,7 +928,8 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
      * configuration only at live migration.
      */
     if (n->alluni) {
-        r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_ALLUNI, 1);
+        r = vhost_vdpa_net_load_rx_mode(s, out_cursor, in_cursor,
+                                        VIRTIO_NET_CTRL_RX_ALLUNI, 1);
         if (r < 0) {
             return r;
         }
@@ -934,7 +946,8 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
      * configuration only at live migration.
      */
     if (n->nomulti) {
-        r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_NOMULTI, 1);
+        r = vhost_vdpa_net_load_rx_mode(s, out_cursor, in_cursor,
+                                        VIRTIO_NET_CTRL_RX_NOMULTI, 1);
         if (r < 0) {
             return r;
         }
@@ -951,7 +964,8 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
      * configuration only at live migration.
      */
     if (n->nouni) {
-        r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_NOUNI, 1);
+        r = vhost_vdpa_net_load_rx_mode(s, out_cursor, in_cursor,
+                                        VIRTIO_NET_CTRL_RX_NOUNI, 1);
         if (r < 0) {
             return r;
         }
@@ -968,7 +982,8 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
      * configuration only at live migration.
      */
     if (n->nobcast) {
-        r = vhost_vdpa_net_load_rx_mode(s, VIRTIO_NET_CTRL_RX_NOBCAST, 1);
+        r = vhost_vdpa_net_load_rx_mode(s, out_cursor, in_cursor,
+                                        VIRTIO_NET_CTRL_RX_NOBCAST, 1);
         if (r < 0) {
             return r;
         }
@@ -979,13 +994,15 @@ static int vhost_vdpa_net_load_rx(VhostVDPAState *s,
 
 static int vhost_vdpa_net_load_single_vlan(VhostVDPAState *s,
                                            const VirtIONet *n,
+                                           void **out_cursor, void **in_cursor,
                                            uint16_t vid)
 {
     const struct iovec data = {
         .iov_base = &vid,
         .iov_len = sizeof(vid),
     };
-    ssize_t dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_VLAN,
+    ssize_t dev_written = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
+                                                  VIRTIO_NET_CTRL_VLAN,
                                                   VIRTIO_NET_CTRL_VLAN_ADD,
                                                   &data, 1);
     if (unlikely(dev_written < 0)) {
@@ -999,7 +1016,8 @@ static int vhost_vdpa_net_load_single_vlan(VhostVDPAState *s,
 }
 
 static int vhost_vdpa_net_load_vlan(VhostVDPAState *s,
-                                    const VirtIONet *n)
+                                    const VirtIONet *n,
+                                    void **out_cursor, void **in_cursor)
 {
     int r;
 
@@ -1010,7 +1028,8 @@ static int vhost_vdpa_net_load_vlan(VhostVDPAState *s,
     for (int i = 0; i < MAX_VLAN >> 5; i++) {
         for (int j = 0; n->vlans[i] && j <= 0x1f; j++) {
             if (n->vlans[i] & (1U << j)) {
-                r = vhost_vdpa_net_load_single_vlan(s, n, (i << 5) + j);
+                r = vhost_vdpa_net_load_single_vlan(s, n, out_cursor,
+                                                    in_cursor, (i << 5) + j);
                 if (unlikely(r != 0)) {
                     return r;
                 }
@@ -1028,6 +1047,8 @@ static int vhost_vdpa_net_load(NetClientState *nc)
     struct vhost_vdpa *v = &s->vhost_vdpa;
     const VirtIONet *n;
     int r;
+    void *out_cursor = s->cvq_cmd_out_buffer,
+         *in_cursor = s->status;
 
     assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
 
@@ -1036,23 +1057,23 @@ static int vhost_vdpa_net_load(NetClientState *nc)
     }
 
     n = VIRTIO_NET(v->dev->vdev);
-    r = vhost_vdpa_net_load_mac(s, n);
+    r = vhost_vdpa_net_load_mac(s, n, &out_cursor, &in_cursor);
     if (unlikely(r < 0)) {
         return r;
     }
-    r = vhost_vdpa_net_load_mq(s, n);
+    r = vhost_vdpa_net_load_mq(s, n, &out_cursor, &in_cursor);
     if (unlikely(r)) {
         return r;
     }
-    r = vhost_vdpa_net_load_offloads(s, n);
+    r = vhost_vdpa_net_load_offloads(s, n, &out_cursor, &in_cursor);
     if (unlikely(r)) {
         return r;
     }
-    r = vhost_vdpa_net_load_rx(s, n);
+    r = vhost_vdpa_net_load_rx(s, n, &out_cursor, &in_cursor);
     if (unlikely(r)) {
         return r;
     }
-    r = vhost_vdpa_net_load_vlan(s, n);
+    r = vhost_vdpa_net_load_vlan(s, n, &out_cursor, &in_cursor);
     if (unlikely(r)) {
         return r;
     }
-- 
2.25.1
[PATCH v3 8/8] vdpa: Send cvq state load commands in parallel
Posted by Hawkins Jiawei 9 months, 3 weeks ago
This patch enables sending CVQ state load commands
in parallel at device startup by following steps:

  * Refactor vhost_vdpa_net_load_cmd() to iterate through
the control commands shadow buffers. This allows different
CVQ state load commands to use their own unique buffers.

  * Delay the polling and checking of buffers until either
the SVQ is full or control commands shadow buffers are full.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1578
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
---
 net/vhost-vdpa.c | 157 +++++++++++++++++++++++++++++------------------
 1 file changed, 96 insertions(+), 61 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 795c9c1fd2..1ebb58f7f6 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -633,6 +633,26 @@ static uint16_t vhost_vdpa_net_svq_available_slots(VhostVDPAState *s)
     return vhost_svq_available_slots(svq);
 }
 
+/*
+ * Poll SVQ for multiple pending control commands and check the device's ack.
+ *
+ * Caller should hold the BQL when invoking this function.
+ */
+static ssize_t vhost_vdpa_net_svq_flush(VhostVDPAState *s,
+                                        size_t cmds_in_flight)
+{
+    vhost_vdpa_net_svq_poll(s, cmds_in_flight);
+
+    /* Device should and must use only one byte ack each control command */
+    assert(cmds_in_flight < vhost_vdpa_net_cvq_cmd_page_len());
+    for (int i = 0; i < cmds_in_flight; ++i) {
+        if (s->status[i] != VIRTIO_NET_OK) {
+            return -EIO;
+        }
+    }
+    return 0;
+}
+
 static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, void **out_cursor,
                                        void **in_cursor, uint8_t class,
                                        uint8_t cmd, const struct iovec *data_sg,
@@ -642,19 +662,41 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, void **out_cursor,
         .class = class,
         .cmd = cmd,
     };
-    size_t data_size = iov_size(data_sg, data_num);
+    size_t data_size = iov_size(data_sg, data_num),
+           left_bytes = vhost_vdpa_net_cvq_cmd_page_len() -
+                        (*out_cursor - s->cvq_cmd_out_buffer);
     /* Buffers for the device */
     struct iovec out = {
-        .iov_base = *out_cursor,
         .iov_len = sizeof(ctrl) + data_size,
     };
     struct iovec in = {
-        .iov_base = *in_cursor,
         .iov_len = sizeof(*s->status),
     };
     ssize_t r;
 
-    assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl));
+    if (sizeof(ctrl) > left_bytes || data_size > left_bytes - sizeof(ctrl) ||
+        vhost_vdpa_net_svq_available_slots(s) < 2) {
+        /*
+         * It is time to flush all pending control commands if SVQ is full
+         * or control commands shadow buffers are full.
+         *
+         * We can poll here since we've had BQL from the time
+         * we sent the descriptor.
+         */
+        r = vhost_vdpa_net_svq_flush(s, *in_cursor - (void *)s->status);
+        if (unlikely(r < 0)) {
+            return r;
+        }
+
+        *out_cursor = s->cvq_cmd_out_buffer;
+        *in_cursor = s->status;
+        left_bytes = vhost_vdpa_net_cvq_cmd_page_len();
+    }
+
+    out.iov_base = *out_cursor;
+    in.iov_base = *in_cursor;
+
+    assert(data_size <= left_bytes - sizeof(ctrl));
     /* Each CVQ command has one out descriptor and one in descriptor */
     assert(vhost_vdpa_net_svq_available_slots(s) >= 2);
 
@@ -670,11 +712,11 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, void **out_cursor,
         return r;
     }
 
-    /*
-     * We can poll here since we've had BQL from the time
-     * we sent the descriptor.
-     */
-    return vhost_vdpa_net_svq_poll(s, 1);
+    /* iterate the cursors */
+    *out_cursor += out.iov_len;
+    *in_cursor += in.iov_len;
+
+    return 0;
 }
 
 static int vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n,
@@ -685,15 +727,12 @@ static int vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n,
             .iov_base = (void *)n->mac,
             .iov_len = sizeof(n->mac),
         };
-        ssize_t dev_written = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
-                                                  VIRTIO_NET_CTRL_MAC,
-                                                  VIRTIO_NET_CTRL_MAC_ADDR_SET,
-                                                  &data, 1);
-        if (unlikely(dev_written < 0)) {
-            return dev_written;
-        }
-        if (*s->status != VIRTIO_NET_OK) {
-            return -EIO;
+        ssize_t r = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
+                                               VIRTIO_NET_CTRL_MAC,
+                                               VIRTIO_NET_CTRL_MAC_ADDR_SET,
+                                               &data, 1);
+        if (unlikely(r < 0)) {
+            return r;
         }
     }
 
@@ -738,15 +777,12 @@ static int vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n,
             .iov_len = mul_macs_size,
         },
     };
-    ssize_t dev_written = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
+    ssize_t r = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
                                 VIRTIO_NET_CTRL_MAC,
                                 VIRTIO_NET_CTRL_MAC_TABLE_SET,
                                 data, ARRAY_SIZE(data));
-    if (unlikely(dev_written < 0)) {
-        return dev_written;
-    }
-    if (*s->status != VIRTIO_NET_OK) {
-        return -EIO;
+    if (unlikely(r < 0)) {
+        return r;
     }
 
     return 0;
@@ -757,7 +793,7 @@ static int vhost_vdpa_net_load_mq(VhostVDPAState *s,
                                   void **out_cursor, void **in_cursor)
 {
     struct virtio_net_ctrl_mq mq;
-    ssize_t dev_written;
+    ssize_t r;
 
     if (!virtio_vdev_has_feature(&n->parent_obj, VIRTIO_NET_F_MQ)) {
         return 0;
@@ -768,15 +804,12 @@ static int vhost_vdpa_net_load_mq(VhostVDPAState *s,
         .iov_base = &mq,
         .iov_len = sizeof(mq),
     };
-    dev_written = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
-                                          VIRTIO_NET_CTRL_MQ,
-                                          VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET,
-                                          &data, 1);
-    if (unlikely(dev_written < 0)) {
-        return dev_written;
-    }
-    if (*s->status != VIRTIO_NET_OK) {
-        return -EIO;
+    r = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
+                                   VIRTIO_NET_CTRL_MQ,
+                                   VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET,
+                                   &data, 1);
+    if (unlikely(r < 0)) {
+        return r;
     }
 
     return 0;
@@ -787,7 +820,7 @@ static int vhost_vdpa_net_load_offloads(VhostVDPAState *s,
                                         void **out_cursor, void **in_cursor)
 {
     uint64_t offloads;
-    ssize_t dev_written;
+    ssize_t r;
 
     if (!virtio_vdev_has_feature(&n->parent_obj,
                                  VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
@@ -815,15 +848,12 @@ static int vhost_vdpa_net_load_offloads(VhostVDPAState *s,
         .iov_base = &offloads,
         .iov_len = sizeof(offloads),
     };
-    dev_written = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
-                                          VIRTIO_NET_CTRL_GUEST_OFFLOADS,
-                                          VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
-                                          &data, 1);
-    if (unlikely(dev_written < 0)) {
-        return dev_written;
-    }
-    if (*s->status != VIRTIO_NET_OK) {
-        return -EIO;
+    r = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
+                                   VIRTIO_NET_CTRL_GUEST_OFFLOADS,
+                                   VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
+                                   &data, 1);
+    if (unlikely(r < 0)) {
+        return r;
     }
 
     return 0;
@@ -838,15 +868,12 @@ static int vhost_vdpa_net_load_rx_mode(VhostVDPAState *s,
         .iov_base = &on,
         .iov_len = sizeof(on),
     };
-    ssize_t dev_written;
+    ssize_t r;
 
-    dev_written = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
-                                          VIRTIO_NET_CTRL_RX, cmd, &data, 1);
-    if (unlikely(dev_written < 0)) {
-        return dev_written;
-    }
-    if (*s->status != VIRTIO_NET_OK) {
-        return -EIO;
+    r = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
+                                VIRTIO_NET_CTRL_RX, cmd, &data, 1);
+    if (unlikely(r < 0)) {
+        return r;
     }
 
     return 0;
@@ -1001,15 +1028,12 @@ static int vhost_vdpa_net_load_single_vlan(VhostVDPAState *s,
         .iov_base = &vid,
         .iov_len = sizeof(vid),
     };
-    ssize_t dev_written = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
-                                                  VIRTIO_NET_CTRL_VLAN,
-                                                  VIRTIO_NET_CTRL_VLAN_ADD,
-                                                  &data, 1);
-    if (unlikely(dev_written < 0)) {
-        return dev_written;
-    }
-    if (unlikely(*s->status != VIRTIO_NET_OK)) {
-        return -EIO;
+    ssize_t r = vhost_vdpa_net_load_cmd(s, out_cursor, in_cursor,
+                                           VIRTIO_NET_CTRL_VLAN,
+                                           VIRTIO_NET_CTRL_VLAN_ADD,
+                                           &data, 1);
+    if (unlikely(r < 0)) {
+        return r;
     }
 
     return 0;
@@ -1078,6 +1102,17 @@ static int vhost_vdpa_net_load(NetClientState *nc)
         return r;
     }
 
+    /*
+     * We need to poll and check all pending device's used buffers.
+     *
+     * We can poll here since we've had BQL from the time
+     * we sent the descriptor.
+     */
+    r = vhost_vdpa_net_svq_flush(s, in_cursor - (void *)s->status);
+    if (unlikely(r)) {
+        return r;
+    }
+
     return 0;
 }
 
-- 
2.25.1