[v4] Add packed virtqueue to shadow virtqueue

[RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 5 months ago

Hi,

There are two issues that I found while trying to test
my changes. I thought I would send the patch series
as well in case that helps in troubleshooting. I haven't
been able to find an issue in the implementation yet.
Maybe I am missing something.

I have been following the "Hands on vDPA: what do you do
when you ain't got the hardware v2 (Part 2)" [1] blog to
test my changes. To boot the L1 VM, I ran:

sudo ./qemu/build/qemu-system-x86_64 \
-enable-kvm \
-drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
-net nic,model=virtio \
-net user,hostfwd=tcp::2222-:22 \
-device intel-iommu,snoop-control=on \
-device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
-netdev tap,id=net0,script=no,downscript=no \
-nographic \
-m 8G \
-smp 4 \
-M q35 \
-cpu host 2>&1 | tee vm.log

Without "guest_uso4=off,guest_uso6=off,host_uso=off,
guest_announce=off" in "-device virtio-net-pci", QEMU
throws "vdpa svq does not work with features" [2] when
trying to boot L2.

The enums added in commit #2 in this series is new and
wasn't in the earlier versions of the series. Without
this change, x-svq=true throws "SVQ invalid device feature
flags" [3] and x-svq is consequently disabled.

The first issue is related to running traffic in L2
with vhost-vdpa.

In L0:

$ ip addr add 111.1.1.1/24 dev tap0
$ ip link set tap0 up
$ ip addr show tap0
4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
    inet 111.1.1.1/24 scope global tap0
       valid_lft forever preferred_lft forever
    inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll 
       valid_lft forever preferred_lft forever

I am able to run traffic in L2 when booting without
x-svq.

In L1:

$ ./qemu/build/qemu-system-x86_64 \
-nographic \
-m 4G \
-enable-kvm \
-M q35 \
-drive file=//root/L2.qcow2,media=disk,if=virtio \
-netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
-device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
-smp 4 \
-cpu host \
2>&1 | tee vm.log

In L2:

# ip addr add 111.1.1.2/24 dev eth0
# ip addr show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
    altname enp0s7
    inet 111.1.1.2/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

# ip route
111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2

# ping 111.1.1.1 -w3
PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms

--- 111.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2034ms
rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms


But if I boot L2 with x-svq=true as shown below, I am unable
to ping the host machine.

$ ./qemu/build/qemu-system-x86_64 \
-nographic \
-m 4G \
-enable-kvm \
-M q35 \
-drive file=//root/L2.qcow2,media=disk,if=virtio \
-netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
-device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
-smp 4 \
-cpu host \
2>&1 | tee vm.log

In L2:

# ip addr add 111.1.1.2/24 dev eth0
# ip addr show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
    altname enp0s7
    inet 111.1.1.2/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

# ip route
111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2

# ping 111.1.1.1 -w10
PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
ping: sendmsg: No route to host
From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
From 111.1.1.2 icmp_seq=3 Destination Host Unreachable

--- 111.1.1.1 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
pipe 3

The other issue is related to booting L2 with "x-svq=true"
and "packed=on".

In L1:

$ ./qemu/build/qemu-system-x86_64 \
-nographic \
-m 4G \
-enable-kvm \
-M q35 \
-drive file=//root/L2.qcow2,media=disk,if=virtio \
-netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
-device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
-smp 4 \
-cpu host \
2>&1 | tee vm.log

The kernel throws "virtio_net virtio1: output.0:id 0 is not
a head!" [4].

Here's part of the trace:

[...]
[  945.370085] watchdog: BUG: soft lockup - CPU#2 stuck for 863s! [NetworkManager:795]
[  945.372467] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_g
[  945.387413] CPU: 2 PID: 795 Comm: NetworkManager Tainted: G             L     6.8.7-200.fc39.x86_64 #1
[  945.390685] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[  945.394256] RIP: 0010:virtnet_poll+0xd8/0x5c0 [virtio_net]
[  945.395998] Code: c0 74 5c 65 8b 05 24 37 8b 3f 41 89 86 c4 00 00 00 80 bb 40 04 00 00 00 75 32 48 8b 3b e8 00 00 28 c7 48 89 df be8
[  945.401465] RSP: 0018:ffffabaec0134e48 EFLAGS: 00000246
[  945.403362] RAX: ffff9bf904432000 RBX: ffff9bf9085b1800 RCX: 00000000ffff0001
[  945.405447] RDX: 0000000000008080 RSI: 0000000000000001 RDI: ffff9bf9085b1800
[  945.408361] RBP: ffff9bf9085b0808 R08: 0000000000000001 R09: ffffabaec0134ba8
[  945.410828] R10: ffffabaec0134ba0 R11: 0000000000000003 R12: ffff9bf905a34ac0
[  945.413272] R13: 0000000000000040 R14: ffff9bf905a34a00 R15: ffff9bf9085b0800
[  945.415180] FS:  00007fa81f0f1540(0000) GS:ffff9bf97bd00000(0000) knlGS:0000000000000000
[  945.418177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  945.419415] CR2: 000055614ba8dc48 CR3: 0000000102b42006 CR4: 0000000000770ef0
[  945.423312] PKRU: 55555554
[  945.424238] Call Trace:
[  945.424238]  <IRQ>
[  945.426236]  ? watchdog_timer_fn+0x1e6/0x270
[  945.427304]  ? __pfx_watchdog_timer_fn+0x10/0x10
[  945.428239]  ? __hrtimer_run_queues+0x10f/0x2b0
[  945.431304]  ? hrtimer_interrupt+0xf8/0x230
[  945.432236]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
[  945.434187]  ? sysvec_apic_timer_interrupt+0x39/0x90
[  945.436306]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  945.438199]  ? virtnet_poll+0xd8/0x5c0 [virtio_net]
[  945.438199]  ? virtnet_poll+0xd0/0x5c0 [virtio_net]
[  945.440197]  ? handle_irq_event+0x50/0x80
[  945.442415]  ? sched_clock_cpu+0x5e/0x190
[  945.444563]  ? irqtime_account_irq+0x40/0xc0
[  945.446191]  __napi_poll+0x28/0x1c0
[  945.446191]  net_rx_action+0x2a4/0x380
[  945.448851]  ? _raw_spin_unlock_irqrestore+0xe/0x40
[  945.450209]  ? note_gp_changes+0x6c/0x80
[  945.452252]  __do_softirq+0xc9/0x2c8
[  945.453579]  do_softirq.part.0+0x3d/0x60
[  945.454188]  </IRQ>
[  945.454188]  <TASK>
[  945.456175]  __local_bh_enable_ip+0x68/0x70
[  945.458373]  virtnet_open+0xdc/0x310 [virtio_net]
[  945.460005]  __dev_open+0xfa/0x1b0
[  945.461310]  __dev_change_flags+0x1dc/0x250
[  945.462800]  dev_change_flags+0x26/0x70
[  945.464190]  do_setlink+0x375/0x12d0
[...]

I am not sure if this issue is similar to the one
described in this patch (race between channels
setting and refill) [5]. As described in the patch,
I see drivers/net/virtio_net:virtnet_open invoke
try_fill_recv() and schedule_delayed_work() [6]. I
am unfamiliar with this and so I am not sure how to
progress.

Maybe I can try disabling napi and checking it out
if that is possible. Would this be a good next step
to troubleshoot the kernel crash?

Thanks,
Sahil

Changes v3 -> v4:
- Split commit #1 of v3 into commit #1 and #2 in
  this series [7].
- Commit #3 is commit #2 of v3.
- Commit #4 is based on commit #3 of v3.
- Commit #5 was sent as an individual patch [8].
- vhost-shadow-virtqueue.c
  (vhost_svq_valid_features): Add enums.
  (vhost_svq_memory_packed): Remove function.
  (vhost_svq_driver_area_size,vhost_svq_descriptor_area_size): Decouple functions.
  (vhost_svq_device_area_size): Rewrite function.
  (vhost_svq_start): Simplify implementation.
  (vhost_svq_stop): Unconditionally munmap().
- vhost-shadow-virtqueue.h: New function declaration.
- vhost-vdpa.c
  (vhost_vdpa_svq_unmap_rings): Call vhost_vdpa_svq_unmap_ring().
  (vhost_vdpa_svq_map_rings): New mappings.
  (vhost_vdpa_svq_setup): Add comment.

[1] https://www.redhat.com/en/blog/hands-vdpa-what-do-you-do-when-you-aint-got-hardware-part-2
[2] https://gitlab.com/qemu-project/qemu/-/blob/master/net/vhost-vdpa.c#L167
[3] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L58
[4] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1763
[5] https://lkml.iu.edu/hypermail/linux/kernel/1307.0/01455.html
[6] https://github.com/torvalds/linux/blob/master/drivers/net/virtio_net.c#L3104
[7] https://lists.nongnu.org/archive/html/qemu-devel/2024-08/msg01148.html
[8] https://lists.nongnu.org/archive/html/qemu-devel/2024-11/msg00598.html

Sahil Siddiq (5):
  vhost: Refactor vhost_svq_add_split
  vhost: Write descriptors to packed svq
  vhost: Data structure changes to support packed vqs
  vdpa: Allocate memory for svq and map them to vdpa
  vdpa: Support setting vring_base for packed svq

 hw/virtio/vhost-shadow-virtqueue.c | 222 +++++++++++++++++++----------
 hw/virtio/vhost-shadow-virtqueue.h |  70 ++++++---
 hw/virtio/vhost-vdpa.c             |  47 +++++-
 3 files changed, 237 insertions(+), 102 deletions(-)

-- 
2.47.0

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 4 months, 3 weeks ago

On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> There are two issues that I found while trying to test
> my changes. I thought I would send the patch series
> as well in case that helps in troubleshooting. I haven't
> been able to find an issue in the implementation yet.
> Maybe I am missing something.
>
> I have been following the "Hands on vDPA: what do you do
> when you ain't got the hardware v2 (Part 2)" [1] blog to
> test my changes. To boot the L1 VM, I ran:
>
> sudo ./qemu/build/qemu-system-x86_64 \
> -enable-kvm \
> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> -net nic,model=virtio \
> -net user,hostfwd=tcp::2222-:22 \
> -device intel-iommu,snoop-control=on \
> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
> -netdev tap,id=net0,script=no,downscript=no \
> -nographic \
> -m 8G \
> -smp 4 \
> -M q35 \
> -cpu host 2>&1 | tee vm.log
>
> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
> guest_announce=off" in "-device virtio-net-pci", QEMU
> throws "vdpa svq does not work with features" [2] when
> trying to boot L2.
>
> The enums added in commit #2 in this series is new and
> wasn't in the earlier versions of the series. Without
> this change, x-svq=true throws "SVQ invalid device feature
> flags" [3] and x-svq is consequently disabled.
>
> The first issue is related to running traffic in L2
> with vhost-vdpa.
>
> In L0:
>
> $ ip addr add 111.1.1.1/24 dev tap0
> $ ip link set tap0 up
> $ ip addr show tap0
> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>     link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>     inet 111.1.1.1/24 scope global tap0
>        valid_lft forever preferred_lft forever
>     inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>        valid_lft forever preferred_lft forever
>
> I am able to run traffic in L2 when booting without
> x-svq.
>
> In L1:
>
> $ ./qemu/build/qemu-system-x86_64 \
> -nographic \
> -m 4G \
> -enable-kvm \
> -M q35 \
> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> -smp 4 \
> -cpu host \
> 2>&1 | tee vm.log
>
> In L2:
>
> # ip addr add 111.1.1.2/24 dev eth0
> # ip addr show eth0
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>     link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>     altname enp0s7
>     inet 111.1.1.2/24 scope global eth0
>        valid_lft forever preferred_lft forever
>     inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>        valid_lft forever preferred_lft forever
>
> # ip route
> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>
> # ping 111.1.1.1 -w3
> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
>
> --- 111.1.1.1 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
>
>
> But if I boot L2 with x-svq=true as shown below, I am unable
> to ping the host machine.
>
> $ ./qemu/build/qemu-system-x86_64 \
> -nographic \
> -m 4G \
> -enable-kvm \
> -M q35 \
> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> -smp 4 \
> -cpu host \
> 2>&1 | tee vm.log
>
> In L2:
>
> # ip addr add 111.1.1.2/24 dev eth0
> # ip addr show eth0
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>     link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>     altname enp0s7
>     inet 111.1.1.2/24 scope global eth0
>        valid_lft forever preferred_lft forever
>     inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>        valid_lft forever preferred_lft forever
>
> # ip route
> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>
> # ping 111.1.1.1 -w10
> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
> ping: sendmsg: No route to host
> From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
> From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
>
> --- 111.1.1.1 ping statistics ---
> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
> pipe 3
>
> The other issue is related to booting L2 with "x-svq=true"
> and "packed=on".
>
> In L1:
>
> $ ./qemu/build/qemu-system-x86_64 \
> -nographic \
> -m 4G \
> -enable-kvm \
> -M q35 \
> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
> -smp 4 \
> -cpu host \
> 2>&1 | tee vm.log
>
> The kernel throws "virtio_net virtio1: output.0:id 0 is not
> a head!" [4].
>

So this series implements the descriptor forwarding from the guest to
the device in packed vq. We also need to forward the descriptors from
the device to the guest. The device writes them in the SVQ ring.

The functions responsible for that in QEMU are
hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
the device when used descriptors are written to the SVQ, which calls
hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
modifications similar to vhost_svq_add: Make them conditional if we're
in split or packed vq, and "copy" the code from Linux's
drivers/virtio/virtio_ring.c:virtqueue_get_buf.

After these modifications you should be able to ping and forward
traffic. As always, It is totally ok if it needs more than one
iteration, and feel free to ask any question you have :).

> Here's part of the trace:
>
> [...]
> [  945.370085] watchdog: BUG: soft lockup - CPU#2 stuck for 863s! [NetworkManager:795]
> [  945.372467] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_g
> [  945.387413] CPU: 2 PID: 795 Comm: NetworkManager Tainted: G             L     6.8.7-200.fc39.x86_64 #1
> [  945.390685] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [  945.394256] RIP: 0010:virtnet_poll+0xd8/0x5c0 [virtio_net]
> [  945.395998] Code: c0 74 5c 65 8b 05 24 37 8b 3f 41 89 86 c4 00 00 00 80 bb 40 04 00 00 00 75 32 48 8b 3b e8 00 00 28 c7 48 89 df be8
> [  945.401465] RSP: 0018:ffffabaec0134e48 EFLAGS: 00000246
> [  945.403362] RAX: ffff9bf904432000 RBX: ffff9bf9085b1800 RCX: 00000000ffff0001
> [  945.405447] RDX: 0000000000008080 RSI: 0000000000000001 RDI: ffff9bf9085b1800
> [  945.408361] RBP: ffff9bf9085b0808 R08: 0000000000000001 R09: ffffabaec0134ba8
> [  945.410828] R10: ffffabaec0134ba0 R11: 0000000000000003 R12: ffff9bf905a34ac0
> [  945.413272] R13: 0000000000000040 R14: ffff9bf905a34a00 R15: ffff9bf9085b0800
> [  945.415180] FS:  00007fa81f0f1540(0000) GS:ffff9bf97bd00000(0000) knlGS:0000000000000000
> [  945.418177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  945.419415] CR2: 000055614ba8dc48 CR3: 0000000102b42006 CR4: 0000000000770ef0
> [  945.423312] PKRU: 55555554
> [  945.424238] Call Trace:
> [  945.424238]  <IRQ>
> [  945.426236]  ? watchdog_timer_fn+0x1e6/0x270
> [  945.427304]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [  945.428239]  ? __hrtimer_run_queues+0x10f/0x2b0
> [  945.431304]  ? hrtimer_interrupt+0xf8/0x230
> [  945.432236]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
> [  945.434187]  ? sysvec_apic_timer_interrupt+0x39/0x90
> [  945.436306]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  945.438199]  ? virtnet_poll+0xd8/0x5c0 [virtio_net]
> [  945.438199]  ? virtnet_poll+0xd0/0x5c0 [virtio_net]
> [  945.440197]  ? handle_irq_event+0x50/0x80
> [  945.442415]  ? sched_clock_cpu+0x5e/0x190
> [  945.444563]  ? irqtime_account_irq+0x40/0xc0
> [  945.446191]  __napi_poll+0x28/0x1c0
> [  945.446191]  net_rx_action+0x2a4/0x380
> [  945.448851]  ? _raw_spin_unlock_irqrestore+0xe/0x40
> [  945.450209]  ? note_gp_changes+0x6c/0x80
> [  945.452252]  __do_softirq+0xc9/0x2c8
> [  945.453579]  do_softirq.part.0+0x3d/0x60
> [  945.454188]  </IRQ>
> [  945.454188]  <TASK>
> [  945.456175]  __local_bh_enable_ip+0x68/0x70
> [  945.458373]  virtnet_open+0xdc/0x310 [virtio_net]
> [  945.460005]  __dev_open+0xfa/0x1b0
> [  945.461310]  __dev_change_flags+0x1dc/0x250
> [  945.462800]  dev_change_flags+0x26/0x70
> [  945.464190]  do_setlink+0x375/0x12d0
> [...]
>
> I am not sure if this issue is similar to the one
> described in this patch (race between channels
> setting and refill) [5]. As described in the patch,
> I see drivers/net/virtio_net:virtnet_open invoke
> try_fill_recv() and schedule_delayed_work() [6]. I
> am unfamiliar with this and so I am not sure how to
> progress.
>
> Maybe I can try disabling napi and checking it out
> if that is possible. Would this be a good next step
> to troubleshoot the kernel crash?
>
> Thanks,
> Sahil
>
> Changes v3 -> v4:
> - Split commit #1 of v3 into commit #1 and #2 in
>   this series [7].
> - Commit #3 is commit #2 of v3.
> - Commit #4 is based on commit #3 of v3.
> - Commit #5 was sent as an individual patch [8].
> - vhost-shadow-virtqueue.c
>   (vhost_svq_valid_features): Add enums.
>   (vhost_svq_memory_packed): Remove function.
>   (vhost_svq_driver_area_size,vhost_svq_descriptor_area_size): Decouple functions.
>   (vhost_svq_device_area_size): Rewrite function.
>   (vhost_svq_start): Simplify implementation.
>   (vhost_svq_stop): Unconditionally munmap().
> - vhost-shadow-virtqueue.h: New function declaration.
> - vhost-vdpa.c
>   (vhost_vdpa_svq_unmap_rings): Call vhost_vdpa_svq_unmap_ring().
>   (vhost_vdpa_svq_map_rings): New mappings.
>   (vhost_vdpa_svq_setup): Add comment.
>
> [1] https://www.redhat.com/en/blog/hands-vdpa-what-do-you-do-when-you-aint-got-hardware-part-2
> [2] https://gitlab.com/qemu-project/qemu/-/blob/master/net/vhost-vdpa.c#L167
> [3] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L58
> [4] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1763
> [5] https://lkml.iu.edu/hypermail/linux/kernel/1307.0/01455.html
> [6] https://github.com/torvalds/linux/blob/master/drivers/net/virtio_net.c#L3104
> [7] https://lists.nongnu.org/archive/html/qemu-devel/2024-08/msg01148.html
> [8] https://lists.nongnu.org/archive/html/qemu-devel/2024-11/msg00598.html
>
> Sahil Siddiq (5):
>   vhost: Refactor vhost_svq_add_split
>   vhost: Write descriptors to packed svq
>   vhost: Data structure changes to support packed vqs
>   vdpa: Allocate memory for svq and map them to vdpa
>   vdpa: Support setting vring_base for packed svq
>
>  hw/virtio/vhost-shadow-virtqueue.c | 222 +++++++++++++++++++----------
>  hw/virtio/vhost-shadow-virtqueue.h |  70 ++++++---
>  hw/virtio/vhost-vdpa.c             |  47 +++++-
>  3 files changed, 237 insertions(+), 102 deletions(-)
>
> --
> 2.47.0
>

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 4 months, 3 weeks ago

Hi,

On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
> On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> Hi,
>>
>> There are two issues that I found while trying to test
>> my changes. I thought I would send the patch series
>> as well in case that helps in troubleshooting. I haven't
>> been able to find an issue in the implementation yet.
>> Maybe I am missing something.
>>
>> I have been following the "Hands on vDPA: what do you do
>> when you ain't got the hardware v2 (Part 2)" [1] blog to
>> test my changes. To boot the L1 VM, I ran:
>>
>> sudo ./qemu/build/qemu-system-x86_64 \
>> -enable-kvm \
>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>> -net nic,model=virtio \
>> -net user,hostfwd=tcp::2222-:22 \
>> -device intel-iommu,snoop-control=on \
>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
>> -netdev tap,id=net0,script=no,downscript=no \
>> -nographic \
>> -m 8G \
>> -smp 4 \
>> -M q35 \
>> -cpu host 2>&1 | tee vm.log
>>
>> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
>> guest_announce=off" in "-device virtio-net-pci", QEMU
>> throws "vdpa svq does not work with features" [2] when
>> trying to boot L2.
>>
>> The enums added in commit #2 in this series is new and
>> wasn't in the earlier versions of the series. Without
>> this change, x-svq=true throws "SVQ invalid device feature
>> flags" [3] and x-svq is consequently disabled.
>>
>> The first issue is related to running traffic in L2
>> with vhost-vdpa.
>>
>> In L0:
>>
>> $ ip addr add 111.1.1.1/24 dev tap0
>> $ ip link set tap0 up
>> $ ip addr show tap0
>> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>>      link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>>      inet 111.1.1.1/24 scope global tap0
>>         valid_lft forever preferred_lft forever
>>      inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>>         valid_lft forever preferred_lft forever
>>
>> I am able to run traffic in L2 when booting without
>> x-svq.
>>
>> In L1:
>>
>> $ ./qemu/build/qemu-system-x86_64 \
>> -nographic \
>> -m 4G \
>> -enable-kvm \
>> -M q35 \
>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>> -smp 4 \
>> -cpu host \
>> 2>&1 | tee vm.log
>>
>> In L2:
>>
>> # ip addr add 111.1.1.2/24 dev eth0
>> # ip addr show eth0
>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>      link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>      altname enp0s7
>>      inet 111.1.1.2/24 scope global eth0
>>         valid_lft forever preferred_lft forever
>>      inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>         valid_lft forever preferred_lft forever
>>
>> # ip route
>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>
>> # ping 111.1.1.1 -w3
>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
>> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
>> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
>>
>> --- 111.1.1.1 ping statistics ---
>> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
>> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
>>
>>
>> But if I boot L2 with x-svq=true as shown below, I am unable
>> to ping the host machine.
>>
>> $ ./qemu/build/qemu-system-x86_64 \
>> -nographic \
>> -m 4G \
>> -enable-kvm \
>> -M q35 \
>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>> -smp 4 \
>> -cpu host \
>> 2>&1 | tee vm.log
>>
>> In L2:
>>
>> # ip addr add 111.1.1.2/24 dev eth0
>> # ip addr show eth0
>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>      link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>      altname enp0s7
>>      inet 111.1.1.2/24 scope global eth0
>>         valid_lft forever preferred_lft forever
>>      inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>         valid_lft forever preferred_lft forever
>>
>> # ip route
>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>
>> # ping 111.1.1.1 -w10
>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>>  From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
>> ping: sendmsg: No route to host
>>  From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
>>  From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
>>
>> --- 111.1.1.1 ping statistics ---
>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
>> pipe 3
>>
>> The other issue is related to booting L2 with "x-svq=true"
>> and "packed=on".
>>
>> In L1:
>>
>> $ ./qemu/build/qemu-system-x86_64 \
>> -nographic \
>> -m 4G \
>> -enable-kvm \
>> -M q35 \
>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
>> -smp 4 \
>> -cpu host \
>> 2>&1 | tee vm.log
>>
>> The kernel throws "virtio_net virtio1: output.0:id 0 is not
>> a head!" [4].
>>
> 
> So this series implements the descriptor forwarding from the guest to
> the device in packed vq. We also need to forward the descriptors from
> the device to the guest. The device writes them in the SVQ ring.
> 
> The functions responsible for that in QEMU are
> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
> the device when used descriptors are written to the SVQ, which calls
> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
> modifications similar to vhost_svq_add: Make them conditional if we're
> in split or packed vq, and "copy" the code from Linux's
> drivers/virtio/virtio_ring.c:virtqueue_get_buf.
> 
> After these modifications you should be able to ping and forward
> traffic. As always, It is totally ok if it needs more than one
> iteration, and feel free to ask any question you have :).
> 

I misunderstood this part. While working on extending
hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf() [1]
for packed vqs, I realized that this function and
vhost_svq_flush() already support split vqs. However, I am
unable to ping L0 when booting L2 with "x-svq=true" and
"packed=off" or when the "packed" option is not specified
in QEMU's command line.

I tried debugging these functions for split vqs after running
the following QEMU commands while following the blog [2].

Booting L1:

$ sudo ./qemu/build/qemu-system-x86_64 \
-enable-kvm \
-drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
-net nic,model=virtio \
-net user,hostfwd=tcp::2222-:22 \
-device intel-iommu,snoop-control=on \
-device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
-netdev tap,id=net0,script=no,downscript=no \
-nographic \
-m 8G \
-smp 4 \
-M q35 \
-cpu host 2>&1 | tee vm.log

Booting L2:

# ./qemu/build/qemu-system-x86_64 \
-nographic \
-m 4G \
-enable-kvm \
-M q35 \
-drive file=//root/L2.qcow2,media=disk,if=virtio \
-netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
-device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
-smp 4 \
-cpu host \
2>&1 | tee vm.log

I printed out the contents of VirtQueueElement returned
by vhost_svq_get_buf() in vhost_svq_flush() [3].
I noticed that "len" which is set by "vhost_svq_get_buf"
is always set to 0 while VirtQueueElement.len is non-zero.
I haven't understood the difference between these two "len"s.

The "len" that is set to 0 is used in "virtqueue_fill()" in
virtio.c [4]. Could this point to why I am not able to ping
L0 from L2?

Thanks,
Sahil

[1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L418
[2] https://www.redhat.com/en/blog/hands-vdpa-what-do-you-do-when-you-aint-got-hardware-part-2
[3] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L488
[4] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L501

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 4 months, 3 weeks ago

On Sun, Dec 15, 2024 at 6:27 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
> > On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> There are two issues that I found while trying to test
> >> my changes. I thought I would send the patch series
> >> as well in case that helps in troubleshooting. I haven't
> >> been able to find an issue in the implementation yet.
> >> Maybe I am missing something.
> >>
> >> I have been following the "Hands on vDPA: what do you do
> >> when you ain't got the hardware v2 (Part 2)" [1] blog to
> >> test my changes. To boot the L1 VM, I ran:
> >>
> >> sudo ./qemu/build/qemu-system-x86_64 \
> >> -enable-kvm \
> >> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >> -net nic,model=virtio \
> >> -net user,hostfwd=tcp::2222-:22 \
> >> -device intel-iommu,snoop-control=on \
> >> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
> >> -netdev tap,id=net0,script=no,downscript=no \
> >> -nographic \
> >> -m 8G \
> >> -smp 4 \
> >> -M q35 \
> >> -cpu host 2>&1 | tee vm.log
> >>
> >> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
> >> guest_announce=off" in "-device virtio-net-pci", QEMU
> >> throws "vdpa svq does not work with features" [2] when
> >> trying to boot L2.
> >>
> >> The enums added in commit #2 in this series is new and
> >> wasn't in the earlier versions of the series. Without
> >> this change, x-svq=true throws "SVQ invalid device feature
> >> flags" [3] and x-svq is consequently disabled.
> >>
> >> The first issue is related to running traffic in L2
> >> with vhost-vdpa.
> >>
> >> In L0:
> >>
> >> $ ip addr add 111.1.1.1/24 dev tap0
> >> $ ip link set tap0 up
> >> $ ip addr show tap0
> >> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
> >>      link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
> >>      inet 111.1.1.1/24 scope global tap0
> >>         valid_lft forever preferred_lft forever
> >>      inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
> >>         valid_lft forever preferred_lft forever
> >>
> >> I am able to run traffic in L2 when booting without
> >> x-svq.
> >>
> >> In L1:
> >>
> >> $ ./qemu/build/qemu-system-x86_64 \
> >> -nographic \
> >> -m 4G \
> >> -enable-kvm \
> >> -M q35 \
> >> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
> >> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >> -smp 4 \
> >> -cpu host \
> >> 2>&1 | tee vm.log
> >>
> >> In L2:
> >>
> >> # ip addr add 111.1.1.2/24 dev eth0
> >> # ip addr show eth0
> >> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>      link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>      altname enp0s7
> >>      inet 111.1.1.2/24 scope global eth0
> >>         valid_lft forever preferred_lft forever
> >>      inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
> >>         valid_lft forever preferred_lft forever
> >>
> >> # ip route
> >> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
> >>
> >> # ping 111.1.1.1 -w3
> >> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> >> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
> >> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
> >> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
> >>
> >> --- 111.1.1.1 ping statistics ---
> >> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
> >> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
> >>
> >>
> >> But if I boot L2 with x-svq=true as shown below, I am unable
> >> to ping the host machine.
> >>
> >> $ ./qemu/build/qemu-system-x86_64 \
> >> -nographic \
> >> -m 4G \
> >> -enable-kvm \
> >> -M q35 \
> >> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> >> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >> -smp 4 \
> >> -cpu host \
> >> 2>&1 | tee vm.log
> >>
> >> In L2:
> >>
> >> # ip addr add 111.1.1.2/24 dev eth0
> >> # ip addr show eth0
> >> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>      link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>      altname enp0s7
> >>      inet 111.1.1.2/24 scope global eth0
> >>         valid_lft forever preferred_lft forever
> >>      inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
> >>         valid_lft forever preferred_lft forever
> >>
> >> # ip route
> >> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
> >>
> >> # ping 111.1.1.1 -w10
> >> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> >>  From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
> >> ping: sendmsg: No route to host
> >>  From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
> >>  From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
> >>
> >> --- 111.1.1.1 ping statistics ---
> >> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
> >> pipe 3
> >>
> >> The other issue is related to booting L2 with "x-svq=true"
> >> and "packed=on".
> >>
> >> In L1:
> >>
> >> $ ./qemu/build/qemu-system-x86_64 \
> >> -nographic \
> >> -m 4G \
> >> -enable-kvm \
> >> -M q35 \
> >> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
> >> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
> >> -smp 4 \
> >> -cpu host \
> >> 2>&1 | tee vm.log
> >>
> >> The kernel throws "virtio_net virtio1: output.0:id 0 is not
> >> a head!" [4].
> >>
> >
> > So this series implements the descriptor forwarding from the guest to
> > the device in packed vq. We also need to forward the descriptors from
> > the device to the guest. The device writes them in the SVQ ring.
> >
> > The functions responsible for that in QEMU are
> > hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
> > the device when used descriptors are written to the SVQ, which calls
> > hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
> > modifications similar to vhost_svq_add: Make them conditional if we're
> > in split or packed vq, and "copy" the code from Linux's
> > drivers/virtio/virtio_ring.c:virtqueue_get_buf.
> >
> > After these modifications you should be able to ping and forward
> > traffic. As always, It is totally ok if it needs more than one
> > iteration, and feel free to ask any question you have :).
> >
>
> I misunderstood this part. While working on extending
> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf() [1]
> for packed vqs, I realized that this function and
> vhost_svq_flush() already support split vqs. However, I am
> unable to ping L0 when booting L2 with "x-svq=true" and
> "packed=off" or when the "packed" option is not specified
> in QEMU's command line.
>
> I tried debugging these functions for split vqs after running
> the following QEMU commands while following the blog [2].
>
> Booting L1:
>
> $ sudo ./qemu/build/qemu-system-x86_64 \
> -enable-kvm \
> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> -net nic,model=virtio \
> -net user,hostfwd=tcp::2222-:22 \
> -device intel-iommu,snoop-control=on \
> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
> -netdev tap,id=net0,script=no,downscript=no \
> -nographic \
> -m 8G \
> -smp 4 \
> -M q35 \
> -cpu host 2>&1 | tee vm.log
>
> Booting L2:
>
> # ./qemu/build/qemu-system-x86_64 \
> -nographic \
> -m 4G \
> -enable-kvm \
> -M q35 \
> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> -smp 4 \
> -cpu host \
> 2>&1 | tee vm.log
>
> I printed out the contents of VirtQueueElement returned
> by vhost_svq_get_buf() in vhost_svq_flush() [3].
> I noticed that "len" which is set by "vhost_svq_get_buf"
> is always set to 0 while VirtQueueElement.len is non-zero.
> I haven't understood the difference between these two "len"s.
>

VirtQueueElement.len is the length of the buffer, while the len of
vhost_svq_get_buf is the bytes written by the device. In the case of
the tx queue, VirtQueuelen is the length of the tx packet, and the
vhost_svq_get_buf is always 0 as the device does not write. In the
case of rx, VirtQueueElem.len is the available length for a rx frame,
and the vhost_svq_get_buf len is the actual length written by the
device.

To be 100% accurate a rx packet can span over multiple buffers, but
SVQ does not need special code to handle this.

So vhost_svq_get_buf should return > 0 for rx queue (svq->vq->index ==
0), and 0 for tx queue (svq->vq->index % 2 == 1).

Take into account that vhost_svq_get_buf only handles split vq at the
moment! It should be renamed or splitted into vhost_svq_get_buf_split.

> The "len" that is set to 0 is used in "virtqueue_fill()" in
> virtio.c [4]. Could this point to why I am not able to ping
> L0 from L2?
>

It depends :). Let me know in what vq you find that.

> Thanks,
> Sahil
>
> [1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L418
> [2] https://www.redhat.com/en/blog/hands-vdpa-what-do-you-do-when-you-aint-got-hardware-part-2
> [3] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L488
> [4] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L501
>

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 4 months, 3 weeks ago

Hi,

Thank you for your reply.

On 12/16/24 2:09 PM, Eugenio Perez Martin wrote:
> On Sun, Dec 15, 2024 at 6:27 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
>>> On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> [...]
>>>> I have been following the "Hands on vDPA: what do you do
>>>> when you ain't got the hardware v2 (Part 2)" [1] blog to
>>>> test my changes. To boot the L1 VM, I ran:
>>>>
>>>> sudo ./qemu/build/qemu-system-x86_64 \
>>>> -enable-kvm \
>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>>>> -net nic,model=virtio \
>>>> -net user,hostfwd=tcp::2222-:22 \
>>>> -device intel-iommu,snoop-control=on \
>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
>>>> -netdev tap,id=net0,script=no,downscript=no \
>>>> -nographic \
>>>> -m 8G \
>>>> -smp 4 \
>>>> -M q35 \
>>>> -cpu host 2>&1 | tee vm.log
>>>>
>>>> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
>>>> guest_announce=off" in "-device virtio-net-pci", QEMU
>>>> throws "vdpa svq does not work with features" [2] when
>>>> trying to boot L2.
>>>>
>>>> The enums added in commit #2 in this series is new and
>>>> wasn't in the earlier versions of the series. Without
>>>> this change, x-svq=true throws "SVQ invalid device feature
>>>> flags" [3] and x-svq is consequently disabled.
>>>>
>>>> The first issue is related to running traffic in L2
>>>> with vhost-vdpa.
>>>>
>>>> In L0:
>>>>
>>>> $ ip addr add 111.1.1.1/24 dev tap0
>>>> $ ip link set tap0 up
>>>> $ ip addr show tap0
>>>> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>>>>       link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>>>>       inet 111.1.1.1/24 scope global tap0
>>>>          valid_lft forever preferred_lft forever
>>>>       inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>>>>          valid_lft forever preferred_lft forever
>>>>
>>>> I am able to run traffic in L2 when booting without
>>>> x-svq.
>>>>
>>>> In L1:
>>>>
>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>> -nographic \
>>>> -m 4G \
>>>> -enable-kvm \
>>>> -M q35 \
>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>> -smp 4 \
>>>> -cpu host \
>>>> 2>&1 | tee vm.log
>>>>
>>>> In L2:
>>>>
>>>> # ip addr add 111.1.1.2/24 dev eth0
>>>> # ip addr show eth0
>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>       link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>       altname enp0s7
>>>>       inet 111.1.1.2/24 scope global eth0
>>>>          valid_lft forever preferred_lft forever
>>>>       inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>>>          valid_lft forever preferred_lft forever
>>>>
>>>> # ip route
>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>>>
>>>> # ping 111.1.1.1 -w3
>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>>>> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
>>>> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
>>>> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
>>>>
>>>> --- 111.1.1.1 ping statistics ---
>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
>>>> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
>>>>
>>>>
>>>> But if I boot L2 with x-svq=true as shown below, I am unable
>>>> to ping the host machine.
>>>>
>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>> -nographic \
>>>> -m 4G \
>>>> -enable-kvm \
>>>> -M q35 \
>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>> -smp 4 \
>>>> -cpu host \
>>>> 2>&1 | tee vm.log
>>>>
>>>> In L2:
>>>>
>>>> # ip addr add 111.1.1.2/24 dev eth0
>>>> # ip addr show eth0
>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>       link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>       altname enp0s7
>>>>       inet 111.1.1.2/24 scope global eth0
>>>>          valid_lft forever preferred_lft forever
>>>>       inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>>>          valid_lft forever preferred_lft forever
>>>>
>>>> # ip route
>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>>>
>>>> # ping 111.1.1.1 -w10
>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>>>>   From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
>>>> ping: sendmsg: No route to host
>>>>   From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
>>>>   From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
>>>>
>>>> --- 111.1.1.1 ping statistics ---
>>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
>>>> pipe 3
>>>>
>>>> The other issue is related to booting L2 with "x-svq=true"
>>>> and "packed=on".
>>>>
>>>> In L1:
>>>>
>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>> -nographic \
>>>> -m 4G \
>>>> -enable-kvm \
>>>> -M q35 \
>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
>>>> -smp 4 \
>>>> -cpu host \
>>>> 2>&1 | tee vm.log
>>>>
>>>> The kernel throws "virtio_net virtio1: output.0:id 0 is not
>>>> a head!" [4].
>>>>
>>>
>>> So this series implements the descriptor forwarding from the guest to
>>> the device in packed vq. We also need to forward the descriptors from
>>> the device to the guest. The device writes them in the SVQ ring.
>>>
>>> The functions responsible for that in QEMU are
>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
>>> the device when used descriptors are written to the SVQ, which calls
>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
>>> modifications similar to vhost_svq_add: Make them conditional if we're
>>> in split or packed vq, and "copy" the code from Linux's
>>> drivers/virtio/virtio_ring.c:virtqueue_get_buf.
>>>
>>> After these modifications you should be able to ping and forward
>>> traffic. As always, It is totally ok if it needs more than one
>>> iteration, and feel free to ask any question you have :).
>>>
>>
>> I misunderstood this part. While working on extending
>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf() [1]
>> for packed vqs, I realized that this function and
>> vhost_svq_flush() already support split vqs. However, I am
>> unable to ping L0 when booting L2 with "x-svq=true" and
>> "packed=off" or when the "packed" option is not specified
>> in QEMU's command line.
>>
>> I tried debugging these functions for split vqs after running
>> the following QEMU commands while following the blog [2].
>>
>> Booting L1:
>>
>> $ sudo ./qemu/build/qemu-system-x86_64 \
>> -enable-kvm \
>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>> -net nic,model=virtio \
>> -net user,hostfwd=tcp::2222-:22 \
>> -device intel-iommu,snoop-control=on \
>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
>> -netdev tap,id=net0,script=no,downscript=no \
>> -nographic \
>> -m 8G \
>> -smp 4 \
>> -M q35 \
>> -cpu host 2>&1 | tee vm.log
>>
>> Booting L2:
>>
>> # ./qemu/build/qemu-system-x86_64 \
>> -nographic \
>> -m 4G \
>> -enable-kvm \
>> -M q35 \
>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>> -smp 4 \
>> -cpu host \
>> 2>&1 | tee vm.log
>>
>> I printed out the contents of VirtQueueElement returned
>> by vhost_svq_get_buf() in vhost_svq_flush() [3].
>> I noticed that "len" which is set by "vhost_svq_get_buf"
>> is always set to 0 while VirtQueueElement.len is non-zero.
>> I haven't understood the difference between these two "len"s.
>>
> 
> VirtQueueElement.len is the length of the buffer, while the len of
> vhost_svq_get_buf is the bytes written by the device. In the case of
> the tx queue, VirtQueuelen is the length of the tx packet, and the
> vhost_svq_get_buf is always 0 as the device does not write. In the
> case of rx, VirtQueueElem.len is the available length for a rx frame,
> and the vhost_svq_get_buf len is the actual length written by the
> device.
> 
> To be 100% accurate a rx packet can span over multiple buffers, but
> SVQ does not need special code to handle this.
> 
> So vhost_svq_get_buf should return > 0 for rx queue (svq->vq->index ==
> 0), and 0 for tx queue (svq->vq->index % 2 == 1).
> 
> Take into account that vhost_svq_get_buf only handles split vq at the
> moment! It should be renamed or splitted into vhost_svq_get_buf_split.

In L1, there are 2 virtio network devices.

# lspci -nn | grep -i net
00:02.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
00:04.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)

I am using the second one (1af4:1041) for testing my changes and have
bound this device to the vp_vdpa driver.

# vdpa dev show -jp
{
     "dev": {
         "vdpa0": {
             "type": "network",
             "mgmtdev": "pci/0000:00:04.0",
             "vendor_id": 6900,
             "max_vqs": 3,
             "max_vq_size": 256
         }
     }
}

The max number of vqs is 3 with the max size being 256.

Since, there are 2 virtio net devices, vhost_vdpa_svqs_start [1]
is called twice. For each of them. it calls vhost_svq_start [2]
v->shadow_vqs->len number of times.

Printing the values of dev->vdev->name, v->shadow_vqs->len and
svq->vring.num in vhost_vdpa_svqs_start gives:

name: virtio-net
len: 2
num: 256
num: 256
name: virtio-net
len: 1
num: 64

I am not sure how to match the above log lines to the
right virtio-net device since the actual value of num
can be less than "max_vq_size" in the output of "vdpa
dev show".

I think the first 3 log lines correspond to the virtio
net device that I am using for testing since it has
2 vqs (rx and tx) while the other virtio-net device
only has one vq.

When printing out the values of svq->vring.num,
used_elem.len and used_elem.id in vhost_svq_get_buf,
there are two sets of output. One set corresponds to
svq->vring.num = 64 and the other corresponds to
svq->vring.num = 256.

For svq->vring.num = 64, only the following line
is printed repeatedly:

size: 64, len: 1, i: 0

For svq->vring.num = 256, the following line is
printed 20 times,

size: 256, len: 0, i: 0

followed by:

size: 256, len: 0, i: 1
size: 256, len: 0, i: 1

used_elem.len is used to set the value of len that is
returned by vhost_svq_get_buf, and it's always 0.

So the value of "len" returned by vhost_svq_get_buf
when called in vhost_svq_flush is also 0.

Thanks,
Sahil

[1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-vdpa.c#L1243
[2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-vdpa.c#L1265

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 4 months, 2 weeks ago

On Tue, Dec 17, 2024 at 6:45 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> Thank you for your reply.
>
> On 12/16/24 2:09 PM, Eugenio Perez Martin wrote:
> > On Sun, Dec 15, 2024 at 6:27 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
> >>> On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> [...]
> >>>> I have been following the "Hands on vDPA: what do you do
> >>>> when you ain't got the hardware v2 (Part 2)" [1] blog to
> >>>> test my changes. To boot the L1 VM, I ran:
> >>>>
> >>>> sudo ./qemu/build/qemu-system-x86_64 \
> >>>> -enable-kvm \
> >>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >>>> -net nic,model=virtio \
> >>>> -net user,hostfwd=tcp::2222-:22 \
> >>>> -device intel-iommu,snoop-control=on \
> >>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
> >>>> -netdev tap,id=net0,script=no,downscript=no \
> >>>> -nographic \
> >>>> -m 8G \
> >>>> -smp 4 \
> >>>> -M q35 \
> >>>> -cpu host 2>&1 | tee vm.log
> >>>>
> >>>> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
> >>>> guest_announce=off" in "-device virtio-net-pci", QEMU
> >>>> throws "vdpa svq does not work with features" [2] when
> >>>> trying to boot L2.
> >>>>
> >>>> The enums added in commit #2 in this series is new and
> >>>> wasn't in the earlier versions of the series. Without
> >>>> this change, x-svq=true throws "SVQ invalid device feature
> >>>> flags" [3] and x-svq is consequently disabled.
> >>>>
> >>>> The first issue is related to running traffic in L2
> >>>> with vhost-vdpa.
> >>>>
> >>>> In L0:
> >>>>
> >>>> $ ip addr add 111.1.1.1/24 dev tap0
> >>>> $ ip link set tap0 up
> >>>> $ ip addr show tap0
> >>>> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
> >>>>       link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
> >>>>       inet 111.1.1.1/24 scope global tap0
> >>>>          valid_lft forever preferred_lft forever
> >>>>       inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
> >>>>          valid_lft forever preferred_lft forever
> >>>>
> >>>> I am able to run traffic in L2 when booting without
> >>>> x-svq.
> >>>>
> >>>> In L1:
> >>>>
> >>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>> -nographic \
> >>>> -m 4G \
> >>>> -enable-kvm \
> >>>> -M q35 \
> >>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
> >>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>> -smp 4 \
> >>>> -cpu host \
> >>>> 2>&1 | tee vm.log
> >>>>
> >>>> In L2:
> >>>>
> >>>> # ip addr add 111.1.1.2/24 dev eth0
> >>>> # ip addr show eth0
> >>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>       link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>       altname enp0s7
> >>>>       inet 111.1.1.2/24 scope global eth0
> >>>>          valid_lft forever preferred_lft forever
> >>>>       inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
> >>>>          valid_lft forever preferred_lft forever
> >>>>
> >>>> # ip route
> >>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
> >>>>
> >>>> # ping 111.1.1.1 -w3
> >>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> >>>> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
> >>>> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
> >>>> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
> >>>>
> >>>> --- 111.1.1.1 ping statistics ---
> >>>> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
> >>>> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
> >>>>
> >>>>
> >>>> But if I boot L2 with x-svq=true as shown below, I am unable
> >>>> to ping the host machine.
> >>>>
> >>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>> -nographic \
> >>>> -m 4G \
> >>>> -enable-kvm \
> >>>> -M q35 \
> >>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> >>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>> -smp 4 \
> >>>> -cpu host \
> >>>> 2>&1 | tee vm.log
> >>>>
> >>>> In L2:
> >>>>
> >>>> # ip addr add 111.1.1.2/24 dev eth0
> >>>> # ip addr show eth0
> >>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>       link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>       altname enp0s7
> >>>>       inet 111.1.1.2/24 scope global eth0
> >>>>          valid_lft forever preferred_lft forever
> >>>>       inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
> >>>>          valid_lft forever preferred_lft forever
> >>>>
> >>>> # ip route
> >>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
> >>>>
> >>>> # ping 111.1.1.1 -w10
> >>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> >>>>   From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
> >>>> ping: sendmsg: No route to host
> >>>>   From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
> >>>>   From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
> >>>>
> >>>> --- 111.1.1.1 ping statistics ---
> >>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
> >>>> pipe 3
> >>>>
> >>>> The other issue is related to booting L2 with "x-svq=true"
> >>>> and "packed=on".
> >>>>
> >>>> In L1:
> >>>>
> >>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>> -nographic \
> >>>> -m 4G \
> >>>> -enable-kvm \
> >>>> -M q35 \
> >>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
> >>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
> >>>> -smp 4 \
> >>>> -cpu host \
> >>>> 2>&1 | tee vm.log
> >>>>
> >>>> The kernel throws "virtio_net virtio1: output.0:id 0 is not
> >>>> a head!" [4].
> >>>>
> >>>
> >>> So this series implements the descriptor forwarding from the guest to
> >>> the device in packed vq. We also need to forward the descriptors from
> >>> the device to the guest. The device writes them in the SVQ ring.
> >>>
> >>> The functions responsible for that in QEMU are
> >>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
> >>> the device when used descriptors are written to the SVQ, which calls
> >>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
> >>> modifications similar to vhost_svq_add: Make them conditional if we're
> >>> in split or packed vq, and "copy" the code from Linux's
> >>> drivers/virtio/virtio_ring.c:virtqueue_get_buf.
> >>>
> >>> After these modifications you should be able to ping and forward
> >>> traffic. As always, It is totally ok if it needs more than one
> >>> iteration, and feel free to ask any question you have :).
> >>>
> >>
> >> I misunderstood this part. While working on extending
> >> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf() [1]
> >> for packed vqs, I realized that this function and
> >> vhost_svq_flush() already support split vqs. However, I am
> >> unable to ping L0 when booting L2 with "x-svq=true" and
> >> "packed=off" or when the "packed" option is not specified
> >> in QEMU's command line.
> >>
> >> I tried debugging these functions for split vqs after running
> >> the following QEMU commands while following the blog [2].
> >>
> >> Booting L1:
> >>
> >> $ sudo ./qemu/build/qemu-system-x86_64 \
> >> -enable-kvm \
> >> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >> -net nic,model=virtio \
> >> -net user,hostfwd=tcp::2222-:22 \
> >> -device intel-iommu,snoop-control=on \
> >> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
> >> -netdev tap,id=net0,script=no,downscript=no \
> >> -nographic \
> >> -m 8G \
> >> -smp 4 \
> >> -M q35 \
> >> -cpu host 2>&1 | tee vm.log
> >>
> >> Booting L2:
> >>
> >> # ./qemu/build/qemu-system-x86_64 \
> >> -nographic \
> >> -m 4G \
> >> -enable-kvm \
> >> -M q35 \
> >> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> >> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >> -smp 4 \
> >> -cpu host \
> >> 2>&1 | tee vm.log
> >>
> >> I printed out the contents of VirtQueueElement returned
> >> by vhost_svq_get_buf() in vhost_svq_flush() [3].
> >> I noticed that "len" which is set by "vhost_svq_get_buf"
> >> is always set to 0 while VirtQueueElement.len is non-zero.
> >> I haven't understood the difference between these two "len"s.
> >>
> >
> > VirtQueueElement.len is the length of the buffer, while the len of
> > vhost_svq_get_buf is the bytes written by the device. In the case of
> > the tx queue, VirtQueuelen is the length of the tx packet, and the
> > vhost_svq_get_buf is always 0 as the device does not write. In the
> > case of rx, VirtQueueElem.len is the available length for a rx frame,
> > and the vhost_svq_get_buf len is the actual length written by the
> > device.
> >
> > To be 100% accurate a rx packet can span over multiple buffers, but
> > SVQ does not need special code to handle this.
> >
> > So vhost_svq_get_buf should return > 0 for rx queue (svq->vq->index ==
> > 0), and 0 for tx queue (svq->vq->index % 2 == 1).
> >
> > Take into account that vhost_svq_get_buf only handles split vq at the
> > moment! It should be renamed or splitted into vhost_svq_get_buf_split.
>
> In L1, there are 2 virtio network devices.
>
> # lspci -nn | grep -i net
> 00:02.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
> 00:04.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
>
> I am using the second one (1af4:1041) for testing my changes and have
> bound this device to the vp_vdpa driver.
>
> # vdpa dev show -jp
> {
>      "dev": {
>          "vdpa0": {
>              "type": "network",
>              "mgmtdev": "pci/0000:00:04.0",
>              "vendor_id": 6900,
>              "max_vqs": 3,

How is max_vqs=3? For this to happen L0 QEMU should have
virtio-net-pci,...,queues=3 cmdline argument. It's clear the guest is
not using them, we can add mq=off to simplify the scenario.

>              "max_vq_size": 256
>          }
>      }
> }
>
> The max number of vqs is 3 with the max size being 256.
>
> Since, there are 2 virtio net devices, vhost_vdpa_svqs_start [1]
> is called twice. For each of them. it calls vhost_svq_start [2]
> v->shadow_vqs->len number of times.
>

Ok I understand this confusion, as the code is not intuitive :). Take
into account you can only have svq in vdpa devices, so both
vhost_vdpa_svqs_start are acting on the vdpa device.

You are seeing two calls to vhost_vdpa_svqs_start because virtio (and
vdpa) devices are modelled internally as two devices in QEMU: One for
the dataplane vq, and other for the control vq. There are historical
reasons for this, but we use it in vdpa to always shadow the CVQ while
leaving dataplane passthrough if x-svq=off and the virtio & virtio-net
feature set is understood by SVQ.

If you break at vhost_vdpa_svqs_start with gdb and go higher in the
stack you should reach vhost_net_start, that starts each vhost_net
device individually.

To be 100% honest, each dataplain *queue pair* (rx+tx) is modelled
with a different vhost_net device in QEMU, but you don't need to take
that into account implementing the packed vq :).

> Printing the values of dev->vdev->name, v->shadow_vqs->len and
> svq->vring.num in vhost_vdpa_svqs_start gives:
>
> name: virtio-net
> len: 2
> num: 256
> num: 256

First QEMU's vhost_net device, the dataplane.

> name: virtio-net
> len: 1
> num: 64
>

Second QEMU's vhost_net device, the control virtqueue.

> I am not sure how to match the above log lines to the
> right virtio-net device since the actual value of num
> can be less than "max_vq_size" in the output of "vdpa
> dev show".
>

Yes, the device can set a different vq max per vq, and the driver can
negotiate a lower vq size per vq too.

> I think the first 3 log lines correspond to the virtio
> net device that I am using for testing since it has
> 2 vqs (rx and tx) while the other virtio-net device
> only has one vq.
>
> When printing out the values of svq->vring.num,
> used_elem.len and used_elem.id in vhost_svq_get_buf,
> there are two sets of output. One set corresponds to
> svq->vring.num = 64 and the other corresponds to
> svq->vring.num = 256.
>
> For svq->vring.num = 64, only the following line
> is printed repeatedly:
>
> size: 64, len: 1, i: 0
>

This is with packed=off, right? If this is testing with packed, you
need to change the code to accommodate it. Let me know if you need
more help with this.

In the CVQ the only reply is a byte, indicating if the command was
applied or not. This seems ok to me.

The queue can also recycle ids as long as they are not available, so
that part seems correct to me too.

> For svq->vring.num = 256, the following line is
> printed 20 times,
>
> size: 256, len: 0, i: 0
>
> followed by:
>
> size: 256, len: 0, i: 1
> size: 256, len: 0, i: 1
>

This makes sense for the tx queue too. Can you print the VirtQueue index?

> used_elem.len is used to set the value of len that is
> returned by vhost_svq_get_buf, and it's always 0.
>
> So the value of "len" returned by vhost_svq_get_buf
> when called in vhost_svq_flush is also 0.
>
> Thanks,
> Sahil
>
> [1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-vdpa.c#L1243
> [2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-vdpa.c#L1265
>

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 4 months, 2 weeks ago

Hi,

On 12/17/24 1:20 PM, Eugenio Perez Martin wrote:
> On Tue, Dec 17, 2024 at 6:45 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> On 12/16/24 2:09 PM, Eugenio Perez Martin wrote:
>>> On Sun, Dec 15, 2024 at 6:27 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
>>>>> On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>> [...]
>>>>>> I have been following the "Hands on vDPA: what do you do
>>>>>> when you ain't got the hardware v2 (Part 2)" [1] blog to
>>>>>> test my changes. To boot the L1 VM, I ran:
>>>>>>
>>>>>> sudo ./qemu/build/qemu-system-x86_64 \
>>>>>> -enable-kvm \
>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>>>>>> -net nic,model=virtio \
>>>>>> -net user,hostfwd=tcp::2222-:22 \
>>>>>> -device intel-iommu,snoop-control=on \
>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
>>>>>> -netdev tap,id=net0,script=no,downscript=no \
>>>>>> -nographic \
>>>>>> -m 8G \
>>>>>> -smp 4 \
>>>>>> -M q35 \
>>>>>> -cpu host 2>&1 | tee vm.log
>>>>>>
>>>>>> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
>>>>>> guest_announce=off" in "-device virtio-net-pci", QEMU
>>>>>> throws "vdpa svq does not work with features" [2] when
>>>>>> trying to boot L2.
>>>>>>
>>>>>> The enums added in commit #2 in this series is new and
>>>>>> wasn't in the earlier versions of the series. Without
>>>>>> this change, x-svq=true throws "SVQ invalid device feature
>>>>>> flags" [3] and x-svq is consequently disabled.
>>>>>>
>>>>>> The first issue is related to running traffic in L2
>>>>>> with vhost-vdpa.
>>>>>>
>>>>>> In L0:
>>>>>>
>>>>>> $ ip addr add 111.1.1.1/24 dev tap0
>>>>>> $ ip link set tap0 up
>>>>>> $ ip addr show tap0
>>>>>> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>>>>>>        link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>>>>>>        inet 111.1.1.1/24 scope global tap0
>>>>>>           valid_lft forever preferred_lft forever
>>>>>>        inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>>>>>>           valid_lft forever preferred_lft forever
>>>>>>
>>>>>> I am able to run traffic in L2 when booting without
>>>>>> x-svq.
>>>>>>
>>>>>> In L1:
>>>>>>
>>>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>>>> -nographic \
>>>>>> -m 4G \
>>>>>> -enable-kvm \
>>>>>> -M q35 \
>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>>>> -smp 4 \
>>>>>> -cpu host \
>>>>>> 2>&1 | tee vm.log
>>>>>>
>>>>>> In L2:
>>>>>>
>>>>>> # ip addr add 111.1.1.2/24 dev eth0
>>>>>> # ip addr show eth0
>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>>>        link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>>>        altname enp0s7
>>>>>>        inet 111.1.1.2/24 scope global eth0
>>>>>>           valid_lft forever preferred_lft forever
>>>>>>        inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>>>>>           valid_lft forever preferred_lft forever
>>>>>>
>>>>>> # ip route
>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>>>>>
>>>>>> # ping 111.1.1.1 -w3
>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>>>>>> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
>>>>>> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
>>>>>> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
>>>>>>
>>>>>> --- 111.1.1.1 ping statistics ---
>>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
>>>>>> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
>>>>>>
>>>>>>
>>>>>> But if I boot L2 with x-svq=true as shown below, I am unable
>>>>>> to ping the host machine.
>>>>>>
>>>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>>>> -nographic \
>>>>>> -m 4G \
>>>>>> -enable-kvm \
>>>>>> -M q35 \
>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>>>> -smp 4 \
>>>>>> -cpu host \
>>>>>> 2>&1 | tee vm.log
>>>>>>
>>>>>> In L2:
>>>>>>
>>>>>> # ip addr add 111.1.1.2/24 dev eth0
>>>>>> # ip addr show eth0
>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>>>        link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>>>        altname enp0s7
>>>>>>        inet 111.1.1.2/24 scope global eth0
>>>>>>           valid_lft forever preferred_lft forever
>>>>>>        inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>>>>>           valid_lft forever preferred_lft forever
>>>>>>
>>>>>> # ip route
>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>>>>>
>>>>>> # ping 111.1.1.1 -w10
>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>>>>>>    From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
>>>>>> ping: sendmsg: No route to host
>>>>>>    From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
>>>>>>    From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
>>>>>>
>>>>>> --- 111.1.1.1 ping statistics ---
>>>>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
>>>>>> pipe 3
>>>>>>
>>>>>> The other issue is related to booting L2 with "x-svq=true"
>>>>>> and "packed=on".
>>>>>>
>>>>>> In L1:
>>>>>>
>>>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>>>> -nographic \
>>>>>> -m 4G \
>>>>>> -enable-kvm \
>>>>>> -M q35 \
>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
>>>>>> -smp 4 \
>>>>>> -cpu host \
>>>>>> 2>&1 | tee vm.log
>>>>>>
>>>>>> The kernel throws "virtio_net virtio1: output.0:id 0 is not
>>>>>> a head!" [4].
>>>>>>
>>>>>
>>>>> So this series implements the descriptor forwarding from the guest to
>>>>> the device in packed vq. We also need to forward the descriptors from
>>>>> the device to the guest. The device writes them in the SVQ ring.
>>>>>
>>>>> The functions responsible for that in QEMU are
>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
>>>>> the device when used descriptors are written to the SVQ, which calls
>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
>>>>> modifications similar to vhost_svq_add: Make them conditional if we're
>>>>> in split or packed vq, and "copy" the code from Linux's
>>>>> drivers/virtio/virtio_ring.c:virtqueue_get_buf.
>>>>>
>>>>> After these modifications you should be able to ping and forward
>>>>> traffic. As always, It is totally ok if it needs more than one
>>>>> iteration, and feel free to ask any question you have :).
>>>>>
>>>>
>>>> I misunderstood this part. While working on extending
>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf() [1]
>>>> for packed vqs, I realized that this function and
>>>> vhost_svq_flush() already support split vqs. However, I am
>>>> unable to ping L0 when booting L2 with "x-svq=true" and
>>>> "packed=off" or when the "packed" option is not specified
>>>> in QEMU's command line.
>>>>
>>>> I tried debugging these functions for split vqs after running
>>>> the following QEMU commands while following the blog [2].
>>>>
>>>> Booting L1:
>>>>
>>>> $ sudo ./qemu/build/qemu-system-x86_64 \
>>>> -enable-kvm \
>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>>>> -net nic,model=virtio \
>>>> -net user,hostfwd=tcp::2222-:22 \
>>>> -device intel-iommu,snoop-control=on \
>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
>>>> -netdev tap,id=net0,script=no,downscript=no \
>>>> -nographic \
>>>> -m 8G \
>>>> -smp 4 \
>>>> -M q35 \
>>>> -cpu host 2>&1 | tee vm.log
>>>>
>>>> Booting L2:
>>>>
>>>> # ./qemu/build/qemu-system-x86_64 \
>>>> -nographic \
>>>> -m 4G \
>>>> -enable-kvm \
>>>> -M q35 \
>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>> -smp 4 \
>>>> -cpu host \
>>>> 2>&1 | tee vm.log
>>>>
>>>> I printed out the contents of VirtQueueElement returned
>>>> by vhost_svq_get_buf() in vhost_svq_flush() [3].
>>>> I noticed that "len" which is set by "vhost_svq_get_buf"
>>>> is always set to 0 while VirtQueueElement.len is non-zero.
>>>> I haven't understood the difference between these two "len"s.
>>>>
>>>
>>> VirtQueueElement.len is the length of the buffer, while the len of
>>> vhost_svq_get_buf is the bytes written by the device. In the case of
>>> the tx queue, VirtQueuelen is the length of the tx packet, and the
>>> vhost_svq_get_buf is always 0 as the device does not write. In the
>>> case of rx, VirtQueueElem.len is the available length for a rx frame,
>>> and the vhost_svq_get_buf len is the actual length written by the
>>> device.
>>>
>>> To be 100% accurate a rx packet can span over multiple buffers, but
>>> SVQ does not need special code to handle this.
>>>
>>> So vhost_svq_get_buf should return > 0 for rx queue (svq->vq->index ==
>>> 0), and 0 for tx queue (svq->vq->index % 2 == 1).
>>>
>>> Take into account that vhost_svq_get_buf only handles split vq at the
>>> moment! It should be renamed or splitted into vhost_svq_get_buf_split.
>>
>> In L1, there are 2 virtio network devices.
>>
>> # lspci -nn | grep -i net
>> 00:02.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
>> 00:04.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
>>
>> I am using the second one (1af4:1041) for testing my changes and have
>> bound this device to the vp_vdpa driver.
>>
>> # vdpa dev show -jp
>> {
>>       "dev": {
>>           "vdpa0": {
>>               "type": "network",
>>               "mgmtdev": "pci/0000:00:04.0",
>>               "vendor_id": 6900,
>>               "max_vqs": 3,
> 
> How is max_vqs=3? For this to happen L0 QEMU should have
> virtio-net-pci,...,queues=3 cmdline argument.

I am not sure why max_vqs is 3. I haven't set the value of queues to 3
in the cmdline argument. Is max_vqs expected to have a default value
other than 3?

In the blog [1] as well, max_vqs is 3 even though there's no queues=3
argument.

> It's clear the guest is not using them, we can add mq=off
> to simplify the scenario.

The value of max_vqs is still 3 after adding mq=off. The whole
command that I run to boot L0 is:

$ sudo ./qemu/build/qemu-system-x86_64 \
-enable-kvm \
-drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
-net nic,model=virtio \
-net user,hostfwd=tcp::2222-:22 \
-device intel-iommu,snoop-control=on \
-device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
-netdev tap,id=net0,script=no,downscript=no \
-nographic \
-m 8G \
-smp 4 \
-M q35 \
-cpu host 2>&1 | tee vm.log

Could it be that 2 of the 3 vqs are used for the dataplane and
the third vq is the control vq?

>>               "max_vq_size": 256
>>           }
>>       }
>> }
>>
>> The max number of vqs is 3 with the max size being 256.
>>
>> Since, there are 2 virtio net devices, vhost_vdpa_svqs_start [1]
>> is called twice. For each of them. it calls vhost_svq_start [2]
>> v->shadow_vqs->len number of times.
>>
> 
> Ok I understand this confusion, as the code is not intuitive :). Take
> into account you can only have svq in vdpa devices, so both
> vhost_vdpa_svqs_start are acting on the vdpa device.
> 
> You are seeing two calls to vhost_vdpa_svqs_start because virtio (and
> vdpa) devices are modelled internally as two devices in QEMU: One for
> the dataplane vq, and other for the control vq. There are historical
> reasons for this, but we use it in vdpa to always shadow the CVQ while
> leaving dataplane passthrough if x-svq=off and the virtio & virtio-net
> feature set is understood by SVQ.
> 
> If you break at vhost_vdpa_svqs_start with gdb and go higher in the
> stack you should reach vhost_net_start, that starts each vhost_net
> device individually.
> 
> To be 100% honest, each dataplain *queue pair* (rx+tx) is modelled
> with a different vhost_net device in QEMU, but you don't need to take
> that into account implementing the packed vq :).

Got it, this makes sense now.

>> Printing the values of dev->vdev->name, v->shadow_vqs->len and
>> svq->vring.num in vhost_vdpa_svqs_start gives:
>>
>> name: virtio-net
>> len: 2
>> num: 256
>> num: 256
> 
> First QEMU's vhost_net device, the dataplane.
> 
>> name: virtio-net
>> len: 1
>> num: 64
>>
> 
> Second QEMU's vhost_net device, the control virtqueue.

Ok, if I understand this correctly, the control vq doesn't
need separate queues for rx and tx.

>> I am not sure how to match the above log lines to the
>> right virtio-net device since the actual value of num
>> can be less than "max_vq_size" in the output of "vdpa
>> dev show".
>>
> 
> Yes, the device can set a different vq max per vq, and the driver can
> negotiate a lower vq size per vq too.
> 
>> I think the first 3 log lines correspond to the virtio
>> net device that I am using for testing since it has
>> 2 vqs (rx and tx) while the other virtio-net device
>> only has one vq.
>>
>> When printing out the values of svq->vring.num,
>> used_elem.len and used_elem.id in vhost_svq_get_buf,
>> there are two sets of output. One set corresponds to
>> svq->vring.num = 64 and the other corresponds to
>> svq->vring.num = 256.
>>
>> For svq->vring.num = 64, only the following line
>> is printed repeatedly:
>>
>> size: 64, len: 1, i: 0
>>
> 
> This is with packed=off, right? If this is testing with packed, you
> need to change the code to accommodate it. Let me know if you need
> more help with this.

Yes, this is for packed=off. For the time being, I am trying to
get L2 to communicate with L0 using split virtqueues and x-svq=true.

> In the CVQ the only reply is a byte, indicating if the command was
> applied or not. This seems ok to me.

Understood.

> The queue can also recycle ids as long as they are not available, so
> that part seems correct to me too.

I am a little confused here. The ids are recycled when they are
available (i.e., the id is not already in use), right?

>> For svq->vring.num = 256, the following line is
>> printed 20 times,
>>
>> size: 256, len: 0, i: 0
>>
>> followed by:
>>
>> size: 256, len: 0, i: 1
>> size: 256, len: 0, i: 1
>>
> 
> This makes sense for the tx queue too. Can you print the VirtQueue index?

For svq->vring.num = 64, the vq index is 2. So the following line
(svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
is printed repeatedly:

size: 64, len: 1, i: 0, vq idx: 2

For svq->vring.num = 256, the following line is repeated several
times:

size: 256, len: 0, i: 0, vq idx: 1

This is followed by:

size: 256, len: 0, i: 1, vq idx: 1

In both cases, queue_index is 1. To get the value of queue_index,
I used "virtio_get_queue_index(svq->vq)" [2].

Since the queue_index is 1, I guess this means this is the tx queue
and the value of len (0) is correct. However, nothing with
queue_index % 2 == 0 is printed by vhost_svq_get_buf() which means
the device is not sending anything to the guest. Is this correct?

>> used_elem.len is used to set the value of len that is
>> returned by vhost_svq_get_buf, and it's always 0.
>>
>> So the value of "len" returned by vhost_svq_get_buf
>> when called in vhost_svq_flush is also 0.
>>
>> Thanks,
>> Sahil
>>
>> [1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-vdpa.c#L1243
>> [2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-vdpa.c#L1265
>>
> 

Thanks,
Sahil

[1] https://www.redhat.com/en/blog/hands-vdpa-what-do-you-do-when-you-aint-got-hardware-part-2
[2] https://gitlab.com/qemu-project/qemu/-/blob/99d6a32469debf1a48921125879b614d15acfb7a/hw/virtio/virtio.c#L3454

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 4 months, 2 weeks ago

On Thu, Dec 19, 2024 at 8:37 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 12/17/24 1:20 PM, Eugenio Perez Martin wrote:
> > On Tue, Dec 17, 2024 at 6:45 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> On 12/16/24 2:09 PM, Eugenio Perez Martin wrote:
> >>> On Sun, Dec 15, 2024 at 6:27 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
> >>>>> On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>> [...]
> >>>>>> I have been following the "Hands on vDPA: what do you do
> >>>>>> when you ain't got the hardware v2 (Part 2)" [1] blog to
> >>>>>> test my changes. To boot the L1 VM, I ran:
> >>>>>>
> >>>>>> sudo ./qemu/build/qemu-system-x86_64 \
> >>>>>> -enable-kvm \
> >>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >>>>>> -net nic,model=virtio \
> >>>>>> -net user,hostfwd=tcp::2222-:22 \
> >>>>>> -device intel-iommu,snoop-control=on \
> >>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
> >>>>>> -netdev tap,id=net0,script=no,downscript=no \
> >>>>>> -nographic \
> >>>>>> -m 8G \
> >>>>>> -smp 4 \
> >>>>>> -M q35 \
> >>>>>> -cpu host 2>&1 | tee vm.log
> >>>>>>
> >>>>>> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
> >>>>>> guest_announce=off" in "-device virtio-net-pci", QEMU
> >>>>>> throws "vdpa svq does not work with features" [2] when
> >>>>>> trying to boot L2.
> >>>>>>
> >>>>>> The enums added in commit #2 in this series is new and
> >>>>>> wasn't in the earlier versions of the series. Without
> >>>>>> this change, x-svq=true throws "SVQ invalid device feature
> >>>>>> flags" [3] and x-svq is consequently disabled.
> >>>>>>
> >>>>>> The first issue is related to running traffic in L2
> >>>>>> with vhost-vdpa.
> >>>>>>
> >>>>>> In L0:
> >>>>>>
> >>>>>> $ ip addr add 111.1.1.1/24 dev tap0
> >>>>>> $ ip link set tap0 up
> >>>>>> $ ip addr show tap0
> >>>>>> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
> >>>>>>        link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
> >>>>>>        inet 111.1.1.1/24 scope global tap0
> >>>>>>           valid_lft forever preferred_lft forever
> >>>>>>        inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
> >>>>>>           valid_lft forever preferred_lft forever
> >>>>>>
> >>>>>> I am able to run traffic in L2 when booting without
> >>>>>> x-svq.
> >>>>>>
> >>>>>> In L1:
> >>>>>>
> >>>>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>>>> -nographic \
> >>>>>> -m 4G \
> >>>>>> -enable-kvm \
> >>>>>> -M q35 \
> >>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
> >>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>>>> -smp 4 \
> >>>>>> -cpu host \
> >>>>>> 2>&1 | tee vm.log
> >>>>>>
> >>>>>> In L2:
> >>>>>>
> >>>>>> # ip addr add 111.1.1.2/24 dev eth0
> >>>>>> # ip addr show eth0
> >>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>>>        link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>>>        altname enp0s7
> >>>>>>        inet 111.1.1.2/24 scope global eth0
> >>>>>>           valid_lft forever preferred_lft forever
> >>>>>>        inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
> >>>>>>           valid_lft forever preferred_lft forever
> >>>>>>
> >>>>>> # ip route
> >>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
> >>>>>>
> >>>>>> # ping 111.1.1.1 -w3
> >>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> >>>>>> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
> >>>>>> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
> >>>>>> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
> >>>>>>
> >>>>>> --- 111.1.1.1 ping statistics ---
> >>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
> >>>>>> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
> >>>>>>
> >>>>>>
> >>>>>> But if I boot L2 with x-svq=true as shown below, I am unable
> >>>>>> to ping the host machine.
> >>>>>>
> >>>>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>>>> -nographic \
> >>>>>> -m 4G \
> >>>>>> -enable-kvm \
> >>>>>> -M q35 \
> >>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> >>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>>>> -smp 4 \
> >>>>>> -cpu host \
> >>>>>> 2>&1 | tee vm.log
> >>>>>>
> >>>>>> In L2:
> >>>>>>
> >>>>>> # ip addr add 111.1.1.2/24 dev eth0
> >>>>>> # ip addr show eth0
> >>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>>>        link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>>>        altname enp0s7
> >>>>>>        inet 111.1.1.2/24 scope global eth0
> >>>>>>           valid_lft forever preferred_lft forever
> >>>>>>        inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
> >>>>>>           valid_lft forever preferred_lft forever
> >>>>>>
> >>>>>> # ip route
> >>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
> >>>>>>
> >>>>>> # ping 111.1.1.1 -w10
> >>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> >>>>>>    From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
> >>>>>> ping: sendmsg: No route to host
> >>>>>>    From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
> >>>>>>    From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
> >>>>>>
> >>>>>> --- 111.1.1.1 ping statistics ---
> >>>>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
> >>>>>> pipe 3
> >>>>>>
> >>>>>> The other issue is related to booting L2 with "x-svq=true"
> >>>>>> and "packed=on".
> >>>>>>
> >>>>>> In L1:
> >>>>>>
> >>>>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>>>> -nographic \
> >>>>>> -m 4G \
> >>>>>> -enable-kvm \
> >>>>>> -M q35 \
> >>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
> >>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
> >>>>>> -smp 4 \
> >>>>>> -cpu host \
> >>>>>> 2>&1 | tee vm.log
> >>>>>>
> >>>>>> The kernel throws "virtio_net virtio1: output.0:id 0 is not
> >>>>>> a head!" [4].
> >>>>>>
> >>>>>
> >>>>> So this series implements the descriptor forwarding from the guest to
> >>>>> the device in packed vq. We also need to forward the descriptors from
> >>>>> the device to the guest. The device writes them in the SVQ ring.
> >>>>>
> >>>>> The functions responsible for that in QEMU are
> >>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
> >>>>> the device when used descriptors are written to the SVQ, which calls
> >>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
> >>>>> modifications similar to vhost_svq_add: Make them conditional if we're
> >>>>> in split or packed vq, and "copy" the code from Linux's
> >>>>> drivers/virtio/virtio_ring.c:virtqueue_get_buf.
> >>>>>
> >>>>> After these modifications you should be able to ping and forward
> >>>>> traffic. As always, It is totally ok if it needs more than one
> >>>>> iteration, and feel free to ask any question you have :).
> >>>>>
> >>>>
> >>>> I misunderstood this part. While working on extending
> >>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf() [1]
> >>>> for packed vqs, I realized that this function and
> >>>> vhost_svq_flush() already support split vqs. However, I am
> >>>> unable to ping L0 when booting L2 with "x-svq=true" and
> >>>> "packed=off" or when the "packed" option is not specified
> >>>> in QEMU's command line.
> >>>>
> >>>> I tried debugging these functions for split vqs after running
> >>>> the following QEMU commands while following the blog [2].
> >>>>
> >>>> Booting L1:
> >>>>
> >>>> $ sudo ./qemu/build/qemu-system-x86_64 \
> >>>> -enable-kvm \
> >>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >>>> -net nic,model=virtio \
> >>>> -net user,hostfwd=tcp::2222-:22 \
> >>>> -device intel-iommu,snoop-control=on \
> >>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
> >>>> -netdev tap,id=net0,script=no,downscript=no \
> >>>> -nographic \
> >>>> -m 8G \
> >>>> -smp 4 \
> >>>> -M q35 \
> >>>> -cpu host 2>&1 | tee vm.log
> >>>>
> >>>> Booting L2:
> >>>>
> >>>> # ./qemu/build/qemu-system-x86_64 \
> >>>> -nographic \
> >>>> -m 4G \
> >>>> -enable-kvm \
> >>>> -M q35 \
> >>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> >>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>> -smp 4 \
> >>>> -cpu host \
> >>>> 2>&1 | tee vm.log
> >>>>
> >>>> I printed out the contents of VirtQueueElement returned
> >>>> by vhost_svq_get_buf() in vhost_svq_flush() [3].
> >>>> I noticed that "len" which is set by "vhost_svq_get_buf"
> >>>> is always set to 0 while VirtQueueElement.len is non-zero.
> >>>> I haven't understood the difference between these two "len"s.
> >>>>
> >>>
> >>> VirtQueueElement.len is the length of the buffer, while the len of
> >>> vhost_svq_get_buf is the bytes written by the device. In the case of
> >>> the tx queue, VirtQueuelen is the length of the tx packet, and the
> >>> vhost_svq_get_buf is always 0 as the device does not write. In the
> >>> case of rx, VirtQueueElem.len is the available length for a rx frame,
> >>> and the vhost_svq_get_buf len is the actual length written by the
> >>> device.
> >>>
> >>> To be 100% accurate a rx packet can span over multiple buffers, but
> >>> SVQ does not need special code to handle this.
> >>>
> >>> So vhost_svq_get_buf should return > 0 for rx queue (svq->vq->index ==
> >>> 0), and 0 for tx queue (svq->vq->index % 2 == 1).
> >>>
> >>> Take into account that vhost_svq_get_buf only handles split vq at the
> >>> moment! It should be renamed or splitted into vhost_svq_get_buf_split.
> >>
> >> In L1, there are 2 virtio network devices.
> >>
> >> # lspci -nn | grep -i net
> >> 00:02.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
> >> 00:04.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
> >>
> >> I am using the second one (1af4:1041) for testing my changes and have
> >> bound this device to the vp_vdpa driver.
> >>
> >> # vdpa dev show -jp
> >> {
> >>       "dev": {
> >>           "vdpa0": {
> >>               "type": "network",
> >>               "mgmtdev": "pci/0000:00:04.0",
> >>               "vendor_id": 6900,
> >>               "max_vqs": 3,
> >
> > How is max_vqs=3? For this to happen L0 QEMU should have
> > virtio-net-pci,...,queues=3 cmdline argument.

Ouch! I totally misread it :(. Everything is correct, max_vqs should
be 3. I read it as the virtio_net queues, which means queue *pairs*,
as it includes rx and tx queue.

>
> I am not sure why max_vqs is 3. I haven't set the value of queues to 3
> in the cmdline argument. Is max_vqs expected to have a default value
> other than 3?
>
> In the blog [1] as well, max_vqs is 3 even though there's no queues=3
> argument.
>
> > It's clear the guest is not using them, we can add mq=off
> > to simplify the scenario.
>
> The value of max_vqs is still 3 after adding mq=off. The whole
> command that I run to boot L0 is:
>
> $ sudo ./qemu/build/qemu-system-x86_64 \
> -enable-kvm \
> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> -net nic,model=virtio \
> -net user,hostfwd=tcp::2222-:22 \
> -device intel-iommu,snoop-control=on \
> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
> -netdev tap,id=net0,script=no,downscript=no \
> -nographic \
> -m 8G \
> -smp 4 \
> -M q35 \
> -cpu host 2>&1 | tee vm.log
>
> Could it be that 2 of the 3 vqs are used for the dataplane and
> the third vq is the control vq?
>
> >>               "max_vq_size": 256
> >>           }
> >>       }
> >> }
> >>
> >> The max number of vqs is 3 with the max size being 256.
> >>
> >> Since, there are 2 virtio net devices, vhost_vdpa_svqs_start [1]
> >> is called twice. For each of them. it calls vhost_svq_start [2]
> >> v->shadow_vqs->len number of times.
> >>
> >
> > Ok I understand this confusion, as the code is not intuitive :). Take
> > into account you can only have svq in vdpa devices, so both
> > vhost_vdpa_svqs_start are acting on the vdpa device.
> >
> > You are seeing two calls to vhost_vdpa_svqs_start because virtio (and
> > vdpa) devices are modelled internally as two devices in QEMU: One for
> > the dataplane vq, and other for the control vq. There are historical
> > reasons for this, but we use it in vdpa to always shadow the CVQ while
> > leaving dataplane passthrough if x-svq=off and the virtio & virtio-net
> > feature set is understood by SVQ.
> >
> > If you break at vhost_vdpa_svqs_start with gdb and go higher in the
> > stack you should reach vhost_net_start, that starts each vhost_net
> > device individually.
> >
> > To be 100% honest, each dataplain *queue pair* (rx+tx) is modelled
> > with a different vhost_net device in QEMU, but you don't need to take
> > that into account implementing the packed vq :).
>
> Got it, this makes sense now.
>
> >> Printing the values of dev->vdev->name, v->shadow_vqs->len and
> >> svq->vring.num in vhost_vdpa_svqs_start gives:
> >>
> >> name: virtio-net
> >> len: 2
> >> num: 256
> >> num: 256
> >
> > First QEMU's vhost_net device, the dataplane.
> >
> >> name: virtio-net
> >> len: 1
> >> num: 64
> >>
> >
> > Second QEMU's vhost_net device, the control virtqueue.
>
> Ok, if I understand this correctly, the control vq doesn't
> need separate queues for rx and tx.
>

That's right. Since CVQ has one reply per command, the driver can just
send ro+rw descriptors to the device. In the case of RX, the device
needs a queue with only-writable descriptors, as neither the device or
the driver knows how many packets will arrive.

> >> I am not sure how to match the above log lines to the
> >> right virtio-net device since the actual value of num
> >> can be less than "max_vq_size" in the output of "vdpa
> >> dev show".
> >>
> >
> > Yes, the device can set a different vq max per vq, and the driver can
> > negotiate a lower vq size per vq too.
> >
> >> I think the first 3 log lines correspond to the virtio
> >> net device that I am using for testing since it has
> >> 2 vqs (rx and tx) while the other virtio-net device
> >> only has one vq.
> >>
> >> When printing out the values of svq->vring.num,
> >> used_elem.len and used_elem.id in vhost_svq_get_buf,
> >> there are two sets of output. One set corresponds to
> >> svq->vring.num = 64 and the other corresponds to
> >> svq->vring.num = 256.
> >>
> >> For svq->vring.num = 64, only the following line
> >> is printed repeatedly:
> >>
> >> size: 64, len: 1, i: 0
> >>
> >
> > This is with packed=off, right? If this is testing with packed, you
> > need to change the code to accommodate it. Let me know if you need
> > more help with this.
>
> Yes, this is for packed=off. For the time being, I am trying to
> get L2 to communicate with L0 using split virtqueues and x-svq=true.
>

Got it.

> > In the CVQ the only reply is a byte, indicating if the command was
> > applied or not. This seems ok to me.
>
> Understood.
>
> > The queue can also recycle ids as long as they are not available, so
> > that part seems correct to me too.
>
> I am a little confused here. The ids are recycled when they are
> available (i.e., the id is not already in use), right?
>

In virtio, available is that the device can use them. And used is that
the device returned to the driver. I think you're aligned it's just it
is better to follow the virtio nomenclature :).

> >> For svq->vring.num = 256, the following line is
> >> printed 20 times,
> >>
> >> size: 256, len: 0, i: 0
> >>
> >> followed by:
> >>
> >> size: 256, len: 0, i: 1
> >> size: 256, len: 0, i: 1
> >>
> >
> > This makes sense for the tx queue too. Can you print the VirtQueue index?
>
> For svq->vring.num = 64, the vq index is 2. So the following line
> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
> is printed repeatedly:
>
> size: 64, len: 1, i: 0, vq idx: 2
>
> For svq->vring.num = 256, the following line is repeated several
> times:
>
> size: 256, len: 0, i: 0, vq idx: 1
>
> This is followed by:
>
> size: 256, len: 0, i: 1, vq idx: 1
>
> In both cases, queue_index is 1. To get the value of queue_index,
> I used "virtio_get_queue_index(svq->vq)" [2].
>
> Since the queue_index is 1, I guess this means this is the tx queue
> and the value of len (0) is correct. However, nothing with
> queue_index % 2 == 0 is printed by vhost_svq_get_buf() which means
> the device is not sending anything to the guest. Is this correct?
>

Yes, that's totally correct.

You can set -netdev tap,...,vhost=off in L0 qemu and trace (or debug
with gdb) it to check what is receiving. You should see calls to
hw/net/virtio-net.c:virtio_net_flush_tx. The corresponding function to
receive is virtio_net_receive_rcu, I recommend you trace too just it
in case you see any strange call to it.

> >> used_elem.len is used to set the value of len that is
> >> returned by vhost_svq_get_buf, and it's always 0.
> >>
> >> So the value of "len" returned by vhost_svq_get_buf
> >> when called in vhost_svq_flush is also 0.
> >>
> >> Thanks,
> >> Sahil
> >>
> >> [1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-vdpa.c#L1243
> >> [2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-vdpa.c#L1265
> >>
> >
>
> Thanks,
> Sahil
>
> [1] https://www.redhat.com/en/blog/hands-vdpa-what-do-you-do-when-you-aint-got-hardware-part-2
> [2] https://gitlab.com/qemu-project/qemu/-/blob/99d6a32469debf1a48921125879b614d15acfb7a/hw/virtio/virtio.c#L3454
>

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 4 months ago

Hi,

On 12/20/24 12:28 PM, Eugenio Perez Martin wrote:
> On Thu, Dec 19, 2024 at 8:37 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> Hi,
>>
>> On 12/17/24 1:20 PM, Eugenio Perez Martin wrote:
>>> On Tue, Dec 17, 2024 at 6:45 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> On 12/16/24 2:09 PM, Eugenio Perez Martin wrote:
>>>>> On Sun, Dec 15, 2024 at 6:27 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>> On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
>>>>>>> On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>>>> [...]
>>>>>>>> I have been following the "Hands on vDPA: what do you do
>>>>>>>> when you ain't got the hardware v2 (Part 2)" [1] blog to
>>>>>>>> test my changes. To boot the L1 VM, I ran:
>>>>>>>>
>>>>>>>> sudo ./qemu/build/qemu-system-x86_64 \
>>>>>>>> -enable-kvm \
>>>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>>>>>>>> -net nic,model=virtio \
>>>>>>>> -net user,hostfwd=tcp::2222-:22 \
>>>>>>>> -device intel-iommu,snoop-control=on \
>>>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
>>>>>>>> -netdev tap,id=net0,script=no,downscript=no \
>>>>>>>> -nographic \
>>>>>>>> -m 8G \
>>>>>>>> -smp 4 \
>>>>>>>> -M q35 \
>>>>>>>> -cpu host 2>&1 | tee vm.log
>>>>>>>>
>>>>>>>> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
>>>>>>>> guest_announce=off" in "-device virtio-net-pci", QEMU
>>>>>>>> throws "vdpa svq does not work with features" [2] when
>>>>>>>> trying to boot L2.
>>>>>>>>
>>>>>>>> The enums added in commit #2 in this series is new and
>>>>>>>> wasn't in the earlier versions of the series. Without
>>>>>>>> this change, x-svq=true throws "SVQ invalid device feature
>>>>>>>> flags" [3] and x-svq is consequently disabled.
>>>>>>>>
>>>>>>>> The first issue is related to running traffic in L2
>>>>>>>> with vhost-vdpa.
>>>>>>>>
>>>>>>>> In L0:
>>>>>>>>
>>>>>>>> $ ip addr add 111.1.1.1/24 dev tap0
>>>>>>>> $ ip link set tap0 up
>>>>>>>> $ ip addr show tap0
>>>>>>>> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>>>>>>>>         link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>>>>>>>>         inet 111.1.1.1/24 scope global tap0
>>>>>>>>            valid_lft forever preferred_lft forever
>>>>>>>>         inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>>>>>>>>            valid_lft forever preferred_lft forever
>>>>>>>>
>>>>>>>> I am able to run traffic in L2 when booting without
>>>>>>>> x-svq.
>>>>>>>>
>>>>>>>> In L1:
>>>>>>>>
>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>>>>>> -nographic \
>>>>>>>> -m 4G \
>>>>>>>> -enable-kvm \
>>>>>>>> -M q35 \
>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>>>>>> -smp 4 \
>>>>>>>> -cpu host \
>>>>>>>> 2>&1 | tee vm.log
>>>>>>>>
>>>>>>>> In L2:
>>>>>>>>
>>>>>>>> # ip addr add 111.1.1.2/24 dev eth0
>>>>>>>> # ip addr show eth0
>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>>>>>         link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>>>>>         altname enp0s7
>>>>>>>>         inet 111.1.1.2/24 scope global eth0
>>>>>>>>            valid_lft forever preferred_lft forever
>>>>>>>>         inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>>>>>>>            valid_lft forever preferred_lft forever
>>>>>>>>
>>>>>>>> # ip route
>>>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>>>>>>>
>>>>>>>> # ping 111.1.1.1 -w3
>>>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
>>>>>>>>
>>>>>>>> --- 111.1.1.1 ping statistics ---
>>>>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
>>>>>>>> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
>>>>>>>>
>>>>>>>>
>>>>>>>> But if I boot L2 with x-svq=true as shown below, I am unable
>>>>>>>> to ping the host machine.
>>>>>>>>
>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>>>>>> -nographic \
>>>>>>>> -m 4G \
>>>>>>>> -enable-kvm \
>>>>>>>> -M q35 \
>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>>>>>> -smp 4 \
>>>>>>>> -cpu host \
>>>>>>>> 2>&1 | tee vm.log
>>>>>>>>
>>>>>>>> In L2:
>>>>>>>>
>>>>>>>> # ip addr add 111.1.1.2/24 dev eth0
>>>>>>>> # ip addr show eth0
>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>>>>>         link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>>>>>         altname enp0s7
>>>>>>>>         inet 111.1.1.2/24 scope global eth0
>>>>>>>>            valid_lft forever preferred_lft forever
>>>>>>>>         inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>>>>>>>            valid_lft forever preferred_lft forever
>>>>>>>>
>>>>>>>> # ip route
>>>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>>>>>>>
>>>>>>>> # ping 111.1.1.1 -w10
>>>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>>>>>>>>     From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
>>>>>>>> ping: sendmsg: No route to host
>>>>>>>>     From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
>>>>>>>>     From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
>>>>>>>>
>>>>>>>> --- 111.1.1.1 ping statistics ---
>>>>>>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
>>>>>>>> pipe 3
>>>>>>>>
>>>>>>>> The other issue is related to booting L2 with "x-svq=true"
>>>>>>>> and "packed=on".
>>>>>>>>
>>>>>>>> In L1:
>>>>>>>>
>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>>>>>> -nographic \
>>>>>>>> -m 4G \
>>>>>>>> -enable-kvm \
>>>>>>>> -M q35 \
>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
>>>>>>>> -smp 4 \
>>>>>>>> -cpu host \
>>>>>>>> 2>&1 | tee vm.log
>>>>>>>>
>>>>>>>> The kernel throws "virtio_net virtio1: output.0:id 0 is not
>>>>>>>> a head!" [4].
>>>>>>>>
>>>>>>>
>>>>>>> So this series implements the descriptor forwarding from the guest to
>>>>>>> the device in packed vq. We also need to forward the descriptors from
>>>>>>> the device to the guest. The device writes them in the SVQ ring.
>>>>>>>
>>>>>>> The functions responsible for that in QEMU are
>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
>>>>>>> the device when used descriptors are written to the SVQ, which calls
>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
>>>>>>> modifications similar to vhost_svq_add: Make them conditional if we're
>>>>>>> in split or packed vq, and "copy" the code from Linux's
>>>>>>> drivers/virtio/virtio_ring.c:virtqueue_get_buf.
>>>>>>>
>>>>>>> After these modifications you should be able to ping and forward
>>>>>>> traffic. As always, It is totally ok if it needs more than one
>>>>>>> iteration, and feel free to ask any question you have :).
>>>>>>>
>>>>>>
>>>>>> I misunderstood this part. While working on extending
>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf() [1]
>>>>>> for packed vqs, I realized that this function and
>>>>>> vhost_svq_flush() already support split vqs. However, I am
>>>>>> unable to ping L0 when booting L2 with "x-svq=true" and
>>>>>> "packed=off" or when the "packed" option is not specified
>>>>>> in QEMU's command line.
>>>>>>
>>>>>> I tried debugging these functions for split vqs after running
>>>>>> the following QEMU commands while following the blog [2].
>>>>>>
>>>>>> Booting L1:
>>>>>>
>>>>>> $ sudo ./qemu/build/qemu-system-x86_64 \
>>>>>> -enable-kvm \
>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>>>>>> -net nic,model=virtio \
>>>>>> -net user,hostfwd=tcp::2222-:22 \
>>>>>> -device intel-iommu,snoop-control=on \
>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
>>>>>> -netdev tap,id=net0,script=no,downscript=no \
>>>>>> -nographic \
>>>>>> -m 8G \
>>>>>> -smp 4 \
>>>>>> -M q35 \
>>>>>> -cpu host 2>&1 | tee vm.log
>>>>>>
>>>>>> Booting L2:
>>>>>>
>>>>>> # ./qemu/build/qemu-system-x86_64 \
>>>>>> -nographic \
>>>>>> -m 4G \
>>>>>> -enable-kvm \
>>>>>> -M q35 \
>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>>>> -smp 4 \
>>>>>> -cpu host \
>>>>>> 2>&1 | tee vm.log
>>>>>>
>>>>>> I printed out the contents of VirtQueueElement returned
>>>>>> by vhost_svq_get_buf() in vhost_svq_flush() [3].
>>>>>> I noticed that "len" which is set by "vhost_svq_get_buf"
>>>>>> is always set to 0 while VirtQueueElement.len is non-zero.
>>>>>> I haven't understood the difference between these two "len"s.
>>>>>>
>>>>>
>>>>> VirtQueueElement.len is the length of the buffer, while the len of
>>>>> vhost_svq_get_buf is the bytes written by the device. In the case of
>>>>> the tx queue, VirtQueuelen is the length of the tx packet, and the
>>>>> vhost_svq_get_buf is always 0 as the device does not write. In the
>>>>> case of rx, VirtQueueElem.len is the available length for a rx frame,
>>>>> and the vhost_svq_get_buf len is the actual length written by the
>>>>> device.
>>>>>
>>>>> To be 100% accurate a rx packet can span over multiple buffers, but
>>>>> SVQ does not need special code to handle this.
>>>>>
>>>>> So vhost_svq_get_buf should return > 0 for rx queue (svq->vq->index ==
>>>>> 0), and 0 for tx queue (svq->vq->index % 2 == 1).
>>>>>
>>>>> Take into account that vhost_svq_get_buf only handles split vq at the
>>>>> moment! It should be renamed or splitted into vhost_svq_get_buf_split.
>>>>
>>>> In L1, there are 2 virtio network devices.
>>>>
>>>> # lspci -nn | grep -i net
>>>> 00:02.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
>>>> 00:04.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
>>>>
>>>> I am using the second one (1af4:1041) for testing my changes and have
>>>> bound this device to the vp_vdpa driver.
>>>>
>>>> # vdpa dev show -jp
>>>> {
>>>>        "dev": {
>>>>            "vdpa0": {
>>>>                "type": "network",
>>>>                "mgmtdev": "pci/0000:00:04.0",
>>>>                "vendor_id": 6900,
>>>>                "max_vqs": 3,
>>>
>>> How is max_vqs=3? For this to happen L0 QEMU should have
>>> virtio-net-pci,...,queues=3 cmdline argument.
> 
> Ouch! I totally misread it :(. Everything is correct, max_vqs should
> be 3. I read it as the virtio_net queues, which means queue *pairs*,
> as it includes rx and tx queue.

Understood :)

>>
>> I am not sure why max_vqs is 3. I haven't set the value of queues to 3
>> in the cmdline argument. Is max_vqs expected to have a default value
>> other than 3?
>>
>> In the blog [1] as well, max_vqs is 3 even though there's no queues=3
>> argument.
>>
>>> It's clear the guest is not using them, we can add mq=off
>>> to simplify the scenario.
>>
>> The value of max_vqs is still 3 after adding mq=off. The whole
>> command that I run to boot L0 is:
>>
>> $ sudo ./qemu/build/qemu-system-x86_64 \
>> -enable-kvm \
>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>> -net nic,model=virtio \
>> -net user,hostfwd=tcp::2222-:22 \
>> -device intel-iommu,snoop-control=on \
>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
>> -netdev tap,id=net0,script=no,downscript=no \
>> -nographic \
>> -m 8G \
>> -smp 4 \
>> -M q35 \
>> -cpu host 2>&1 | tee vm.log
>>
>> Could it be that 2 of the 3 vqs are used for the dataplane and
>> the third vq is the control vq?
>>
>>>>                "max_vq_size": 256
>>>>            }
>>>>        }
>>>> }
>>>>
>>>> The max number of vqs is 3 with the max size being 256.
>>>>
>>>> Since, there are 2 virtio net devices, vhost_vdpa_svqs_start [1]
>>>> is called twice. For each of them. it calls vhost_svq_start [2]
>>>> v->shadow_vqs->len number of times.
>>>>
>>>
>>> Ok I understand this confusion, as the code is not intuitive :). Take
>>> into account you can only have svq in vdpa devices, so both
>>> vhost_vdpa_svqs_start are acting on the vdpa device.
>>>
>>> You are seeing two calls to vhost_vdpa_svqs_start because virtio (and
>>> vdpa) devices are modelled internally as two devices in QEMU: One for
>>> the dataplane vq, and other for the control vq. There are historical
>>> reasons for this, but we use it in vdpa to always shadow the CVQ while
>>> leaving dataplane passthrough if x-svq=off and the virtio & virtio-net
>>> feature set is understood by SVQ.
>>>
>>> If you break at vhost_vdpa_svqs_start with gdb and go higher in the
>>> stack you should reach vhost_net_start, that starts each vhost_net
>>> device individually.
>>>
>>> To be 100% honest, each dataplain *queue pair* (rx+tx) is modelled
>>> with a different vhost_net device in QEMU, but you don't need to take
>>> that into account implementing the packed vq :).
>>
>> Got it, this makes sense now.
>>
>>>> Printing the values of dev->vdev->name, v->shadow_vqs->len and
>>>> svq->vring.num in vhost_vdpa_svqs_start gives:
>>>>
>>>> name: virtio-net
>>>> len: 2
>>>> num: 256
>>>> num: 256
>>>
>>> First QEMU's vhost_net device, the dataplane.
>>>
>>>> name: virtio-net
>>>> len: 1
>>>> num: 64
>>>>
>>>
>>> Second QEMU's vhost_net device, the control virtqueue.
>>
>> Ok, if I understand this correctly, the control vq doesn't
>> need separate queues for rx and tx.
>>
> 
> That's right. Since CVQ has one reply per command, the driver can just
> send ro+rw descriptors to the device. In the case of RX, the device
> needs a queue with only-writable descriptors, as neither the device or
> the driver knows how many packets will arrive.

Got it, this makes sense now.

>>>> I am not sure how to match the above log lines to the
>>>> right virtio-net device since the actual value of num
>>>> can be less than "max_vq_size" in the output of "vdpa
>>>> dev show".
>>>>
>>>
>>> Yes, the device can set a different vq max per vq, and the driver can
>>> negotiate a lower vq size per vq too.
>>>
>>>> I think the first 3 log lines correspond to the virtio
>>>> net device that I am using for testing since it has
>>>> 2 vqs (rx and tx) while the other virtio-net device
>>>> only has one vq.
>>>>
>>>> When printing out the values of svq->vring.num,
>>>> used_elem.len and used_elem.id in vhost_svq_get_buf,
>>>> there are two sets of output. One set corresponds to
>>>> svq->vring.num = 64 and the other corresponds to
>>>> svq->vring.num = 256.
>>>>
>>>> For svq->vring.num = 64, only the following line
>>>> is printed repeatedly:
>>>>
>>>> size: 64, len: 1, i: 0
>>>>
>>>
>>> This is with packed=off, right? If this is testing with packed, you
>>> need to change the code to accommodate it. Let me know if you need
>>> more help with this.
>>
>> Yes, this is for packed=off. For the time being, I am trying to
>> get L2 to communicate with L0 using split virtqueues and x-svq=true.
>>
> 
> Got it.
> 
>>> In the CVQ the only reply is a byte, indicating if the command was
>>> applied or not. This seems ok to me.
>>
>> Understood.
>>
>>> The queue can also recycle ids as long as they are not available, so
>>> that part seems correct to me too.
>>
>> I am a little confused here. The ids are recycled when they are
>> available (i.e., the id is not already in use), right?
>>
> 
> In virtio, available is that the device can use them. And used is that
> the device returned to the driver. I think you're aligned it's just it
> is better to follow the virtio nomenclature :).

Got it.

>>>> For svq->vring.num = 256, the following line is
>>>> printed 20 times,
>>>>
>>>> size: 256, len: 0, i: 0
>>>>
>>>> followed by:
>>>>
>>>> size: 256, len: 0, i: 1
>>>> size: 256, len: 0, i: 1
>>>>
>>>
>>> This makes sense for the tx queue too. Can you print the VirtQueue index?
>>
>> For svq->vring.num = 64, the vq index is 2. So the following line
>> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
>> is printed repeatedly:
>>
>> size: 64, len: 1, i: 0, vq idx: 2
>>
>> For svq->vring.num = 256, the following line is repeated several
>> times:
>>
>> size: 256, len: 0, i: 0, vq idx: 1
>>
>> This is followed by:
>>
>> size: 256, len: 0, i: 1, vq idx: 1
>>
>> In both cases, queue_index is 1. To get the value of queue_index,
>> I used "virtio_get_queue_index(svq->vq)" [2].
>>
>> Since the queue_index is 1, I guess this means this is the tx queue
>> and the value of len (0) is correct. However, nothing with
>> queue_index % 2 == 0 is printed by vhost_svq_get_buf() which means
>> the device is not sending anything to the guest. Is this correct?
>>
> 
> Yes, that's totally correct.
> 
> You can set -netdev tap,...,vhost=off in L0 qemu and trace (or debug
> with gdb) it to check what is receiving. You should see calls to
> hw/net/virtio-net.c:virtio_net_flush_tx. The corresponding function to
> receive is virtio_net_receive_rcu, I recommend you trace too just it
> in case you see any strange call to it.
> 

I added "vhost=off" to -netdev tap in L0's qemu command. I followed all
the steps in the blog [1] up till the point where L2 is booted. Before
booting L2, I had no issues pinging L0 from L1.

For each ping, the following trace lines were printed by QEMU:

virtqueue_alloc_element elem 0x5d041024f560 size 56 in_num 0 out_num 1
virtqueue_pop vq 0x5d04109b0ce8 elem 0x5d041024f560 in_num 0 out_num 1
virtqueue_fill vq 0x5d04109b0ce8 elem 0x5d041024f560 len 0 idx 0
virtqueue_flush vq 0x5d04109b0ce8 count 1
virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0ce8
virtqueue_alloc_element elem 0x5d041024f560 size 56 in_num 1 out_num 0
virtqueue_pop vq 0x5d04109b0c50 elem 0x5d041024f560 in_num 1 out_num 0
virtqueue_fill vq 0x5d04109b0c50 elem 0x5d041024f560 len 110 idx 0
virtqueue_flush vq 0x5d04109b0c50 count 1
virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0c50

The first 5 lines look like they were printed when an echo request was
sent to L0 and the next 5 lines were printed when an echo reply was
received.

After booting L2, I set up the tap device's IP address in L0 and the
vDPA port's IP address in L2.

When trying to ping L0 from L2, I only see the following lines being
printed:

virtqueue_alloc_element elem 0x5d041099ffd0 size 56 in_num 0 out_num 1
virtqueue_pop vq 0x5d0410d87168 elem 0x5d041099ffd0 in_num 0 out_num 1
virtqueue_fill vq 0x5d0410d87168 elem 0x5d041099ffd0 len 0 idx 0
virtqueue_flush vq 0x5d0410d87168 count 1
virtio_notify vdev 0x5d0410d79a10 vq 0x5d0410d87168

There's no reception. I used wireshark to inspect the packets that are
being sent and received through the tap device in L0.

When pinging L0 from L2, I see one of the following two outcomes:

Outcome 1:
----------
L2 broadcasts ARP packets and L0 replies to L2.

Source             Destination        Protocol    Length    Info
52:54:00:12:34:57  Broadcast          ARP         42        Who has 111.1.1.1? Tell 111.1.1.2
d2:6d:b9:61:e1:9a  52:54:00:12:34:57  ARP         42        111.1.1.1 is at d2:6d:b9:61:e1:9a

Outcome 2 (less frequent):
--------------------------
L2 sends an ICMP echo request packet to L0 and L0 sends a reply,
but the reply is not received by L2.

Source             Destination        Protocol    Length    Info
111.1.1.2          111.1.1.1          ICMP        98        Echo (ping) request  id=0x0006, seq=1/256, ttl=64
111.1.1.1          111.1.1.2          ICMP        98        Echo (ping) reply    id=0x0006, seq=1/256, ttl=64

When pinging L2 from L0 I get the following output in
wireshark:

Source             Destination        Protocol    Length    Info
111.1.1.1          111.1.1.2          ICMP        100       Echo (ping) request  id=0x002c, seq=2/512, ttl=64 (no response found!)

I do see a lot of traced lines being printed (by the QEMU instance that
was started in L0) with in_num > 1, for example:

virtqueue_alloc_element elem 0x5d040fdbad30 size 56 in_num 1 out_num 0
virtqueue_pop vq 0x5d04109b0c50 elem 0x5d040fdbad30 in_num 1 out_num 0
virtqueue_fill vq 0x5d04109b0c50 elem 0x5d040fdbad30 len 76 idx 0
virtqueue_flush vq 0x5d04109b0c50 count 1
virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0c50

It looks like L1 is receiving data from L0 but this is not related to
the pings that are sent from L2. I haven't figured out what data is
actually being transferred in this case. It's not necessary for all of
the data that L1 receives from L0 to be passed to L2, is it?

>>>> For svq->vring.num = 256, the following line is
>>>> printed 20 times,
>>>>
>>>> size: 256, len: 0, i: 0
>>>>
>>>> followed by:
>>>>
>>>> size: 256, len: 0, i: 1
>>>> size: 256, len: 0, i: 1
>>>>
>>>
>>> This makes sense for the tx queue too. Can you print the VirtQueue index?
>>
>> For svq->vring.num = 64, the vq index is 2. So the following line
>> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
>> is printed repeatedly:
>>
>> size: 64, len: 1, i: 0, vq idx: 2
>>
>> For svq->vring.num = 256, the following line is repeated several
>> times:
>>
>> size: 256, len: 0, i: 0, vq idx: 1
>>
>> This is followed by:
>>
>> size: 256, len: 0, i: 1, vq idx: 1
>>
>> In both cases, queue_index is 1.

I also noticed that there are now some lines with svq->vring.num = 256
where len > 0. These lines were printed by the QEMU instance running
in L1, so this corresponds to data that was received by L2.

svq->vring.num  used_elem.len  used_elem.id  svq->vq->queue_index
size: 256       len: 82        i: 0          vq idx: 0
size: 256       len: 82        i: 1          vq idx: 0
size: 256       len: 82        i: 2          vq idx: 0
size: 256       len: 54        i: 3          vq idx: 0

I still haven't figured out what data was received by L2 but I am
slightly confused as to why this data was received by L2 but not
the ICMP echo replies sent by L0.

Thanks,
Sahil

[1] https://www.redhat.com/en/blog/hands-vdpa-what-do-you-do-when-you-aint-got-hardware-part-2

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 3 months, 4 weeks ago

On Fri, Jan 3, 2025 at 2:06 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 12/20/24 12:28 PM, Eugenio Perez Martin wrote:
> > On Thu, Dec 19, 2024 at 8:37 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> On 12/17/24 1:20 PM, Eugenio Perez Martin wrote:
> >>> On Tue, Dec 17, 2024 at 6:45 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> On 12/16/24 2:09 PM, Eugenio Perez Martin wrote:
> >>>>> On Sun, Dec 15, 2024 at 6:27 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>> On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
> >>>>>>> On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>>>> [...]
> >>>>>>>> I have been following the "Hands on vDPA: what do you do
> >>>>>>>> when you ain't got the hardware v2 (Part 2)" [1] blog to
> >>>>>>>> test my changes. To boot the L1 VM, I ran:
> >>>>>>>>
> >>>>>>>> sudo ./qemu/build/qemu-system-x86_64 \
> >>>>>>>> -enable-kvm \
> >>>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >>>>>>>> -net nic,model=virtio \
> >>>>>>>> -net user,hostfwd=tcp::2222-:22 \
> >>>>>>>> -device intel-iommu,snoop-control=on \
> >>>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
> >>>>>>>> -netdev tap,id=net0,script=no,downscript=no \
> >>>>>>>> -nographic \
> >>>>>>>> -m 8G \
> >>>>>>>> -smp 4 \
> >>>>>>>> -M q35 \
> >>>>>>>> -cpu host 2>&1 | tee vm.log
> >>>>>>>>
> >>>>>>>> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
> >>>>>>>> guest_announce=off" in "-device virtio-net-pci", QEMU
> >>>>>>>> throws "vdpa svq does not work with features" [2] when
> >>>>>>>> trying to boot L2.
> >>>>>>>>
> >>>>>>>> The enums added in commit #2 in this series is new and
> >>>>>>>> wasn't in the earlier versions of the series. Without
> >>>>>>>> this change, x-svq=true throws "SVQ invalid device feature
> >>>>>>>> flags" [3] and x-svq is consequently disabled.
> >>>>>>>>
> >>>>>>>> The first issue is related to running traffic in L2
> >>>>>>>> with vhost-vdpa.
> >>>>>>>>
> >>>>>>>> In L0:
> >>>>>>>>
> >>>>>>>> $ ip addr add 111.1.1.1/24 dev tap0
> >>>>>>>> $ ip link set tap0 up
> >>>>>>>> $ ip addr show tap0
> >>>>>>>> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
> >>>>>>>>         link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
> >>>>>>>>         inet 111.1.1.1/24 scope global tap0
> >>>>>>>>            valid_lft forever preferred_lft forever
> >>>>>>>>         inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
> >>>>>>>>            valid_lft forever preferred_lft forever
> >>>>>>>>
> >>>>>>>> I am able to run traffic in L2 when booting without
> >>>>>>>> x-svq.
> >>>>>>>>
> >>>>>>>> In L1:
> >>>>>>>>
> >>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>>>>>> -nographic \
> >>>>>>>> -m 4G \
> >>>>>>>> -enable-kvm \
> >>>>>>>> -M q35 \
> >>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
> >>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>>>>>> -smp 4 \
> >>>>>>>> -cpu host \
> >>>>>>>> 2>&1 | tee vm.log
> >>>>>>>>
> >>>>>>>> In L2:
> >>>>>>>>
> >>>>>>>> # ip addr add 111.1.1.2/24 dev eth0
> >>>>>>>> # ip addr show eth0
> >>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>>>>>         link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>>>>>         altname enp0s7
> >>>>>>>>         inet 111.1.1.2/24 scope global eth0
> >>>>>>>>            valid_lft forever preferred_lft forever
> >>>>>>>>         inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
> >>>>>>>>            valid_lft forever preferred_lft forever
> >>>>>>>>
> >>>>>>>> # ip route
> >>>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
> >>>>>>>>
> >>>>>>>> # ping 111.1.1.1 -w3
> >>>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> >>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
> >>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
> >>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
> >>>>>>>>
> >>>>>>>> --- 111.1.1.1 ping statistics ---
> >>>>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
> >>>>>>>> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> But if I boot L2 with x-svq=true as shown below, I am unable
> >>>>>>>> to ping the host machine.
> >>>>>>>>
> >>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>>>>>> -nographic \
> >>>>>>>> -m 4G \
> >>>>>>>> -enable-kvm \
> >>>>>>>> -M q35 \
> >>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> >>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>>>>>> -smp 4 \
> >>>>>>>> -cpu host \
> >>>>>>>> 2>&1 | tee vm.log
> >>>>>>>>
> >>>>>>>> In L2:
> >>>>>>>>
> >>>>>>>> # ip addr add 111.1.1.2/24 dev eth0
> >>>>>>>> # ip addr show eth0
> >>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>>>>>         link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>>>>>         altname enp0s7
> >>>>>>>>         inet 111.1.1.2/24 scope global eth0
> >>>>>>>>            valid_lft forever preferred_lft forever
> >>>>>>>>         inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
> >>>>>>>>            valid_lft forever preferred_lft forever
> >>>>>>>>
> >>>>>>>> # ip route
> >>>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
> >>>>>>>>
> >>>>>>>> # ping 111.1.1.1 -w10
> >>>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> >>>>>>>>     From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
> >>>>>>>> ping: sendmsg: No route to host
> >>>>>>>>     From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
> >>>>>>>>     From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
> >>>>>>>>
> >>>>>>>> --- 111.1.1.1 ping statistics ---
> >>>>>>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
> >>>>>>>> pipe 3
> >>>>>>>>
> >>>>>>>> The other issue is related to booting L2 with "x-svq=true"
> >>>>>>>> and "packed=on".
> >>>>>>>>
> >>>>>>>> In L1:
> >>>>>>>>
> >>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>>>>>> -nographic \
> >>>>>>>> -m 4G \
> >>>>>>>> -enable-kvm \
> >>>>>>>> -M q35 \
> >>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
> >>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
> >>>>>>>> -smp 4 \
> >>>>>>>> -cpu host \
> >>>>>>>> 2>&1 | tee vm.log
> >>>>>>>>
> >>>>>>>> The kernel throws "virtio_net virtio1: output.0:id 0 is not
> >>>>>>>> a head!" [4].
> >>>>>>>>
> >>>>>>>
> >>>>>>> So this series implements the descriptor forwarding from the guest to
> >>>>>>> the device in packed vq. We also need to forward the descriptors from
> >>>>>>> the device to the guest. The device writes them in the SVQ ring.
> >>>>>>>
> >>>>>>> The functions responsible for that in QEMU are
> >>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
> >>>>>>> the device when used descriptors are written to the SVQ, which calls
> >>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
> >>>>>>> modifications similar to vhost_svq_add: Make them conditional if we're
> >>>>>>> in split or packed vq, and "copy" the code from Linux's
> >>>>>>> drivers/virtio/virtio_ring.c:virtqueue_get_buf.
> >>>>>>>
> >>>>>>> After these modifications you should be able to ping and forward
> >>>>>>> traffic. As always, It is totally ok if it needs more than one
> >>>>>>> iteration, and feel free to ask any question you have :).
> >>>>>>>
> >>>>>>
> >>>>>> I misunderstood this part. While working on extending
> >>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf() [1]
> >>>>>> for packed vqs, I realized that this function and
> >>>>>> vhost_svq_flush() already support split vqs. However, I am
> >>>>>> unable to ping L0 when booting L2 with "x-svq=true" and
> >>>>>> "packed=off" or when the "packed" option is not specified
> >>>>>> in QEMU's command line.
> >>>>>>
> >>>>>> I tried debugging these functions for split vqs after running
> >>>>>> the following QEMU commands while following the blog [2].
> >>>>>>
> >>>>>> Booting L1:
> >>>>>>
> >>>>>> $ sudo ./qemu/build/qemu-system-x86_64 \
> >>>>>> -enable-kvm \
> >>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >>>>>> -net nic,model=virtio \
> >>>>>> -net user,hostfwd=tcp::2222-:22 \
> >>>>>> -device intel-iommu,snoop-control=on \
> >>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
> >>>>>> -netdev tap,id=net0,script=no,downscript=no \
> >>>>>> -nographic \
> >>>>>> -m 8G \
> >>>>>> -smp 4 \
> >>>>>> -M q35 \
> >>>>>> -cpu host 2>&1 | tee vm.log
> >>>>>>
> >>>>>> Booting L2:
> >>>>>>
> >>>>>> # ./qemu/build/qemu-system-x86_64 \
> >>>>>> -nographic \
> >>>>>> -m 4G \
> >>>>>> -enable-kvm \
> >>>>>> -M q35 \
> >>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> >>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>>>> -smp 4 \
> >>>>>> -cpu host \
> >>>>>> 2>&1 | tee vm.log
> >>>>>>
> >>>>>> I printed out the contents of VirtQueueElement returned
> >>>>>> by vhost_svq_get_buf() in vhost_svq_flush() [3].
> >>>>>> I noticed that "len" which is set by "vhost_svq_get_buf"
> >>>>>> is always set to 0 while VirtQueueElement.len is non-zero.
> >>>>>> I haven't understood the difference between these two "len"s.
> >>>>>>
> >>>>>
> >>>>> VirtQueueElement.len is the length of the buffer, while the len of
> >>>>> vhost_svq_get_buf is the bytes written by the device. In the case of
> >>>>> the tx queue, VirtQueuelen is the length of the tx packet, and the
> >>>>> vhost_svq_get_buf is always 0 as the device does not write. In the
> >>>>> case of rx, VirtQueueElem.len is the available length for a rx frame,
> >>>>> and the vhost_svq_get_buf len is the actual length written by the
> >>>>> device.
> >>>>>
> >>>>> To be 100% accurate a rx packet can span over multiple buffers, but
> >>>>> SVQ does not need special code to handle this.
> >>>>>
> >>>>> So vhost_svq_get_buf should return > 0 for rx queue (svq->vq->index ==
> >>>>> 0), and 0 for tx queue (svq->vq->index % 2 == 1).
> >>>>>
> >>>>> Take into account that vhost_svq_get_buf only handles split vq at the
> >>>>> moment! It should be renamed or splitted into vhost_svq_get_buf_split.
> >>>>
> >>>> In L1, there are 2 virtio network devices.
> >>>>
> >>>> # lspci -nn | grep -i net
> >>>> 00:02.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
> >>>> 00:04.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
> >>>>
> >>>> I am using the second one (1af4:1041) for testing my changes and have
> >>>> bound this device to the vp_vdpa driver.
> >>>>
> >>>> # vdpa dev show -jp
> >>>> {
> >>>>        "dev": {
> >>>>            "vdpa0": {
> >>>>                "type": "network",
> >>>>                "mgmtdev": "pci/0000:00:04.0",
> >>>>                "vendor_id": 6900,
> >>>>                "max_vqs": 3,
> >>>
> >>> How is max_vqs=3? For this to happen L0 QEMU should have
> >>> virtio-net-pci,...,queues=3 cmdline argument.
> >
> > Ouch! I totally misread it :(. Everything is correct, max_vqs should
> > be 3. I read it as the virtio_net queues, which means queue *pairs*,
> > as it includes rx and tx queue.
>
> Understood :)
>
> >>
> >> I am not sure why max_vqs is 3. I haven't set the value of queues to 3
> >> in the cmdline argument. Is max_vqs expected to have a default value
> >> other than 3?
> >>
> >> In the blog [1] as well, max_vqs is 3 even though there's no queues=3
> >> argument.
> >>
> >>> It's clear the guest is not using them, we can add mq=off
> >>> to simplify the scenario.
> >>
> >> The value of max_vqs is still 3 after adding mq=off. The whole
> >> command that I run to boot L0 is:
> >>
> >> $ sudo ./qemu/build/qemu-system-x86_64 \
> >> -enable-kvm \
> >> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >> -net nic,model=virtio \
> >> -net user,hostfwd=tcp::2222-:22 \
> >> -device intel-iommu,snoop-control=on \
> >> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
> >> -netdev tap,id=net0,script=no,downscript=no \
> >> -nographic \
> >> -m 8G \
> >> -smp 4 \
> >> -M q35 \
> >> -cpu host 2>&1 | tee vm.log
> >>
> >> Could it be that 2 of the 3 vqs are used for the dataplane and
> >> the third vq is the control vq?
> >>
> >>>>                "max_vq_size": 256
> >>>>            }
> >>>>        }
> >>>> }
> >>>>
> >>>> The max number of vqs is 3 with the max size being 256.
> >>>>
> >>>> Since, there are 2 virtio net devices, vhost_vdpa_svqs_start [1]
> >>>> is called twice. For each of them. it calls vhost_svq_start [2]
> >>>> v->shadow_vqs->len number of times.
> >>>>
> >>>
> >>> Ok I understand this confusion, as the code is not intuitive :). Take
> >>> into account you can only have svq in vdpa devices, so both
> >>> vhost_vdpa_svqs_start are acting on the vdpa device.
> >>>
> >>> You are seeing two calls to vhost_vdpa_svqs_start because virtio (and
> >>> vdpa) devices are modelled internally as two devices in QEMU: One for
> >>> the dataplane vq, and other for the control vq. There are historical
> >>> reasons for this, but we use it in vdpa to always shadow the CVQ while
> >>> leaving dataplane passthrough if x-svq=off and the virtio & virtio-net
> >>> feature set is understood by SVQ.
> >>>
> >>> If you break at vhost_vdpa_svqs_start with gdb and go higher in the
> >>> stack you should reach vhost_net_start, that starts each vhost_net
> >>> device individually.
> >>>
> >>> To be 100% honest, each dataplain *queue pair* (rx+tx) is modelled
> >>> with a different vhost_net device in QEMU, but you don't need to take
> >>> that into account implementing the packed vq :).
> >>
> >> Got it, this makes sense now.
> >>
> >>>> Printing the values of dev->vdev->name, v->shadow_vqs->len and
> >>>> svq->vring.num in vhost_vdpa_svqs_start gives:
> >>>>
> >>>> name: virtio-net
> >>>> len: 2
> >>>> num: 256
> >>>> num: 256
> >>>
> >>> First QEMU's vhost_net device, the dataplane.
> >>>
> >>>> name: virtio-net
> >>>> len: 1
> >>>> num: 64
> >>>>
> >>>
> >>> Second QEMU's vhost_net device, the control virtqueue.
> >>
> >> Ok, if I understand this correctly, the control vq doesn't
> >> need separate queues for rx and tx.
> >>
> >
> > That's right. Since CVQ has one reply per command, the driver can just
> > send ro+rw descriptors to the device. In the case of RX, the device
> > needs a queue with only-writable descriptors, as neither the device or
> > the driver knows how many packets will arrive.
>
> Got it, this makes sense now.
>
> >>>> I am not sure how to match the above log lines to the
> >>>> right virtio-net device since the actual value of num
> >>>> can be less than "max_vq_size" in the output of "vdpa
> >>>> dev show".
> >>>>
> >>>
> >>> Yes, the device can set a different vq max per vq, and the driver can
> >>> negotiate a lower vq size per vq too.
> >>>
> >>>> I think the first 3 log lines correspond to the virtio
> >>>> net device that I am using for testing since it has
> >>>> 2 vqs (rx and tx) while the other virtio-net device
> >>>> only has one vq.
> >>>>
> >>>> When printing out the values of svq->vring.num,
> >>>> used_elem.len and used_elem.id in vhost_svq_get_buf,
> >>>> there are two sets of output. One set corresponds to
> >>>> svq->vring.num = 64 and the other corresponds to
> >>>> svq->vring.num = 256.
> >>>>
> >>>> For svq->vring.num = 64, only the following line
> >>>> is printed repeatedly:
> >>>>
> >>>> size: 64, len: 1, i: 0
> >>>>
> >>>
> >>> This is with packed=off, right? If this is testing with packed, you
> >>> need to change the code to accommodate it. Let me know if you need
> >>> more help with this.
> >>
> >> Yes, this is for packed=off. For the time being, I am trying to
> >> get L2 to communicate with L0 using split virtqueues and x-svq=true.
> >>
> >
> > Got it.
> >
> >>> In the CVQ the only reply is a byte, indicating if the command was
> >>> applied or not. This seems ok to me.
> >>
> >> Understood.
> >>
> >>> The queue can also recycle ids as long as they are not available, so
> >>> that part seems correct to me too.
> >>
> >> I am a little confused here. The ids are recycled when they are
> >> available (i.e., the id is not already in use), right?
> >>
> >
> > In virtio, available is that the device can use them. And used is that
> > the device returned to the driver. I think you're aligned it's just it
> > is better to follow the virtio nomenclature :).
>
> Got it.
>
> >>>> For svq->vring.num = 256, the following line is
> >>>> printed 20 times,
> >>>>
> >>>> size: 256, len: 0, i: 0
> >>>>
> >>>> followed by:
> >>>>
> >>>> size: 256, len: 0, i: 1
> >>>> size: 256, len: 0, i: 1
> >>>>
> >>>
> >>> This makes sense for the tx queue too. Can you print the VirtQueue index?
> >>
> >> For svq->vring.num = 64, the vq index is 2. So the following line
> >> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
> >> is printed repeatedly:
> >>
> >> size: 64, len: 1, i: 0, vq idx: 2
> >>
> >> For svq->vring.num = 256, the following line is repeated several
> >> times:
> >>
> >> size: 256, len: 0, i: 0, vq idx: 1
> >>
> >> This is followed by:
> >>
> >> size: 256, len: 0, i: 1, vq idx: 1
> >>
> >> In both cases, queue_index is 1. To get the value of queue_index,
> >> I used "virtio_get_queue_index(svq->vq)" [2].
> >>
> >> Since the queue_index is 1, I guess this means this is the tx queue
> >> and the value of len (0) is correct. However, nothing with
> >> queue_index % 2 == 0 is printed by vhost_svq_get_buf() which means
> >> the device is not sending anything to the guest. Is this correct?
> >>
> >
> > Yes, that's totally correct.
> >
> > You can set -netdev tap,...,vhost=off in L0 qemu and trace (or debug
> > with gdb) it to check what is receiving. You should see calls to
> > hw/net/virtio-net.c:virtio_net_flush_tx. The corresponding function to
> > receive is virtio_net_receive_rcu, I recommend you trace too just it
> > in case you see any strange call to it.
> >
>
> I added "vhost=off" to -netdev tap in L0's qemu command. I followed all
> the steps in the blog [1] up till the point where L2 is booted. Before
> booting L2, I had no issues pinging L0 from L1.
>
> For each ping, the following trace lines were printed by QEMU:
>
> virtqueue_alloc_element elem 0x5d041024f560 size 56 in_num 0 out_num 1
> virtqueue_pop vq 0x5d04109b0ce8 elem 0x5d041024f560 in_num 0 out_num 1
> virtqueue_fill vq 0x5d04109b0ce8 elem 0x5d041024f560 len 0 idx 0
> virtqueue_flush vq 0x5d04109b0ce8 count 1
> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0ce8
> virtqueue_alloc_element elem 0x5d041024f560 size 56 in_num 1 out_num 0
> virtqueue_pop vq 0x5d04109b0c50 elem 0x5d041024f560 in_num 1 out_num 0
> virtqueue_fill vq 0x5d04109b0c50 elem 0x5d041024f560 len 110 idx 0
> virtqueue_flush vq 0x5d04109b0c50 count 1
> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0c50
>
> The first 5 lines look like they were printed when an echo request was
> sent to L0 and the next 5 lines were printed when an echo reply was
> received.
>
> After booting L2, I set up the tap device's IP address in L0 and the
> vDPA port's IP address in L2.
>
> When trying to ping L0 from L2, I only see the following lines being
> printed:
>
> virtqueue_alloc_element elem 0x5d041099ffd0 size 56 in_num 0 out_num 1
> virtqueue_pop vq 0x5d0410d87168 elem 0x5d041099ffd0 in_num 0 out_num 1
> virtqueue_fill vq 0x5d0410d87168 elem 0x5d041099ffd0 len 0 idx 0
> virtqueue_flush vq 0x5d0410d87168 count 1
> virtio_notify vdev 0x5d0410d79a10 vq 0x5d0410d87168
>
> There's no reception. I used wireshark to inspect the packets that are
> being sent and received through the tap device in L0.
>
> When pinging L0 from L2, I see one of the following two outcomes:
>
> Outcome 1:
> ----------
> L2 broadcasts ARP packets and L0 replies to L2.
>
> Source             Destination        Protocol    Length    Info
> 52:54:00:12:34:57  Broadcast          ARP         42        Who has 111.1.1.1? Tell 111.1.1.2
> d2:6d:b9:61:e1:9a  52:54:00:12:34:57  ARP         42        111.1.1.1 is at d2:6d:b9:61:e1:9a
>
> Outcome 2 (less frequent):
> --------------------------
> L2 sends an ICMP echo request packet to L0 and L0 sends a reply,
> but the reply is not received by L2.
>
> Source             Destination        Protocol    Length    Info
> 111.1.1.2          111.1.1.1          ICMP        98        Echo (ping) request  id=0x0006, seq=1/256, ttl=64
> 111.1.1.1          111.1.1.2          ICMP        98        Echo (ping) reply    id=0x0006, seq=1/256, ttl=64
>
> When pinging L2 from L0 I get the following output in
> wireshark:
>
> Source             Destination        Protocol    Length    Info
> 111.1.1.1          111.1.1.2          ICMP        100       Echo (ping) request  id=0x002c, seq=2/512, ttl=64 (no response found!)
>
> I do see a lot of traced lines being printed (by the QEMU instance that
> was started in L0) with in_num > 1, for example:
>
> virtqueue_alloc_element elem 0x5d040fdbad30 size 56 in_num 1 out_num 0
> virtqueue_pop vq 0x5d04109b0c50 elem 0x5d040fdbad30 in_num 1 out_num 0
> virtqueue_fill vq 0x5d04109b0c50 elem 0x5d040fdbad30 len 76 idx 0
> virtqueue_flush vq 0x5d04109b0c50 count 1
> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0c50
>

So L0 is able to receive data from L2. We're halfway there, Good! :).

> It looks like L1 is receiving data from L0 but this is not related to
> the pings that are sent from L2. I haven't figured out what data is
> actually being transferred in this case. It's not necessary for all of
> the data that L1 receives from L0 to be passed to L2, is it?
>

It should be noise, yes.

> >>>> For svq->vring.num = 256, the following line is
> >>>> printed 20 times,
> >>>>
> >>>> size: 256, len: 0, i: 0
> >>>>
> >>>> followed by:
> >>>>
> >>>> size: 256, len: 0, i: 1
> >>>> size: 256, len: 0, i: 1
> >>>>
> >>>
> >>> This makes sense for the tx queue too. Can you print the VirtQueue index?
> >>
> >> For svq->vring.num = 64, the vq index is 2. So the following line
> >> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
> >> is printed repeatedly:
> >>
> >> size: 64, len: 1, i: 0, vq idx: 2
> >>
> >> For svq->vring.num = 256, the following line is repeated several
> >> times:
> >>
> >> size: 256, len: 0, i: 0, vq idx: 1
> >>
> >> This is followed by:
> >>
> >> size: 256, len: 0, i: 1, vq idx: 1
> >>
> >> In both cases, queue_index is 1.
>
> I also noticed that there are now some lines with svq->vring.num = 256
> where len > 0. These lines were printed by the QEMU instance running
> in L1, so this corresponds to data that was received by L2.
>
> svq->vring.num  used_elem.len  used_elem.id  svq->vq->queue_index
> size: 256       len: 82        i: 0          vq idx: 0
> size: 256       len: 82        i: 1          vq idx: 0
> size: 256       len: 82        i: 2          vq idx: 0
> size: 256       len: 54        i: 3          vq idx: 0
>
> I still haven't figured out what data was received by L2 but I am
> slightly confused as to why this data was received by L2 but not
> the ICMP echo replies sent by L0.
>

We're on a good track, let's trace it deeper. I guess these are
printed from vhost_svq_flush, right? Do virtqueue_fill,
virtqueue_flush, and event_notifier_set(&svq->svq_call) run properly,
or do you see anything strange with gdb / tracing?

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 3 months, 2 weeks ago

Hi,

On 1/7/25 1:35 PM, Eugenio Perez Martin wrote:
> On Fri, Jan 3, 2025 at 2:06 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> Hi,
>>
>> On 12/20/24 12:28 PM, Eugenio Perez Martin wrote:
>>> On Thu, Dec 19, 2024 at 8:37 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 12/17/24 1:20 PM, Eugenio Perez Martin wrote:
>>>>> On Tue, Dec 17, 2024 at 6:45 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>> On 12/16/24 2:09 PM, Eugenio Perez Martin wrote:
>>>>>>> On Sun, Dec 15, 2024 at 6:27 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>>>> On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
>>>>>>>>> On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>>>>>> [...]
>>>>>>>>>> I have been following the "Hands on vDPA: what do you do
>>>>>>>>>> when you ain't got the hardware v2 (Part 2)" [1] blog to
>>>>>>>>>> test my changes. To boot the L1 VM, I ran:
>>>>>>>>>>
>>>>>>>>>> sudo ./qemu/build/qemu-system-x86_64 \
>>>>>>>>>> -enable-kvm \
>>>>>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>>>>>>>>>> -net nic,model=virtio \
>>>>>>>>>> -net user,hostfwd=tcp::2222-:22 \
>>>>>>>>>> -device intel-iommu,snoop-control=on \
>>>>>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
>>>>>>>>>> -netdev tap,id=net0,script=no,downscript=no \
>>>>>>>>>> -nographic \
>>>>>>>>>> -m 8G \
>>>>>>>>>> -smp 4 \
>>>>>>>>>> -M q35 \
>>>>>>>>>> -cpu host 2>&1 | tee vm.log
>>>>>>>>>>
>>>>>>>>>> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
>>>>>>>>>> guest_announce=off" in "-device virtio-net-pci", QEMU
>>>>>>>>>> throws "vdpa svq does not work with features" [2] when
>>>>>>>>>> trying to boot L2.
>>>>>>>>>>
>>>>>>>>>> The enums added in commit #2 in this series is new and
>>>>>>>>>> wasn't in the earlier versions of the series. Without
>>>>>>>>>> this change, x-svq=true throws "SVQ invalid device feature
>>>>>>>>>> flags" [3] and x-svq is consequently disabled.
>>>>>>>>>>
>>>>>>>>>> The first issue is related to running traffic in L2
>>>>>>>>>> with vhost-vdpa.
>>>>>>>>>>
>>>>>>>>>> In L0:
>>>>>>>>>>
>>>>>>>>>> $ ip addr add 111.1.1.1/24 dev tap0
>>>>>>>>>> $ ip link set tap0 up
>>>>>>>>>> $ ip addr show tap0
>>>>>>>>>> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>>>>>>>>>>          link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>>>>>>>>>>          inet 111.1.1.1/24 scope global tap0
>>>>>>>>>>             valid_lft forever preferred_lft forever
>>>>>>>>>>          inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>>>>>>>>>>             valid_lft forever preferred_lft forever
>>>>>>>>>>
>>>>>>>>>> I am able to run traffic in L2 when booting without
>>>>>>>>>> x-svq.
>>>>>>>>>>
>>>>>>>>>> In L1:
>>>>>>>>>>
>>>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>>>>>>>> -nographic \
>>>>>>>>>> -m 4G \
>>>>>>>>>> -enable-kvm \
>>>>>>>>>> -M q35 \
>>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
>>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>>>>>>>> -smp 4 \
>>>>>>>>>> -cpu host \
>>>>>>>>>> 2>&1 | tee vm.log
>>>>>>>>>>
>>>>>>>>>> In L2:
>>>>>>>>>>
>>>>>>>>>> # ip addr add 111.1.1.2/24 dev eth0
>>>>>>>>>> # ip addr show eth0
>>>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>>>>>>>          link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>>>>>>>          altname enp0s7
>>>>>>>>>>          inet 111.1.1.2/24 scope global eth0
>>>>>>>>>>             valid_lft forever preferred_lft forever
>>>>>>>>>>          inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>>>>>>>>>             valid_lft forever preferred_lft forever
>>>>>>>>>>
>>>>>>>>>> # ip route
>>>>>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>>>>>>>>>
>>>>>>>>>> # ping 111.1.1.1 -w3
>>>>>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>>>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
>>>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
>>>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
>>>>>>>>>>
>>>>>>>>>> --- 111.1.1.1 ping statistics ---
>>>>>>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
>>>>>>>>>> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But if I boot L2 with x-svq=true as shown below, I am unable
>>>>>>>>>> to ping the host machine.
>>>>>>>>>>
>>>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>>>>>>>> -nographic \
>>>>>>>>>> -m 4G \
>>>>>>>>>> -enable-kvm \
>>>>>>>>>> -M q35 \
>>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
>>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>>>>>>>> -smp 4 \
>>>>>>>>>> -cpu host \
>>>>>>>>>> 2>&1 | tee vm.log
>>>>>>>>>>
>>>>>>>>>> In L2:
>>>>>>>>>>
>>>>>>>>>> # ip addr add 111.1.1.2/24 dev eth0
>>>>>>>>>> # ip addr show eth0
>>>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>>>>>>>          link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>>>>>>>          altname enp0s7
>>>>>>>>>>          inet 111.1.1.2/24 scope global eth0
>>>>>>>>>>             valid_lft forever preferred_lft forever
>>>>>>>>>>          inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>>>>>>>>>             valid_lft forever preferred_lft forever
>>>>>>>>>>
>>>>>>>>>> # ip route
>>>>>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>>>>>>>>>
>>>>>>>>>> # ping 111.1.1.1 -w10
>>>>>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>>>>>>>>>>      From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
>>>>>>>>>> ping: sendmsg: No route to host
>>>>>>>>>>      From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
>>>>>>>>>>      From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
>>>>>>>>>>
>>>>>>>>>> --- 111.1.1.1 ping statistics ---
>>>>>>>>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
>>>>>>>>>> pipe 3
>>>>>>>>>>
>>>>>>>>>> The other issue is related to booting L2 with "x-svq=true"
>>>>>>>>>> and "packed=on".
>>>>>>>>>>
>>>>>>>>>> In L1:
>>>>>>>>>>
>>>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>>>>>>>> -nographic \
>>>>>>>>>> -m 4G \
>>>>>>>>>> -enable-kvm \
>>>>>>>>>> -M q35 \
>>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
>>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
>>>>>>>>>> -smp 4 \
>>>>>>>>>> -cpu host \
>>>>>>>>>> 2>&1 | tee vm.log
>>>>>>>>>>
>>>>>>>>>> The kernel throws "virtio_net virtio1: output.0:id 0 is not
>>>>>>>>>> a head!" [4].
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So this series implements the descriptor forwarding from the guest to
>>>>>>>>> the device in packed vq. We also need to forward the descriptors from
>>>>>>>>> the device to the guest. The device writes them in the SVQ ring.
>>>>>>>>>
>>>>>>>>> The functions responsible for that in QEMU are
>>>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
>>>>>>>>> the device when used descriptors are written to the SVQ, which calls
>>>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
>>>>>>>>> modifications similar to vhost_svq_add: Make them conditional if we're
>>>>>>>>> in split or packed vq, and "copy" the code from Linux's
>>>>>>>>> drivers/virtio/virtio_ring.c:virtqueue_get_buf.
>>>>>>>>>
>>>>>>>>> After these modifications you should be able to ping and forward
>>>>>>>>> traffic. As always, It is totally ok if it needs more than one
>>>>>>>>> iteration, and feel free to ask any question you have :).
>>>>>>>>>
>>>>>>>>
>>>>>>>> I misunderstood this part. While working on extending
>>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf() [1]
>>>>>>>> for packed vqs, I realized that this function and
>>>>>>>> vhost_svq_flush() already support split vqs. However, I am
>>>>>>>> unable to ping L0 when booting L2 with "x-svq=true" and
>>>>>>>> "packed=off" or when the "packed" option is not specified
>>>>>>>> in QEMU's command line.
>>>>>>>>
>>>>>>>> I tried debugging these functions for split vqs after running
>>>>>>>> the following QEMU commands while following the blog [2].
>>>>>>>>
>>>>>>>> Booting L1:
>>>>>>>>
>>>>>>>> $ sudo ./qemu/build/qemu-system-x86_64 \
>>>>>>>> -enable-kvm \
>>>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>>>>>>>> -net nic,model=virtio \
>>>>>>>> -net user,hostfwd=tcp::2222-:22 \
>>>>>>>> -device intel-iommu,snoop-control=on \
>>>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
>>>>>>>> -netdev tap,id=net0,script=no,downscript=no \
>>>>>>>> -nographic \
>>>>>>>> -m 8G \
>>>>>>>> -smp 4 \
>>>>>>>> -M q35 \
>>>>>>>> -cpu host 2>&1 | tee vm.log
>>>>>>>>
>>>>>>>> Booting L2:
>>>>>>>>
>>>>>>>> # ./qemu/build/qemu-system-x86_64 \
>>>>>>>> -nographic \
>>>>>>>> -m 4G \
>>>>>>>> -enable-kvm \
>>>>>>>> -M q35 \
>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>>>>>> -smp 4 \
>>>>>>>> -cpu host \
>>>>>>>> 2>&1 | tee vm.log
>>>>>>>>
>>>>>>>> I printed out the contents of VirtQueueElement returned
>>>>>>>> by vhost_svq_get_buf() in vhost_svq_flush() [3].
>>>>>>>> I noticed that "len" which is set by "vhost_svq_get_buf"
>>>>>>>> is always set to 0 while VirtQueueElement.len is non-zero.
>>>>>>>> I haven't understood the difference between these two "len"s.
>>>>>>>>
>>>>>>>
>>>>>>> VirtQueueElement.len is the length of the buffer, while the len of
>>>>>>> vhost_svq_get_buf is the bytes written by the device. In the case of
>>>>>>> the tx queue, VirtQueuelen is the length of the tx packet, and the
>>>>>>> vhost_svq_get_buf is always 0 as the device does not write. In the
>>>>>>> case of rx, VirtQueueElem.len is the available length for a rx frame,
>>>>>>> and the vhost_svq_get_buf len is the actual length written by the
>>>>>>> device.
>>>>>>>
>>>>>>> To be 100% accurate a rx packet can span over multiple buffers, but
>>>>>>> SVQ does not need special code to handle this.
>>>>>>>
>>>>>>> So vhost_svq_get_buf should return > 0 for rx queue (svq->vq->index ==
>>>>>>> 0), and 0 for tx queue (svq->vq->index % 2 == 1).
>>>>>>>
>>>>>>> Take into account that vhost_svq_get_buf only handles split vq at the
>>>>>>> moment! It should be renamed or splitted into vhost_svq_get_buf_split.
>>>>>>
>>>>>> In L1, there are 2 virtio network devices.
>>>>>>
>>>>>> # lspci -nn | grep -i net
>>>>>> 00:02.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
>>>>>> 00:04.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
>>>>>>
>>>>>> I am using the second one (1af4:1041) for testing my changes and have
>>>>>> bound this device to the vp_vdpa driver.
>>>>>>
>>>>>> # vdpa dev show -jp
>>>>>> {
>>>>>>         "dev": {
>>>>>>             "vdpa0": {
>>>>>>                 "type": "network",
>>>>>>                 "mgmtdev": "pci/0000:00:04.0",
>>>>>>                 "vendor_id": 6900,
>>>>>>                 "max_vqs": 3,
>>>>>
>>>>> How is max_vqs=3? For this to happen L0 QEMU should have
>>>>> virtio-net-pci,...,queues=3 cmdline argument.
>>>
>>> Ouch! I totally misread it :(. Everything is correct, max_vqs should
>>> be 3. I read it as the virtio_net queues, which means queue *pairs*,
>>> as it includes rx and tx queue.
>>
>> Understood :)
>>
>>>>
>>>> I am not sure why max_vqs is 3. I haven't set the value of queues to 3
>>>> in the cmdline argument. Is max_vqs expected to have a default value
>>>> other than 3?
>>>>
>>>> In the blog [1] as well, max_vqs is 3 even though there's no queues=3
>>>> argument.
>>>>
>>>>> It's clear the guest is not using them, we can add mq=off
>>>>> to simplify the scenario.
>>>>
>>>> The value of max_vqs is still 3 after adding mq=off. The whole
>>>> command that I run to boot L0 is:
>>>>
>>>> $ sudo ./qemu/build/qemu-system-x86_64 \
>>>> -enable-kvm \
>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>>>> -net nic,model=virtio \
>>>> -net user,hostfwd=tcp::2222-:22 \
>>>> -device intel-iommu,snoop-control=on \
>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
>>>> -netdev tap,id=net0,script=no,downscript=no \
>>>> -nographic \
>>>> -m 8G \
>>>> -smp 4 \
>>>> -M q35 \
>>>> -cpu host 2>&1 | tee vm.log
>>>>
>>>> Could it be that 2 of the 3 vqs are used for the dataplane and
>>>> the third vq is the control vq?
>>>>
>>>>>>                 "max_vq_size": 256
>>>>>>             }
>>>>>>         }
>>>>>> }
>>>>>>
>>>>>> The max number of vqs is 3 with the max size being 256.
>>>>>>
>>>>>> Since, there are 2 virtio net devices, vhost_vdpa_svqs_start [1]
>>>>>> is called twice. For each of them. it calls vhost_svq_start [2]
>>>>>> v->shadow_vqs->len number of times.
>>>>>>
>>>>>
>>>>> Ok I understand this confusion, as the code is not intuitive :). Take
>>>>> into account you can only have svq in vdpa devices, so both
>>>>> vhost_vdpa_svqs_start are acting on the vdpa device.
>>>>>
>>>>> You are seeing two calls to vhost_vdpa_svqs_start because virtio (and
>>>>> vdpa) devices are modelled internally as two devices in QEMU: One for
>>>>> the dataplane vq, and other for the control vq. There are historical
>>>>> reasons for this, but we use it in vdpa to always shadow the CVQ while
>>>>> leaving dataplane passthrough if x-svq=off and the virtio & virtio-net
>>>>> feature set is understood by SVQ.
>>>>>
>>>>> If you break at vhost_vdpa_svqs_start with gdb and go higher in the
>>>>> stack you should reach vhost_net_start, that starts each vhost_net
>>>>> device individually.
>>>>>
>>>>> To be 100% honest, each dataplain *queue pair* (rx+tx) is modelled
>>>>> with a different vhost_net device in QEMU, but you don't need to take
>>>>> that into account implementing the packed vq :).
>>>>
>>>> Got it, this makes sense now.
>>>>
>>>>>> Printing the values of dev->vdev->name, v->shadow_vqs->len and
>>>>>> svq->vring.num in vhost_vdpa_svqs_start gives:
>>>>>>
>>>>>> name: virtio-net
>>>>>> len: 2
>>>>>> num: 256
>>>>>> num: 256
>>>>>
>>>>> First QEMU's vhost_net device, the dataplane.
>>>>>
>>>>>> name: virtio-net
>>>>>> len: 1
>>>>>> num: 64
>>>>>>
>>>>>
>>>>> Second QEMU's vhost_net device, the control virtqueue.
>>>>
>>>> Ok, if I understand this correctly, the control vq doesn't
>>>> need separate queues for rx and tx.
>>>>
>>>
>>> That's right. Since CVQ has one reply per command, the driver can just
>>> send ro+rw descriptors to the device. In the case of RX, the device
>>> needs a queue with only-writable descriptors, as neither the device or
>>> the driver knows how many packets will arrive.
>>
>> Got it, this makes sense now.
>>
>>>>>> I am not sure how to match the above log lines to the
>>>>>> right virtio-net device since the actual value of num
>>>>>> can be less than "max_vq_size" in the output of "vdpa
>>>>>> dev show".
>>>>>>
>>>>>
>>>>> Yes, the device can set a different vq max per vq, and the driver can
>>>>> negotiate a lower vq size per vq too.
>>>>>
>>>>>> I think the first 3 log lines correspond to the virtio
>>>>>> net device that I am using for testing since it has
>>>>>> 2 vqs (rx and tx) while the other virtio-net device
>>>>>> only has one vq.
>>>>>>
>>>>>> When printing out the values of svq->vring.num,
>>>>>> used_elem.len and used_elem.id in vhost_svq_get_buf,
>>>>>> there are two sets of output. One set corresponds to
>>>>>> svq->vring.num = 64 and the other corresponds to
>>>>>> svq->vring.num = 256.
>>>>>>
>>>>>> For svq->vring.num = 64, only the following line
>>>>>> is printed repeatedly:
>>>>>>
>>>>>> size: 64, len: 1, i: 0
>>>>>>
>>>>>
>>>>> This is with packed=off, right? If this is testing with packed, you
>>>>> need to change the code to accommodate it. Let me know if you need
>>>>> more help with this.
>>>>
>>>> Yes, this is for packed=off. For the time being, I am trying to
>>>> get L2 to communicate with L0 using split virtqueues and x-svq=true.
>>>>
>>>
>>> Got it.
>>>
>>>>> In the CVQ the only reply is a byte, indicating if the command was
>>>>> applied or not. This seems ok to me.
>>>>
>>>> Understood.
>>>>
>>>>> The queue can also recycle ids as long as they are not available, so
>>>>> that part seems correct to me too.
>>>>
>>>> I am a little confused here. The ids are recycled when they are
>>>> available (i.e., the id is not already in use), right?
>>>>
>>>
>>> In virtio, available is that the device can use them. And used is that
>>> the device returned to the driver. I think you're aligned it's just it
>>> is better to follow the virtio nomenclature :).
>>
>> Got it.
>>
>>>>>> For svq->vring.num = 256, the following line is
>>>>>> printed 20 times,
>>>>>>
>>>>>> size: 256, len: 0, i: 0
>>>>>>
>>>>>> followed by:
>>>>>>
>>>>>> size: 256, len: 0, i: 1
>>>>>> size: 256, len: 0, i: 1
>>>>>>
>>>>>
>>>>> This makes sense for the tx queue too. Can you print the VirtQueue index?
>>>>
>>>> For svq->vring.num = 64, the vq index is 2. So the following line
>>>> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
>>>> is printed repeatedly:
>>>>
>>>> size: 64, len: 1, i: 0, vq idx: 2
>>>>
>>>> For svq->vring.num = 256, the following line is repeated several
>>>> times:
>>>>
>>>> size: 256, len: 0, i: 0, vq idx: 1
>>>>
>>>> This is followed by:
>>>>
>>>> size: 256, len: 0, i: 1, vq idx: 1
>>>>
>>>> In both cases, queue_index is 1. To get the value of queue_index,
>>>> I used "virtio_get_queue_index(svq->vq)" [2].
>>>>
>>>> Since the queue_index is 1, I guess this means this is the tx queue
>>>> and the value of len (0) is correct. However, nothing with
>>>> queue_index % 2 == 0 is printed by vhost_svq_get_buf() which means
>>>> the device is not sending anything to the guest. Is this correct?
>>>>
>>>
>>> Yes, that's totally correct.
>>>
>>> You can set -netdev tap,...,vhost=off in L0 qemu and trace (or debug
>>> with gdb) it to check what is receiving. You should see calls to
>>> hw/net/virtio-net.c:virtio_net_flush_tx. The corresponding function to
>>> receive is virtio_net_receive_rcu, I recommend you trace too just it
>>> in case you see any strange call to it.
>>>
>>
>> I added "vhost=off" to -netdev tap in L0's qemu command. I followed all
>> the steps in the blog [1] up till the point where L2 is booted. Before
>> booting L2, I had no issues pinging L0 from L1.
>>
>> For each ping, the following trace lines were printed by QEMU:
>>
>> virtqueue_alloc_element elem 0x5d041024f560 size 56 in_num 0 out_num 1
>> virtqueue_pop vq 0x5d04109b0ce8 elem 0x5d041024f560 in_num 0 out_num 1
>> virtqueue_fill vq 0x5d04109b0ce8 elem 0x5d041024f560 len 0 idx 0
>> virtqueue_flush vq 0x5d04109b0ce8 count 1
>> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0ce8
>> virtqueue_alloc_element elem 0x5d041024f560 size 56 in_num 1 out_num 0
>> virtqueue_pop vq 0x5d04109b0c50 elem 0x5d041024f560 in_num 1 out_num 0
>> virtqueue_fill vq 0x5d04109b0c50 elem 0x5d041024f560 len 110 idx 0
>> virtqueue_flush vq 0x5d04109b0c50 count 1
>> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0c50
>>
>> The first 5 lines look like they were printed when an echo request was
>> sent to L0 and the next 5 lines were printed when an echo reply was
>> received.
>>
>> After booting L2, I set up the tap device's IP address in L0 and the
>> vDPA port's IP address in L2.
>>
>> When trying to ping L0 from L2, I only see the following lines being
>> printed:
>>
>> virtqueue_alloc_element elem 0x5d041099ffd0 size 56 in_num 0 out_num 1
>> virtqueue_pop vq 0x5d0410d87168 elem 0x5d041099ffd0 in_num 0 out_num 1
>> virtqueue_fill vq 0x5d0410d87168 elem 0x5d041099ffd0 len 0 idx 0
>> virtqueue_flush vq 0x5d0410d87168 count 1
>> virtio_notify vdev 0x5d0410d79a10 vq 0x5d0410d87168
>>
>> There's no reception. I used wireshark to inspect the packets that are
>> being sent and received through the tap device in L0.
>>
>> When pinging L0 from L2, I see one of the following two outcomes:
>>
>> Outcome 1:
>> ----------
>> L2 broadcasts ARP packets and L0 replies to L2.
>>
>> Source             Destination        Protocol    Length    Info
>> 52:54:00:12:34:57  Broadcast          ARP         42        Who has 111.1.1.1? Tell 111.1.1.2
>> d2:6d:b9:61:e1:9a  52:54:00:12:34:57  ARP         42        111.1.1.1 is at d2:6d:b9:61:e1:9a
>>
>> Outcome 2 (less frequent):
>> --------------------------
>> L2 sends an ICMP echo request packet to L0 and L0 sends a reply,
>> but the reply is not received by L2.
>>
>> Source             Destination        Protocol    Length    Info
>> 111.1.1.2          111.1.1.1          ICMP        98        Echo (ping) request  id=0x0006, seq=1/256, ttl=64
>> 111.1.1.1          111.1.1.2          ICMP        98        Echo (ping) reply    id=0x0006, seq=1/256, ttl=64
>>
>> When pinging L2 from L0 I get the following output in
>> wireshark:
>>
>> Source             Destination        Protocol    Length    Info
>> 111.1.1.1          111.1.1.2          ICMP        100       Echo (ping) request  id=0x002c, seq=2/512, ttl=64 (no response found!)
>>
>> I do see a lot of traced lines being printed (by the QEMU instance that
>> was started in L0) with in_num > 1, for example:
>>
>> virtqueue_alloc_element elem 0x5d040fdbad30 size 56 in_num 1 out_num 0
>> virtqueue_pop vq 0x5d04109b0c50 elem 0x5d040fdbad30 in_num 1 out_num 0
>> virtqueue_fill vq 0x5d04109b0c50 elem 0x5d040fdbad30 len 76 idx 0
>> virtqueue_flush vq 0x5d04109b0c50 count 1
>> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0c50
>>
> 
> So L0 is able to receive data from L2. We're halfway there, Good! :).
> 
>> It looks like L1 is receiving data from L0 but this is not related to
>> the pings that are sent from L2. I haven't figured out what data is
>> actually being transferred in this case. It's not necessary for all of
>> the data that L1 receives from L0 to be passed to L2, is it?
>>
> 
> It should be noise, yes.
> 

Understood.

>>>>>> For svq->vring.num = 256, the following line is
>>>>>> printed 20 times,
>>>>>>
>>>>>> size: 256, len: 0, i: 0
>>>>>>
>>>>>> followed by:
>>>>>>
>>>>>> size: 256, len: 0, i: 1
>>>>>> size: 256, len: 0, i: 1
>>>>>>
>>>>>
>>>>> This makes sense for the tx queue too. Can you print the VirtQueue index?
>>>>
>>>> For svq->vring.num = 64, the vq index is 2. So the following line
>>>> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
>>>> is printed repeatedly:
>>>>
>>>> size: 64, len: 1, i: 0, vq idx: 2
>>>>
>>>> For svq->vring.num = 256, the following line is repeated several
>>>> times:
>>>>
>>>> size: 256, len: 0, i: 0, vq idx: 1
>>>>
>>>> This is followed by:
>>>>
>>>> size: 256, len: 0, i: 1, vq idx: 1
>>>>
>>>> In both cases, queue_index is 1.
>>
>> I also noticed that there are now some lines with svq->vring.num = 256
>> where len > 0. These lines were printed by the QEMU instance running
>> in L1, so this corresponds to data that was received by L2.
>>
>> svq->vring.num  used_elem.len  used_elem.id  svq->vq->queue_index
>> size: 256       len: 82        i: 0          vq idx: 0
>> size: 256       len: 82        i: 1          vq idx: 0
>> size: 256       len: 82        i: 2          vq idx: 0
>> size: 256       len: 54        i: 3          vq idx: 0
>>
>> I still haven't figured out what data was received by L2 but I am
>> slightly confused as to why this data was received by L2 but not
>> the ICMP echo replies sent by L0.
>>
> 
> We're on a good track, let's trace it deeper. I guess these are
> printed from vhost_svq_flush, right? Do virtqueue_fill,
> virtqueue_flush, and event_notifier_set(&svq->svq_call) run properly,
> or do you see anything strange with gdb / tracing?
> 

Apologies for the delay in replying. It took me a while to figure
this out, but I have now understood why this doesn't work. L1 is
unable to receive messages from L0 because they get filtered out
by hw/net/virtio-net.c:receive_filter [1]. There's an issue with
the MAC addresses.

In L0, I have:

$ ip a show tap0
6: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
     link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
     inet 111.1.1.1/24 scope global tap0
        valid_lft forever preferred_lft forever
     inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
        valid_lft forever preferred_lft forever

In L1:

# ip a show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
     link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
     altname enp0s2
     inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
        valid_lft 83455sec preferred_lft 83455sec
     inet6 fec0::7bd2:265e:3b8e:5acc/64 scope site dynamic noprefixroute
        valid_lft 86064sec preferred_lft 14064sec
     inet6 fe80::50e7:5bf6:fff8:a7b0/64 scope link noprefixroute
        valid_lft forever preferred_lft forever

I'll call this L1-eth0.

In L2:
# ip a show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP gro0
     link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
     altname enp0s7
     inet 111.1.1.2/24 scope global eth0
        valid_lft forever preferred_lft forever

I'll call this L2-eth0.

Apart from eth0, lo is the only other device in both L1 and L2.

A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
as its destination address. When booting L2 with x-svq=false, the
value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
the frames and passes them on to L2 and pinging works [2].

However, when booting L2 with x-svq=true, n->mac is set to L1-eth0
(LSB = 56) in virtio_net_handle_mac() [3]. n->mac_table.macs also
does not seem to have L2-eth0's MAC address. Due to this,
receive_filter() filters out all the frames [4] that were meant for
L2-eth0.

With x-svq=true, I see that n->mac is set by virtio_net_handle_mac()
[3] when L1 receives VIRTIO_NET_CTRL_MAC_ADDR_SET. With x-svq=false,
virtio_net_handle_mac() doesn't seem to be getting called. I haven't
understood how the MAC address is set in VirtIONet when x-svq=false.
Understanding this might help see why n->mac has different values
when x-svq is false vs when it is true.

Thanks,
Sahil

[1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/net/virtio-net.c#L1944
[2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/net/virtio-net.c#L1775
[3] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/net/virtio-net.c#L1108
[4] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/net/virtio-net.c#L1770-1786

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 3 months, 2 weeks ago

On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 1/7/25 1:35 PM, Eugenio Perez Martin wrote:
> > On Fri, Jan 3, 2025 at 2:06 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> On 12/20/24 12:28 PM, Eugenio Perez Martin wrote:
> >>> On Thu, Dec 19, 2024 at 8:37 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> On 12/17/24 1:20 PM, Eugenio Perez Martin wrote:
> >>>>> On Tue, Dec 17, 2024 at 6:45 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>> On 12/16/24 2:09 PM, Eugenio Perez Martin wrote:
> >>>>>>> On Sun, Dec 15, 2024 at 6:27 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>>>> On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
> >>>>>>>>> On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>>>>>> [...]
> >>>>>>>>>> I have been following the "Hands on vDPA: what do you do
> >>>>>>>>>> when you ain't got the hardware v2 (Part 2)" [1] blog to
> >>>>>>>>>> test my changes. To boot the L1 VM, I ran:
> >>>>>>>>>>
> >>>>>>>>>> sudo ./qemu/build/qemu-system-x86_64 \
> >>>>>>>>>> -enable-kvm \
> >>>>>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >>>>>>>>>> -net nic,model=virtio \
> >>>>>>>>>> -net user,hostfwd=tcp::2222-:22 \
> >>>>>>>>>> -device intel-iommu,snoop-control=on \
> >>>>>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
> >>>>>>>>>> -netdev tap,id=net0,script=no,downscript=no \
> >>>>>>>>>> -nographic \
> >>>>>>>>>> -m 8G \
> >>>>>>>>>> -smp 4 \
> >>>>>>>>>> -M q35 \
> >>>>>>>>>> -cpu host 2>&1 | tee vm.log
> >>>>>>>>>>
> >>>>>>>>>> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
> >>>>>>>>>> guest_announce=off" in "-device virtio-net-pci", QEMU
> >>>>>>>>>> throws "vdpa svq does not work with features" [2] when
> >>>>>>>>>> trying to boot L2.
> >>>>>>>>>>
> >>>>>>>>>> The enums added in commit #2 in this series is new and
> >>>>>>>>>> wasn't in the earlier versions of the series. Without
> >>>>>>>>>> this change, x-svq=true throws "SVQ invalid device feature
> >>>>>>>>>> flags" [3] and x-svq is consequently disabled.
> >>>>>>>>>>
> >>>>>>>>>> The first issue is related to running traffic in L2
> >>>>>>>>>> with vhost-vdpa.
> >>>>>>>>>>
> >>>>>>>>>> In L0:
> >>>>>>>>>>
> >>>>>>>>>> $ ip addr add 111.1.1.1/24 dev tap0
> >>>>>>>>>> $ ip link set tap0 up
> >>>>>>>>>> $ ip addr show tap0
> >>>>>>>>>> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
> >>>>>>>>>>          link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
> >>>>>>>>>>          inet 111.1.1.1/24 scope global tap0
> >>>>>>>>>>             valid_lft forever preferred_lft forever
> >>>>>>>>>>          inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
> >>>>>>>>>>             valid_lft forever preferred_lft forever
> >>>>>>>>>>
> >>>>>>>>>> I am able to run traffic in L2 when booting without
> >>>>>>>>>> x-svq.
> >>>>>>>>>>
> >>>>>>>>>> In L1:
> >>>>>>>>>>
> >>>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>>>>>>>> -nographic \
> >>>>>>>>>> -m 4G \
> >>>>>>>>>> -enable-kvm \
> >>>>>>>>>> -M q35 \
> >>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
> >>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>>>>>>>> -smp 4 \
> >>>>>>>>>> -cpu host \
> >>>>>>>>>> 2>&1 | tee vm.log
> >>>>>>>>>>
> >>>>>>>>>> In L2:
> >>>>>>>>>>
> >>>>>>>>>> # ip addr add 111.1.1.2/24 dev eth0
> >>>>>>>>>> # ip addr show eth0
> >>>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>>>>>>>          link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>>>>>>>          altname enp0s7
> >>>>>>>>>>          inet 111.1.1.2/24 scope global eth0
> >>>>>>>>>>             valid_lft forever preferred_lft forever
> >>>>>>>>>>          inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
> >>>>>>>>>>             valid_lft forever preferred_lft forever
> >>>>>>>>>>
> >>>>>>>>>> # ip route
> >>>>>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
> >>>>>>>>>>
> >>>>>>>>>> # ping 111.1.1.1 -w3
> >>>>>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> >>>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
> >>>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
> >>>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
> >>>>>>>>>>
> >>>>>>>>>> --- 111.1.1.1 ping statistics ---
> >>>>>>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
> >>>>>>>>>> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> But if I boot L2 with x-svq=true as shown below, I am unable
> >>>>>>>>>> to ping the host machine.
> >>>>>>>>>>
> >>>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>>>>>>>> -nographic \
> >>>>>>>>>> -m 4G \
> >>>>>>>>>> -enable-kvm \
> >>>>>>>>>> -M q35 \
> >>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> >>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>>>>>>>> -smp 4 \
> >>>>>>>>>> -cpu host \
> >>>>>>>>>> 2>&1 | tee vm.log
> >>>>>>>>>>
> >>>>>>>>>> In L2:
> >>>>>>>>>>
> >>>>>>>>>> # ip addr add 111.1.1.2/24 dev eth0
> >>>>>>>>>> # ip addr show eth0
> >>>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>>>>>>>          link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>>>>>>>          altname enp0s7
> >>>>>>>>>>          inet 111.1.1.2/24 scope global eth0
> >>>>>>>>>>             valid_lft forever preferred_lft forever
> >>>>>>>>>>          inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
> >>>>>>>>>>             valid_lft forever preferred_lft forever
> >>>>>>>>>>
> >>>>>>>>>> # ip route
> >>>>>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
> >>>>>>>>>>
> >>>>>>>>>> # ping 111.1.1.1 -w10
> >>>>>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> >>>>>>>>>>      From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
> >>>>>>>>>> ping: sendmsg: No route to host
> >>>>>>>>>>      From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
> >>>>>>>>>>      From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
> >>>>>>>>>>
> >>>>>>>>>> --- 111.1.1.1 ping statistics ---
> >>>>>>>>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
> >>>>>>>>>> pipe 3
> >>>>>>>>>>
> >>>>>>>>>> The other issue is related to booting L2 with "x-svq=true"
> >>>>>>>>>> and "packed=on".
> >>>>>>>>>>
> >>>>>>>>>> In L1:
> >>>>>>>>>>
> >>>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>>>>>>>> -nographic \
> >>>>>>>>>> -m 4G \
> >>>>>>>>>> -enable-kvm \
> >>>>>>>>>> -M q35 \
> >>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
> >>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
> >>>>>>>>>> -smp 4 \
> >>>>>>>>>> -cpu host \
> >>>>>>>>>> 2>&1 | tee vm.log
> >>>>>>>>>>
> >>>>>>>>>> The kernel throws "virtio_net virtio1: output.0:id 0 is not
> >>>>>>>>>> a head!" [4].
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> So this series implements the descriptor forwarding from the guest to
> >>>>>>>>> the device in packed vq. We also need to forward the descriptors from
> >>>>>>>>> the device to the guest. The device writes them in the SVQ ring.
> >>>>>>>>>
> >>>>>>>>> The functions responsible for that in QEMU are
> >>>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
> >>>>>>>>> the device when used descriptors are written to the SVQ, which calls
> >>>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
> >>>>>>>>> modifications similar to vhost_svq_add: Make them conditional if we're
> >>>>>>>>> in split or packed vq, and "copy" the code from Linux's
> >>>>>>>>> drivers/virtio/virtio_ring.c:virtqueue_get_buf.
> >>>>>>>>>
> >>>>>>>>> After these modifications you should be able to ping and forward
> >>>>>>>>> traffic. As always, It is totally ok if it needs more than one
> >>>>>>>>> iteration, and feel free to ask any question you have :).
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I misunderstood this part. While working on extending
> >>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf() [1]
> >>>>>>>> for packed vqs, I realized that this function and
> >>>>>>>> vhost_svq_flush() already support split vqs. However, I am
> >>>>>>>> unable to ping L0 when booting L2 with "x-svq=true" and
> >>>>>>>> "packed=off" or when the "packed" option is not specified
> >>>>>>>> in QEMU's command line.
> >>>>>>>>
> >>>>>>>> I tried debugging these functions for split vqs after running
> >>>>>>>> the following QEMU commands while following the blog [2].
> >>>>>>>>
> >>>>>>>> Booting L1:
> >>>>>>>>
> >>>>>>>> $ sudo ./qemu/build/qemu-system-x86_64 \
> >>>>>>>> -enable-kvm \
> >>>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >>>>>>>> -net nic,model=virtio \
> >>>>>>>> -net user,hostfwd=tcp::2222-:22 \
> >>>>>>>> -device intel-iommu,snoop-control=on \
> >>>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
> >>>>>>>> -netdev tap,id=net0,script=no,downscript=no \
> >>>>>>>> -nographic \
> >>>>>>>> -m 8G \
> >>>>>>>> -smp 4 \
> >>>>>>>> -M q35 \
> >>>>>>>> -cpu host 2>&1 | tee vm.log
> >>>>>>>>
> >>>>>>>> Booting L2:
> >>>>>>>>
> >>>>>>>> # ./qemu/build/qemu-system-x86_64 \
> >>>>>>>> -nographic \
> >>>>>>>> -m 4G \
> >>>>>>>> -enable-kvm \
> >>>>>>>> -M q35 \
> >>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> >>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>>>>>> -smp 4 \
> >>>>>>>> -cpu host \
> >>>>>>>> 2>&1 | tee vm.log
> >>>>>>>>
> >>>>>>>> I printed out the contents of VirtQueueElement returned
> >>>>>>>> by vhost_svq_get_buf() in vhost_svq_flush() [3].
> >>>>>>>> I noticed that "len" which is set by "vhost_svq_get_buf"
> >>>>>>>> is always set to 0 while VirtQueueElement.len is non-zero.
> >>>>>>>> I haven't understood the difference between these two "len"s.
> >>>>>>>>
> >>>>>>>
> >>>>>>> VirtQueueElement.len is the length of the buffer, while the len of
> >>>>>>> vhost_svq_get_buf is the bytes written by the device. In the case of
> >>>>>>> the tx queue, VirtQueuelen is the length of the tx packet, and the
> >>>>>>> vhost_svq_get_buf is always 0 as the device does not write. In the
> >>>>>>> case of rx, VirtQueueElem.len is the available length for a rx frame,
> >>>>>>> and the vhost_svq_get_buf len is the actual length written by the
> >>>>>>> device.
> >>>>>>>
> >>>>>>> To be 100% accurate a rx packet can span over multiple buffers, but
> >>>>>>> SVQ does not need special code to handle this.
> >>>>>>>
> >>>>>>> So vhost_svq_get_buf should return > 0 for rx queue (svq->vq->index ==
> >>>>>>> 0), and 0 for tx queue (svq->vq->index % 2 == 1).
> >>>>>>>
> >>>>>>> Take into account that vhost_svq_get_buf only handles split vq at the
> >>>>>>> moment! It should be renamed or splitted into vhost_svq_get_buf_split.
> >>>>>>
> >>>>>> In L1, there are 2 virtio network devices.
> >>>>>>
> >>>>>> # lspci -nn | grep -i net
> >>>>>> 00:02.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
> >>>>>> 00:04.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
> >>>>>>
> >>>>>> I am using the second one (1af4:1041) for testing my changes and have
> >>>>>> bound this device to the vp_vdpa driver.
> >>>>>>
> >>>>>> # vdpa dev show -jp
> >>>>>> {
> >>>>>>         "dev": {
> >>>>>>             "vdpa0": {
> >>>>>>                 "type": "network",
> >>>>>>                 "mgmtdev": "pci/0000:00:04.0",
> >>>>>>                 "vendor_id": 6900,
> >>>>>>                 "max_vqs": 3,
> >>>>>
> >>>>> How is max_vqs=3? For this to happen L0 QEMU should have
> >>>>> virtio-net-pci,...,queues=3 cmdline argument.
> >>>
> >>> Ouch! I totally misread it :(. Everything is correct, max_vqs should
> >>> be 3. I read it as the virtio_net queues, which means queue *pairs*,
> >>> as it includes rx and tx queue.
> >>
> >> Understood :)
> >>
> >>>>
> >>>> I am not sure why max_vqs is 3. I haven't set the value of queues to 3
> >>>> in the cmdline argument. Is max_vqs expected to have a default value
> >>>> other than 3?
> >>>>
> >>>> In the blog [1] as well, max_vqs is 3 even though there's no queues=3
> >>>> argument.
> >>>>
> >>>>> It's clear the guest is not using them, we can add mq=off
> >>>>> to simplify the scenario.
> >>>>
> >>>> The value of max_vqs is still 3 after adding mq=off. The whole
> >>>> command that I run to boot L0 is:
> >>>>
> >>>> $ sudo ./qemu/build/qemu-system-x86_64 \
> >>>> -enable-kvm \
> >>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >>>> -net nic,model=virtio \
> >>>> -net user,hostfwd=tcp::2222-:22 \
> >>>> -device intel-iommu,snoop-control=on \
> >>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
> >>>> -netdev tap,id=net0,script=no,downscript=no \
> >>>> -nographic \
> >>>> -m 8G \
> >>>> -smp 4 \
> >>>> -M q35 \
> >>>> -cpu host 2>&1 | tee vm.log
> >>>>
> >>>> Could it be that 2 of the 3 vqs are used for the dataplane and
> >>>> the third vq is the control vq?
> >>>>
> >>>>>>                 "max_vq_size": 256
> >>>>>>             }
> >>>>>>         }
> >>>>>> }
> >>>>>>
> >>>>>> The max number of vqs is 3 with the max size being 256.
> >>>>>>
> >>>>>> Since, there are 2 virtio net devices, vhost_vdpa_svqs_start [1]
> >>>>>> is called twice. For each of them. it calls vhost_svq_start [2]
> >>>>>> v->shadow_vqs->len number of times.
> >>>>>>
> >>>>>
> >>>>> Ok I understand this confusion, as the code is not intuitive :). Take
> >>>>> into account you can only have svq in vdpa devices, so both
> >>>>> vhost_vdpa_svqs_start are acting on the vdpa device.
> >>>>>
> >>>>> You are seeing two calls to vhost_vdpa_svqs_start because virtio (and
> >>>>> vdpa) devices are modelled internally as two devices in QEMU: One for
> >>>>> the dataplane vq, and other for the control vq. There are historical
> >>>>> reasons for this, but we use it in vdpa to always shadow the CVQ while
> >>>>> leaving dataplane passthrough if x-svq=off and the virtio & virtio-net
> >>>>> feature set is understood by SVQ.
> >>>>>
> >>>>> If you break at vhost_vdpa_svqs_start with gdb and go higher in the
> >>>>> stack you should reach vhost_net_start, that starts each vhost_net
> >>>>> device individually.
> >>>>>
> >>>>> To be 100% honest, each dataplain *queue pair* (rx+tx) is modelled
> >>>>> with a different vhost_net device in QEMU, but you don't need to take
> >>>>> that into account implementing the packed vq :).
> >>>>
> >>>> Got it, this makes sense now.
> >>>>
> >>>>>> Printing the values of dev->vdev->name, v->shadow_vqs->len and
> >>>>>> svq->vring.num in vhost_vdpa_svqs_start gives:
> >>>>>>
> >>>>>> name: virtio-net
> >>>>>> len: 2
> >>>>>> num: 256
> >>>>>> num: 256
> >>>>>
> >>>>> First QEMU's vhost_net device, the dataplane.
> >>>>>
> >>>>>> name: virtio-net
> >>>>>> len: 1
> >>>>>> num: 64
> >>>>>>
> >>>>>
> >>>>> Second QEMU's vhost_net device, the control virtqueue.
> >>>>
> >>>> Ok, if I understand this correctly, the control vq doesn't
> >>>> need separate queues for rx and tx.
> >>>>
> >>>
> >>> That's right. Since CVQ has one reply per command, the driver can just
> >>> send ro+rw descriptors to the device. In the case of RX, the device
> >>> needs a queue with only-writable descriptors, as neither the device or
> >>> the driver knows how many packets will arrive.
> >>
> >> Got it, this makes sense now.
> >>
> >>>>>> I am not sure how to match the above log lines to the
> >>>>>> right virtio-net device since the actual value of num
> >>>>>> can be less than "max_vq_size" in the output of "vdpa
> >>>>>> dev show".
> >>>>>>
> >>>>>
> >>>>> Yes, the device can set a different vq max per vq, and the driver can
> >>>>> negotiate a lower vq size per vq too.
> >>>>>
> >>>>>> I think the first 3 log lines correspond to the virtio
> >>>>>> net device that I am using for testing since it has
> >>>>>> 2 vqs (rx and tx) while the other virtio-net device
> >>>>>> only has one vq.
> >>>>>>
> >>>>>> When printing out the values of svq->vring.num,
> >>>>>> used_elem.len and used_elem.id in vhost_svq_get_buf,
> >>>>>> there are two sets of output. One set corresponds to
> >>>>>> svq->vring.num = 64 and the other corresponds to
> >>>>>> svq->vring.num = 256.
> >>>>>>
> >>>>>> For svq->vring.num = 64, only the following line
> >>>>>> is printed repeatedly:
> >>>>>>
> >>>>>> size: 64, len: 1, i: 0
> >>>>>>
> >>>>>
> >>>>> This is with packed=off, right? If this is testing with packed, you
> >>>>> need to change the code to accommodate it. Let me know if you need
> >>>>> more help with this.
> >>>>
> >>>> Yes, this is for packed=off. For the time being, I am trying to
> >>>> get L2 to communicate with L0 using split virtqueues and x-svq=true.
> >>>>
> >>>
> >>> Got it.
> >>>
> >>>>> In the CVQ the only reply is a byte, indicating if the command was
> >>>>> applied or not. This seems ok to me.
> >>>>
> >>>> Understood.
> >>>>
> >>>>> The queue can also recycle ids as long as they are not available, so
> >>>>> that part seems correct to me too.
> >>>>
> >>>> I am a little confused here. The ids are recycled when they are
> >>>> available (i.e., the id is not already in use), right?
> >>>>
> >>>
> >>> In virtio, available is that the device can use them. And used is that
> >>> the device returned to the driver. I think you're aligned it's just it
> >>> is better to follow the virtio nomenclature :).
> >>
> >> Got it.
> >>
> >>>>>> For svq->vring.num = 256, the following line is
> >>>>>> printed 20 times,
> >>>>>>
> >>>>>> size: 256, len: 0, i: 0
> >>>>>>
> >>>>>> followed by:
> >>>>>>
> >>>>>> size: 256, len: 0, i: 1
> >>>>>> size: 256, len: 0, i: 1
> >>>>>>
> >>>>>
> >>>>> This makes sense for the tx queue too. Can you print the VirtQueue index?
> >>>>
> >>>> For svq->vring.num = 64, the vq index is 2. So the following line
> >>>> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
> >>>> is printed repeatedly:
> >>>>
> >>>> size: 64, len: 1, i: 0, vq idx: 2
> >>>>
> >>>> For svq->vring.num = 256, the following line is repeated several
> >>>> times:
> >>>>
> >>>> size: 256, len: 0, i: 0, vq idx: 1
> >>>>
> >>>> This is followed by:
> >>>>
> >>>> size: 256, len: 0, i: 1, vq idx: 1
> >>>>
> >>>> In both cases, queue_index is 1. To get the value of queue_index,
> >>>> I used "virtio_get_queue_index(svq->vq)" [2].
> >>>>
> >>>> Since the queue_index is 1, I guess this means this is the tx queue
> >>>> and the value of len (0) is correct. However, nothing with
> >>>> queue_index % 2 == 0 is printed by vhost_svq_get_buf() which means
> >>>> the device is not sending anything to the guest. Is this correct?
> >>>>
> >>>
> >>> Yes, that's totally correct.
> >>>
> >>> You can set -netdev tap,...,vhost=off in L0 qemu and trace (or debug
> >>> with gdb) it to check what is receiving. You should see calls to
> >>> hw/net/virtio-net.c:virtio_net_flush_tx. The corresponding function to
> >>> receive is virtio_net_receive_rcu, I recommend you trace too just it
> >>> in case you see any strange call to it.
> >>>
> >>
> >> I added "vhost=off" to -netdev tap in L0's qemu command. I followed all
> >> the steps in the blog [1] up till the point where L2 is booted. Before
> >> booting L2, I had no issues pinging L0 from L1.
> >>
> >> For each ping, the following trace lines were printed by QEMU:
> >>
> >> virtqueue_alloc_element elem 0x5d041024f560 size 56 in_num 0 out_num 1
> >> virtqueue_pop vq 0x5d04109b0ce8 elem 0x5d041024f560 in_num 0 out_num 1
> >> virtqueue_fill vq 0x5d04109b0ce8 elem 0x5d041024f560 len 0 idx 0
> >> virtqueue_flush vq 0x5d04109b0ce8 count 1
> >> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0ce8
> >> virtqueue_alloc_element elem 0x5d041024f560 size 56 in_num 1 out_num 0
> >> virtqueue_pop vq 0x5d04109b0c50 elem 0x5d041024f560 in_num 1 out_num 0
> >> virtqueue_fill vq 0x5d04109b0c50 elem 0x5d041024f560 len 110 idx 0
> >> virtqueue_flush vq 0x5d04109b0c50 count 1
> >> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0c50
> >>
> >> The first 5 lines look like they were printed when an echo request was
> >> sent to L0 and the next 5 lines were printed when an echo reply was
> >> received.
> >>
> >> After booting L2, I set up the tap device's IP address in L0 and the
> >> vDPA port's IP address in L2.
> >>
> >> When trying to ping L0 from L2, I only see the following lines being
> >> printed:
> >>
> >> virtqueue_alloc_element elem 0x5d041099ffd0 size 56 in_num 0 out_num 1
> >> virtqueue_pop vq 0x5d0410d87168 elem 0x5d041099ffd0 in_num 0 out_num 1
> >> virtqueue_fill vq 0x5d0410d87168 elem 0x5d041099ffd0 len 0 idx 0
> >> virtqueue_flush vq 0x5d0410d87168 count 1
> >> virtio_notify vdev 0x5d0410d79a10 vq 0x5d0410d87168
> >>
> >> There's no reception. I used wireshark to inspect the packets that are
> >> being sent and received through the tap device in L0.
> >>
> >> When pinging L0 from L2, I see one of the following two outcomes:
> >>
> >> Outcome 1:
> >> ----------
> >> L2 broadcasts ARP packets and L0 replies to L2.
> >>
> >> Source             Destination        Protocol    Length    Info
> >> 52:54:00:12:34:57  Broadcast          ARP         42        Who has 111.1.1.1? Tell 111.1.1.2
> >> d2:6d:b9:61:e1:9a  52:54:00:12:34:57  ARP         42        111.1.1.1 is at d2:6d:b9:61:e1:9a
> >>
> >> Outcome 2 (less frequent):
> >> --------------------------
> >> L2 sends an ICMP echo request packet to L0 and L0 sends a reply,
> >> but the reply is not received by L2.
> >>
> >> Source             Destination        Protocol    Length    Info
> >> 111.1.1.2          111.1.1.1          ICMP        98        Echo (ping) request  id=0x0006, seq=1/256, ttl=64
> >> 111.1.1.1          111.1.1.2          ICMP        98        Echo (ping) reply    id=0x0006, seq=1/256, ttl=64
> >>
> >> When pinging L2 from L0 I get the following output in
> >> wireshark:
> >>
> >> Source             Destination        Protocol    Length    Info
> >> 111.1.1.1          111.1.1.2          ICMP        100       Echo (ping) request  id=0x002c, seq=2/512, ttl=64 (no response found!)
> >>
> >> I do see a lot of traced lines being printed (by the QEMU instance that
> >> was started in L0) with in_num > 1, for example:
> >>
> >> virtqueue_alloc_element elem 0x5d040fdbad30 size 56 in_num 1 out_num 0
> >> virtqueue_pop vq 0x5d04109b0c50 elem 0x5d040fdbad30 in_num 1 out_num 0
> >> virtqueue_fill vq 0x5d04109b0c50 elem 0x5d040fdbad30 len 76 idx 0
> >> virtqueue_flush vq 0x5d04109b0c50 count 1
> >> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0c50
> >>
> >
> > So L0 is able to receive data from L2. We're halfway there, Good! :).
> >
> >> It looks like L1 is receiving data from L0 but this is not related to
> >> the pings that are sent from L2. I haven't figured out what data is
> >> actually being transferred in this case. It's not necessary for all of
> >> the data that L1 receives from L0 to be passed to L2, is it?
> >>
> >
> > It should be noise, yes.
> >
>
> Understood.
>
> >>>>>> For svq->vring.num = 256, the following line is
> >>>>>> printed 20 times,
> >>>>>>
> >>>>>> size: 256, len: 0, i: 0
> >>>>>>
> >>>>>> followed by:
> >>>>>>
> >>>>>> size: 256, len: 0, i: 1
> >>>>>> size: 256, len: 0, i: 1
> >>>>>>
> >>>>>
> >>>>> This makes sense for the tx queue too. Can you print the VirtQueue index?
> >>>>
> >>>> For svq->vring.num = 64, the vq index is 2. So the following line
> >>>> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
> >>>> is printed repeatedly:
> >>>>
> >>>> size: 64, len: 1, i: 0, vq idx: 2
> >>>>
> >>>> For svq->vring.num = 256, the following line is repeated several
> >>>> times:
> >>>>
> >>>> size: 256, len: 0, i: 0, vq idx: 1
> >>>>
> >>>> This is followed by:
> >>>>
> >>>> size: 256, len: 0, i: 1, vq idx: 1
> >>>>
> >>>> In both cases, queue_index is 1.
> >>
> >> I also noticed that there are now some lines with svq->vring.num = 256
> >> where len > 0. These lines were printed by the QEMU instance running
> >> in L1, so this corresponds to data that was received by L2.
> >>
> >> svq->vring.num  used_elem.len  used_elem.id  svq->vq->queue_index
> >> size: 256       len: 82        i: 0          vq idx: 0
> >> size: 256       len: 82        i: 1          vq idx: 0
> >> size: 256       len: 82        i: 2          vq idx: 0
> >> size: 256       len: 54        i: 3          vq idx: 0
> >>
> >> I still haven't figured out what data was received by L2 but I am
> >> slightly confused as to why this data was received by L2 but not
> >> the ICMP echo replies sent by L0.
> >>
> >
> > We're on a good track, let's trace it deeper. I guess these are
> > printed from vhost_svq_flush, right? Do virtqueue_fill,
> > virtqueue_flush, and event_notifier_set(&svq->svq_call) run properly,
> > or do you see anything strange with gdb / tracing?
> >
>
> Apologies for the delay in replying. It took me a while to figure
> this out, but I have now understood why this doesn't work. L1 is
> unable to receive messages from L0 because they get filtered out
> by hw/net/virtio-net.c:receive_filter [1]. There's an issue with
> the MAC addresses.
>
> In L0, I have:
>
> $ ip a show tap0
> 6: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>      link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>      inet 111.1.1.1/24 scope global tap0
>         valid_lft forever preferred_lft forever
>      inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>         valid_lft forever preferred_lft forever
>
> In L1:
>
> # ip a show eth0
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>      link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
>      altname enp0s2
>      inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
>         valid_lft 83455sec preferred_lft 83455sec
>      inet6 fec0::7bd2:265e:3b8e:5acc/64 scope site dynamic noprefixroute
>         valid_lft 86064sec preferred_lft 14064sec
>      inet6 fe80::50e7:5bf6:fff8:a7b0/64 scope link noprefixroute
>         valid_lft forever preferred_lft forever
>
> I'll call this L1-eth0.
>
> In L2:
> # ip a show eth0
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP gro0
>      link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>      altname enp0s7
>      inet 111.1.1.2/24 scope global eth0
>         valid_lft forever preferred_lft forever
>
> I'll call this L2-eth0.
>
> Apart from eth0, lo is the only other device in both L1 and L2.
>
> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
> as its destination address. When booting L2 with x-svq=false, the
> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
> the frames and passes them on to L2 and pinging works [2].
>

So this behavior is interesting by itself. But L1's kernel net system
should not receive anything. As I read it, even if it receives it, it
should not forward the frame to L2 as it is in a different subnet. Are
you able to read it using tcpdump on L1?

Maybe we can make the scenario clearer by telling which virtio-net
device is which with virtio_net_pci,mac=XX:... ?

> However, when booting L2 with x-svq=true, n->mac is set to L1-eth0
> (LSB = 56) in virtio_net_handle_mac() [3].

Can you tell with gdb bt if this function is called from net or the
SVQ subsystem?

> n->mac_table.macs also
> does not seem to have L2-eth0's MAC address. Due to this,
> receive_filter() filters out all the frames [4] that were meant for
> L2-eth0.
>

In the vp_vdpa scenario of the blog receive_filter should not be
called in the qemu running in the L1 guest, the nested one. Can you
check it with gdb or by printing a trace if it is called?

> With x-svq=true, I see that n->mac is set by virtio_net_handle_mac()
> [3] when L1 receives VIRTIO_NET_CTRL_MAC_ADDR_SET. With x-svq=false,
> virtio_net_handle_mac() doesn't seem to be getting called. I haven't
> understood how the MAC address is set in VirtIONet when x-svq=false.
> Understanding this might help see why n->mac has different values
> when x-svq is false vs when it is true.
>

Ok this makes sense, as x-svq=true is the one that receives the set
mac message. You should see it in L0's QEMU though, both in x-svq=on
and x-svq=off scenarios. Can you check it?

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 3 months, 1 week ago

Hi,

On 1/21/25 10:07 PM, Eugenio Perez Martin wrote:
> On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> Hi,
>>
>> On 1/7/25 1:35 PM, Eugenio Perez Martin wrote:
>>> On Fri, Jan 3, 2025 at 2:06 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 12/20/24 12:28 PM, Eugenio Perez Martin wrote:
>>>>> On Thu, Dec 19, 2024 at 8:37 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 12/17/24 1:20 PM, Eugenio Perez Martin wrote:
>>>>>>> On Tue, Dec 17, 2024 at 6:45 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>>>> On 12/16/24 2:09 PM, Eugenio Perez Martin wrote:
>>>>>>>>> On Sun, Dec 15, 2024 at 6:27 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>>>>>> On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
>>>>>>>>>>> On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>>>>>>>> [...]
>>>>>>>>>>>> I have been following the "Hands on vDPA: what do you do
>>>>>>>>>>>> when you ain't got the hardware v2 (Part 2)" [1] blog to
>>>>>>>>>>>> test my changes. To boot the L1 VM, I ran:
>>>>>>>>>>>>
>>>>>>>>>>>> sudo ./qemu/build/qemu-system-x86_64 \
>>>>>>>>>>>> -enable-kvm \
>>>>>>>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>>>>>>>>>>>> -net nic,model=virtio \
>>>>>>>>>>>> -net user,hostfwd=tcp::2222-:22 \
>>>>>>>>>>>> -device intel-iommu,snoop-control=on \
>>>>>>>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
>>>>>>>>>>>> -netdev tap,id=net0,script=no,downscript=no \
>>>>>>>>>>>> -nographic \
>>>>>>>>>>>> -m 8G \
>>>>>>>>>>>> -smp 4 \
>>>>>>>>>>>> -M q35 \
>>>>>>>>>>>> -cpu host 2>&1 | tee vm.log
>>>>>>>>>>>>
>>>>>>>>>>>> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
>>>>>>>>>>>> guest_announce=off" in "-device virtio-net-pci", QEMU
>>>>>>>>>>>> throws "vdpa svq does not work with features" [2] when
>>>>>>>>>>>> trying to boot L2.
>>>>>>>>>>>>
>>>>>>>>>>>> The enums added in commit #2 in this series is new and
>>>>>>>>>>>> wasn't in the earlier versions of the series. Without
>>>>>>>>>>>> this change, x-svq=true throws "SVQ invalid device feature
>>>>>>>>>>>> flags" [3] and x-svq is consequently disabled.
>>>>>>>>>>>>
>>>>>>>>>>>> The first issue is related to running traffic in L2
>>>>>>>>>>>> with vhost-vdpa.
>>>>>>>>>>>>
>>>>>>>>>>>> In L0:
>>>>>>>>>>>>
>>>>>>>>>>>> $ ip addr add 111.1.1.1/24 dev tap0
>>>>>>>>>>>> $ ip link set tap0 up
>>>>>>>>>>>> $ ip addr show tap0
>>>>>>>>>>>> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>>>>>>>>>>>>           link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>>>>>>>>>>>>           inet 111.1.1.1/24 scope global tap0
>>>>>>>>>>>>              valid_lft forever preferred_lft forever
>>>>>>>>>>>>           inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>>>>>>>>>>>>              valid_lft forever preferred_lft forever
>>>>>>>>>>>>
>>>>>>>>>>>> I am able to run traffic in L2 when booting without
>>>>>>>>>>>> x-svq.
>>>>>>>>>>>>
>>>>>>>>>>>> In L1:
>>>>>>>>>>>>
>>>>>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>>>>>>>>>> -nographic \
>>>>>>>>>>>> -m 4G \
>>>>>>>>>>>> -enable-kvm \
>>>>>>>>>>>> -M q35 \
>>>>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
>>>>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>>>>>>>>>> -smp 4 \
>>>>>>>>>>>> -cpu host \
>>>>>>>>>>>> 2>&1 | tee vm.log
>>>>>>>>>>>>
>>>>>>>>>>>> In L2:
>>>>>>>>>>>>
>>>>>>>>>>>> # ip addr add 111.1.1.2/24 dev eth0
>>>>>>>>>>>> # ip addr show eth0
>>>>>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>>>>>>>>>           link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>>>>>>>>>           altname enp0s7
>>>>>>>>>>>>           inet 111.1.1.2/24 scope global eth0
>>>>>>>>>>>>              valid_lft forever preferred_lft forever
>>>>>>>>>>>>           inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>>>>>>>>>>>              valid_lft forever preferred_lft forever
>>>>>>>>>>>>
>>>>>>>>>>>> # ip route
>>>>>>>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>>>>>>>>>>>
>>>>>>>>>>>> # ping 111.1.1.1 -w3
>>>>>>>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>>>>>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
>>>>>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
>>>>>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
>>>>>>>>>>>>
>>>>>>>>>>>> --- 111.1.1.1 ping statistics ---
>>>>>>>>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
>>>>>>>>>>>> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> But if I boot L2 with x-svq=true as shown below, I am unable
>>>>>>>>>>>> to ping the host machine.
>>>>>>>>>>>>
>>>>>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>>>>>>>>>> -nographic \
>>>>>>>>>>>> -m 4G \
>>>>>>>>>>>> -enable-kvm \
>>>>>>>>>>>> -M q35 \
>>>>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
>>>>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>>>>>>>>>> -smp 4 \
>>>>>>>>>>>> -cpu host \
>>>>>>>>>>>> 2>&1 | tee vm.log
>>>>>>>>>>>>
>>>>>>>>>>>> In L2:
>>>>>>>>>>>>
>>>>>>>>>>>> # ip addr add 111.1.1.2/24 dev eth0
>>>>>>>>>>>> # ip addr show eth0
>>>>>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>>>>>>>>>           link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>>>>>>>>>           altname enp0s7
>>>>>>>>>>>>           inet 111.1.1.2/24 scope global eth0
>>>>>>>>>>>>              valid_lft forever preferred_lft forever
>>>>>>>>>>>>           inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>>>>>>>>>>>              valid_lft forever preferred_lft forever
>>>>>>>>>>>>
>>>>>>>>>>>> # ip route
>>>>>>>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>>>>>>>>>>>
>>>>>>>>>>>> # ping 111.1.1.1 -w10
>>>>>>>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>>>>>>>>>>>>       From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
>>>>>>>>>>>> ping: sendmsg: No route to host
>>>>>>>>>>>>       From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
>>>>>>>>>>>>       From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
>>>>>>>>>>>>
>>>>>>>>>>>> --- 111.1.1.1 ping statistics ---
>>>>>>>>>>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
>>>>>>>>>>>> pipe 3
>>>>>>>>>>>>
>>>>>>>>>>>> The other issue is related to booting L2 with "x-svq=true"
>>>>>>>>>>>> and "packed=on".
>>>>>>>>>>>>
>>>>>>>>>>>> In L1:
>>>>>>>>>>>>
>>>>>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
>>>>>>>>>>>> -nographic \
>>>>>>>>>>>> -m 4G \
>>>>>>>>>>>> -enable-kvm \
>>>>>>>>>>>> -M q35 \
>>>>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
>>>>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
>>>>>>>>>>>> -smp 4 \
>>>>>>>>>>>> -cpu host \
>>>>>>>>>>>> 2>&1 | tee vm.log
>>>>>>>>>>>>
>>>>>>>>>>>> The kernel throws "virtio_net virtio1: output.0:id 0 is not
>>>>>>>>>>>> a head!" [4].
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> So this series implements the descriptor forwarding from the guest to
>>>>>>>>>>> the device in packed vq. We also need to forward the descriptors from
>>>>>>>>>>> the device to the guest. The device writes them in the SVQ ring.
>>>>>>>>>>>
>>>>>>>>>>> The functions responsible for that in QEMU are
>>>>>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
>>>>>>>>>>> the device when used descriptors are written to the SVQ, which calls
>>>>>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
>>>>>>>>>>> modifications similar to vhost_svq_add: Make them conditional if we're
>>>>>>>>>>> in split or packed vq, and "copy" the code from Linux's
>>>>>>>>>>> drivers/virtio/virtio_ring.c:virtqueue_get_buf.
>>>>>>>>>>>
>>>>>>>>>>> After these modifications you should be able to ping and forward
>>>>>>>>>>> traffic. As always, It is totally ok if it needs more than one
>>>>>>>>>>> iteration, and feel free to ask any question you have :).
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I misunderstood this part. While working on extending
>>>>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf() [1]
>>>>>>>>>> for packed vqs, I realized that this function and
>>>>>>>>>> vhost_svq_flush() already support split vqs. However, I am
>>>>>>>>>> unable to ping L0 when booting L2 with "x-svq=true" and
>>>>>>>>>> "packed=off" or when the "packed" option is not specified
>>>>>>>>>> in QEMU's command line.
>>>>>>>>>>
>>>>>>>>>> I tried debugging these functions for split vqs after running
>>>>>>>>>> the following QEMU commands while following the blog [2].
>>>>>>>>>>
>>>>>>>>>> Booting L1:
>>>>>>>>>>
>>>>>>>>>> $ sudo ./qemu/build/qemu-system-x86_64 \
>>>>>>>>>> -enable-kvm \
>>>>>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>>>>>>>>>> -net nic,model=virtio \
>>>>>>>>>> -net user,hostfwd=tcp::2222-:22 \
>>>>>>>>>> -device intel-iommu,snoop-control=on \
>>>>>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
>>>>>>>>>> -netdev tap,id=net0,script=no,downscript=no \
>>>>>>>>>> -nographic \
>>>>>>>>>> -m 8G \
>>>>>>>>>> -smp 4 \
>>>>>>>>>> -M q35 \
>>>>>>>>>> -cpu host 2>&1 | tee vm.log
>>>>>>>>>>
>>>>>>>>>> Booting L2:
>>>>>>>>>>
>>>>>>>>>> # ./qemu/build/qemu-system-x86_64 \
>>>>>>>>>> -nographic \
>>>>>>>>>> -m 4G \
>>>>>>>>>> -enable-kvm \
>>>>>>>>>> -M q35 \
>>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
>>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>>>>>>>>>> -smp 4 \
>>>>>>>>>> -cpu host \
>>>>>>>>>> 2>&1 | tee vm.log
>>>>>>>>>>
>>>>>>>>>> I printed out the contents of VirtQueueElement returned
>>>>>>>>>> by vhost_svq_get_buf() in vhost_svq_flush() [3].
>>>>>>>>>> I noticed that "len" which is set by "vhost_svq_get_buf"
>>>>>>>>>> is always set to 0 while VirtQueueElement.len is non-zero.
>>>>>>>>>> I haven't understood the difference between these two "len"s.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> VirtQueueElement.len is the length of the buffer, while the len of
>>>>>>>>> vhost_svq_get_buf is the bytes written by the device. In the case of
>>>>>>>>> the tx queue, VirtQueuelen is the length of the tx packet, and the
>>>>>>>>> vhost_svq_get_buf is always 0 as the device does not write. In the
>>>>>>>>> case of rx, VirtQueueElem.len is the available length for a rx frame,
>>>>>>>>> and the vhost_svq_get_buf len is the actual length written by the
>>>>>>>>> device.
>>>>>>>>>
>>>>>>>>> To be 100% accurate a rx packet can span over multiple buffers, but
>>>>>>>>> SVQ does not need special code to handle this.
>>>>>>>>>
>>>>>>>>> So vhost_svq_get_buf should return > 0 for rx queue (svq->vq->index ==
>>>>>>>>> 0), and 0 for tx queue (svq->vq->index % 2 == 1).
>>>>>>>>>
>>>>>>>>> Take into account that vhost_svq_get_buf only handles split vq at the
>>>>>>>>> moment! It should be renamed or splitted into vhost_svq_get_buf_split.
>>>>>>>>
>>>>>>>> In L1, there are 2 virtio network devices.
>>>>>>>>
>>>>>>>> # lspci -nn | grep -i net
>>>>>>>> 00:02.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
>>>>>>>> 00:04.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
>>>>>>>>
>>>>>>>> I am using the second one (1af4:1041) for testing my changes and have
>>>>>>>> bound this device to the vp_vdpa driver.
>>>>>>>>
>>>>>>>> # vdpa dev show -jp
>>>>>>>> {
>>>>>>>>          "dev": {
>>>>>>>>              "vdpa0": {
>>>>>>>>                  "type": "network",
>>>>>>>>                  "mgmtdev": "pci/0000:00:04.0",
>>>>>>>>                  "vendor_id": 6900,
>>>>>>>>                  "max_vqs": 3,
>>>>>>>
>>>>>>> How is max_vqs=3? For this to happen L0 QEMU should have
>>>>>>> virtio-net-pci,...,queues=3 cmdline argument.
>>>>>
>>>>> Ouch! I totally misread it :(. Everything is correct, max_vqs should
>>>>> be 3. I read it as the virtio_net queues, which means queue *pairs*,
>>>>> as it includes rx and tx queue.
>>>>
>>>> Understood :)
>>>>
>>>>>>
>>>>>> I am not sure why max_vqs is 3. I haven't set the value of queues to 3
>>>>>> in the cmdline argument. Is max_vqs expected to have a default value
>>>>>> other than 3?
>>>>>>
>>>>>> In the blog [1] as well, max_vqs is 3 even though there's no queues=3
>>>>>> argument.
>>>>>>
>>>>>>> It's clear the guest is not using them, we can add mq=off
>>>>>>> to simplify the scenario.
>>>>>>
>>>>>> The value of max_vqs is still 3 after adding mq=off. The whole
>>>>>> command that I run to boot L0 is:
>>>>>>
>>>>>> $ sudo ./qemu/build/qemu-system-x86_64 \
>>>>>> -enable-kvm \
>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>>>>>> -net nic,model=virtio \
>>>>>> -net user,hostfwd=tcp::2222-:22 \
>>>>>> -device intel-iommu,snoop-control=on \
>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
>>>>>> -netdev tap,id=net0,script=no,downscript=no \
>>>>>> -nographic \
>>>>>> -m 8G \
>>>>>> -smp 4 \
>>>>>> -M q35 \
>>>>>> -cpu host 2>&1 | tee vm.log
>>>>>>
>>>>>> Could it be that 2 of the 3 vqs are used for the dataplane and
>>>>>> the third vq is the control vq?
>>>>>>
>>>>>>>>                  "max_vq_size": 256
>>>>>>>>              }
>>>>>>>>          }
>>>>>>>> }
>>>>>>>>
>>>>>>>> The max number of vqs is 3 with the max size being 256.
>>>>>>>>
>>>>>>>> Since, there are 2 virtio net devices, vhost_vdpa_svqs_start [1]
>>>>>>>> is called twice. For each of them. it calls vhost_svq_start [2]
>>>>>>>> v->shadow_vqs->len number of times.
>>>>>>>>
>>>>>>>
>>>>>>> Ok I understand this confusion, as the code is not intuitive :). Take
>>>>>>> into account you can only have svq in vdpa devices, so both
>>>>>>> vhost_vdpa_svqs_start are acting on the vdpa device.
>>>>>>>
>>>>>>> You are seeing two calls to vhost_vdpa_svqs_start because virtio (and
>>>>>>> vdpa) devices are modelled internally as two devices in QEMU: One for
>>>>>>> the dataplane vq, and other for the control vq. There are historical
>>>>>>> reasons for this, but we use it in vdpa to always shadow the CVQ while
>>>>>>> leaving dataplane passthrough if x-svq=off and the virtio & virtio-net
>>>>>>> feature set is understood by SVQ.
>>>>>>>
>>>>>>> If you break at vhost_vdpa_svqs_start with gdb and go higher in the
>>>>>>> stack you should reach vhost_net_start, that starts each vhost_net
>>>>>>> device individually.
>>>>>>>
>>>>>>> To be 100% honest, each dataplain *queue pair* (rx+tx) is modelled
>>>>>>> with a different vhost_net device in QEMU, but you don't need to take
>>>>>>> that into account implementing the packed vq :).
>>>>>>
>>>>>> Got it, this makes sense now.
>>>>>>
>>>>>>>> Printing the values of dev->vdev->name, v->shadow_vqs->len and
>>>>>>>> svq->vring.num in vhost_vdpa_svqs_start gives:
>>>>>>>>
>>>>>>>> name: virtio-net
>>>>>>>> len: 2
>>>>>>>> num: 256
>>>>>>>> num: 256
>>>>>>>
>>>>>>> First QEMU's vhost_net device, the dataplane.
>>>>>>>
>>>>>>>> name: virtio-net
>>>>>>>> len: 1
>>>>>>>> num: 64
>>>>>>>>
>>>>>>>
>>>>>>> Second QEMU's vhost_net device, the control virtqueue.
>>>>>>
>>>>>> Ok, if I understand this correctly, the control vq doesn't
>>>>>> need separate queues for rx and tx.
>>>>>>
>>>>>
>>>>> That's right. Since CVQ has one reply per command, the driver can just
>>>>> send ro+rw descriptors to the device. In the case of RX, the device
>>>>> needs a queue with only-writable descriptors, as neither the device or
>>>>> the driver knows how many packets will arrive.
>>>>
>>>> Got it, this makes sense now.
>>>>
>>>>>>>> I am not sure how to match the above log lines to the
>>>>>>>> right virtio-net device since the actual value of num
>>>>>>>> can be less than "max_vq_size" in the output of "vdpa
>>>>>>>> dev show".
>>>>>>>>
>>>>>>>
>>>>>>> Yes, the device can set a different vq max per vq, and the driver can
>>>>>>> negotiate a lower vq size per vq too.
>>>>>>>
>>>>>>>> I think the first 3 log lines correspond to the virtio
>>>>>>>> net device that I am using for testing since it has
>>>>>>>> 2 vqs (rx and tx) while the other virtio-net device
>>>>>>>> only has one vq.
>>>>>>>>
>>>>>>>> When printing out the values of svq->vring.num,
>>>>>>>> used_elem.len and used_elem.id in vhost_svq_get_buf,
>>>>>>>> there are two sets of output. One set corresponds to
>>>>>>>> svq->vring.num = 64 and the other corresponds to
>>>>>>>> svq->vring.num = 256.
>>>>>>>>
>>>>>>>> For svq->vring.num = 64, only the following line
>>>>>>>> is printed repeatedly:
>>>>>>>>
>>>>>>>> size: 64, len: 1, i: 0
>>>>>>>>
>>>>>>>
>>>>>>> This is with packed=off, right? If this is testing with packed, you
>>>>>>> need to change the code to accommodate it. Let me know if you need
>>>>>>> more help with this.
>>>>>>
>>>>>> Yes, this is for packed=off. For the time being, I am trying to
>>>>>> get L2 to communicate with L0 using split virtqueues and x-svq=true.
>>>>>>
>>>>>
>>>>> Got it.
>>>>>
>>>>>>> In the CVQ the only reply is a byte, indicating if the command was
>>>>>>> applied or not. This seems ok to me.
>>>>>>
>>>>>> Understood.
>>>>>>
>>>>>>> The queue can also recycle ids as long as they are not available, so
>>>>>>> that part seems correct to me too.
>>>>>>
>>>>>> I am a little confused here. The ids are recycled when they are
>>>>>> available (i.e., the id is not already in use), right?
>>>>>>
>>>>>
>>>>> In virtio, available is that the device can use them. And used is that
>>>>> the device returned to the driver. I think you're aligned it's just it
>>>>> is better to follow the virtio nomenclature :).
>>>>
>>>> Got it.
>>>>
>>>>>>>> For svq->vring.num = 256, the following line is
>>>>>>>> printed 20 times,
>>>>>>>>
>>>>>>>> size: 256, len: 0, i: 0
>>>>>>>>
>>>>>>>> followed by:
>>>>>>>>
>>>>>>>> size: 256, len: 0, i: 1
>>>>>>>> size: 256, len: 0, i: 1
>>>>>>>>
>>>>>>>
>>>>>>> This makes sense for the tx queue too. Can you print the VirtQueue index?
>>>>>>
>>>>>> For svq->vring.num = 64, the vq index is 2. So the following line
>>>>>> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
>>>>>> is printed repeatedly:
>>>>>>
>>>>>> size: 64, len: 1, i: 0, vq idx: 2
>>>>>>
>>>>>> For svq->vring.num = 256, the following line is repeated several
>>>>>> times:
>>>>>>
>>>>>> size: 256, len: 0, i: 0, vq idx: 1
>>>>>>
>>>>>> This is followed by:
>>>>>>
>>>>>> size: 256, len: 0, i: 1, vq idx: 1
>>>>>>
>>>>>> In both cases, queue_index is 1. To get the value of queue_index,
>>>>>> I used "virtio_get_queue_index(svq->vq)" [2].
>>>>>>
>>>>>> Since the queue_index is 1, I guess this means this is the tx queue
>>>>>> and the value of len (0) is correct. However, nothing with
>>>>>> queue_index % 2 == 0 is printed by vhost_svq_get_buf() which means
>>>>>> the device is not sending anything to the guest. Is this correct?
>>>>>>
>>>>>
>>>>> Yes, that's totally correct.
>>>>>
>>>>> You can set -netdev tap,...,vhost=off in L0 qemu and trace (or debug
>>>>> with gdb) it to check what is receiving. You should see calls to
>>>>> hw/net/virtio-net.c:virtio_net_flush_tx. The corresponding function to
>>>>> receive is virtio_net_receive_rcu, I recommend you trace too just it
>>>>> in case you see any strange call to it.
>>>>>
>>>>
>>>> I added "vhost=off" to -netdev tap in L0's qemu command. I followed all
>>>> the steps in the blog [1] up till the point where L2 is booted. Before
>>>> booting L2, I had no issues pinging L0 from L1.
>>>>
>>>> For each ping, the following trace lines were printed by QEMU:
>>>>
>>>> virtqueue_alloc_element elem 0x5d041024f560 size 56 in_num 0 out_num 1
>>>> virtqueue_pop vq 0x5d04109b0ce8 elem 0x5d041024f560 in_num 0 out_num 1
>>>> virtqueue_fill vq 0x5d04109b0ce8 elem 0x5d041024f560 len 0 idx 0
>>>> virtqueue_flush vq 0x5d04109b0ce8 count 1
>>>> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0ce8
>>>> virtqueue_alloc_element elem 0x5d041024f560 size 56 in_num 1 out_num 0
>>>> virtqueue_pop vq 0x5d04109b0c50 elem 0x5d041024f560 in_num 1 out_num 0
>>>> virtqueue_fill vq 0x5d04109b0c50 elem 0x5d041024f560 len 110 idx 0
>>>> virtqueue_flush vq 0x5d04109b0c50 count 1
>>>> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0c50
>>>>
>>>> The first 5 lines look like they were printed when an echo request was
>>>> sent to L0 and the next 5 lines were printed when an echo reply was
>>>> received.
>>>>
>>>> After booting L2, I set up the tap device's IP address in L0 and the
>>>> vDPA port's IP address in L2.
>>>>
>>>> When trying to ping L0 from L2, I only see the following lines being
>>>> printed:
>>>>
>>>> virtqueue_alloc_element elem 0x5d041099ffd0 size 56 in_num 0 out_num 1
>>>> virtqueue_pop vq 0x5d0410d87168 elem 0x5d041099ffd0 in_num 0 out_num 1
>>>> virtqueue_fill vq 0x5d0410d87168 elem 0x5d041099ffd0 len 0 idx 0
>>>> virtqueue_flush vq 0x5d0410d87168 count 1
>>>> virtio_notify vdev 0x5d0410d79a10 vq 0x5d0410d87168
>>>>
>>>> There's no reception. I used wireshark to inspect the packets that are
>>>> being sent and received through the tap device in L0.
>>>>
>>>> When pinging L0 from L2, I see one of the following two outcomes:
>>>>
>>>> Outcome 1:
>>>> ----------
>>>> L2 broadcasts ARP packets and L0 replies to L2.
>>>>
>>>> Source             Destination        Protocol    Length    Info
>>>> 52:54:00:12:34:57  Broadcast          ARP         42        Who has 111.1.1.1? Tell 111.1.1.2
>>>> d2:6d:b9:61:e1:9a  52:54:00:12:34:57  ARP         42        111.1.1.1 is at d2:6d:b9:61:e1:9a
>>>>
>>>> Outcome 2 (less frequent):
>>>> --------------------------
>>>> L2 sends an ICMP echo request packet to L0 and L0 sends a reply,
>>>> but the reply is not received by L2.
>>>>
>>>> Source             Destination        Protocol    Length    Info
>>>> 111.1.1.2          111.1.1.1          ICMP        98        Echo (ping) request  id=0x0006, seq=1/256, ttl=64
>>>> 111.1.1.1          111.1.1.2          ICMP        98        Echo (ping) reply    id=0x0006, seq=1/256, ttl=64
>>>>
>>>> When pinging L2 from L0 I get the following output in
>>>> wireshark:
>>>>
>>>> Source             Destination        Protocol    Length    Info
>>>> 111.1.1.1          111.1.1.2          ICMP        100       Echo (ping) request  id=0x002c, seq=2/512, ttl=64 (no response found!)
>>>>
>>>> I do see a lot of traced lines being printed (by the QEMU instance that
>>>> was started in L0) with in_num > 1, for example:
>>>>
>>>> virtqueue_alloc_element elem 0x5d040fdbad30 size 56 in_num 1 out_num 0
>>>> virtqueue_pop vq 0x5d04109b0c50 elem 0x5d040fdbad30 in_num 1 out_num 0
>>>> virtqueue_fill vq 0x5d04109b0c50 elem 0x5d040fdbad30 len 76 idx 0
>>>> virtqueue_flush vq 0x5d04109b0c50 count 1
>>>> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0c50
>>>>
>>>
>>> So L0 is able to receive data from L2. We're halfway there, Good! :).
>>>
>>>> It looks like L1 is receiving data from L0 but this is not related to
>>>> the pings that are sent from L2. I haven't figured out what data is
>>>> actually being transferred in this case. It's not necessary for all of
>>>> the data that L1 receives from L0 to be passed to L2, is it?
>>>>
>>>
>>> It should be noise, yes.
>>>
>>
>> Understood.
>>
>>>>>>>> For svq->vring.num = 256, the following line is
>>>>>>>> printed 20 times,
>>>>>>>>
>>>>>>>> size: 256, len: 0, i: 0
>>>>>>>>
>>>>>>>> followed by:
>>>>>>>>
>>>>>>>> size: 256, len: 0, i: 1
>>>>>>>> size: 256, len: 0, i: 1
>>>>>>>>
>>>>>>>
>>>>>>> This makes sense for the tx queue too. Can you print the VirtQueue index?
>>>>>>
>>>>>> For svq->vring.num = 64, the vq index is 2. So the following line
>>>>>> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
>>>>>> is printed repeatedly:
>>>>>>
>>>>>> size: 64, len: 1, i: 0, vq idx: 2
>>>>>>
>>>>>> For svq->vring.num = 256, the following line is repeated several
>>>>>> times:
>>>>>>
>>>>>> size: 256, len: 0, i: 0, vq idx: 1
>>>>>>
>>>>>> This is followed by:
>>>>>>
>>>>>> size: 256, len: 0, i: 1, vq idx: 1
>>>>>>
>>>>>> In both cases, queue_index is 1.
>>>>
>>>> I also noticed that there are now some lines with svq->vring.num = 256
>>>> where len > 0. These lines were printed by the QEMU instance running
>>>> in L1, so this corresponds to data that was received by L2.
>>>>
>>>> svq->vring.num  used_elem.len  used_elem.id  svq->vq->queue_index
>>>> size: 256       len: 82        i: 0          vq idx: 0
>>>> size: 256       len: 82        i: 1          vq idx: 0
>>>> size: 256       len: 82        i: 2          vq idx: 0
>>>> size: 256       len: 54        i: 3          vq idx: 0
>>>>
>>>> I still haven't figured out what data was received by L2 but I am
>>>> slightly confused as to why this data was received by L2 but not
>>>> the ICMP echo replies sent by L0.
>>>>
>>>
>>> We're on a good track, let's trace it deeper. I guess these are
>>> printed from vhost_svq_flush, right? Do virtqueue_fill,
>>> virtqueue_flush, and event_notifier_set(&svq->svq_call) run properly,
>>> or do you see anything strange with gdb / tracing?
>>>
>>
>> Apologies for the delay in replying. It took me a while to figure
>> this out, but I have now understood why this doesn't work. L1 is
>> unable to receive messages from L0 because they get filtered out
>> by hw/net/virtio-net.c:receive_filter [1]. There's an issue with
>> the MAC addresses.
>>
>> In L0, I have:
>>
>> $ ip a show tap0
>> 6: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>>       link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>>       inet 111.1.1.1/24 scope global tap0
>>          valid_lft forever preferred_lft forever
>>       inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>>          valid_lft forever preferred_lft forever
>>
>> In L1:
>>
>> # ip a show eth0
>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>       link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
>>       altname enp0s2
>>       inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
>>          valid_lft 83455sec preferred_lft 83455sec
>>       inet6 fec0::7bd2:265e:3b8e:5acc/64 scope site dynamic noprefixroute
>>          valid_lft 86064sec preferred_lft 14064sec
>>       inet6 fe80::50e7:5bf6:fff8:a7b0/64 scope link noprefixroute
>>          valid_lft forever preferred_lft forever
>>
>> I'll call this L1-eth0.
>>
>> In L2:
>> # ip a show eth0
>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP gro0
>>       link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>       altname enp0s7
>>       inet 111.1.1.2/24 scope global eth0
>>          valid_lft forever preferred_lft forever
>>
>> I'll call this L2-eth0.
>>
>> Apart from eth0, lo is the only other device in both L1 and L2.
>>
>> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
>> as its destination address. When booting L2 with x-svq=false, the
>> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
>> the frames and passes them on to L2 and pinging works [2].
>>
> 
> So this behavior is interesting by itself. But L1's kernel net system
> should not receive anything. As I read it, even if it receives it, it
> should not forward the frame to L2 as it is in a different subnet. Are
> you able to read it using tcpdump on L1?

I ran "tcpdump -i eth0" in L1. It didn't capture any of the packets
that were directed at L2 even though L2 was able to receive them.
Similarly, it didn't capture any packets that were sent from L2 to
L0. This is when L2 is launched with x-svq=false.

With x-svq=true, forcibly setting the LSB of n->mac to 0x57 in
receive_filter allows L2 to receive packets from L0. I added
the following line just before line 1771 [1] to check this out.

n->mac[5] = 0x57;

> Maybe we can make the scenario clearer by telling which virtio-net
> device is which with virtio_net_pci,mac=XX:... ?
> 
>> However, when booting L2 with x-svq=true, n->mac is set to L1-eth0
>> (LSB = 56) in virtio_net_handle_mac() [3].
> 
> Can you tell with gdb bt if this function is called from net or the
> SVQ subsystem?

I am struggling to learn how one uses gdb to debug QEMU. I tried running
QEMU in L0 with -s and -S in one terminal. In another terminal, I ran
the following:

$ gdb ./build/qemu-system-x86_64

I then ran the following in gdb's console, but stepping through or
continuing the execution gives me errors:

(gdb) target remote localhost:1234
(gdb) break -source ../hw/net/virtio-net.c -function receive_filter
(gdb) c
Continuing.
Warning:
Cannot insert breakpoint 2.
Cannot access memory at address 0x9058c6

Command aborted.
(gdb) ni
Continuing.
Warning:
Cannot insert breakpoint 2.
Cannot access memory at address 0x9058c6

Command aborted.

I built QEMU using ./configure --enable-debug.

I also tried using the --disable-pie option but this results
in a build error.

[8063/8844] Linking target qemu-keymap
FAILED: qemu-keymap
cc -m64  -o qemu-keymap <...>
/usr/bin/ld: libevent-loop-base.a.p/event-loop-base.c.o: relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: failed to set dynamic section sizes: bad value
collect2: error: ld returned 1 exit status

>> n->mac_table.macs also
>> does not seem to have L2-eth0's MAC address. Due to this,
>> receive_filter() filters out all the frames [4] that were meant for
>> L2-eth0.
>>
> 
> In the vp_vdpa scenario of the blog receive_filter should not be
> called in the qemu running in the L1 guest, the nested one. Can you
> check it with gdb or by printing a trace if it is called?

This is right. receive_filter is not called in L1's QEMU with
x-svq=true.

>> With x-svq=true, I see that n->mac is set by virtio_net_handle_mac()
>> [3] when L1 receives VIRTIO_NET_CTRL_MAC_ADDR_SET. With x-svq=false,
>> virtio_net_handle_mac() doesn't seem to be getting called. I haven't
>> understood how the MAC address is set in VirtIONet when x-svq=false.
>> Understanding this might help see why n->mac has different values
>> when x-svq is false vs when it is true.
>>
> 
> Ok this makes sense, as x-svq=true is the one that receives the set
> mac message. You should see it in L0's QEMU though, both in x-svq=on
> and x-svq=off scenarios. Can you check it?

L0's QEMU seems to be receiving the "set mac" message only when L1
is launched with x-svq=true. With x-svq=off, I don't see any call
to virtio_net_handle_mac with cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET
in L0.

Thanks,
Sahil

[1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/net/virtio-net.c#L1771

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 3 months, 1 week ago

On Fri, Jan 24, 2025 at 6:47 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 1/21/25 10:07 PM, Eugenio Perez Martin wrote:
> > On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> On 1/7/25 1:35 PM, Eugenio Perez Martin wrote:
> >>> On Fri, Jan 3, 2025 at 2:06 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> On 12/20/24 12:28 PM, Eugenio Perez Martin wrote:
> >>>>> On Thu, Dec 19, 2024 at 8:37 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 12/17/24 1:20 PM, Eugenio Perez Martin wrote:
> >>>>>>> On Tue, Dec 17, 2024 at 6:45 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>>>> On 12/16/24 2:09 PM, Eugenio Perez Martin wrote:
> >>>>>>>>> On Sun, Dec 15, 2024 at 6:27 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>>>>>> On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
> >>>>>>>>>>> On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>>>>>>>> [...]
> >>>>>>>>>>>> I have been following the "Hands on vDPA: what do you do
> >>>>>>>>>>>> when you ain't got the hardware v2 (Part 2)" [1] blog to
> >>>>>>>>>>>> test my changes. To boot the L1 VM, I ran:
> >>>>>>>>>>>>
> >>>>>>>>>>>> sudo ./qemu/build/qemu-system-x86_64 \
> >>>>>>>>>>>> -enable-kvm \
> >>>>>>>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >>>>>>>>>>>> -net nic,model=virtio \
> >>>>>>>>>>>> -net user,hostfwd=tcp::2222-:22 \
> >>>>>>>>>>>> -device intel-iommu,snoop-control=on \
> >>>>>>>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
> >>>>>>>>>>>> -netdev tap,id=net0,script=no,downscript=no \
> >>>>>>>>>>>> -nographic \
> >>>>>>>>>>>> -m 8G \
> >>>>>>>>>>>> -smp 4 \
> >>>>>>>>>>>> -M q35 \
> >>>>>>>>>>>> -cpu host 2>&1 | tee vm.log
> >>>>>>>>>>>>
> >>>>>>>>>>>> Without "guest_uso4=off,guest_uso6=off,host_uso=off,
> >>>>>>>>>>>> guest_announce=off" in "-device virtio-net-pci", QEMU
> >>>>>>>>>>>> throws "vdpa svq does not work with features" [2] when
> >>>>>>>>>>>> trying to boot L2.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The enums added in commit #2 in this series is new and
> >>>>>>>>>>>> wasn't in the earlier versions of the series. Without
> >>>>>>>>>>>> this change, x-svq=true throws "SVQ invalid device feature
> >>>>>>>>>>>> flags" [3] and x-svq is consequently disabled.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The first issue is related to running traffic in L2
> >>>>>>>>>>>> with vhost-vdpa.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In L0:
> >>>>>>>>>>>>
> >>>>>>>>>>>> $ ip addr add 111.1.1.1/24 dev tap0
> >>>>>>>>>>>> $ ip link set tap0 up
> >>>>>>>>>>>> $ ip addr show tap0
> >>>>>>>>>>>> 4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
> >>>>>>>>>>>>           link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
> >>>>>>>>>>>>           inet 111.1.1.1/24 scope global tap0
> >>>>>>>>>>>>              valid_lft forever preferred_lft forever
> >>>>>>>>>>>>           inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
> >>>>>>>>>>>>              valid_lft forever preferred_lft forever
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am able to run traffic in L2 when booting without
> >>>>>>>>>>>> x-svq.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In L1:
> >>>>>>>>>>>>
> >>>>>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>>>>>>>>>> -nographic \
> >>>>>>>>>>>> -m 4G \
> >>>>>>>>>>>> -enable-kvm \
> >>>>>>>>>>>> -M q35 \
> >>>>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
> >>>>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>>>>>>>>>> -smp 4 \
> >>>>>>>>>>>> -cpu host \
> >>>>>>>>>>>> 2>&1 | tee vm.log
> >>>>>>>>>>>>
> >>>>>>>>>>>> In L2:
> >>>>>>>>>>>>
> >>>>>>>>>>>> # ip addr add 111.1.1.2/24 dev eth0
> >>>>>>>>>>>> # ip addr show eth0
> >>>>>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>>>>>>>>>           link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>>>>>>>>>           altname enp0s7
> >>>>>>>>>>>>           inet 111.1.1.2/24 scope global eth0
> >>>>>>>>>>>>              valid_lft forever preferred_lft forever
> >>>>>>>>>>>>           inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
> >>>>>>>>>>>>              valid_lft forever preferred_lft forever
> >>>>>>>>>>>>
> >>>>>>>>>>>> # ip route
> >>>>>>>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
> >>>>>>>>>>>>
> >>>>>>>>>>>> # ping 111.1.1.1 -w3
> >>>>>>>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> >>>>>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=1 ttl=64 time=0.407 ms
> >>>>>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=2 ttl=64 time=0.671 ms
> >>>>>>>>>>>> 64 bytes from 111.1.1.1: icmp_seq=3 ttl=64 time=0.291 ms
> >>>>>>>>>>>>
> >>>>>>>>>>>> --- 111.1.1.1 ping statistics ---
> >>>>>>>>>>>> 3 packets transmitted, 3 received, 0% packet loss, time 2034ms
> >>>>>>>>>>>> rtt min/avg/max/mdev = 0.291/0.456/0.671/0.159 ms
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> But if I boot L2 with x-svq=true as shown below, I am unable
> >>>>>>>>>>>> to ping the host machine.
> >>>>>>>>>>>>
> >>>>>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>>>>>>>>>> -nographic \
> >>>>>>>>>>>> -m 4G \
> >>>>>>>>>>>> -enable-kvm \
> >>>>>>>>>>>> -M q35 \
> >>>>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> >>>>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>>>>>>>>>> -smp 4 \
> >>>>>>>>>>>> -cpu host \
> >>>>>>>>>>>> 2>&1 | tee vm.log
> >>>>>>>>>>>>
> >>>>>>>>>>>> In L2:
> >>>>>>>>>>>>
> >>>>>>>>>>>> # ip addr add 111.1.1.2/24 dev eth0
> >>>>>>>>>>>> # ip addr show eth0
> >>>>>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>>>>>>>>>           link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>>>>>>>>>           altname enp0s7
> >>>>>>>>>>>>           inet 111.1.1.2/24 scope global eth0
> >>>>>>>>>>>>              valid_lft forever preferred_lft forever
> >>>>>>>>>>>>           inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
> >>>>>>>>>>>>              valid_lft forever preferred_lft forever
> >>>>>>>>>>>>
> >>>>>>>>>>>> # ip route
> >>>>>>>>>>>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
> >>>>>>>>>>>>
> >>>>>>>>>>>> # ping 111.1.1.1 -w10
> >>>>>>>>>>>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
> >>>>>>>>>>>>       From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
> >>>>>>>>>>>> ping: sendmsg: No route to host
> >>>>>>>>>>>>       From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
> >>>>>>>>>>>>       From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
> >>>>>>>>>>>>
> >>>>>>>>>>>> --- 111.1.1.1 ping statistics ---
> >>>>>>>>>>>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
> >>>>>>>>>>>> pipe 3
> >>>>>>>>>>>>
> >>>>>>>>>>>> The other issue is related to booting L2 with "x-svq=true"
> >>>>>>>>>>>> and "packed=on".
> >>>>>>>>>>>>
> >>>>>>>>>>>> In L1:
> >>>>>>>>>>>>
> >>>>>>>>>>>> $ ./qemu/build/qemu-system-x86_64 \
> >>>>>>>>>>>> -nographic \
> >>>>>>>>>>>> -m 4G \
> >>>>>>>>>>>> -enable-kvm \
> >>>>>>>>>>>> -M q35 \
> >>>>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
> >>>>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
> >>>>>>>>>>>> -smp 4 \
> >>>>>>>>>>>> -cpu host \
> >>>>>>>>>>>> 2>&1 | tee vm.log
> >>>>>>>>>>>>
> >>>>>>>>>>>> The kernel throws "virtio_net virtio1: output.0:id 0 is not
> >>>>>>>>>>>> a head!" [4].
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> So this series implements the descriptor forwarding from the guest to
> >>>>>>>>>>> the device in packed vq. We also need to forward the descriptors from
> >>>>>>>>>>> the device to the guest. The device writes them in the SVQ ring.
> >>>>>>>>>>>
> >>>>>>>>>>> The functions responsible for that in QEMU are
> >>>>>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
> >>>>>>>>>>> the device when used descriptors are written to the SVQ, which calls
> >>>>>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
> >>>>>>>>>>> modifications similar to vhost_svq_add: Make them conditional if we're
> >>>>>>>>>>> in split or packed vq, and "copy" the code from Linux's
> >>>>>>>>>>> drivers/virtio/virtio_ring.c:virtqueue_get_buf.
> >>>>>>>>>>>
> >>>>>>>>>>> After these modifications you should be able to ping and forward
> >>>>>>>>>>> traffic. As always, It is totally ok if it needs more than one
> >>>>>>>>>>> iteration, and feel free to ask any question you have :).
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I misunderstood this part. While working on extending
> >>>>>>>>>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf() [1]
> >>>>>>>>>> for packed vqs, I realized that this function and
> >>>>>>>>>> vhost_svq_flush() already support split vqs. However, I am
> >>>>>>>>>> unable to ping L0 when booting L2 with "x-svq=true" and
> >>>>>>>>>> "packed=off" or when the "packed" option is not specified
> >>>>>>>>>> in QEMU's command line.
> >>>>>>>>>>
> >>>>>>>>>> I tried debugging these functions for split vqs after running
> >>>>>>>>>> the following QEMU commands while following the blog [2].
> >>>>>>>>>>
> >>>>>>>>>> Booting L1:
> >>>>>>>>>>
> >>>>>>>>>> $ sudo ./qemu/build/qemu-system-x86_64 \
> >>>>>>>>>> -enable-kvm \
> >>>>>>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >>>>>>>>>> -net nic,model=virtio \
> >>>>>>>>>> -net user,hostfwd=tcp::2222-:22 \
> >>>>>>>>>> -device intel-iommu,snoop-control=on \
> >>>>>>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
> >>>>>>>>>> -netdev tap,id=net0,script=no,downscript=no \
> >>>>>>>>>> -nographic \
> >>>>>>>>>> -m 8G \
> >>>>>>>>>> -smp 4 \
> >>>>>>>>>> -M q35 \
> >>>>>>>>>> -cpu host 2>&1 | tee vm.log
> >>>>>>>>>>
> >>>>>>>>>> Booting L2:
> >>>>>>>>>>
> >>>>>>>>>> # ./qemu/build/qemu-system-x86_64 \
> >>>>>>>>>> -nographic \
> >>>>>>>>>> -m 4G \
> >>>>>>>>>> -enable-kvm \
> >>>>>>>>>> -M q35 \
> >>>>>>>>>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> >>>>>>>>>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> >>>>>>>>>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
> >>>>>>>>>> -smp 4 \
> >>>>>>>>>> -cpu host \
> >>>>>>>>>> 2>&1 | tee vm.log
> >>>>>>>>>>
> >>>>>>>>>> I printed out the contents of VirtQueueElement returned
> >>>>>>>>>> by vhost_svq_get_buf() in vhost_svq_flush() [3].
> >>>>>>>>>> I noticed that "len" which is set by "vhost_svq_get_buf"
> >>>>>>>>>> is always set to 0 while VirtQueueElement.len is non-zero.
> >>>>>>>>>> I haven't understood the difference between these two "len"s.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> VirtQueueElement.len is the length of the buffer, while the len of
> >>>>>>>>> vhost_svq_get_buf is the bytes written by the device. In the case of
> >>>>>>>>> the tx queue, VirtQueuelen is the length of the tx packet, and the
> >>>>>>>>> vhost_svq_get_buf is always 0 as the device does not write. In the
> >>>>>>>>> case of rx, VirtQueueElem.len is the available length for a rx frame,
> >>>>>>>>> and the vhost_svq_get_buf len is the actual length written by the
> >>>>>>>>> device.
> >>>>>>>>>
> >>>>>>>>> To be 100% accurate a rx packet can span over multiple buffers, but
> >>>>>>>>> SVQ does not need special code to handle this.
> >>>>>>>>>
> >>>>>>>>> So vhost_svq_get_buf should return > 0 for rx queue (svq->vq->index ==
> >>>>>>>>> 0), and 0 for tx queue (svq->vq->index % 2 == 1).
> >>>>>>>>>
> >>>>>>>>> Take into account that vhost_svq_get_buf only handles split vq at the
> >>>>>>>>> moment! It should be renamed or splitted into vhost_svq_get_buf_split.
> >>>>>>>>
> >>>>>>>> In L1, there are 2 virtio network devices.
> >>>>>>>>
> >>>>>>>> # lspci -nn | grep -i net
> >>>>>>>> 00:02.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
> >>>>>>>> 00:04.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
> >>>>>>>>
> >>>>>>>> I am using the second one (1af4:1041) for testing my changes and have
> >>>>>>>> bound this device to the vp_vdpa driver.
> >>>>>>>>
> >>>>>>>> # vdpa dev show -jp
> >>>>>>>> {
> >>>>>>>>          "dev": {
> >>>>>>>>              "vdpa0": {
> >>>>>>>>                  "type": "network",
> >>>>>>>>                  "mgmtdev": "pci/0000:00:04.0",
> >>>>>>>>                  "vendor_id": 6900,
> >>>>>>>>                  "max_vqs": 3,
> >>>>>>>
> >>>>>>> How is max_vqs=3? For this to happen L0 QEMU should have
> >>>>>>> virtio-net-pci,...,queues=3 cmdline argument.
> >>>>>
> >>>>> Ouch! I totally misread it :(. Everything is correct, max_vqs should
> >>>>> be 3. I read it as the virtio_net queues, which means queue *pairs*,
> >>>>> as it includes rx and tx queue.
> >>>>
> >>>> Understood :)
> >>>>
> >>>>>>
> >>>>>> I am not sure why max_vqs is 3. I haven't set the value of queues to 3
> >>>>>> in the cmdline argument. Is max_vqs expected to have a default value
> >>>>>> other than 3?
> >>>>>>
> >>>>>> In the blog [1] as well, max_vqs is 3 even though there's no queues=3
> >>>>>> argument.
> >>>>>>
> >>>>>>> It's clear the guest is not using them, we can add mq=off
> >>>>>>> to simplify the scenario.
> >>>>>>
> >>>>>> The value of max_vqs is still 3 after adding mq=off. The whole
> >>>>>> command that I run to boot L0 is:
> >>>>>>
> >>>>>> $ sudo ./qemu/build/qemu-system-x86_64 \
> >>>>>> -enable-kvm \
> >>>>>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >>>>>> -net nic,model=virtio \
> >>>>>> -net user,hostfwd=tcp::2222-:22 \
> >>>>>> -device intel-iommu,snoop-control=on \
> >>>>>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=on,ctrl_rx=on,packed=off,event_idx=off,bus=pcie.0,addr=0x4 \
> >>>>>> -netdev tap,id=net0,script=no,downscript=no \
> >>>>>> -nographic \
> >>>>>> -m 8G \
> >>>>>> -smp 4 \
> >>>>>> -M q35 \
> >>>>>> -cpu host 2>&1 | tee vm.log
> >>>>>>
> >>>>>> Could it be that 2 of the 3 vqs are used for the dataplane and
> >>>>>> the third vq is the control vq?
> >>>>>>
> >>>>>>>>                  "max_vq_size": 256
> >>>>>>>>              }
> >>>>>>>>          }
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> The max number of vqs is 3 with the max size being 256.
> >>>>>>>>
> >>>>>>>> Since, there are 2 virtio net devices, vhost_vdpa_svqs_start [1]
> >>>>>>>> is called twice. For each of them. it calls vhost_svq_start [2]
> >>>>>>>> v->shadow_vqs->len number of times.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Ok I understand this confusion, as the code is not intuitive :). Take
> >>>>>>> into account you can only have svq in vdpa devices, so both
> >>>>>>> vhost_vdpa_svqs_start are acting on the vdpa device.
> >>>>>>>
> >>>>>>> You are seeing two calls to vhost_vdpa_svqs_start because virtio (and
> >>>>>>> vdpa) devices are modelled internally as two devices in QEMU: One for
> >>>>>>> the dataplane vq, and other for the control vq. There are historical
> >>>>>>> reasons for this, but we use it in vdpa to always shadow the CVQ while
> >>>>>>> leaving dataplane passthrough if x-svq=off and the virtio & virtio-net
> >>>>>>> feature set is understood by SVQ.
> >>>>>>>
> >>>>>>> If you break at vhost_vdpa_svqs_start with gdb and go higher in the
> >>>>>>> stack you should reach vhost_net_start, that starts each vhost_net
> >>>>>>> device individually.
> >>>>>>>
> >>>>>>> To be 100% honest, each dataplain *queue pair* (rx+tx) is modelled
> >>>>>>> with a different vhost_net device in QEMU, but you don't need to take
> >>>>>>> that into account implementing the packed vq :).
> >>>>>>
> >>>>>> Got it, this makes sense now.
> >>>>>>
> >>>>>>>> Printing the values of dev->vdev->name, v->shadow_vqs->len and
> >>>>>>>> svq->vring.num in vhost_vdpa_svqs_start gives:
> >>>>>>>>
> >>>>>>>> name: virtio-net
> >>>>>>>> len: 2
> >>>>>>>> num: 256
> >>>>>>>> num: 256
> >>>>>>>
> >>>>>>> First QEMU's vhost_net device, the dataplane.
> >>>>>>>
> >>>>>>>> name: virtio-net
> >>>>>>>> len: 1
> >>>>>>>> num: 64
> >>>>>>>>
> >>>>>>>
> >>>>>>> Second QEMU's vhost_net device, the control virtqueue.
> >>>>>>
> >>>>>> Ok, if I understand this correctly, the control vq doesn't
> >>>>>> need separate queues for rx and tx.
> >>>>>>
> >>>>>
> >>>>> That's right. Since CVQ has one reply per command, the driver can just
> >>>>> send ro+rw descriptors to the device. In the case of RX, the device
> >>>>> needs a queue with only-writable descriptors, as neither the device or
> >>>>> the driver knows how many packets will arrive.
> >>>>
> >>>> Got it, this makes sense now.
> >>>>
> >>>>>>>> I am not sure how to match the above log lines to the
> >>>>>>>> right virtio-net device since the actual value of num
> >>>>>>>> can be less than "max_vq_size" in the output of "vdpa
> >>>>>>>> dev show".
> >>>>>>>>
> >>>>>>>
> >>>>>>> Yes, the device can set a different vq max per vq, and the driver can
> >>>>>>> negotiate a lower vq size per vq too.
> >>>>>>>
> >>>>>>>> I think the first 3 log lines correspond to the virtio
> >>>>>>>> net device that I am using for testing since it has
> >>>>>>>> 2 vqs (rx and tx) while the other virtio-net device
> >>>>>>>> only has one vq.
> >>>>>>>>
> >>>>>>>> When printing out the values of svq->vring.num,
> >>>>>>>> used_elem.len and used_elem.id in vhost_svq_get_buf,
> >>>>>>>> there are two sets of output. One set corresponds to
> >>>>>>>> svq->vring.num = 64 and the other corresponds to
> >>>>>>>> svq->vring.num = 256.
> >>>>>>>>
> >>>>>>>> For svq->vring.num = 64, only the following line
> >>>>>>>> is printed repeatedly:
> >>>>>>>>
> >>>>>>>> size: 64, len: 1, i: 0
> >>>>>>>>
> >>>>>>>
> >>>>>>> This is with packed=off, right? If this is testing with packed, you
> >>>>>>> need to change the code to accommodate it. Let me know if you need
> >>>>>>> more help with this.
> >>>>>>
> >>>>>> Yes, this is for packed=off. For the time being, I am trying to
> >>>>>> get L2 to communicate with L0 using split virtqueues and x-svq=true.
> >>>>>>
> >>>>>
> >>>>> Got it.
> >>>>>
> >>>>>>> In the CVQ the only reply is a byte, indicating if the command was
> >>>>>>> applied or not. This seems ok to me.
> >>>>>>
> >>>>>> Understood.
> >>>>>>
> >>>>>>> The queue can also recycle ids as long as they are not available, so
> >>>>>>> that part seems correct to me too.
> >>>>>>
> >>>>>> I am a little confused here. The ids are recycled when they are
> >>>>>> available (i.e., the id is not already in use), right?
> >>>>>>
> >>>>>
> >>>>> In virtio, available is that the device can use them. And used is that
> >>>>> the device returned to the driver. I think you're aligned it's just it
> >>>>> is better to follow the virtio nomenclature :).
> >>>>
> >>>> Got it.
> >>>>
> >>>>>>>> For svq->vring.num = 256, the following line is
> >>>>>>>> printed 20 times,
> >>>>>>>>
> >>>>>>>> size: 256, len: 0, i: 0
> >>>>>>>>
> >>>>>>>> followed by:
> >>>>>>>>
> >>>>>>>> size: 256, len: 0, i: 1
> >>>>>>>> size: 256, len: 0, i: 1
> >>>>>>>>
> >>>>>>>
> >>>>>>> This makes sense for the tx queue too. Can you print the VirtQueue index?
> >>>>>>
> >>>>>> For svq->vring.num = 64, the vq index is 2. So the following line
> >>>>>> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
> >>>>>> is printed repeatedly:
> >>>>>>
> >>>>>> size: 64, len: 1, i: 0, vq idx: 2
> >>>>>>
> >>>>>> For svq->vring.num = 256, the following line is repeated several
> >>>>>> times:
> >>>>>>
> >>>>>> size: 256, len: 0, i: 0, vq idx: 1
> >>>>>>
> >>>>>> This is followed by:
> >>>>>>
> >>>>>> size: 256, len: 0, i: 1, vq idx: 1
> >>>>>>
> >>>>>> In both cases, queue_index is 1. To get the value of queue_index,
> >>>>>> I used "virtio_get_queue_index(svq->vq)" [2].
> >>>>>>
> >>>>>> Since the queue_index is 1, I guess this means this is the tx queue
> >>>>>> and the value of len (0) is correct. However, nothing with
> >>>>>> queue_index % 2 == 0 is printed by vhost_svq_get_buf() which means
> >>>>>> the device is not sending anything to the guest. Is this correct?
> >>>>>>
> >>>>>
> >>>>> Yes, that's totally correct.
> >>>>>
> >>>>> You can set -netdev tap,...,vhost=off in L0 qemu and trace (or debug
> >>>>> with gdb) it to check what is receiving. You should see calls to
> >>>>> hw/net/virtio-net.c:virtio_net_flush_tx. The corresponding function to
> >>>>> receive is virtio_net_receive_rcu, I recommend you trace too just it
> >>>>> in case you see any strange call to it.
> >>>>>
> >>>>
> >>>> I added "vhost=off" to -netdev tap in L0's qemu command. I followed all
> >>>> the steps in the blog [1] up till the point where L2 is booted. Before
> >>>> booting L2, I had no issues pinging L0 from L1.
> >>>>
> >>>> For each ping, the following trace lines were printed by QEMU:
> >>>>
> >>>> virtqueue_alloc_element elem 0x5d041024f560 size 56 in_num 0 out_num 1
> >>>> virtqueue_pop vq 0x5d04109b0ce8 elem 0x5d041024f560 in_num 0 out_num 1
> >>>> virtqueue_fill vq 0x5d04109b0ce8 elem 0x5d041024f560 len 0 idx 0
> >>>> virtqueue_flush vq 0x5d04109b0ce8 count 1
> >>>> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0ce8
> >>>> virtqueue_alloc_element elem 0x5d041024f560 size 56 in_num 1 out_num 0
> >>>> virtqueue_pop vq 0x5d04109b0c50 elem 0x5d041024f560 in_num 1 out_num 0
> >>>> virtqueue_fill vq 0x5d04109b0c50 elem 0x5d041024f560 len 110 idx 0
> >>>> virtqueue_flush vq 0x5d04109b0c50 count 1
> >>>> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0c50
> >>>>
> >>>> The first 5 lines look like they were printed when an echo request was
> >>>> sent to L0 and the next 5 lines were printed when an echo reply was
> >>>> received.
> >>>>
> >>>> After booting L2, I set up the tap device's IP address in L0 and the
> >>>> vDPA port's IP address in L2.
> >>>>
> >>>> When trying to ping L0 from L2, I only see the following lines being
> >>>> printed:
> >>>>
> >>>> virtqueue_alloc_element elem 0x5d041099ffd0 size 56 in_num 0 out_num 1
> >>>> virtqueue_pop vq 0x5d0410d87168 elem 0x5d041099ffd0 in_num 0 out_num 1
> >>>> virtqueue_fill vq 0x5d0410d87168 elem 0x5d041099ffd0 len 0 idx 0
> >>>> virtqueue_flush vq 0x5d0410d87168 count 1
> >>>> virtio_notify vdev 0x5d0410d79a10 vq 0x5d0410d87168
> >>>>
> >>>> There's no reception. I used wireshark to inspect the packets that are
> >>>> being sent and received through the tap device in L0.
> >>>>
> >>>> When pinging L0 from L2, I see one of the following two outcomes:
> >>>>
> >>>> Outcome 1:
> >>>> ----------
> >>>> L2 broadcasts ARP packets and L0 replies to L2.
> >>>>
> >>>> Source             Destination        Protocol    Length    Info
> >>>> 52:54:00:12:34:57  Broadcast          ARP         42        Who has 111.1.1.1? Tell 111.1.1.2
> >>>> d2:6d:b9:61:e1:9a  52:54:00:12:34:57  ARP         42        111.1.1.1 is at d2:6d:b9:61:e1:9a
> >>>>
> >>>> Outcome 2 (less frequent):
> >>>> --------------------------
> >>>> L2 sends an ICMP echo request packet to L0 and L0 sends a reply,
> >>>> but the reply is not received by L2.
> >>>>
> >>>> Source             Destination        Protocol    Length    Info
> >>>> 111.1.1.2          111.1.1.1          ICMP        98        Echo (ping) request  id=0x0006, seq=1/256, ttl=64
> >>>> 111.1.1.1          111.1.1.2          ICMP        98        Echo (ping) reply    id=0x0006, seq=1/256, ttl=64
> >>>>
> >>>> When pinging L2 from L0 I get the following output in
> >>>> wireshark:
> >>>>
> >>>> Source             Destination        Protocol    Length    Info
> >>>> 111.1.1.1          111.1.1.2          ICMP        100       Echo (ping) request  id=0x002c, seq=2/512, ttl=64 (no response found!)
> >>>>
> >>>> I do see a lot of traced lines being printed (by the QEMU instance that
> >>>> was started in L0) with in_num > 1, for example:
> >>>>
> >>>> virtqueue_alloc_element elem 0x5d040fdbad30 size 56 in_num 1 out_num 0
> >>>> virtqueue_pop vq 0x5d04109b0c50 elem 0x5d040fdbad30 in_num 1 out_num 0
> >>>> virtqueue_fill vq 0x5d04109b0c50 elem 0x5d040fdbad30 len 76 idx 0
> >>>> virtqueue_flush vq 0x5d04109b0c50 count 1
> >>>> virtio_notify vdev 0x5d04109a8d50 vq 0x5d04109b0c50
> >>>>
> >>>
> >>> So L0 is able to receive data from L2. We're halfway there, Good! :).
> >>>
> >>>> It looks like L1 is receiving data from L0 but this is not related to
> >>>> the pings that are sent from L2. I haven't figured out what data is
> >>>> actually being transferred in this case. It's not necessary for all of
> >>>> the data that L1 receives from L0 to be passed to L2, is it?
> >>>>
> >>>
> >>> It should be noise, yes.
> >>>
> >>
> >> Understood.
> >>
> >>>>>>>> For svq->vring.num = 256, the following line is
> >>>>>>>> printed 20 times,
> >>>>>>>>
> >>>>>>>> size: 256, len: 0, i: 0
> >>>>>>>>
> >>>>>>>> followed by:
> >>>>>>>>
> >>>>>>>> size: 256, len: 0, i: 1
> >>>>>>>> size: 256, len: 0, i: 1
> >>>>>>>>
> >>>>>>>
> >>>>>>> This makes sense for the tx queue too. Can you print the VirtQueue index?
> >>>>>>
> >>>>>> For svq->vring.num = 64, the vq index is 2. So the following line
> >>>>>> (svq->vring.num, used_elem.len, used_elem.id, svq->vq->queue_index)
> >>>>>> is printed repeatedly:
> >>>>>>
> >>>>>> size: 64, len: 1, i: 0, vq idx: 2
> >>>>>>
> >>>>>> For svq->vring.num = 256, the following line is repeated several
> >>>>>> times:
> >>>>>>
> >>>>>> size: 256, len: 0, i: 0, vq idx: 1
> >>>>>>
> >>>>>> This is followed by:
> >>>>>>
> >>>>>> size: 256, len: 0, i: 1, vq idx: 1
> >>>>>>
> >>>>>> In both cases, queue_index is 1.
> >>>>
> >>>> I also noticed that there are now some lines with svq->vring.num = 256
> >>>> where len > 0. These lines were printed by the QEMU instance running
> >>>> in L1, so this corresponds to data that was received by L2.
> >>>>
> >>>> svq->vring.num  used_elem.len  used_elem.id  svq->vq->queue_index
> >>>> size: 256       len: 82        i: 0          vq idx: 0
> >>>> size: 256       len: 82        i: 1          vq idx: 0
> >>>> size: 256       len: 82        i: 2          vq idx: 0
> >>>> size: 256       len: 54        i: 3          vq idx: 0
> >>>>
> >>>> I still haven't figured out what data was received by L2 but I am
> >>>> slightly confused as to why this data was received by L2 but not
> >>>> the ICMP echo replies sent by L0.
> >>>>
> >>>
> >>> We're on a good track, let's trace it deeper. I guess these are
> >>> printed from vhost_svq_flush, right? Do virtqueue_fill,
> >>> virtqueue_flush, and event_notifier_set(&svq->svq_call) run properly,
> >>> or do you see anything strange with gdb / tracing?
> >>>
> >>
> >> Apologies for the delay in replying. It took me a while to figure
> >> this out, but I have now understood why this doesn't work. L1 is
> >> unable to receive messages from L0 because they get filtered out
> >> by hw/net/virtio-net.c:receive_filter [1]. There's an issue with
> >> the MAC addresses.
> >>
> >> In L0, I have:
> >>
> >> $ ip a show tap0
> >> 6: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
> >>       link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
> >>       inet 111.1.1.1/24 scope global tap0
> >>          valid_lft forever preferred_lft forever
> >>       inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
> >>          valid_lft forever preferred_lft forever
> >>
> >> In L1:
> >>
> >> # ip a show eth0
> >> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>       link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
> >>       altname enp0s2
> >>       inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
> >>          valid_lft 83455sec preferred_lft 83455sec
> >>       inet6 fec0::7bd2:265e:3b8e:5acc/64 scope site dynamic noprefixroute
> >>          valid_lft 86064sec preferred_lft 14064sec
> >>       inet6 fe80::50e7:5bf6:fff8:a7b0/64 scope link noprefixroute
> >>          valid_lft forever preferred_lft forever
> >>
> >> I'll call this L1-eth0.
> >>
> >> In L2:
> >> # ip a show eth0
> >> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP gro0
> >>       link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>       altname enp0s7
> >>       inet 111.1.1.2/24 scope global eth0
> >>          valid_lft forever preferred_lft forever
> >>
> >> I'll call this L2-eth0.
> >>
> >> Apart from eth0, lo is the only other device in both L1 and L2.
> >>
> >> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
> >> as its destination address. When booting L2 with x-svq=false, the
> >> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
> >> the frames and passes them on to L2 and pinging works [2].
> >>
> >
> > So this behavior is interesting by itself. But L1's kernel net system
> > should not receive anything. As I read it, even if it receives it, it
> > should not forward the frame to L2 as it is in a different subnet. Are
> > you able to read it using tcpdump on L1?
>
> I ran "tcpdump -i eth0" in L1. It didn't capture any of the packets
> that were directed at L2 even though L2 was able to receive them.
> Similarly, it didn't capture any packets that were sent from L2 to
> L0. This is when L2 is launched with x-svq=false.
>

That's right. The virtio dataplane goes directly from L0 to L2, you
should not be able to see any packets in the net of L1.

> With x-svq=true, forcibly setting the LSB of n->mac to 0x57 in
> receive_filter allows L2 to receive packets from L0. I added
> the following line just before line 1771 [1] to check this out.
>
> n->mac[5] = 0x57;
>

That's very interesting. Let me answer all the gdb questions below and
we can debug it deeper :).

> > Maybe we can make the scenario clearer by telling which virtio-net
> > device is which with virtio_net_pci,mac=XX:... ?
> >
> >> However, when booting L2 with x-svq=true, n->mac is set to L1-eth0
> >> (LSB = 56) in virtio_net_handle_mac() [3].
> >
> > Can you tell with gdb bt if this function is called from net or the
> > SVQ subsystem?
>
> I am struggling to learn how one uses gdb to debug QEMU. I tried running
> QEMU in L0 with -s and -S in one terminal. In another terminal, I ran
> the following:
>

The option -s of QEMU make it act as a debugger for the *guest
kernel*. It's interesting for debugging the virtio_net driver in the
nested guest, for example, but SVQ lives in the nested QEMU, in
userspace. So you don't need to use it here.

> $ gdb ./build/qemu-system-x86_64
>
> I then ran the following in gdb's console, but stepping through or
> continuing the execution gives me errors:
>
> (gdb) target remote localhost:1234
> (gdb) break -source ../hw/net/virtio-net.c -function receive_filter
> (gdb) c
> Continuing.
> Warning:
> Cannot insert breakpoint 2.
> Cannot access memory at address 0x9058c6
>
> Command aborted.
> (gdb) ni
> Continuing.
> Warning:
> Cannot insert breakpoint 2.
> Cannot access memory at address 0x9058c6
>
> Command aborted.
>

Yes, you're trying to debug the kernel running in the guest with the
QEMU's sources so it is not possible to correlate functions, symbols,
etc :).

To run QEMU with gdb, just add gdb --args in front of the qemu invocation.

> I built QEMU using ./configure --enable-debug.
>
> I also tried using the --disable-pie option but this results
> in a build error.
>
> [8063/8844] Linking target qemu-keymap
> FAILED: qemu-keymap
> cc -m64  -o qemu-keymap <...>
> /usr/bin/ld: libevent-loop-base.a.p/event-loop-base.c.o: relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIE
> /usr/bin/ld: failed to set dynamic section sizes: bad value
> collect2: error: ld returned 1 exit status
>
> >> n->mac_table.macs also
> >> does not seem to have L2-eth0's MAC address. Due to this,
> >> receive_filter() filters out all the frames [4] that were meant for
> >> L2-eth0.
> >>
> >
> > In the vp_vdpa scenario of the blog receive_filter should not be
> > called in the qemu running in the L1 guest, the nested one. Can you
> > check it with gdb or by printing a trace if it is called?
>
> This is right. receive_filter is not called in L1's QEMU with
> x-svq=true.
>
> >> With x-svq=true, I see that n->mac is set by virtio_net_handle_mac()
> >> [3] when L1 receives VIRTIO_NET_CTRL_MAC_ADDR_SET. With x-svq=false,
> >> virtio_net_handle_mac() doesn't seem to be getting called. I haven't
> >> understood how the MAC address is set in VirtIONet when x-svq=false.
> >> Understanding this might help see why n->mac has different values
> >> when x-svq is false vs when it is true.
> >>
> >
> > Ok this makes sense, as x-svq=true is the one that receives the set
> > mac message. You should see it in L0's QEMU though, both in x-svq=on
> > and x-svq=off scenarios. Can you check it?
>
> L0's QEMU seems to be receiving the "set mac" message only when L1
> is launched with x-svq=true. With x-svq=off, I don't see any call
> to virtio_net_handle_mac with cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET
> in L0.
>

Ok this is interesting. Let's disable control virtqueue to start with
something simpler:
device virtio-net-pci,netdev=net0,ctrl_vq=off,...

QEMU will start complaining about features that depend on ctrl_vq,
like ctrl_rx. Let's disable all of them and check this new scenario.

Thanks!

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 3 months ago

Hi,

On 1/24/25 1:04 PM, Eugenio Perez Martin wrote:
> On Fri, Jan 24, 2025 at 6:47 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> On 1/21/25 10:07 PM, Eugenio Perez Martin wrote:
>>> On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> On 1/7/25 1:35 PM, Eugenio Perez Martin wrote:
>>>> [...]
>>>> Apologies for the delay in replying. It took me a while to figure
>>>> this out, but I have now understood why this doesn't work. L1 is
>>>> unable to receive messages from L0 because they get filtered out
>>>> by hw/net/virtio-net.c:receive_filter [1]. There's an issue with
>>>> the MAC addresses.
>>>>
>>>> In L0, I have:
>>>>
>>>> $ ip a show tap0
>>>> 6: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>>>>        link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>>>>        inet 111.1.1.1/24 scope global tap0
>>>>           valid_lft forever preferred_lft forever
>>>>        inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>>>>           valid_lft forever preferred_lft forever
>>>>
>>>> In L1:
>>>>
>>>> # ip a show eth0
>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>        link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
>>>>        altname enp0s2
>>>>        inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
>>>>           valid_lft 83455sec preferred_lft 83455sec
>>>>        inet6 fec0::7bd2:265e:3b8e:5acc/64 scope site dynamic noprefixroute
>>>>           valid_lft 86064sec preferred_lft 14064sec
>>>>        inet6 fe80::50e7:5bf6:fff8:a7b0/64 scope link noprefixroute
>>>>           valid_lft forever preferred_lft forever
>>>>
>>>> I'll call this L1-eth0.
>>>>
>>>> In L2:
>>>> # ip a show eth0
>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP gro0
>>>>        link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>        altname enp0s7
>>>>        inet 111.1.1.2/24 scope global eth0
>>>>           valid_lft forever preferred_lft forever
>>>>
>>>> I'll call this L2-eth0.
>>>>
>>>> Apart from eth0, lo is the only other device in both L1 and L2.
>>>>
>>>> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
>>>> as its destination address. When booting L2 with x-svq=false, the
>>>> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
>>>> the frames and passes them on to L2 and pinging works [2].
>>>>
>>>
>>> So this behavior is interesting by itself. But L1's kernel net system
>>> should not receive anything. As I read it, even if it receives it, it
>>> should not forward the frame to L2 as it is in a different subnet. Are
>>> you able to read it using tcpdump on L1?
>>
>> I ran "tcpdump -i eth0" in L1. It didn't capture any of the packets
>> that were directed at L2 even though L2 was able to receive them.
>> Similarly, it didn't capture any packets that were sent from L2 to
>> L0. This is when L2 is launched with x-svq=false.
>>
> 
> That's right. The virtio dataplane goes directly from L0 to L2, you
> should not be able to see any packets in the net of L1.

I am a little confused here. Since vhost=off is set in L0's QEMU
(which is used to boot L1), I am able to inspect the packets when
tracing/debugging receive_filter in hw/net/virtio-net.c. [1] Does
this mean the dataplane from L0 to L2 passes through L0's QEMU
(so L0 QEMU is aware of what's going on), but bypasses L1 completely
so L1's kernel does not know what packets are being sent/received.

>> With x-svq=true, forcibly setting the LSB of n->mac to 0x57 in
>> receive_filter allows L2 to receive packets from L0. I added
>> the following line just before line 1771 [1] to check this out.
>>
>> n->mac[5] = 0x57;
>>
> 
> That's very interesting. Let me answer all the gdb questions below and
> we can debug it deeper :).
> 

Thank you for the primer on using gdb with QEMU. I am able to debug
QEMU now.

>>> Maybe we can make the scenario clearer by telling which virtio-net
>>> device is which with virtio_net_pci,mac=XX:... ?
>>>
>>>> However, when booting L2 with x-svq=true, n->mac is set to L1-eth0
>>>> (LSB = 56) in virtio_net_handle_mac() [3].
>>>
>>> Can you tell with gdb bt if this function is called from net or the
>>> SVQ subsystem?
>>

It looks like the function is being called from net.

(gdb) bt
#0  virtio_net_handle_mac (n=0x15622425e, cmd=85 'U', iov=0x555558865980, iov_cnt=1476792840) at ../hw/net/virtio-net.c:1098
#1  0x0000555555e5920b in virtio_net_handle_ctrl_iov (vdev=0x555558fdacd0, in_sg=0x5555580611f8, in_num=1, out_sg=0x555558061208,
      out_num=1) at ../hw/net/virtio-net.c:1581
#2  0x0000555555e593a0 in virtio_net_handle_ctrl (vdev=0x555558fdacd0, vq=0x555558fe7730) at ../hw/net/virtio-net.c:1610
#3  0x0000555555e9a7d8 in virtio_queue_notify_vq (vq=0x555558fe7730) at ../hw/virtio/virtio.c:2484
#4  0x0000555555e9dffb in virtio_queue_host_notifier_read (n=0x555558fe77a4) at ../hw/virtio/virtio.c:3869
#5  0x000055555620329f in aio_dispatch_handler (ctx=0x555557d9f840, node=0x7fffdca7ba80) at ../util/aio-posix.c:373
#6  0x000055555620346f in aio_dispatch_handlers (ctx=0x555557d9f840) at ../util/aio-posix.c:415
#7  0x00005555562034cb in aio_dispatch (ctx=0x555557d9f840) at ../util/aio-posix.c:425
#8  0x00005555562242b5 in aio_ctx_dispatch (source=0x555557d9f840, callback=0x0, user_data=0x0) at ../util/async.c:361
#9  0x00007ffff6d86559 in ?? () from /usr/lib/libglib-2.0.so.0
#10 0x00007ffff6d86858 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
#11 0x0000555556225bf9 in glib_pollfds_poll () at ../util/main-loop.c:287
#12 0x0000555556225c87 in os_host_main_loop_wait (timeout=294672) at ../util/main-loop.c:310
#13 0x0000555556225db6 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:589
#14 0x0000555555c0c1a3 in qemu_main_loop () at ../system/runstate.c:835
#15 0x000055555612bd8d in qemu_default_main (opaque=0x0) at ../system/main.c:48
#16 0x000055555612be3d in main (argc=23, argv=0x7fffffffe508) at ../system/main.c:76

virtio_queue_notify_vq at hw/virtio/virtio.c:2484 [2] calls
vq->handle_output(vdev, vq). I see "handle_output" is a function
pointer and in this case it seems to be pointing to
virtio_net_handle_ctrl.

>>>> [...]
>>>> With x-svq=true, I see that n->mac is set by virtio_net_handle_mac()
>>>> [3] when L1 receives VIRTIO_NET_CTRL_MAC_ADDR_SET. With x-svq=false,
>>>> virtio_net_handle_mac() doesn't seem to be getting called. I haven't
>>>> understood how the MAC address is set in VirtIONet when x-svq=false.
>>>> Understanding this might help see why n->mac has different values
>>>> when x-svq is false vs when it is true.
>>>
>>> Ok this makes sense, as x-svq=true is the one that receives the set
>>> mac message. You should see it in L0's QEMU though, both in x-svq=on
>>> and x-svq=off scenarios. Can you check it?
>>
>> L0's QEMU seems to be receiving the "set mac" message only when L1
>> is launched with x-svq=true. With x-svq=off, I don't see any call
>> to virtio_net_handle_mac with cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET
>> in L0.
>>
> 
> Ok this is interesting. Let's disable control virtqueue to start with
> something simpler:
> device virtio-net-pci,netdev=net0,ctrl_vq=off,...
> 
> QEMU will start complaining about features that depend on ctrl_vq,
> like ctrl_rx. Let's disable all of them and check this new scenario.
>

I am still investigating this part. I set ctrl_vq=off and ctrl_rx=off.
I didn't get any errors as such about features that depend on ctrl_vq.
However, I did notice that after booting L2 (x-svq=true as well as
x-svq=false), no eth0 device was created. There was only a "lo" interface
in L2. An eth0 interface is present only when L1 (L0 QEMU) is booted
with ctrl_vq=on and ctrl_rx=on.

Thanks,
Sahil

[1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/net/virtio-net.c#L1738
[2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/virtio.c#L2484

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 3 months ago

On Fri, Jan 31, 2025 at 6:04 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 1/24/25 1:04 PM, Eugenio Perez Martin wrote:
> > On Fri, Jan 24, 2025 at 6:47 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> On 1/21/25 10:07 PM, Eugenio Perez Martin wrote:
> >>> On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> On 1/7/25 1:35 PM, Eugenio Perez Martin wrote:
> >>>> [...]
> >>>> Apologies for the delay in replying. It took me a while to figure
> >>>> this out, but I have now understood why this doesn't work. L1 is
> >>>> unable to receive messages from L0 because they get filtered out
> >>>> by hw/net/virtio-net.c:receive_filter [1]. There's an issue with
> >>>> the MAC addresses.
> >>>>
> >>>> In L0, I have:
> >>>>
> >>>> $ ip a show tap0
> >>>> 6: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
> >>>>        link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
> >>>>        inet 111.1.1.1/24 scope global tap0
> >>>>           valid_lft forever preferred_lft forever
> >>>>        inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
> >>>>           valid_lft forever preferred_lft forever
> >>>>
> >>>> In L1:
> >>>>
> >>>> # ip a show eth0
> >>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>        link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
> >>>>        altname enp0s2
> >>>>        inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
> >>>>           valid_lft 83455sec preferred_lft 83455sec
> >>>>        inet6 fec0::7bd2:265e:3b8e:5acc/64 scope site dynamic noprefixroute
> >>>>           valid_lft 86064sec preferred_lft 14064sec
> >>>>        inet6 fe80::50e7:5bf6:fff8:a7b0/64 scope link noprefixroute
> >>>>           valid_lft forever preferred_lft forever
> >>>>
> >>>> I'll call this L1-eth0.
> >>>>
> >>>> In L2:
> >>>> # ip a show eth0
> >>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP gro0
> >>>>        link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>        altname enp0s7
> >>>>        inet 111.1.1.2/24 scope global eth0
> >>>>           valid_lft forever preferred_lft forever
> >>>>
> >>>> I'll call this L2-eth0.
> >>>>
> >>>> Apart from eth0, lo is the only other device in both L1 and L2.
> >>>>
> >>>> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
> >>>> as its destination address. When booting L2 with x-svq=false, the
> >>>> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
> >>>> the frames and passes them on to L2 and pinging works [2].
> >>>>
> >>>
> >>> So this behavior is interesting by itself. But L1's kernel net system
> >>> should not receive anything. As I read it, even if it receives it, it
> >>> should not forward the frame to L2 as it is in a different subnet. Are
> >>> you able to read it using tcpdump on L1?
> >>
> >> I ran "tcpdump -i eth0" in L1. It didn't capture any of the packets
> >> that were directed at L2 even though L2 was able to receive them.
> >> Similarly, it didn't capture any packets that were sent from L2 to
> >> L0. This is when L2 is launched with x-svq=false.
> >>
> >
> > That's right. The virtio dataplane goes directly from L0 to L2, you
> > should not be able to see any packets in the net of L1.
>
> I am a little confused here. Since vhost=off is set in L0's QEMU
> (which is used to boot L1), I am able to inspect the packets when
> tracing/debugging receive_filter in hw/net/virtio-net.c. [1] Does
> this mean the dataplane from L0 to L2 passes through L0's QEMU
> (so L0 QEMU is aware of what's going on), but bypasses L1 completely
> so L1's kernel does not know what packets are being sent/received.
>

That's right. We're saving processing power and context switches that way :).

> >> With x-svq=true, forcibly setting the LSB of n->mac to 0x57 in
> >> receive_filter allows L2 to receive packets from L0. I added
> >> the following line just before line 1771 [1] to check this out.
> >>
> >> n->mac[5] = 0x57;
> >>
> >
> > That's very interesting. Let me answer all the gdb questions below and
> > we can debug it deeper :).
> >
>
> Thank you for the primer on using gdb with QEMU. I am able to debug
> QEMU now.
>
> >>> Maybe we can make the scenario clearer by telling which virtio-net
> >>> device is which with virtio_net_pci,mac=XX:... ?
> >>>
> >>>> However, when booting L2 with x-svq=true, n->mac is set to L1-eth0
> >>>> (LSB = 56) in virtio_net_handle_mac() [3].
> >>>
> >>> Can you tell with gdb bt if this function is called from net or the
> >>> SVQ subsystem?
> >>
>
> It looks like the function is being called from net.
>
> (gdb) bt
> #0  virtio_net_handle_mac (n=0x15622425e, cmd=85 'U', iov=0x555558865980, iov_cnt=1476792840) at ../hw/net/virtio-net.c:1098
> #1  0x0000555555e5920b in virtio_net_handle_ctrl_iov (vdev=0x555558fdacd0, in_sg=0x5555580611f8, in_num=1, out_sg=0x555558061208,
>       out_num=1) at ../hw/net/virtio-net.c:1581
> #2  0x0000555555e593a0 in virtio_net_handle_ctrl (vdev=0x555558fdacd0, vq=0x555558fe7730) at ../hw/net/virtio-net.c:1610
> #3  0x0000555555e9a7d8 in virtio_queue_notify_vq (vq=0x555558fe7730) at ../hw/virtio/virtio.c:2484
> #4  0x0000555555e9dffb in virtio_queue_host_notifier_read (n=0x555558fe77a4) at ../hw/virtio/virtio.c:3869
> #5  0x000055555620329f in aio_dispatch_handler (ctx=0x555557d9f840, node=0x7fffdca7ba80) at ../util/aio-posix.c:373
> #6  0x000055555620346f in aio_dispatch_handlers (ctx=0x555557d9f840) at ../util/aio-posix.c:415
> #7  0x00005555562034cb in aio_dispatch (ctx=0x555557d9f840) at ../util/aio-posix.c:425
> #8  0x00005555562242b5 in aio_ctx_dispatch (source=0x555557d9f840, callback=0x0, user_data=0x0) at ../util/async.c:361
> #9  0x00007ffff6d86559 in ?? () from /usr/lib/libglib-2.0.so.0
> #10 0x00007ffff6d86858 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
> #11 0x0000555556225bf9 in glib_pollfds_poll () at ../util/main-loop.c:287
> #12 0x0000555556225c87 in os_host_main_loop_wait (timeout=294672) at ../util/main-loop.c:310
> #13 0x0000555556225db6 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:589
> #14 0x0000555555c0c1a3 in qemu_main_loop () at ../system/runstate.c:835
> #15 0x000055555612bd8d in qemu_default_main (opaque=0x0) at ../system/main.c:48
> #16 0x000055555612be3d in main (argc=23, argv=0x7fffffffe508) at ../system/main.c:76
>
> virtio_queue_notify_vq at hw/virtio/virtio.c:2484 [2] calls
> vq->handle_output(vdev, vq). I see "handle_output" is a function
> pointer and in this case it seems to be pointing to
> virtio_net_handle_ctrl.
>
> >>>> [...]
> >>>> With x-svq=true, I see that n->mac is set by virtio_net_handle_mac()
> >>>> [3] when L1 receives VIRTIO_NET_CTRL_MAC_ADDR_SET. With x-svq=false,
> >>>> virtio_net_handle_mac() doesn't seem to be getting called. I haven't
> >>>> understood how the MAC address is set in VirtIONet when x-svq=false.
> >>>> Understanding this might help see why n->mac has different values
> >>>> when x-svq is false vs when it is true.
> >>>
> >>> Ok this makes sense, as x-svq=true is the one that receives the set
> >>> mac message. You should see it in L0's QEMU though, both in x-svq=on
> >>> and x-svq=off scenarios. Can you check it?
> >>
> >> L0's QEMU seems to be receiving the "set mac" message only when L1
> >> is launched with x-svq=true. With x-svq=off, I don't see any call
> >> to virtio_net_handle_mac with cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET
> >> in L0.
> >>
> >
> > Ok this is interesting. Let's disable control virtqueue to start with
> > something simpler:
> > device virtio-net-pci,netdev=net0,ctrl_vq=off,...
> >
> > QEMU will start complaining about features that depend on ctrl_vq,
> > like ctrl_rx. Let's disable all of them and check this new scenario.
> >
>
> I am still investigating this part. I set ctrl_vq=off and ctrl_rx=off.
> I didn't get any errors as such about features that depend on ctrl_vq.
> However, I did notice that after booting L2 (x-svq=true as well as
> x-svq=false), no eth0 device was created. There was only a "lo" interface
> in L2. An eth0 interface is present only when L1 (L0 QEMU) is booted
> with ctrl_vq=on and ctrl_rx=on.
>

Any error messages on the nested guest's dmesg? Is it fixed when you
set the same mac address on L0 virtio-net-pci and L1's?

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 3 months ago

Hi,

On 1/31/25 12:27 PM, Eugenio Perez Martin wrote:
> On Fri, Jan 31, 2025 at 6:04 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> On 1/24/25 1:04 PM, Eugenio Perez Martin wrote:
>>> On Fri, Jan 24, 2025 at 6:47 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> On 1/21/25 10:07 PM, Eugenio Perez Martin wrote:
>>>>> On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>> On 1/7/25 1:35 PM, Eugenio Perez Martin wrote:
>>>>>> [...]
>>>>>> Apologies for the delay in replying. It took me a while to figure
>>>>>> this out, but I have now understood why this doesn't work. L1 is
>>>>>> unable to receive messages from L0 because they get filtered out
>>>>>> by hw/net/virtio-net.c:receive_filter [1]. There's an issue with
>>>>>> the MAC addresses.
>>>>>>
>>>>>> In L0, I have:
>>>>>>
>>>>>> $ ip a show tap0
>>>>>> 6: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>>>>>>         link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>>>>>>         inet 111.1.1.1/24 scope global tap0
>>>>>>            valid_lft forever preferred_lft forever
>>>>>>         inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>>>>>>            valid_lft forever preferred_lft forever
>>>>>>
>>>>>> In L1:
>>>>>>
>>>>>> # ip a show eth0
>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>>>         link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
>>>>>>         altname enp0s2
>>>>>>         inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
>>>>>>            valid_lft 83455sec preferred_lft 83455sec
>>>>>>         inet6 fec0::7bd2:265e:3b8e:5acc/64 scope site dynamic noprefixroute
>>>>>>            valid_lft 86064sec preferred_lft 14064sec
>>>>>>         inet6 fe80::50e7:5bf6:fff8:a7b0/64 scope link noprefixroute
>>>>>>            valid_lft forever preferred_lft forever
>>>>>>
>>>>>> I'll call this L1-eth0.
>>>>>>
>>>>>> In L2:
>>>>>> # ip a show eth0
>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP gro0
>>>>>>         link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>>>         altname enp0s7
>>>>>>         inet 111.1.1.2/24 scope global eth0
>>>>>>            valid_lft forever preferred_lft forever
>>>>>>
>>>>>> I'll call this L2-eth0.
>>>>>>
>>>>>> Apart from eth0, lo is the only other device in both L1 and L2.
>>>>>>
>>>>>> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
>>>>>> as its destination address. When booting L2 with x-svq=false, the
>>>>>> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
>>>>>> the frames and passes them on to L2 and pinging works [2].
>>>>>>
>>>>>
>>>>> So this behavior is interesting by itself. But L1's kernel net system
>>>>> should not receive anything. As I read it, even if it receives it, it
>>>>> should not forward the frame to L2 as it is in a different subnet. Are
>>>>> you able to read it using tcpdump on L1?
>>>>
>>>> I ran "tcpdump -i eth0" in L1. It didn't capture any of the packets
>>>> that were directed at L2 even though L2 was able to receive them.
>>>> Similarly, it didn't capture any packets that were sent from L2 to
>>>> L0. This is when L2 is launched with x-svq=false.
>>>>
>>>
>>> That's right. The virtio dataplane goes directly from L0 to L2, you
>>> should not be able to see any packets in the net of L1.
>>
>> I am a little confused here. Since vhost=off is set in L0's QEMU
>> (which is used to boot L1), I am able to inspect the packets when
>> tracing/debugging receive_filter in hw/net/virtio-net.c. [1] Does
>> this mean the dataplane from L0 to L2 passes through L0's QEMU
>> (so L0 QEMU is aware of what's going on), but bypasses L1 completely
>> so L1's kernel does not know what packets are being sent/received.
>>
> 
> That's right. We're saving processing power and context switches that way :).

Got it. I have understood this part. In a previous mail (also present above):

>>>>> On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq wrote:
>>>>>> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
>>>>>> as its destination address. When booting L2 with x-svq=false, the
>>>>>> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
>>>>>> the frames and passes them on to L2 and pinging works [2].
>>>>>>

I was a little unclear in my explanation. I meant to say the frame received by
L0-QEMU (which is running L1).

>>>> With x-svq=true, forcibly setting the LSB of n->mac to 0x57 in
>>>> receive_filter allows L2 to receive packets from L0. I added
>>>> the following line just before line 1771 [1] to check this out.
>>>>
>>>> n->mac[5] = 0x57;
>>>>
>>>
>>> That's very interesting. Let me answer all the gdb questions below and
>>> we can debug it deeper :).
>>>
>>
>> Thank you for the primer on using gdb with QEMU. I am able to debug
>> QEMU now.
>>
>>>>> Maybe we can make the scenario clearer by telling which virtio-net
>>>>> device is which with virtio_net_pci,mac=XX:... ?
>>>>>
>>>>>> However, when booting L2 with x-svq=true, n->mac is set to L1-eth0
>>>>>> (LSB = 56) in virtio_net_handle_mac() [3].
>>>>>
>>>>> Can you tell with gdb bt if this function is called from net or the
>>>>> SVQ subsystem?
>>>>
>>
>> It looks like the function is being called from net.
>>
>> (gdb) bt
>> #0  virtio_net_handle_mac (n=0x15622425e, cmd=85 'U', iov=0x555558865980, iov_cnt=1476792840) at ../hw/net/virtio-net.c:1098
>> #1  0x0000555555e5920b in virtio_net_handle_ctrl_iov (vdev=0x555558fdacd0, in_sg=0x5555580611f8, in_num=1, out_sg=0x555558061208,
>>        out_num=1) at ../hw/net/virtio-net.c:1581
>> #2  0x0000555555e593a0 in virtio_net_handle_ctrl (vdev=0x555558fdacd0, vq=0x555558fe7730) at ../hw/net/virtio-net.c:1610
>> #3  0x0000555555e9a7d8 in virtio_queue_notify_vq (vq=0x555558fe7730) at ../hw/virtio/virtio.c:2484
>> #4  0x0000555555e9dffb in virtio_queue_host_notifier_read (n=0x555558fe77a4) at ../hw/virtio/virtio.c:3869
>> #5  0x000055555620329f in aio_dispatch_handler (ctx=0x555557d9f840, node=0x7fffdca7ba80) at ../util/aio-posix.c:373
>> #6  0x000055555620346f in aio_dispatch_handlers (ctx=0x555557d9f840) at ../util/aio-posix.c:415
>> #7  0x00005555562034cb in aio_dispatch (ctx=0x555557d9f840) at ../util/aio-posix.c:425
>> #8  0x00005555562242b5 in aio_ctx_dispatch (source=0x555557d9f840, callback=0x0, user_data=0x0) at ../util/async.c:361
>> #9  0x00007ffff6d86559 in ?? () from /usr/lib/libglib-2.0.so.0
>> #10 0x00007ffff6d86858 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
>> #11 0x0000555556225bf9 in glib_pollfds_poll () at ../util/main-loop.c:287
>> #12 0x0000555556225c87 in os_host_main_loop_wait (timeout=294672) at ../util/main-loop.c:310
>> #13 0x0000555556225db6 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:589
>> #14 0x0000555555c0c1a3 in qemu_main_loop () at ../system/runstate.c:835
>> #15 0x000055555612bd8d in qemu_default_main (opaque=0x0) at ../system/main.c:48
>> #16 0x000055555612be3d in main (argc=23, argv=0x7fffffffe508) at ../system/main.c:76
>>
>> virtio_queue_notify_vq at hw/virtio/virtio.c:2484 [2] calls
>> vq->handle_output(vdev, vq). I see "handle_output" is a function
>> pointer and in this case it seems to be pointing to
>> virtio_net_handle_ctrl.
>>
>>>>>> [...]
>>>>>> With x-svq=true, I see that n->mac is set by virtio_net_handle_mac()
>>>>>> [3] when L1 receives VIRTIO_NET_CTRL_MAC_ADDR_SET. With x-svq=false,
>>>>>> virtio_net_handle_mac() doesn't seem to be getting called. I haven't
>>>>>> understood how the MAC address is set in VirtIONet when x-svq=false.
>>>>>> Understanding this might help see why n->mac has different values
>>>>>> when x-svq is false vs when it is true.
>>>>>
>>>>> Ok this makes sense, as x-svq=true is the one that receives the set
>>>>> mac message. You should see it in L0's QEMU though, both in x-svq=on
>>>>> and x-svq=off scenarios. Can you check it?
>>>>
>>>> L0's QEMU seems to be receiving the "set mac" message only when L1
>>>> is launched with x-svq=true. With x-svq=off, I don't see any call
>>>> to virtio_net_handle_mac with cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET
>>>> in L0.
>>>>
>>>
>>> Ok this is interesting. Let's disable control virtqueue to start with
>>> something simpler:
>>> device virtio-net-pci,netdev=net0,ctrl_vq=off,...
>>>
>>> QEMU will start complaining about features that depend on ctrl_vq,
>>> like ctrl_rx. Let's disable all of them and check this new scenario.
>>>
>>
>> I am still investigating this part. I set ctrl_vq=off and ctrl_rx=off.
>> I didn't get any errors as such about features that depend on ctrl_vq.
>> However, I did notice that after booting L2 (x-svq=true as well as
>> x-svq=false), no eth0 device was created. There was only a "lo" interface
>> in L2. An eth0 interface is present only when L1 (L0 QEMU) is booted
>> with ctrl_vq=on and ctrl_rx=on.
>>
> 
> Any error messages on the nested guest's dmesg?

Oh, yes, there were error messages in the output of dmesg related to
ctrl_vq. After adding the following args, there were no error messages
in dmesg.

-device virtio-net-pci,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off

I see that the eth0 interface is also created. I am able to ping L0
from L2 and vice versa as well (even with x-svq=true). This is because
n->promisc is set when these features are disabled and receive_filter() [1]
always returns 1.

> Is it fixed when you set the same mac address on L0
> virtio-net-pci and L1's?
> 

I didn't have to set the same mac address in this case since promiscuous
mode seems to be getting enabled which allows pinging to work.

There is another concept that I am a little confused about. In the case
where L2 is booted with x-svq=false (and all ctrl features such as ctrl_vq,
ctrl_rx, etc. are on), I am able to ping L0 from L2. When tracing
receive_filter() in L0-QEMU, I see the values of n->mac and the destination
mac address in the ICMP packet match [2].

I haven't understood what n->mac refers to over here. MAC addresses are
globally unique and so the mac address of the device in L1 should be
different from that in L2. But I see L0-QEMU's n->mac is set to the mac
address of the device in L2 (allowing receive_filter to accept the packet).

Thanks,
Sahil

[1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/net/virtio-net.c#L1745
[2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/net/virtio-net.c#L1775

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 3 months ago

On Tue, Feb 4, 2025 at 1:49 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 1/31/25 12:27 PM, Eugenio Perez Martin wrote:
> > On Fri, Jan 31, 2025 at 6:04 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> On 1/24/25 1:04 PM, Eugenio Perez Martin wrote:
> >>> On Fri, Jan 24, 2025 at 6:47 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> On 1/21/25 10:07 PM, Eugenio Perez Martin wrote:
> >>>>> On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>> On 1/7/25 1:35 PM, Eugenio Perez Martin wrote:
> >>>>>> [...]
> >>>>>> Apologies for the delay in replying. It took me a while to figure
> >>>>>> this out, but I have now understood why this doesn't work. L1 is
> >>>>>> unable to receive messages from L0 because they get filtered out
> >>>>>> by hw/net/virtio-net.c:receive_filter [1]. There's an issue with
> >>>>>> the MAC addresses.
> >>>>>>
> >>>>>> In L0, I have:
> >>>>>>
> >>>>>> $ ip a show tap0
> >>>>>> 6: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
> >>>>>>         link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
> >>>>>>         inet 111.1.1.1/24 scope global tap0
> >>>>>>            valid_lft forever preferred_lft forever
> >>>>>>         inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
> >>>>>>            valid_lft forever preferred_lft forever
> >>>>>>
> >>>>>> In L1:
> >>>>>>
> >>>>>> # ip a show eth0
> >>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>>>         link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
> >>>>>>         altname enp0s2
> >>>>>>         inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
> >>>>>>            valid_lft 83455sec preferred_lft 83455sec
> >>>>>>         inet6 fec0::7bd2:265e:3b8e:5acc/64 scope site dynamic noprefixroute
> >>>>>>            valid_lft 86064sec preferred_lft 14064sec
> >>>>>>         inet6 fe80::50e7:5bf6:fff8:a7b0/64 scope link noprefixroute
> >>>>>>            valid_lft forever preferred_lft forever
> >>>>>>
> >>>>>> I'll call this L1-eth0.
> >>>>>>
> >>>>>> In L2:
> >>>>>> # ip a show eth0
> >>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP gro0
> >>>>>>         link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>>>         altname enp0s7
> >>>>>>         inet 111.1.1.2/24 scope global eth0
> >>>>>>            valid_lft forever preferred_lft forever
> >>>>>>
> >>>>>> I'll call this L2-eth0.
> >>>>>>
> >>>>>> Apart from eth0, lo is the only other device in both L1 and L2.
> >>>>>>
> >>>>>> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
> >>>>>> as its destination address. When booting L2 with x-svq=false, the
> >>>>>> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
> >>>>>> the frames and passes them on to L2 and pinging works [2].
> >>>>>>
> >>>>>
> >>>>> So this behavior is interesting by itself. But L1's kernel net system
> >>>>> should not receive anything. As I read it, even if it receives it, it
> >>>>> should not forward the frame to L2 as it is in a different subnet. Are
> >>>>> you able to read it using tcpdump on L1?
> >>>>
> >>>> I ran "tcpdump -i eth0" in L1. It didn't capture any of the packets
> >>>> that were directed at L2 even though L2 was able to receive them.
> >>>> Similarly, it didn't capture any packets that were sent from L2 to
> >>>> L0. This is when L2 is launched with x-svq=false.
> >>>>
> >>>
> >>> That's right. The virtio dataplane goes directly from L0 to L2, you
> >>> should not be able to see any packets in the net of L1.
> >>
> >> I am a little confused here. Since vhost=off is set in L0's QEMU
> >> (which is used to boot L1), I am able to inspect the packets when
> >> tracing/debugging receive_filter in hw/net/virtio-net.c. [1] Does
> >> this mean the dataplane from L0 to L2 passes through L0's QEMU
> >> (so L0 QEMU is aware of what's going on), but bypasses L1 completely
> >> so L1's kernel does not know what packets are being sent/received.
> >>
> >
> > That's right. We're saving processing power and context switches that way :).
>
> Got it. I have understood this part. In a previous mail (also present above):
>
> >>>>> On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq wrote:
> >>>>>> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
> >>>>>> as its destination address. When booting L2 with x-svq=false, the
> >>>>>> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
> >>>>>> the frames and passes them on to L2 and pinging works [2].
> >>>>>>
>
> I was a little unclear in my explanation. I meant to say the frame received by
> L0-QEMU (which is running L1).
>
> >>>> With x-svq=true, forcibly setting the LSB of n->mac to 0x57 in
> >>>> receive_filter allows L2 to receive packets from L0. I added
> >>>> the following line just before line 1771 [1] to check this out.
> >>>>
> >>>> n->mac[5] = 0x57;
> >>>>
> >>>
> >>> That's very interesting. Let me answer all the gdb questions below and
> >>> we can debug it deeper :).
> >>>
> >>
> >> Thank you for the primer on using gdb with QEMU. I am able to debug
> >> QEMU now.
> >>
> >>>>> Maybe we can make the scenario clearer by telling which virtio-net
> >>>>> device is which with virtio_net_pci,mac=XX:... ?
> >>>>>
> >>>>>> However, when booting L2 with x-svq=true, n->mac is set to L1-eth0
> >>>>>> (LSB = 56) in virtio_net_handle_mac() [3].
> >>>>>
> >>>>> Can you tell with gdb bt if this function is called from net or the
> >>>>> SVQ subsystem?
> >>>>
> >>
> >> It looks like the function is being called from net.
> >>
> >> (gdb) bt
> >> #0  virtio_net_handle_mac (n=0x15622425e, cmd=85 'U', iov=0x555558865980, iov_cnt=1476792840) at ../hw/net/virtio-net.c:1098
> >> #1  0x0000555555e5920b in virtio_net_handle_ctrl_iov (vdev=0x555558fdacd0, in_sg=0x5555580611f8, in_num=1, out_sg=0x555558061208,
> >>        out_num=1) at ../hw/net/virtio-net.c:1581
> >> #2  0x0000555555e593a0 in virtio_net_handle_ctrl (vdev=0x555558fdacd0, vq=0x555558fe7730) at ../hw/net/virtio-net.c:1610
> >> #3  0x0000555555e9a7d8 in virtio_queue_notify_vq (vq=0x555558fe7730) at ../hw/virtio/virtio.c:2484
> >> #4  0x0000555555e9dffb in virtio_queue_host_notifier_read (n=0x555558fe77a4) at ../hw/virtio/virtio.c:3869
> >> #5  0x000055555620329f in aio_dispatch_handler (ctx=0x555557d9f840, node=0x7fffdca7ba80) at ../util/aio-posix.c:373
> >> #6  0x000055555620346f in aio_dispatch_handlers (ctx=0x555557d9f840) at ../util/aio-posix.c:415
> >> #7  0x00005555562034cb in aio_dispatch (ctx=0x555557d9f840) at ../util/aio-posix.c:425
> >> #8  0x00005555562242b5 in aio_ctx_dispatch (source=0x555557d9f840, callback=0x0, user_data=0x0) at ../util/async.c:361
> >> #9  0x00007ffff6d86559 in ?? () from /usr/lib/libglib-2.0.so.0
> >> #10 0x00007ffff6d86858 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
> >> #11 0x0000555556225bf9 in glib_pollfds_poll () at ../util/main-loop.c:287
> >> #12 0x0000555556225c87 in os_host_main_loop_wait (timeout=294672) at ../util/main-loop.c:310
> >> #13 0x0000555556225db6 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:589
> >> #14 0x0000555555c0c1a3 in qemu_main_loop () at ../system/runstate.c:835
> >> #15 0x000055555612bd8d in qemu_default_main (opaque=0x0) at ../system/main.c:48
> >> #16 0x000055555612be3d in main (argc=23, argv=0x7fffffffe508) at ../system/main.c:76
> >>
> >> virtio_queue_notify_vq at hw/virtio/virtio.c:2484 [2] calls
> >> vq->handle_output(vdev, vq). I see "handle_output" is a function
> >> pointer and in this case it seems to be pointing to
> >> virtio_net_handle_ctrl.
> >>
> >>>>>> [...]
> >>>>>> With x-svq=true, I see that n->mac is set by virtio_net_handle_mac()
> >>>>>> [3] when L1 receives VIRTIO_NET_CTRL_MAC_ADDR_SET. With x-svq=false,
> >>>>>> virtio_net_handle_mac() doesn't seem to be getting called. I haven't
> >>>>>> understood how the MAC address is set in VirtIONet when x-svq=false.
> >>>>>> Understanding this might help see why n->mac has different values
> >>>>>> when x-svq is false vs when it is true.
> >>>>>
> >>>>> Ok this makes sense, as x-svq=true is the one that receives the set
> >>>>> mac message. You should see it in L0's QEMU though, both in x-svq=on
> >>>>> and x-svq=off scenarios. Can you check it?
> >>>>
> >>>> L0's QEMU seems to be receiving the "set mac" message only when L1
> >>>> is launched with x-svq=true. With x-svq=off, I don't see any call
> >>>> to virtio_net_handle_mac with cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET
> >>>> in L0.
> >>>>
> >>>
> >>> Ok this is interesting. Let's disable control virtqueue to start with
> >>> something simpler:
> >>> device virtio-net-pci,netdev=net0,ctrl_vq=off,...
> >>>
> >>> QEMU will start complaining about features that depend on ctrl_vq,
> >>> like ctrl_rx. Let's disable all of them and check this new scenario.
> >>>
> >>
> >> I am still investigating this part. I set ctrl_vq=off and ctrl_rx=off.
> >> I didn't get any errors as such about features that depend on ctrl_vq.
> >> However, I did notice that after booting L2 (x-svq=true as well as
> >> x-svq=false), no eth0 device was created. There was only a "lo" interface
> >> in L2. An eth0 interface is present only when L1 (L0 QEMU) is booted
> >> with ctrl_vq=on and ctrl_rx=on.
> >>
> >
> > Any error messages on the nested guest's dmesg?
>
> Oh, yes, there were error messages in the output of dmesg related to
> ctrl_vq. After adding the following args, there were no error messages
> in dmesg.
>
> -device virtio-net-pci,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off
>
> I see that the eth0 interface is also created. I am able to ping L0
> from L2 and vice versa as well (even with x-svq=true). This is because
> n->promisc is set when these features are disabled and receive_filter() [1]
> always returns 1.
>
> > Is it fixed when you set the same mac address on L0
> > virtio-net-pci and L1's?
> >
>
> I didn't have to set the same mac address in this case since promiscuous
> mode seems to be getting enabled which allows pinging to work.
>
> There is another concept that I am a little confused about. In the case
> where L2 is booted with x-svq=false (and all ctrl features such as ctrl_vq,
> ctrl_rx, etc. are on), I am able to ping L0 from L2. When tracing
> receive_filter() in L0-QEMU, I see the values of n->mac and the destination
> mac address in the ICMP packet match [2].
>

SVQ makes an effort to set the mac address at the beginning of
operation. The L0 interpret it as "filter out all MACs except this
one". But SVQ cannot set the mac if ctrl_mac_addr=off, so the nic
receives all packets and the guest kernel needs to filter out by
itself.

> I haven't understood what n->mac refers to over here. MAC addresses are
> globally unique and so the mac address of the device in L1 should be
> different from that in L2.

With vDPA, they should be the same device even if they are declared in
different cmdlines or layers of virtualizations. If it were a physical
NIC, QEMU should declare the MAC of the physical NIC too.

There is a thread in QEMU maul list where how QEMU should influence
the control plane is discussed, and maybe it would be easier if QEMU
just checks the device's MAC and ignores cmdline. But then, that
behavior would be surprising for the rest of vhosts like vhost-kernel.
Or just emit a warning if the MAC is different than the one that the
device reports.


> But I see L0-QEMU's n->mac is set to the mac
> address of the device in L2 (allowing receive_filter to accept the packet).
>

That's interesting, can you check further what does receive_filter and
virtio_net_receive_rcu do with gdb? As long as virtio_net_receive_rcu
flushes the packet on the receive queue, SVQ should receive it.

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 3 months ago

Hi,

On 2/4/25 11:40 PM, Eugenio Perez Martin wrote:
> On Tue, Feb 4, 2025 at 1:49 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> On 1/31/25 12:27 PM, Eugenio Perez Martin wrote:
>>> On Fri, Jan 31, 2025 at 6:04 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> On 1/24/25 1:04 PM, Eugenio Perez Martin wrote:
>>>>> On Fri, Jan 24, 2025 at 6:47 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>> On 1/21/25 10:07 PM, Eugenio Perez Martin wrote:
>>>>>>> On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>>>> On 1/7/25 1:35 PM, Eugenio Perez Martin wrote:
>>>>>>>> [...]
>>>>>>>> Apologies for the delay in replying. It took me a while to figure
>>>>>>>> this out, but I have now understood why this doesn't work. L1 is
>>>>>>>> unable to receive messages from L0 because they get filtered out
>>>>>>>> by hw/net/virtio-net.c:receive_filter [1]. There's an issue with
>>>>>>>> the MAC addresses.
>>>>>>>>
>>>>>>>> In L0, I have:
>>>>>>>>
>>>>>>>> $ ip a show tap0
>>>>>>>> 6: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>>>>>>>>          link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>>>>>>>>          inet 111.1.1.1/24 scope global tap0
>>>>>>>>             valid_lft forever preferred_lft forever
>>>>>>>>          inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>>>>>>>>             valid_lft forever preferred_lft forever
>>>>>>>>
>>>>>>>> In L1:
>>>>>>>>
>>>>>>>> # ip a show eth0
>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>>>>>          link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
>>>>>>>>          altname enp0s2
>>>>>>>>          inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
>>>>>>>>             valid_lft 83455sec preferred_lft 83455sec
>>>>>>>>          inet6 fec0::7bd2:265e:3b8e:5acc/64 scope site dynamic noprefixroute
>>>>>>>>             valid_lft 86064sec preferred_lft 14064sec
>>>>>>>>          inet6 fe80::50e7:5bf6:fff8:a7b0/64 scope link noprefixroute
>>>>>>>>             valid_lft forever preferred_lft forever
>>>>>>>>
>>>>>>>> I'll call this L1-eth0.
>>>>>>>>
>>>>>>>> In L2:
>>>>>>>> # ip a show eth0
>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP gro0
>>>>>>>>          link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>>>>>          altname enp0s7
>>>>>>>>          inet 111.1.1.2/24 scope global eth0
>>>>>>>>             valid_lft forever preferred_lft forever
>>>>>>>>
>>>>>>>> I'll call this L2-eth0.
>>>>>>>>
>>>>>>>> Apart from eth0, lo is the only other device in both L1 and L2.
>>>>>>>>
>>>>>>>> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
>>>>>>>> as its destination address. When booting L2 with x-svq=false, the
>>>>>>>> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
>>>>>>>> the frames and passes them on to L2 and pinging works [2].
>>>>>>>>
>>>>>>>
>>>>>>> So this behavior is interesting by itself. But L1's kernel net system
>>>>>>> should not receive anything. As I read it, even if it receives it, it
>>>>>>> should not forward the frame to L2 as it is in a different subnet. Are
>>>>>>> you able to read it using tcpdump on L1?
>>>>>>
>>>>>> I ran "tcpdump -i eth0" in L1. It didn't capture any of the packets
>>>>>> that were directed at L2 even though L2 was able to receive them.
>>>>>> Similarly, it didn't capture any packets that were sent from L2 to
>>>>>> L0. This is when L2 is launched with x-svq=false.
>>>>>> [...]
>>>>>> With x-svq=true, forcibly setting the LSB of n->mac to 0x57 in
>>>>>> receive_filter allows L2 to receive packets from L0. I added
>>>>>> the following line just before line 1771 [1] to check this out.
>>>>>>
>>>>>> n->mac[5] = 0x57;
>>>>>>
>>>>>
>>>>> That's very interesting. Let me answer all the gdb questions below and
>>>>> we can debug it deeper :).
>>>>>
>>>>
>>>> Thank you for the primer on using gdb with QEMU. I am able to debug
>>>> QEMU now.
>>>>
>>>>>>> Maybe we can make the scenario clearer by telling which virtio-net
>>>>>>> device is which with virtio_net_pci,mac=XX:... ?
>>>>>>>
>>>>>>>> However, when booting L2 with x-svq=true, n->mac is set to L1-eth0
>>>>>>>> (LSB = 56) in virtio_net_handle_mac() [3].
>>>>>>>
>>>>>>> Can you tell with gdb bt if this function is called from net or the
>>>>>>> SVQ subsystem?
>>>>>>
>>>>
>>>> It looks like the function is being called from net.
>>>>
>>>> (gdb) bt
>>>> #0  virtio_net_handle_mac (n=0x15622425e, cmd=85 'U', iov=0x555558865980, iov_cnt=1476792840) at ../hw/net/virtio-net.c:1098
>>>> #1  0x0000555555e5920b in virtio_net_handle_ctrl_iov (vdev=0x555558fdacd0, in_sg=0x5555580611f8, in_num=1, out_sg=0x555558061208,
>>>>         out_num=1) at ../hw/net/virtio-net.c:1581
>>>> #2  0x0000555555e593a0 in virtio_net_handle_ctrl (vdev=0x555558fdacd0, vq=0x555558fe7730) at ../hw/net/virtio-net.c:1610
>>>> #3  0x0000555555e9a7d8 in virtio_queue_notify_vq (vq=0x555558fe7730) at ../hw/virtio/virtio.c:2484
>>>> #4  0x0000555555e9dffb in virtio_queue_host_notifier_read (n=0x555558fe77a4) at ../hw/virtio/virtio.c:3869
>>>> #5  0x000055555620329f in aio_dispatch_handler (ctx=0x555557d9f840, node=0x7fffdca7ba80) at ../util/aio-posix.c:373
>>>> #6  0x000055555620346f in aio_dispatch_handlers (ctx=0x555557d9f840) at ../util/aio-posix.c:415
>>>> #7  0x00005555562034cb in aio_dispatch (ctx=0x555557d9f840) at ../util/aio-posix.c:425
>>>> #8  0x00005555562242b5 in aio_ctx_dispatch (source=0x555557d9f840, callback=0x0, user_data=0x0) at ../util/async.c:361
>>>> #9  0x00007ffff6d86559 in ?? () from /usr/lib/libglib-2.0.so.0
>>>> #10 0x00007ffff6d86858 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
>>>> #11 0x0000555556225bf9 in glib_pollfds_poll () at ../util/main-loop.c:287
>>>> #12 0x0000555556225c87 in os_host_main_loop_wait (timeout=294672) at ../util/main-loop.c:310
>>>> #13 0x0000555556225db6 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:589
>>>> #14 0x0000555555c0c1a3 in qemu_main_loop () at ../system/runstate.c:835
>>>> #15 0x000055555612bd8d in qemu_default_main (opaque=0x0) at ../system/main.c:48
>>>> #16 0x000055555612be3d in main (argc=23, argv=0x7fffffffe508) at ../system/main.c:76
>>>>
>>>> virtio_queue_notify_vq at hw/virtio/virtio.c:2484 [2] calls
>>>> vq->handle_output(vdev, vq). I see "handle_output" is a function
>>>> pointer and in this case it seems to be pointing to
>>>> virtio_net_handle_ctrl.
>>>>
>>>>>>>> [...]
>>>>>>>> With x-svq=true, I see that n->mac is set by virtio_net_handle_mac()
>>>>>>>> [3] when L1 receives VIRTIO_NET_CTRL_MAC_ADDR_SET. With x-svq=false,
>>>>>>>> virtio_net_handle_mac() doesn't seem to be getting called. I haven't
>>>>>>>> understood how the MAC address is set in VirtIONet when x-svq=false.
>>>>>>>> Understanding this might help see why n->mac has different values
>>>>>>>> when x-svq is false vs when it is true.
>>>>>>>
>>>>>>> Ok this makes sense, as x-svq=true is the one that receives the set
>>>>>>> mac message. You should see it in L0's QEMU though, both in x-svq=on
>>>>>>> and x-svq=off scenarios. Can you check it?
>>>>>>
>>>>>> L0's QEMU seems to be receiving the "set mac" message only when L1
>>>>>> is launched with x-svq=true. With x-svq=off, I don't see any call
>>>>>> to virtio_net_handle_mac with cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET
>>>>>> in L0.
>>>>>>
>>>>>
>>>>> Ok this is interesting. Let's disable control virtqueue to start with
>>>>> something simpler:
>>>>> device virtio-net-pci,netdev=net0,ctrl_vq=off,...
>>>>>
>>>>> QEMU will start complaining about features that depend on ctrl_vq,
>>>>> like ctrl_rx. Let's disable all of them and check this new scenario.
>>>>>
>>>>
>>>> I am still investigating this part. I set ctrl_vq=off and ctrl_rx=off.
>>>> I didn't get any errors as such about features that depend on ctrl_vq.
>>>> However, I did notice that after booting L2 (x-svq=true as well as
>>>> x-svq=false), no eth0 device was created. There was only a "lo" interface
>>>> in L2. An eth0 interface is present only when L1 (L0 QEMU) is booted
>>>> with ctrl_vq=on and ctrl_rx=on.
>>>>
>>>
>>> Any error messages on the nested guest's dmesg?
>>
>> Oh, yes, there were error messages in the output of dmesg related to
>> ctrl_vq. After adding the following args, there were no error messages
>> in dmesg.
>>
>> -device virtio-net-pci,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off
>>
>> I see that the eth0 interface is also created. I am able to ping L0
>> from L2 and vice versa as well (even with x-svq=true). This is because
>> n->promisc is set when these features are disabled and receive_filter() [1]
>> always returns 1.
>>
>>> Is it fixed when you set the same mac address on L0
>>> virtio-net-pci and L1's?
>>>
>>
>> I didn't have to set the same mac address in this case since promiscuous
>> mode seems to be getting enabled which allows pinging to work.
>>
>> There is another concept that I am a little confused about. In the case
>> where L2 is booted with x-svq=false (and all ctrl features such as ctrl_vq,
>> ctrl_rx, etc. are on), I am able to ping L0 from L2. When tracing
>> receive_filter() in L0-QEMU, I see the values of n->mac and the destination
>> mac address in the ICMP packet match [2].
>>
> 
> SVQ makes an effort to set the mac address at the beginning of
> operation. The L0 interpret it as "filter out all MACs except this
> one". But SVQ cannot set the mac if ctrl_mac_addr=off, so the nic
> receives all packets and the guest kernel needs to filter out by
> itself.
> 
>> I haven't understood what n->mac refers to over here. MAC addresses are
>> globally unique and so the mac address of the device in L1 should be
>> different from that in L2.
> 
> With vDPA, they should be the same device even if they are declared in
> different cmdlines or layers of virtualizations. If it were a physical
> NIC, QEMU should declare the MAC of the physical NIC too.

Understood. I guess the issue with x-svq=true is that the MAC address
set in L0-QEMU's n->mac is different from the device in L2. That's why
the packets get filtered out with x-svq=true but pinging works with
x-svq=false.

> There is a thread in QEMU maul list where how QEMU should influence
> the control plane is discussed, and maybe it would be easier if QEMU
> just checks the device's MAC and ignores cmdline. But then, that
> behavior would be surprising for the rest of vhosts like vhost-kernel.
> Or just emit a warning if the MAC is different than the one that the
> device reports.
> 

Got it.

>> But I see L0-QEMU's n->mac is set to the mac
>> address of the device in L2 (allowing receive_filter to accept the packet).
>>
> 
> That's interesting, can you check further what does receive_filter and
> virtio_net_receive_rcu do with gdb? As long as virtio_net_receive_rcu
> flushes the packet on the receive queue, SVQ should receive it.
> 
The control flow irrespective of the value of x-svq is the same up till
the MAC address comparison in receive_filter() [1]. For x-svq=true,
the equality check between n->mac and the packet's destination MAC address
fails and the packet is filtered out. It is not flushed to the receive
queue. With x-svq=false, this is not the case.

On 2/4/25 11:45 PM, Eugenio Perez Martin wrote:
> PS: Please note that you can check packed_vq SVQ implementation
> already without CVQ, as these features are totally orthogonal :).
> 

Right. Now that I can ping with the ctrl features turned off, I think
this should take precedence. There's another issue specific to the
packed virtqueue case. It causes the kernel to crash. I have been
investigating this and the situation here looks very similar to what's
explained in Jason Wang's mail [2]. My plan of action is to apply his
changes in L2's kernel and check if that resolves the problem.

The details of the crash can be found in this mail [3].

Thanks,
Sahil

[1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/net/virtio-net.c#L1775
[2] https://lkml.iu.edu/hypermail/linux/kernel/1307.0/01455.html
[3] https://lists.nongnu.org/archive/html/qemu-devel/2024-12/msg01134.html

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 3 months ago

On Thu, Feb 6, 2025 at 6:26 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 2/4/25 11:40 PM, Eugenio Perez Martin wrote:
> > On Tue, Feb 4, 2025 at 1:49 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> On 1/31/25 12:27 PM, Eugenio Perez Martin wrote:
> >>> On Fri, Jan 31, 2025 at 6:04 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> On 1/24/25 1:04 PM, Eugenio Perez Martin wrote:
> >>>>> On Fri, Jan 24, 2025 at 6:47 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>> On 1/21/25 10:07 PM, Eugenio Perez Martin wrote:
> >>>>>>> On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>>>> On 1/7/25 1:35 PM, Eugenio Perez Martin wrote:
> >>>>>>>> [...]
> >>>>>>>> Apologies for the delay in replying. It took me a while to figure
> >>>>>>>> this out, but I have now understood why this doesn't work. L1 is
> >>>>>>>> unable to receive messages from L0 because they get filtered out
> >>>>>>>> by hw/net/virtio-net.c:receive_filter [1]. There's an issue with
> >>>>>>>> the MAC addresses.
> >>>>>>>>
> >>>>>>>> In L0, I have:
> >>>>>>>>
> >>>>>>>> $ ip a show tap0
> >>>>>>>> 6: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
> >>>>>>>>          link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
> >>>>>>>>          inet 111.1.1.1/24 scope global tap0
> >>>>>>>>             valid_lft forever preferred_lft forever
> >>>>>>>>          inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
> >>>>>>>>             valid_lft forever preferred_lft forever
> >>>>>>>>
> >>>>>>>> In L1:
> >>>>>>>>
> >>>>>>>> # ip a show eth0
> >>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> >>>>>>>>          link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
> >>>>>>>>          altname enp0s2
> >>>>>>>>          inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
> >>>>>>>>             valid_lft 83455sec preferred_lft 83455sec
> >>>>>>>>          inet6 fec0::7bd2:265e:3b8e:5acc/64 scope site dynamic noprefixroute
> >>>>>>>>             valid_lft 86064sec preferred_lft 14064sec
> >>>>>>>>          inet6 fe80::50e7:5bf6:fff8:a7b0/64 scope link noprefixroute
> >>>>>>>>             valid_lft forever preferred_lft forever
> >>>>>>>>
> >>>>>>>> I'll call this L1-eth0.
> >>>>>>>>
> >>>>>>>> In L2:
> >>>>>>>> # ip a show eth0
> >>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP gro0
> >>>>>>>>          link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> >>>>>>>>          altname enp0s7
> >>>>>>>>          inet 111.1.1.2/24 scope global eth0
> >>>>>>>>             valid_lft forever preferred_lft forever
> >>>>>>>>
> >>>>>>>> I'll call this L2-eth0.
> >>>>>>>>
> >>>>>>>> Apart from eth0, lo is the only other device in both L1 and L2.
> >>>>>>>>
> >>>>>>>> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
> >>>>>>>> as its destination address. When booting L2 with x-svq=false, the
> >>>>>>>> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
> >>>>>>>> the frames and passes them on to L2 and pinging works [2].
> >>>>>>>>
> >>>>>>>
> >>>>>>> So this behavior is interesting by itself. But L1's kernel net system
> >>>>>>> should not receive anything. As I read it, even if it receives it, it
> >>>>>>> should not forward the frame to L2 as it is in a different subnet. Are
> >>>>>>> you able to read it using tcpdump on L1?
> >>>>>>
> >>>>>> I ran "tcpdump -i eth0" in L1. It didn't capture any of the packets
> >>>>>> that were directed at L2 even though L2 was able to receive them.
> >>>>>> Similarly, it didn't capture any packets that were sent from L2 to
> >>>>>> L0. This is when L2 is launched with x-svq=false.
> >>>>>> [...]
> >>>>>> With x-svq=true, forcibly setting the LSB of n->mac to 0x57 in
> >>>>>> receive_filter allows L2 to receive packets from L0. I added
> >>>>>> the following line just before line 1771 [1] to check this out.
> >>>>>>
> >>>>>> n->mac[5] = 0x57;
> >>>>>>
> >>>>>
> >>>>> That's very interesting. Let me answer all the gdb questions below and
> >>>>> we can debug it deeper :).
> >>>>>
> >>>>
> >>>> Thank you for the primer on using gdb with QEMU. I am able to debug
> >>>> QEMU now.
> >>>>
> >>>>>>> Maybe we can make the scenario clearer by telling which virtio-net
> >>>>>>> device is which with virtio_net_pci,mac=XX:... ?
> >>>>>>>
> >>>>>>>> However, when booting L2 with x-svq=true, n->mac is set to L1-eth0
> >>>>>>>> (LSB = 56) in virtio_net_handle_mac() [3].
> >>>>>>>
> >>>>>>> Can you tell with gdb bt if this function is called from net or the
> >>>>>>> SVQ subsystem?
> >>>>>>
> >>>>
> >>>> It looks like the function is being called from net.
> >>>>
> >>>> (gdb) bt
> >>>> #0  virtio_net_handle_mac (n=0x15622425e, cmd=85 'U', iov=0x555558865980, iov_cnt=1476792840) at ../hw/net/virtio-net.c:1098
> >>>> #1  0x0000555555e5920b in virtio_net_handle_ctrl_iov (vdev=0x555558fdacd0, in_sg=0x5555580611f8, in_num=1, out_sg=0x555558061208,
> >>>>         out_num=1) at ../hw/net/virtio-net.c:1581
> >>>> #2  0x0000555555e593a0 in virtio_net_handle_ctrl (vdev=0x555558fdacd0, vq=0x555558fe7730) at ../hw/net/virtio-net.c:1610
> >>>> #3  0x0000555555e9a7d8 in virtio_queue_notify_vq (vq=0x555558fe7730) at ../hw/virtio/virtio.c:2484
> >>>> #4  0x0000555555e9dffb in virtio_queue_host_notifier_read (n=0x555558fe77a4) at ../hw/virtio/virtio.c:3869
> >>>> #5  0x000055555620329f in aio_dispatch_handler (ctx=0x555557d9f840, node=0x7fffdca7ba80) at ../util/aio-posix.c:373
> >>>> #6  0x000055555620346f in aio_dispatch_handlers (ctx=0x555557d9f840) at ../util/aio-posix.c:415
> >>>> #7  0x00005555562034cb in aio_dispatch (ctx=0x555557d9f840) at ../util/aio-posix.c:425
> >>>> #8  0x00005555562242b5 in aio_ctx_dispatch (source=0x555557d9f840, callback=0x0, user_data=0x0) at ../util/async.c:361
> >>>> #9  0x00007ffff6d86559 in ?? () from /usr/lib/libglib-2.0.so.0
> >>>> #10 0x00007ffff6d86858 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
> >>>> #11 0x0000555556225bf9 in glib_pollfds_poll () at ../util/main-loop.c:287
> >>>> #12 0x0000555556225c87 in os_host_main_loop_wait (timeout=294672) at ../util/main-loop.c:310
> >>>> #13 0x0000555556225db6 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:589
> >>>> #14 0x0000555555c0c1a3 in qemu_main_loop () at ../system/runstate.c:835
> >>>> #15 0x000055555612bd8d in qemu_default_main (opaque=0x0) at ../system/main.c:48
> >>>> #16 0x000055555612be3d in main (argc=23, argv=0x7fffffffe508) at ../system/main.c:76
> >>>>
> >>>> virtio_queue_notify_vq at hw/virtio/virtio.c:2484 [2] calls
> >>>> vq->handle_output(vdev, vq). I see "handle_output" is a function
> >>>> pointer and in this case it seems to be pointing to
> >>>> virtio_net_handle_ctrl.
> >>>>
> >>>>>>>> [...]
> >>>>>>>> With x-svq=true, I see that n->mac is set by virtio_net_handle_mac()
> >>>>>>>> [3] when L1 receives VIRTIO_NET_CTRL_MAC_ADDR_SET. With x-svq=false,
> >>>>>>>> virtio_net_handle_mac() doesn't seem to be getting called. I haven't
> >>>>>>>> understood how the MAC address is set in VirtIONet when x-svq=false.
> >>>>>>>> Understanding this might help see why n->mac has different values
> >>>>>>>> when x-svq is false vs when it is true.
> >>>>>>>
> >>>>>>> Ok this makes sense, as x-svq=true is the one that receives the set
> >>>>>>> mac message. You should see it in L0's QEMU though, both in x-svq=on
> >>>>>>> and x-svq=off scenarios. Can you check it?
> >>>>>>
> >>>>>> L0's QEMU seems to be receiving the "set mac" message only when L1
> >>>>>> is launched with x-svq=true. With x-svq=off, I don't see any call
> >>>>>> to virtio_net_handle_mac with cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET
> >>>>>> in L0.
> >>>>>>
> >>>>>
> >>>>> Ok this is interesting. Let's disable control virtqueue to start with
> >>>>> something simpler:
> >>>>> device virtio-net-pci,netdev=net0,ctrl_vq=off,...
> >>>>>
> >>>>> QEMU will start complaining about features that depend on ctrl_vq,
> >>>>> like ctrl_rx. Let's disable all of them and check this new scenario.
> >>>>>
> >>>>
> >>>> I am still investigating this part. I set ctrl_vq=off and ctrl_rx=off.
> >>>> I didn't get any errors as such about features that depend on ctrl_vq.
> >>>> However, I did notice that after booting L2 (x-svq=true as well as
> >>>> x-svq=false), no eth0 device was created. There was only a "lo" interface
> >>>> in L2. An eth0 interface is present only when L1 (L0 QEMU) is booted
> >>>> with ctrl_vq=on and ctrl_rx=on.
> >>>>
> >>>
> >>> Any error messages on the nested guest's dmesg?
> >>
> >> Oh, yes, there were error messages in the output of dmesg related to
> >> ctrl_vq. After adding the following args, there were no error messages
> >> in dmesg.
> >>
> >> -device virtio-net-pci,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off
> >>
> >> I see that the eth0 interface is also created. I am able to ping L0
> >> from L2 and vice versa as well (even with x-svq=true). This is because
> >> n->promisc is set when these features are disabled and receive_filter() [1]
> >> always returns 1.
> >>
> >>> Is it fixed when you set the same mac address on L0
> >>> virtio-net-pci and L1's?
> >>>
> >>
> >> I didn't have to set the same mac address in this case since promiscuous
> >> mode seems to be getting enabled which allows pinging to work.
> >>
> >> There is another concept that I am a little confused about. In the case
> >> where L2 is booted with x-svq=false (and all ctrl features such as ctrl_vq,
> >> ctrl_rx, etc. are on), I am able to ping L0 from L2. When tracing
> >> receive_filter() in L0-QEMU, I see the values of n->mac and the destination
> >> mac address in the ICMP packet match [2].
> >>
> >
> > SVQ makes an effort to set the mac address at the beginning of
> > operation. The L0 interpret it as "filter out all MACs except this
> > one". But SVQ cannot set the mac if ctrl_mac_addr=off, so the nic
> > receives all packets and the guest kernel needs to filter out by
> > itself.
> >
> >> I haven't understood what n->mac refers to over here. MAC addresses are
> >> globally unique and so the mac address of the device in L1 should be
> >> different from that in L2.
> >
> > With vDPA, they should be the same device even if they are declared in
> > different cmdlines or layers of virtualizations. If it were a physical
> > NIC, QEMU should declare the MAC of the physical NIC too.
>
> Understood. I guess the issue with x-svq=true is that the MAC address
> set in L0-QEMU's n->mac is different from the device in L2. That's why
> the packets get filtered out with x-svq=true but pinging works with
> x-svq=false.
>

Right!


> > There is a thread in QEMU maul list where how QEMU should influence
> > the control plane is discussed, and maybe it would be easier if QEMU
> > just checks the device's MAC and ignores cmdline. But then, that
> > behavior would be surprising for the rest of vhosts like vhost-kernel.
> > Or just emit a warning if the MAC is different than the one that the
> > device reports.
> >
>
> Got it.
>
> >> But I see L0-QEMU's n->mac is set to the mac
> >> address of the device in L2 (allowing receive_filter to accept the packet).
> >>
> >
> > That's interesting, can you check further what does receive_filter and
> > virtio_net_receive_rcu do with gdb? As long as virtio_net_receive_rcu
> > flushes the packet on the receive queue, SVQ should receive it.
> >
> The control flow irrespective of the value of x-svq is the same up till
> the MAC address comparison in receive_filter() [1]. For x-svq=true,
> the equality check between n->mac and the packet's destination MAC address
> fails and the packet is filtered out. It is not flushed to the receive
> queue. With x-svq=false, this is not the case.
>
> On 2/4/25 11:45 PM, Eugenio Perez Martin wrote:
> > PS: Please note that you can check packed_vq SVQ implementation
> > already without CVQ, as these features are totally orthogonal :).
> >
>
> Right. Now that I can ping with the ctrl features turned off, I think
> this should take precedence. There's another issue specific to the
> packed virtqueue case. It causes the kernel to crash. I have been
> investigating this and the situation here looks very similar to what's
> explained in Jason Wang's mail [2]. My plan of action is to apply his
> changes in L2's kernel and check if that resolves the problem.
>
> The details of the crash can be found in this mail [3].
>

If you're testing this series without changes, I think that is caused
by not implementing the packed version of vhost_svq_get_buf.

https://lists.nongnu.org/archive/html/qemu-devel/2024-12/msg01902.html

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 2 months, 4 weeks ago

Hi,

On 2/6/25 12:42 PM, Eugenio Perez Martin wrote:
> On Thu, Feb 6, 2025 at 6:26 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> Hi,
>>
>> On 2/4/25 11:40 PM, Eugenio Perez Martin wrote:
>>> On Tue, Feb 4, 2025 at 1:49 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> On 1/31/25 12:27 PM, Eugenio Perez Martin wrote:
>>>>> On Fri, Jan 31, 2025 at 6:04 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>> On 1/24/25 1:04 PM, Eugenio Perez Martin wrote:
>>>>>>> On Fri, Jan 24, 2025 at 6:47 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>>>> On 1/21/25 10:07 PM, Eugenio Perez Martin wrote:
>>>>>>>>> On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>>>>>> On 1/7/25 1:35 PM, Eugenio Perez Martin wrote:
>>>>>>>>>> [...]
>>>>>>>>>> Apologies for the delay in replying. It took me a while to figure
>>>>>>>>>> this out, but I have now understood why this doesn't work. L1 is
>>>>>>>>>> unable to receive messages from L0 because they get filtered out
>>>>>>>>>> by hw/net/virtio-net.c:receive_filter [1]. There's an issue with
>>>>>>>>>> the MAC addresses.
>>>>>>>>>>
>>>>>>>>>> In L0, I have:
>>>>>>>>>>
>>>>>>>>>> $ ip a show tap0
>>>>>>>>>> 6: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
>>>>>>>>>>           link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
>>>>>>>>>>           inet 111.1.1.1/24 scope global tap0
>>>>>>>>>>              valid_lft forever preferred_lft forever
>>>>>>>>>>           inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
>>>>>>>>>>              valid_lft forever preferred_lft forever
>>>>>>>>>>
>>>>>>>>>> In L1:
>>>>>>>>>>
>>>>>>>>>> # ip a show eth0
>>>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>>>>>>>>>           link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
>>>>>>>>>>           altname enp0s2
>>>>>>>>>>           inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
>>>>>>>>>>              valid_lft 83455sec preferred_lft 83455sec
>>>>>>>>>>           inet6 fec0::7bd2:265e:3b8e:5acc/64 scope site dynamic noprefixroute
>>>>>>>>>>              valid_lft 86064sec preferred_lft 14064sec
>>>>>>>>>>           inet6 fe80::50e7:5bf6:fff8:a7b0/64 scope link noprefixroute
>>>>>>>>>>              valid_lft forever preferred_lft forever
>>>>>>>>>>
>>>>>>>>>> I'll call this L1-eth0.
>>>>>>>>>>
>>>>>>>>>> In L2:
>>>>>>>>>> # ip a show eth0
>>>>>>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP gro0
>>>>>>>>>>           link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>>>>>>>>>           altname enp0s7
>>>>>>>>>>           inet 111.1.1.2/24 scope global eth0
>>>>>>>>>>              valid_lft forever preferred_lft forever
>>>>>>>>>>
>>>>>>>>>> I'll call this L2-eth0.
>>>>>>>>>>
>>>>>>>>>> Apart from eth0, lo is the only other device in both L1 and L2.
>>>>>>>>>>
>>>>>>>>>> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
>>>>>>>>>> as its destination address. When booting L2 with x-svq=false, the
>>>>>>>>>> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
>>>>>>>>>> the frames and passes them on to L2 and pinging works [2].
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So this behavior is interesting by itself. But L1's kernel net system
>>>>>>>>> should not receive anything. As I read it, even if it receives it, it
>>>>>>>>> should not forward the frame to L2 as it is in a different subnet. Are
>>>>>>>>> you able to read it using tcpdump on L1?
>>>>>>>>
>>>>>>>> I ran "tcpdump -i eth0" in L1. It didn't capture any of the packets
>>>>>>>> that were directed at L2 even though L2 was able to receive them.
>>>>>>>> Similarly, it didn't capture any packets that were sent from L2 to
>>>>>>>> L0. This is when L2 is launched with x-svq=false.
>>>>>>>> [...]
>>>>>>>> With x-svq=true, forcibly setting the LSB of n->mac to 0x57 in
>>>>>>>> receive_filter allows L2 to receive packets from L0. I added
>>>>>>>> the following line just before line 1771 [1] to check this out.
>>>>>>>>
>>>>>>>> n->mac[5] = 0x57;
>>>>>>>>
>>>>>>>
>>>>>>> That's very interesting. Let me answer all the gdb questions below and
>>>>>>> we can debug it deeper :).
>>>>>>>
>>>>>>
>>>>>> Thank you for the primer on using gdb with QEMU. I am able to debug
>>>>>> QEMU now.
>>>>>>
>>>>>>>>> Maybe we can make the scenario clearer by telling which virtio-net
>>>>>>>>> device is which with virtio_net_pci,mac=XX:... ?
>>>>>>>>>
>>>>>>>>>> However, when booting L2 with x-svq=true, n->mac is set to L1-eth0
>>>>>>>>>> (LSB = 56) in virtio_net_handle_mac() [3].
>>>>>>>>>
>>>>>>>>> Can you tell with gdb bt if this function is called from net or the
>>>>>>>>> SVQ subsystem?
>>>>>>>>
>>>>>>
>>>>>> It looks like the function is being called from net.
>>>>>>
>>>>>> (gdb) bt
>>>>>> #0  virtio_net_handle_mac (n=0x15622425e, cmd=85 'U', iov=0x555558865980, iov_cnt=1476792840) at ../hw/net/virtio-net.c:1098
>>>>>> #1  0x0000555555e5920b in virtio_net_handle_ctrl_iov (vdev=0x555558fdacd0, in_sg=0x5555580611f8, in_num=1, out_sg=0x555558061208,
>>>>>>          out_num=1) at ../hw/net/virtio-net.c:1581
>>>>>> #2  0x0000555555e593a0 in virtio_net_handle_ctrl (vdev=0x555558fdacd0, vq=0x555558fe7730) at ../hw/net/virtio-net.c:1610
>>>>>> #3  0x0000555555e9a7d8 in virtio_queue_notify_vq (vq=0x555558fe7730) at ../hw/virtio/virtio.c:2484
>>>>>> #4  0x0000555555e9dffb in virtio_queue_host_notifier_read (n=0x555558fe77a4) at ../hw/virtio/virtio.c:3869
>>>>>> #5  0x000055555620329f in aio_dispatch_handler (ctx=0x555557d9f840, node=0x7fffdca7ba80) at ../util/aio-posix.c:373
>>>>>> #6  0x000055555620346f in aio_dispatch_handlers (ctx=0x555557d9f840) at ../util/aio-posix.c:415
>>>>>> #7  0x00005555562034cb in aio_dispatch (ctx=0x555557d9f840) at ../util/aio-posix.c:425
>>>>>> #8  0x00005555562242b5 in aio_ctx_dispatch (source=0x555557d9f840, callback=0x0, user_data=0x0) at ../util/async.c:361
>>>>>> #9  0x00007ffff6d86559 in ?? () from /usr/lib/libglib-2.0.so.0
>>>>>> #10 0x00007ffff6d86858 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
>>>>>> #11 0x0000555556225bf9 in glib_pollfds_poll () at ../util/main-loop.c:287
>>>>>> #12 0x0000555556225c87 in os_host_main_loop_wait (timeout=294672) at ../util/main-loop.c:310
>>>>>> #13 0x0000555556225db6 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:589
>>>>>> #14 0x0000555555c0c1a3 in qemu_main_loop () at ../system/runstate.c:835
>>>>>> #15 0x000055555612bd8d in qemu_default_main (opaque=0x0) at ../system/main.c:48
>>>>>> #16 0x000055555612be3d in main (argc=23, argv=0x7fffffffe508) at ../system/main.c:76
>>>>>>
>>>>>> virtio_queue_notify_vq at hw/virtio/virtio.c:2484 [2] calls
>>>>>> vq->handle_output(vdev, vq). I see "handle_output" is a function
>>>>>> pointer and in this case it seems to be pointing to
>>>>>> virtio_net_handle_ctrl.
>>>>>>
>>>>>>>>>> [...]
>>>>>>>>>> With x-svq=true, I see that n->mac is set by virtio_net_handle_mac()
>>>>>>>>>> [3] when L1 receives VIRTIO_NET_CTRL_MAC_ADDR_SET. With x-svq=false,
>>>>>>>>>> virtio_net_handle_mac() doesn't seem to be getting called. I haven't
>>>>>>>>>> understood how the MAC address is set in VirtIONet when x-svq=false.
>>>>>>>>>> Understanding this might help see why n->mac has different values
>>>>>>>>>> when x-svq is false vs when it is true.
>>>>>>>>>
>>>>>>>>> Ok this makes sense, as x-svq=true is the one that receives the set
>>>>>>>>> mac message. You should see it in L0's QEMU though, both in x-svq=on
>>>>>>>>> and x-svq=off scenarios. Can you check it?
>>>>>>>>
>>>>>>>> L0's QEMU seems to be receiving the "set mac" message only when L1
>>>>>>>> is launched with x-svq=true. With x-svq=off, I don't see any call
>>>>>>>> to virtio_net_handle_mac with cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET
>>>>>>>> in L0.
>>>>>>>>
>>>>>>>
>>>>>>> Ok this is interesting. Let's disable control virtqueue to start with
>>>>>>> something simpler:
>>>>>>> device virtio-net-pci,netdev=net0,ctrl_vq=off,...
>>>>>>>
>>>>>>> QEMU will start complaining about features that depend on ctrl_vq,
>>>>>>> like ctrl_rx. Let's disable all of them and check this new scenario.
>>>>>>>
>>>>>>
>>>>>> I am still investigating this part. I set ctrl_vq=off and ctrl_rx=off.
>>>>>> I didn't get any errors as such about features that depend on ctrl_vq.
>>>>>> However, I did notice that after booting L2 (x-svq=true as well as
>>>>>> x-svq=false), no eth0 device was created. There was only a "lo" interface
>>>>>> in L2. An eth0 interface is present only when L1 (L0 QEMU) is booted
>>>>>> with ctrl_vq=on and ctrl_rx=on.
>>>>>>
>>>>>
>>>>> Any error messages on the nested guest's dmesg?
>>>>
>>>> Oh, yes, there were error messages in the output of dmesg related to
>>>> ctrl_vq. After adding the following args, there were no error messages
>>>> in dmesg.
>>>>
>>>> -device virtio-net-pci,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off
>>>>
>>>> I see that the eth0 interface is also created. I am able to ping L0
>>>> from L2 and vice versa as well (even with x-svq=true). This is because
>>>> n->promisc is set when these features are disabled and receive_filter() [1]
>>>> always returns 1.
>>>>
>>>>> Is it fixed when you set the same mac address on L0
>>>>> virtio-net-pci and L1's?
>>>>>
>>>>
>>>> I didn't have to set the same mac address in this case since promiscuous
>>>> mode seems to be getting enabled which allows pinging to work.
>>>>
>>>> There is another concept that I am a little confused about. In the case
>>>> where L2 is booted with x-svq=false (and all ctrl features such as ctrl_vq,
>>>> ctrl_rx, etc. are on), I am able to ping L0 from L2. When tracing
>>>> receive_filter() in L0-QEMU, I see the values of n->mac and the destination
>>>> mac address in the ICMP packet match [2].
>>>>
>>>
>>> SVQ makes an effort to set the mac address at the beginning of
>>> operation. The L0 interpret it as "filter out all MACs except this
>>> one". But SVQ cannot set the mac if ctrl_mac_addr=off, so the nic
>>> receives all packets and the guest kernel needs to filter out by
>>> itself.
>>>
>>>> I haven't understood what n->mac refers to over here. MAC addresses are
>>>> globally unique and so the mac address of the device in L1 should be
>>>> different from that in L2.
>>>
>>> With vDPA, they should be the same device even if they are declared in
>>> different cmdlines or layers of virtualizations. If it were a physical
>>> NIC, QEMU should declare the MAC of the physical NIC too.
>>
>> Understood. I guess the issue with x-svq=true is that the MAC address
>> set in L0-QEMU's n->mac is different from the device in L2. That's why
>> the packets get filtered out with x-svq=true but pinging works with
>> x-svq=false.
>>
> 
> Right!
> 
> 
>>> There is a thread in QEMU maul list where how QEMU should influence
>>> the control plane is discussed, and maybe it would be easier if QEMU
>>> just checks the device's MAC and ignores cmdline. But then, that
>>> behavior would be surprising for the rest of vhosts like vhost-kernel.
>>> Or just emit a warning if the MAC is different than the one that the
>>> device reports.
>>>
>>
>> Got it.
>>
>>>> But I see L0-QEMU's n->mac is set to the mac
>>>> address of the device in L2 (allowing receive_filter to accept the packet).
>>>>
>>>
>>> That's interesting, can you check further what does receive_filter and
>>> virtio_net_receive_rcu do with gdb? As long as virtio_net_receive_rcu
>>> flushes the packet on the receive queue, SVQ should receive it.
>>>
>> The control flow irrespective of the value of x-svq is the same up till
>> the MAC address comparison in receive_filter() [1]. For x-svq=true,
>> the equality check between n->mac and the packet's destination MAC address
>> fails and the packet is filtered out. It is not flushed to the receive
>> queue. With x-svq=false, this is not the case.
>>
>> On 2/4/25 11:45 PM, Eugenio Perez Martin wrote:
>>> PS: Please note that you can check packed_vq SVQ implementation
>>> already without CVQ, as these features are totally orthogonal :).
>>>
>>
>> Right. Now that I can ping with the ctrl features turned off, I think
>> this should take precedence. There's another issue specific to the
>> packed virtqueue case. It causes the kernel to crash. I have been
>> investigating this and the situation here looks very similar to what's
>> explained in Jason Wang's mail [2]. My plan of action is to apply his
>> changes in L2's kernel and check if that resolves the problem.
>>
>> The details of the crash can be found in this mail [3].
>>
> 
> If you're testing this series without changes, I think that is caused
> by not implementing the packed version of vhost_svq_get_buf.
> 
> https://lists.nongnu.org/archive/html/qemu-devel/2024-12/msg01902.html
> 

Oh, apologies, I think I had misunderstood your response in the linked mail.
Until now, I thought they were unrelated. In that case, I'll implement the
packed version of vhost_svq_get_buf. Hopefully that fixes it :).

Thanks,
Sahil

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 2 months, 3 weeks ago

Hi,

On 2/6/25 8:47 PM, Sahil Siddiq wrote:
> On 2/6/25 12:42 PM, Eugenio Perez Martin wrote:
>> On Thu, Feb 6, 2025 at 6:26 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>> On 2/4/25 11:45 PM, Eugenio Perez Martin wrote:
>>>> PS: Please note that you can check packed_vq SVQ implementation
>>>> already without CVQ, as these features are totally orthogonal :).
>>>>
>>>
>>> Right. Now that I can ping with the ctrl features turned off, I think
>>> this should take precedence. There's another issue specific to the
>>> packed virtqueue case. It causes the kernel to crash. I have been
>>> investigating this and the situation here looks very similar to what's
>>> explained in Jason Wang's mail [2]. My plan of action is to apply his
>>> changes in L2's kernel and check if that resolves the problem.
>>>
>>> The details of the crash can be found in this mail [3].
>>>
>>
>> If you're testing this series without changes, I think that is caused
>> by not implementing the packed version of vhost_svq_get_buf.
>>
>> https://lists.nongnu.org/archive/html/qemu-devel/2024-12/msg01902.html
>>
> 
> Oh, apologies, I think I had misunderstood your response in the linked mail.
> Until now, I thought they were unrelated. In that case, I'll implement the
> packed version of vhost_svq_get_buf. Hopefully that fixes it :).
> 

I noticed one thing while testing some of the changes that I have made.
I haven't finished making the relevant changes to all the functions which
will have to handle split and packed vq differently. L2's kernel crashes
when I launch L0-QEMU with ctrl_vq=on,ctrl_rx=on. However, when I start
L0-QEMU with ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off, L2's
kernel boots successfully. Tracing L2-QEMU also confirms that the packed
feature is enabled. With all the ctrl features disabled, I think pinging
will also be possible once I finish implementing the packed versions of
the other functions.

There's another thing that I am confused about regarding the current
implementation (in the master branch).

In hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_vring_write_descs() [1],
svq->free_head saves the descriptor in the specified format using
"le16_to_cpu" (line 171). On the other hand, the value of i is stored
in the native endianness using "cpu_to_le16" (line 168). If "i" is to be
stored in the native endianness (little endian in this case), then
should svq->free_head first be converted to little endian before being
assigned to "i" at the start of the function (line 142)?

Thanks,
Sahil

[1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 2 months, 3 weeks ago

On Mon, Feb 10, 2025 at 11:58 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 2/6/25 8:47 PM, Sahil Siddiq wrote:
> > On 2/6/25 12:42 PM, Eugenio Perez Martin wrote:
> >> On Thu, Feb 6, 2025 at 6:26 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>> On 2/4/25 11:45 PM, Eugenio Perez Martin wrote:
> >>>> PS: Please note that you can check packed_vq SVQ implementation
> >>>> already without CVQ, as these features are totally orthogonal :).
> >>>>
> >>>
> >>> Right. Now that I can ping with the ctrl features turned off, I think
> >>> this should take precedence. There's another issue specific to the
> >>> packed virtqueue case. It causes the kernel to crash. I have been
> >>> investigating this and the situation here looks very similar to what's
> >>> explained in Jason Wang's mail [2]. My plan of action is to apply his
> >>> changes in L2's kernel and check if that resolves the problem.
> >>>
> >>> The details of the crash can be found in this mail [3].
> >>>
> >>
> >> If you're testing this series without changes, I think that is caused
> >> by not implementing the packed version of vhost_svq_get_buf.
> >>
> >> https://lists.nongnu.org/archive/html/qemu-devel/2024-12/msg01902.html
> >>
> >
> > Oh, apologies, I think I had misunderstood your response in the linked mail.
> > Until now, I thought they were unrelated. In that case, I'll implement the
> > packed version of vhost_svq_get_buf. Hopefully that fixes it :).
> >
>
> I noticed one thing while testing some of the changes that I have made.
> I haven't finished making the relevant changes to all the functions which
> will have to handle split and packed vq differently. L2's kernel crashes
> when I launch L0-QEMU with ctrl_vq=on,ctrl_rx=on.

Interesting, is a similar crash than this? (NULL ptr deference on
virtnet_set_features)?

https://issues.redhat.com/browse/RHEL-391

> However, when I start
> L0-QEMU with ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off, L2's
> kernel boots successfully. Tracing L2-QEMU also confirms that the packed
> feature is enabled. With all the ctrl features disabled, I think pinging
> will also be possible once I finish implementing the packed versions of
> the other functions.
>

Good!

> There's another thing that I am confused about regarding the current
> implementation (in the master branch).
>
> In hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_vring_write_descs() [1],
> svq->free_head saves the descriptor in the specified format using
> "le16_to_cpu" (line 171).

Good catch, this should be le16_to_cpu actually. But code wise is the
same, so we have no visible error. Do you want to send a patch to fix
it?

> On the other hand, the value of i is stored
> in the native endianness using "cpu_to_le16" (line 168). If "i" is to be
> stored in the native endianness (little endian in this case), then
> should svq->free_head first be converted to little endian before being
> assigned to "i" at the start of the function (line 142)?
>

This part is correct in the code, as it is used by the host, not
written to the guest or read from the guest. So no conversion is
needed.

Thanks!

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 2 months, 3 weeks ago

Hi,

On 2/10/25 7:53 PM, Eugenio Perez Martin wrote:
> On Mon, Feb 10, 2025 at 11:58 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> On 2/6/25 8:47 PM, Sahil Siddiq wrote:
>>> On 2/6/25 12:42 PM, Eugenio Perez Martin wrote:
>>>> On Thu, Feb 6, 2025 at 6:26 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>> On 2/4/25 11:45 PM, Eugenio Perez Martin wrote:
>>>>>> PS: Please note that you can check packed_vq SVQ implementation
>>>>>> already without CVQ, as these features are totally orthogonal :).
>>>>>>
>>>>>
>>>>> Right. Now that I can ping with the ctrl features turned off, I think
>>>>> this should take precedence. There's another issue specific to the
>>>>> packed virtqueue case. It causes the kernel to crash. I have been
>>>>> investigating this and the situation here looks very similar to what's
>>>>> explained in Jason Wang's mail [2]. My plan of action is to apply his
>>>>> changes in L2's kernel and check if that resolves the problem.
>>>>>
>>>>> The details of the crash can be found in this mail [3].
>>>>>
>>>>
>>>> If you're testing this series without changes, I think that is caused
>>>> by not implementing the packed version of vhost_svq_get_buf.
>>>>
>>>> https://lists.nongnu.org/archive/html/qemu-devel/2024-12/msg01902.html
>>>>
>>>
>>> Oh, apologies, I think I had misunderstood your response in the linked mail.
>>> Until now, I thought they were unrelated. In that case, I'll implement the
>>> packed version of vhost_svq_get_buf. Hopefully that fixes it :).
>>>
>>
>> I noticed one thing while testing some of the changes that I have made.
>> I haven't finished making the relevant changes to all the functions which
>> will have to handle split and packed vq differently. L2's kernel crashes
>> when I launch L0-QEMU with ctrl_vq=on,ctrl_rx=on.
> 
> Interesting, is a similar crash than this? (NULL ptr deference on
> virtnet_set_features)?
>
> https://issues.redhat.com/browse/RHEL-391
I am not able to access this bug report (even with a Red Hat account). It
says it may have been deleted or I don't have the permission to view it.

It's hard to tell if this is the same issue. I don't think it is the same
issue though since I don't see any such indication in the logs. The kernel
throws the following:

[   23.047503] virtio_net virtio1: output.0:id 0 is not a head!
[   49.173243] watchdog: BUG: soft lockup - CPU#1 stuck for 25s! [NetworkManager:782]
[   49.174167] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvg
[   49.188258] CPU: 1 PID: 782 Comm: NetworkManager Not tainted 6.8.7-200.fc39.x86_64 #1
[   49.193196] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   49.193196] RIP: 0010:virtqueue_get_buf+0x0/0x20

Maybe I was incorrect in stating that the kernel crashes. It's more like
the kernel is stuck in a loop (according to these blog posts on soft
lockup [1][2]).

In the above trace, RIP is in virtqueue_get_buf() [3]. This is what
calls virtqueue_get_buf_ctx_packed() [4] which throws the error.

What I don't understand is why vq->packed.desc_state[id].data [5] is
NULL when the control features are turned on, but doesn't seem to be
NULL when the control features are turned off.

>> However, when I start
>> L0-QEMU with ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off, L2's
>> kernel boots successfully. Tracing L2-QEMU also confirms that the packed
>> feature is enabled. With all the ctrl features disabled, I think pinging
>> will also be possible once I finish implementing the packed versions of
>> the other functions.
>>
> 
> Good!
> 
>> There's another thing that I am confused about regarding the current
>> implementation (in the master branch).
>>
>> In hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_vring_write_descs() [1],
>> svq->free_head saves the descriptor in the specified format using
>> "le16_to_cpu" (line 171).
> 
> Good catch, this should be le16_to_cpu actually. But code wise is the
> same, so we have no visible error. Do you want to send a patch to fix
> it?
> 
Sorry, I am still a little confused here. Did you mean cpu_to_le16
by any chance? Based on what I have understood, if it is to be used
by the host machine, then it should be cpu_to_le16.

I can send a patch once this is clear, or can even integrate it in
this patch series since this patch series refactors that function
anyway.

>> On the other hand, the value of i is stored
>> in the native endianness using "cpu_to_le16" (line 168). If "i" is to be
>> stored in the native endianness (little endian in this case), then
>> should svq->free_head first be converted to little endian before being
>> assigned to "i" at the start of the function (line 142)?
>>
> 
> This part is correct in the code, as it is used by the host, not
> written to the guest or read from the guest. So no conversion is
> needed.

Understood.

Thanks,
Sahil

[1] https://www.suse.com/support/kb/doc/?id=000018705
[2] https://softlockup.com/SystemAdministration/Linux/Kernel/softlockup/
[3] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L2545
[4] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1727
[5] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1762

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 2 months, 3 weeks ago

On Mon, Feb 10, 2025 at 5:25 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 2/10/25 7:53 PM, Eugenio Perez Martin wrote:
> > On Mon, Feb 10, 2025 at 11:58 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> On 2/6/25 8:47 PM, Sahil Siddiq wrote:
> >>> On 2/6/25 12:42 PM, Eugenio Perez Martin wrote:
> >>>> On Thu, Feb 6, 2025 at 6:26 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>> On 2/4/25 11:45 PM, Eugenio Perez Martin wrote:
> >>>>>> PS: Please note that you can check packed_vq SVQ implementation
> >>>>>> already without CVQ, as these features are totally orthogonal :).
> >>>>>>
> >>>>>
> >>>>> Right. Now that I can ping with the ctrl features turned off, I think
> >>>>> this should take precedence. There's another issue specific to the
> >>>>> packed virtqueue case. It causes the kernel to crash. I have been
> >>>>> investigating this and the situation here looks very similar to what's
> >>>>> explained in Jason Wang's mail [2]. My plan of action is to apply his
> >>>>> changes in L2's kernel and check if that resolves the problem.
> >>>>>
> >>>>> The details of the crash can be found in this mail [3].
> >>>>>
> >>>>
> >>>> If you're testing this series without changes, I think that is caused
> >>>> by not implementing the packed version of vhost_svq_get_buf.
> >>>>
> >>>> https://lists.nongnu.org/archive/html/qemu-devel/2024-12/msg01902.html
> >>>>
> >>>
> >>> Oh, apologies, I think I had misunderstood your response in the linked mail.
> >>> Until now, I thought they were unrelated. In that case, I'll implement the
> >>> packed version of vhost_svq_get_buf. Hopefully that fixes it :).
> >>>
> >>
> >> I noticed one thing while testing some of the changes that I have made.
> >> I haven't finished making the relevant changes to all the functions which
> >> will have to handle split and packed vq differently. L2's kernel crashes
> >> when I launch L0-QEMU with ctrl_vq=on,ctrl_rx=on.
> >
> > Interesting, is a similar crash than this? (NULL ptr deference on
> > virtnet_set_features)?
> >
> > https://issues.redhat.com/browse/RHEL-391
> I am not able to access this bug report (even with a Red Hat account). It
> says it may have been deleted or I don't have the permission to view it.
>
> It's hard to tell if this is the same issue. I don't think it is the same
> issue though since I don't see any such indication in the logs. The kernel
> throws the following:
>
> [   23.047503] virtio_net virtio1: output.0:id 0 is not a head!

This is a common error when modifying code of the dataplane, it is
unlikely to do deep changes and not see this error :). It indicates
that your code is marking the descriptor id 0 as used when the guest
didn't make it available.

If this is happening in control virtqueue, I'd check if the code is
setting the flags as used in ring[1] when it shouldn't. But my bet is
that the rx queue is the wrong one.

> [   49.173243] watchdog: BUG: soft lockup - CPU#1 stuck for 25s! [NetworkManager:782]
> [   49.174167] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvg
> [   49.188258] CPU: 1 PID: 782 Comm: NetworkManager Not tainted 6.8.7-200.fc39.x86_64 #1
> [   49.193196] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [   49.193196] RIP: 0010:virtqueue_get_buf+0x0/0x20
>

Two possibilities about this part:
a) You're spending "too long" in the debugger in QEMU. From the kernel
POV, the function virtqueue_get_buf is taking too long to complete so
it detects it as a lockup. You can check this scenario by not running
QEMU under GDB or disabling all breakpoints. You can ignore this
message if you don't find the error this way. If you still see the
message, goto possibility b.

b) The kernel has a bug that makes it softlockup in virtqueue_get_buf.
The kernel should not soft lockup even if your changes were malicious
:(, so it is something to be fixed. If you have the time, can you test
with the latest upstream kernel?

> Maybe I was incorrect in stating that the kernel crashes. It's more like
> the kernel is stuck in a loop (according to these blog posts on soft
> lockup [1][2]).
>
> In the above trace, RIP is in virtqueue_get_buf() [3]. This is what
> calls virtqueue_get_buf_ctx_packed() [4] which throws the error.
>
> What I don't understand is why vq->packed.desc_state[id].data [5] is
> NULL when the control features are turned on, but doesn't seem to be
> NULL when the control features are turned off.
>

Due to the net subsystem lock, CVQ handling is not as robust / secure
against this error as the dataplane queues. There is an ongoing effort
to make it more robust, so maybe this is something to fix in that
line.

Can you put the whole backtrace that prints the kernel?

> >> However, when I start
> >> L0-QEMU with ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off, L2's
> >> kernel boots successfully. Tracing L2-QEMU also confirms that the packed
> >> feature is enabled. With all the ctrl features disabled, I think pinging
> >> will also be possible once I finish implementing the packed versions of
> >> the other functions.
> >>
> >
> > Good!
> >
> >> There's another thing that I am confused about regarding the current
> >> implementation (in the master branch).
> >>
> >> In hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_vring_write_descs() [1],
> >> svq->free_head saves the descriptor in the specified format using
> >> "le16_to_cpu" (line 171).
> >
> > Good catch, this should be le16_to_cpu actually. But code wise is the
> > same, so we have no visible error. Do you want to send a patch to fix
> > it?
> >
> Sorry, I am still a little confused here. Did you mean cpu_to_le16
> by any chance? Based on what I have understood, if it is to be used
> by the host machine, then it should be cpu_to_le16.
>
> I can send a patch once this is clear, or can even integrate it in
> this patch series since this patch series refactors that function
> anyway.
>

Ok, I don't know how I read the function to answer you that :(. Let me
start from scratch,

In line 171, we're copying data from QEMU internals, that are not in
the guest memory, to other QEMU internals. So no cpu_to_le* or
le*to_cpu is needed.

> >> On the other hand, the value of i is stored
> >> in the native endianness using "cpu_to_le16" (line 168). If "i" is to be
> >> stored in the native endianness (little endian in this case), then
> >> should svq->free_head first be converted to little endian before being
> >> assigned to "i" at the start of the function (line 142)?
> >>
> >

No endianness conversion is needed here for the same reason, all is
internal to QEMU and not intended to be seen by the guest.

> > This part is correct in the code, as it is used by the host, not
> > written to the guest or read from the guest. So no conversion is
> > needed.
>
> Understood.
>
> Thanks,
> Sahil
>
> [1] https://www.suse.com/support/kb/doc/?id=000018705
> [2] https://softlockup.com/SystemAdministration/Linux/Kernel/softlockup/
> [3] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L2545
> [4] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1727
> [5] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1762
>

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 2 months ago

Hi,

Sorry for the delay in my response. There was a lot to absorb in the
previous mail and I thought I would spend some more time exploring
this.

On 2/11/25 1:27 PM, Eugenio Perez Martin wrote:
> On Mon, Feb 10, 2025 at 5:25 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> On 2/10/25 7:53 PM, Eugenio Perez Martin wrote:
>>> On Mon, Feb 10, 2025 at 11:58 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> On 2/6/25 8:47 PM, Sahil Siddiq wrote:
>>>>> On 2/6/25 12:42 PM, Eugenio Perez Martin wrote:
>>>>>> On Thu, Feb 6, 2025 at 6:26 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>>> On 2/4/25 11:45 PM, Eugenio Perez Martin wrote:
>>>>>>>> PS: Please note that you can check packed_vq SVQ implementation
>>>>>>>> already without CVQ, as these features are totally orthogonal :).
>>>>>>>>
>>>>>>>
>>>>>>> Right. Now that I can ping with the ctrl features turned off, I think
>>>>>>> this should take precedence. There's another issue specific to the
>>>>>>> packed virtqueue case. It causes the kernel to crash. I have been
>>>>>>> investigating this and the situation here looks very similar to what's
>>>>>>> explained in Jason Wang's mail [2]. My plan of action is to apply his
>>>>>>> changes in L2's kernel and check if that resolves the problem.
>>>>>>>
>>>>>>> The details of the crash can be found in this mail [3].
>>>>>>>
>>>>>>
>>>>>> If you're testing this series without changes, I think that is caused
>>>>>> by not implementing the packed version of vhost_svq_get_buf.
>>>>>>
>>>>>> https://lists.nongnu.org/archive/html/qemu-devel/2024-12/msg01902.html
>>>>>>
>>>>>
>>>>> Oh, apologies, I think I had misunderstood your response in the linked mail.
>>>>> Until now, I thought they were unrelated. In that case, I'll implement the
>>>>> packed version of vhost_svq_get_buf. Hopefully that fixes it :).
>>>>>
>>>>
>>>> I noticed one thing while testing some of the changes that I have made.
>>>> I haven't finished making the relevant changes to all the functions which
>>>> will have to handle split and packed vq differently. L2's kernel crashes
>>>> when I launch L0-QEMU with ctrl_vq=on,ctrl_rx=on.
>>>
>>> Interesting, is a similar crash than this? (NULL ptr deference on
>>> virtnet_set_features)?
>>>
>>> https://issues.redhat.com/browse/RHEL-391
>> I am not able to access this bug report (even with a Red Hat account). It
>> says it may have been deleted or I don't have the permission to view it.
>>
>> It's hard to tell if this is the same issue. I don't think it is the same
>> issue though since I don't see any such indication in the logs. The kernel
>> throws the following:
>>
>> [   23.047503] virtio_net virtio1: output.0:id 0 is not a head!
> 
> This is a common error when modifying code of the dataplane, it is
> unlikely to do deep changes and not see this error :). It indicates
> that your code is marking the descriptor id 0 as used when the guest
> didn't make it available.

Right, I explored this a little further. I noticed that there were
a few issues in my implementation with the way packed vqs were being
handled (apart from the lack of implementation of
vhost_svq_get_buf_packed). After making the relevant changes and
implementing vhost_svq_get_buf_packed, I couldn't see this issue
anymore.

> If this is happening in control virtqueue, I'd check if the code is
> setting the flags as used in ring[1] when it shouldn't. But my bet is
> that the rx queue is the wrong one.

The flags were one of the issues. I hadn't initialized "avail_used_flags"
correctly. Rectifying them seems to have solved this issue. However, I see
two new issues (described further below).

>> [   49.173243] watchdog: BUG: soft lockup - CPU#1 stuck for 25s! [NetworkManager:782]
>> [   49.174167] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvg
>> [   49.188258] CPU: 1 PID: 782 Comm: NetworkManager Not tainted 6.8.7-200.fc39.x86_64 #1
>> [   49.193196] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>> [   49.193196] RIP: 0010:virtqueue_get_buf+0x0/0x20
>>
> 
> Two possibilities about this part:
> a) You're spending "too long" in the debugger in QEMU. From the kernel
> POV, the function virtqueue_get_buf is taking too long to complete so
> it detects it as a lockup. You can check this scenario by not running
> QEMU under GDB or disabling all breakpoints. You can ignore this
> message if you don't find the error this way. If you still see the
> message, goto possibility b.
> 
> b) The kernel has a bug that makes it softlockup in virtqueue_get_buf.
> The kernel should not soft lockup even if your changes were malicious
> :(, so it is something to be fixed. If you have the time, can you test
> with the latest upstream kernel?

I wasn't running QEMU under GDB, so there may indeed be an issue in the
kernel. While I don't see a soft lockup at this exact point after making
the above described changes, I do see a soft lockup issue in another part
of virtio-net.

When testing my implementation with the control feature bits turned on,
the kernel throws the following warning while booting.

[    9.046478] net eth0: Failed to disable allmulti mode.

This is printed as a dev_warn() in drivers/net/virtio_net.c:virtnet_rx_mode_work [1].
The kernel doesn't continue booting beyond this point and after a few seconds,
it reports a soft lockup.

>> Maybe I was incorrect in stating that the kernel crashes. It's more like
>> the kernel is stuck in a loop (according to these blog posts on soft
>> lockup [1][2]).
>>
>> In the above trace, RIP is in virtqueue_get_buf() [3]. This is what
>> calls virtqueue_get_buf_ctx_packed() [4] which throws the error.
>>
>> What I don't understand is why vq->packed.desc_state[id].data [5] is
>> NULL when the control features are turned on, but doesn't seem to be
>> NULL when the control features are turned off.
> 
> Due to the net subsystem lock, CVQ handling is not as robust / secure
> against this error as the dataplane queues. There is an ongoing effort
> to make it more robust, so maybe this is something to fix in that
> line.
> 
> Can you put the whole backtrace that prints the kernel?

I haven't tested these changes with the latest kernel yet. I think this would be
a good time to test against the latest kernel. I'll update my kernel.

Here's the backtrace that is printed in the kernel that I currently have installed
(6.8.5-201.fc39.x86_64), in case this is relevant.

[   65.214308] watchdog: BUG: soft lockup - CPU#0 stuck for 51s! [NetworkManage]
[   65.215933] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intelg
[   65.238465] CPU: 0 PID: 784 Comm: NetworkManager Tainted: G             L   1
[   65.242530] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14
[   65.248474] RIP: 0010:virtnet_send_command+0x17c/0x1e0 [virtio_net]
[   65.251505] Code: 74 24 48 e8 f6 b1 40 c1 85 c0 78 60 48 8b 7b 08 e8 29 92 43
[   65.260475] RSP: 0018:ffffb8038073f298 EFLAGS: 00000246
[   65.260475] RAX: 0000000000000000 RBX: ffff8ea600f389c0 RCX: ffffb8038073f29c
[   65.265165] RDX: 0000000000008003 RSI: 0000000000000000 RDI: ffff8ea60cead300
[   65.269528] RBP: ffffb8038073f2c0 R08: 0000000000000001 R09: ffff8ea600f389c0
[   65.272532] R10: 0000000000000030 R11: 0000000000000002 R12: 0000000000000002
[   65.274483] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   65.278518] FS:  00007f2dd66f2540(0000) GS:ffff8ea67bc00000(0000) knlGS:00000
[   65.280653] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   65.284492] CR2: 00005593d46e5868 CR3: 000000010d310001 CR4: 0000000000770ef0
[   65.286464] PKRU: 55555554
[   65.288524] Call Trace:
[   65.291470]  <IRQ>
[   65.291470]  ? watchdog_timer_fn+0x1e6/0x270
[   65.293464]  ? __pfx_watchdog_timer_fn+0x10/0x10
[   65.296496]  ? __hrtimer_run_queues+0x10f/0x2b0
[   65.297578]  ? hrtimer_interrupt+0xf8/0x230
[   65.300472]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
[   65.301680]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[   65.305464]  </IRQ>
[   65.305464]  <TASK>
[   65.305464]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   65.308705]  ? virtnet_send_command+0x17c/0x1e0 [virtio_net]
[   65.312466]  ? virtnet_send_command+0x176/0x1e0 [virtio_net]
[   65.314465]  virtnet_set_rx_mode+0xd8/0x340 [virtio_net]
[   65.317466]  __dev_mc_add+0x79/0x80
[   65.318462]  igmp_group_added+0x1f2/0x210
[   65.320541]  ____ip_mc_inc_group+0x15b/0x250
[   65.323522]  ip_mc_up+0x4f/0xb0
[   65.324491]  inetdev_event+0x27a/0x700
[   65.325469]  ? _raw_spin_unlock_irqrestore+0xe/0x40
[   65.329462]  notifier_call_chain+0x5a/0xd0
[   65.331717]  __dev_notify_flags+0x5c/0xf0
[   65.332491]  dev_change_flags+0x54/0x70
[   65.334508]  do_setlink+0x375/0x12d0
[   65.336554]  ? __nla_validate_parse+0x61/0xd50
[   65.338510]  __rtnl_newlink+0x668/0xa30
[   65.340733]  ? security_unix_may_send+0x21/0x50
[   65.342620]  rtnl_newlink+0x47/0x70
[   65.344556]  rtnetlink_rcv_msg+0x14f/0x3c0
[   65.346509]  ? avc_has_perm_noaudit+0x6b/0xf0
[   65.348470]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[   65.350533]  netlink_rcv_skb+0x58/0x110
[   65.352482]  netlink_unicast+0x1a3/0x290
[   65.354547]  netlink_sendmsg+0x223/0x490
[   65.356480]  ____sys_sendmsg+0x396/0x3d0
[   65.357482]  ? copy_msghdr_from_user+0x7d/0xc0
[   65.360488]  ___sys_sendmsg+0x9a/0xe0
[   65.360488]  __sys_sendmsg+0x7a/0xd0
[   65.364591]  do_syscall_64+0x83/0x170
[   65.365485]  ? syscall_exit_to_user_mode+0x83/0x230
[   65.368475]  ? do_syscall_64+0x90/0x170
[   65.370477]  ? _raw_spin_unlock+0xe/0x30
[   65.372498]  ? proc_sys_call_handler+0xfc/0x2e0
[   65.374474]  ? kvm_clock_get_cycles+0x18/0x30
[   65.376475]  ? ktime_get_ts64+0x47/0xe0
[   65.378457]  ? posix_get_monotonic_timespec+0x65/0xa0
[   65.380535]  ? put_timespec64+0x3e/0x70
[   65.382458]  ? syscall_exit_to_user_mode+0x83/0x230
[   65.384542]  ? do_syscall_64+0x90/0x170
[   65.384542]  ? do_syscall_64+0x90/0x170
[   65.387505]  ? ksys_write+0xd8/0xf0
[   65.388670]  ? syscall_exit_to_user_mode+0x83/0x230
[   65.390522]  ? do_syscall_64+0x90/0x170
[   65.390522]  ? syscall_exit_to_user_mode+0x83/0x230
[   65.394472]  ? do_syscall_64+0x90/0x170
[   65.396532]  ? syscall_exit_to_user_mode+0x83/0x230
[   65.398519]  ? do_syscall_64+0x90/0x170
[   65.400486]  ? do_user_addr_fault+0x304/0x670
[   65.400486]  ? clear_bhb_loop+0x55/0xb0
[   65.404531]  ? clear_bhb_loop+0x55/0xb0
[   65.405471]  ? clear_bhb_loop+0x55/0xb0
[   65.408520]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[   65.408520] RIP: 0033:0x7f2dd7810a1b
[   65.413467] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 09 3b
[   65.420593] RSP: 002b:00007ffcff6bd520 EFLAGS: 00000293 ORIG_RAX: 0000000000e
[   65.425554] RAX: ffffffffffffffda RBX: 00005593d4679a90 RCX: 00007f2dd7810a1b
[   65.428519] RDX: 0000000000000000 RSI: 00007ffcff6bd560 RDI: 000000000000000d
[   65.430509] RBP: 00007ffcff6bd540 R08: 0000000000000000 R09: 0000000000000000
[   65.434723] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000000a
[   65.438526] R13: 00005593d4679a90 R14: 0000000000000001 R15: 0000000000000000
[   65.440555]  </TASK>
[   71.028432] rcu: INFO: rcu_preempt self-detected stall on CPU
[   71.028432] rcu:     0-....: (1 GPs behind) idle=7764/1/0x4000000000000000 s9
[   71.036518] rcu:     (t=60010 jiffies g=2193 q=1947 ncpus=4)
[   71.041707] CPU: 0 PID: 784 Comm: NetworkManager Tainted: G             L   1
[   71.050455] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14
[   71.055661] RIP: 0010:virtnet_send_command+0x17c/0x1e0 [virtio_net]
[   71.059518] Code: 74 24 48 e8 f6 b1 40 c1 85 c0 78 60 48 8b 7b 08 e8 29 92 43
[   71.065526] RSP: 0018:ffffb8038073f298 EFLAGS: 00000246
[   71.067651] RAX: 0000000000000000 RBX: ffff8ea600f389c0 RCX: ffffb8038073f29c
[   71.069472] RDX: 0000000000008003 RSI: 0000000000000000 RDI: ffff8ea60cead300
[   71.071461] RBP: ffffb8038073f2c0 R08: 0000000000000001 R09: ffff8ea600f389c0
[   71.075455] R10: 0000000000000030 R11: 0000000000000002 R12: 0000000000000002
[   71.078461] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   71.079579] FS:  00007f2dd66f2540(0000) GS:ffff8ea67bc00000(0000) knlGS:00000
[   71.083577] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   71.083577] CR2: 00005593d46e5868 CR3: 000000010d310001 CR4: 0000000000770ef0
[   71.087582] PKRU: 55555554
[   71.090472] Call Trace:
[   71.091452]  <IRQ>
[   71.091452]  ? rcu_dump_cpu_stacks+0xc4/0x100
[   71.095487]  ? rcu_sched_clock_irq+0x32e/0x1040
[   71.095487]  ? task_tick_fair+0x40/0x3f0
[   71.100466]  ? trigger_load_balance+0x73/0x360
[   71.100466]  ? update_process_times+0x74/0xb0
[   71.103539]  ? tick_sched_handle+0x21/0x60
[   71.107494]  ? tick_nohz_highres_handler+0x6f/0x90
[   71.107572]  ? __pfx_tick_nohz_highres_handler+0x10/0x10
[   71.111477]  ? __hrtimer_run_queues+0x10f/0x2b0
[   71.111477]  ? hrtimer_interrupt+0xf8/0x230
[   71.116489]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
[   71.119526]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[   71.119526]  </IRQ>
[   71.124489]  <TASK>
[   71.124489]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   71.127499]  ? virtnet_send_command+0x17c/0x1e0 [virtio_net]
[   71.127499]  ? virtnet_send_command+0x176/0x1e0 [virtio_net]
[   71.132613]  virtnet_set_rx_mode+0xd8/0x340 [virtio_net]
[   71.136474]  __dev_mc_add+0x79/0x80
[   71.136474]  igmp_group_added+0x1f2/0x210
[   71.139469]  ____ip_mc_inc_group+0x15b/0x250
[   71.140473]  ip_mc_up+0x4f/0xb0
[   71.143492]  inetdev_event+0x27a/0x700
[   71.144486]  ? _raw_spin_unlock_irqrestore+0xe/0x40
[   71.147600]  notifier_call_chain+0x5a/0xd0
[   71.148918]  __dev_notify_flags+0x5c/0xf0
[   71.151634]  dev_change_flags+0x54/0x70
[   71.153529]  do_setlink+0x375/0x12d0
[   71.155476]  ? __nla_validate_parse+0x61/0xd50
[   71.157541]  __rtnl_newlink+0x668/0xa30
[   71.159503]  ? security_unix_may_send+0x21/0x50
[   71.161954]  rtnl_newlink+0x47/0x70
[   71.163680]  rtnetlink_rcv_msg+0x14f/0x3c0
[   71.165468]  ? avc_has_perm_noaudit+0x6b/0xf0
[   71.167506]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[   71.170499]  netlink_rcv_skb+0x58/0x110
[   71.171461]  netlink_unicast+0x1a3/0x290
[   71.174477]  netlink_sendmsg+0x223/0x490
[   71.175472]  ____sys_sendmsg+0x396/0x3d0
[   71.175472]  ? copy_msghdr_from_user+0x7d/0xc0
[   71.179465]  ___sys_sendmsg+0x9a/0xe0
[   71.179465]  __sys_sendmsg+0x7a/0xd0
[   71.182526]  do_syscall_64+0x83/0x170
[   71.183522]  ? syscall_exit_to_user_mode+0x83/0x230
[   71.183522]  ? do_syscall_64+0x90/0x170
[   71.187502]  ? _raw_spin_unlock+0xe/0x30
[   71.187502]  ? proc_sys_call_handler+0xfc/0x2e0
[   71.191500]  ? kvm_clock_get_cycles+0x18/0x30
[   71.191500]  ? ktime_get_ts64+0x47/0xe0
[   71.195472]  ? posix_get_monotonic_timespec+0x65/0xa0
[   71.195472]  ? put_timespec64+0x3e/0x70
[   71.198593]  ? syscall_exit_to_user_mode+0x83/0x230
[   71.199571]  ? do_syscall_64+0x90/0x170
[   71.202457]  ? do_syscall_64+0x90/0x170
[   71.203463]  ? ksys_write+0xd8/0xf0
[   71.203463]  ? syscall_exit_to_user_mode+0x83/0x230
[   71.207464]  ? do_syscall_64+0x90/0x170
[   71.207464]  ? syscall_exit_to_user_mode+0x83/0x230
[   71.211460]  ? do_syscall_64+0x90/0x170
[   71.211460]  ? syscall_exit_to_user_mode+0x83/0x230
[   71.211460]  ? do_syscall_64+0x90/0x170
[   71.216481]  ? do_user_addr_fault+0x304/0x670
[   71.216481]  ? clear_bhb_loop+0x55/0xb0
[   71.220472]  ? clear_bhb_loop+0x55/0xb0
[   71.220472]  ? clear_bhb_loop+0x55/0xb0
[   71.223704]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[   71.225564] RIP: 0033:0x7f2dd7810a1b
[   71.227495] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 09 3b
[   71.235475] RSP: 002b:00007ffcff6bd520 EFLAGS: 00000293 ORIG_RAX: 0000000000e
[   71.239515] RAX: ffffffffffffffda RBX: 00005593d4679a90 RCX: 00007f2dd7810a1b
[   71.241643] RDX: 0000000000000000 RSI: 00007ffcff6bd560 RDI: 000000000000000d
[   71.245469] RBP: 00007ffcff6bd540 R08: 0000000000000000 R09: 0000000000000000
[   71.247467] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000000a
[   71.251479] R13: 00005593d4679a90 R14: 0000000000000001 R15: 0000000000000000
[   71.251479]  </TASK>

When I test my changes with the control feature bits turned off, I see another
issue. The kernel boots successfully in this case, but I noticed that no new
elements in the dataplane are added to the virtqueue. This is because, in
hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_translate_addr() [2], when gpas
is not null and QEMU tries to retrieve the IOVA address from the GPA->IOVA
tree, the result of map is NULL in the following line [3]:

map = vhost_iova_tree_find_gpa(svq->iova_tree, &needle)

Due to this, vhost_svq_vring_write_descs() [4] simply returns false and nothing
is added to the virtqueue.

This issue is present even for split virtqueues, when I test my changes with
"packed=off". However, I don't see any issues when I build QEMU from the master
branch. I think the issue might lie in how memory is being allocated to the
virtqueues in my implementation, but I am not sure. I have a few ideas regarding
how this can be debugged. I'll let you know if I find anything else.

>>> [...]
>>>> There's another thing that I am confused about regarding the current
>>>> implementation (in the master branch).
>>>>
>>>> In hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_vring_write_descs() [1],
>>>> svq->free_head saves the descriptor in the specified format using
>>>> "le16_to_cpu" (line 171).
>>>
>>> Good catch, this should be le16_to_cpu actually. But code wise is the
>>> same, so we have no visible error. Do you want to send a patch to fix
>>> it?
>>>
>> Sorry, I am still a little confused here. Did you mean cpu_to_le16
>> by any chance? Based on what I have understood, if it is to be used
>> by the host machine, then it should be cpu_to_le16.
>>
>> I can send a patch once this is clear, or can even integrate it in
>> this patch series since this patch series refactors that function
>> anyway.
>>
> 
> Ok, I don't know how I read the function to answer you that :(. Let me
> start from scratch,
> 
> In line 171, we're copying data from QEMU internals, that are not in
> the guest memory, to other QEMU internals. So no cpu_to_le* or
> le*to_cpu is needed.

Understood.

>>>> On the other hand, the value of i is stored
>>>> in the native endianness using "cpu_to_le16" (line 168). If "i" is to be
>>>> stored in the native endianness (little endian in this case), then
>>>> should svq->free_head first be converted to little endian before being
>>>> assigned to "i" at the start of the function (line 142)?
>>>>
>>>
> 
> No endianness conversion is needed here for the same reason, all is
> internal to QEMU and not intended to be seen by the guest.
> 

Got it. This makes sense now.

Thanks,
Sahil

[1] https://github.com/torvalds/linux/blob/master/drivers/net/virtio_net.c#L3712
[2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L83
[3] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L104
[4] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L171

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 2 months ago

On Thu, Mar 6, 2025 at 6:26 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> Sorry for the delay in my response. There was a lot to absorb in the
> previous mail and I thought I would spend some more time exploring
> this.
>
> On 2/11/25 1:27 PM, Eugenio Perez Martin wrote:
> > On Mon, Feb 10, 2025 at 5:25 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> On 2/10/25 7:53 PM, Eugenio Perez Martin wrote:
> >>> On Mon, Feb 10, 2025 at 11:58 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> On 2/6/25 8:47 PM, Sahil Siddiq wrote:
> >>>>> On 2/6/25 12:42 PM, Eugenio Perez Martin wrote:
> >>>>>> On Thu, Feb 6, 2025 at 6:26 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>>> On 2/4/25 11:45 PM, Eugenio Perez Martin wrote:
> >>>>>>>> PS: Please note that you can check packed_vq SVQ implementation
> >>>>>>>> already without CVQ, as these features are totally orthogonal :).
> >>>>>>>>
> >>>>>>>
> >>>>>>> Right. Now that I can ping with the ctrl features turned off, I think
> >>>>>>> this should take precedence. There's another issue specific to the
> >>>>>>> packed virtqueue case. It causes the kernel to crash. I have been
> >>>>>>> investigating this and the situation here looks very similar to what's
> >>>>>>> explained in Jason Wang's mail [2]. My plan of action is to apply his
> >>>>>>> changes in L2's kernel and check if that resolves the problem.
> >>>>>>>
> >>>>>>> The details of the crash can be found in this mail [3].
> >>>>>>>
> >>>>>>
> >>>>>> If you're testing this series without changes, I think that is caused
> >>>>>> by not implementing the packed version of vhost_svq_get_buf.
> >>>>>>
> >>>>>> https://lists.nongnu.org/archive/html/qemu-devel/2024-12/msg01902.html
> >>>>>>
> >>>>>
> >>>>> Oh, apologies, I think I had misunderstood your response in the linked mail.
> >>>>> Until now, I thought they were unrelated. In that case, I'll implement the
> >>>>> packed version of vhost_svq_get_buf. Hopefully that fixes it :).
> >>>>>
> >>>>
> >>>> I noticed one thing while testing some of the changes that I have made.
> >>>> I haven't finished making the relevant changes to all the functions which
> >>>> will have to handle split and packed vq differently. L2's kernel crashes
> >>>> when I launch L0-QEMU with ctrl_vq=on,ctrl_rx=on.
> >>>
> >>> Interesting, is a similar crash than this? (NULL ptr deference on
> >>> virtnet_set_features)?
> >>>
> >>> https://issues.redhat.com/browse/RHEL-391
> >> I am not able to access this bug report (even with a Red Hat account). It
> >> says it may have been deleted or I don't have the permission to view it.
> >>
> >> It's hard to tell if this is the same issue. I don't think it is the same
> >> issue though since I don't see any such indication in the logs. The kernel
> >> throws the following:
> >>
> >> [   23.047503] virtio_net virtio1: output.0:id 0 is not a head!
> >
> > This is a common error when modifying code of the dataplane, it is
> > unlikely to do deep changes and not see this error :). It indicates
> > that your code is marking the descriptor id 0 as used when the guest
> > didn't make it available.
>
> Right, I explored this a little further. I noticed that there were
> a few issues in my implementation with the way packed vqs were being
> handled (apart from the lack of implementation of
> vhost_svq_get_buf_packed). After making the relevant changes and
> implementing vhost_svq_get_buf_packed, I couldn't see this issue
> anymore.
>
> > If this is happening in control virtqueue, I'd check if the code is
> > setting the flags as used in ring[1] when it shouldn't. But my bet is
> > that the rx queue is the wrong one.
>
> The flags were one of the issues. I hadn't initialized "avail_used_flags"
> correctly. Rectifying them seems to have solved this issue. However, I see
> two new issues (described further below).
>
> >> [   49.173243] watchdog: BUG: soft lockup - CPU#1 stuck for 25s! [NetworkManager:782]
> >> [   49.174167] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvg
> >> [   49.188258] CPU: 1 PID: 782 Comm: NetworkManager Not tainted 6.8.7-200.fc39.x86_64 #1
> >> [   49.193196] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> >> [   49.193196] RIP: 0010:virtqueue_get_buf+0x0/0x20
> >>
> >
> > Two possibilities about this part:
> > a) You're spending "too long" in the debugger in QEMU. From the kernel
> > POV, the function virtqueue_get_buf is taking too long to complete so
> > it detects it as a lockup. You can check this scenario by not running
> > QEMU under GDB or disabling all breakpoints. You can ignore this
> > message if you don't find the error this way. If you still see the
> > message, goto possibility b.
> >
> > b) The kernel has a bug that makes it softlockup in virtqueue_get_buf.
> > The kernel should not soft lockup even if your changes were malicious
> > :(, so it is something to be fixed. If you have the time, can you test
> > with the latest upstream kernel?
>
> I wasn't running QEMU under GDB, so there may indeed be an issue in the
> kernel. While I don't see a soft lockup at this exact point after making
> the above described changes, I do see a soft lockup issue in another part
> of virtio-net.
>
> When testing my implementation with the control feature bits turned on,
> the kernel throws the following warning while booting.
>
> [    9.046478] net eth0: Failed to disable allmulti mode.
>
> This is printed as a dev_warn() in drivers/net/virtio_net.c:virtnet_rx_mode_work [1].
> The kernel doesn't continue booting beyond this point and after a few seconds,
> it reports a soft lockup.
>
> >> Maybe I was incorrect in stating that the kernel crashes. It's more like
> >> the kernel is stuck in a loop (according to these blog posts on soft
> >> lockup [1][2]).
> >>
> >> In the above trace, RIP is in virtqueue_get_buf() [3]. This is what
> >> calls virtqueue_get_buf_ctx_packed() [4] which throws the error.
> >>
> >> What I don't understand is why vq->packed.desc_state[id].data [5] is
> >> NULL when the control features are turned on, but doesn't seem to be
> >> NULL when the control features are turned off.
> >
> > Due to the net subsystem lock, CVQ handling is not as robust / secure
> > against this error as the dataplane queues. There is an ongoing effort
> > to make it more robust, so maybe this is something to fix in that
> > line.
> >
> > Can you put the whole backtrace that prints the kernel?
>
> I haven't tested these changes with the latest kernel yet. I think this would be
> a good time to test against the latest kernel. I'll update my kernel.
>
> Here's the backtrace that is printed in the kernel that I currently have installed
> (6.8.5-201.fc39.x86_64), in case this is relevant.
>
> [   65.214308] watchdog: BUG: soft lockup - CPU#0 stuck for 51s! [NetworkManage]
> [   65.215933] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intelg
> [   65.238465] CPU: 0 PID: 784 Comm: NetworkManager Tainted: G             L   1
> [   65.242530] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14
> [   65.248474] RIP: 0010:virtnet_send_command+0x17c/0x1e0 [virtio_net]
> [   65.251505] Code: 74 24 48 e8 f6 b1 40 c1 85 c0 78 60 48 8b 7b 08 e8 29 92 43
> [   65.260475] RSP: 0018:ffffb8038073f298 EFLAGS: 00000246
> [   65.260475] RAX: 0000000000000000 RBX: ffff8ea600f389c0 RCX: ffffb8038073f29c
> [   65.265165] RDX: 0000000000008003 RSI: 0000000000000000 RDI: ffff8ea60cead300
> [   65.269528] RBP: ffffb8038073f2c0 R08: 0000000000000001 R09: ffff8ea600f389c0
> [   65.272532] R10: 0000000000000030 R11: 0000000000000002 R12: 0000000000000002
> [   65.274483] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [   65.278518] FS:  00007f2dd66f2540(0000) GS:ffff8ea67bc00000(0000) knlGS:00000
> [   65.280653] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   65.284492] CR2: 00005593d46e5868 CR3: 000000010d310001 CR4: 0000000000770ef0
> [   65.286464] PKRU: 55555554
> [   65.288524] Call Trace:
> [   65.291470]  <IRQ>
> [   65.291470]  ? watchdog_timer_fn+0x1e6/0x270
> [   65.293464]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [   65.296496]  ? __hrtimer_run_queues+0x10f/0x2b0
> [   65.297578]  ? hrtimer_interrupt+0xf8/0x230
> [   65.300472]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
> [   65.301680]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> [   65.305464]  </IRQ>
> [   65.305464]  <TASK>
> [   65.305464]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [   65.308705]  ? virtnet_send_command+0x17c/0x1e0 [virtio_net]
> [   65.312466]  ? virtnet_send_command+0x176/0x1e0 [virtio_net]
> [   65.314465]  virtnet_set_rx_mode+0xd8/0x340 [virtio_net]
> [   65.317466]  __dev_mc_add+0x79/0x80
> [   65.318462]  igmp_group_added+0x1f2/0x210
> [   65.320541]  ____ip_mc_inc_group+0x15b/0x250
> [   65.323522]  ip_mc_up+0x4f/0xb0
> [   65.324491]  inetdev_event+0x27a/0x700
> [   65.325469]  ? _raw_spin_unlock_irqrestore+0xe/0x40
> [   65.329462]  notifier_call_chain+0x5a/0xd0
> [   65.331717]  __dev_notify_flags+0x5c/0xf0
> [   65.332491]  dev_change_flags+0x54/0x70
> [   65.334508]  do_setlink+0x375/0x12d0
> [   65.336554]  ? __nla_validate_parse+0x61/0xd50
> [   65.338510]  __rtnl_newlink+0x668/0xa30
> [   65.340733]  ? security_unix_may_send+0x21/0x50
> [   65.342620]  rtnl_newlink+0x47/0x70
> [   65.344556]  rtnetlink_rcv_msg+0x14f/0x3c0
> [   65.346509]  ? avc_has_perm_noaudit+0x6b/0xf0
> [   65.348470]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
> [   65.350533]  netlink_rcv_skb+0x58/0x110
> [   65.352482]  netlink_unicast+0x1a3/0x290
> [   65.354547]  netlink_sendmsg+0x223/0x490
> [   65.356480]  ____sys_sendmsg+0x396/0x3d0
> [   65.357482]  ? copy_msghdr_from_user+0x7d/0xc0
> [   65.360488]  ___sys_sendmsg+0x9a/0xe0
> [   65.360488]  __sys_sendmsg+0x7a/0xd0
> [   65.364591]  do_syscall_64+0x83/0x170
> [   65.365485]  ? syscall_exit_to_user_mode+0x83/0x230
> [   65.368475]  ? do_syscall_64+0x90/0x170
> [   65.370477]  ? _raw_spin_unlock+0xe/0x30
> [   65.372498]  ? proc_sys_call_handler+0xfc/0x2e0
> [   65.374474]  ? kvm_clock_get_cycles+0x18/0x30
> [   65.376475]  ? ktime_get_ts64+0x47/0xe0
> [   65.378457]  ? posix_get_monotonic_timespec+0x65/0xa0
> [   65.380535]  ? put_timespec64+0x3e/0x70
> [   65.382458]  ? syscall_exit_to_user_mode+0x83/0x230
> [   65.384542]  ? do_syscall_64+0x90/0x170
> [   65.384542]  ? do_syscall_64+0x90/0x170
> [   65.387505]  ? ksys_write+0xd8/0xf0
> [   65.388670]  ? syscall_exit_to_user_mode+0x83/0x230
> [   65.390522]  ? do_syscall_64+0x90/0x170
> [   65.390522]  ? syscall_exit_to_user_mode+0x83/0x230
> [   65.394472]  ? do_syscall_64+0x90/0x170
> [   65.396532]  ? syscall_exit_to_user_mode+0x83/0x230
> [   65.398519]  ? do_syscall_64+0x90/0x170
> [   65.400486]  ? do_user_addr_fault+0x304/0x670
> [   65.400486]  ? clear_bhb_loop+0x55/0xb0
> [   65.404531]  ? clear_bhb_loop+0x55/0xb0
> [   65.405471]  ? clear_bhb_loop+0x55/0xb0
> [   65.408520]  entry_SYSCALL_64_after_hwframe+0x78/0x80
> [   65.408520] RIP: 0033:0x7f2dd7810a1b
> [   65.413467] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 09 3b
> [   65.420593] RSP: 002b:00007ffcff6bd520 EFLAGS: 00000293 ORIG_RAX: 0000000000e
> [   65.425554] RAX: ffffffffffffffda RBX: 00005593d4679a90 RCX: 00007f2dd7810a1b
> [   65.428519] RDX: 0000000000000000 RSI: 00007ffcff6bd560 RDI: 000000000000000d
> [   65.430509] RBP: 00007ffcff6bd540 R08: 0000000000000000 R09: 0000000000000000
> [   65.434723] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000000a
> [   65.438526] R13: 00005593d4679a90 R14: 0000000000000001 R15: 0000000000000000
> [   65.440555]  </TASK>
> [   71.028432] rcu: INFO: rcu_preempt self-detected stall on CPU
> [   71.028432] rcu:     0-....: (1 GPs behind) idle=7764/1/0x4000000000000000 s9
> [   71.036518] rcu:     (t=60010 jiffies g=2193 q=1947 ncpus=4)
> [   71.041707] CPU: 0 PID: 784 Comm: NetworkManager Tainted: G             L   1
> [   71.050455] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14
> [   71.055661] RIP: 0010:virtnet_send_command+0x17c/0x1e0 [virtio_net]
> [   71.059518] Code: 74 24 48 e8 f6 b1 40 c1 85 c0 78 60 48 8b 7b 08 e8 29 92 43
> [   71.065526] RSP: 0018:ffffb8038073f298 EFLAGS: 00000246
> [   71.067651] RAX: 0000000000000000 RBX: ffff8ea600f389c0 RCX: ffffb8038073f29c
> [   71.069472] RDX: 0000000000008003 RSI: 0000000000000000 RDI: ffff8ea60cead300
> [   71.071461] RBP: ffffb8038073f2c0 R08: 0000000000000001 R09: ffff8ea600f389c0
> [   71.075455] R10: 0000000000000030 R11: 0000000000000002 R12: 0000000000000002
> [   71.078461] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [   71.079579] FS:  00007f2dd66f2540(0000) GS:ffff8ea67bc00000(0000) knlGS:00000
> [   71.083577] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   71.083577] CR2: 00005593d46e5868 CR3: 000000010d310001 CR4: 0000000000770ef0
> [   71.087582] PKRU: 55555554
> [   71.090472] Call Trace:
> [   71.091452]  <IRQ>
> [   71.091452]  ? rcu_dump_cpu_stacks+0xc4/0x100
> [   71.095487]  ? rcu_sched_clock_irq+0x32e/0x1040
> [   71.095487]  ? task_tick_fair+0x40/0x3f0
> [   71.100466]  ? trigger_load_balance+0x73/0x360
> [   71.100466]  ? update_process_times+0x74/0xb0
> [   71.103539]  ? tick_sched_handle+0x21/0x60
> [   71.107494]  ? tick_nohz_highres_handler+0x6f/0x90
> [   71.107572]  ? __pfx_tick_nohz_highres_handler+0x10/0x10
> [   71.111477]  ? __hrtimer_run_queues+0x10f/0x2b0
> [   71.111477]  ? hrtimer_interrupt+0xf8/0x230
> [   71.116489]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
> [   71.119526]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> [   71.119526]  </IRQ>
> [   71.124489]  <TASK>
> [   71.124489]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [   71.127499]  ? virtnet_send_command+0x17c/0x1e0 [virtio_net]
> [   71.127499]  ? virtnet_send_command+0x176/0x1e0 [virtio_net]
> [   71.132613]  virtnet_set_rx_mode+0xd8/0x340 [virtio_net]
> [   71.136474]  __dev_mc_add+0x79/0x80
> [   71.136474]  igmp_group_added+0x1f2/0x210
> [   71.139469]  ____ip_mc_inc_group+0x15b/0x250
> [   71.140473]  ip_mc_up+0x4f/0xb0
> [   71.143492]  inetdev_event+0x27a/0x700
> [   71.144486]  ? _raw_spin_unlock_irqrestore+0xe/0x40
> [   71.147600]  notifier_call_chain+0x5a/0xd0
> [   71.148918]  __dev_notify_flags+0x5c/0xf0
> [   71.151634]  dev_change_flags+0x54/0x70
> [   71.153529]  do_setlink+0x375/0x12d0
> [   71.155476]  ? __nla_validate_parse+0x61/0xd50
> [   71.157541]  __rtnl_newlink+0x668/0xa30
> [   71.159503]  ? security_unix_may_send+0x21/0x50
> [   71.161954]  rtnl_newlink+0x47/0x70
> [   71.163680]  rtnetlink_rcv_msg+0x14f/0x3c0
> [   71.165468]  ? avc_has_perm_noaudit+0x6b/0xf0
> [   71.167506]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
> [   71.170499]  netlink_rcv_skb+0x58/0x110
> [   71.171461]  netlink_unicast+0x1a3/0x290
> [   71.174477]  netlink_sendmsg+0x223/0x490
> [   71.175472]  ____sys_sendmsg+0x396/0x3d0
> [   71.175472]  ? copy_msghdr_from_user+0x7d/0xc0
> [   71.179465]  ___sys_sendmsg+0x9a/0xe0
> [   71.179465]  __sys_sendmsg+0x7a/0xd0
> [   71.182526]  do_syscall_64+0x83/0x170
> [   71.183522]  ? syscall_exit_to_user_mode+0x83/0x230
> [   71.183522]  ? do_syscall_64+0x90/0x170
> [   71.187502]  ? _raw_spin_unlock+0xe/0x30
> [   71.187502]  ? proc_sys_call_handler+0xfc/0x2e0
> [   71.191500]  ? kvm_clock_get_cycles+0x18/0x30
> [   71.191500]  ? ktime_get_ts64+0x47/0xe0
> [   71.195472]  ? posix_get_monotonic_timespec+0x65/0xa0
> [   71.195472]  ? put_timespec64+0x3e/0x70
> [   71.198593]  ? syscall_exit_to_user_mode+0x83/0x230
> [   71.199571]  ? do_syscall_64+0x90/0x170
> [   71.202457]  ? do_syscall_64+0x90/0x170
> [   71.203463]  ? ksys_write+0xd8/0xf0
> [   71.203463]  ? syscall_exit_to_user_mode+0x83/0x230
> [   71.207464]  ? do_syscall_64+0x90/0x170
> [   71.207464]  ? syscall_exit_to_user_mode+0x83/0x230
> [   71.211460]  ? do_syscall_64+0x90/0x170
> [   71.211460]  ? syscall_exit_to_user_mode+0x83/0x230
> [   71.211460]  ? do_syscall_64+0x90/0x170
> [   71.216481]  ? do_user_addr_fault+0x304/0x670
> [   71.216481]  ? clear_bhb_loop+0x55/0xb0
> [   71.220472]  ? clear_bhb_loop+0x55/0xb0
> [   71.220472]  ? clear_bhb_loop+0x55/0xb0
> [   71.223704]  entry_SYSCALL_64_after_hwframe+0x78/0x80
> [   71.225564] RIP: 0033:0x7f2dd7810a1b
> [   71.227495] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 09 3b
> [   71.235475] RSP: 002b:00007ffcff6bd520 EFLAGS: 00000293 ORIG_RAX: 0000000000e
> [   71.239515] RAX: ffffffffffffffda RBX: 00005593d4679a90 RCX: 00007f2dd7810a1b
> [   71.241643] RDX: 0000000000000000 RSI: 00007ffcff6bd560 RDI: 000000000000000d
> [   71.245469] RBP: 00007ffcff6bd540 R08: 0000000000000000 R09: 0000000000000000
> [   71.247467] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000000a
> [   71.251479] R13: 00005593d4679a90 R14: 0000000000000001 R15: 0000000000000000
> [   71.251479]  </TASK>
>

Yes, the kernel does softlock waiting for a reply if the CVQ does not
move forward. This is a known issue that is being fixed, but it is not
easy :). To achieve the packed vq support, we can either disable CVQ
entirely or try to process the message the kernel is trying to send.
Both approaches come down to the same functions in SVQ, so you can
pick the one you feel more comfortable :).

> When I test my changes with the control feature bits turned off, I see another
> issue. The kernel boots successfully in this case, but I noticed that no new
> elements in the dataplane are added to the virtqueue. This is because, in
> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_translate_addr() [2], when gpas
> is not null and QEMU tries to retrieve the IOVA address from the GPA->IOVA
> tree, the result of map is NULL in the following line [3]:
>
> map = vhost_iova_tree_find_gpa(svq->iova_tree, &needle)
>
> Due to this, vhost_svq_vring_write_descs() [4] simply returns false and nothing
> is added to the virtqueue.
>
> This issue is present even for split virtqueues, when I test my changes with
> "packed=off". However, I don't see any issues when I build QEMU from the master
> branch. I think the issue might lie in how memory is being allocated to the
> virtqueues in my implementation, but I am not sure. I have a few ideas regarding
> how this can be debugged. I'll let you know if I find anything else.
>

Understood! In case you run out of ideas, it seems like a good
candidate for bisection.

Thanks for the update!

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 1 month, 1 week ago

Hi,

On 3/6/25 12:53 PM, Eugenio Perez Martin wrote:
> On Thu, Mar 6, 2025 at 6:26 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> [...]
>> On 2/11/25 1:27 PM, Eugenio Perez Martin wrote:
>>> [...]
>>>> [   49.173243] watchdog: BUG: soft lockup - CPU#1 stuck for 25s! [NetworkManager:782]
>>>> [   49.174167] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvg
>>>> [   49.188258] CPU: 1 PID: 782 Comm: NetworkManager Not tainted 6.8.7-200.fc39.x86_64 #1
>>>> [   49.193196] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>>>> [   49.193196] RIP: 0010:virtqueue_get_buf+0x0/0x20
>>>>
>>>
>>> Two possibilities about this part:
>>> a) You're spending "too long" in the debugger in QEMU. From the kernel
>>> POV, the function virtqueue_get_buf is taking too long to complete so
>>> it detects it as a lockup. You can check this scenario by not running
>>> QEMU under GDB or disabling all breakpoints. You can ignore this
>>> message if you don't find the error this way. If you still see the
>>> message, goto possibility b.
>>>
>>> b) The kernel has a bug that makes it softlockup in virtqueue_get_buf.
>>> The kernel should not soft lockup even if your changes were malicious
>>> :(, so it is something to be fixed. If you have the time, can you test
>>> with the latest upstream kernel?
>>
>> I wasn't running QEMU under GDB, so there may indeed be an issue in the
>> kernel. While I don't see a soft lockup at this exact point after making
>> the above described changes, I do see a soft lockup issue in another part
>> of virtio-net.
>>
>> When testing my implementation with the control feature bits turned on,
>> the kernel throws the following warning while booting.
>>
>> [    9.046478] net eth0: Failed to disable allmulti mode.
>>
>> This is printed as a dev_warn() in drivers/net/virtio_net.c:virtnet_rx_mode_work [1].
>> The kernel doesn't continue booting beyond this point and after a few seconds,
>> it reports a soft lockup.
>>
>>>> Maybe I was incorrect in stating that the kernel crashes. It's more like
>>>> the kernel is stuck in a loop (according to these blog posts on soft
>>>> lockup [1][2]).
>>>>
>>>> In the above trace, RIP is in virtqueue_get_buf() [3]. This is what
>>>> calls virtqueue_get_buf_ctx_packed() [4] which throws the error.
>>>>
>>>> What I don't understand is why vq->packed.desc_state[id].data [5] is
>>>> NULL when the control features are turned on, but doesn't seem to be
>>>> NULL when the control features are turned off.
>>>
>>> Due to the net subsystem lock, CVQ handling is not as robust / secure
>>> against this error as the dataplane queues. There is an ongoing effort
>>> to make it more robust, so maybe this is something to fix in that
>>> line.
>>>
>>> Can you put the whole backtrace that prints the kernel?
>>
>> I haven't tested these changes with the latest kernel yet. I think this would be
>> a good time to test against the latest kernel. I'll update my kernel.
>>
>> Here's the backtrace that is printed in the kernel that I currently have installed
>> (6.8.5-201.fc39.x86_64), in case this is relevant.
>>
>> [   65.214308] watchdog: BUG: soft lockup - CPU#0 stuck for 51s! [NetworkManage]
>> [   65.215933] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intelg
>> [   65.238465] CPU: 0 PID: 784 Comm: NetworkManager Tainted: G             L   1
>> [   65.242530] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14
>> [   65.248474] RIP: 0010:virtnet_send_command+0x17c/0x1e0 [virtio_net]
>> [   65.251505] Code: 74 24 48 e8 f6 b1 40 c1 85 c0 78 60 48 8b 7b 08 e8 29 92 43
>> [   65.260475] RSP: 0018:ffffb8038073f298 EFLAGS: 00000246
>> [   65.260475] RAX: 0000000000000000 RBX: ffff8ea600f389c0 RCX: ffffb8038073f29c
>> [   65.265165] RDX: 0000000000008003 RSI: 0000000000000000 RDI: ffff8ea60cead300
>> [   65.269528] RBP: ffffb8038073f2c0 R08: 0000000000000001 R09: ffff8ea600f389c0
>> [   65.272532] R10: 0000000000000030 R11: 0000000000000002 R12: 0000000000000002
>> [   65.274483] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>> [   65.278518] FS:  00007f2dd66f2540(0000) GS:ffff8ea67bc00000(0000) knlGS:00000
>> [   65.280653] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   65.284492] CR2: 00005593d46e5868 CR3: 000000010d310001 CR4: 0000000000770ef0
>> [   65.286464] PKRU: 55555554
>> [   65.288524] Call Trace:
>> [   65.291470]  <IRQ>
>> [   65.291470]  ? watchdog_timer_fn+0x1e6/0x270
>> [   65.293464]  ? __pfx_watchdog_timer_fn+0x10/0x10
>> [   65.296496]  ? __hrtimer_run_queues+0x10f/0x2b0
>> [   65.297578]  ? hrtimer_interrupt+0xf8/0x230
>> [   65.300472]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
>> [   65.301680]  ? sysvec_apic_timer_interrupt+0x6d/0x90
>> [   65.305464]  </IRQ>
>> [   65.305464]  <TASK>
>> [   65.305464]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> [   65.308705]  ? virtnet_send_command+0x17c/0x1e0 [virtio_net]
>> [   65.312466]  ? virtnet_send_command+0x176/0x1e0 [virtio_net]
>> [   65.314465]  virtnet_set_rx_mode+0xd8/0x340 [virtio_net]
>> [   65.317466]  __dev_mc_add+0x79/0x80
>> [   65.318462]  igmp_group_added+0x1f2/0x210
>> [   65.320541]  ____ip_mc_inc_group+0x15b/0x250
>> [   65.323522]  ip_mc_up+0x4f/0xb0
>> [   65.324491]  inetdev_event+0x27a/0x700
>> [   65.325469]  ? _raw_spin_unlock_irqrestore+0xe/0x40
>> [   65.329462]  notifier_call_chain+0x5a/0xd0
>> [   65.331717]  __dev_notify_flags+0x5c/0xf0
>> [   65.332491]  dev_change_flags+0x54/0x70
>> [   65.334508]  do_setlink+0x375/0x12d0
>> [   65.336554]  ? __nla_validate_parse+0x61/0xd50
>> [   65.338510]  __rtnl_newlink+0x668/0xa30
>> [   65.340733]  ? security_unix_may_send+0x21/0x50
>> [   65.342620]  rtnl_newlink+0x47/0x70
>> [   65.344556]  rtnetlink_rcv_msg+0x14f/0x3c0
>> [   65.346509]  ? avc_has_perm_noaudit+0x6b/0xf0
>> [   65.348470]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
>> [   65.350533]  netlink_rcv_skb+0x58/0x110
>> [   65.352482]  netlink_unicast+0x1a3/0x290
>> [   65.354547]  netlink_sendmsg+0x223/0x490
>> [   65.356480]  ____sys_sendmsg+0x396/0x3d0
>> [   65.357482]  ? copy_msghdr_from_user+0x7d/0xc0
>> [   65.360488]  ___sys_sendmsg+0x9a/0xe0
>> [   65.360488]  __sys_sendmsg+0x7a/0xd0
>> [   65.364591]  do_syscall_64+0x83/0x170
>> [   65.365485]  ? syscall_exit_to_user_mode+0x83/0x230
>> [   65.368475]  ? do_syscall_64+0x90/0x170
>> [   65.370477]  ? _raw_spin_unlock+0xe/0x30
>> [   65.372498]  ? proc_sys_call_handler+0xfc/0x2e0
>> [   65.374474]  ? kvm_clock_get_cycles+0x18/0x30
>> [   65.376475]  ? ktime_get_ts64+0x47/0xe0
>> [   65.378457]  ? posix_get_monotonic_timespec+0x65/0xa0
>> [   65.380535]  ? put_timespec64+0x3e/0x70
>> [   65.382458]  ? syscall_exit_to_user_mode+0x83/0x230
>> [   65.384542]  ? do_syscall_64+0x90/0x170
>> [   65.384542]  ? do_syscall_64+0x90/0x170
>> [   65.387505]  ? ksys_write+0xd8/0xf0
>> [   65.388670]  ? syscall_exit_to_user_mode+0x83/0x230
>> [   65.390522]  ? do_syscall_64+0x90/0x170
>> [   65.390522]  ? syscall_exit_to_user_mode+0x83/0x230
>> [   65.394472]  ? do_syscall_64+0x90/0x170
>> [   65.396532]  ? syscall_exit_to_user_mode+0x83/0x230
>> [   65.398519]  ? do_syscall_64+0x90/0x170
>> [   65.400486]  ? do_user_addr_fault+0x304/0x670
>> [   65.400486]  ? clear_bhb_loop+0x55/0xb0
>> [   65.404531]  ? clear_bhb_loop+0x55/0xb0
>> [   65.405471]  ? clear_bhb_loop+0x55/0xb0
>> [   65.408520]  entry_SYSCALL_64_after_hwframe+0x78/0x80
>> [   65.408520] RIP: 0033:0x7f2dd7810a1b
>> [   65.413467] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 09 3b
>> [   65.420593] RSP: 002b:00007ffcff6bd520 EFLAGS: 00000293 ORIG_RAX: 0000000000e
>> [   65.425554] RAX: ffffffffffffffda RBX: 00005593d4679a90 RCX: 00007f2dd7810a1b
>> [   65.428519] RDX: 0000000000000000 RSI: 00007ffcff6bd560 RDI: 000000000000000d
>> [   65.430509] RBP: 00007ffcff6bd540 R08: 0000000000000000 R09: 0000000000000000
>> [   65.434723] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000000a
>> [   65.438526] R13: 00005593d4679a90 R14: 0000000000000001 R15: 0000000000000000
>> [   65.440555]  </TASK>
>> [   71.028432] rcu: INFO: rcu_preempt self-detected stall on CPU
>> [   71.028432] rcu:     0-....: (1 GPs behind) idle=7764/1/0x4000000000000000 s9
>> [   71.036518] rcu:     (t=60010 jiffies g=2193 q=1947 ncpus=4)
>> [   71.041707] CPU: 0 PID: 784 Comm: NetworkManager Tainted: G             L   1
>> [   71.050455] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14
>> [   71.055661] RIP: 0010:virtnet_send_command+0x17c/0x1e0 [virtio_net]
>> [   71.059518] Code: 74 24 48 e8 f6 b1 40 c1 85 c0 78 60 48 8b 7b 08 e8 29 92 43
>> [   71.065526] RSP: 0018:ffffb8038073f298 EFLAGS: 00000246
>> [   71.067651] RAX: 0000000000000000 RBX: ffff8ea600f389c0 RCX: ffffb8038073f29c
>> [   71.069472] RDX: 0000000000008003 RSI: 0000000000000000 RDI: ffff8ea60cead300
>> [   71.071461] RBP: ffffb8038073f2c0 R08: 0000000000000001 R09: ffff8ea600f389c0
>> [   71.075455] R10: 0000000000000030 R11: 0000000000000002 R12: 0000000000000002
>> [   71.078461] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>> [   71.079579] FS:  00007f2dd66f2540(0000) GS:ffff8ea67bc00000(0000) knlGS:00000
>> [   71.083577] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   71.083577] CR2: 00005593d46e5868 CR3: 000000010d310001 CR4: 0000000000770ef0
>> [   71.087582] PKRU: 55555554
>> [   71.090472] Call Trace:
>> [   71.091452]  <IRQ>
>> [   71.091452]  ? rcu_dump_cpu_stacks+0xc4/0x100
>> [   71.095487]  ? rcu_sched_clock_irq+0x32e/0x1040
>> [   71.095487]  ? task_tick_fair+0x40/0x3f0
>> [   71.100466]  ? trigger_load_balance+0x73/0x360
>> [   71.100466]  ? update_process_times+0x74/0xb0
>> [   71.103539]  ? tick_sched_handle+0x21/0x60
>> [   71.107494]  ? tick_nohz_highres_handler+0x6f/0x90
>> [   71.107572]  ? __pfx_tick_nohz_highres_handler+0x10/0x10
>> [   71.111477]  ? __hrtimer_run_queues+0x10f/0x2b0
>> [   71.111477]  ? hrtimer_interrupt+0xf8/0x230
>> [   71.116489]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
>> [   71.119526]  ? sysvec_apic_timer_interrupt+0x6d/0x90
>> [   71.119526]  </IRQ>
>> [   71.124489]  <TASK>
>> [   71.124489]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> [   71.127499]  ? virtnet_send_command+0x17c/0x1e0 [virtio_net]
>> [   71.127499]  ? virtnet_send_command+0x176/0x1e0 [virtio_net]
>> [   71.132613]  virtnet_set_rx_mode+0xd8/0x340 [virtio_net]
>> [   71.136474]  __dev_mc_add+0x79/0x80
>> [   71.136474]  igmp_group_added+0x1f2/0x210
>> [   71.139469]  ____ip_mc_inc_group+0x15b/0x250
>> [   71.140473]  ip_mc_up+0x4f/0xb0
>> [   71.143492]  inetdev_event+0x27a/0x700
>> [   71.144486]  ? _raw_spin_unlock_irqrestore+0xe/0x40
>> [   71.147600]  notifier_call_chain+0x5a/0xd0
>> [   71.148918]  __dev_notify_flags+0x5c/0xf0
>> [   71.151634]  dev_change_flags+0x54/0x70
>> [   71.153529]  do_setlink+0x375/0x12d0
>> [   71.155476]  ? __nla_validate_parse+0x61/0xd50
>> [   71.157541]  __rtnl_newlink+0x668/0xa30
>> [   71.159503]  ? security_unix_may_send+0x21/0x50
>> [   71.161954]  rtnl_newlink+0x47/0x70
>> [   71.163680]  rtnetlink_rcv_msg+0x14f/0x3c0
>> [   71.165468]  ? avc_has_perm_noaudit+0x6b/0xf0
>> [   71.167506]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
>> [   71.170499]  netlink_rcv_skb+0x58/0x110
>> [   71.171461]  netlink_unicast+0x1a3/0x290
>> [   71.174477]  netlink_sendmsg+0x223/0x490
>> [   71.175472]  ____sys_sendmsg+0x396/0x3d0
>> [   71.175472]  ? copy_msghdr_from_user+0x7d/0xc0
>> [   71.179465]  ___sys_sendmsg+0x9a/0xe0
>> [   71.179465]  __sys_sendmsg+0x7a/0xd0
>> [   71.182526]  do_syscall_64+0x83/0x170
>> [   71.183522]  ? syscall_exit_to_user_mode+0x83/0x230
>> [   71.183522]  ? do_syscall_64+0x90/0x170
>> [   71.187502]  ? _raw_spin_unlock+0xe/0x30
>> [   71.187502]  ? proc_sys_call_handler+0xfc/0x2e0
>> [   71.191500]  ? kvm_clock_get_cycles+0x18/0x30
>> [   71.191500]  ? ktime_get_ts64+0x47/0xe0
>> [   71.195472]  ? posix_get_monotonic_timespec+0x65/0xa0
>> [   71.195472]  ? put_timespec64+0x3e/0x70
>> [   71.198593]  ? syscall_exit_to_user_mode+0x83/0x230
>> [   71.199571]  ? do_syscall_64+0x90/0x170
>> [   71.202457]  ? do_syscall_64+0x90/0x170
>> [   71.203463]  ? ksys_write+0xd8/0xf0
>> [   71.203463]  ? syscall_exit_to_user_mode+0x83/0x230
>> [   71.207464]  ? do_syscall_64+0x90/0x170
>> [   71.207464]  ? syscall_exit_to_user_mode+0x83/0x230
>> [   71.211460]  ? do_syscall_64+0x90/0x170
>> [   71.211460]  ? syscall_exit_to_user_mode+0x83/0x230
>> [   71.211460]  ? do_syscall_64+0x90/0x170
>> [   71.216481]  ? do_user_addr_fault+0x304/0x670
>> [   71.216481]  ? clear_bhb_loop+0x55/0xb0
>> [   71.220472]  ? clear_bhb_loop+0x55/0xb0
>> [   71.220472]  ? clear_bhb_loop+0x55/0xb0
>> [   71.223704]  entry_SYSCALL_64_after_hwframe+0x78/0x80
>> [   71.225564] RIP: 0033:0x7f2dd7810a1b
>> [   71.227495] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 09 3b
>> [   71.235475] RSP: 002b:00007ffcff6bd520 EFLAGS: 00000293 ORIG_RAX: 0000000000e
>> [   71.239515] RAX: ffffffffffffffda RBX: 00005593d4679a90 RCX: 00007f2dd7810a1b
>> [   71.241643] RDX: 0000000000000000 RSI: 00007ffcff6bd560 RDI: 000000000000000d
>> [   71.245469] RBP: 00007ffcff6bd540 R08: 0000000000000000 R09: 0000000000000000
>> [   71.247467] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000000a
>> [   71.251479] R13: 00005593d4679a90 R14: 0000000000000001 R15: 0000000000000000
>> [   71.251479]  </TASK>
>>
> 
> Yes, the kernel does softlock waiting for a reply if the CVQ does not
> move forward. This is a known issue that is being fixed, but it is not
> easy :). To achieve the packed vq support, we can either disable CVQ
> entirely or try to process the message the kernel is trying to send.
> Both approaches come down to the same functions in SVQ, so you can
> pick the one you feel more comfortable :).

I would like to start with disabling the CVQ altogether. The data plane
implementation is still giving a problem as described below. Once that
is resolved, it should be easier to handle the CVQ.

>> When I test my changes with the control feature bits turned off, I see another
>> issue. The kernel boots successfully in this case, but I noticed that no new
>> elements in the dataplane are added to the virtqueue. This is because, in
>> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_translate_addr() [2], when gpas
>> is not null and QEMU tries to retrieve the IOVA address from the GPA->IOVA
>> tree, the result of map is NULL in the following line [3]:
>>
>> map = vhost_iova_tree_find_gpa(svq->iova_tree, &needle)
>>
>> Due to this, vhost_svq_vring_write_descs() [4] simply returns false and nothing
>> is added to the virtqueue.
>>
>> This issue is present even for split virtqueues, when I test my changes with
>> "packed=off". However, I don't see any issues when I build QEMU from the master
>> branch. I think the issue might lie in how memory is being allocated to the
>> virtqueues in my implementation, but I am not sure. I have a few ideas regarding
>> how this can be debugged. I'll let you know if I find anything else.
>>
> 
> Understood! In case you run out of ideas, it seems like a good
> candidate for bisection.
> 
> Thanks for the update!
> 

I managed to make some progress. I am no longer having problems with this.
I am facing a new issue though and this issue does not arise everytime.

When testing the current state of my changes, I am able to ping L0 from L2
and vice versa via packed vqs. Unfortunately, this does not work everytime.

I thought I would send a new patch series for review in case I have missed
something.

I have explained the issue in greater detail in the cover letter.

Thanks,
Sahil

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Eugenio Perez Martin 3 months ago

On Tue, Feb 4, 2025 at 7:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Tue, Feb 4, 2025 at 1:49 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >
> > Hi,
> >
> > On 1/31/25 12:27 PM, Eugenio Perez Martin wrote:
> > > On Fri, Jan 31, 2025 at 6:04 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> > >> On 1/24/25 1:04 PM, Eugenio Perez Martin wrote:
> > >>> On Fri, Jan 24, 2025 at 6:47 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> > >>>> On 1/21/25 10:07 PM, Eugenio Perez Martin wrote:
> > >>>>> On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> > >>>>>> On 1/7/25 1:35 PM, Eugenio Perez Martin wrote:
> > >>>>>> [...]
> > >>>>>> Apologies for the delay in replying. It took me a while to figure
> > >>>>>> this out, but I have now understood why this doesn't work. L1 is
> > >>>>>> unable to receive messages from L0 because they get filtered out
> > >>>>>> by hw/net/virtio-net.c:receive_filter [1]. There's an issue with
> > >>>>>> the MAC addresses.
> > >>>>>>
> > >>>>>> In L0, I have:
> > >>>>>>
> > >>>>>> $ ip a show tap0
> > >>>>>> 6: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
> > >>>>>>         link/ether d2:6d:b9:61:e1:9a brd ff:ff:ff:ff:ff:ff
> > >>>>>>         inet 111.1.1.1/24 scope global tap0
> > >>>>>>            valid_lft forever preferred_lft forever
> > >>>>>>         inet6 fe80::d06d:b9ff:fe61:e19a/64 scope link proto kernel_ll
> > >>>>>>            valid_lft forever preferred_lft forever
> > >>>>>>
> > >>>>>> In L1:
> > >>>>>>
> > >>>>>> # ip a show eth0
> > >>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
> > >>>>>>         link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
> > >>>>>>         altname enp0s2
> > >>>>>>         inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
> > >>>>>>            valid_lft 83455sec preferred_lft 83455sec
> > >>>>>>         inet6 fec0::7bd2:265e:3b8e:5acc/64 scope site dynamic noprefixroute
> > >>>>>>            valid_lft 86064sec preferred_lft 14064sec
> > >>>>>>         inet6 fe80::50e7:5bf6:fff8:a7b0/64 scope link noprefixroute
> > >>>>>>            valid_lft forever preferred_lft forever
> > >>>>>>
> > >>>>>> I'll call this L1-eth0.
> > >>>>>>
> > >>>>>> In L2:
> > >>>>>> # ip a show eth0
> > >>>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP gro0
> > >>>>>>         link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
> > >>>>>>         altname enp0s7
> > >>>>>>         inet 111.1.1.2/24 scope global eth0
> > >>>>>>            valid_lft forever preferred_lft forever
> > >>>>>>
> > >>>>>> I'll call this L2-eth0.
> > >>>>>>
> > >>>>>> Apart from eth0, lo is the only other device in both L1 and L2.
> > >>>>>>
> > >>>>>> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
> > >>>>>> as its destination address. When booting L2 with x-svq=false, the
> > >>>>>> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
> > >>>>>> the frames and passes them on to L2 and pinging works [2].
> > >>>>>>
> > >>>>>
> > >>>>> So this behavior is interesting by itself. But L1's kernel net system
> > >>>>> should not receive anything. As I read it, even if it receives it, it
> > >>>>> should not forward the frame to L2 as it is in a different subnet. Are
> > >>>>> you able to read it using tcpdump on L1?
> > >>>>
> > >>>> I ran "tcpdump -i eth0" in L1. It didn't capture any of the packets
> > >>>> that were directed at L2 even though L2 was able to receive them.
> > >>>> Similarly, it didn't capture any packets that were sent from L2 to
> > >>>> L0. This is when L2 is launched with x-svq=false.
> > >>>>
> > >>>
> > >>> That's right. The virtio dataplane goes directly from L0 to L2, you
> > >>> should not be able to see any packets in the net of L1.
> > >>
> > >> I am a little confused here. Since vhost=off is set in L0's QEMU
> > >> (which is used to boot L1), I am able to inspect the packets when
> > >> tracing/debugging receive_filter in hw/net/virtio-net.c. [1] Does
> > >> this mean the dataplane from L0 to L2 passes through L0's QEMU
> > >> (so L0 QEMU is aware of what's going on), but bypasses L1 completely
> > >> so L1's kernel does not know what packets are being sent/received.
> > >>
> > >
> > > That's right. We're saving processing power and context switches that way :).
> >
> > Got it. I have understood this part. In a previous mail (also present above):
> >
> > >>>>> On Sun, Jan 19, 2025 at 7:37 AM Sahil Siddiq wrote:
> > >>>>>> A frame that L1 receives from L0 has L2-eth0's MAC address (LSB = 57)
> > >>>>>> as its destination address. When booting L2 with x-svq=false, the
> > >>>>>> value of n->mac in VirtIONet is also L2-eth0. So, L1 accepts
> > >>>>>> the frames and passes them on to L2 and pinging works [2].
> > >>>>>>
> >
> > I was a little unclear in my explanation. I meant to say the frame received by
> > L0-QEMU (which is running L1).
> >
> > >>>> With x-svq=true, forcibly setting the LSB of n->mac to 0x57 in
> > >>>> receive_filter allows L2 to receive packets from L0. I added
> > >>>> the following line just before line 1771 [1] to check this out.
> > >>>>
> > >>>> n->mac[5] = 0x57;
> > >>>>
> > >>>
> > >>> That's very interesting. Let me answer all the gdb questions below and
> > >>> we can debug it deeper :).
> > >>>
> > >>
> > >> Thank you for the primer on using gdb with QEMU. I am able to debug
> > >> QEMU now.
> > >>
> > >>>>> Maybe we can make the scenario clearer by telling which virtio-net
> > >>>>> device is which with virtio_net_pci,mac=XX:... ?
> > >>>>>
> > >>>>>> However, when booting L2 with x-svq=true, n->mac is set to L1-eth0
> > >>>>>> (LSB = 56) in virtio_net_handle_mac() [3].
> > >>>>>
> > >>>>> Can you tell with gdb bt if this function is called from net or the
> > >>>>> SVQ subsystem?
> > >>>>
> > >>
> > >> It looks like the function is being called from net.
> > >>
> > >> (gdb) bt
> > >> #0  virtio_net_handle_mac (n=0x15622425e, cmd=85 'U', iov=0x555558865980, iov_cnt=1476792840) at ../hw/net/virtio-net.c:1098
> > >> #1  0x0000555555e5920b in virtio_net_handle_ctrl_iov (vdev=0x555558fdacd0, in_sg=0x5555580611f8, in_num=1, out_sg=0x555558061208,
> > >>        out_num=1) at ../hw/net/virtio-net.c:1581
> > >> #2  0x0000555555e593a0 in virtio_net_handle_ctrl (vdev=0x555558fdacd0, vq=0x555558fe7730) at ../hw/net/virtio-net.c:1610
> > >> #3  0x0000555555e9a7d8 in virtio_queue_notify_vq (vq=0x555558fe7730) at ../hw/virtio/virtio.c:2484
> > >> #4  0x0000555555e9dffb in virtio_queue_host_notifier_read (n=0x555558fe77a4) at ../hw/virtio/virtio.c:3869
> > >> #5  0x000055555620329f in aio_dispatch_handler (ctx=0x555557d9f840, node=0x7fffdca7ba80) at ../util/aio-posix.c:373
> > >> #6  0x000055555620346f in aio_dispatch_handlers (ctx=0x555557d9f840) at ../util/aio-posix.c:415
> > >> #7  0x00005555562034cb in aio_dispatch (ctx=0x555557d9f840) at ../util/aio-posix.c:425
> > >> #8  0x00005555562242b5 in aio_ctx_dispatch (source=0x555557d9f840, callback=0x0, user_data=0x0) at ../util/async.c:361
> > >> #9  0x00007ffff6d86559 in ?? () from /usr/lib/libglib-2.0.so.0
> > >> #10 0x00007ffff6d86858 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
> > >> #11 0x0000555556225bf9 in glib_pollfds_poll () at ../util/main-loop.c:287
> > >> #12 0x0000555556225c87 in os_host_main_loop_wait (timeout=294672) at ../util/main-loop.c:310
> > >> #13 0x0000555556225db6 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:589
> > >> #14 0x0000555555c0c1a3 in qemu_main_loop () at ../system/runstate.c:835
> > >> #15 0x000055555612bd8d in qemu_default_main (opaque=0x0) at ../system/main.c:48
> > >> #16 0x000055555612be3d in main (argc=23, argv=0x7fffffffe508) at ../system/main.c:76
> > >>
> > >> virtio_queue_notify_vq at hw/virtio/virtio.c:2484 [2] calls
> > >> vq->handle_output(vdev, vq). I see "handle_output" is a function
> > >> pointer and in this case it seems to be pointing to
> > >> virtio_net_handle_ctrl.
> > >>
> > >>>>>> [...]
> > >>>>>> With x-svq=true, I see that n->mac is set by virtio_net_handle_mac()
> > >>>>>> [3] when L1 receives VIRTIO_NET_CTRL_MAC_ADDR_SET. With x-svq=false,
> > >>>>>> virtio_net_handle_mac() doesn't seem to be getting called. I haven't
> > >>>>>> understood how the MAC address is set in VirtIONet when x-svq=false.
> > >>>>>> Understanding this might help see why n->mac has different values
> > >>>>>> when x-svq is false vs when it is true.
> > >>>>>
> > >>>>> Ok this makes sense, as x-svq=true is the one that receives the set
> > >>>>> mac message. You should see it in L0's QEMU though, both in x-svq=on
> > >>>>> and x-svq=off scenarios. Can you check it?
> > >>>>
> > >>>> L0's QEMU seems to be receiving the "set mac" message only when L1
> > >>>> is launched with x-svq=true. With x-svq=off, I don't see any call
> > >>>> to virtio_net_handle_mac with cmd == VIRTIO_NET_CTRL_MAC_ADDR_SET
> > >>>> in L0.
> > >>>>
> > >>>
> > >>> Ok this is interesting. Let's disable control virtqueue to start with
> > >>> something simpler:
> > >>> device virtio-net-pci,netdev=net0,ctrl_vq=off,...
> > >>>
> > >>> QEMU will start complaining about features that depend on ctrl_vq,
> > >>> like ctrl_rx. Let's disable all of them and check this new scenario.
> > >>>
> > >>
> > >> I am still investigating this part. I set ctrl_vq=off and ctrl_rx=off.
> > >> I didn't get any errors as such about features that depend on ctrl_vq.
> > >> However, I did notice that after booting L2 (x-svq=true as well as
> > >> x-svq=false), no eth0 device was created. There was only a "lo" interface
> > >> in L2. An eth0 interface is present only when L1 (L0 QEMU) is booted
> > >> with ctrl_vq=on and ctrl_rx=on.
> > >>
> > >
> > > Any error messages on the nested guest's dmesg?
> >
> > Oh, yes, there were error messages in the output of dmesg related to
> > ctrl_vq. After adding the following args, there were no error messages
> > in dmesg.
> >
> > -device virtio-net-pci,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off
> >
> > I see that the eth0 interface is also created. I am able to ping L0
> > from L2 and vice versa as well (even with x-svq=true). This is because
> > n->promisc is set when these features are disabled and receive_filter() [1]
> > always returns 1.
> >
> > > Is it fixed when you set the same mac address on L0
> > > virtio-net-pci and L1's?
> > >
> >
> > I didn't have to set the same mac address in this case since promiscuous
> > mode seems to be getting enabled which allows pinging to work.
> >
> > There is another concept that I am a little confused about. In the case
> > where L2 is booted with x-svq=false (and all ctrl features such as ctrl_vq,
> > ctrl_rx, etc. are on), I am able to ping L0 from L2. When tracing
> > receive_filter() in L0-QEMU, I see the values of n->mac and the destination
> > mac address in the ICMP packet match [2].
> >
>
> SVQ makes an effort to set the mac address at the beginning of
> operation. The L0 interpret it as "filter out all MACs except this
> one". But SVQ cannot set the mac if ctrl_mac_addr=off, so the nic
> receives all packets and the guest kernel needs to filter out by
> itself.
>
> > I haven't understood what n->mac refers to over here. MAC addresses are
> > globally unique and so the mac address of the device in L1 should be
> > different from that in L2.
>
> With vDPA, they should be the same device even if they are declared in
> different cmdlines or layers of virtualizations. If it were a physical
> NIC, QEMU should declare the MAC of the physical NIC too.
>
> There is a thread in QEMU maul list where how QEMU should influence
> the control plane is discussed, and maybe it would be easier if QEMU
> just checks the device's MAC and ignores cmdline. But then, that
> behavior would be surprising for the rest of vhosts like vhost-kernel.
> Or just emit a warning if the MAC is different than the one that the
> device reports.
>
>
> > But I see L0-QEMU's n->mac is set to the mac
> > address of the device in L2 (allowing receive_filter to accept the packet).
> >
>
> That's interesting, can you check further what does receive_filter and
> virtio_net_receive_rcu do with gdb? As long as virtio_net_receive_rcu
> flushes the packet on the receive queue, SVQ should receive it.

PS: Please note that you can check packed_vq SVQ implementation
already without CVQ, as these features are totally orthogonal :).

Re: [RFC v4 0/5] Add packed virtqueue to shadow virtqueue

Posted by Sahil Siddiq 4 months, 3 weeks ago

Hi,

Thank you for your reply.

On 12/10/24 2:57 PM, Eugenio Perez Martin wrote:
> On Thu, Dec 5, 2024 at 9:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> Hi,
>>
>> There are two issues that I found while trying to test
>> my changes. I thought I would send the patch series
>> as well in case that helps in troubleshooting. I haven't
>> been able to find an issue in the implementation yet.
>> Maybe I am missing something.
>>
>> I have been following the "Hands on vDPA: what do you do
>> when you ain't got the hardware v2 (Part 2)" [1] blog to
>> test my changes. To boot the L1 VM, I ran:
>>
>> [...]
>>
>> But if I boot L2 with x-svq=true as shown below, I am unable
>> to ping the host machine.
>>
>> $ ./qemu/build/qemu-system-x86_64 \
>> -nographic \
>> -m 4G \
>> -enable-kvm \
>> -M q35 \
>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,bus=pcie.0,addr=0x7 \
>> -smp 4 \
>> -cpu host \
>> 2>&1 | tee vm.log
>>
>> In L2:
>>
>> # ip addr add 111.1.1.2/24 dev eth0
>> # ip addr show eth0
>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
>>      link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
>>      altname enp0s7
>>      inet 111.1.1.2/24 scope global eth0
>>         valid_lft forever preferred_lft forever
>>      inet6 fe80::9877:de30:5f17:35f9/64 scope link noprefixroute
>>         valid_lft forever preferred_lft forever
>>
>> # ip route
>> 111.1.1.0/24 dev eth0 proto kernel scope link src 111.1.1.2
>>
>> # ping 111.1.1.1 -w10
>> PING 111.1.1.1 (111.1.1.1) 56(84) bytes of data.
>>  From 111.1.1.2 icmp_seq=1 Destination Host Unreachable
>> ping: sendmsg: No route to host
>>  From 111.1.1.2 icmp_seq=2 Destination Host Unreachable
>>  From 111.1.1.2 icmp_seq=3 Destination Host Unreachable
>>
>> --- 111.1.1.1 ping statistics ---
>> 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2076ms
>> pipe 3
>>
>> The other issue is related to booting L2 with "x-svq=true"
>> and "packed=on".
>>
>> In L1:
>>
>> $ ./qemu/build/qemu-system-x86_64 \
>> -nographic \
>> -m 4G \
>> -enable-kvm \
>> -M q35 \
>> -drive file=//root/L2.qcow2,media=disk,if=virtio \
>> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-svq=true \
>> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,ctrl_vq=on,ctrl_rx=on,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
>> -smp 4 \
>> -cpu host \
>> 2>&1 | tee vm.log
>>
>> The kernel throws "virtio_net virtio1: output.0:id 0 is not
>> a head!" [4].
>>
> 
> So this series implements the descriptor forwarding from the guest to
> the device in packed vq. We also need to forward the descriptors from
> the device to the guest. The device writes them in the SVQ ring.
> 
> The functions responsible for that in QEMU are
> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush, which is called by
> the device when used descriptors are written to the SVQ, which calls
> hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf. We need to do
> modifications similar to vhost_svq_add: Make them conditional if we're
> in split or packed vq, and "copy" the code from Linux's
> drivers/virtio/virtio_ring.c:virtqueue_get_buf.
> 
> After these modifications you should be able to ping and forward
> traffic. As always, It is totally ok if it needs more than one
> iteration, and feel free to ask any question you have :).
> 

Understood, I'll make these changes and will test it again.

Thanks,
Sahil