[RFC v5 0/7] Add packed format to shadow virtqueue

Sahil Siddiq posted 7 patches 10 months, 3 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20250324135929.74945-1-sahilcdq@proton.me
Maintainers: "Michael S. Tsirkin" <mst@redhat.com>, Stefano Garzarella <sgarzare@redhat.com>, "Eugenio Pérez" <eperezma@redhat.com>
hw/virtio/vhost-shadow-virtqueue.c | 396 ++++++++++++++++++++++-------
hw/virtio/vhost-shadow-virtqueue.h |  88 ++++---
hw/virtio/vhost-vdpa.c             |  52 +++-
3 files changed, 404 insertions(+), 132 deletions(-)
[RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Sahil Siddiq 10 months, 3 weeks ago
Hi,

I managed to fix a few issues while testing this patch series.
There is still one issue that I am unable to resolve. I thought
I would send this patch series for review in case I have missed
something.

The issue is that this patch series does not work every time. I
am able to ping L0 from L2 and vice versa via packed SVQ when it
works.

When this doesn't work, both VMs throw a "Destination Host
Unreachable" error. This is sometimes (not always) accompanied
by the following kernel error (thrown by L2-kernel):

virtio_net virtio1: output.0:id 1 is not a head!

This error is not thrown always, but when it is thrown, the id
varies. This is invariably followed by a soft lockup:

[  284.662292] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [swapper/1:0]
[  284.662292] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class vfg
[  284.662292] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.8.7-200.fc39.x86_64 #1
[  284.662292] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[  284.662292] RIP: 0010:virtqueue_enable_cb_delayed+0x115/0x150
[  284.662292] Code: 44 77 04 0f ae f0 48 8b 42 70 0f b7 40 02 66 2b 42 50 66 39 c1 0f 93 c0 c3 cc cc cc cc 66 87 44 77 04 eb e2 f0 83 44 24 fc 00 <e9> 5a f1
[  284.662292] RSP: 0018:ffffb8f000100cb0 EFLAGS: 00000246
[  284.662292] RAX: 0000000000000000 RBX: ffff96f20204d800 RCX: ffff96f206f5e000
[  284.662292] RDX: ffff96f2054fd900 RSI: ffffb8f000100c7c RDI: ffff96f2054fd900
[  284.662292] RBP: ffff96f2078bb000 R08: 0000000000000001 R09: 0000000000000001
[  284.662292] R10: ffff96f2078bb000 R11: 0000000000000005 R12: ffff96f207bb4a00
[  284.662292] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96f20452fd00
[  284.662292] FS:  0000000000000000(0000) GS:ffff96f27bc80000(0000) knlGS:0000000000000000
[  284.662292] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  284.662292] CR2: 00007f2a9ca191e8 CR3: 0000000136422003 CR4: 0000000000770ef0
[  284.662292] PKRU: 55555554
[  284.662292] Call Trace:
[  284.662292]  <IRQ>
[  284.662292]  ? watchdog_timer_fn+0x1e6/0x270
[  284.662292]  ? __pfx_watchdog_timer_fn+0x10/0x10
[  284.662292]  ? __hrtimer_run_queues+0x10f/0x2b0
[  284.662292]  ? hrtimer_interrupt+0xf8/0x230
[  284.662292]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
[  284.662292]  ? sysvec_apic_timer_interrupt+0x39/0x90
[  284.662292]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  284.662292]  ? virtqueue_enable_cb_delayed+0x115/0x150
[  284.662292]  start_xmit+0x2a6/0x4f0 [virtio_net]
[  284.662292]  ? netif_skb_features+0x98/0x300
[  284.662292]  dev_hard_start_xmit+0x61/0x1d0
[  284.662292]  sch_direct_xmit+0xa4/0x390
[  284.662292]  __dev_queue_xmit+0x84f/0xdc0
[  284.662292]  ? nf_hook_slow+0x42/0xf0
[  284.662292]  ip_finish_output2+0x2b8/0x580
[  284.662292]  igmp_ifc_timer_expire+0x1d5/0x430
[  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
[  284.662292]  call_timer_fn+0x21/0x130
[  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
[  284.662292]  __run_timers+0x21f/0x2b0
[  284.662292]  run_timer_softirq+0x1d/0x40
[  284.662292]  __do_softirq+0xc9/0x2c8
[  284.662292]  __irq_exit_rcu+0xa6/0xc0
[  284.662292]  sysvec_apic_timer_interrupt+0x72/0x90
[  284.662292]  </IRQ>
[  284.662292]  <TASK>
[  284.662292]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  284.662292] RIP: 0010:pv_native_safe_halt+0xf/0x20
[  284.662292] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 53 75 3f 00 fb f4 <c3> cc c0
[  284.662292] RSP: 0018:ffffb8f0000b3ed8 EFLAGS: 00000212
[  284.662292] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
[  284.662292] RDX: 4000000000000000 RSI: 0000000000000083 RDI: 00000000000289ec
[  284.662292] RBP: ffff96f200810000 R08: 0000000000000000 R09: 0000000000000001
[  284.662292] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  284.662292] R13: 0000000000000000 R14: ffff96f200810000 R15: 0000000000000000
[  284.662292]  default_idle+0x9/0x20
[  284.662292]  default_idle_call+0x2c/0xe0
[  284.662292]  do_idle+0x226/0x270
[  284.662292]  cpu_startup_entry+0x2a/0x30
[  284.662292]  start_secondary+0x11e/0x140
[  284.662292]  secondary_startup_64_no_verify+0x184/0x18b
[  284.662292]  </TASK>

The soft lockup seems to happen in
drivers/net/virtio_net.c:start_xmit() [1].

I don't think the issue is in the kernel because I haven't seen
any issue when testing my changes with split vqs. Only packed vqs
give an issue.

L0 kernel version: 6.12.13-1-lts

QEMU command to boot L1:

$ sudo ./qemu/build/qemu-system-x86_64 \
-enable-kvm \
-drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
-net nic,model=virtio \
-net user,hostfwd=tcp::2222-:22 \
-device intel-iommu,snoop-control=on \
-device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
-netdev tap,id=net0,script=no,downscript=no,vhost=off \
-nographic \
-m 8G \
-smp 4 \
-M q35 \
-cpu host 2>&1 | tee vm.log

L1 kernel version: 6.8.5-201.fc39.x86_64

I have been following the "Hands on vDPA - Part 2" blog
to set up the environment in L1 [2].

QEMU command to boot L2:

# ./qemu/build/qemu-system-x86_64 \
-nographic \
-m 4G \
-enable-kvm \
-M q35 \
-drive file=//root/L2.qcow2,media=disk,if=virtio \
-netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
-device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
-smp 4 \
-cpu host \
2>&1 | tee vm.log

L2 kernel version: 6.8.7-200.fc39.x86_64

I confirmed that packed vqs are enabled in L2 by running the
following:

# cut -c35 /sys/devices/pci0000\:00/0000\:00\:07.0/virtio1/features 
1

I may be wrong, but I think the issue in my implementation might be
related to:

1. incorrect endianness coversions.
2. implementation of "vhost_svq_more_used_packed" in commit #5.
3. implementation of "vhost_svq_(en|dis)able_notification" in commit #5.
4. something else?

Thanks,
Sahil

[1] https://github.com/torvalds/linux/blob/master/drivers/net/virtio_net.c#L3245
[2] https://www.redhat.com/en/blog/hands-vdpa-what-do-you-do-when-you-aint-got-hardware-part-2

Sahil Siddiq (7):
  vhost: Refactor vhost_svq_add_split
  vhost: Data structure changes to support packed vqs
  vhost: Forward descriptors to device via packed SVQ
  vdpa: Allocate memory for SVQ and map them to vdpa
  vhost: Forward descriptors to guest via packed vqs
  vhost: Validate transport device features for packed vqs
  vdpa: Support setting vring_base for packed SVQ

 hw/virtio/vhost-shadow-virtqueue.c | 396 ++++++++++++++++++++++-------
 hw/virtio/vhost-shadow-virtqueue.h |  88 ++++---
 hw/virtio/vhost-vdpa.c             |  52 +++-
 3 files changed, 404 insertions(+), 132 deletions(-)

-- 
2.48.1
Re: [RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Eugenio Perez Martin 10 months, 2 weeks ago
On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> I managed to fix a few issues while testing this patch series.
> There is still one issue that I am unable to resolve. I thought
> I would send this patch series for review in case I have missed
> something.
>
> The issue is that this patch series does not work every time. I
> am able to ping L0 from L2 and vice versa via packed SVQ when it
> works.
>

So we're on a very good track then!

> When this doesn't work, both VMs throw a "Destination Host
> Unreachable" error. This is sometimes (not always) accompanied
> by the following kernel error (thrown by L2-kernel):
>
> virtio_net virtio1: output.0:id 1 is not a head!
>

How many packets have been sent or received before hitting this? If
the answer to that is "the vq size", maybe there is a bug in the code
that handles the wraparound of the packed vq, as the used and avail
flags need to be twisted. You can count them in the SVQ code.

> This error is not thrown always, but when it is thrown, the id
> varies. This is invariably followed by a soft lockup:
>
> [  284.662292] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [swapper/1:0]
> [  284.662292] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class vfg
> [  284.662292] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.8.7-200.fc39.x86_64 #1
> [  284.662292] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [  284.662292] RIP: 0010:virtqueue_enable_cb_delayed+0x115/0x150
> [  284.662292] Code: 44 77 04 0f ae f0 48 8b 42 70 0f b7 40 02 66 2b 42 50 66 39 c1 0f 93 c0 c3 cc cc cc cc 66 87 44 77 04 eb e2 f0 83 44 24 fc 00 <e9> 5a f1
> [  284.662292] RSP: 0018:ffffb8f000100cb0 EFLAGS: 00000246
> [  284.662292] RAX: 0000000000000000 RBX: ffff96f20204d800 RCX: ffff96f206f5e000
> [  284.662292] RDX: ffff96f2054fd900 RSI: ffffb8f000100c7c RDI: ffff96f2054fd900
> [  284.662292] RBP: ffff96f2078bb000 R08: 0000000000000001 R09: 0000000000000001
> [  284.662292] R10: ffff96f2078bb000 R11: 0000000000000005 R12: ffff96f207bb4a00
> [  284.662292] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96f20452fd00
> [  284.662292] FS:  0000000000000000(0000) GS:ffff96f27bc80000(0000) knlGS:0000000000000000
> [  284.662292] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  284.662292] CR2: 00007f2a9ca191e8 CR3: 0000000136422003 CR4: 0000000000770ef0
> [  284.662292] PKRU: 55555554
> [  284.662292] Call Trace:
> [  284.662292]  <IRQ>
> [  284.662292]  ? watchdog_timer_fn+0x1e6/0x270
> [  284.662292]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [  284.662292]  ? __hrtimer_run_queues+0x10f/0x2b0
> [  284.662292]  ? hrtimer_interrupt+0xf8/0x230
> [  284.662292]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
> [  284.662292]  ? sysvec_apic_timer_interrupt+0x39/0x90
> [  284.662292]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  284.662292]  ? virtqueue_enable_cb_delayed+0x115/0x150
> [  284.662292]  start_xmit+0x2a6/0x4f0 [virtio_net]
> [  284.662292]  ? netif_skb_features+0x98/0x300
> [  284.662292]  dev_hard_start_xmit+0x61/0x1d0
> [  284.662292]  sch_direct_xmit+0xa4/0x390
> [  284.662292]  __dev_queue_xmit+0x84f/0xdc0
> [  284.662292]  ? nf_hook_slow+0x42/0xf0
> [  284.662292]  ip_finish_output2+0x2b8/0x580
> [  284.662292]  igmp_ifc_timer_expire+0x1d5/0x430
> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
> [  284.662292]  call_timer_fn+0x21/0x130
> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
> [  284.662292]  __run_timers+0x21f/0x2b0
> [  284.662292]  run_timer_softirq+0x1d/0x40
> [  284.662292]  __do_softirq+0xc9/0x2c8
> [  284.662292]  __irq_exit_rcu+0xa6/0xc0
> [  284.662292]  sysvec_apic_timer_interrupt+0x72/0x90
> [  284.662292]  </IRQ>
> [  284.662292]  <TASK>
> [  284.662292]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  284.662292] RIP: 0010:pv_native_safe_halt+0xf/0x20
> [  284.662292] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 53 75 3f 00 fb f4 <c3> cc c0
> [  284.662292] RSP: 0018:ffffb8f0000b3ed8 EFLAGS: 00000212
> [  284.662292] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
> [  284.662292] RDX: 4000000000000000 RSI: 0000000000000083 RDI: 00000000000289ec
> [  284.662292] RBP: ffff96f200810000 R08: 0000000000000000 R09: 0000000000000001
> [  284.662292] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> [  284.662292] R13: 0000000000000000 R14: ffff96f200810000 R15: 0000000000000000
> [  284.662292]  default_idle+0x9/0x20
> [  284.662292]  default_idle_call+0x2c/0xe0
> [  284.662292]  do_idle+0x226/0x270
> [  284.662292]  cpu_startup_entry+0x2a/0x30
> [  284.662292]  start_secondary+0x11e/0x140
> [  284.662292]  secondary_startup_64_no_verify+0x184/0x18b
> [  284.662292]  </TASK>
>
> The soft lockup seems to happen in
> drivers/net/virtio_net.c:start_xmit() [1].
>

Maybe it gets stuck in the do {} while(...
!virtqueue_enable_cb_delayed()) ? you can add a printk in
virtqueue_enable_cb_delayed return and check if it matches with the
speed you're sending or receiving ping. For example, if ping is each
second, you should not see a lot of traces.

If this does not work I'd try never disabling notifications, both in
the kernel and SVQ, and check if that works.

> I don't think the issue is in the kernel because I haven't seen
> any issue when testing my changes with split vqs. Only packed vqs
> give an issue.
>
> L0 kernel version: 6.12.13-1-lts
>
> QEMU command to boot L1:
>
> $ sudo ./qemu/build/qemu-system-x86_64 \
> -enable-kvm \
> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> -net nic,model=virtio \
> -net user,hostfwd=tcp::2222-:22 \
> -device intel-iommu,snoop-control=on \
> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
> -netdev tap,id=net0,script=no,downscript=no,vhost=off \
> -nographic \
> -m 8G \
> -smp 4 \
> -M q35 \
> -cpu host 2>&1 | tee vm.log
>
> L1 kernel version: 6.8.5-201.fc39.x86_64
>
> I have been following the "Hands on vDPA - Part 2" blog
> to set up the environment in L1 [2].
>
> QEMU command to boot L2:
>
> # ./qemu/build/qemu-system-x86_64 \
> -nographic \
> -m 4G \
> -enable-kvm \
> -M q35 \
> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
> -smp 4 \
> -cpu host \
> 2>&1 | tee vm.log
>
> L2 kernel version: 6.8.7-200.fc39.x86_64
>
> I confirmed that packed vqs are enabled in L2 by running the
> following:
>
> # cut -c35 /sys/devices/pci0000\:00/0000\:00\:07.0/virtio1/features
> 1
>
> I may be wrong, but I think the issue in my implementation might be
> related to:
>
> 1. incorrect endianness coversions.
> 2. implementation of "vhost_svq_more_used_packed" in commit #5.
> 3. implementation of "vhost_svq_(en|dis)able_notification" in commit #5.
> 4. something else?
>

I think 1 is unlikely. I'd go with 2 and 3.

Let me know if the proposed changes work!

> Thanks,
> Sahil
>
> [1] https://github.com/torvalds/linux/blob/master/drivers/net/virtio_net.c#L3245
> [2] https://www.redhat.com/en/blog/hands-vdpa-what-do-you-do-when-you-aint-got-hardware-part-2
>
> Sahil Siddiq (7):
>   vhost: Refactor vhost_svq_add_split
>   vhost: Data structure changes to support packed vqs
>   vhost: Forward descriptors to device via packed SVQ
>   vdpa: Allocate memory for SVQ and map them to vdpa
>   vhost: Forward descriptors to guest via packed vqs
>   vhost: Validate transport device features for packed vqs
>   vdpa: Support setting vring_base for packed SVQ
>
>  hw/virtio/vhost-shadow-virtqueue.c | 396 ++++++++++++++++++++++-------
>  hw/virtio/vhost-shadow-virtqueue.h |  88 ++++---
>  hw/virtio/vhost-vdpa.c             |  52 +++-
>  3 files changed, 404 insertions(+), 132 deletions(-)
>
> --
> 2.48.1
>
Re: [RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Sahil Siddiq 10 months ago
Hi,

On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> I managed to fix a few issues while testing this patch series.
>> There is still one issue that I am unable to resolve. I thought
>> I would send this patch series for review in case I have missed
>> something.
>>
>> The issue is that this patch series does not work every time. I
>> am able to ping L0 from L2 and vice versa via packed SVQ when it
>> works.
> 
> So we're on a very good track then!
> 
>> When this doesn't work, both VMs throw a "Destination Host
>> Unreachable" error. This is sometimes (not always) accompanied
>> by the following kernel error (thrown by L2-kernel):
>>
>> virtio_net virtio1: output.0:id 1 is not a head!
>>
> 
> How many packets have been sent or received before hitting this? If
> the answer to that is "the vq size", maybe there is a bug in the code
> that handles the wraparound of the packed vq, as the used and avail
> flags need to be twisted. You can count them in the SVQ code.

I did a lot more testing. This issue is quite unpredictable in terms
of the time at which it appears after booting L2. So far, it almost
always appears after booting L2. Even when pinging works, this issue
appears after several seconds of pinging.

The total number of svq descriptors varied in every test run. But in
every case, all 256 indices were filled in the descriptor region for
vq with vq_idx = 0. This is the RX vq, right? This was filled while L2
was booting. In the case when the ctrl vq is disabled, I am not sure
what is responsible for filling the vqs in the data plane during
booting.

=====
The issue is hit most frequently when the following command is run
in L0:
$ ip addr add 111.1.1.1/24 dev tap0
$ ip link set tap0 up

or, running the following in L2:
# ip addr add 111.1.1.2/24 dev eth0

The other vq (vq_idx=1) is not filled completely before the issue is
hit. I have been noting down the numbers and here is an example:

295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
|_ 256 additions in vq_idx = 0, all with unique ids
     |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
|_ 39 additions in vq_idx = 1
     |_ 13 descriptors had id = 0
     |_ 26 descriptors had id = 1
     |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)

There was one case in which vq_idx=0 had wrapped around. I verified
that flags were set appropriately during the wrap (avail and used flags
were flipped as expected).

=====
The next common situation where this issue is hit is during startup.
Before L2 can finish booting successfully, this error is thrown:

virtio_net virtio1: output.0:id 0 is not a head!

258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
|_ 256 additions in vq_idx = 0, all with unique ids
    |---- None of them were received by the device (vhost_svq_get_buf_packed)
|_ 2 additions in vq_idx = 1
    |_ id = 0 in index 0
    |_ id = 1 in index 1
    |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)

=====
Another case is after several seconds of pinging L0 from L2.

[   99.034114] virtio_net virtio1: output.0:id 0 is not a head!

366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
|_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
|   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
|_ 77 additions in vq_idx = 1
     |_ 76 descriptors had id = 0
     |_ 1 descriptor had id = 1
     |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)

I am not entirely sure now if there's an issue in the packed vq
implementation in QEMU or if this is being caused due to some sort
of race condition in linux.

"id is not a head" is being thrown because vq->packed.desc_state[id].data
doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
stored some data for this id via vhost_svq_add() [2]. Linux sets the value
of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].

>> This error is not thrown always, but when it is thrown, the id
>> varies. This is invariably followed by a soft lockup:
>> [...]
>> [  284.662292] Call Trace:
>> [  284.662292]  <IRQ>
>> [  284.662292]  ? watchdog_timer_fn+0x1e6/0x270
>> [  284.662292]  ? __pfx_watchdog_timer_fn+0x10/0x10
>> [  284.662292]  ? __hrtimer_run_queues+0x10f/0x2b0
>> [  284.662292]  ? hrtimer_interrupt+0xf8/0x230
>> [  284.662292]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
>> [  284.662292]  ? sysvec_apic_timer_interrupt+0x39/0x90
>> [  284.662292]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> [  284.662292]  ? virtqueue_enable_cb_delayed+0x115/0x150
>> [  284.662292]  start_xmit+0x2a6/0x4f0 [virtio_net]
>> [  284.662292]  ? netif_skb_features+0x98/0x300
>> [  284.662292]  dev_hard_start_xmit+0x61/0x1d0
>> [  284.662292]  sch_direct_xmit+0xa4/0x390
>> [  284.662292]  __dev_queue_xmit+0x84f/0xdc0
>> [  284.662292]  ? nf_hook_slow+0x42/0xf0
>> [  284.662292]  ip_finish_output2+0x2b8/0x580
>> [  284.662292]  igmp_ifc_timer_expire+0x1d5/0x430
>> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
>> [  284.662292]  call_timer_fn+0x21/0x130
>> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
>> [  284.662292]  __run_timers+0x21f/0x2b0
>> [  284.662292]  run_timer_softirq+0x1d/0x40
>> [  284.662292]  __do_softirq+0xc9/0x2c8
>> [  284.662292]  __irq_exit_rcu+0xa6/0xc0
>> [  284.662292]  sysvec_apic_timer_interrupt+0x72/0x90
>> [  284.662292]  </IRQ>
>> [  284.662292]  <TASK>
>> [  284.662292]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> [  284.662292] RIP: 0010:pv_native_safe_halt+0xf/0x20
>> [  284.662292] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 53 75 3f 00 fb f4 <c3> cc c0
>> [  284.662292] RSP: 0018:ffffb8f0000b3ed8 EFLAGS: 00000212
>> [  284.662292] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
>> [  284.662292] RDX: 4000000000000000 RSI: 0000000000000083 RDI: 00000000000289ec
>> [  284.662292] RBP: ffff96f200810000 R08: 0000000000000000 R09: 0000000000000001
>> [  284.662292] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
>> [  284.662292] R13: 0000000000000000 R14: ffff96f200810000 R15: 0000000000000000
>> [  284.662292]  default_idle+0x9/0x20
>> [  284.662292]  default_idle_call+0x2c/0xe0
>> [  284.662292]  do_idle+0x226/0x270
>> [  284.662292]  cpu_startup_entry+0x2a/0x30
>> [  284.662292]  start_secondary+0x11e/0x140
>> [  284.662292]  secondary_startup_64_no_verify+0x184/0x18b
>> [  284.662292]  </TASK>
>>
>> The soft lockup seems to happen in
>> drivers/net/virtio_net.c:start_xmit() [1].
>>
> 
> Maybe it gets stuck in the do {} while(...
> !virtqueue_enable_cb_delayed()) ? you can add a printk in
> virtqueue_enable_cb_delayed return and check if it matches with the
> speed you're sending or receiving ping. For example, if ping is each
> second, you should not see a lot of traces.
> 
> If this does not work I'd try never disabling notifications, both in
> the kernel and SVQ, and check if that works.

In order to disable notifications, will something have to be commented
out in the implementation?

>> [...]
>> QEMU command to boot L1:
>>
>> $ sudo ./qemu/build/qemu-system-x86_64 \
>> -enable-kvm \
>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>> -net nic,model=virtio \
>> -net user,hostfwd=tcp::2222-:22 \
>> -device intel-iommu,snoop-control=on \
>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
>> -netdev tap,id=net0,script=no,downscript=no,vhost=off \
>> -nographic \
>> -m 8G \
>> -smp 4 \
>> -M q35 \
>> -cpu host 2>&1 | tee vm.log
>>

I have added "-device virtio-net-pci,indirect_desc=off,queue_reset=off"
to the L0 QEMU command to boot L1.

Thanks,
Sahil

[1] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1762
[2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L290
[3] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1564

Re: [RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Eugenio Perez Martin 9 months, 4 weeks ago
On Mon, Apr 14, 2025 at 11:20 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
> > On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> I managed to fix a few issues while testing this patch series.
> >> There is still one issue that I am unable to resolve. I thought
> >> I would send this patch series for review in case I have missed
> >> something.
> >>
> >> The issue is that this patch series does not work every time. I
> >> am able to ping L0 from L2 and vice versa via packed SVQ when it
> >> works.
> >
> > So we're on a very good track then!
> >
> >> When this doesn't work, both VMs throw a "Destination Host
> >> Unreachable" error. This is sometimes (not always) accompanied
> >> by the following kernel error (thrown by L2-kernel):
> >>
> >> virtio_net virtio1: output.0:id 1 is not a head!
> >>
> >
> > How many packets have been sent or received before hitting this? If
> > the answer to that is "the vq size", maybe there is a bug in the code
> > that handles the wraparound of the packed vq, as the used and avail
> > flags need to be twisted. You can count them in the SVQ code.
>
> I did a lot more testing. This issue is quite unpredictable in terms
> of the time at which it appears after booting L2. So far, it almost
> always appears after booting L2. Even when pinging works, this issue
> appears after several seconds of pinging.
>

Maybe you can speed it up with ping -f?

> The total number of svq descriptors varied in every test run. But in
> every case, all 256 indices were filled in the descriptor region for
> vq with vq_idx = 0. This is the RX vq, right?

Right!

> This was filled while L2
> was booting. In the case when the ctrl vq is disabled, I am not sure
> what is responsible for filling the vqs in the data plane during
> booting.
>

The nested guest's driver fills the rx queue at startup. After that,
that nested guest kicks and SVQ receives the descriptors. It copies
the descriptors to the shadow virtqueue and then kicks L0 QEMU.

> =====
> The issue is hit most frequently when the following command is run
> in L0:
> $ ip addr add 111.1.1.1/24 dev tap0
> $ ip link set tap0 up
>
> or, running the following in L2:
> # ip addr add 111.1.1.2/24 dev eth0
>

I guess those are able to start the network, aren't they?

> The other vq (vq_idx=1) is not filled completely before the issue is
> hit.
> I have been noting down the numbers and here is an example:
>
> 295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> |_ 256 additions in vq_idx = 0, all with unique ids
>      |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
> |_ 39 additions in vq_idx = 1
>      |_ 13 descriptors had id = 0
>      |_ 26 descriptors had id = 1
>      |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)
>
> There was one case in which vq_idx=0 had wrapped around. I verified
> that flags were set appropriately during the wrap (avail and used flags
> were flipped as expected).
>

Ok sounds like you're able to reach it before filling the queue. I'd
go for debugging notifications for this one then. More on this below.

> =====
> The next common situation where this issue is hit is during startup.
> Before L2 can finish booting successfully, this error is thrown:
>
> virtio_net virtio1: output.0:id 0 is not a head!
>
> 258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
> |_ 256 additions in vq_idx = 0, all with unique ids
>     |---- None of them were received by the device (vhost_svq_get_buf_packed)
> |_ 2 additions in vq_idx = 1
>     |_ id = 0 in index 0
>     |_ id = 1 in index 1
>     |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)
>
> =====
> Another case is after several seconds of pinging L0 from L2.
>
> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
>

So the L2 guest sees a descriptor it has not made available
previously. This can be caused because SVQ returns the same descriptor
twice, or it doesn't fill the id or flags properly. It can also be
caused because we're not protecting the write ordering in the ring,
but I don't see anything obviously wrong by looking at the code.

> 366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
> |   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
> |_ 77 additions in vq_idx = 1
>      |_ 76 descriptors had id = 0
>      |_ 1 descriptor had id = 1
>      |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)
>
> I am not entirely sure now if there's an issue in the packed vq
> implementation in QEMU or if this is being caused due to some sort
> of race condition in linux.
>
> "id is not a head" is being thrown because vq->packed.desc_state[id].data
> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].
>

Let's keep debugging further. Can you trace the ids that the L2 kernel
makes available, and then the ones that it uses? At the same time, can
you trace the ids that the svq sees in vhost_svq_get_buf and the ones
that flushes? This allows us to check the set of available descriptors
at any given time.

> >> This error is not thrown always, but when it is thrown, the id
> >> varies. This is invariably followed by a soft lockup:
> >> [...]
> >> [  284.662292] Call Trace:
> >> [  284.662292]  <IRQ>
> >> [  284.662292]  ? watchdog_timer_fn+0x1e6/0x270
> >> [  284.662292]  ? __pfx_watchdog_timer_fn+0x10/0x10
> >> [  284.662292]  ? __hrtimer_run_queues+0x10f/0x2b0
> >> [  284.662292]  ? hrtimer_interrupt+0xf8/0x230
> >> [  284.662292]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
> >> [  284.662292]  ? sysvec_apic_timer_interrupt+0x39/0x90
> >> [  284.662292]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >> [  284.662292]  ? virtqueue_enable_cb_delayed+0x115/0x150
> >> [  284.662292]  start_xmit+0x2a6/0x4f0 [virtio_net]
> >> [  284.662292]  ? netif_skb_features+0x98/0x300
> >> [  284.662292]  dev_hard_start_xmit+0x61/0x1d0
> >> [  284.662292]  sch_direct_xmit+0xa4/0x390
> >> [  284.662292]  __dev_queue_xmit+0x84f/0xdc0
> >> [  284.662292]  ? nf_hook_slow+0x42/0xf0
> >> [  284.662292]  ip_finish_output2+0x2b8/0x580
> >> [  284.662292]  igmp_ifc_timer_expire+0x1d5/0x430
> >> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
> >> [  284.662292]  call_timer_fn+0x21/0x130
> >> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
> >> [  284.662292]  __run_timers+0x21f/0x2b0
> >> [  284.662292]  run_timer_softirq+0x1d/0x40
> >> [  284.662292]  __do_softirq+0xc9/0x2c8
> >> [  284.662292]  __irq_exit_rcu+0xa6/0xc0
> >> [  284.662292]  sysvec_apic_timer_interrupt+0x72/0x90
> >> [  284.662292]  </IRQ>
> >> [  284.662292]  <TASK>
> >> [  284.662292]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >> [  284.662292] RIP: 0010:pv_native_safe_halt+0xf/0x20
> >> [  284.662292] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 53 75 3f 00 fb f4 <c3> cc c0
> >> [  284.662292] RSP: 0018:ffffb8f0000b3ed8 EFLAGS: 00000212
> >> [  284.662292] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
> >> [  284.662292] RDX: 4000000000000000 RSI: 0000000000000083 RDI: 00000000000289ec
> >> [  284.662292] RBP: ffff96f200810000 R08: 0000000000000000 R09: 0000000000000001
> >> [  284.662292] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> >> [  284.662292] R13: 0000000000000000 R14: ffff96f200810000 R15: 0000000000000000
> >> [  284.662292]  default_idle+0x9/0x20
> >> [  284.662292]  default_idle_call+0x2c/0xe0
> >> [  284.662292]  do_idle+0x226/0x270
> >> [  284.662292]  cpu_startup_entry+0x2a/0x30
> >> [  284.662292]  start_secondary+0x11e/0x140
> >> [  284.662292]  secondary_startup_64_no_verify+0x184/0x18b
> >> [  284.662292]  </TASK>
> >>
> >> The soft lockup seems to happen in
> >> drivers/net/virtio_net.c:start_xmit() [1].
> >>
> >
> > Maybe it gets stuck in the do {} while(...
> > !virtqueue_enable_cb_delayed()) ? you can add a printk in
> > virtqueue_enable_cb_delayed return and check if it matches with the
> > speed you're sending or receiving ping. For example, if ping is each
> > second, you should not see a lot of traces.
> >
> > If this does not work I'd try never disabling notifications, both in
> > the kernel and SVQ, and check if that works.
>
> In order to disable notifications, will something have to be commented
> out in the implementation?
>

To *never* disable notifications you should comment out the SVQ calls
to virtio_queue_set_notification and vhost_svq_disable_notification.
This way the other side (L0 device in QEMU and the guest) are forced
to notify SVQ always, and we can check if that solves the first issue,

> >> [...]
> >> QEMU command to boot L1:
> >>
> >> $ sudo ./qemu/build/qemu-system-x86_64 \
> >> -enable-kvm \
> >> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >> -net nic,model=virtio \
> >> -net user,hostfwd=tcp::2222-:22 \
> >> -device intel-iommu,snoop-control=on \
> >> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
> >> -netdev tap,id=net0,script=no,downscript=no,vhost=off \
> >> -nographic \
> >> -m 8G \
> >> -smp 4 \
> >> -M q35 \
> >> -cpu host 2>&1 | tee vm.log
> >>
>
> I have added "-device virtio-net-pci,indirect_desc=off,queue_reset=off"
> to the L0 QEMU command to boot L1.
>
> Thanks,
> Sahil
>
> [1] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1762
> [2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L290
> [3] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1564
>
Re: [RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Sahil Siddiq 9 months ago
Hi,

Apologies, I haven't been in touch for a while. I have an update that
I would like to give.

On 4/16/25 12:50 PM, Eugenio Perez Martin wrote:
> On Mon, Apr 14, 2025 at 11:20 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> Hi,
>>
>> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
>>> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> I managed to fix a few issues while testing this patch series.
>>>> There is still one issue that I am unable to resolve. I thought
>>>> I would send this patch series for review in case I have missed
>>>> something.
>>>>
>>>> The issue is that this patch series does not work every time. I
>>>> am able to ping L0 from L2 and vice versa via packed SVQ when it
>>>> works.
>>>
>>> So we're on a very good track then!
>>>
>>>> When this doesn't work, both VMs throw a "Destination Host
>>>> Unreachable" error. This is sometimes (not always) accompanied
>>>> by the following kernel error (thrown by L2-kernel):
>>>>
>>>> virtio_net virtio1: output.0:id 1 is not a head!
>>>>
>>>
>>> How many packets have been sent or received before hitting this? If
>>> the answer to that is "the vq size", maybe there is a bug in the code
>>> that handles the wraparound of the packed vq, as the used and avail
>>> flags need to be twisted. You can count them in the SVQ code.
>>
>> I did a lot more testing. This issue is quite unpredictable in terms
>> of the time at which it appears after booting L2. So far, it almost
>> always appears after booting L2. Even when pinging works, this issue
>> appears after several seconds of pinging.
>>
> 
> Maybe you can speed it up with ping -f?

Thank you, I was able to run tests much faster with the -f option. So
far I have noticed that the RX queue does not give problems. When all
the descriptors are used it is able to wrap around without issues.

>> The total number of svq descriptors varied in every test run. But in
>> every case, all 256 indices were filled in the descriptor region for
>> vq with vq_idx = 0. This is the RX vq, right?
> 
> Right!

The TX queue seems to be problematic. More on this below.

>> This was filled while L2
>> was booting. In the case when the ctrl vq is disabled, I am not sure
>> what is responsible for filling the vqs in the data plane during
>> booting.
>>
> The nested guest's driver fills the rx queue at startup. After that,
> that nested guest kicks and SVQ receives the descriptors. It copies
> the descriptors to the shadow virtqueue and then kicks L0 QEMU.

Understood.

>> =====
>> The issue is hit most frequently when the following command is run
>> in L0:
>> $ ip addr add 111.1.1.1/24 dev tap0
>> $ ip link set tap0 up
>>
>> or, running the following in L2:
>> # ip addr add 111.1.1.2/24 dev eth0
>>
> 
> I guess those are able to start the network, aren't they?

Yes, that's correct.

>> The other vq (vq_idx=1) is not filled completely before the issue is
>> hit.
>> I have been noting down the numbers and here is an example:
>>
>> 295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
>> |_ 256 additions in vq_idx = 0, all with unique ids
>>       |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
>> |_ 39 additions in vq_idx = 1
>>       |_ 13 descriptors had id = 0
>>       |_ 26 descriptors had id = 1
>>       |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)
>>
>> There was one case in which vq_idx=0 had wrapped around. I verified
>> that flags were set appropriately during the wrap (avail and used flags
>> were flipped as expected).
>>
> 
> Ok sounds like you're able to reach it before filling the queue. I'd
> go for debugging notifications for this one then. More on this below.
> 
>> =====
>> The next common situation where this issue is hit is during startup.
>> Before L2 can finish booting successfully, this error is thrown:
>>
>> virtio_net virtio1: output.0:id 0 is not a head!
>>
>> 258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
>> |_ 256 additions in vq_idx = 0, all with unique ids
>>      |---- None of them were received by the device (vhost_svq_get_buf_packed)
>> |_ 2 additions in vq_idx = 1
>>      |_ id = 0 in index 0
>>      |_ id = 1 in index 1
>>      |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)
>>
>> =====
>> Another case is after several seconds of pinging L0 from L2.
>>
>> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
>>
> 
> So the L2 guest sees a descriptor it has not made available
> previously. This can be caused because SVQ returns the same descriptor
> twice, or it doesn't fill the id or flags properly. It can also be
> caused because we're not protecting the write ordering in the ring,
> but I don't see anything obviously wrong by looking at the code.
> 
>> 366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
>> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
>> |   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
>> |_ 77 additions in vq_idx = 1
>>       |_ 76 descriptors had id = 0
>>       |_ 1 descriptor had id = 1
>>       |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)
>>
>> I am not entirely sure now if there's an issue in the packed vq
>> implementation in QEMU or if this is being caused due to some sort
>> of race condition in linux.
>>
>> "id is not a head" is being thrown because vq->packed.desc_state[id].data
>> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
>> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
>> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].
>>
> 
> Let's keep debugging further. Can you trace the ids that the L2 kernel
> makes available, and then the ones that it uses? At the same time, can
> you trace the ids that the svq sees in vhost_svq_get_buf and the ones
> that flushes? This allows us to check the set of available descriptors
> at any given time.
> 
In the linux kernel, I am printing which descriptor is received in which
queue in drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_packed() [1].
I see the following lines getting printed for the TX queue:

[  192.101591] output.0 -> id: 0
[  213.737417] output.0 -> id: 0
[  213.738714] output.0 -> id: 1
[  213.740093] output.0 -> id: 0
[  213.741521] virtio_net virtio1: output.0:id 0 is not a head!

In QEMU's hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_add_packed(), I am
printing the head_idx, id, len, flags and vq_idx. Just before the crash,
the following lines are printed:

head_idx: 157, id: 0, len: 122, flags: 32768, vq idx: 1
head_idx: 158, id: 0, len: 122, flags: 32768, vq idx: 1
head_idx: 159, id: 0, len: 66, flags: 32768, vq idx: 1
head_idx: 160, id: 1, len: 102, flags: 32768, vq idx: 1

In QEMU's hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf_packed(), I
am printing the id, last_used index, used wrap counter and vq_idx. These
are the lines just before the crash:

id: 0, last_used: 158, used_wrap_counter: 0, vq idx: 1
id: 0, last_used: 159, used_wrap_counter: 0, vq idx: 1
id: 0, last_used: 160, used_wrap_counter: 0, vq idx: 1
id: 1, last_used: 161, used_wrap_counter: 0, vq idx: 1

In QEMU's hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush() [2], I am tracking
the values of i and vq_idx in the outer do..while() loop as well as in the inner
while(true) loop. The value of i is used as the "idx" in virtqueue_fill() [3] and
as "count" in virtqueue_flush() [4]. Lines printed in each iteration of the outer
do...while loop are enclosed between "===" lines. These are the lines just before
the crash:

===
in_loop: i: 0, vq idx: 1
in_loop: i: 1, vq idx: 1
out_loop: i: 1, vq idx: 1
===
in_loop: i: 0, vq idx: 1
in_loop: i: 1, vq idx: 1
out_loop: i: 1, vq idx: 1
===
in_loop: i: 0, vq idx: 1
in_loop: i: 1, vq idx: 1
in_loop: i: 2, vq idx: 1
out_loop: i: 2, vq idx: 1
===
in_loop: i: 0, vq idx: 1
out_loop: i: 0, vq idx: 1

I have only investigated which descriptors the kernel uses. I'll also check
which descriptors are made available by the kernel. I'll let you know what I
find.

Thanks,
Sahil

[1] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1727
[2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L499
[3] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/virtio.c#L1008
[4] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/virtio.c#L1147


Re: [RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Eugenio Perez Martin 9 months ago
On Wed, May 14, 2025 at 8:22 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> Apologies, I haven't been in touch for a while. I have an update that
> I would like to give.
>
> On 4/16/25 12:50 PM, Eugenio Perez Martin wrote:
> > On Mon, Apr 14, 2025 at 11:20 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
> >>> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> I managed to fix a few issues while testing this patch series.
> >>>> There is still one issue that I am unable to resolve. I thought
> >>>> I would send this patch series for review in case I have missed
> >>>> something.
> >>>>
> >>>> The issue is that this patch series does not work every time. I
> >>>> am able to ping L0 from L2 and vice versa via packed SVQ when it
> >>>> works.
> >>>
> >>> So we're on a very good track then!
> >>>
> >>>> When this doesn't work, both VMs throw a "Destination Host
> >>>> Unreachable" error. This is sometimes (not always) accompanied
> >>>> by the following kernel error (thrown by L2-kernel):
> >>>>
> >>>> virtio_net virtio1: output.0:id 1 is not a head!
> >>>>
> >>>
> >>> How many packets have been sent or received before hitting this? If
> >>> the answer to that is "the vq size", maybe there is a bug in the code
> >>> that handles the wraparound of the packed vq, as the used and avail
> >>> flags need to be twisted. You can count them in the SVQ code.
> >>
> >> I did a lot more testing. This issue is quite unpredictable in terms
> >> of the time at which it appears after booting L2. So far, it almost
> >> always appears after booting L2. Even when pinging works, this issue
> >> appears after several seconds of pinging.
> >>
> >
> > Maybe you can speed it up with ping -f?
>
> Thank you, I was able to run tests much faster with the -f option. So
> far I have noticed that the RX queue does not give problems. When all
> the descriptors are used it is able to wrap around without issues.
>
> >> The total number of svq descriptors varied in every test run. But in
> >> every case, all 256 indices were filled in the descriptor region for
> >> vq with vq_idx = 0. This is the RX vq, right?
> >
> > Right!
>
> The TX queue seems to be problematic. More on this below.
>
> >> This was filled while L2
> >> was booting. In the case when the ctrl vq is disabled, I am not sure
> >> what is responsible for filling the vqs in the data plane during
> >> booting.
> >>
> > The nested guest's driver fills the rx queue at startup. After that,
> > that nested guest kicks and SVQ receives the descriptors. It copies
> > the descriptors to the shadow virtqueue and then kicks L0 QEMU.
>
> Understood.
>
> >> =====
> >> The issue is hit most frequently when the following command is run
> >> in L0:
> >> $ ip addr add 111.1.1.1/24 dev tap0
> >> $ ip link set tap0 up
> >>
> >> or, running the following in L2:
> >> # ip addr add 111.1.1.2/24 dev eth0
> >>
> >
> > I guess those are able to start the network, aren't they?
>
> Yes, that's correct.
>
> >> The other vq (vq_idx=1) is not filled completely before the issue is
> >> hit.
> >> I have been noting down the numbers and here is an example:
> >>
> >> 295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> >> |_ 256 additions in vq_idx = 0, all with unique ids
> >>       |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
> >> |_ 39 additions in vq_idx = 1
> >>       |_ 13 descriptors had id = 0
> >>       |_ 26 descriptors had id = 1
> >>       |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)
> >>
> >> There was one case in which vq_idx=0 had wrapped around. I verified
> >> that flags were set appropriately during the wrap (avail and used flags
> >> were flipped as expected).
> >>
> >
> > Ok sounds like you're able to reach it before filling the queue. I'd
> > go for debugging notifications for this one then. More on this below.
> >
> >> =====
> >> The next common situation where this issue is hit is during startup.
> >> Before L2 can finish booting successfully, this error is thrown:
> >>
> >> virtio_net virtio1: output.0:id 0 is not a head!
> >>
> >> 258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
> >> |_ 256 additions in vq_idx = 0, all with unique ids
> >>      |---- None of them were received by the device (vhost_svq_get_buf_packed)
> >> |_ 2 additions in vq_idx = 1
> >>      |_ id = 0 in index 0
> >>      |_ id = 1 in index 1
> >>      |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)
> >>
> >> =====
> >> Another case is after several seconds of pinging L0 from L2.
> >>
> >> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
> >>
> >
> > So the L2 guest sees a descriptor it has not made available
> > previously. This can be caused because SVQ returns the same descriptor
> > twice, or it doesn't fill the id or flags properly. It can also be
> > caused because we're not protecting the write ordering in the ring,
> > but I don't see anything obviously wrong by looking at the code.
> >
> >> 366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> >> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
> >> |   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
> >> |_ 77 additions in vq_idx = 1
> >>       |_ 76 descriptors had id = 0
> >>       |_ 1 descriptor had id = 1
> >>       |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)
> >>
> >> I am not entirely sure now if there's an issue in the packed vq
> >> implementation in QEMU or if this is being caused due to some sort
> >> of race condition in linux.
> >>
> >> "id is not a head" is being thrown because vq->packed.desc_state[id].data
> >> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
> >> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
> >> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].
> >>
> >
> > Let's keep debugging further. Can you trace the ids that the L2 kernel
> > makes available, and then the ones that it uses? At the same time, can
> > you trace the ids that the svq sees in vhost_svq_get_buf and the ones
> > that flushes? This allows us to check the set of available descriptors
> > at any given time.
> >
> In the linux kernel, I am printing which descriptor is received in which
> queue in drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_packed() [1].
> I see the following lines getting printed for the TX queue:
>
> [  192.101591] output.0 -> id: 0
> [  213.737417] output.0 -> id: 0
> [  213.738714] output.0 -> id: 1
> [  213.740093] output.0 -> id: 0
> [  213.741521] virtio_net virtio1: output.0:id 0 is not a head!
>

I find it particular that it is the first descriptor with id 1. Do you
have any other descriptor with id 1 previously? Does it fail
consistently with id 1?

You should have descriptors with id 1 and more in the rx queue and the
code should not be able to tell the difference, so it seems weird it
fails with tx. But who knows :).

> In QEMU's hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_add_packed(), I am
> printing the head_idx, id, len, flags and vq_idx. Just before the crash,
> the following lines are printed:
>
> head_idx: 157, id: 0, len: 122, flags: 32768, vq idx: 1
> head_idx: 158, id: 0, len: 122, flags: 32768, vq idx: 1
> head_idx: 159, id: 0, len: 66, flags: 32768, vq idx: 1
> head_idx: 160, id: 1, len: 102, flags: 32768, vq idx: 1
>
> In QEMU's hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf_packed(), I
> am printing the id, last_used index, used wrap counter and vq_idx. These
> are the lines just before the crash:
>
> id: 0, last_used: 158, used_wrap_counter: 0, vq idx: 1
> id: 0, last_used: 159, used_wrap_counter: 0, vq idx: 1
> id: 0, last_used: 160, used_wrap_counter: 0, vq idx: 1
> id: 1, last_used: 161, used_wrap_counter: 0, vq idx: 1
>
> In QEMU's hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush() [2], I am tracking
> the values of i and vq_idx in the outer do..while() loop as well as in the inner
> while(true) loop. The value of i is used as the "idx" in virtqueue_fill() [3] and
> as "count" in virtqueue_flush() [4]. Lines printed in each iteration of the outer
> do...while loop are enclosed between "===" lines. These are the lines just before
> the crash:
>

I'd print VirtQueueElement members too.

It seems you're super close to fix it :).

Thanks!

> ===
> in_loop: i: 0, vq idx: 1
> in_loop: i: 1, vq idx: 1
> out_loop: i: 1, vq idx: 1
> ===
> in_loop: i: 0, vq idx: 1
> in_loop: i: 1, vq idx: 1
> out_loop: i: 1, vq idx: 1
> ===
> in_loop: i: 0, vq idx: 1
> in_loop: i: 1, vq idx: 1
> in_loop: i: 2, vq idx: 1
> out_loop: i: 2, vq idx: 1
> ===
> in_loop: i: 0, vq idx: 1
> out_loop: i: 0, vq idx: 1
>
> I have only investigated which descriptors the kernel uses. I'll also check
> which descriptors are made available by the kernel. I'll let you know what I
> find.
>
> Thanks,
> Sahil
>
> [1] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1727
> [2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L499
> [3] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/virtio.c#L1008
> [4] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/virtio.c#L1147
>
Re: [RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Sahil Siddiq 7 months, 2 weeks ago
Hi,

It's been a while since I sent an email. I thought I would send an update
to keep you in the loop.

I have been comparing svq's mechanism for split and packed vqs hoping to
find something that might lead to the source of the issue.

One thing worth noting is that when I use kernel version 6.8.5 for testing,
the crashes are far more frequent. In kernel version 6.15.0-rc3+, it's much
harder to reproduce.

On 5/15/25 11:49 AM, Eugenio Perez Martin wrote:
> On Wed, May 14, 2025 at 8:22 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> On 4/16/25 12:50 PM, Eugenio Perez Martin wrote:
>>> On Mon, Apr 14, 2025 at 11:20 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
>>>>> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>> I managed to fix a few issues while testing this patch series.
>>>>>> There is still one issue that I am unable to resolve. I thought
>>>>>> I would send this patch series for review in case I have missed
>>>>>> something.
>>>>>>
>>>>>> The issue is that this patch series does not work every time. I
>>>>>> am able to ping L0 from L2 and vice versa via packed SVQ when it
>>>>>> works.
>>>>>>
>>>>>> When this doesn't work, both VMs throw a "Destination Host
>>>>>> Unreachable" error. This is sometimes (not always) accompanied
>>>>>> by the following kernel error (thrown by L2-kernel):
>>>>>>
>>>>>> virtio_net virtio1: output.0:id 1 is not a head!
>>>>>>
>>
>> The TX queue seems to be problematic. More on this below.

Sometimes RX also results in this crash, but it seems to be less frequent.

>>>> This was filled while L2
>>>> was booting. In the case when the ctrl vq is disabled, I am not sure
>>>> what is responsible for filling the vqs in the data plane during
>>>> booting.
>>>>
>>> The nested guest's driver fills the rx queue at startup. After that,
>>> that nested guest kicks and SVQ receives the descriptors. It copies
>>> the descriptors to the shadow virtqueue and then kicks L0 QEMU.
>>
>> Understood.
>>
>>>> The other vq (vq_idx=1) is not filled completely before the issue is
>>>> hit.
>>>> I have been noting down the numbers and here is an example:
>>>>
>>>> 295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
>>>> |_ 256 additions in vq_idx = 0, all with unique ids
>>>>        |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
>>>> |_ 39 additions in vq_idx = 1
>>>>        |_ 13 descriptors had id = 0
>>>>        |_ 26 descriptors had id = 1
>>>>        |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)
>>>>
>>>> There was one case in which vq_idx=0 had wrapped around. I verified
>>>> that flags were set appropriately during the wrap (avail and used flags
>>>> were flipped as expected).
>>>>
>>>
>>> Ok sounds like you're able to reach it before filling the queue. I'd
>>> go for debugging notifications for this one then. More on this below.
>>>
>>>> =====
>>>> The next common situation where this issue is hit is during startup.
>>>> Before L2 can finish booting successfully, this error is thrown:
>>>>
>>>> virtio_net virtio1: output.0:id 0 is not a head!
>>>>
>>>> 258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
>>>> |_ 256 additions in vq_idx = 0, all with unique ids
>>>>       |---- None of them were received by the device (vhost_svq_get_buf_packed)
>>>> |_ 2 additions in vq_idx = 1
>>>>       |_ id = 0 in index 0
>>>>       |_ id = 1 in index 1
>>>>       |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)
>>>>
>>>> =====
>>>> Another case is after several seconds of pinging L0 from L2.
>>>>
>>>> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
>>>>
>>>
>>> So the L2 guest sees a descriptor it has not made available
>>> previously. This can be caused because SVQ returns the same descriptor
>>> twice, or it doesn't fill the id or flags properly. It can also be
>>> caused because we're not protecting the write ordering in the ring,
>>> but I don't see anything obviously wrong by looking at the code.
>>>
>>>> 366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
>>>> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
>>>> |   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
>>>> |_ 77 additions in vq_idx = 1
>>>>        |_ 76 descriptors had id = 0
>>>>        |_ 1 descriptor had id = 1
>>>>        |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)
>>>>
>>>>
>>>> "id is not a head" is being thrown because vq->packed.desc_state[id].data
>>>> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
>>>> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
>>>> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].
>>>>
>>>
>>> Let's keep debugging further. Can you trace the ids that the L2 kernel
>>> makes available, and then the ones that it uses? At the same time, can
>>> you trace the ids that the svq sees in vhost_svq_get_buf and the ones
>>> that flushes? This allows us to check the set of available descriptors
>>> at any given time.
>>>
>> In the linux kernel, I am printing which descriptor is received in which
>> queue in drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_packed() [1].
>> I see the following lines getting printed for the TX queue:
>>
>> [  192.101591] output.0 -> id: 0
>> [  213.737417] output.0 -> id: 0
>> [  213.738714] output.0 -> id: 1
>> [  213.740093] output.0 -> id: 0
>> [  213.741521] virtio_net virtio1: output.0:id 0 is not a head!
>>
> 
> I find it particular that it is the first descriptor with id 1. Do you
> have any other descriptor with id 1 previously? Does it fail
> consistently with id 1?

Yes, the descriptor with id 1 was used previously in TX. It varies between
test runs. It has failed with other ids as well during some test runs. In
one test run, it failed with id 17. I think there's an off-by-one bug here.
It crashes when it receives id 'x - 1' instead of 'x'.
> You should have descriptors with id 1 and more in the rx queue and the
> code should not be able to tell the difference, so it seems weird it
> fails with tx. But who knows :).
Oh, I thought it would be able to differentiate between them since it knows
which vq->idx it's coming from.

I think there's something off in the way "free_head", "last_used_idx" and
"desc_next" values are calculated in vhost_svq_get_buf_packed() [1].

In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
them in order till id 8. At this point it received id 7 again for some reason
and then crashed.

L2:

[ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
[ 1641.130621] (notify) output.0 -> function will return true
[ 1641.132022] output.0 -> id: 0
[ 1739.502358] input.0 -> id: 0
[ 1739.503003] input.0 -> id: 1
[ 1739.562024] input.0 -> id: 2
[ 1739.578682] input.0 -> id: 3
[ 1739.661913] input.0 -> id: 4
[ 1739.828796] input.0 -> id: 5
[ 1739.829789] input.0 -> id: 6
[ 1740.078757] input.0 -> id: 7
[ 1740.079749] input.0 -> id: 8
[ 1740.080382] input.0 -> id: 7    <----Received 7 again
[ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!

QEMU logs (vhost_svq_get_buf_packed):
------
size              : svq->vring.num
len               : svq->vring_packed.vring.desc[last_used].len
id                : svq->vring_packed.vring.desc[last_used].id
num               : svq->desc_state[id].ndescs
last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
free_head         : svq->free_head
last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + num
used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
------

size: 256, len: 102, id: 0, vq idx: 0
id: 0, last_used_chain: 0, free_head: 0, vq idx: 0
num: 1, free_head: 0, id: 0, last_used: 1, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 74, id: 1, vq idx: 0
id: 1, last_used_chain: 1, free_head: 0, vq idx: 0
num: 1, free_head: 1, id: 1, last_used: 2, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 102, id: 2, vq idx: 0
id: 2, last_used_chain: 2, free_head: 1, vq idx: 0
num: 1, free_head: 2, id: 2, last_used: 3, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 82, id: 3, vq idx: 0
id: 3, last_used_chain: 3, free_head: 2, vq idx: 0
num: 1, free_head: 3, id: 3, last_used: 4, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 74, id: 4, vq idx: 0
id: 4, last_used_chain: 4, free_head: 3, vq idx: 0
num: 1, free_head: 4, id: 4, last_used: 5, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 82, id: 5, vq idx: 0
id: 5, last_used_chain: 5, free_head: 4, vq idx: 0
num: 1, free_head: 5, id: 5, last_used: 6, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 104, id: 6, vq idx: 0
id: 6, last_used_chain: 6, free_head: 5, vq idx: 0
num: 1, free_head: 6, id: 6, last_used: 7, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 82, id: 7, vq idx: 0
id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 104, id: 8, vq idx: 0
id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 98, id: 9, vq idx: 0
id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 104, id: 10, vq idx: 0
id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0

I have a few more ideas of what to do. I'll let you know if I find something
else.

Thanks,
Sahil

[1] https://github.com/valdaarhun/qemu/blob/packed_vq/hw/virtio/vhost-shadow-virtqueue.c#L687
[2] https://github.com/valdaarhun/qemu/blob/packed_vq/hw/virtio/vhost-shadow-virtqueue.c#L629

Re: [RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Eugenio Perez Martin 7 months, 2 weeks ago
On Thu, Jun 26, 2025 at 7:16 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> It's been a while since I sent an email. I thought I would send an update
> to keep you in the loop.
>
> I have been comparing svq's mechanism for split and packed vqs hoping to
> find something that might lead to the source of the issue.
>
> One thing worth noting is that when I use kernel version 6.8.5 for testing,
> the crashes are far more frequent. In kernel version 6.15.0-rc3+, it's much
> harder to reproduce.
>
> On 5/15/25 11:49 AM, Eugenio Perez Martin wrote:
> > On Wed, May 14, 2025 at 8:22 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> On 4/16/25 12:50 PM, Eugenio Perez Martin wrote:
> >>> On Mon, Apr 14, 2025 at 11:20 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
> >>>>> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>> I managed to fix a few issues while testing this patch series.
> >>>>>> There is still one issue that I am unable to resolve. I thought
> >>>>>> I would send this patch series for review in case I have missed
> >>>>>> something.
> >>>>>>
> >>>>>> The issue is that this patch series does not work every time. I
> >>>>>> am able to ping L0 from L2 and vice versa via packed SVQ when it
> >>>>>> works.
> >>>>>>
> >>>>>> When this doesn't work, both VMs throw a "Destination Host
> >>>>>> Unreachable" error. This is sometimes (not always) accompanied
> >>>>>> by the following kernel error (thrown by L2-kernel):
> >>>>>>
> >>>>>> virtio_net virtio1: output.0:id 1 is not a head!
> >>>>>>
> >>
> >> The TX queue seems to be problematic. More on this below.
>
> Sometimes RX also results in this crash, but it seems to be less frequent.
>
> >>>> This was filled while L2
> >>>> was booting. In the case when the ctrl vq is disabled, I am not sure
> >>>> what is responsible for filling the vqs in the data plane during
> >>>> booting.
> >>>>
> >>> The nested guest's driver fills the rx queue at startup. After that,
> >>> that nested guest kicks and SVQ receives the descriptors. It copies
> >>> the descriptors to the shadow virtqueue and then kicks L0 QEMU.
> >>
> >> Understood.
> >>
> >>>> The other vq (vq_idx=1) is not filled completely before the issue is
> >>>> hit.
> >>>> I have been noting down the numbers and here is an example:
> >>>>
> >>>> 295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> >>>> |_ 256 additions in vq_idx = 0, all with unique ids
> >>>>        |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
> >>>> |_ 39 additions in vq_idx = 1
> >>>>        |_ 13 descriptors had id = 0
> >>>>        |_ 26 descriptors had id = 1
> >>>>        |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)
> >>>>
> >>>> There was one case in which vq_idx=0 had wrapped around. I verified
> >>>> that flags were set appropriately during the wrap (avail and used flags
> >>>> were flipped as expected).
> >>>>
> >>>
> >>> Ok sounds like you're able to reach it before filling the queue. I'd
> >>> go for debugging notifications for this one then. More on this below.
> >>>
> >>>> =====
> >>>> The next common situation where this issue is hit is during startup.
> >>>> Before L2 can finish booting successfully, this error is thrown:
> >>>>
> >>>> virtio_net virtio1: output.0:id 0 is not a head!
> >>>>
> >>>> 258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
> >>>> |_ 256 additions in vq_idx = 0, all with unique ids
> >>>>       |---- None of them were received by the device (vhost_svq_get_buf_packed)
> >>>> |_ 2 additions in vq_idx = 1
> >>>>       |_ id = 0 in index 0
> >>>>       |_ id = 1 in index 1
> >>>>       |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)
> >>>>
> >>>> =====
> >>>> Another case is after several seconds of pinging L0 from L2.
> >>>>
> >>>> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
> >>>>
> >>>
> >>> So the L2 guest sees a descriptor it has not made available
> >>> previously. This can be caused because SVQ returns the same descriptor
> >>> twice, or it doesn't fill the id or flags properly. It can also be
> >>> caused because we're not protecting the write ordering in the ring,
> >>> but I don't see anything obviously wrong by looking at the code.
> >>>
> >>>> 366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> >>>> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
> >>>> |   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
> >>>> |_ 77 additions in vq_idx = 1
> >>>>        |_ 76 descriptors had id = 0
> >>>>        |_ 1 descriptor had id = 1
> >>>>        |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)
> >>>>
> >>>>
> >>>> "id is not a head" is being thrown because vq->packed.desc_state[id].data
> >>>> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
> >>>> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
> >>>> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].
> >>>>
> >>>
> >>> Let's keep debugging further. Can you trace the ids that the L2 kernel
> >>> makes available, and then the ones that it uses? At the same time, can
> >>> you trace the ids that the svq sees in vhost_svq_get_buf and the ones
> >>> that flushes? This allows us to check the set of available descriptors
> >>> at any given time.
> >>>
> >> In the linux kernel, I am printing which descriptor is received in which
> >> queue in drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_packed() [1].
> >> I see the following lines getting printed for the TX queue:
> >>
> >> [  192.101591] output.0 -> id: 0
> >> [  213.737417] output.0 -> id: 0
> >> [  213.738714] output.0 -> id: 1
> >> [  213.740093] output.0 -> id: 0
> >> [  213.741521] virtio_net virtio1: output.0:id 0 is not a head!
> >>
> >
> > I find it particular that it is the first descriptor with id 1. Do you
> > have any other descriptor with id 1 previously? Does it fail
> > consistently with id 1?
>
> Yes, the descriptor with id 1 was used previously in TX. It varies between
> test runs. It has failed with other ids as well during some test runs. In
> one test run, it failed with id 17. I think there's an off-by-one bug here.
> It crashes when it receives id 'x - 1' instead of 'x'.
> > You should have descriptors with id 1 and more in the rx queue and the
> > code should not be able to tell the difference, so it seems weird it
> > fails with tx. But who knows :).
> Oh, I thought it would be able to differentiate between them since it knows
> which vq->idx it's coming from.
>
> I think there's something off in the way "free_head", "last_used_idx" and
> "desc_next" values are calculated in vhost_svq_get_buf_packed() [1].
>
> In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
> them in order till id 8. At this point it received id 7 again for some reason
> and then crashed.
>
> L2:
>
> [ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
> [ 1641.130621] (notify) output.0 -> function will return true
> [ 1641.132022] output.0 -> id: 0
> [ 1739.502358] input.0 -> id: 0
> [ 1739.503003] input.0 -> id: 1
> [ 1739.562024] input.0 -> id: 2
> [ 1739.578682] input.0 -> id: 3
> [ 1739.661913] input.0 -> id: 4
> [ 1739.828796] input.0 -> id: 5
> [ 1739.829789] input.0 -> id: 6
> [ 1740.078757] input.0 -> id: 7
> [ 1740.079749] input.0 -> id: 8
> [ 1740.080382] input.0 -> id: 7    <----Received 7 again
> [ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!
>
> QEMU logs (vhost_svq_get_buf_packed):
> ------
> size              : svq->vring.num
> len               : svq->vring_packed.vring.desc[last_used].len
> id                : svq->vring_packed.vring.desc[last_used].id
> num               : svq->desc_state[id].ndescs
> last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
> free_head         : svq->free_head
> last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + num
> used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
> ------
>
> size: 256, len: 102, id: 0, vq idx: 0
> id: 0, last_used_chain: 0, free_head: 0, vq idx: 0
> num: 1, free_head: 0, id: 0, last_used: 1, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 74, id: 1, vq idx: 0
> id: 1, last_used_chain: 1, free_head: 0, vq idx: 0
> num: 1, free_head: 1, id: 1, last_used: 2, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 102, id: 2, vq idx: 0
> id: 2, last_used_chain: 2, free_head: 1, vq idx: 0
> num: 1, free_head: 2, id: 2, last_used: 3, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 82, id: 3, vq idx: 0
> id: 3, last_used_chain: 3, free_head: 2, vq idx: 0
> num: 1, free_head: 3, id: 3, last_used: 4, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 74, id: 4, vq idx: 0
> id: 4, last_used_chain: 4, free_head: 3, vq idx: 0
> num: 1, free_head: 4, id: 4, last_used: 5, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 82, id: 5, vq idx: 0
> id: 5, last_used_chain: 5, free_head: 4, vq idx: 0
> num: 1, free_head: 5, id: 5, last_used: 6, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 104, id: 6, vq idx: 0
> id: 6, last_used_chain: 6, free_head: 5, vq idx: 0
> num: 1, free_head: 6, id: 6, last_used: 7, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 82, id: 7, vq idx: 0
> id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
> num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 104, id: 8, vq idx: 0
> id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
> num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 98, id: 9, vq idx: 0
> id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
> num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 104, id: 10, vq idx: 0
> id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
> num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0
>
> I have a few more ideas of what to do. I'll let you know if I find something
> else.
>

I cannot find anything just by inspection. What about printing all the
desc_state and all desc_next to check for incoherencies in each
svq_add and get_buf?
Re: [RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Sahil Siddiq 6 months, 1 week ago
Hi,

I think I have finally found the reason behind this issue.

The order in which "add_packed" and "get_buf_packed" are performed in the
nested guest kernel (L2 kernel) and QEMU are a little different. Due to
this, the values in free_head and svq->desc_next[] differ and the guest
crashes at some point. More below.

On 6/26/25 1:07 PM, Eugenio Perez Martin wrote:
> On Thu, Jun 26, 2025 at 7:16 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> I think there's something off in the way "free_head", "last_used_idx" and
>> "desc_next" values are calculated in vhost_svq_get_buf_packed() [1].
>>
>> In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
>> them in order till id 8. At this point it received id 7 again for some reason
>> and then crashed.
>>
>> L2:
>>
>> [ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
>> [ 1641.130621] (notify) output.0 -> function will return true
>> [ 1641.132022] output.0 -> id: 0
>> [ 1739.502358] input.0 -> id: 0
>> [ 1739.503003] input.0 -> id: 1
>> [ 1739.562024] input.0 -> id: 2
>> [ 1739.578682] input.0 -> id: 3
>> [ 1739.661913] input.0 -> id: 4
>> [ 1739.828796] input.0 -> id: 5
>> [ 1739.829789] input.0 -> id: 6
>> [ 1740.078757] input.0 -> id: 7
>> [ 1740.079749] input.0 -> id: 8
>> [ 1740.080382] input.0 -> id: 7    <----Received 7 again
>> [ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!
>>
>> QEMU logs (vhost_svq_get_buf_packed):
>> ------
>> size              : svq->vring.num
>> len               : svq->vring_packed.vring.desc[last_used].len
>> id                : svq->vring_packed.vring.desc[last_used].id
>> num               : svq->desc_state[id].ndescs
>> last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
>> free_head         : svq->free_head
>> last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + num
>> used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
>> ------
>> size: 256, len: 82, id: 7, vq idx: 0
>> id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
>> num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
>> ------
>> size: 256, len: 104, id: 8, vq idx: 0
>> id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
>> num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
>> ------
>> size: 256, len: 98, id: 9, vq idx: 0
>> id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
>> num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
>> ------
>> size: 256, len: 104, id: 10, vq idx: 0
>> id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
>> num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0
>>
>> I have a few more ideas of what to do. I'll let you know if I find something
>> else.
>>
> I cannot find anything just by inspection. What about printing all the
> desc_state and all desc_next to check for incoherencies in each
> svq_add and get_buf?
In this test, all 256 descriptors were filled in the RX vq.

In the TX queue, L2 kernel would add one descriptor at a time and notify
QEMU. QEMU would then register it in its SVQ and mark it as "available".
After processing the descriptor, QEMU would mark it as "used" and flush it
back to L2. L2, in turn, would mark this descriptor as "used". After this
process, L2 would add the next descriptor in the TX vq while reusing this
ID. This was observed from idx 0 till idx 7.

L2's debug logs:

[   18.379112] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor in idx 6
[   18.387134] (add_packed) output.0 -> idx: 6
[   18.389897] (add_packed) output.0 -> id: 0
[   18.392290] (add_packed) output.0 -> len: 74
[   18.394606] (add_packed) output.0 -> addr: 5012315726
[   18.397043] (add_packed) output.0 -> next id: 1
[   18.399861] Entering prepare_packed: output.0
[   18.402478] (prepare_packed) output.0 -> needs_kick: 1
[   18.404998] (notify) output.0 -> function will return true        <----- Notify QEMU
[   18.406349] output.0 -> id: 0, idx: 6                             <----- Mark ID 0 in idx 6 as used
[   18.409482] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
[   18.410919] (after get_buf processed) output.0 -> id: 0, idx: 7   <----- Next slot is idx 7
[   18.921895] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 7
[   18.930093] (add_packed) output.0 -> idx: 7
[   18.935715] (add_packed) output.0 -> id: 0
[   18.937609] (add_packed) output.0 -> len: 122
[   18.939614] (add_packed) output.0 -> addr: 4925868038
[   18.941710] (add_packed) output.0 -> next id: 1
[   18.944032] Entering prepare_packed: output.0
[   18.946148] (prepare_packed) output.0 -> needs_kick: 1
[   18.948234] (notify) output.0 -> function will return true        <----- Notify QEMU
[   18.949606] output.0 -> id: 0, idx: 7                             <----- Mark ID 0 in idx 7 as used
[   18.952756] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
[   18.955154] (after get_buf processed) output.0 -> id: 0, idx: 8   <----- Next slot is idx 8

There was no issue in QEMU till this point.

[   19.177536] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 8
[   19.182415] (add_packed) output.0 -> idx: 8
[   19.187257] (add_packed) output.0 -> id: 0
[   19.191355] (add_packed) output.0 -> len: 102
[   19.195131] (add_packed) output.0 -> addr: 4370702342
[   19.199224] (add_packed) output.0 -> next id: 1
[   19.204929] Entering prepare_packed: output.0
[   19.209505] (prepare_packed) output.0 -> needs_kick: 1
[   19.213820] (notify) output.0 -> function will return true       <----- Notify QEMU
[   19.218792] (use_indirect?) output.0 -> verdict: 0               <----- Next slot is idx 9
[   19.224730] (add_packed) output.0 -> idx: 9
[   19.227067] (add_packed) output.0 -> id: 1                       <----- ID 0 can't be reused yet, so use ID = 1
[   19.229090] (add_packed) output.0 -> len: 330
[   19.231182] (add_packed) output.0 -> addr: 4311020614
[   19.233302] (add_packed) output.0 -> next id: 2
[   19.235620] Entering prepare_packed: output.0
[   19.237781] (prepare_packed) output.0 -> needs_kick: 1
[   19.239958] (notify) output.0 -> function will return true       <----- Notify QEMU
[   19.237780] output.0 -> id: 0, idx: 8                            <----- Mark ID 0 in idx 8 as used
[   19.243676] output.0 -> old free_head: 2, new free_head: 0       <----- ID 0 can now be reused
[   19.245214] (after get_buf processed) output.0 -> id: 0, idx: 9  <----- Next slot is idx 9
[   19.247097] output.0 -> id: 1, idx: 9                            <----- Mark ID 1 in idx 9 as used
[   19.249612] output.0 -> old free_head: 0, new free_head: 1       <----- ID 1 can now be reused
[   19.252266] (after get_buf processed) output.0 -> id: 1, idx: 10 <----- Next slot is idx 10

ID 0 and ID 1 in idx 8 and idx 9 respectively are pushed to QEMU
before either of them are marked as used.

But in QEMU, the order is slightly different.

num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
idx: 8, id: 0, len: 0, flags: 0, vq idx: 1                                     <----- Before adding descriptor
idx: 8, id: 0, len: 102, flags: 128, vq idx: 1                                 <----- After adding descriptor
Finally: new_idx: 9, head_idx: 8, id: 0, len: 102, flags: 128, vq idx: 1
svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
descriptor_len: 0
descriptor_id: 0                                                               <----- Mark ID = 0 as used
last_used: 8                                                                   <----- Processing idx 8
used_wrap_counter: 1
svq->desc_state[id].ndescs: 1
free_head: 0                                                                   <----- Update free_head to 0.
last_used: 9                                                                   <----- Update last_used to 9.
vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
i: 0                                                                           <----- vhost_svq_flush()
descriptor_len: 0
elem->len: 22086
i: 1
elem_is_null: 1
vq idx: 1                                                                      <----- End vhost_svq_flush()
num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
idx: 9, id: 0, len: 0, flags: 0, curr: 0, vq idx: 1                            <----- Before adding descriptor
idx: 9, id: 0, len: 330, flags: 128, curr: 1, vq idx: 1                        <----- After adding descriptor
Finally: new_idx: 10, head_idx: 9, id: 0, len: 330, flags: 128, vq idx: 1      <----- ID 0 has been reused (versus ID 1 in L2)
svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
descriptor_len: 0
descriptor_id: 0                                                               <----- Mark ID = 0 as used
last_used: 9                                                                   <----- Processing idx 9
used_wrap_counter: 1
svq->desc_state[id].ndescs: 1
free_head: 0                                                                   <----- Update free_head to 0.
last_used: 10                                                                  <----- Update last_used to 10.
vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
i: 0                                                                           <----- vhost_svq_flush()
descriptor_len: 0
elem->len: 22086
i: 1
elem_is_null: 1
vq idx: 1                                                                      <----- End vhost_svq_flush()

In QEMU, id 0 is added in idx 8. But it's marked as used before a
descriptor can be added in idx 9. Because of this there's a discrepancy
in the value of free_head and in svq->desc_next.

In the current implementation, the values of ID are generated, maintained
and processed by QEMU instead of reading from the guest's memory. I think
reading the value of ID from the guest memory (similar to reading the
descriptor length from guest memory) should resolve this issue.

The alternative would be to ensure that "add_packed" and "get_buf_packed"
are synchronized between the guest and QEMU.

What are your thoughts on this?

Thanks,
Sahil




Re: [RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Eugenio Perez Martin 6 months, 1 week ago
On Wed, Jul 30, 2025 at 4:33 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> I think I have finally found the reason behind this issue.
>
> The order in which "add_packed" and "get_buf_packed" are performed in the
> nested guest kernel (L2 kernel) and QEMU are a little different. Due to
> this, the values in free_head and svq->desc_next[] differ and the guest
> crashes at some point. More below.
>
> On 6/26/25 1:07 PM, Eugenio Perez Martin wrote:
> > On Thu, Jun 26, 2025 at 7:16 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> I think there's something off in the way "free_head", "last_used_idx" and
> >> "desc_next" values are calculated in vhost_svq_get_buf_packed() [1].
> >>
> >> In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
> >> them in order till id 8. At this point it received id 7 again for some reason
> >> and then crashed.
> >>
> >> L2:
> >>
> >> [ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
> >> [ 1641.130621] (notify) output.0 -> function will return true
> >> [ 1641.132022] output.0 -> id: 0
> >> [ 1739.502358] input.0 -> id: 0
> >> [ 1739.503003] input.0 -> id: 1
> >> [ 1739.562024] input.0 -> id: 2
> >> [ 1739.578682] input.0 -> id: 3
> >> [ 1739.661913] input.0 -> id: 4
> >> [ 1739.828796] input.0 -> id: 5
> >> [ 1739.829789] input.0 -> id: 6
> >> [ 1740.078757] input.0 -> id: 7
> >> [ 1740.079749] input.0 -> id: 8
> >> [ 1740.080382] input.0 -> id: 7    <----Received 7 again
> >> [ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!
> >>
> >> QEMU logs (vhost_svq_get_buf_packed):
> >> ------
> >> size              : svq->vring.num
> >> len               : svq->vring_packed.vring.desc[last_used].len
> >> id                : svq->vring_packed.vring.desc[last_used].id
> >> num               : svq->desc_state[id].ndescs
> >> last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
> >> free_head         : svq->free_head
> >> last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + num
> >> used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
> >> ------
> >> size: 256, len: 82, id: 7, vq idx: 0
> >> id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
> >> num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
> >> ------
> >> size: 256, len: 104, id: 8, vq idx: 0
> >> id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
> >> num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
> >> ------
> >> size: 256, len: 98, id: 9, vq idx: 0
> >> id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
> >> num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
> >> ------
> >> size: 256, len: 104, id: 10, vq idx: 0
> >> id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
> >> num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0
> >>
> >> I have a few more ideas of what to do. I'll let you know if I find something
> >> else.
> >>
> > I cannot find anything just by inspection. What about printing all the
> > desc_state and all desc_next to check for incoherencies in each
> > svq_add and get_buf?
> In this test, all 256 descriptors were filled in the RX vq.
>
> In the TX queue, L2 kernel would add one descriptor at a time and notify
> QEMU. QEMU would then register it in its SVQ and mark it as "available".
> After processing the descriptor, QEMU would mark it as "used" and flush it
> back to L2. L2, in turn, would mark this descriptor as "used". After this
> process, L2 would add the next descriptor in the TX vq while reusing this
> ID. This was observed from idx 0 till idx 7.
>
> L2's debug logs:
>
> [   18.379112] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor in idx 6
> [   18.387134] (add_packed) output.0 -> idx: 6
> [   18.389897] (add_packed) output.0 -> id: 0
> [   18.392290] (add_packed) output.0 -> len: 74
> [   18.394606] (add_packed) output.0 -> addr: 5012315726
> [   18.397043] (add_packed) output.0 -> next id: 1
> [   18.399861] Entering prepare_packed: output.0
> [   18.402478] (prepare_packed) output.0 -> needs_kick: 1
> [   18.404998] (notify) output.0 -> function will return true        <----- Notify QEMU
> [   18.406349] output.0 -> id: 0, idx: 6                             <----- Mark ID 0 in idx 6 as used
> [   18.409482] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
> [   18.410919] (after get_buf processed) output.0 -> id: 0, idx: 7   <----- Next slot is idx 7
> [   18.921895] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 7
> [   18.930093] (add_packed) output.0 -> idx: 7
> [   18.935715] (add_packed) output.0 -> id: 0
> [   18.937609] (add_packed) output.0 -> len: 122
> [   18.939614] (add_packed) output.0 -> addr: 4925868038
> [   18.941710] (add_packed) output.0 -> next id: 1
> [   18.944032] Entering prepare_packed: output.0
> [   18.946148] (prepare_packed) output.0 -> needs_kick: 1
> [   18.948234] (notify) output.0 -> function will return true        <----- Notify QEMU
> [   18.949606] output.0 -> id: 0, idx: 7                             <----- Mark ID 0 in idx 7 as used
> [   18.952756] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
> [   18.955154] (after get_buf processed) output.0 -> id: 0, idx: 8   <----- Next slot is idx 8
>
> There was no issue in QEMU till this point.
>
> [   19.177536] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 8
> [   19.182415] (add_packed) output.0 -> idx: 8
> [   19.187257] (add_packed) output.0 -> id: 0
> [   19.191355] (add_packed) output.0 -> len: 102
> [   19.195131] (add_packed) output.0 -> addr: 4370702342
> [   19.199224] (add_packed) output.0 -> next id: 1
> [   19.204929] Entering prepare_packed: output.0
> [   19.209505] (prepare_packed) output.0 -> needs_kick: 1
> [   19.213820] (notify) output.0 -> function will return true       <----- Notify QEMU
> [   19.218792] (use_indirect?) output.0 -> verdict: 0               <----- Next slot is idx 9
> [   19.224730] (add_packed) output.0 -> idx: 9
> [   19.227067] (add_packed) output.0 -> id: 1                       <----- ID 0 can't be reused yet, so use ID = 1
> [   19.229090] (add_packed) output.0 -> len: 330
> [   19.231182] (add_packed) output.0 -> addr: 4311020614
> [   19.233302] (add_packed) output.0 -> next id: 2
> [   19.235620] Entering prepare_packed: output.0
> [   19.237781] (prepare_packed) output.0 -> needs_kick: 1
> [   19.239958] (notify) output.0 -> function will return true       <----- Notify QEMU
> [   19.237780] output.0 -> id: 0, idx: 8                            <----- Mark ID 0 in idx 8 as used
> [   19.243676] output.0 -> old free_head: 2, new free_head: 0       <----- ID 0 can now be reused
> [   19.245214] (after get_buf processed) output.0 -> id: 0, idx: 9  <----- Next slot is idx 9
> [   19.247097] output.0 -> id: 1, idx: 9                            <----- Mark ID 1 in idx 9 as used
> [   19.249612] output.0 -> old free_head: 0, new free_head: 1       <----- ID 1 can now be reused
> [   19.252266] (after get_buf processed) output.0 -> id: 1, idx: 10 <----- Next slot is idx 10
>
> ID 0 and ID 1 in idx 8 and idx 9 respectively are pushed to QEMU
> before either of them are marked as used.
>
> But in QEMU, the order is slightly different.
>
> num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
> idx: 8, id: 0, len: 0, flags: 0, vq idx: 1                                     <----- Before adding descriptor
> idx: 8, id: 0, len: 102, flags: 128, vq idx: 1                                 <----- After adding descriptor
> Finally: new_idx: 9, head_idx: 8, id: 0, len: 102, flags: 128, vq idx: 1
> svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
> descriptor_len: 0
> descriptor_id: 0                                                               <----- Mark ID = 0 as used
> last_used: 8                                                                   <----- Processing idx 8
> used_wrap_counter: 1
> svq->desc_state[id].ndescs: 1
> free_head: 0                                                                   <----- Update free_head to 0.
> last_used: 9                                                                   <----- Update last_used to 9.
> vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
> i: 0                                                                           <----- vhost_svq_flush()
> descriptor_len: 0
> elem->len: 22086
> i: 1
> elem_is_null: 1
> vq idx: 1                                                                      <----- End vhost_svq_flush()
> num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
> idx: 9, id: 0, len: 0, flags: 0, curr: 0, vq idx: 1                            <----- Before adding descriptor
> idx: 9, id: 0, len: 330, flags: 128, curr: 1, vq idx: 1                        <----- After adding descriptor
> Finally: new_idx: 10, head_idx: 9, id: 0, len: 330, flags: 128, vq idx: 1      <----- ID 0 has been reused (versus ID 1 in L2)
> svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
> descriptor_len: 0
> descriptor_id: 0                                                               <----- Mark ID = 0 as used
> last_used: 9                                                                   <----- Processing idx 9
> used_wrap_counter: 1
> svq->desc_state[id].ndescs: 1
> free_head: 0                                                                   <----- Update free_head to 0.
> last_used: 10                                                                  <----- Update last_used to 10.
> vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
> i: 0                                                                           <----- vhost_svq_flush()
> descriptor_len: 0
> elem->len: 22086
> i: 1
> elem_is_null: 1
> vq idx: 1                                                                      <----- End vhost_svq_flush()
>
> In QEMU, id 0 is added in idx 8. But it's marked as used before a
> descriptor can be added in idx 9. Because of this there's a discrepancy
> in the value of free_head and in svq->desc_next.
>
> In the current implementation, the values of ID are generated, maintained
> and processed by QEMU instead of reading from the guest's memory. I think
> reading the value of ID from the guest memory (similar to reading the
> descriptor length from guest memory) should resolve this issue.
>

Ok you made a good catch here :).

The 1:1 sync is hard to achieve as a single buffer in the guest may
need to be splitted in many buffers in the host.

> The alternative would be to ensure that "add_packed" and "get_buf_packed"
> are synchronized between the guest and QEMU.
>

Yes, they're synchronized. When the guest makes an available
descriptor, its head is saved in the VirtQueueElement of the SVQ's
head idx on svq->desc_state.

Reviewing patch 3/7 I see you're actually returning the id of the
first descriptor of the chain in *head, while it should be the id of
the *last* descriptor. It should not be the cause of the failure, as I
don't see any descriptor chain in the log. To keep the free linked
list happy we may need to store the head of the descriptor chain in
the vq too.

Now, why is SVQ id 0 being reused? Sounds like free_list is not
initialized to 0, 1, 2... but to something else like 0, 0, 0, etc. Can
you print the whole list in each iteration?

> What are your thoughts on this?
>
> Thanks,
> Sahil
>
>
>
Re: [RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Sahil Siddiq 6 months, 1 week ago
Hi,

On 7/31/25 7:22 PM, Eugenio Perez Martin wrote:
> On Wed, Jul 30, 2025 at 4:33 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> I think I have finally found the reason behind this issue.
>>
>> The order in which "add_packed" and "get_buf_packed" are performed in the
>> nested guest kernel (L2 kernel) and QEMU are a little different. Due to
>> this, the values in free_head and svq->desc_next[] differ and the guest
>> crashes at some point. More below.
>>
>> On 6/26/25 1:07 PM, Eugenio Perez Martin wrote:
>>> On Thu, Jun 26, 2025 at 7:16 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> I think there's something off in the way "free_head", "last_used_idx" and
>>>> "desc_next" values are calculated in vhost_svq_get_buf_packed() [1].
>>>>
>>>> In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
>>>> them in order till id 8. At this point it received id 7 again for some reason
>>>> and then crashed.
>>>>
>>>> L2:
>>>>
>>>> [ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
>>>> [ 1641.130621] (notify) output.0 -> function will return true
>>>> [ 1641.132022] output.0 -> id: 0
>>>> [ 1739.502358] input.0 -> id: 0
>>>> [ 1739.503003] input.0 -> id: 1
>>>> [ 1739.562024] input.0 -> id: 2
>>>> [ 1739.578682] input.0 -> id: 3
>>>> [ 1739.661913] input.0 -> id: 4
>>>> [ 1739.828796] input.0 -> id: 5
>>>> [ 1739.829789] input.0 -> id: 6
>>>> [ 1740.078757] input.0 -> id: 7
>>>> [ 1740.079749] input.0 -> id: 8
>>>> [ 1740.080382] input.0 -> id: 7    <----Received 7 again
>>>> [ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!
>>>>
>>>> QEMU logs (vhost_svq_get_buf_packed):
>>>> ------
>>>> size              : svq->vring.num
>>>> len               : svq->vring_packed.vring.desc[last_used].len
>>>> id                : svq->vring_packed.vring.desc[last_used].id
>>>> num               : svq->desc_state[id].ndescs
>>>> last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
>>>> free_head         : svq->free_head
>>>> last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + num
>>>> used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
>>>> ------
>>>> size: 256, len: 82, id: 7, vq idx: 0
>>>> id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
>>>> num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
>>>> ------
>>>> size: 256, len: 104, id: 8, vq idx: 0
>>>> id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
>>>> num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
>>>> ------
>>>> size: 256, len: 98, id: 9, vq idx: 0
>>>> id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
>>>> num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
>>>> ------
>>>> size: 256, len: 104, id: 10, vq idx: 0
>>>> id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
>>>> num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0
>>>>
>>>> I have a few more ideas of what to do. I'll let you know if I find something
>>>> else.
>>>>
>>> I cannot find anything just by inspection. What about printing all the
>>> desc_state and all desc_next to check for incoherencies in each
>>> svq_add and get_buf?
>> In this test, all 256 descriptors were filled in the RX vq.
>>
>> In the TX queue, L2 kernel would add one descriptor at a time and notify
>> QEMU. QEMU would then register it in its SVQ and mark it as "available".
>> After processing the descriptor, QEMU would mark it as "used" and flush it
>> back to L2. L2, in turn, would mark this descriptor as "used". After this
>> process, L2 would add the next descriptor in the TX vq while reusing this
>> ID. This was observed from idx 0 till idx 7.
>>
>> L2's debug logs:
>>
>> [   18.379112] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor in idx 6
>> [   18.387134] (add_packed) output.0 -> idx: 6
>> [   18.389897] (add_packed) output.0 -> id: 0
>> [   18.392290] (add_packed) output.0 -> len: 74
>> [   18.394606] (add_packed) output.0 -> addr: 5012315726
>> [   18.397043] (add_packed) output.0 -> next id: 1
>> [   18.399861] Entering prepare_packed: output.0
>> [   18.402478] (prepare_packed) output.0 -> needs_kick: 1
>> [   18.404998] (notify) output.0 -> function will return true        <----- Notify QEMU
>> [   18.406349] output.0 -> id: 0, idx: 6                             <----- Mark ID 0 in idx 6 as used
>> [   18.409482] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
>> [   18.410919] (after get_buf processed) output.0 -> id: 0, idx: 7   <----- Next slot is idx 7
>> [   18.921895] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 7
>> [   18.930093] (add_packed) output.0 -> idx: 7
>> [   18.935715] (add_packed) output.0 -> id: 0
>> [   18.937609] (add_packed) output.0 -> len: 122
>> [   18.939614] (add_packed) output.0 -> addr: 4925868038
>> [   18.941710] (add_packed) output.0 -> next id: 1
>> [   18.944032] Entering prepare_packed: output.0
>> [   18.946148] (prepare_packed) output.0 -> needs_kick: 1
>> [   18.948234] (notify) output.0 -> function will return true        <----- Notify QEMU
>> [   18.949606] output.0 -> id: 0, idx: 7                             <----- Mark ID 0 in idx 7 as used
>> [   18.952756] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
>> [   18.955154] (after get_buf processed) output.0 -> id: 0, idx: 8   <----- Next slot is idx 8
>>
>> There was no issue in QEMU till this point.
>>
>> [   19.177536] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 8
>> [   19.182415] (add_packed) output.0 -> idx: 8
>> [   19.187257] (add_packed) output.0 -> id: 0
>> [   19.191355] (add_packed) output.0 -> len: 102
>> [   19.195131] (add_packed) output.0 -> addr: 4370702342
>> [   19.199224] (add_packed) output.0 -> next id: 1
>> [   19.204929] Entering prepare_packed: output.0
>> [   19.209505] (prepare_packed) output.0 -> needs_kick: 1
>> [   19.213820] (notify) output.0 -> function will return true       <----- Notify QEMU
>> [   19.218792] (use_indirect?) output.0 -> verdict: 0               <----- Next slot is idx 9
>> [   19.224730] (add_packed) output.0 -> idx: 9
>> [   19.227067] (add_packed) output.0 -> id: 1                       <----- ID 0 can't be reused yet, so use ID = 1
>> [   19.229090] (add_packed) output.0 -> len: 330
>> [   19.231182] (add_packed) output.0 -> addr: 4311020614
>> [   19.233302] (add_packed) output.0 -> next id: 2
>> [   19.235620] Entering prepare_packed: output.0
>> [   19.237781] (prepare_packed) output.0 -> needs_kick: 1
>> [   19.239958] (notify) output.0 -> function will return true       <----- Notify QEMU
>> [   19.237780] output.0 -> id: 0, idx: 8                            <----- Mark ID 0 in idx 8 as used
>> [   19.243676] output.0 -> old free_head: 2, new free_head: 0       <----- ID 0 can now be reused
>> [   19.245214] (after get_buf processed) output.0 -> id: 0, idx: 9  <----- Next slot is idx 9
>> [   19.247097] output.0 -> id: 1, idx: 9                            <----- Mark ID 1 in idx 9 as used
>> [   19.249612] output.0 -> old free_head: 0, new free_head: 1       <----- ID 1 can now be reused
>> [   19.252266] (after get_buf processed) output.0 -> id: 1, idx: 10 <----- Next slot is idx 10
>>
>> ID 0 and ID 1 in idx 8 and idx 9 respectively are pushed to QEMU
>> before either of them are marked as used.
>>
>> But in QEMU, the order is slightly different.
>>
>> num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
>> idx: 8, id: 0, len: 0, flags: 0, vq idx: 1                                     <----- Before adding descriptor
>> idx: 8, id: 0, len: 102, flags: 128, vq idx: 1                                 <----- After adding descriptor
>> Finally: new_idx: 9, head_idx: 8, id: 0, len: 102, flags: 128, vq idx: 1
>> svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
>> descriptor_len: 0
>> descriptor_id: 0                                                               <----- Mark ID = 0 as used
>> last_used: 8                                                                   <----- Processing idx 8
>> used_wrap_counter: 1
>> svq->desc_state[id].ndescs: 1
>> free_head: 0                                                                   <----- Update free_head to 0.
>> last_used: 9                                                                   <----- Update last_used to 9.
>> vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
>> i: 0                                                                           <----- vhost_svq_flush()
>> descriptor_len: 0
>> elem->len: 22086
>> i: 1
>> elem_is_null: 1
>> vq idx: 1                                                                      <----- End vhost_svq_flush()
>> num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
>> idx: 9, id: 0, len: 0, flags: 0, curr: 0, vq idx: 1                            <----- Before adding descriptor
>> idx: 9, id: 0, len: 330, flags: 128, curr: 1, vq idx: 1                        <----- After adding descriptor
>> Finally: new_idx: 10, head_idx: 9, id: 0, len: 330, flags: 128, vq idx: 1      <----- ID 0 has been reused (versus ID 1 in L2)
>> svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
>> descriptor_len: 0
>> descriptor_id: 0                                                               <----- Mark ID = 0 as used
>> last_used: 9                                                                   <----- Processing idx 9
>> used_wrap_counter: 1
>> svq->desc_state[id].ndescs: 1
>> free_head: 0                                                                   <----- Update free_head to 0.
>> last_used: 10                                                                  <----- Update last_used to 10.
>> vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
>> i: 0                                                                           <----- vhost_svq_flush()
>> descriptor_len: 0
>> elem->len: 22086
>> i: 1
>> elem_is_null: 1
>> vq idx: 1                                                                      <----- End vhost_svq_flush()
>>
>> In QEMU, id 0 is added in idx 8. But it's marked as used before a
>> descriptor can be added in idx 9. Because of this there's a discrepancy
>> in the value of free_head and in svq->desc_next.
>>
>> In the current implementation, the values of ID are generated, maintained
>> and processed by QEMU instead of reading from the guest's memory. I think
>> reading the value of ID from the guest memory (similar to reading the
>> descriptor length from guest memory) should resolve this issue.
>>
> 
> Ok you made a good catch here :).
> 
> The 1:1 sync is hard to achieve as a single buffer in the guest may
> need to be splitted in many buffers in the host.
> 
>> The alternative would be to ensure that "add_packed" and "get_buf_packed"
>> are synchronized between the guest and QEMU.
> 
> Yes, they're synchronized. When the guest makes an available
> descriptor, its head is saved in the VirtQueueElement of the SVQ's
> head idx on svq->desc_state.
> 
> Reviewing patch 3/7 I see you're actually returning the id of the
> first descriptor of the chain in *head, while it should be the id of
> the *last* descriptor. It should not be the cause of the failure, as I
> don't see any descriptor chain in the log.

Does this mean if the current free_head is 3 and the length of the chain
is 4, then ID 7 should be saved in the descriptor ring? In the current
implementation, *all* descriptors in the chain are being assigned the same
ID (= free_head instead of free_head + length of chain).

> To keep the free linked list happy we may need to store the head of the
> descriptor chain in the vq too.
> 
> Now, why is SVQ id 0 being reused? Sounds like free_list is not
> initialized to 0, 1, 2... but to something else like 0, 0, 0, etc. Can
> you print the whole list in each iteration?
> 

The free_list initially has the following values:
index 0 -> 1
index 1 -> 2
...
index 254 -> 255
index 255 -> 0

free_head is set to 0 at the beginning. When add_packed() executes for
the first time ID 0 is used and free_head is set to 1. If get_buf_packed()
is run immediately after the "add" operation, free_list[ID] is set to
the current free_head (ID 1 in this case). After this free_head is set to
the used ID (0 in this case).

So, free_list still looks like this:
index 0 -> 1
index 1 -> 2
...
index 254 -> 255
index 255 -> 0

But the free_head is 0. So, ID = 0 is reused again.

When 2 IDs (such as with idx 8 and idx 9 in the guest) are added to the
SVQ without either being used, then free_head is updated as shown:

free_head = 0 -> 1 -> 2

And then marking both IDs as "used" results in:

free_list[0] = current free_head (= 2)
free_head = 0
free_list[1] = current free_head (= 0)
free_head = 1

So free_list looks like this:

index 0 -> 2
index 1 -> 0
index 2 -> 3
index 3 -> 4
...
index 254 -> 255
index 255 -> 0

None of the indices in free_list hold the value 1.

Since at this point free_head is 1, ID 1 is now used very frequently
if add_packed() and get_buf_used() run one at a time in an interleaved
fashion.

Thanks,
Sahil

Re: [RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Eugenio Perez Martin 6 months, 1 week ago
On Mon, Aug 4, 2025 at 8:04 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 7/31/25 7:22 PM, Eugenio Perez Martin wrote:
> > On Wed, Jul 30, 2025 at 4:33 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> I think I have finally found the reason behind this issue.
> >>
> >> The order in which "add_packed" and "get_buf_packed" are performed in the
> >> nested guest kernel (L2 kernel) and QEMU are a little different. Due to
> >> this, the values in free_head and svq->desc_next[] differ and the guest
> >> crashes at some point. More below.
> >>
> >> On 6/26/25 1:07 PM, Eugenio Perez Martin wrote:
> >>> On Thu, Jun 26, 2025 at 7:16 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> I think there's something off in the way "free_head", "last_used_idx" and
> >>>> "desc_next" values are calculated in vhost_svq_get_buf_packed() [1].
> >>>>
> >>>> In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
> >>>> them in order till id 8. At this point it received id 7 again for some reason
> >>>> and then crashed.
> >>>>
> >>>> L2:
> >>>>
> >>>> [ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
> >>>> [ 1641.130621] (notify) output.0 -> function will return true
> >>>> [ 1641.132022] output.0 -> id: 0
> >>>> [ 1739.502358] input.0 -> id: 0
> >>>> [ 1739.503003] input.0 -> id: 1
> >>>> [ 1739.562024] input.0 -> id: 2
> >>>> [ 1739.578682] input.0 -> id: 3
> >>>> [ 1739.661913] input.0 -> id: 4
> >>>> [ 1739.828796] input.0 -> id: 5
> >>>> [ 1739.829789] input.0 -> id: 6
> >>>> [ 1740.078757] input.0 -> id: 7
> >>>> [ 1740.079749] input.0 -> id: 8
> >>>> [ 1740.080382] input.0 -> id: 7    <----Received 7 again
> >>>> [ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!
> >>>>
> >>>> QEMU logs (vhost_svq_get_buf_packed):
> >>>> ------
> >>>> size              : svq->vring.num
> >>>> len               : svq->vring_packed.vring.desc[last_used].len
> >>>> id                : svq->vring_packed.vring.desc[last_used].id
> >>>> num               : svq->desc_state[id].ndescs
> >>>> last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
> >>>> free_head         : svq->free_head
> >>>> last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + num
> >>>> used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
> >>>> ------
> >>>> size: 256, len: 82, id: 7, vq idx: 0
> >>>> id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
> >>>> num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
> >>>> ------
> >>>> size: 256, len: 104, id: 8, vq idx: 0
> >>>> id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
> >>>> num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
> >>>> ------
> >>>> size: 256, len: 98, id: 9, vq idx: 0
> >>>> id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
> >>>> num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
> >>>> ------
> >>>> size: 256, len: 104, id: 10, vq idx: 0
> >>>> id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
> >>>> num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0
> >>>>
> >>>> I have a few more ideas of what to do. I'll let you know if I find something
> >>>> else.
> >>>>
> >>> I cannot find anything just by inspection. What about printing all the
> >>> desc_state and all desc_next to check for incoherencies in each
> >>> svq_add and get_buf?
> >> In this test, all 256 descriptors were filled in the RX vq.
> >>
> >> In the TX queue, L2 kernel would add one descriptor at a time and notify
> >> QEMU. QEMU would then register it in its SVQ and mark it as "available".
> >> After processing the descriptor, QEMU would mark it as "used" and flush it
> >> back to L2. L2, in turn, would mark this descriptor as "used". After this
> >> process, L2 would add the next descriptor in the TX vq while reusing this
> >> ID. This was observed from idx 0 till idx 7.
> >>
> >> L2's debug logs:
> >>
> >> [   18.379112] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor in idx 6
> >> [   18.387134] (add_packed) output.0 -> idx: 6
> >> [   18.389897] (add_packed) output.0 -> id: 0
> >> [   18.392290] (add_packed) output.0 -> len: 74
> >> [   18.394606] (add_packed) output.0 -> addr: 5012315726
> >> [   18.397043] (add_packed) output.0 -> next id: 1
> >> [   18.399861] Entering prepare_packed: output.0
> >> [   18.402478] (prepare_packed) output.0 -> needs_kick: 1
> >> [   18.404998] (notify) output.0 -> function will return true        <----- Notify QEMU
> >> [   18.406349] output.0 -> id: 0, idx: 6                             <----- Mark ID 0 in idx 6 as used
> >> [   18.409482] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
> >> [   18.410919] (after get_buf processed) output.0 -> id: 0, idx: 7   <----- Next slot is idx 7
> >> [   18.921895] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 7
> >> [   18.930093] (add_packed) output.0 -> idx: 7
> >> [   18.935715] (add_packed) output.0 -> id: 0
> >> [   18.937609] (add_packed) output.0 -> len: 122
> >> [   18.939614] (add_packed) output.0 -> addr: 4925868038
> >> [   18.941710] (add_packed) output.0 -> next id: 1
> >> [   18.944032] Entering prepare_packed: output.0
> >> [   18.946148] (prepare_packed) output.0 -> needs_kick: 1
> >> [   18.948234] (notify) output.0 -> function will return true        <----- Notify QEMU
> >> [   18.949606] output.0 -> id: 0, idx: 7                             <----- Mark ID 0 in idx 7 as used
> >> [   18.952756] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
> >> [   18.955154] (after get_buf processed) output.0 -> id: 0, idx: 8   <----- Next slot is idx 8
> >>
> >> There was no issue in QEMU till this point.
> >>
> >> [   19.177536] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 8
> >> [   19.182415] (add_packed) output.0 -> idx: 8
> >> [   19.187257] (add_packed) output.0 -> id: 0
> >> [   19.191355] (add_packed) output.0 -> len: 102
> >> [   19.195131] (add_packed) output.0 -> addr: 4370702342
> >> [   19.199224] (add_packed) output.0 -> next id: 1
> >> [   19.204929] Entering prepare_packed: output.0
> >> [   19.209505] (prepare_packed) output.0 -> needs_kick: 1
> >> [   19.213820] (notify) output.0 -> function will return true       <----- Notify QEMU
> >> [   19.218792] (use_indirect?) output.0 -> verdict: 0               <----- Next slot is idx 9
> >> [   19.224730] (add_packed) output.0 -> idx: 9
> >> [   19.227067] (add_packed) output.0 -> id: 1                       <----- ID 0 can't be reused yet, so use ID = 1
> >> [   19.229090] (add_packed) output.0 -> len: 330
> >> [   19.231182] (add_packed) output.0 -> addr: 4311020614
> >> [   19.233302] (add_packed) output.0 -> next id: 2
> >> [   19.235620] Entering prepare_packed: output.0
> >> [   19.237781] (prepare_packed) output.0 -> needs_kick: 1
> >> [   19.239958] (notify) output.0 -> function will return true       <----- Notify QEMU
> >> [   19.237780] output.0 -> id: 0, idx: 8                            <----- Mark ID 0 in idx 8 as used
> >> [   19.243676] output.0 -> old free_head: 2, new free_head: 0       <----- ID 0 can now be reused
> >> [   19.245214] (after get_buf processed) output.0 -> id: 0, idx: 9  <----- Next slot is idx 9
> >> [   19.247097] output.0 -> id: 1, idx: 9                            <----- Mark ID 1 in idx 9 as used
> >> [   19.249612] output.0 -> old free_head: 0, new free_head: 1       <----- ID 1 can now be reused
> >> [   19.252266] (after get_buf processed) output.0 -> id: 1, idx: 10 <----- Next slot is idx 10
> >>
> >> ID 0 and ID 1 in idx 8 and idx 9 respectively are pushed to QEMU
> >> before either of them are marked as used.
> >>
> >> But in QEMU, the order is slightly different.
> >>
> >> num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
> >> idx: 8, id: 0, len: 0, flags: 0, vq idx: 1                                     <----- Before adding descriptor
> >> idx: 8, id: 0, len: 102, flags: 128, vq idx: 1                                 <----- After adding descriptor
> >> Finally: new_idx: 9, head_idx: 8, id: 0, len: 102, flags: 128, vq idx: 1
> >> svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
> >> descriptor_len: 0
> >> descriptor_id: 0                                                               <----- Mark ID = 0 as used
> >> last_used: 8                                                                   <----- Processing idx 8
> >> used_wrap_counter: 1
> >> svq->desc_state[id].ndescs: 1
> >> free_head: 0                                                                   <----- Update free_head to 0.
> >> last_used: 9                                                                   <----- Update last_used to 9.
> >> vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
> >> i: 0                                                                           <----- vhost_svq_flush()
> >> descriptor_len: 0
> >> elem->len: 22086
> >> i: 1
> >> elem_is_null: 1
> >> vq idx: 1                                                                      <----- End vhost_svq_flush()
> >> num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
> >> idx: 9, id: 0, len: 0, flags: 0, curr: 0, vq idx: 1                            <----- Before adding descriptor
> >> idx: 9, id: 0, len: 330, flags: 128, curr: 1, vq idx: 1                        <----- After adding descriptor
> >> Finally: new_idx: 10, head_idx: 9, id: 0, len: 330, flags: 128, vq idx: 1      <----- ID 0 has been reused (versus ID 1 in L2)
> >> svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
> >> descriptor_len: 0
> >> descriptor_id: 0                                                               <----- Mark ID = 0 as used
> >> last_used: 9                                                                   <----- Processing idx 9
> >> used_wrap_counter: 1
> >> svq->desc_state[id].ndescs: 1
> >> free_head: 0                                                                   <----- Update free_head to 0.
> >> last_used: 10                                                                  <----- Update last_used to 10.
> >> vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
> >> i: 0                                                                           <----- vhost_svq_flush()
> >> descriptor_len: 0
> >> elem->len: 22086
> >> i: 1
> >> elem_is_null: 1
> >> vq idx: 1                                                                      <----- End vhost_svq_flush()
> >>
> >> In QEMU, id 0 is added in idx 8. But it's marked as used before a
> >> descriptor can be added in idx 9. Because of this there's a discrepancy
> >> in the value of free_head and in svq->desc_next.
> >>
> >> In the current implementation, the values of ID are generated, maintained
> >> and processed by QEMU instead of reading from the guest's memory. I think
> >> reading the value of ID from the guest memory (similar to reading the
> >> descriptor length from guest memory) should resolve this issue.
> >>
> >
> > Ok you made a good catch here :).
> >
> > The 1:1 sync is hard to achieve as a single buffer in the guest may
> > need to be splitted in many buffers in the host.
> >
> >> The alternative would be to ensure that "add_packed" and "get_buf_packed"
> >> are synchronized between the guest and QEMU.
> >
> > Yes, they're synchronized. When the guest makes an available
> > descriptor, its head is saved in the VirtQueueElement of the SVQ's
> > head idx on svq->desc_state.
> >
> > Reviewing patch 3/7 I see you're actually returning the id of the
> > first descriptor of the chain in *head, while it should be the id of
> > the *last* descriptor. It should not be the cause of the failure, as I
> > don't see any descriptor chain in the log.
>
> Does this mean if the current free_head is 3 and the length of the chain
> is 4, then ID 7 should be saved in the descriptor ring? In the current
> implementation, *all* descriptors in the chain are being assigned the same
> ID (= free_head instead of free_head + length of chain).
>

Ouch, you're right! I recall I had the same comment while reading the
kernel's version and I forgot it.

> > To keep the free linked list happy we may need to store the head of the
> > descriptor chain in the vq too.
> >
> > Now, why is SVQ id 0 being reused? Sounds like free_list is not
> > initialized to 0, 1, 2... but to something else like 0, 0, 0, etc. Can
> > you print the whole list in each iteration?
> >
>
> The free_list initially has the following values:
> index 0 -> 1
> index 1 -> 2
> ...
> index 254 -> 255
> index 255 -> 0
>
> free_head is set to 0 at the beginning. When add_packed() executes for
> the first time ID 0 is used and free_head is set to 1. If get_buf_packed()
> is run immediately after the "add" operation, free_list[ID] is set to
> the current free_head (ID 1 in this case). After this free_head is set to
> the used ID (0 in this case).
>
> So, free_list still looks like this:
> index 0 -> 1
> index 1 -> 2
> ...
> index 254 -> 255
> index 255 -> 0
>
> But the free_head is 0. So, ID = 0 is reused again.
>
> When 2 IDs (such as with idx 8 and idx 9 in the guest) are added to the
> SVQ without either being used, then free_head is updated as shown:
>
> free_head = 0 -> 1 -> 2
>
> And then marking both IDs as "used" results in:
>
> free_list[0] = current free_head (= 2)
> free_head = 0
> free_list[1] = current free_head (= 0)
> free_head = 1
>
> So free_list looks like this:
>
> index 0 -> 2
> index 1 -> 0
> index 2 -> 3
> index 3 -> 4
> ...
> index 254 -> 255
> index 255 -> 0
>
> None of the indices in free_list hold the value 1.
>

So the free_list should always transverse all the indices. If SVQ
makes available 0, 1 and then the device uses 0, 1 in that order, the
free list should be 1 -> 0 -> 2 -> 3... Otherwise the loop will make
available the same descriptor twice. You should always have this case
in the rx queue in the Linux kernel virtio driver once it uses two
descriptors, maybe you can trace it and compare it with your SVQ code?

I just spotted that you're using le32_to_cpu(used descriptor id) in
patch 5/7, but it should be le16_to_cpu(). It shouldn't matter in x86
but it would cause problems in BE archs.

> Since at this point free_head is 1, ID 1 is now used very frequently
> if add_packed() and get_buf_used() run one at a time in an interleaved
> fashion.
>

This part is expected.
Re: [RFC v5 0/7] Add packed format to shadow virtqueue
Posted by Sahil Siddiq 9 months, 4 weeks ago
Hi,

On 4/14/25 2:50 PM, Sahil Siddiq wrote:
> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
>> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>> I managed to fix a few issues while testing this patch series.
>>> There is still one issue that I am unable to resolve. I thought
>>> I would send this patch series for review in case I have missed
>>> something.
>>>
>>> The issue is that this patch series does not work every time. I
>>> am able to ping L0 from L2 and vice versa via packed SVQ when it
>>> works.
>>
>> So we're on a very good track then!
>>
>>> When this doesn't work, both VMs throw a "Destination Host
>>> Unreachable" error. This is sometimes (not always) accompanied
>>> by the following kernel error (thrown by L2-kernel):
>>>
>>> virtio_net virtio1: output.0:id 1 is not a head!
>>>
>>
>> How many packets have been sent or received before hitting this? If
>> the answer to that is "the vq size", maybe there is a bug in the code
>> that handles the wraparound of the packed vq, as the used and avail
>> flags need to be twisted. You can count them in the SVQ code.
> 
> I did a lot more testing. This issue is quite unpredictable in terms
> of the time at which it appears after booting L2. So far, it almost
> always appears after booting L2. Even when pinging works, this issue
> appears after several seconds of pinging.
> 
> The total number of svq descriptors varied in every test run. But in
> every case, all 256 indices were filled in the descriptor region for
> vq with vq_idx = 0. This is the RX vq, right? This was filled while L2
> was booting. In the case when the ctrl vq is disabled, I am not sure
> what is responsible for filling the vqs in the data plane during
> booting.
> 
> =====
> The issue is hit most frequently when the following command is run
> in L0:
> $ ip addr add 111.1.1.1/24 dev tap0
> $ ip link set tap0 up
> 
> or, running the following in L2:
> # ip addr add 111.1.1.2/24 dev eth0
> 
> The other vq (vq_idx=1) is not filled completely before the issue is
> hit. I have been noting down the numbers and here is an example:
> 
> 295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> |_ 256 additions in vq_idx = 0, all with unique ids
>      |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
> |_ 39 additions in vq_idx = 1
>      |_ 13 descriptors had id = 0
>      |_ 26 descriptors had id = 1
>      |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)
> 
> There was one case in which vq_idx=0 had wrapped around. I verified
> that flags were set appropriately during the wrap (avail and used flags
> were flipped as expected).
> 
> =====
> The next common situation where this issue is hit is during startup.
> Before L2 can finish booting successfully, this error is thrown:
> 
> virtio_net virtio1: output.0:id 0 is not a head!
> 
> 258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
> |_ 256 additions in vq_idx = 0, all with unique ids
>     |---- None of them were received by the device (vhost_svq_get_buf_packed)
> |_ 2 additions in vq_idx = 1
>     |_ id = 0 in index 0
>     |_ id = 1 in index 1
>     |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)
> 
> =====
> Another case is after several seconds of pinging L0 from L2.
> 
> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
> 
> 366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
> |   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
> |_ 77 additions in vq_idx = 1
>      |_ 76 descriptors had id = 0
>      |_ 1 descriptor had id = 1
>      |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)
> 
> I am not entirely sure now if there's an issue in the packed vq
> implementation in QEMU or if this is being caused due to some sort
> of race condition in linux.

After some more testing, I think the issue is indeed in the current
implementation of packed vq in QEMU. The kernel does not crash when
using packed vqs with x-svq=false. I have an idea that might help
find the issue. It involves debugging the linux kernel. I'll try this
out and will let you know how it goes.

> "id is not a head" is being thrown because vq->packed.desc_state[id].data
> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].
> [...]

Thanks,
Sahil