[RFC 0/8] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci

Greg Kurz posted 8 patches 3 years ago
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20210325150735.1098387-1-groug@kaod.org
Maintainers: Stefan Hajnoczi <stefanha@redhat.com>, Max Reitz <mreitz@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Fam Zheng <fam@euphon.net>, Kevin Wolf <kwolf@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>
There is a newer version of this series
hw/virtio/virtio-pci.h          |  1 +
include/exec/memory.h           | 48 ++++++++++++++++------
include/hw/virtio/virtio-bus.h  |  7 ++++
hw/block/dataplane/virtio-blk.c | 26 +++++-------
hw/scsi/virtio-scsi-dataplane.c | 68 ++++++++++++++++++--------------
hw/virtio/virtio-bus.c          | 70 +++++++++++++++++++++++++++++++++
hw/virtio/virtio-pci.c          | 53 +++++++++++++++++--------
softmmu/memory.c                | 42 ++++++++++++--------
8 files changed, 225 insertions(+), 90 deletions(-)
[RFC 0/8] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci
Posted by Greg Kurz 3 years ago
Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
a serious slow down may be observed on setups with a big enough number
of vCPUs.

Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads):

1		0m20.922s	0m21.346s
2		0m21.230s	0m20.350s
4		0m21.761s	0m20.997s
8		0m22.770s	0m20.051s
16		0m22.038s	0m19.994s
32		0m22.928s	0m20.803s
64		0m26.583s	0m22.953s
128		0m41.273s	0m32.333s
256		2m4.727s 	1m16.924s
384		6m5.563s 	3m26.186s

Both perf and gprof indicate that QEMU is hogging CPUs when setting up
the ioeventfds:

 67.88%  swapper         [kernel.kallsyms]  [k] power_pmu_enable
  9.47%  qemu-kvm        [kernel.kallsyms]  [k] smp_call_function_single
  8.64%  qemu-kvm        [kernel.kallsyms]  [k] power_pmu_enable
=>2.79%  qemu-kvm        qemu-kvm           [.] memory_region_ioeventfd_before
=>2.12%  qemu-kvm        qemu-kvm           [.] address_space_update_ioeventfds
  0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single

address_space_update_ioeventfds() is called when committing an MR
transaction, i.e. for each ioeventfd with the current code base,
and it internally loops on all ioventfds:

static void address_space_update_ioeventfds(AddressSpace *as)
{
[...]
    FOR_EACH_FLAT_RANGE(fr, view) {
        for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {

This means that the setup of ioeventfds for these devices has
quadratic time complexity.

This series introduce generic APIs to allow batch creation and deletion
of ioeventfds, and converts virtio-blk and virtio-scsi to use them. This
greatly improves the numbers:

1		0m21.271s	0m22.076s
2		0m20.912s	0m19.716s
4		0m20.508s	0m19.310s
8		0m21.374s	0m20.273s
16		0m21.559s	0m21.374s
32		0m22.532s	0m21.271s
64		0m26.550s	0m22.007s
128		0m29.115s	0m27.446s
256		0m44.752s	0m41.004s
384		1m2.884s	0m58.023s

The series deliberately spans over multiple subsystems for easier
review and experimenting. It also does some preliminary fixes on
the way. It is thus posted as an RFC for now, but if the general
idea is acceptable, I guess a non-RFC could be posted and maybe
extend the feature to some other devices that might suffer from
similar scaling issues, e.g. vhost-scsi-pci, vhost-user-scsi-pci
and vhost-user-blk-pci, even if I haven't checked.

This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108
which reported the issue for virtio-scsi-pci.

Greg Kurz (8):
  memory: Allow eventfd add/del without starting a transaction
  virtio: Introduce virtio_bus_set_host_notifiers()
  virtio: Add API to batch set host notifiers
  virtio-pci: Batch add/del ioeventfds in a single MR transaction
  virtio-blk: Fix rollback path in virtio_blk_data_plane_start()
  virtio-blk: Use virtio_bus_set_host_notifiers()
  virtio-scsi: Set host notifiers and callbacks separately
  virtio-scsi: Use virtio_bus_set_host_notifiers()

 hw/virtio/virtio-pci.h          |  1 +
 include/exec/memory.h           | 48 ++++++++++++++++------
 include/hw/virtio/virtio-bus.h  |  7 ++++
 hw/block/dataplane/virtio-blk.c | 26 +++++-------
 hw/scsi/virtio-scsi-dataplane.c | 68 ++++++++++++++++++--------------
 hw/virtio/virtio-bus.c          | 70 +++++++++++++++++++++++++++++++++
 hw/virtio/virtio-pci.c          | 53 +++++++++++++++++--------
 softmmu/memory.c                | 42 ++++++++++++--------
 8 files changed, 225 insertions(+), 90 deletions(-)

-- 
2.26.3



Re: [RFC 0/8] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci
Posted by Michael S. Tsirkin 3 years ago
On Thu, Mar 25, 2021 at 04:07:27PM +0100, Greg Kurz wrote:
> Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> a serious slow down may be observed on setups with a big enough number
> of vCPUs.
> 
> Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads):
> 
> 1		0m20.922s	0m21.346s
> 2		0m21.230s	0m20.350s
> 4		0m21.761s	0m20.997s
> 8		0m22.770s	0m20.051s
> 16		0m22.038s	0m19.994s
> 32		0m22.928s	0m20.803s
> 64		0m26.583s	0m22.953s
> 128		0m41.273s	0m32.333s
> 256		2m4.727s 	1m16.924s
> 384		6m5.563s 	3m26.186s
> 
> Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> the ioeventfds:
> 
>  67.88%  swapper         [kernel.kallsyms]  [k] power_pmu_enable
>   9.47%  qemu-kvm        [kernel.kallsyms]  [k] smp_call_function_single
>   8.64%  qemu-kvm        [kernel.kallsyms]  [k] power_pmu_enable
> =>2.79%  qemu-kvm        qemu-kvm           [.] memory_region_ioeventfd_before
> =>2.12%  qemu-kvm        qemu-kvm           [.] address_space_update_ioeventfds
>   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> 
> address_space_update_ioeventfds() is called when committing an MR
> transaction, i.e. for each ioeventfd with the current code base,
> and it internally loops on all ioventfds:
> 
> static void address_space_update_ioeventfds(AddressSpace *as)
> {
> [...]
>     FOR_EACH_FLAT_RANGE(fr, view) {
>         for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> 
> This means that the setup of ioeventfds for these devices has
> quadratic time complexity.
> 
> This series introduce generic APIs to allow batch creation and deletion
> of ioeventfds, and converts virtio-blk and virtio-scsi to use them. This
> greatly improves the numbers:
> 
> 1		0m21.271s	0m22.076s
> 2		0m20.912s	0m19.716s
> 4		0m20.508s	0m19.310s
> 8		0m21.374s	0m20.273s
> 16		0m21.559s	0m21.374s
> 32		0m22.532s	0m21.271s
> 64		0m26.550s	0m22.007s
> 128		0m29.115s	0m27.446s
> 256		0m44.752s	0m41.004s
> 384		1m2.884s	0m58.023s
> 
> The series deliberately spans over multiple subsystems for easier
> review and experimenting. It also does some preliminary fixes on
> the way. It is thus posted as an RFC for now, but if the general
> idea is acceptable, I guess a non-RFC could be posted and maybe
> extend the feature to some other devices that might suffer from
> similar scaling issues, e.g. vhost-scsi-pci, vhost-user-scsi-pci
> and vhost-user-blk-pci, even if I haven't checked.
> 
> This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108
> which reported the issue for virtio-scsi-pci.


Series looks ok from a quick look ...

this is a regression isn't it?
So I guess we'll need that in 6.0 or revert the # of vqs
change for now ...

> Greg Kurz (8):
>   memory: Allow eventfd add/del without starting a transaction
>   virtio: Introduce virtio_bus_set_host_notifiers()
>   virtio: Add API to batch set host notifiers
>   virtio-pci: Batch add/del ioeventfds in a single MR transaction
>   virtio-blk: Fix rollback path in virtio_blk_data_plane_start()
>   virtio-blk: Use virtio_bus_set_host_notifiers()
>   virtio-scsi: Set host notifiers and callbacks separately
>   virtio-scsi: Use virtio_bus_set_host_notifiers()
> 
>  hw/virtio/virtio-pci.h          |  1 +
>  include/exec/memory.h           | 48 ++++++++++++++++------
>  include/hw/virtio/virtio-bus.h  |  7 ++++
>  hw/block/dataplane/virtio-blk.c | 26 +++++-------
>  hw/scsi/virtio-scsi-dataplane.c | 68 ++++++++++++++++++--------------
>  hw/virtio/virtio-bus.c          | 70 +++++++++++++++++++++++++++++++++
>  hw/virtio/virtio-pci.c          | 53 +++++++++++++++++--------
>  softmmu/memory.c                | 42 ++++++++++++--------
>  8 files changed, 225 insertions(+), 90 deletions(-)
> 
> -- 
> 2.26.3
> 


Re: [RFC 0/8] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci
Posted by Stefan Hajnoczi 3 years ago
On Thu, Mar 25, 2021 at 01:05:16PM -0400, Michael S. Tsirkin wrote:
> On Thu, Mar 25, 2021 at 04:07:27PM +0100, Greg Kurz wrote:
> > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> > a serious slow down may be observed on setups with a big enough number
> > of vCPUs.
> > 
> > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads):
> > 
> > 1		0m20.922s	0m21.346s
> > 2		0m21.230s	0m20.350s
> > 4		0m21.761s	0m20.997s
> > 8		0m22.770s	0m20.051s
> > 16		0m22.038s	0m19.994s
> > 32		0m22.928s	0m20.803s
> > 64		0m26.583s	0m22.953s
> > 128		0m41.273s	0m32.333s
> > 256		2m4.727s 	1m16.924s
> > 384		6m5.563s 	3m26.186s
> > 
> > Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> > the ioeventfds:
> > 
> >  67.88%  swapper         [kernel.kallsyms]  [k] power_pmu_enable
> >   9.47%  qemu-kvm        [kernel.kallsyms]  [k] smp_call_function_single
> >   8.64%  qemu-kvm        [kernel.kallsyms]  [k] power_pmu_enable
> > =>2.79%  qemu-kvm        qemu-kvm           [.] memory_region_ioeventfd_before
> > =>2.12%  qemu-kvm        qemu-kvm           [.] address_space_update_ioeventfds
> >   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> > 
> > address_space_update_ioeventfds() is called when committing an MR
> > transaction, i.e. for each ioeventfd with the current code base,
> > and it internally loops on all ioventfds:
> > 
> > static void address_space_update_ioeventfds(AddressSpace *as)
> > {
> > [...]
> >     FOR_EACH_FLAT_RANGE(fr, view) {
> >         for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> > 
> > This means that the setup of ioeventfds for these devices has
> > quadratic time complexity.
> > 
> > This series introduce generic APIs to allow batch creation and deletion
> > of ioeventfds, and converts virtio-blk and virtio-scsi to use them. This
> > greatly improves the numbers:
> > 
> > 1		0m21.271s	0m22.076s
> > 2		0m20.912s	0m19.716s
> > 4		0m20.508s	0m19.310s
> > 8		0m21.374s	0m20.273s
> > 16		0m21.559s	0m21.374s
> > 32		0m22.532s	0m21.271s
> > 64		0m26.550s	0m22.007s
> > 128		0m29.115s	0m27.446s
> > 256		0m44.752s	0m41.004s
> > 384		1m2.884s	0m58.023s
> > 
> > The series deliberately spans over multiple subsystems for easier
> > review and experimenting. It also does some preliminary fixes on
> > the way. It is thus posted as an RFC for now, but if the general
> > idea is acceptable, I guess a non-RFC could be posted and maybe
> > extend the feature to some other devices that might suffer from
> > similar scaling issues, e.g. vhost-scsi-pci, vhost-user-scsi-pci
> > and vhost-user-blk-pci, even if I haven't checked.
> > 
> > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108
> > which reported the issue for virtio-scsi-pci.
> 
> 
> Series looks ok from a quick look ...
> 
> this is a regression isn't it?
> So I guess we'll need that in 6.0 or revert the # of vqs
> change for now ...

Commit 9445e1e15e66c19e42bea942ba810db28052cd05 ("virtio-blk-pci:
default num_queues to -smp N") was already released in QEMU 5.2.0. It is
not a QEMU 6.0 regression.

I'll review this series on Monday.

Stefan
Re: [RFC 0/8] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci
Posted by Greg Kurz 3 years ago
On Thu, 25 Mar 2021 17:43:10 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Thu, Mar 25, 2021 at 01:05:16PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Mar 25, 2021 at 04:07:27PM +0100, Greg Kurz wrote:
> > > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> > > a serious slow down may be observed on setups with a big enough number
> > > of vCPUs.
> > > 
> > > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads):
> > > 
> > > 1		0m20.922s	0m21.346s
> > > 2		0m21.230s	0m20.350s
> > > 4		0m21.761s	0m20.997s
> > > 8		0m22.770s	0m20.051s
> > > 16		0m22.038s	0m19.994s
> > > 32		0m22.928s	0m20.803s
> > > 64		0m26.583s	0m22.953s
> > > 128		0m41.273s	0m32.333s
> > > 256		2m4.727s 	1m16.924s
> > > 384		6m5.563s 	3m26.186s
> > > 
> > > Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> > > the ioeventfds:
> > > 
> > >  67.88%  swapper         [kernel.kallsyms]  [k] power_pmu_enable
> > >   9.47%  qemu-kvm        [kernel.kallsyms]  [k] smp_call_function_single
> > >   8.64%  qemu-kvm        [kernel.kallsyms]  [k] power_pmu_enable
> > > =>2.79%  qemu-kvm        qemu-kvm           [.] memory_region_ioeventfd_before
> > > =>2.12%  qemu-kvm        qemu-kvm           [.] address_space_update_ioeventfds
> > >   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> > > 
> > > address_space_update_ioeventfds() is called when committing an MR
> > > transaction, i.e. for each ioeventfd with the current code base,
> > > and it internally loops on all ioventfds:
> > > 
> > > static void address_space_update_ioeventfds(AddressSpace *as)
> > > {
> > > [...]
> > >     FOR_EACH_FLAT_RANGE(fr, view) {
> > >         for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> > > 
> > > This means that the setup of ioeventfds for these devices has
> > > quadratic time complexity.
> > > 
> > > This series introduce generic APIs to allow batch creation and deletion
> > > of ioeventfds, and converts virtio-blk and virtio-scsi to use them. This
> > > greatly improves the numbers:
> > > 
> > > 1		0m21.271s	0m22.076s
> > > 2		0m20.912s	0m19.716s
> > > 4		0m20.508s	0m19.310s
> > > 8		0m21.374s	0m20.273s
> > > 16		0m21.559s	0m21.374s
> > > 32		0m22.532s	0m21.271s
> > > 64		0m26.550s	0m22.007s
> > > 128		0m29.115s	0m27.446s
> > > 256		0m44.752s	0m41.004s
> > > 384		1m2.884s	0m58.023s
> > > 
> > > The series deliberately spans over multiple subsystems for easier
> > > review and experimenting. It also does some preliminary fixes on
> > > the way. It is thus posted as an RFC for now, but if the general
> > > idea is acceptable, I guess a non-RFC could be posted and maybe
> > > extend the feature to some other devices that might suffer from
> > > similar scaling issues, e.g. vhost-scsi-pci, vhost-user-scsi-pci
> > > and vhost-user-blk-pci, even if I haven't checked.
> > > 
> > > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108
> > > which reported the issue for virtio-scsi-pci.
> > 
> > 
> > Series looks ok from a quick look ...
> > 
> > this is a regression isn't it?
> > So I guess we'll need that in 6.0 or revert the # of vqs
> > change for now ...
> 
> Commit 9445e1e15e66c19e42bea942ba810db28052cd05 ("virtio-blk-pci:
> default num_queues to -smp N") was already released in QEMU 5.2.0. It is
> not a QEMU 6.0 regression.
> 

Oh you're right, I've just checked and QEMU 5.2.0 has the same problem.

> I'll review this series on Monday.
> 

Thanks !

> Stefan

Re: [RFC 0/8] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci
Posted by Greg Kurz 3 years ago
On Thu, 25 Mar 2021 13:05:16 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Mar 25, 2021 at 04:07:27PM +0100, Greg Kurz wrote:
> > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> > a serious slow down may be observed on setups with a big enough number
> > of vCPUs.
> > 
> > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads):
> > 
> > 1		0m20.922s	0m21.346s
> > 2		0m21.230s	0m20.350s
> > 4		0m21.761s	0m20.997s
> > 8		0m22.770s	0m20.051s
> > 16		0m22.038s	0m19.994s
> > 32		0m22.928s	0m20.803s
> > 64		0m26.583s	0m22.953s
> > 128		0m41.273s	0m32.333s
> > 256		2m4.727s 	1m16.924s
> > 384		6m5.563s 	3m26.186s
> > 
> > Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> > the ioeventfds:
> > 
> >  67.88%  swapper         [kernel.kallsyms]  [k] power_pmu_enable
> >   9.47%  qemu-kvm        [kernel.kallsyms]  [k] smp_call_function_single
> >   8.64%  qemu-kvm        [kernel.kallsyms]  [k] power_pmu_enable
> > =>2.79%  qemu-kvm        qemu-kvm           [.] memory_region_ioeventfd_before
> > =>2.12%  qemu-kvm        qemu-kvm           [.] address_space_update_ioeventfds
> >   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> > 
> > address_space_update_ioeventfds() is called when committing an MR
> > transaction, i.e. for each ioeventfd with the current code base,
> > and it internally loops on all ioventfds:
> > 
> > static void address_space_update_ioeventfds(AddressSpace *as)
> > {
> > [...]
> >     FOR_EACH_FLAT_RANGE(fr, view) {
> >         for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> > 
> > This means that the setup of ioeventfds for these devices has
> > quadratic time complexity.
> > 
> > This series introduce generic APIs to allow batch creation and deletion
> > of ioeventfds, and converts virtio-blk and virtio-scsi to use them. This
> > greatly improves the numbers:
> > 
> > 1		0m21.271s	0m22.076s
> > 2		0m20.912s	0m19.716s
> > 4		0m20.508s	0m19.310s
> > 8		0m21.374s	0m20.273s
> > 16		0m21.559s	0m21.374s
> > 32		0m22.532s	0m21.271s
> > 64		0m26.550s	0m22.007s
> > 128		0m29.115s	0m27.446s
> > 256		0m44.752s	0m41.004s
> > 384		1m2.884s	0m58.023s
> > 
> > The series deliberately spans over multiple subsystems for easier
> > review and experimenting. It also does some preliminary fixes on
> > the way. It is thus posted as an RFC for now, but if the general
> > idea is acceptable, I guess a non-RFC could be posted and maybe
> > extend the feature to some other devices that might suffer from
> > similar scaling issues, e.g. vhost-scsi-pci, vhost-user-scsi-pci
> > and vhost-user-blk-pci, even if I haven't checked.
> > 
> > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108
> > which reported the issue for virtio-scsi-pci.
> 
> 
> Series looks ok from a quick look ...
> 
> this is a regression isn't it?

This is a regression only if we consider the defaults. Manually setting
num_queues to a high value already affects existing devices.

> So I guess we'll need that in 6.0 or revert the # of vqs
> change for now ...
> 

Not sure it is safe to merge these fixes this late... also,
as said above, I've only tested virtio-scsi and virtio-blk
but I believe the vhost-user-* variants might be affected too.

Reverting the # of vqs would really be a pity IMHO. What
about mitigating the effects ? e.g. enforce previous
behavior only if # vcpus > 64 ?

> > Greg Kurz (8):
> >   memory: Allow eventfd add/del without starting a transaction
> >   virtio: Introduce virtio_bus_set_host_notifiers()
> >   virtio: Add API to batch set host notifiers
> >   virtio-pci: Batch add/del ioeventfds in a single MR transaction
> >   virtio-blk: Fix rollback path in virtio_blk_data_plane_start()
> >   virtio-blk: Use virtio_bus_set_host_notifiers()
> >   virtio-scsi: Set host notifiers and callbacks separately
> >   virtio-scsi: Use virtio_bus_set_host_notifiers()
> > 
> >  hw/virtio/virtio-pci.h          |  1 +
> >  include/exec/memory.h           | 48 ++++++++++++++++------
> >  include/hw/virtio/virtio-bus.h  |  7 ++++
> >  hw/block/dataplane/virtio-blk.c | 26 +++++-------
> >  hw/scsi/virtio-scsi-dataplane.c | 68 ++++++++++++++++++--------------
> >  hw/virtio/virtio-bus.c          | 70 +++++++++++++++++++++++++++++++++
> >  hw/virtio/virtio-pci.c          | 53 +++++++++++++++++--------
> >  softmmu/memory.c                | 42 ++++++++++++--------
> >  8 files changed, 225 insertions(+), 90 deletions(-)
> > 
> > -- 
> > 2.26.3
> > 
> 


Re: [RFC 0/8] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci
Posted by Stefan Hajnoczi 3 years ago
On Thu, Mar 25, 2021 at 04:07:27PM +0100, Greg Kurz wrote:
> Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> a serious slow down may be observed on setups with a big enough number
> of vCPUs.
> 
> Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads):
> 
> 1		0m20.922s	0m21.346s
> 2		0m21.230s	0m20.350s
> 4		0m21.761s	0m20.997s
> 8		0m22.770s	0m20.051s
> 16		0m22.038s	0m19.994s
> 32		0m22.928s	0m20.803s
> 64		0m26.583s	0m22.953s
> 128		0m41.273s	0m32.333s
> 256		2m4.727s 	1m16.924s
> 384		6m5.563s 	3m26.186s
> 
> Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> the ioeventfds:
> 
>  67.88%  swapper         [kernel.kallsyms]  [k] power_pmu_enable
>   9.47%  qemu-kvm        [kernel.kallsyms]  [k] smp_call_function_single
>   8.64%  qemu-kvm        [kernel.kallsyms]  [k] power_pmu_enable
> =>2.79%  qemu-kvm        qemu-kvm           [.] memory_region_ioeventfd_before
> =>2.12%  qemu-kvm        qemu-kvm           [.] address_space_update_ioeventfds
>   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> 
> address_space_update_ioeventfds() is called when committing an MR
> transaction, i.e. for each ioeventfd with the current code base,
> and it internally loops on all ioventfds:
> 
> static void address_space_update_ioeventfds(AddressSpace *as)
> {
> [...]
>     FOR_EACH_FLAT_RANGE(fr, view) {
>         for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> 
> This means that the setup of ioeventfds for these devices has
> quadratic time complexity.
> 
> This series introduce generic APIs to allow batch creation and deletion
> of ioeventfds, and converts virtio-blk and virtio-scsi to use them. This
> greatly improves the numbers:
> 
> 1		0m21.271s	0m22.076s
> 2		0m20.912s	0m19.716s
> 4		0m20.508s	0m19.310s
> 8		0m21.374s	0m20.273s
> 16		0m21.559s	0m21.374s
> 32		0m22.532s	0m21.271s
> 64		0m26.550s	0m22.007s
> 128		0m29.115s	0m27.446s
> 256		0m44.752s	0m41.004s
> 384		1m2.884s	0m58.023s

Excellent numbers!

I wonder if the code can be simplified since
memory_region_transaction_begin/end() supports nesting. Why not call
them directly from the device model instead of introducing callbacks in
core virtio and virtio-pci code?

Also, do you think there are other opportunities to have a long
transaction to batch up machine init, device hotplug, etc? It's not
clear to me when transactions must be ended. Clearly it's necessary to
end the transaction if we need to do something that depends on the
MemoryRegion, eventfd, etc being updated. But most of the time there is
no immediate need to end the transaction and more code could share the
same transaction before we go back to the event loop or vcpu thread.

Stefan
Re: [RFC 0/8] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci
Posted by Greg Kurz 2 years, 12 months ago
On Mon, 29 Mar 2021 18:35:16 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Thu, Mar 25, 2021 at 04:07:27PM +0100, Greg Kurz wrote:
> > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> > a serious slow down may be observed on setups with a big enough number
> > of vCPUs.
> > 
> > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads):
> > 
> > 1		0m20.922s	0m21.346s
> > 2		0m21.230s	0m20.350s
> > 4		0m21.761s	0m20.997s
> > 8		0m22.770s	0m20.051s
> > 16		0m22.038s	0m19.994s
> > 32		0m22.928s	0m20.803s
> > 64		0m26.583s	0m22.953s
> > 128		0m41.273s	0m32.333s
> > 256		2m4.727s 	1m16.924s
> > 384		6m5.563s 	3m26.186s
> > 
> > Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> > the ioeventfds:
> > 
> >  67.88%  swapper         [kernel.kallsyms]  [k] power_pmu_enable
> >   9.47%  qemu-kvm        [kernel.kallsyms]  [k] smp_call_function_single
> >   8.64%  qemu-kvm        [kernel.kallsyms]  [k] power_pmu_enable
> > =>2.79%  qemu-kvm        qemu-kvm           [.] memory_region_ioeventfd_before
> > =>2.12%  qemu-kvm        qemu-kvm           [.] address_space_update_ioeventfds
> >   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> > 
> > address_space_update_ioeventfds() is called when committing an MR
> > transaction, i.e. for each ioeventfd with the current code base,
> > and it internally loops on all ioventfds:
> > 
> > static void address_space_update_ioeventfds(AddressSpace *as)
> > {
> > [...]
> >     FOR_EACH_FLAT_RANGE(fr, view) {
> >         for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> > 
> > This means that the setup of ioeventfds for these devices has
> > quadratic time complexity.
> > 
> > This series introduce generic APIs to allow batch creation and deletion
> > of ioeventfds, and converts virtio-blk and virtio-scsi to use them. This
> > greatly improves the numbers:
> > 
> > 1		0m21.271s	0m22.076s
> > 2		0m20.912s	0m19.716s
> > 4		0m20.508s	0m19.310s
> > 8		0m21.374s	0m20.273s
> > 16		0m21.559s	0m21.374s
> > 32		0m22.532s	0m21.271s
> > 64		0m26.550s	0m22.007s
> > 128		0m29.115s	0m27.446s
> > 256		0m44.752s	0m41.004s
> > 384		1m2.884s	0m58.023s
> 
> Excellent numbers!
> 
> I wonder if the code can be simplified since
> memory_region_transaction_begin/end() supports nesting. Why not call
> them directly from the device model instead of introducing callbacks in
> core virtio and virtio-pci code?
> 

It seems a bit awkward that the device model should assume a memory
transaction is needed to setup host notifiers, which are ioeventfds
under the hood but the device doesn't know that.

> Also, do you think there are other opportunities to have a long
> transaction to batch up machine init, device hotplug, etc? It's not
> clear to me when transactions must be ended. Clearly it's necessary to

The transaction *must* be ended before calling
virtio_bus_cleanup_host_notifier() because
address_space_add_del_ioeventfds(), called when
finishing the transaction, needs the "to-be-closed"
eventfds to be still open, otherwise the KVM_IOEVENTFD 
ioctl() might fail with EBADF.

See this change in patch 3:

@@ -315,6 +338,10 @@ static void virtio_bus_unset_and_cleanup_host_notifiers(VirtioBusState *bus,
 
     for (i = 0; i < nvqs; i++) {
         virtio_bus_set_host_notifier(bus, i + n_offset, false);
+    }
+    /* Let address_space_update_ioeventfds() run before closing ioeventfds */
+    virtio_bus_set_host_notifier_commit(bus);
+    for (i = 0; i < nvqs; i++) {
         virtio_bus_cleanup_host_notifier(bus, i + n_offset);
     }
 }

Maybe I should provide more details why we're doing that ?

> end the transaction if we need to do something that depends on the
> MemoryRegion, eventfd, etc being updated. But most of the time there is
> no immediate need to end the transaction and more code could share the
> same transaction before we go back to the event loop or vcpu thread.
> 

I can't tell for all scenarios that involve memory transactions but
it seems this is definitely not the case for ioeventfds : the rest
of the code expects the transaction to be complete.

> Stefan

Thanks for the review !

Cheers,

--
Greg