vl: Prioritize device realizations

[PATCH 0/4] vl: Prioritize device realizations

Posted by Peter Xu 2 years, 8 months ago

This is a long pending issue that we haven't fixed.  The issue is in QEMU we
have implicit device ordering requirement when realizing, otherwise some of the
device may not work properly.

The initial requirement comes from when vfio-pci starts to work with vIOMMUs.
To make sure vfio-pci will get the correct DMA address space, the vIOMMU device
needs to be created before vfio-pci otherwise vfio-pci will stop working when
the guest enables the vIOMMU and the device at the same time.

AFAIU Libvirt should have code that guarantees that.  For QEMU cmdline users,
they need to pay attention or things will stop working at some point.

Recently there's a growing and similar requirement on vDPA.  It's not a hard
requirement so far but vDPA has patches that try to workaround this issue.

This patchset allows us to realize the devices in the order that e.g. platform
devices will be created first (bus device, IOMMU, etc.), then the rest of
normal devices.  It's done simply by ordering the QemuOptsList of "device"
entries before realization.  The priority so far comes from migration
priorities which could be a little bit odd, but that's really about the same
problem and we can clean that part up in the future.

Libvirt can still keep its ordering for sure so old QEMU will still work,
however that won't be needed for new qemus after this patchset, so with the new
binary we should be able to specify qemu cmdline as wish on '-device'.

Logically this should also work for vDPA and the workaround code can be done
with more straightforward approaches.

Please review, thanks.

Peter Xu (4):
  qdev-monitor: Trace qdev creation
  qemu-config: Allow in-place sorting of QemuOptsList
  qdev: Export qdev_get_device_class()
  vl: Prioritize realizations of devices

 include/monitor/qdev.h     |  2 ++
 include/qemu/config-file.h |  4 ++++
 softmmu/qdev-monitor.c     |  4 +++-
 softmmu/trace-events       |  3 +++
 softmmu/vl.c               | 35 +++++++++++++++++++++++++++
 util/qemu-config.c         | 48 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 95 insertions(+), 1 deletion(-)

-- 
2.31.1

Re: [PATCH 0/4] vl: Prioritize device realizations

Posted by David Hildenbrand 2 years, 6 months ago

On 18.08.21 21:42, Peter Xu wrote:
> This is a long pending issue that we haven't fixed.  The issue is in QEMU we
> have implicit device ordering requirement when realizing, otherwise some of the
> device may not work properly.
> 
> The initial requirement comes from when vfio-pci starts to work with vIOMMUs.
> To make sure vfio-pci will get the correct DMA address space, the vIOMMU device
> needs to be created before vfio-pci otherwise vfio-pci will stop working when
> the guest enables the vIOMMU and the device at the same time.
> 
> AFAIU Libvirt should have code that guarantees that.  For QEMU cmdline users,
> they need to pay attention or things will stop working at some point.
> 
> Recently there's a growing and similar requirement on vDPA.  It's not a hard
> requirement so far but vDPA has patches that try to workaround this issue.
> 
> This patchset allows us to realize the devices in the order that e.g. platform
> devices will be created first (bus device, IOMMU, etc.), then the rest of
> normal devices.  It's done simply by ordering the QemuOptsList of "device"
> entries before realization.  The priority so far comes from migration
> priorities which could be a little bit odd, but that's really about the same
> problem and we can clean that part up in the future.
> 
> Libvirt can still keep its ordering for sure so old QEMU will still work,
> however that won't be needed for new qemus after this patchset, so with the new
> binary we should be able to specify qemu cmdline as wish on '-device'.
> 
> Logically this should also work for vDPA and the workaround code can be done
> with more straightforward approaches.
> 
> Please review, thanks.

Hi Peter, looks like I have another use case:

vhost devices can heavily restrict the number of available memslots:
e.g., upstream KVM ~64k, vhost-user usually 32 (!). With virtio-mem
intending to make use of multiple memslots [1] and auto-detecting how
many to use based on currently avilable memslots when plugging and
realizing the virtio-mem device, this implies that realizing vhost
devices (especially vhost-user device) after virtio-mem devices can
similarly result in issues: when trying realization of the vhost device
with restricted memslots, QEMU will bail out.

So similarly, we'd want to realize any vhost-* before any virtio-mem device.

Do you have any updated version of this patchset? Thanks!

[1] https://lkml.kernel.org/r/20211013103330.26869-1-david@redhat.com


-- 
Thanks,

David / dhildenb

Re: [PATCH 0/4] vl: Prioritize device realizations

Posted by Daniel P. Berrangé 2 years, 6 months ago

On Wed, Oct 20, 2021 at 03:44:08PM +0200, David Hildenbrand wrote:
> On 18.08.21 21:42, Peter Xu wrote:
> > This is a long pending issue that we haven't fixed.  The issue is in QEMU we
> > have implicit device ordering requirement when realizing, otherwise some of the
> > device may not work properly.
> > 
> > The initial requirement comes from when vfio-pci starts to work with vIOMMUs.
> > To make sure vfio-pci will get the correct DMA address space, the vIOMMU device
> > needs to be created before vfio-pci otherwise vfio-pci will stop working when
> > the guest enables the vIOMMU and the device at the same time.
> > 
> > AFAIU Libvirt should have code that guarantees that.  For QEMU cmdline users,
> > they need to pay attention or things will stop working at some point.
> > 
> > Recently there's a growing and similar requirement on vDPA.  It's not a hard
> > requirement so far but vDPA has patches that try to workaround this issue.
> > 
> > This patchset allows us to realize the devices in the order that e.g. platform
> > devices will be created first (bus device, IOMMU, etc.), then the rest of
> > normal devices.  It's done simply by ordering the QemuOptsList of "device"
> > entries before realization.  The priority so far comes from migration
> > priorities which could be a little bit odd, but that's really about the same
> > problem and we can clean that part up in the future.
> > 
> > Libvirt can still keep its ordering for sure so old QEMU will still work,
> > however that won't be needed for new qemus after this patchset, so with the new
> > binary we should be able to specify qemu cmdline as wish on '-device'.
> > 
> > Logically this should also work for vDPA and the workaround code can be done
> > with more straightforward approaches.
> > 
> > Please review, thanks.
> 
> Hi Peter, looks like I have another use case:
> 
> vhost devices can heavily restrict the number of available memslots:
> e.g., upstream KVM ~64k, vhost-user usually 32 (!). With virtio-mem
> intending to make use of multiple memslots [1] and auto-detecting how
> many to use based on currently avilable memslots when plugging and
> realizing the virtio-mem device, this implies that realizing vhost
> devices (especially vhost-user device) after virtio-mem devices can
> similarly result in issues: when trying realization of the vhost device
> with restricted memslots, QEMU will bail out.
> 
> So similarly, we'd want to realize any vhost-* before any virtio-mem device.

Ordering virtio-mem vs vhost-* devices doesn't feel like a good
solution to this problem. eg if you start a guest with several
vhost-* devices, then virtio-mem auto-decides to use all/most
remaining memslots, we've now surely broken the abiltiy to then
hotplug more vhost-* devices at runtime by not leaving memslots
for them.

I think virtio-mem configuration needs to be stable in its memslot
usage regardless of how many other types of devices are present,
and not auto-adjust how many it consumes.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 0/4] vl: Prioritize device realizations

Posted by David Hildenbrand 2 years, 6 months ago

On 20.10.21 15:48, Daniel P. Berrangé wrote:
> On Wed, Oct 20, 2021 at 03:44:08PM +0200, David Hildenbrand wrote:
>> On 18.08.21 21:42, Peter Xu wrote:
>>> This is a long pending issue that we haven't fixed.  The issue is in QEMU we
>>> have implicit device ordering requirement when realizing, otherwise some of the
>>> device may not work properly.
>>>
>>> The initial requirement comes from when vfio-pci starts to work with vIOMMUs.
>>> To make sure vfio-pci will get the correct DMA address space, the vIOMMU device
>>> needs to be created before vfio-pci otherwise vfio-pci will stop working when
>>> the guest enables the vIOMMU and the device at the same time.
>>>
>>> AFAIU Libvirt should have code that guarantees that.  For QEMU cmdline users,
>>> they need to pay attention or things will stop working at some point.
>>>
>>> Recently there's a growing and similar requirement on vDPA.  It's not a hard
>>> requirement so far but vDPA has patches that try to workaround this issue.
>>>
>>> This patchset allows us to realize the devices in the order that e.g. platform
>>> devices will be created first (bus device, IOMMU, etc.), then the rest of
>>> normal devices.  It's done simply by ordering the QemuOptsList of "device"
>>> entries before realization.  The priority so far comes from migration
>>> priorities which could be a little bit odd, but that's really about the same
>>> problem and we can clean that part up in the future.
>>>
>>> Libvirt can still keep its ordering for sure so old QEMU will still work,
>>> however that won't be needed for new qemus after this patchset, so with the new
>>> binary we should be able to specify qemu cmdline as wish on '-device'.
>>>
>>> Logically this should also work for vDPA and the workaround code can be done
>>> with more straightforward approaches.
>>>
>>> Please review, thanks.
>>
>> Hi Peter, looks like I have another use case:
>>
>> vhost devices can heavily restrict the number of available memslots:
>> e.g., upstream KVM ~64k, vhost-user usually 32 (!). With virtio-mem
>> intending to make use of multiple memslots [1] and auto-detecting how
>> many to use based on currently avilable memslots when plugging and
>> realizing the virtio-mem device, this implies that realizing vhost
>> devices (especially vhost-user device) after virtio-mem devices can
>> similarly result in issues: when trying realization of the vhost device
>> with restricted memslots, QEMU will bail out.
>>
>> So similarly, we'd want to realize any vhost-* before any virtio-mem device.
> 
> Ordering virtio-mem vs vhost-* devices doesn't feel like a good
> solution to this problem. eg if you start a guest with several
> vhost-* devices, then virtio-mem auto-decides to use all/most
> remaining memslots, we've now surely broken the abiltiy to then
> hotplug more vhost-* devices at runtime by not leaving memslots
> for them.

You can hotplug vhost-* devices devices as you want; they don't
"consume" memslots, they can only restrict the number of total memslots
if they provide less..

We have this situation today already:

Coldplug/hotplug > 32 DIMMs to a VM. Then hotplug a vhost-user device
that's based on libvhost-user or rust's vhost-user-backend. Hotplug will
fail.

Nothing really different with virtio-mem, except that you can configure
how many memslots it should actually use if you care about above situation.

> 
> I think virtio-mem configuration needs to be stable in its memslot
> usage regardless of how many other types of devices are present,
> and not auto-adjust how many it consumes.

There is a parameter to limit the number of memslots a virtio-mem device
can use ("max-memslots") to handle such corner-case environments as you
describe.

Set to 1 - exactly one ("old behavior").
Set to 0 - auto-detect.
Set to > 1 - auto detect and cap at the given value.

99.999% of all users don't care about hotplug of limiting vhost devices
and will happily use "0". The remainder can be handled via realization
priority. Nothing to confuse ordinary users with IMHO.

-- 
Thanks,

David / dhildenb

Re: [PATCH 0/4] vl: Prioritize device realizations

Posted by Peter Xu 2 years, 6 months ago

On Wed, Oct 20, 2021 at 03:44:08PM +0200, David Hildenbrand wrote:
> On 18.08.21 21:42, Peter Xu wrote:
> > This is a long pending issue that we haven't fixed.  The issue is in QEMU we
> > have implicit device ordering requirement when realizing, otherwise some of the
> > device may not work properly.
> > 
> > The initial requirement comes from when vfio-pci starts to work with vIOMMUs.
> > To make sure vfio-pci will get the correct DMA address space, the vIOMMU device
> > needs to be created before vfio-pci otherwise vfio-pci will stop working when
> > the guest enables the vIOMMU and the device at the same time.
> > 
> > AFAIU Libvirt should have code that guarantees that.  For QEMU cmdline users,
> > they need to pay attention or things will stop working at some point.
> > 
> > Recently there's a growing and similar requirement on vDPA.  It's not a hard
> > requirement so far but vDPA has patches that try to workaround this issue.
> > 
> > This patchset allows us to realize the devices in the order that e.g. platform
> > devices will be created first (bus device, IOMMU, etc.), then the rest of
> > normal devices.  It's done simply by ordering the QemuOptsList of "device"
> > entries before realization.  The priority so far comes from migration
> > priorities which could be a little bit odd, but that's really about the same
> > problem and we can clean that part up in the future.
> > 
> > Libvirt can still keep its ordering for sure so old QEMU will still work,
> > however that won't be needed for new qemus after this patchset, so with the new
> > binary we should be able to specify qemu cmdline as wish on '-device'.
> > 
> > Logically this should also work for vDPA and the workaround code can be done
> > with more straightforward approaches.
> > 
> > Please review, thanks.
> 
> Hi Peter, looks like I have another use case:

Hi, David,

> 
> vhost devices can heavily restrict the number of available memslots:
> e.g., upstream KVM ~64k, vhost-user usually 32 (!). With virtio-mem
> intending to make use of multiple memslots [1] and auto-detecting how
> many to use based on currently avilable memslots when plugging and
> realizing the virtio-mem device, this implies that realizing vhost
> devices (especially vhost-user device) after virtio-mem devices can
> similarly result in issues: when trying realization of the vhost device
> with restricted memslots, QEMU will bail out.
> 
> So similarly, we'd want to realize any vhost-* before any virtio-mem device.
> 
> Do you have any updated version of this patchset? Thanks!

Yes I should follow this up, thanks for asking.

Though after Markus and Igor pointed out to me that it's much more than types
of device and objects to order, I don't have a good way to fix the ordering
issue for good on all the problems; obviously current solution only applies to
device class ordering.

Examples that Markus provided:

https://lore.kernel.org/qemu-devel/87ilzj81q7.fsf@dusky.pond.sub.org/

Also there can be inter-dependency issue too for single device class, e.g., for
pci buses if bus pcie.2 has a parent pci bus of pcie.1, then we must speficy
the "-device" for pcie.1 before the "-device" of pcie.2, otherwise qemu will
fail to boot too.

Any of above examples means ordering based on device class can only solve
partial of the problems, not all.

And I can buy in with what people worry about on having yet another way to fix
ordering, since the root issue is still unsettled, even if the current solution
seems to work for vIOMMU/vfio, and I had a feeling it could work too with the
virtio-mem issue you're tackling with.

My plan is to move on with what Igor suggested, on using either pre_plug hook
for vIOMMU to make sure no special devices like vfio are realized before that.
I think it'll be as silly as a pci bus scan on all the pcie host bridges
looking for vfio-pci, it can even be put directly into realize() I think as I
don't see an obvious difference on failing pre_plug() or realize() so far.
Then I'll just drop this series so the new version may not really help with
virtio-mem anymore; though not sure virtio-mem can do similar thing.

One step back, OTOH, I do second on what Daniel commented in the other thread
on leaving that problem to the user; sad to know that we already have pmem
restriction so hot plugging some device already start to fail, but maybe
failing is fine as long as nothing will crash? :)

I also do think it's nice to at least allow the user to specify the exact value
of virtio-mem slot number without any smart tricks to be played when the user
wants - I think it's still okay to do automatic detection, but that's already
part of "policy" not "mechanism" to me so imho it should be better optional,
and now I had a feeling that maybe qemu should be good enough on providing
these mechanisms first then we leave the rest of the problems to libvirt, maybe
that's a better place to do all these sanity checks and doing smart things on
deciding the slot numbers.  For qemu failing at the right point without
interrupting the guest seems to be good enough so far.

I think "early failing" seems to not be a problem for virtio-mem already since
if there's a conflict on the slot number then e.g. vhost-user will already fail
early, not sure whether it means it's good enough.  For vIOMMU I may need to
work on the other bus scanning patchset to make sure when vfio is specified
before vIOMMU then we should fail qemu early, and that's still missing.

Thanks,

-- 
Peter Xu

Re: [PATCH 0/4] vl: Prioritize device realizations

Posted by David Hildenbrand 2 years, 6 months ago

> Yes I should follow this up, thanks for asking.
> 
> Though after Markus and Igor pointed out to me that it's much more than types
> of device and objects to order, I don't have a good way to fix the ordering
> issue for good on all the problems; obviously current solution only applies to
> device class ordering.
> 
> Examples that Markus provided:
> 
> https://lore.kernel.org/qemu-devel/87ilzj81q7.fsf@dusky.pond.sub.org/
> 
> Also there can be inter-dependency issue too for single device class, e.g., for
> pci buses if bus pcie.2 has a parent pci bus of pcie.1, then we must speficy
> the "-device" for pcie.1 before the "-device" of pcie.2, otherwise qemu will
> fail to boot too.
> 
Interesting case :)

> Any of above examples means ordering based on device class can only solve
> partial of the problems, not all.
> 
> And I can buy in with what people worry about on having yet another way to fix
> ordering, since the root issue is still unsettled, even if the current solution
> seems to work for vIOMMU/vfio, and I had a feeling it could work too with the
> virtio-mem issue you're tackling with.

The question is if we need to get it all 100% right from the start. To
me, for example, the single device class issue is a whole different beast.

I know, whenever someone proposes a way to tackle part of a challenging
problem, everybody discovers their hopes and dreams and suddenly you
have to go all the way to solve the complete problem. The end result is
that there is no improvement at all instead of incremental improvement.

> 
> My plan is to move on with what Igor suggested, on using either pre_plug hook
> for vIOMMU to make sure no special devices like vfio are realized before that.
> I think it'll be as silly as a pci bus scan on all the pcie host bridges
> looking for vfio-pci, it can even be put directly into realize() I think as I
> don't see an obvious difference on failing pre_plug() or realize() so far.
> Then I'll just drop this series so the new version may not really help with
> virtio-mem anymore; though not sure virtio-mem can do similar thing.

In case of virtio-mem, we'll already fail properly when realizing the
vhost-* device and figuring out that the memslot limit the device
imposes isn't going to work. So what you would want to have for vIOMMU
is already in place (fail instead of silently continue).

> 
> One step back, OTOH, I do second on what Daniel commented in the other thread
> on leaving that problem to the user; sad to know that we already have pmem
> restriction so hot plugging some device already start to fail, but maybe
> failing is fine as long as nothing will crash? :)

I very much agree.

There are two cases:

1. QEMU failing to start because vhost-* was specified after virtio-mem.
We get an indication that points at the number of memslots. And as the
user explicitly enabled e.g., "max-memslots=0", I think that's fair enough.

2. Hotplug of vhost-* devices failing. We similarly get an indication
that points at the number of memslots. Similarly, I think that's fair
enough. The guest will continue working just fine.

The result of that discussion is that the default should always be
"max-memslots=1" and that users have to opt in manually.

> 
> I also do think it's nice to at least allow the user to specify the exact value
> of virtio-mem slot number without any smart tricks to be played when the user
> wants - I think it's still okay to do automatic detection, but that's already
> part of "policy" not "mechanism" to me so imho it should be better optional,
> and now I had a feeling that maybe qemu should be good enough on providing
> these mechanisms first then we leave the rest of the problems to libvirt, maybe
> that's a better place to do all these sanity checks and doing smart things on
> deciding the slot numbers.  For qemu failing at the right point without
> interrupting the guest seems to be good enough so far.

I'm not planning on letting the user set the actual number of memslots
to use, only an upper limit. But to me, it's fundamentally the same: the
user has to enable this behavior explicitly.

> 
> I think "early failing" seems to not be a problem for virtio-mem already since
> if there's a conflict on the slot number then e.g. vhost-user will already fail
> early, not sure whether it means it's good enough.  For vIOMMU I may need to
> work on the other bus scanning patchset to make sure when vfio is specified
> before vIOMMU then we should fail qemu early, and that's still missing.

Right, thanks!


-- 
Thanks,

David / dhildenb

Re: [PATCH 0/4] vl: Prioritize device realizations

Posted by Peter Xu 2 years, 6 months ago

On Thu, Oct 21, 2021 at 09:17:57AM +0200, David Hildenbrand wrote:
> I know, whenever someone proposes a way to tackle part of a challenging
> problem, everybody discovers their hopes and dreams and suddenly you
> have to go all the way to solve the complete problem. The end result is
> that there is no improvement at all instead of incremental improvement.

Yeah, there's the trade-off; we either not moving forward or otherwise we could
potentially bring (more) chaos so the code is less maintainable.  Before I'm
sure I won't do the latter and convince the others, I need to hold off a bit. :-)

> I'm not planning on letting the user set the actual number of memslots
> to use, only an upper limit. But to me, it's fundamentally the same: the
> user has to enable this behavior explicitly.

I'm not familiar enough on virtio-mem's side, it's just that it will stop
working when the ideal value (even in a very corner case) is less than the
maximum specified, then that trick stops people from specifying the ideal.  But
if it's bigger the better then indeed I don't see much to worry.

-- 
Peter Xu

Re: [PATCH 0/4] vl: Prioritize device realizations

Posted by David Hildenbrand 2 years, 6 months ago

On 21.10.21 10:00, Peter Xu wrote:
> On Thu, Oct 21, 2021 at 09:17:57AM +0200, David Hildenbrand wrote:
>> I know, whenever someone proposes a way to tackle part of a challenging
>> problem, everybody discovers their hopes and dreams and suddenly you
>> have to go all the way to solve the complete problem. The end result is
>> that there is no improvement at all instead of incremental improvement.
> 
> Yeah, there's the trade-off; we either not moving forward or otherwise we could
> potentially bring (more) chaos so the code is less maintainable.  Before I'm
> sure I won't do the latter and convince the others, I need to hold off a bit. :-)

Sure :)

>> I'm not planning on letting the user set the actual number of memslots
>> to use, only an upper limit. But to me, it's fundamentally the same: the
>> user has to enable this behavior explicitly.
> 
> I'm not familiar enough on virtio-mem's side, it's just that it will stop
> working when the ideal value (even in a very corner case) is less than the
> maximum specified, then that trick stops people from specifying the ideal.  But
> if it's bigger the better then indeed I don't see much to worry.

Usually it's "the bigger the better", but there are a lot of exceptions,
and error handling on weird user input is a little hairy ... but I'm
playing with it right now, essentially having

"memslots=0" -> auto detect as good as possible
"memslots=1" -> default
"memslits>1" -> use user input, bail out if some conditions aren't met.
Especially, fail plugging if there are not sufficient free memslots around.

-- 
Thanks,

David / dhildenb