src/qemu/qemu_command.c | 20 ++- src/qemu/qemu_domain.h | 3 + src/qemu/qemu_hotplug.c | 322 ++++++++++++++++++++++++++++++++++++++-- src/qemu/qemu_hotplug.h | 3 + src/qemu/qemu_process.c | 38 ++++- 5 files changed, 368 insertions(+), 18 deletions(-)
From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> This patch series is RFC to implement to make <transient/> disk sharable. Currently, <transient/> disk option for qemu uses blockdev-snapshot QMP command with overlay. In that case, qemu holds the write lock to the original disk, so we cannot share the original disks with the other qemu guests. This patch series tries to implement to make the disks, which have <transient/> disk option, sharable by using disk hot-plug. First, create the overlay disk with the original disk is set as the backingStore. Then, blockdev-del the StorageProgs and FormatProgs of the disk. That's because to fix the bootindex of the disk. Lastly, device_add the disks and CPUs start. The sharable transient disk has the same limitation as the current <transient/> disk option, and also has the following limitations: - qemu needs to have hotplug feature - All the disk bus is virtio Masayoshi Mizuma (7): qemu_hotplug: Add asynJob argument to qemuDomainAttachDiskGeneric qemu_hotplug: Add asyncJob argument to qemuDomainAttachDeviceDiskLiveInternal qemu_hotplug: Add asynJob argument to qemuDomainRemoveDiskDevice qemu_hotplug: Add bootindex argument to qemuDomainAttachDiskGeneric qemu_hotplug: make <transient/> disk sharable qemu: Add check whether the transient disks are sharable qemu: Add virtio disks as sharable transient disks src/qemu/qemu_command.c | 20 ++- src/qemu/qemu_domain.h | 3 + src/qemu/qemu_hotplug.c | 322 ++++++++++++++++++++++++++++++++++++++-- src/qemu/qemu_hotplug.h | 3 + src/qemu/qemu_process.c | 38 ++++- 5 files changed, 368 insertions(+), 18 deletions(-) -- 2.27.0
On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote: > From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> > > This patch series is RFC to implement to make <transient/> disk sharable. > > Currently, <transient/> disk option for qemu uses blockdev-snapshot QMP > command with overlay. > In that case, qemu holds the write lock to the original disk, so we cannot > share the original disks with the other qemu guests. > > This patch series tries to implement to make the disks, which have > <transient/> disk option, sharable by using disk hot-plug. > > First, create the overlay disk with the original disk is set as the backingStore. > Then, blockdev-del the StorageProgs and FormatProgs of the disk. That's > because to fix the bootindex of the disk. > Lastly, device_add the disks and CPUs start. So I had a look at the proposed implementation and its too ugly and complicated. Firstly one of the overcomplicated bits is the part which adds the disk backend, then removes it and then adds it back. That's too much. Secondly you are hotplugging ALL virtio disks even those which are not transient, that won't work either. This must stay limited to <transient> disk to minimize the impact. Thirdly the code is also overcomplicated by the fact that you workaround the fact that qemuDomainAttachDeviceDiskLiveInternal doesn't actually support hotplug of transient disks. You then prepare everything somewhere behind it's back and try to hotplug the disk as not-transient. It would be way better if you support hotplug of transient disk in the first place reusing the hotplug of the backend and adding image creating for the frontend. I also don't understand the issue with bootindex. If the problem is that bootindex is calculated in qemuBuildDisksCommandLine and used directly then you'll need to refactor the code in first place, pre-calculate the indices and store them in the disk private data and then use them from there, but you must not modify disk->info.bootIndex as that gets represented back to the user when dumping XML. The new code also the name of the VM as suffix to the temporary image, but doesn't update the previous code which is used for disks which don't support hotplug. That needs to be fixed too. The fact that we are doing hotplug of the disk should also stay hidden, so the code doing the hotplug should not actually emit the device-added event and the whole thing should not need to update the device list at all, but should be placed before the regular update of the device list. In the end that means you shouldn't need 'TransientDiskSharable' private data variable and shouldn do any modification of disk->src->readonly and such. Please note that this is a high level review. I've spotted some coding style inconsistencies and such, but since this will need significant rework I'll not point them out. Also don't forget to add test cases, where it will be visible that the disk (neither frontend nor backend) is added on the commandline. Also it should be fairly trivial to support it for SCSI disks since they support hotplug too.
On Sat, Jan 23, 2021 at 11:54:51AM +0100, Peter Krempa wrote: > On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote: > > From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> > > > > This patch series is RFC to implement to make <transient/> disk sharable. > > > > Currently, <transient/> disk option for qemu uses blockdev-snapshot QMP > > command with overlay. > > In that case, qemu holds the write lock to the original disk, so we cannot > > share the original disks with the other qemu guests. > > > > This patch series tries to implement to make the disks, which have > > <transient/> disk option, sharable by using disk hot-plug. > > > > First, create the overlay disk with the original disk is set as the backingStore. > > Then, blockdev-del the StorageProgs and FormatProgs of the disk. That's > > because to fix the bootindex of the disk. > > Lastly, device_add the disks and CPUs start. > > So I had a look at the proposed implementation and its too ugly and > complicated. > > Firstly one of the overcomplicated bits is the part which adds the disk > backend, then removes it and then adds it back. That's too much. > > Secondly you are hotplugging ALL virtio disks even those which are not > transient, that won't work either. This must stay limited to <transient> > disk to minimize the impact. > > Thirdly the code is also overcomplicated by the fact that you workaround > the fact that qemuDomainAttachDeviceDiskLiveInternal doesn't actually > support hotplug of transient disks. You then prepare everything > somewhere behind it's back and try to hotplug the disk as not-transient. > > It would be way better if you support hotplug of transient disk in the > first place reusing the hotplug of the backend and adding image creating > for the frontend. > > I also don't understand the issue with bootindex. If the problem is that > bootindex is calculated in qemuBuildDisksCommandLine and used directly > then you'll need to refactor the code in first place, pre-calculate the > indices and store them in the disk private data and then use them from > there, but you must not modify disk->info.bootIndex as that gets > represented back to the user when dumping XML. > > The new code also the name of the VM as suffix to the temporary image, > but doesn't update the previous code which is used for disks which don't > support hotplug. That needs to be fixed too. > > The fact that we are doing hotplug of the disk should also stay hidden, > so the code doing the hotplug should not actually emit the device-added > event and the whole thing should not need to update the device list at > all, but should be placed before the regular update of the device list. > > In the end that means you shouldn't need 'TransientDiskSharable' private > data variable and shouldn do any modification of disk->src->readonly and > such. > > Please note that this is a high level review. I've spotted some coding > style inconsistencies and such, but since this will need significant > rework I'll not point them out. > > Also don't forget to add test cases, where it will be visible that the > disk (neither frontend nor backend) is added on the commandline. > > Also it should be fairly trivial to support it for SCSI disks since they > support hotplug too. Is there a reason we can't use the existing <readonly/> element as a way to make the underlying base image be treated as read only and thus become sharable ? Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Mon, Jan 25, 2021 at 09:33:31 +0000, Daniel Berrange wrote: > On Sat, Jan 23, 2021 at 11:54:51AM +0100, Peter Krempa wrote: > > On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote: > > > From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> [...] > > Please note that this is a high level review. I've spotted some coding > > style inconsistencies and such, but since this will need significant > > rework I'll not point them out. > > > > Also don't forget to add test cases, where it will be visible that the > > disk (neither frontend nor backend) is added on the commandline. > > > > Also it should be fairly trivial to support it for SCSI disks since they > > support hotplug too. > > Is there a reason we can't use the existing <readonly/> element as a way > to make the underlying base image be treated as read only and thus become > sharable ? The problem is that the qemu device frontend code infers the readonly state from the backend image specification when it's instantiated. Thus if you start qemu with the topmost layer marked as read-only so that the write lock doesn't get asserted, this fact gets stored in the frontend. If you then later insert a writable layer into the chain on top of it, the frontend stays read-only. IDE/SATA disks are outhright refused to be read-only by qemu, virtio disks remember that they are read-only and keep it into guest execution and SCSI disks caused an abort() (which was recently fixed in qemu upstream.) Thus the only two options are: 1) hotplug the frontend after we do the shenanigans with the image 2) implement explicit control of the read-only state of the frontend in qemu in the first place, adapt to it in libvirt and then keep using the current approach in libvirt
On Mon, Jan 25, 2021 at 10:51:09AM +0100, Peter Krempa wrote: > On Mon, Jan 25, 2021 at 09:33:31 +0000, Daniel Berrange wrote: > > On Sat, Jan 23, 2021 at 11:54:51AM +0100, Peter Krempa wrote: > > > On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote: > > > > From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> > > [...] > > > > Please note that this is a high level review. I've spotted some coding > > > style inconsistencies and such, but since this will need significant > > > rework I'll not point them out. > > > > > > Also don't forget to add test cases, where it will be visible that the > > > disk (neither frontend nor backend) is added on the commandline. > > > > > > Also it should be fairly trivial to support it for SCSI disks since they > > > support hotplug too. > > > > Is there a reason we can't use the existing <readonly/> element as a way > > to make the underlying base image be treated as read only and thus become > > sharable ? > > The problem is that the qemu device frontend code infers the readonly > state from the backend image specification when it's instantiated. > > Thus if you start qemu with the topmost layer marked as read-only so > that the write lock doesn't get asserted, this fact gets stored in the > frontend. If you then later insert a writable layer into the chain on > top of it, the frontend stays read-only. > > IDE/SATA disks are outhright refused to be read-only by qemu, virtio > disks remember that they are read-only and keep it into guest execution > and SCSI disks caused an abort() (which was recently fixed in qemu > upstream.) > > Thus the only two options are: > > 1) hotplug the frontend after we do the shenanigans with the image > 2) implement explicit control of the read-only state of the frontend in > qemu in the first place, adapt to it in libvirt and then keep using > the current approach in libvirt I must be missing something here, because the topmost layer with <transient> isn't read-only - it would be writable as that's the whole point of the throwaway transient overlay file. It just feels to me that the situation desired is already one that is supported and working when explicitly top+base are specified in the XML. Consider if we have <disk> <source file="top.qcow2> <backingStore> <source file="base.qcow2"> </backing> </disk> IIUC, base.qcow2 will be readonly, and top.qcow2 will be writable by default. I'm suggesting that <disk> <source file="base.qcow2"> <transient/> <readonly/> </disk> should be treated as identical to the first. During domain startup preparation, we would effectvely transform the latter into the former, with a random tempfile for top.qcow. This shouldn't add any real complexity to our existing code and not involve tricks with hotplug. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Mon, Jan 25, 2021 at 10:04:41 +0000, Daniel Berrange wrote: > On Mon, Jan 25, 2021 at 10:51:09AM +0100, Peter Krempa wrote: > > On Mon, Jan 25, 2021 at 09:33:31 +0000, Daniel Berrange wrote: > > > On Sat, Jan 23, 2021 at 11:54:51AM +0100, Peter Krempa wrote: > > > > On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote: > > > > > From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> > > > > [...] > > > > > > Please note that this is a high level review. I've spotted some coding > > > > style inconsistencies and such, but since this will need significant > > > > rework I'll not point them out. > > > > > > > > Also don't forget to add test cases, where it will be visible that the > > > > disk (neither frontend nor backend) is added on the commandline. > > > > > > > > Also it should be fairly trivial to support it for SCSI disks since they > > > > support hotplug too. > > > > > > Is there a reason we can't use the existing <readonly/> element as a way > > > to make the underlying base image be treated as read only and thus become > > > sharable ? > > > > The problem is that the qemu device frontend code infers the readonly > > state from the backend image specification when it's instantiated. > > > > Thus if you start qemu with the topmost layer marked as read-only so > > that the write lock doesn't get asserted, this fact gets stored in the > > frontend. If you then later insert a writable layer into the chain on > > top of it, the frontend stays read-only. > > > > IDE/SATA disks are outhright refused to be read-only by qemu, virtio > > disks remember that they are read-only and keep it into guest execution > > and SCSI disks caused an abort() (which was recently fixed in qemu > > upstream.) > > > > Thus the only two options are: > > > > 1) hotplug the frontend after we do the shenanigans with the image > > 2) implement explicit control of the read-only state of the frontend in > > qemu in the first place, adapt to it in libvirt and then keep using > > the current approach in libvirt > > I must be missing something here, because the topmost layer with > <transient> isn't read-only - it would be writable as that's the > whole point of the throwaway transient overlay file. It just feels > to me that the situation desired is already one that is supported > and working when explicitly top+base are specified in the XML. [...] > should be treated as identical to the first. During domain > startup preparation, we would effectvely transform the latter > into the former, with a random tempfile for top.qcow. This Your assumptions are right, but the problem is how it's done: The temporary file is created with 'blockdev-create' during the initialization of the VM. That means that the chain starts with all the existing layers, and we then add the throwaway layer before starting execution of the guest once qemu is already started. Unfortunately that's after the frontend is initialized (and thus remembers the readonly state) and after the storage is already open (thus it already asserted-or wanted to assert the write lock). 'blockdev-create' is used to prevent another invocation of qemu-img
On Mon, Jan 25, 2021 at 11:11:43AM +0100, Peter Krempa wrote: > On Mon, Jan 25, 2021 at 10:04:41 +0000, Daniel Berrange wrote: > > On Mon, Jan 25, 2021 at 10:51:09AM +0100, Peter Krempa wrote: > > > On Mon, Jan 25, 2021 at 09:33:31 +0000, Daniel Berrange wrote: > > > > On Sat, Jan 23, 2021 at 11:54:51AM +0100, Peter Krempa wrote: > > > > > On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote: > > > > > > From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> > > > > > > [...] > > > > > > > > Please note that this is a high level review. I've spotted some coding > > > > > style inconsistencies and such, but since this will need significant > > > > > rework I'll not point them out. > > > > > > > > > > Also don't forget to add test cases, where it will be visible that the > > > > > disk (neither frontend nor backend) is added on the commandline. > > > > > > > > > > Also it should be fairly trivial to support it for SCSI disks since they > > > > > support hotplug too. > > > > > > > > Is there a reason we can't use the existing <readonly/> element as a way > > > > to make the underlying base image be treated as read only and thus become > > > > sharable ? > > > > > > The problem is that the qemu device frontend code infers the readonly > > > state from the backend image specification when it's instantiated. > > > > > > Thus if you start qemu with the topmost layer marked as read-only so > > > that the write lock doesn't get asserted, this fact gets stored in the > > > frontend. If you then later insert a writable layer into the chain on > > > top of it, the frontend stays read-only. > > > > > > IDE/SATA disks are outhright refused to be read-only by qemu, virtio > > > disks remember that they are read-only and keep it into guest execution > > > and SCSI disks caused an abort() (which was recently fixed in qemu > > > upstream.) > > > > > > Thus the only two options are: > > > > > > 1) hotplug the frontend after we do the shenanigans with the image > > > 2) implement explicit control of the read-only state of the frontend in > > > qemu in the first place, adapt to it in libvirt and then keep using > > > the current approach in libvirt > > > > I must be missing something here, because the topmost layer with > > <transient> isn't read-only - it would be writable as that's the > > whole point of the throwaway transient overlay file. It just feels > > to me that the situation desired is already one that is supported > > and working when explicitly top+base are specified in the XML. > > [...] > > > should be treated as identical to the first. During domain > > startup preparation, we would effectvely transform the latter > > into the former, with a random tempfile for top.qcow. This > > Your assumptions are right, but the problem is how it's done: > > The temporary file is created with 'blockdev-create' during the > initialization of the VM. That means that the chain starts with all the > existing layers, and we then add the throwaway layer before starting > execution of the guest once qemu is already started. > > Unfortunately that's after the frontend is initialized (and thus > remembers the readonly state) and after the storage is already open > (thus it already asserted-or wanted to assert the write lock). > > 'blockdev-create' is used to prevent another invocation of qemu-img Wouldn't it make life simpler to just use qemu-img and avoid these problems in the first place ? Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Mon, Jan 25, 2021 at 10:13:51 +0000, Daniel Berrange wrote: > On Mon, Jan 25, 2021 at 11:11:43AM +0100, Peter Krempa wrote: > > On Mon, Jan 25, 2021 at 10:04:41 +0000, Daniel Berrange wrote: [...] > > Unfortunately that's after the frontend is initialized (and thus > > remembers the readonly state) and after the storage is already open > > (thus it already asserted-or wanted to assert the write lock). > > > > 'blockdev-create' is used to prevent another invocation of qemu-img > > Wouldn't it make life simpler to just use qemu-img and avoid these > problems in the first place ? The use of 'qemu-img' in the qemu driver code is limited to inactive snapshots and I'd really like us to keep it that way. I'm more willing to declare that sharing of the original image with a <transient/> disk is unsupported rather than cave in using qemu-img. The user can always provide their own overlay and discard it appropriately.
On Mon, Jan 25, 2021 at 11:28:00 +0100, Peter Krempa wrote: > On Mon, Jan 25, 2021 at 10:13:51 +0000, Daniel Berrange wrote: > > On Mon, Jan 25, 2021 at 11:11:43AM +0100, Peter Krempa wrote: > > > On Mon, Jan 25, 2021 at 10:04:41 +0000, Daniel Berrange wrote: > > [...] > > > > Unfortunately that's after the frontend is initialized (and thus > > > remembers the readonly state) and after the storage is already open > > > (thus it already asserted-or wanted to assert the write lock). > > > > > > 'blockdev-create' is used to prevent another invocation of qemu-img > > > > Wouldn't it make life simpler to just use qemu-img and avoid these > > problems in the first place ? > > The use of 'qemu-img' in the qemu driver code is limited to inactive > snapshots and I'd really like us to keep it that way. > > I'm more willing to declare that sharing of the original image with a > <transient/> disk is unsupported rather than cave in using qemu-img. To be more specific, qemu-img uses a really crusty old commadline. Trying to adapt any blockdev code to it (such as encrypted qcow2 and others) isn't something I see a point in doing and maintaining an interlocking matrix of supported things. Similarly, the idea for supporting offline-blockjobs was always to use either qemu -m none or nowadays qemu-storage-daemon, rather than trying to probe and adapt to the quirks of qemu-img.
On Sat, Jan 23, 2021 at 11:54:51AM +0100, Peter Krempa wrote: > On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote: > > From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> > > > > This patch series is RFC to implement to make <transient/> disk sharable. > > > > Currently, <transient/> disk option for qemu uses blockdev-snapshot QMP > > command with overlay. > > In that case, qemu holds the write lock to the original disk, so we cannot > > share the original disks with the other qemu guests. > > > > This patch series tries to implement to make the disks, which have > > <transient/> disk option, sharable by using disk hot-plug. > > > > First, create the overlay disk with the original disk is set as the backingStore. > > Then, blockdev-del the StorageProgs and FormatProgs of the disk. That's > > because to fix the bootindex of the disk. > > Lastly, device_add the disks and CPUs start. Thank you for your review! > > So I had a look at the proposed implementation and its too ugly and > complicated. > > Firstly one of the overcomplicated bits is the part which adds the disk > backend, then removes it and then adds it back. That's too much. > > Secondly you are hotplugging ALL virtio disks even those which are not > transient, that won't work either. This must stay limited to <transient> > disk to minimize the impact. I agree that it's overkill to hotplugging all disks... I'm hotplugging all virtio disks because I have experienced a boot order issue. Case: the domain xml has two virtio disks, the one has <transient/> and the boot order is 1, the other doesn't have <transient/> and the boot order is 2. In that case, qemu tried to boot from the latter disk after the former disk is hot-added even though the order is 2. The boot issue doesn't happen if the latter disk is also hot-added. > > Thirdly the code is also overcomplicated by the fact that you workaround > the fact that qemuDomainAttachDeviceDiskLiveInternal doesn't actually > support hotplug of transient disks. You then prepare everything > somewhere behind it's back and try to hotplug the disk as not-transient. > > It would be way better if you support hotplug of transient disk in the > first place reusing the hotplug of the backend and adding image creating > for the frontend. Make sense, thanks. > > I also don't understand the issue with bootindex. If the problem is that > bootindex is calculated in qemuBuildDisksCommandLine and used directly > then you'll need to refactor the code in first place, pre-calculate the > indices and store them in the disk private data and then use them from > there, but you must not modify disk->info.bootIndex as that gets > represented back to the user when dumping XML. > > The new code also the name of the VM as suffix to the temporary image, > but doesn't update the previous code which is used for disks which don't > support hotplug. That needs to be fixed too. Got it. > > The fact that we are doing hotplug of the disk should also stay hidden, > so the code doing the hotplug should not actually emit the device-added > event and the whole thing should not need to update the device list at > all, but should be placed before the regular update of the device list. > > In the end that means you shouldn't need 'TransientDiskSharable' private > data variable and shouldn do any modification of disk->src->readonly and > such. OK. > > Please note that this is a high level review. I've spotted some coding > style inconsistencies and such, but since this will need significant > rework I'll not point them out. > > Also don't forget to add test cases, where it will be visible that the > disk (neither frontend nor backend) is added on the commandline. > > Also it should be fairly trivial to support it for SCSI disks since they > support hotplug too. I suppose the patches will be following parts... 1. Hot-add support for <transient/> disks 2. Some changes to be sharable <transient/> disks 3. virtio and scsi specific changes to support sharable <transient/> 4. Testing Thanks! Masa
© 2016 - 2024 Red Hat, Inc.