Documentation/virt/kvm/locking.rst | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-)
Currently only the locking order of SRCU vs kvm->slots_arch_lock
and kvm->slots_lock is documented. Extend this to kvm->lock
since Xen emulation got it terribly wrong.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
Documentation/virt/kvm/locking.rst | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index 845a561629f1..a3ca76f9be75 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -16,17 +16,26 @@ The acquisition orders for mutexes are as follows:
- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
them together is quite rare.
-- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before
- synchronize_srcu(&kvm->srcu). Therefore kvm->slots_arch_lock
- can be taken inside a kvm->srcu read-side critical section,
- while kvm->slots_lock cannot.
-
- kvm->mn_active_invalidate_count ensures that pairs of
invalidate_range_start() and invalidate_range_end() callbacks
use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock
are taken on the waiting side in install_new_memslots, so MMU notifiers
must not take either kvm->slots_lock or kvm->slots_arch_lock.
+For SRCU:
+
+- ``synchronize_srcu(&kvm->srcu)`` is called _inside_
+ the kvm->slots_lock critical section, therefore kvm->slots_lock
+ cannot be taken inside a kvm->srcu read-side critical section.
+ Instead, kvm->slots_arch_lock is released before the call
+ to ``synchronize_srcu()`` and _can_ be taken inside a
+ kvm->srcu read-side critical section.
+
+- kvm->lock is taken inside kvm->srcu, therefore
+ ``synchronize_srcu(&kvm->srcu)`` cannot be called inside
+ a kvm->lock critical section. If you cannot delay the
+ call until after kvm->lock is released, use ``call_srcu``.
+
On x86:
- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock
--
2.31.1
On Wed, Dec 28, 2022, Paolo Bonzini wrote: > Currently only the locking order of SRCU vs kvm->slots_arch_lock > and kvm->slots_lock is documented. Extend this to kvm->lock > since Xen emulation got it terribly wrong. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > Documentation/virt/kvm/locking.rst | 19 ++++++++++++++----- > 1 file changed, 14 insertions(+), 5 deletions(-) > > diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst > index 845a561629f1..a3ca76f9be75 100644 > --- a/Documentation/virt/kvm/locking.rst > +++ b/Documentation/virt/kvm/locking.rst > @@ -16,17 +16,26 @@ The acquisition orders for mutexes are as follows: > - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring > them together is quite rare. > > -- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before > - synchronize_srcu(&kvm->srcu). Therefore kvm->slots_arch_lock > - can be taken inside a kvm->srcu read-side critical section, > - while kvm->slots_lock cannot. > - > - kvm->mn_active_invalidate_count ensures that pairs of > invalidate_range_start() and invalidate_range_end() callbacks > use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock > are taken on the waiting side in install_new_memslots, so MMU notifiers > must not take either kvm->slots_lock or kvm->slots_arch_lock. > > +For SRCU: > + > +- ``synchronize_srcu(&kvm->srcu)`` is called _inside_ > + the kvm->slots_lock critical section, therefore kvm->slots_lock > + cannot be taken inside a kvm->srcu read-side critical section. > + Instead, kvm->slots_arch_lock is released before the call > + to ``synchronize_srcu()`` and _can_ be taken inside a > + kvm->srcu read-side critical section. > + > +- kvm->lock is taken inside kvm->srcu, therefore Prior to the recent Xen change, is this actually true? There are many instances where kvm->srcu is taken inside kvm->lock, but I can't find any existing cases where the reverse is true. Logically, it makes sense to take kvm->lock first since kvm->srcu can be taken deep in helpers, e.g. for accessing guest memory. It's also more consistent to take kvm->lock first since kvm->srcu is taken inside vcpu->mutex, and vcpu->mutex is taken inside kvm->lock. Disallowing synchronize_srcu(kvm->srcu) inside kvm->lock isn't probelmatic per se, but it's going to result in a weird set of rules because synchronize_scru() can, and is, called while holding a variety of other locks. In other words, IMO taking kvm->srcu outside of kvm->lock in the Xen code is the real bug. > + ``synchronize_srcu(&kvm->srcu)`` cannot be called inside > + a kvm->lock critical section. If you cannot delay the > + call until after kvm->lock is released, use ``call_srcu``. > + > On x86: > > - vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock > -- > 2.31.1 >
On Tue, Jan 03, 2023, Sean Christopherson wrote: > On Wed, Dec 28, 2022, Paolo Bonzini wrote: > > Currently only the locking order of SRCU vs kvm->slots_arch_lock > > and kvm->slots_lock is documented. Extend this to kvm->lock > > since Xen emulation got it terribly wrong. > > > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > > --- > > Documentation/virt/kvm/locking.rst | 19 ++++++++++++++----- > > 1 file changed, 14 insertions(+), 5 deletions(-) > > > > diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst > > index 845a561629f1..a3ca76f9be75 100644 > > --- a/Documentation/virt/kvm/locking.rst > > +++ b/Documentation/virt/kvm/locking.rst > > @@ -16,17 +16,26 @@ The acquisition orders for mutexes are as follows: > > - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring > > them together is quite rare. > > > > -- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before > > - synchronize_srcu(&kvm->srcu). Therefore kvm->slots_arch_lock > > - can be taken inside a kvm->srcu read-side critical section, > > - while kvm->slots_lock cannot. > > - > > - kvm->mn_active_invalidate_count ensures that pairs of > > invalidate_range_start() and invalidate_range_end() callbacks > > use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock > > are taken on the waiting side in install_new_memslots, so MMU notifiers > > must not take either kvm->slots_lock or kvm->slots_arch_lock. > > > > +For SRCU: > > + > > +- ``synchronize_srcu(&kvm->srcu)`` is called _inside_ > > + the kvm->slots_lock critical section, therefore kvm->slots_lock > > + cannot be taken inside a kvm->srcu read-side critical section. > > + Instead, kvm->slots_arch_lock is released before the call > > + to ``synchronize_srcu()`` and _can_ be taken inside a > > + kvm->srcu read-side critical section. > > + > > +- kvm->lock is taken inside kvm->srcu, therefore > > Prior to the recent Xen change, is this actually true? I was thinking of a different change, but v5.19 is still kinda recent, so I'll count it. Better to be lucky than good :-) Commit 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") introduced the only case I can find where kvm->srcu is taken inside kvm->lock. > There are many instances where kvm->srcu is taken inside kvm->lock, but I > can't find any existing cases where the reverse is true. Logically, it makes > sense to take kvm->lock first since kvm->srcu can be taken deep in helpers, > e.g. for accessing guest memory. It's also more consistent to take kvm->lock > first since kvm->srcu is taken inside vcpu->mutex, and vcpu->mutex is taken > inside kvm->lock. > > Disallowing synchronize_srcu(kvm->srcu) inside kvm->lock isn't probelmatic per se, > but it's going to result in a weird set of rules because synchronize_scru() can, > and is, called while holding a variety of other locks. > > In other words, IMO taking kvm->srcu outside of kvm->lock in the Xen code is the > real bug. I'm doubing down on this. Taking kvm->srcu outside of kvm->lock is all kinds of sketchy, and likely indicates a larger problem. The aformentioned commit that introduced the problematic kvm->srcu vs. kvm->lock also blatantly violates ordering between kvm->lock and vcpu->mutex. Details in the link[*]. The vast majority of flows that take kvm->srcu will already hold a lock of some kind, otherwise the task can't safely deference any VM/vCPU/device data and thus has no reason to acquire kvm->srcu. E.g. taking kvm->srcu to read guest memory is nonsensical without a stable guest physical address to work with. There are exceptions, e.g. evtchn_set_fn() and maybe some ioctls(), but in most cases, taking kvm->lock inside kvm->srcu is just asking for problems. [*] https://lore.kernel.org/all/Y7dN0Negds7XUbvI@google.com
On Thu, 2023-01-05 at 22:38 +0000, Sean Christopherson wrote: > > > > +- kvm->lock is taken inside kvm->srcu, therefore > > > > Prior to the recent Xen change, is this actually true? > > I was thinking of a different change, but v5.19 is still kinda recent, so I'll > count it. Better to be lucky than good :-) > > Commit 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") introduced > the only case I can find where kvm->srcu is taken inside kvm->lock. > > > There are many instances where kvm->srcu is taken inside kvm->lock, but I > > can't find any existing cases where the reverse is true. Logically, it makes > > sense to take kvm->lock first since kvm->srcu can be taken deep in helpers, > > e.g. for accessing guest memory. It's also more consistent to take kvm->lock > > first since kvm->srcu is taken inside vcpu->mutex, and vcpu->mutex is taken > > inside kvm->lock. > > > > Disallowing synchronize_srcu(kvm->srcu) inside kvm->lock isn't probelmatic per se, > > but it's going to result in a weird set of rules because synchronize_scru() can, > > and is, called while holding a variety of other locks. > > > > In other words, IMO taking kvm->srcu outside of kvm->lock in the Xen code is the > > real bug. > > I'm doubing down on this. Taking kvm->srcu outside of kvm->lock is all kinds of > sketchy, and likely indicates a larger problem. The aformentioned commit that > introduced the problematic kvm->srcu vs. kvm->lock also blatantly violates ordering > between kvm->lock and vcpu->mutex. Details in the link[*]. > > The vast majority of flows that take kvm->srcu will already hold a lock of some > kind, otherwise the task can't safely deference any VM/vCPU/device data and thus > has no reason to acquire kvm->srcu. E.g. taking kvm->srcu to read guest memory > is nonsensical without a stable guest physical address to work with. > > There are exceptions, e.g. evtchn_set_fn() and maybe some ioctls(), but in most > cases, taking kvm->lock inside kvm->srcu is just asking for problems. > > [*] https://lore.kernel.org/all/Y7dN0Negds7XUbvI@google.com So... If we apply the patch below, then the Xen code never takes kvm->lock inside kvm->srcu. (Never takes kvm->lock at all, in fact; I'm just rereading the manual search/replace to make sure the change is valid in all cases.) It does take kvm->scru *without* holding kvm->lock, just not "outside" it. I think we can probably revert commit a79b53aaaa ("KVM: x86: fix deadlock for KVM_XEN_EVTCHN_RESET") after doing this too? From: David Woodhouse <dwmw@amazon.co.uk> Date: Wed, 11 Jan 2023 10:04:38 +0000 Subject: [PATCH] KVM: x86/xen: Avoid deadlock by adding kvm->arch.xen.xen_lock leaf node lock In commit 14243b387137a ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery") the clever version of me left some helpful notes for those who would come after him: /* * For the irqfd workqueue, using the main kvm->lock mutex is * fine since this function is invoked from kvm_set_irq() with * no other lock held, no srcu. In future if it will be called * directly from a vCPU thread (e.g. on hypercall for an IPI) * then it may need to switch to using a leaf-node mutex for * serializing the shared_info mapping. */ mutex_lock(&kvm->lock); In commit 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") the other version of me ran straight past that comment without reading it, and introduced a potential deadlock by taking vcpu->mutex and kvm->lock in the wrong order. Solve this as originally suggested, by adding a leaf-node lock in the Xen state rather than using kvm->lock for it. Fixes: 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/xen.c | 65 +++++++++++++++------------------ 2 files changed, 30 insertions(+), 36 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 2f5bf581d00a..4ef0143fa682 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1115,6 +1115,7 @@ struct msr_bitmap_range { /* Xen emulation context */ struct kvm_xen { + struct mutex xen_lock; u32 xen_version; bool long_mode; bool runstate_update_flag; diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index c444948ab1ac..713241808a3d 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -604,26 +604,26 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data) if (!IS_ENABLED(CONFIG_64BIT) && data->u.long_mode) { r = -EINVAL; } else { - mutex_lock(&kvm->lock); + mutex_lock(&kvm->arch.xen.xen_lock); kvm->arch.xen.long_mode = !!data->u.long_mode; - mutex_unlock(&kvm->lock); + mutex_unlock(&kvm->arch.xen.xen_lock); r = 0; } break; case KVM_XEN_ATTR_TYPE_SHARED_INFO: - mutex_lock(&kvm->lock); + mutex_lock(&kvm->arch.xen.xen_lock); r = kvm_xen_shared_info_init(kvm, data->u.shared_info.gfn); - mutex_unlock(&kvm->lock); + mutex_unlock(&kvm->arch.xen.xen_lock); break; case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR: if (data->u.vector && data->u.vector < 0x10) r = -EINVAL; else { - mutex_lock(&kvm->lock); + mutex_lock(&kvm->arch.xen.xen_lock); kvm->arch.xen.upcall_vector = data->u.vector; - mutex_unlock(&kvm->lock); + mutex_unlock(&kvm->arch.xen.xen_lock); r = 0; } break; @@ -633,9 +633,9 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data) break; case KVM_XEN_ATTR_TYPE_XEN_VERSION: - mutex_lock(&kvm->lock); + mutex_lock(&kvm->arch.xen.xen_lock); kvm->arch.xen.xen_version = data->u.xen_version; - mutex_unlock(&kvm->lock); + mutex_unlock(&kvm->arch.xen.xen_lock); r = 0; break; @@ -644,9 +644,9 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data) r = -EOPNOTSUPP; break; } - mutex_lock(&kvm->lock); + mutex_lock(&kvm->arch.xen.xen_lock); kvm->arch.xen.runstate_update_flag = !!data->u.runstate_update_flag; - mutex_unlock(&kvm->lock); + mutex_unlock(&kvm->arch.xen.xen_lock); r = 0; break; @@ -661,7 +661,7 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data) { int r = -ENOENT; - mutex_lock(&kvm->lock); + mutex_lock(&kvm->arch.xen.xen_lock); switch (data->type) { case KVM_XEN_ATTR_TYPE_LONG_MODE: @@ -700,7 +700,7 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data) break; } - mutex_unlock(&kvm->lock); + mutex_unlock(&kvm->arch.xen.xen_lock); return r; } @@ -708,7 +708,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) { int idx, r = -ENOENT; - mutex_lock(&vcpu->kvm->lock); + mutex_lock(&vcpu->kvm->arch.xen.xen_lock); idx = srcu_read_lock(&vcpu->kvm->srcu); switch (data->type) { @@ -936,7 +936,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) } srcu_read_unlock(&vcpu->kvm->srcu, idx); - mutex_unlock(&vcpu->kvm->lock); + mutex_unlock(&vcpu->kvm->arch.xen.xen_lock); return r; } @@ -944,7 +944,7 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) { int r = -ENOENT; - mutex_lock(&vcpu->kvm->lock); + mutex_lock(&vcpu->kvm->arch.xen.xen_lock); switch (data->type) { case KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO: @@ -1027,7 +1027,7 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) break; } - mutex_unlock(&vcpu->kvm->lock); + mutex_unlock(&vcpu->kvm->arch.xen.xen_lock); return r; } @@ -1120,7 +1120,7 @@ int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc) xhc->blob_size_32 || xhc->blob_size_64)) return -EINVAL; - mutex_lock(&kvm->lock); + mutex_lock(&kvm->arch.xen.xen_lock); if (xhc->msr && !kvm->arch.xen_hvm_config.msr) static_branch_inc(&kvm_xen_enabled.key); @@ -1129,7 +1129,7 @@ int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc) memcpy(&kvm->arch.xen_hvm_config, xhc, sizeof(*xhc)); - mutex_unlock(&kvm->lock); + mutex_unlock(&kvm->arch.xen.xen_lock); return 0; } @@ -1672,15 +1672,7 @@ static int kvm_xen_set_evtchn(struct kvm_xen_evtchn *xe, struct kvm *kvm) mm_borrowed = true; } - /* - * For the irqfd workqueue, using the main kvm->lock mutex is - * fine since this function is invoked from kvm_set_irq() with - * no other lock held, no srcu. In future if it will be called - * directly from a vCPU thread (e.g. on hypercall for an IPI) - * then it may need to switch to using a leaf-node mutex for - * serializing the shared_info mapping. - */ - mutex_lock(&kvm->lock); + mutex_lock(&kvm->arch.xen.xen_lock); /* * It is theoretically possible for the page to be unmapped @@ -1709,7 +1701,7 @@ static int kvm_xen_set_evtchn(struct kvm_xen_evtchn *xe, struct kvm *kvm) srcu_read_unlock(&kvm->srcu, idx); } while(!rc); - mutex_unlock(&kvm->lock); + mutex_unlock(&kvm->arch.xen.xen_lock); if (mm_borrowed) kthread_unuse_mm(kvm->mm); @@ -1825,7 +1817,7 @@ static int kvm_xen_eventfd_update(struct kvm *kvm, int ret; /* Protect writes to evtchnfd as well as the idr lookup. */ - mutex_lock(&kvm->lock); + mutex_lock(&kvm->arch.xen.xen_lock); evtchnfd = idr_find(&kvm->arch.xen.evtchn_ports, port); ret = -ENOENT; @@ -1856,7 +1848,7 @@ static int kvm_xen_eventfd_update(struct kvm *kvm, } ret = 0; out_unlock: - mutex_unlock(&kvm->lock); + mutex_unlock(&kvm->arch.xen.xen_lock); return ret; } @@ -1919,10 +1911,10 @@ static int kvm_xen_eventfd_assign(struct kvm *kvm, evtchnfd->deliver.port.priority = data->u.evtchn.deliver.port.priority; } - mutex_lock(&kvm->lock); + mutex_lock(&kvm->arch.xen.xen_lock); ret = idr_alloc(&kvm->arch.xen.evtchn_ports, evtchnfd, port, port + 1, GFP_KERNEL); - mutex_unlock(&kvm->lock); + mutex_unlock(&kvm->arch.xen.xen_lock); if (ret >= 0) return 0; @@ -1940,9 +1932,9 @@ static int kvm_xen_eventfd_deassign(struct kvm *kvm, u32 port) { struct evtchnfd *evtchnfd; - mutex_lock(&kvm->lock); + mutex_lock(&kvm->arch.xen.xen_lock); evtchnfd = idr_remove(&kvm->arch.xen.evtchn_ports, port); - mutex_unlock(&kvm->lock); + mutex_unlock(&kvm->arch.xen.xen_lock); if (!evtchnfd) return -ENOENT; @@ -1959,7 +1951,7 @@ static int kvm_xen_eventfd_reset(struct kvm *kvm) struct evtchnfd *evtchnfd; int i; - mutex_lock(&kvm->lock); + mutex_lock(&kvm->arch.xen.xen_lock); idr_for_each_entry(&kvm->arch.xen.evtchn_ports, evtchnfd, i) { idr_remove(&kvm->arch.xen.evtchn_ports, evtchnfd->send_port); synchronize_srcu(&kvm->srcu); @@ -1967,7 +1959,7 @@ static int kvm_xen_eventfd_reset(struct kvm *kvm) eventfd_ctx_put(evtchnfd->deliver.eventfd.ctx); kfree(evtchnfd); } - mutex_unlock(&kvm->lock); + mutex_unlock(&kvm->arch.xen.xen_lock); return 0; } @@ -2059,6 +2051,7 @@ void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu) void kvm_xen_init_vm(struct kvm *kvm) { + mutex_init(&kvm->arch.xen.xen_lock); idr_init(&kvm->arch.xen.evtchn_ports); kvm_gpc_init(&kvm->arch.xen.shinfo_cache, kvm, NULL, KVM_HOST_USES_PFN); } -- 2.35.3
© 2016 - 2025 Red Hat, Inc.