From nobody Thu Apr 10 18:30:43 2025 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C5F3820E6ED for ; Tue, 1 Apr 2025 15:49:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1743522558; cv=none; b=d5sneqFIxUcW49TaTIN0s1XN/XuluRIkErB57FsHPfb+8abfzHBJpkIv7PtyJhqdvqamVPBa32StYJ+RMJAQOAA5Ib6LWpD3hxbQ9stoLITkuw6A/20cQmghX9ot/XTYhl7N3Zt/6xjaMG9QMerwLbgS77QQxef7S4HfWVgH/Ks= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1743522558; c=relaxed/simple; bh=vhm5JRAE6njpdjScz5UK8LdwniERJDF2bJGfN6xBYh4=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=kDI9fLEaC0oMVvxQXakINSZkoU0EDWSCulCGcszHaKdL1kbV4y2faH4mpwMtepEz8/PECwUeGg9BRkoIkEFJitYUU6/pQlbE8NtrlWEUV/+lnLzJQi12hPjCwrHOX/V6iVBCdrq0w+nHhc9PrVZwCU2JBHFA1pLvmgsU35u9iKw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=VztYwNuw; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="VztYwNuw" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-2241e7e3addso91385055ad.1 for ; Tue, 01 Apr 2025 08:49:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1743522555; x=1744127355; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=8wH3EOfhowl2ah+tdgIUAwIwU2yTiFV9tKRtcsqEfls=; b=VztYwNuw5CBmlg1pU1133QTiIJZWe/w5dJA5xCrGtbVVr1N/XvO3tqGbQPZjJHAEGe KMZMoiqY35jND/HUxDESZLQs87IvgTZzyT+QSGuFU5wz3+lgYkzZ8iL2EPmQL7WXsW5O DX4n3njQwEJGbw0V18WXrVY5fHFzfL5YryG67WFhiZeI+8iwU0ldaIbh7C69K5DSsCZJ qUwbavOIyVJ9x4ZyaY6xvxPdjb6pGhXUlSTJM0woBPRKvuh9cb8+XMrj5EU6Vl3Neqh1 kUHfQk6PShDZxnkT9PTicnSxfJHj9DWZzYsrq3CQBcPR/ComwQIfrB7g8Q2O+V/83i+f tu8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743522555; x=1744127355; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=8wH3EOfhowl2ah+tdgIUAwIwU2yTiFV9tKRtcsqEfls=; b=LgCbytwfN524h47v3Gp3AvSVWa87Bk7TQiB/kcccT0EOSdODnOKzYaddhC9e3xIUzV DuTfOlD4AJGDRWcdLzSPlx5NBgd6yicdFdQVzqv3Ad93yhl9kIbHvozeTwm7t/jKx0N9 n+cKCKIVuV0dOj9y+jpMHwS7t9/fOwXMtzWkaCP2jz2KiwT6ZctHPOdAQ3KyYPPm/omT Vtt/15nR+yOBZX6KOm26RTwTv74qY0WCxuLxvDZ9htubCC6T+1SB8Kgeqm080j/8N4D+ B0awNoNU3jjY7/PUirnW7zy43mtxm8Z92/a+2LBbjbRdXuaO9vQF6cAAc71z0D957979 RnSA== X-Forwarded-Encrypted: i=1; AJvYcCWAnQc6gusJGjM2lEDTw7j3R1DAxrN/9XNLe9wciq2kHDfzslu0AWu0P+hRyZdLAX8ac0Q8zoCHOcdtXOE=@vger.kernel.org X-Gm-Message-State: AOJu0YwZybdrSjKRg4G8DPTIkyQd9Vh1jhLq4rUEs9ExXRNo41rz3Xce sapqUIirpf9bsStrdUm9MJV3670zpkKwfoxjBjV6IoEV8Km42WjrxrBfhrT4DJ5mqW50Mf4ll1a 57g== X-Google-Smtp-Source: AGHT+IH86RKe+94T4SRRkFodCtx5qKlady13L3+LhOcM6mMcYsZC46a6axWch98wxyXO/mby85l8LBVLSYY= X-Received: from pltl12.prod.google.com ([2002:a17:902:70cc:b0:226:342c:5750]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:f693:b0:220:ca39:d453 with SMTP id d9443c01a7336-2295be7969fmr50326595ad.17.1743522555079; Tue, 01 Apr 2025 08:49:15 -0700 (PDT) Reply-To: Sean Christopherson Date: Tue, 1 Apr 2025 08:47:26 -0700 In-Reply-To: <20250401154727.835231-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250401154727.835231-1-seanjc@google.com> X-Mailer: git-send-email 2.49.0.472.ge94155a9ec-goog Message-ID: <20250401154727.835231-2-seanjc@google.com> Subject: [PATCH 1/2] KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list From: Sean Christopherson To: Sean Christopherson , Paolo Bonzini Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Yan Zhao Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Assert that IRQs are already disabled when putting a vCPU on a CPU's PI wakeup list, as opposed to saving/disabling+restoring IRQs. KVM relies on IRQs being disabled until the vCPU task is fully scheduled out, i.e. until the scheduler has dropped all of its per-CPU locks (e.g. for the runqueue), as attempting to wake the task while it's being scheduled out could lead to deadlock. Signed-off-by: Sean Christopherson Reviewed-by: Maxim Levitsky Reviewed-by: Yan Zhao --- arch/x86/kvm/vmx/posted_intr.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c index ec08fa3caf43..840d435229a8 100644 --- a/arch/x86/kvm/vmx/posted_intr.c +++ b/arch/x86/kvm/vmx/posted_intr.c @@ -148,9 +148,8 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *v= cpu) struct pi_desc *pi_desc =3D vcpu_to_pi_desc(vcpu); struct vcpu_vmx *vmx =3D to_vmx(vcpu); struct pi_desc old, new; - unsigned long flags; =20 - local_irq_save(flags); + lockdep_assert_irqs_disabled(); =20 raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); list_add_tail(&vmx->pi_wakeup_list, @@ -176,8 +175,6 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *v= cpu) */ if (pi_test_on(&new)) __apic_send_IPI_self(POSTED_INTR_WAKEUP_VECTOR); - - local_irq_restore(flags); } =20 static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu) --=20 2.49.0.472.ge94155a9ec-goog From nobody Thu Apr 10 18:30:43 2025 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BBA7020E718 for ; Tue, 1 Apr 2025 15:49:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1743522559; cv=none; b=mdkhzdLo5dKdouZFLygmSPGQBgve++N3ZddzzYX47Kov1i+QLmcQJnxFYbLCE8uuyEGe3rxqqtqdrnh0lfkHSeXXQp8LV1eiCgPyn5Kbp4wy7AoYKo06oV1fvtER3QiDpCFWEOMGpyOsjA7qOXI4byD41Fi2uY8OApjcqOJYPts= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1743522559; c=relaxed/simple; bh=gLK58HHSD3/s9+5+CMqOiJMgocg0/qBgCo1ULcStcuk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=h4H5v9K6UuEFYKzg16oDh7Dm84jnI+1W5nHnce3kx9bQs37l9CbBNRVEsEm5s8/UYo5TZ8b4o2HzZ0vIhwqBGO7/QZUPFT9EhaN/oyq7dGg6VzDTuOFb7WtP0RzE4zk3JDmcBDF0iPbcZL9IE1LNCQLR7kG0nNvK6Y1BIKj+L0Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=XuZXeNoC; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="XuZXeNoC" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-229668c8659so3283575ad.3 for ; Tue, 01 Apr 2025 08:49:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1743522557; x=1744127357; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=NEoKKraLkus5TSEUW8Vu6zSvk2yZjVwDVKYenBySZQo=; b=XuZXeNoCfaTQ9eqeARFWLaTlcWi1Fa5WDJBUDc06FWkCn+KB6/3sxTe8X0Nlq2NYCt EdjitEQtkHXzHyEVQMNMxpeuGJojXFddQ+LmmTiZgFZdyaEeoNIJ/A8mZu0Doc8IgXKL j4OrJ2iB8pK5kJJNE6HIVKgsaJcU4c2ljwEvsPHCAxsY3zxp5e5IrfkXfGQJp67DrXwD jgOXQxGj3KpL4XiS61UC3cAGW+Imbg0MoHrDxRLxiIdhYL6ConXV51UncSKsheIOMtx8 OU25YjcZ8uK5jF5paGkmlznGjC7g05saOLEAzbXUnjy6eLzPM9Uzft1Ip2H7oEyJ9LQQ zmpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743522557; x=1744127357; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=NEoKKraLkus5TSEUW8Vu6zSvk2yZjVwDVKYenBySZQo=; b=BOlmirOeWGsr3zBLp6eF0bM8yAeViw2EAEyCg5j7tTRgxLuBV8t45PnK+/KfLgrZ5Y AANO1+ce1zzEgwGZmZ6RpvXh3/eAYL1IV+BI/9dBXM6HRydCHqFBIka0sw5JCHcStOTj pS0N/jc2sQ3WsvvhZrbNlWMIdVGILGHQpcBYg1N0OhYnEElicELhuDCdMMN8l+KfInqX QAy3RQBv80vtb7yXU/0mEfioXKoLID9Q1ZeC1Gn2462CLSijq2oguHH3gYsPCbUPE2uI /Z10FzG8h4gOtkcq+jxY95sY1k7F6iIiKOTVDU7pmQ84VsbfKyU2n2sqj6Jqso0/lUcL qUIQ== X-Forwarded-Encrypted: i=1; AJvYcCU9khTBChivCOKjcjNV/uRmGUwruTOd5EY15gPQCSaR6PR6H/AwpCrCVRMg7uHpEUaKe2yBcIjAxa5pV/E=@vger.kernel.org X-Gm-Message-State: AOJu0YxI8k/xe1T3xoq5s7jg6g4wHQdLS40m+UUaYBnB+PZla/DlsxT4 JktqudqpTWQ8Ai14c1EyJXppTCwkHVVWyETb7RmlfybWdJCdIBUIvPKnUuqZu2CSRSkctXJeO9t uRQ== X-Google-Smtp-Source: AGHT+IGbhkLRRFTUQ7bxiQyU2fg0e5+VIELmG5bgHhONmz/4t+jU7czIP6A4Q16LN9DgWQLrpb52LAhd85I= X-Received: from pfbjw37.prod.google.com ([2002:a05:6a00:92a5:b0:736:6fb6:7fc]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:ce09:b0:224:10a2:cae1 with SMTP id d9443c01a7336-2292f9df155mr209876235ad.37.1743522556908; Tue, 01 Apr 2025 08:49:16 -0700 (PDT) Reply-To: Sean Christopherson Date: Tue, 1 Apr 2025 08:47:27 -0700 In-Reply-To: <20250401154727.835231-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250401154727.835231-1-seanjc@google.com> X-Mailer: git-send-email 2.49.0.472.ge94155a9ec-goog Message-ID: <20250401154727.835231-3-seanjc@google.com> Subject: [PATCH 2/2] KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive From: Sean Christopherson To: Sean Christopherson , Paolo Bonzini Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Yan Zhao Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Yan Zhao Use a separate subclass when acquiring KVM's per-CPU posted interrupts wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list of vCPUs to wake, to workaround a false positive deadlock. Chain exists of: &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&rq->__lock); lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&p->pi_lock); *** DEADLOCK *** In the wakeup handler, the callchain is *always*: sysvec_kvm_posted_intr_wakeup_ipi() | --> pi_wakeup_handler() | --> kvm_vcpu_wake_up() | --> try_to_wake_up(), and the lock order is: &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock. For the schedule out path, the callchain is always (for all intents and purposes; if the kernel is preemptible, kvm_sched_out() can be called from something other than schedule(), but the beginning of the callchain will be the same point in vcpu_block()): vcpu_block() | --> schedule() | --> kvm_sched_out() | --> vmx_vcpu_put() | --> vmx_vcpu_pi_put() | --> pi_enable_wakeup_handler() and the lock order is: &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for wakeup, and complains about the A=3D>C versus C=3D>A inversion. In practic= e, deadlock can't occur between schedule out and the wakeup handler as they are mutually exclusive. The entirely of the schedule out code that runs with the problematic scheduler locks held, does so with IRQs disabled, i.e. can't run concurrently with the wakeup handler. Use a subclass instead disabling lockdep entirely, and tell lockdep that both subclasses are being acquired when loading a vCPU, as the sched_out and sched_in paths are NOT mutually exclusive, e.g. CPU 0 CPU 1 --------------- --------------- vCPU0 sched_out vCPU1 sched_in vCPU1 sched_out vCPU 0 sched_in where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup list+lock. Signed-off-by: Yan Zhao Signed-off-by: Sean Christopherson Reviewed-by: Maxim Levitsky --- arch/x86/kvm/vmx/posted_intr.c | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c index 840d435229a8..51116fe69a50 100644 --- a/arch/x86/kvm/vmx/posted_intr.c +++ b/arch/x86/kvm/vmx/posted_intr.c @@ -31,6 +31,8 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_c= pu); */ static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock); =20 +#define PI_LOCK_SCHED_OUT SINGLE_DEPTH_NESTING + static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu) { return &(to_vmx(vcpu)->pi_desc); @@ -89,9 +91,20 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu) * current pCPU if the task was migrated. */ if (pi_desc->nv =3D=3D POSTED_INTR_WAKEUP_VECTOR) { - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + raw_spinlock_t *spinlock =3D &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cp= u); + + /* + * In addition to taking the wakeup lock for the regular/IRQ + * context, tell lockdep it is being taken for the "sched out" + * context as well. vCPU loads happens in task context, and + * this is taking the lock of the *previous* CPU, i.e. can race + * with both the scheduler and the wakeup handler. + */ + raw_spin_lock(spinlock); + spin_acquire(&spinlock->dep_map, PI_LOCK_SCHED_OUT, 0, _RET_IP_); list_del(&vmx->pi_wakeup_list); - raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + spin_release(&spinlock->dep_map, _RET_IP_); + raw_spin_unlock(spinlock); } =20 dest =3D cpu_physical_id(cpu); @@ -151,7 +164,20 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *= vcpu) =20 lockdep_assert_irqs_disabled(); =20 - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + /* + * Acquire the wakeup lock using the "sched out" context to workaround + * a lockdep false positive. When this is called, schedule() holds + * various per-CPU scheduler locks. When the wakeup handler runs, it + * holds this CPU's wakeup lock while calling try_to_wake_up(), which + * can eventually take the aforementioned scheduler locks, which causes + * lockdep to assume there is deadlock. + * + * Deadlock can't actually occur because IRQs are disabled for the + * entirety of the sched_out critical section, i.e. the wakeup handler + * can't run while the scheduler locks are held. + */ + raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), + PI_LOCK_SCHED_OUT); list_add_tail(&vmx->pi_wakeup_list, &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu)); raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); --=20 2.49.0.472.ge94155a9ec-goog