From nobody Sun Apr  5 13:04:37 2026
Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com
 [209.85.216.73])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A18B238C1F
	for <linux-kernel@vger.kernel.org>; Wed, 18 Feb 2026 00:54:43 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.216.73
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1771376084; cv=none;
 b=RmDqZv5Gi2mYu2UhrmR3uh2tr68BYB65U1YVrFv+wuvTVZvLlZpbe0p/Y5HBz1Y6Dgtn9hwgL4eWWSLpCTJC0+nX/xfv/fXjo3l6VjMof0QZQ31pAsKN4NDwJIQ5fVAlYFcf+HoaOjNOtBNyspDkEeHjHPIel0brRx60eBxg9GQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1771376084; c=relaxed/simple;
	bh=BJq+bnODq7PPJfYiXocsUdd7noDnh5Qk/6IbZnX9Jto=;
	h=Date:Mime-Version:Message-ID:Subject:From:To:Cc:Content-Type;
 b=BvAedQ/kH7cYY5duXB3jzBSGrTLcS+TSURgy1gj32MvlwtyOKCRodKXuGG9s0BRMNvUpDyDTmFmKyRfVX94NDFYuvuPtltzeBOLyyDsw9FcyiAyXrDoibGtgo2fZOoTF2DXCF7615xVivtdgbmIv0kvzdVRa22uuCBNSt8Lcvkg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=Wu3TMOck; arc=none smtp.client-ip=209.85.216.73
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="Wu3TMOck"
Received: by mail-pj1-f73.google.com with SMTP id
 98e67ed59e1d1-3562bdba6f7so22614688a91.2
        for <linux-kernel@vger.kernel.org>;
 Tue, 17 Feb 2026 16:54:43 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1771376083; x=1771980883;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:mime-version:date:reply-to:from:to:cc
         :subject:date:message-id:reply-to;
        bh=ZEhXLZTAt6f3yc+91oPu+TmXvKGxerYCl9HuHZGGdxc=;
        b=Wu3TMOckqbXn3r53IKmOIZbbAhhk/kKCvH3sDdiu16unQISvWNnLdnl9DX4KcY3xgp
         YUu8lo4UBLqT76Ca+nfDRgIcKvKkJ1a2L4fWUiPhH/9MpgM95qZ1an2rQdhMBmEUiv9/
         icz8SwkKGyFNF1Jl1rBmMXTGrU1N03ry6kuhag0v1MRb6PRjDmygbzVflGiOvZJZ0umi
         Pp/tbeXse/jT/wLm/DMK7iw+cTZ9GCYLbvdxgMJ28UrcsdIfIVZxPQ/yDSL2SohJIsmP
         sbg6jcLLftpSiJlKMur6QTyZqjgYtMZJaHKZ2nHn+LUs36WhVDoOgUY1uOmuXIfZkMfI
         0bNQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1771376083; x=1771980883;
        h=cc:to:from:subject:message-id:mime-version:date:reply-to
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=ZEhXLZTAt6f3yc+91oPu+TmXvKGxerYCl9HuHZGGdxc=;
        b=W3iFLSp3yeu9t9Qukf11F7vL4YuhKyyAIJQkddin1HTzlVRYhGq0puh7tFno5FZ/+5
         lRa8WeSCgdYxqoDbpeHOTDT+v4Wh1rMHad6Ht/lSVCoDcYbFNQMkHjZxqNFbcxWjTz1n
         3TbJQw8+cyFMO4BWidDu3UhudqFbG0obWv9UcJ2V/0MxxBxWznTsC352o3+rT+is42Iy
         nKA1plP3C49KF0JWLnRbBVvsx+2eo2WN7oXtmQ+jv7CwRQQMQfNnotgtaBhbgeTh5l9G
         n6cwJin1ihyw6dkWmd3lL20Kjtzg5uF8aXhBRe/EvkwUvjNG/SsanPVHaoub1FooFsFx
         sDfA==
X-Forwarded-Encrypted: i=1;
 AJvYcCWsWqcWv/SVEBTVf6e10w1Vtfo/M7DNGIAOpSGam1m9/dhxVvPYDtbWruL0IWesSZzgb5uSAyaSsrxa8wc=@vger.kernel.org
X-Gm-Message-State: AOJu0YyOm3xmaEeEubwpJiqbTYv4aXWdDZiAt+xer1/M8IuFZXxIRQ/7
	pOwmMFsr/uQ/CPibrwlx85MIMhT0f75tFDmBeQkRqUeUHbPwKvgqp9chGKYwYXVyUaYiAayFyAG
	9q3C7Cg==
X-Received: from pjbhk3.prod.google.com ([2002:a17:90b:2243:b0:352:f162:7d9f])
 (user=seanjc job=prod-delivery.src-stubby-dispatcher) by
 2002:a17:90b:270e:b0:352:c762:8c2e
 with SMTP id 98e67ed59e1d1-356a777f11emr13842807a91.13.1771376082740; Tue, 17
 Feb 2026 16:54:42 -0800 (PST)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Tue, 17 Feb 2026 16:54:38 -0800
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
X-Mailer: git-send-email 2.53.0.335.g19a08e0c02-goog
Message-ID: <20260218005438.2619063-1-seanjc@google.com>
Subject: [PATCH] KVM: x86: Defer non-architectural deliver of exception
 payload to userspace read
From: Sean Christopherson <seanjc@google.com>
To: Sean Christopherson <seanjc@google.com>,
 Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Yosry Ahmed <yosry.ahmed@linux.dev>
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

When attempting to play nice with userspace that hasn't enabled
KVM_CAP_EXCEPTION_PAYLOAD, defer KVM's non-architectural delivery of the
payload until userspace actually reads relevant vCPU state, and more
importantly, force delivery of the payload in *all* paths where userspace
saves relevant vCPU state, not just KVM_GET_VCPU_EVENTS.

Ignoring userspace save/restore for the moment, delivering the payload
before the exception is injected is wrong regardless of whether L1 or L2
is running.  To make matters even more confusing, the flaw *currently*
being papered over by the !is_guest_mode() check isn't even the same bug
that commit da998b46d244 ("kvm: x86: Defer setting of CR2 until #PF
delivery") was trying to avoid.

At the time of commit da998b46d244, KVM didn't correctly handle exception
intercepts, as KVM would wait until VM-Entry into L2 was imminent to check
if the queued exception should morph to a nested VM-Exit.  I.e. KVM would
deliver the payload to L2 and then synthesize a VM-Exit into L1.  But the
payload was only the most blatant issue, e.g. waiting to check exception
intercepts would also lead to KVM incorrectly escalating a
should-be-intercepted #PF into a #DF.

That underlying bug was eventually fixed by commit 7709aba8f716 ("KVM: x86:
Morph pending exceptions to pending VM-Exits at queue time"), but in the
interim, commit a06230b62b89 ("KVM: x86: Deliver exception payload on
KVM_GET_VCPU_EVENTS") came along and subtly added another dependency on
the !is_guest_mode() check.

While not recorded in the changelog, the motivation for deferring the
!exception_payload_enabled delivery was to fix a flaw where a synthesized
MTF (Monitor Trap Flag) VM-Exit would drop a pending #DB and clobber DR6.
On a VM-Exit, VMX CPUs save pending #DB information into the VMCS, which
is emulated by KVM in nested_vmx_update_pending_dbg() by grabbing the
payload from the queue/pending exception.  I.e. prematurely delivering the
payload would cause the pending #DB to not be recorded in the VMCS, and of
course, clobber L2's DR6 as seen by L1.

Jumping back to save+restore, the quirked behavior of forcing delivery of
the payload only works if userspace does KVM_GET_VCPU_EVENTS *before*
CR2 or DR6 is saved, i.e. before KVM_GET_SREGS{,2} and KVM_GET_DEBUGREGS.
E.g. if userspace does KVM_GET_SREGS before KVM_GET_VCPU_EVENTS, then the
CR2 saved by userspace won't contain the payload for the exception save by
KVM_GET_VCPU_EVENTS.

Deliberately deliver the payload in the store_regs() path, as it's the
least awful option even though userspace may not be doing save+restore.
Because if userspace _is_ doing save restore, it could elide KVM_GET_SREGS
knowing that SREGS were already saved when the vCPU exited.

Link: https://lore.kernel.org/all/20200207103608.110305-1-oupton@google.com
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Tested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++-----------------
 1 file changed, 39 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db3f393192d9..365ce3ea4a32 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcp=
u, unsigned int nr,
 		vcpu->arch.exception.error_code =3D error_code;
 		vcpu->arch.exception.has_payload =3D has_payload;
 		vcpu->arch.exception.payload =3D payload;
-		if (!is_guest_mode(vcpu))
-			kvm_deliver_exception_payload(vcpu,
-						      &vcpu->arch.exception);
 		return;
 	}
=20
@@ -5532,18 +5529,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcp=
u *vcpu,
 	return 0;
 }
=20
-static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
-					       struct kvm_vcpu_events *events)
+static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_v=
cpu *vcpu)
 {
-	struct kvm_queued_exception *ex;
-
-	process_nmi(vcpu);
-
-#ifdef CONFIG_KVM_SMM
-	if (kvm_check_request(KVM_REQ_SMI, vcpu))
-		process_smi(vcpu);
-#endif
-
 	/*
 	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
 	 * the only time there can be two queued exceptions is if there's a
@@ -5554,21 +5541,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(stru=
ct kvm_vcpu *vcpu,
 	if (vcpu->arch.exception_vmexit.pending &&
 	    !vcpu->arch.exception.pending &&
 	    !vcpu->arch.exception.injected)
-		ex =3D &vcpu->arch.exception_vmexit;
-	else
-		ex =3D &vcpu->arch.exception;
+		return &vcpu->arch.exception_vmexit;
+
+	return &vcpu->arch.exception;
+}
+
+static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu)
+{
+	struct kvm_queued_exception *ex =3D kvm_get_exception_to_save(vcpu);
=20
 	/*
-	 * In guest mode, payload delivery should be deferred if the exception
-	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
-	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
-	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
-	 * propagate the payload and so it cannot be safely deferred.  Deliver
-	 * the payload if the capability hasn't been requested.
+	 * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver
+	 * the pending exception payload when userspace saves *any* vCPU state
+	 * that interacts with exception payloads to avoid breaking userspace.
+	 *
+	 * Architecturally, KVM must not deliver an exception payload until the
+	 * exception is actually injected, e.g. to avoid losing pending #DB
+	 * information (which VMX tracks in the VMCS), and to avoid clobbering
+	 * state if the exception is never injected for whatever reason.  But
+	 * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or
+	 * may not propagate the payload across save+restore, and so KVM can't
+	 * safely defer delivery of the payload.
 	 */
 	if (!vcpu->kvm->arch.exception_payload_enabled &&
 	    ex->pending && ex->has_payload)
 		kvm_deliver_exception_payload(vcpu, ex);
+}
+
+static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
+					       struct kvm_vcpu_events *events)
+{
+	struct kvm_queued_exception *ex =3D kvm_get_exception_to_save(vcpu);
+
+	process_nmi(vcpu);
+
+#ifdef CONFIG_KVM_SMM
+	if (kvm_check_request(KVM_REQ_SMI, vcpu))
+		process_smi(vcpu);
+#endif
+
+	kvm_handle_exception_payload_quirk(vcpu);
=20
 	memset(events, 0, sizeof(*events));
=20
@@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kv=
m_vcpu *vcpu,
 	    vcpu->arch.guest_state_protected)
 		return -EINVAL;
=20
+	kvm_handle_exception_payload_quirk(vcpu);
+
 	memset(dbgregs, 0, sizeof(*dbgregs));
=20
 	BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) !=3D ARRAY_SIZE(dbgregs->db));
@@ -12137,6 +12151,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcp=
u, struct kvm_sregs *sregs)
 	if (vcpu->arch.guest_state_protected)
 		goto skip_protected_regs;
=20
+	kvm_handle_exception_payload_quirk(vcpu);
+
 	kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
 	kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
 	kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES);

base-commit: 183bb0ce8c77b0fd1fb25874112bc8751a461e49
--=20
2.53.0.335.g19a08e0c02-goog