From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D2690C433EF for ; Thu, 23 Dec 2021 22:23:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350430AbhLWWXs (ORCPT ); Thu, 23 Dec 2021 17:23:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38648 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350412AbhLWWXo (ORCPT ); Thu, 23 Dec 2021 17:23:44 -0500 Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com [IPv6:2607:f8b0:4864:20::1049]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 176C0C061756 for ; Thu, 23 Dec 2021 14:23:44 -0800 (PST) Received: by mail-pj1-x1049.google.com with SMTP id w23-20020a17090a15d700b001b15a89e63fso6400544pjd.3 for ; Thu, 23 Dec 2021 14:23:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=79VM+FjN4+gNHCbA/cPFX/DKrZdg0OUTgV3ZeDUQV64=; b=nlDhfoBF1UMR4zeZ2AMSsTw+2NFav++t76m5v/u01E8WX1IeLkIOgAwZxi4k8dtyD8 DINKhA/okKbzh3Bl79w0UHPYPe+mI4qQSZBXUA7Q47lwZHbduNO4ktigSHa5Ti2t/EkP 9kdEt9PUWiSDuPmNi2sOxizUslZ06oNOwgJg5zBRPAZNUhm5xf+Ew1gTWxeKs+A3wRS2 LDbF9nF0FWwkIu5BMvu+ht6a0imSBbA4HP6uyTQIFXDk/edSQHkFwBMbZFIq4yFBcfyM +Jnkd2dtw/HibGSkH9vS+j3S8TWU1rDiX1GYc8Fia1d0AR1CpgvgnpxJMh+NbICv3H0j QQCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=79VM+FjN4+gNHCbA/cPFX/DKrZdg0OUTgV3ZeDUQV64=; b=rUamxjLpqFcFYEcSu72DrgKY1EL9m/l9TzNCLYlnUT/OQZ38wvY6WoSt7BXKMKMJ0p PDmw4NVqGdEpCC1QnRB7kQaJkmXu7ScJ25n4fMA6IY84Wr1zQHhKfA5pLRbX1sjCtz91 x7wvnbSSkNBJK+emjhfobf7CuLCJQidNl+DYwsWfUYiwqK7p8m87JSGUMYp5F7l7Erj2 bb+qXEeNzuA+9Sjhc0JKGMkAT1/pIeulC1co9HCKHWfo1lzuok0sT3EqGioRHvPh63pA /tp8M4in3My1HCLIA3+ixGxzksKHKLTYnz5bDePATg6nraGAcMhwr4FwXhuvSHagdZsX PZpA== X-Gm-Message-State: AOAM5337aBZ7H8rdVbFTTmIRaYPcFbWuYTlw7RDzSPg5mcUgNYUBuRQj WiDEW6Q6x+qDCugWXdm6DSycFqFOkno= X-Google-Smtp-Source: ABdhPJzIg1slQVwWGbXNsGn/iIW/Jo8Ri2O+OYkocHu7aRAWSVbg1N6hanW77pCArB6nbzVf3Peu+sZUvEs= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:90a:fa12:: with SMTP id cm18mr4714053pjb.141.1640298223566; Thu, 23 Dec 2021 14:23:43 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:22:49 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-2-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 01/30] KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap hook From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Use the common TDP MMU zap helper when handling an MMU notifier unmap event, the two flows are semantically identical. Consolidate the code in preparation for a future bug fix, as both kvm_tdp_mmu_unmap_gfn_range() and __kvm_tdp_mmu_zap_gfn_range() are guilty of not zapping SPTEs in invalid roots. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 7b1bc816b7c3..d320b56d5cd7 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1032,13 +1032,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kv= m_page_fault *fault) bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *ra= nge, bool flush) { - struct kvm_mmu_page *root; - - for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false) - flush =3D zap_gfn_range(kvm, root, range->start, range->end, - range->may_block, flush, false); - - return flush; + return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start, + range->end, range->may_block, flush); } =20 typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter, --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBB16C433F5 for ; Thu, 23 Dec 2021 22:23:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350442AbhLWWXx (ORCPT ); Thu, 23 Dec 2021 17:23:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38656 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350417AbhLWWXp (ORCPT ); Thu, 23 Dec 2021 17:23:45 -0500 Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com [IPv6:2607:f8b0:4864:20::549]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7420FC061757 for ; Thu, 23 Dec 2021 14:23:45 -0800 (PST) Received: by mail-pg1-x549.google.com with SMTP id k21-20020a63f015000000b0033db7baf101so3863612pgh.19 for ; Thu, 23 Dec 2021 14:23:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=O0B5yHs9mrLguToeJUNtqwnxH46d2zShhuTlEqWnd2Y=; b=j3f476DX6cFUy1LeHxbgKU/39XYmfhXM7YCJPoGXYJXG4tzh/TIg4LDroE2Eh8ExsZ AszwXw0eWIPsVr7PnWBZcgNh6EpXOAOPb0XC3CmRJXBSR0HJsp62GWyPgDvuFruj8vHG PfjgOdZhKq9insutYDBhPQnerzAdn3MEPZpEXni76KbXEiDUAu39kuYLUx949LrpK2l1 s/7T1AusdxAnqAKm683N6/qpuVdUd+mEwrFZpFkZm23/wPMb7dnPnFgoASe+Av8Kv9gF KGyCf757+Ogsd5v6vCHUBpuzle1TjAWrsR3KpYnAOnCNESSYhkGC/FZIGaJCqBdVbDsS uxWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=O0B5yHs9mrLguToeJUNtqwnxH46d2zShhuTlEqWnd2Y=; b=ECA/t4rhc4Wsh+OtZdXGiqIj/Z7NKINwXQsC/O6yujrwrNnO14uXxACLZ5CZcH4gDO JsnfdbWTJYFnl17JCFfom5Q30/tUzmpZSlqtlYpsGQ1WnGkBmSao3zAYgnbgy4EuTNck uTrVoXlnkxnQ3l/5G9GDX+MEA68H3/lTCIEt5fiCXOdjO1wnMFLPPJ7jYn8UdCWicGH1 lrW7uV4Zvo3J5nb7IvK/R0qZgJUiuw0oCoZu66/N+7HYAz2m0EIEFe6aZIgSAsytHkMb T/0/yt+MVvdqAuapR3DxdXq2Ory71GbOmMMTRHmquDlxqQHnKIuAXGkibzrqwDzV6zYS nbIA== X-Gm-Message-State: AOAM530WLH6yFGyN077uH4cfhx0eryVMS0ImMwtghiP7NvaAC0WWAtXF Qze6jZ1yZLYjXrl0hndhErYgkXiWv9U= X-Google-Smtp-Source: ABdhPJwz12iTb+zG1EDREi+ibfO4OLS6nmunCF6taJTygfFDtz9t2fFNJG5TmDYp4Fppj1Dyf3gqG6psTDE= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a05:6a00:2187:b0:4a7:fedf:8ff9 with SMTP id h7-20020a056a00218700b004a7fedf8ff9mr4129022pfi.9.1640298224993; Thu, 23 Dec 2021 14:23:44 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:22:50 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-3-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 02/30] KVM: x86/mmu: Move "invalid" check out of kvm_tdp_mmu_get_root() From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Move the check for an invalid root out kvm_tdp_mmu_get_root() and into the one place it actually matters, tdp_mmu_next_root(), as the other user already has an implicit validity check. A future bug fix will need to get references to invalid roots to honor mmu_notifier requests, there's no point in forcing what will be a common path to open code getting a reference to a root. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++++-- arch/x86/kvm/mmu/tdp_mmu.h | 3 --- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index d320b56d5cd7..200001190fcf 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -121,9 +121,14 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct k= vm *kvm, next_root =3D list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots, typeof(*next_root), link); =20 - while (next_root && !kvm_tdp_mmu_get_root(kvm, next_root)) + while (next_root) { + if (!next_root->role.invalid && + kvm_tdp_mmu_get_root(kvm, next_root)) + break; + next_root =3D list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots, &next_root->link, typeof(*next_root), link); + } =20 rcu_read_unlock(); =20 @@ -200,7 +205,10 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *v= cpu) =20 role =3D page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level); =20 - /* Check for an existing root before allocating a new one. */ + /* + * Check for an existing root before allocating a new one. Note, the + * role check prevents consuming an invalid root. + */ for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) { if (root->role.word =3D=3D role.word && kvm_tdp_mmu_get_root(kvm, root)) diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 3899004a5d91..08c917511fed 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -10,9 +10,6 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu= ); __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm, struct kvm_mmu_page *root) { - if (root->role.invalid) - return false; - return refcount_inc_not_zero(&root->tdp_mmu_root_count); } =20 --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6499C433F5 for ; Thu, 23 Dec 2021 22:23:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350471AbhLWWX4 (ORCPT ); Thu, 23 Dec 2021 17:23:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38668 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350422AbhLWWXr (ORCPT ); Thu, 23 Dec 2021 17:23:47 -0500 Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com [IPv6:2607:f8b0:4864:20::549]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2F3AFC061759 for ; Thu, 23 Dec 2021 14:23:47 -0800 (PST) Received: by mail-pg1-x549.google.com with SMTP id s16-20020a63ff50000000b0033b6e4cedc8so3877813pgk.8 for ; Thu, 23 Dec 2021 14:23:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=r5AFn4RCMgqUfgGu1hVllvQY0Qnai8Du0jee32fy2GE=; b=JnGNaEUL5DaNDIAqcnqFUV9gJzkXIPHdne6msuarIDeRsdzVe9j5BZhYTgWPAc3kb2 YqigPSe0qYop80rjrutzrRQGfh1rPoopUu4u9rNRPJlfrEZnSwJl27MvGF5cvJ8gPs6h HBIxYsDfqTTuCiEe4Bq+c9m+27x5cU0RuofNyUp29gEWCAi38RhdXRg/F3q6EFyNA6Yc ZeoKqcLUYGpwopeTRsaQgKCjEhSvqf22qbLnqgsLQW1jqqna1Z+wDLrsZeVFxSXCIi+X 0Yd0H7u/joCRL6cSEI1hat2U6dbd06rZHBUawWfq0NGuXTBcczlnMSeVdIPnSXLx/O0f V9IA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=r5AFn4RCMgqUfgGu1hVllvQY0Qnai8Du0jee32fy2GE=; b=UTUB6a7adNjyE5sxIu3Y/AZAXuKqp86MPRsWRDmxqfbVrPK6OSwhTjqOQwlGK9+SGf adZX/DLx2Gr/DjSFYTcoI0uVd3NUVCg6NKj7+xwKSSCILa0ofD2sPj2sCyZmvBCxbrV1 L2iYEnYEeSbhklD90KH/ULSxJ6j2wODZuQGhQo4ydKTq/SG8khU/QArc4+B57kY6B2FS pWjbeFcd3nEtP4vx5w3Q/gRlYVLBG2PVgaOgXj98j787JApmVkuXriwnLzxxriClDG7X fV8orTVjiNdtFi4sP9rg4j3xMM4VNq4Snx+rYtFnNsEIjIIuPuxubRPEzHxIrpW3Bpff nWdQ== X-Gm-Message-State: AOAM532t9JHReo/teIaV/lxDPH4uavWGqdCbjeUotPopHRpQlZjg1+fM RWZ4hRhOQEA45nwsZ75qkHamTyEprGk= X-Google-Smtp-Source: ABdhPJwBWPjs350K7xbsdKCjZNMI5z/bHmRVnyyOermRQrAcMNQmE8Qc+ApcdvilpdVim7ddlMHaLN/StUc= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a05:6a00:114d:b0:4a2:87bd:37f with SMTP id b13-20020a056a00114d00b004a287bd037fmr4357612pfm.82.1640298226732; Thu, 23 Dec 2021 14:23:46 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:22:51 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-4-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 03/30] KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Zap both valid and invalid roots when zapping/unmapping a gfn range, as KVM must ensure it holds no references to the freed page after returning from the unmap operation. Most notably, thee TDP MMU doesn't zap invalid roots in mmu_notifier callbacks. This leads to use-after-free and other issues if the mmu_notifier runs to completion while an invalid root zapper yields as KVM fails to honor the requirement that there must be _no_ references to the page after the mmu_notifier returns. The bug is most easily reproduced by hacking KVM to cause a collision between set_nx_huge_pages() + kvm_mmu_notifier_release(), but the bug exists between kvm_mmu_notifier_invalidate_range_start() and memslot updates as well. Invalidating a root ensure pages aren't accessible by the guest, and KVM won't read or write page data itself, but KVM will trigger e.g. kvm_set_pfn_dirty() when zapping SPTEs, and thus completing a zap _after_ the mmu_notifier returns is fatal. WARNING: CPU: 24 PID: 1496 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:1= 73 [kvm] RIP: 0010:kvm_is_zone_device_pfn+0x96/0xa0 [kvm] Call Trace: kvm_set_pfn_dirty+0xa8/0xe0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] zap_gfn_range+0x1f3/0x310 [kvm] kvm_tdp_mmu_zap_invalidated_roots+0x50/0x90 [kvm] kvm_mmu_zap_all_fast+0x177/0x1a0 [kvm] set_nx_huge_pages+0xb4/0x190 [kvm] param_attr_store+0x70/0x100 module_attr_store+0x19/0x30 kernfs_fop_write_iter+0x119/0x1b0 new_sync_write+0x11c/0x1b0 vfs_write+0x1cc/0x270 ksys_write+0x5f/0xe0 do_syscall_64+0x38/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 39 +++++++++++++++++++++++--------------- 1 file changed, 24 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 200001190fcf..577985fa001d 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -99,15 +99,18 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_m= mu_page *root, } =20 /* - * Finds the next valid root after root (or the first valid root if root - * is NULL), takes a reference on it, and returns that next root. If root - * is not NULL, this thread should have already taken a reference on it, a= nd - * that reference will be dropped. If no valid root is found, this - * function will return NULL. + * Returns the next root after @prev_root (or the first root if @prev_root= is + * NULL). A reference to the returned root is acquired, and the reference= to + * @prev_root is released (the caller obviously must hold a reference to + * @prev_root if it's non-NULL). + * + * If @only_valid is true, invalid roots are skipped. + * + * Returns NULL if the end of tdp_mmu_roots was reached. */ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, struct kvm_mmu_page *prev_root, - bool shared) + bool shared, bool only_valid) { struct kvm_mmu_page *next_root; =20 @@ -122,7 +125,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kv= m *kvm, typeof(*next_root), link); =20 while (next_root) { - if (!next_root->role.invalid && + if ((!only_valid || !next_root->role.invalid) && kvm_tdp_mmu_get_root(kvm, next_root)) break; =20 @@ -148,13 +151,19 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct = kvm *kvm, * mode. In the unlikely event that this thread must free a root, the lock * will be temporarily dropped and reacquired in write mode. */ -#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \ - for (_root =3D tdp_mmu_next_root(_kvm, NULL, _shared); \ - _root; \ - _root =3D tdp_mmu_next_root(_kvm, _root, _shared)) \ - if (kvm_mmu_page_as_id(_root) !=3D _as_id) { \ +#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _= only_valid)\ + for (_root =3D tdp_mmu_next_root(_kvm, NULL, _shared, _only_valid); \ + _root; \ + _root =3D tdp_mmu_next_root(_kvm, _root, _shared, _only_valid)) \ + if (kvm_mmu_page_as_id(_root) !=3D _as_id) { \ } else =20 +#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _share= d) \ + __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true) + +#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \ + __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, false) + #define for_each_tdp_mmu_root(_kvm, _root, _as_id) \ list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \ lockdep_is_held_type(&kvm->mmu_lock, 0) || \ @@ -1224,7 +1233,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, =20 lockdep_assert_held_read(&kvm->mmu_lock); =20 - for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) spte_set |=3D wrprot_gfn_range(kvm, root, slot->base_gfn, slot->base_gfn + slot->npages, min_level); =20 @@ -1294,7 +1303,7 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, =20 lockdep_assert_held_read(&kvm->mmu_lock); =20 - for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) spte_set |=3D clear_dirty_gfn_range(kvm, root, slot->base_gfn, slot->base_gfn + slot->npages); =20 @@ -1419,7 +1428,7 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kv= m, =20 lockdep_assert_held_read(&kvm->mmu_lock); =20 - for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) + for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) zap_collapsible_spte_range(kvm, root, slot); } =20 --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 796ADC433F5 for ; Thu, 23 Dec 2021 22:24:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350528AbhLWWYA (ORCPT ); Thu, 23 Dec 2021 17:24:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38690 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350446AbhLWWXw (ORCPT ); Thu, 23 Dec 2021 17:23:52 -0500 Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com [IPv6:2607:f8b0:4864:20::449]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EE780C06175D for ; Thu, 23 Dec 2021 14:23:48 -0800 (PST) Received: by mail-pf1-x449.google.com with SMTP id n18-20020a056a00213200b004baa74aca72so3987918pfj.19 for ; Thu, 23 Dec 2021 14:23:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=ZUWnlx6tDuEZy0/vV2sw5dzru8Z2gIDouCHSlfxHcxQ=; b=k4V9QdP6pgnjQiAyDIGFQ3IbXAdMAIGLwhbBQGB76N29OcRiybSsC1Xq3IbzDQW0JV ud5aflcB07n6mMIEAbFEJGTVUZkS6hL2EiQCLoFu+bJQ41J+kn4rOvQjYbn+SBZJVmRl 2KFAfBfJepWrqN3BuApd2XekxXIGmWVoeb2R9HqgyoFaoKHh8EH5/W+RzAfCcPhHk7X9 lN4HmlUbpX505CK5tZMLIVLJrFlCsC1ZTF2dbvxKlrJUG74Di/h/0MROuY+Gg8Y8pSGi Da+PtK5QGf+tuqWBw/NE5N6AUCFE6VmRAhX62gd3R9Ljd/fgY1wMZsHIv2UPUWcJ8Rdh Aqfw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=ZUWnlx6tDuEZy0/vV2sw5dzru8Z2gIDouCHSlfxHcxQ=; b=SlnX6DtGZArDnl+4gvRjGHULEAm7bwDZowqL3YpzfZcjo+3A+fcWEclyxWIF4Cv/pT 3TEO05DfOvIphkfrB16wZAePGkleqWgLRKCmQQCG11YU2Cn8qH+DDOpW0/0AGKWqLq0Z phlpGQaeQ1+Y4AaPeA7m06ZRHcdfqMfXEJ2c8NqdTFb438OlMJUfaEaCuhgoFxjOpn0r c9TmGC0q5+n07crdcDkm0f+FVz95x9QJTHab8nxd7BuwuABebNh6zGfNkcdn/NrJzpOJ 9uYBSSAuEBxPpsm2ZkBVqUhoZEkLNBqCYpX4Rzl7lawFAbIyWdez80wSn2xsEBYxZ3yH 8R0A== X-Gm-Message-State: AOAM53241piiQ5zl5yRtDHh0voMBMrIZa6yZE1oTyFeAQ5JJU7ejHUgi EL4REDIO9zYMW4A+bld83yDX+AF7NWo= X-Google-Smtp-Source: ABdhPJyn5LW9UyEMjasLetWykpvTmR07wO2hYJdg6iT6muq92I13GNnJNVEpjCDGp6kCVhZ+fkbOYRoBGQE= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a62:6488:0:b0:4ba:95ec:a333 with SMTP id y130-20020a626488000000b004ba95eca333mr4093463pfb.23.1640298228413; Thu, 23 Dec 2021 14:23:48 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:22:52 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-5-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 04/30] KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Now that tdp_mmu_next_root() can process both valid and invalid roots, extend it to be able to process _only_ invalid roots, add yet another iterator macro for walking invalid roots, and use the new macro in kvm_tdp_mmu_zap_invalidated_roots(). No functional change intended. Signed-off-by: Sean Christopherson Reviewed-by: David Matlack --- arch/x86/kvm/mmu/tdp_mmu.c | 74 ++++++++++++++------------------------ 1 file changed, 26 insertions(+), 48 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 577985fa001d..41e975841ea6 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -98,6 +98,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mm= u_page *root, call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback); } =20 +enum tdp_mmu_roots_iter_type { + ALL_ROOTS =3D -1, + VALID_ROOTS =3D 0, + INVALID_ROOTS =3D 1, +}; + /* * Returns the next root after @prev_root (or the first root if @prev_root= is * NULL). A reference to the returned root is acquired, and the reference= to @@ -110,10 +116,16 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm= _mmu_page *root, */ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, struct kvm_mmu_page *prev_root, - bool shared, bool only_valid) + bool shared, + enum tdp_mmu_roots_iter_type type) { struct kvm_mmu_page *next_root; =20 + kvm_lockdep_assert_mmu_lock_held(kvm, shared); + + /* Ensure correctness for the below comparison against role.invalid. */ + BUILD_BUG_ON(!!VALID_ROOTS || !INVALID_ROOTS); + rcu_read_lock(); =20 if (prev_root) @@ -125,7 +137,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kv= m *kvm, typeof(*next_root), link); =20 while (next_root) { - if ((!only_valid || !next_root->role.invalid) && + if ((type =3D=3D ALL_ROOTS || (type =3D=3D !!next_root->role.invalid)) && kvm_tdp_mmu_get_root(kvm, next_root)) break; =20 @@ -151,18 +163,21 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct = kvm *kvm, * mode. In the unlikely event that this thread must free a root, the lock * will be temporarily dropped and reacquired in write mode. */ -#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _= only_valid)\ - for (_root =3D tdp_mmu_next_root(_kvm, NULL, _shared, _only_valid); \ +#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _= type) \ + for (_root =3D tdp_mmu_next_root(_kvm, NULL, _shared, _type); \ _root; \ - _root =3D tdp_mmu_next_root(_kvm, _root, _shared, _only_valid)) \ - if (kvm_mmu_page_as_id(_root) !=3D _as_id) { \ + _root =3D tdp_mmu_next_root(_kvm, _root, _shared, _type)) \ + if (_as_id > 0 && kvm_mmu_page_as_id(_root) !=3D _as_id) { \ } else =20 +#define for_each_invalid_tdp_mmu_root_yield_safe(_kvm, _root) \ + __for_each_tdp_mmu_root_yield_safe(_kvm, _root, -1, true, INVALID_ROOTS) + #define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _share= d) \ - __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true) + __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, VALID_RO= OTS) =20 #define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \ - __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, false) + __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, ALL_ROOT= S) =20 #define for_each_tdp_mmu_root(_kvm, _root, _as_id) \ list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \ @@ -811,28 +826,6 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) kvm_flush_remote_tlbs(kvm); } =20 -static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm, - struct kvm_mmu_page *prev_root) -{ - struct kvm_mmu_page *next_root; - - if (prev_root) - next_root =3D list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots, - &prev_root->link, - typeof(*prev_root), link); - else - next_root =3D list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots, - typeof(*next_root), link); - - while (next_root && !(next_root->role.invalid && - refcount_read(&next_root->tdp_mmu_root_count))) - next_root =3D list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots, - &next_root->link, - typeof(*next_root), link); - - return next_root; -} - /* * Since kvm_tdp_mmu_zap_all_fast has acquired a reference to each * invalidated root, they will not be freed until this function drops the @@ -843,36 +836,21 @@ static struct kvm_mmu_page *next_invalidated_root(str= uct kvm *kvm, */ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm) { - struct kvm_mmu_page *next_root; struct kvm_mmu_page *root; bool flush =3D false; =20 lockdep_assert_held_read(&kvm->mmu_lock); =20 - rcu_read_lock(); - - root =3D next_invalidated_root(kvm, NULL); - - while (root) { - next_root =3D next_invalidated_root(kvm, root); - - rcu_read_unlock(); - + for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) { flush =3D zap_gfn_range(kvm, root, 0, -1ull, true, flush, true); =20 /* - * Put the reference acquired in - * kvm_tdp_mmu_invalidate_roots + * Put the reference acquired in kvm_tdp_mmu_invalidate_roots(). + * Note, the iterator holds its own reference. */ kvm_tdp_mmu_put_root(kvm, root, true); - - root =3D next_root; - - rcu_read_lock(); } =20 - rcu_read_unlock(); - if (flush) kvm_flush_remote_tlbs(kvm); } --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A1AD4C433F5 for ; Thu, 23 Dec 2021 22:24:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350509AbhLWWX6 (ORCPT ); Thu, 23 Dec 2021 17:23:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38700 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350452AbhLWWXx (ORCPT ); Thu, 23 Dec 2021 17:23:53 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 847ADC061784 for ; Thu, 23 Dec 2021 14:23:50 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id b8-20020a17090a10c800b001a61dff6c9dso4031104pje.5 for ; Thu, 23 Dec 2021 14:23:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=fpIDrpWZwmj6BNONuDX3BQPSQp/TrPoIB2oyQBfGSwo=; b=GRcaaHwNTlE7NbCGNV/wpojHO94gvVAt5hXaCqZRlKal8bVZDb3AR7GVeWbPVCBIp3 JylcRrzpPNy4j7rzIblndCS2J2VE68MPO0n1sfCMr0Snbg06HoeuO8+GnBfNtMljOxvQ rjLPLHtz/rbxAA/maMsbXxnkiXxNQ/1D2UPVOJowJrFNEVjrLljnelDNEvtmMqcKH9h/ iDERCmRJOGePmwwt6T+JwDLSqX8CFOqZ2n6mILxbTJ2mao/zYUoKYGTo/nZYbgQTbRgX 6w9HzzMk/m/xRaxtf+jplH3oB6AfmUi6TgqUVA9fcdhQHIsvOZIyUFBPw5j6mk4OppQV +baQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=fpIDrpWZwmj6BNONuDX3BQPSQp/TrPoIB2oyQBfGSwo=; b=gg7s3Tb/9lQ4IMCz/CQkUKduPdqL3Tc5uwon0e1K2kHDf04wnLsDVV5ucsseta/0TW O9DrW+2PL4ZAUZqMSmUJCXUIhfG3zHHBvs242dgkYahqnWWppV4ZfxVMokj5XtbuJQFO h44UFxuDy6Zg1rPfRL1S61hxk1J2jBy6F9KyfSLPSvNu/S9mZKTjk9HJqRoyqCWacy8I bQwgWenqueZJcGoFCP+ZxOWPSmgJGNf7a0cvUGj5LgFAU1YsdWlDzn/nJiEzZWmt0zTU mgAzEX4kqFcN4zXIoZ0VyXTSVLrDHwYWANGasSxSsUnwoIT50AB/6GxzxCOx0ejeuUF7 UX1g== X-Gm-Message-State: AOAM532bE43TxK/YM1zk07GimD8k5VQVp1xfRsQJ2paVgaMYXWJ7B334 3USH+fjeMrgxG6TbRH9MeeiOjA/jFNA= X-Google-Smtp-Source: ABdhPJyclXLRTwW6GJeCEWsyo4Kj1bo3ZQ/06TQEYE++AKaw4dOta3j54//LgcIFPrdnUSZzHL2DS03/7L0= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:902:7797:b0:143:88c3:7ff1 with SMTP id o23-20020a170902779700b0014388c37ff1mr3972429pll.22.1640298230048; Thu, 23 Dec 2021 14:23:50 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:22:53 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-6-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 05/30] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Explicitly check for present SPTEs when clearing dirty bits in the TDP MMU. This isn't strictly required for correctness, as setting the dirty bit in a defunct SPTE will not change the SPTE from !PRESENT to PRESENT. However, the guarded MMU_WARN_ON() in spte_ad_need_write_protect() would complain if anyone actually turned on KVM's MMU debugging. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Cc: Ben Gardon Signed-off-by: Sean Christopherson Reviewed-by: Ben Gardon --- arch/x86/kvm/mmu/tdp_mmu.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 41e975841ea6..fcbae282af6f 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1239,6 +1239,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, st= ruct kvm_mmu_page *root, if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) continue; =20 + if (!is_shadow_present_pte(iter.old_spte)) + continue; + if (spte_ad_need_write_protect(iter.old_spte)) { if (is_writable_pte(iter.old_spte)) new_spte =3D iter.old_spte & ~PT_WRITABLE_MASK; --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A89DC4332F for ; Thu, 23 Dec 2021 22:24:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350474AbhLWWYF (ORCPT ); Thu, 23 Dec 2021 17:24:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38708 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350464AbhLWWXy (ORCPT ); Thu, 23 Dec 2021 17:23:54 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 38FE2C061756 for ; Thu, 23 Dec 2021 14:23:52 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id h33-20020a17090a29a400b001b20b7d48dfso3945276pjd.0 for ; Thu, 23 Dec 2021 14:23:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=/C3VB2aYSnMeLUhQSsL7UqIJMPi2v/9JP8zRWM/80L8=; b=WuV/B8WuGmqt6FNXhJhidQyLJqByY8R6S4ZMlQAbUXQZQUDMDuSLC68P/xmjUnvDLP JnubPJVQiV7ttO7tFrLc5QmqHgHRjPvdci+BrdZN8hQyU2n3Guv4NZpTRrkxrezcs76J i7x/qN85MEICBd9nxqSRuv0XSE4PG8tsJD5PMvY3fC1uEt/du6GFEASv4mOlyQJUpj0r gZ6BVbSpidgHzxXq/eAud7xsDaiApk5WZXPTm10Gt9gRw0oFd1LTx2asedpN7L2ITwev QM0FxO15N98a0sOzrsuJeXqMJvABv/E+QVzGke+etsP8z9sXHOIMBSVQF+lA14rGSZu4 ko9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=/C3VB2aYSnMeLUhQSsL7UqIJMPi2v/9JP8zRWM/80L8=; b=0kUXAdnsPqA4rcp5M9JL7NEi+kxiPiIY5hKEm+vKQvq8mzJVkv6VnhWLW9CwQZqtxk QzwNMC9whpluKSt/KNEo8iJAOlPPP6TYzM66n1XVCkPVOHZVDb9ADCwiS267gKhaI8zl M0FmlMaGtgPquB38UdyDlrCX+u2at7UrAjwp7QXu3RKUTlcUFKSlB8Y5UhYoTnkL/X5s tIlVTEVRdz8POFlm0BBtH0aeGF0GX82c3DolWLujn1HCyQhQyzbp4DnGJ51RTIy9zbpf Imkoof1wQ+uzsrLdtR5e3Dv8RYxtYWmU1W2h9Y+P+wqPxIYadaeKQIcHnWVoo4n5XLzp fsdg== X-Gm-Message-State: AOAM531pwhM8TQCQHwKeDD8/V3VRFgqeZxDlK5vNfVKT1gPejd+BJgr/ PycicFQJt5nxfM/heF+IBAEkqZCvPpM= X-Google-Smtp-Source: ABdhPJx/W4+wNaHw8WuqGA4G5O9fXG26GZ5qtzGiOo9Q9/SYGQyULfCkHCCJuiCApMAwzTTiAEQ/rD6Mryo= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:aa7:8545:0:b0:4ba:7163:7dfd with SMTP id y5-20020aa78545000000b004ba71637dfdmr4278433pfn.61.1640298231685; Thu, 23 Dec 2021 14:23:51 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:22:54 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-7-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 06/30] KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Fix misleading and arguably wrong comments in the TDP MMU's fast zap flow. The comments, and the fact that actually zapping invalid roots was added separately, strongly suggests that zapping invalid roots is an optimization and not required for correctness. That is a lie. KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(), because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(), KVM is relying on it to fully remove all references to the memslot. Once the memslot is gone, KVM's mmu_notifier hooks will be unable to find the stale references as the hva=3D>gfn translation is done via the memslots. If KVM doesn't immediately zap SPTEs and userspace unmaps a range after deleting a memslot, KVM will fail to zap in response to the mmu_notifier due to not finding a memslot corresponding to the notifier's range, which leads to a variation of use-after-free. The other misleading comment (and code) explicitly states that roots without a reference should be skipped. While that's technically true, it's also extremely misleading as it should be impossible for KVM to encounter a defunct root on the list while holding mmu_lock for write. Opportunstically add a WARN to enforce that invariant. Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Fixes: 4c6654bd160d ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_= fast returns") Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu.c | 8 +++++++ arch/x86/kvm/mmu/tdp_mmu.c | 46 +++++++++++++++++++++----------------- 2 files changed, 33 insertions(+), 21 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 1d275e9d76b5..94590bc97a67 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5692,6 +5692,14 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm) =20 write_unlock(&kvm->mmu_lock); =20 + /* + * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before + * returning to the caller, e.g. if the zap is in response to a memslot + * deletion, mmu_notifier callbacks will be unable to reach the SPTEs + * associated with the deleted memslot once the update completes, and + * Deferring the zap until the final reference to the root is put would + * lead to use-after-free. + */ if (is_tdp_mmu_enabled(kvm)) { read_lock(&kvm->mmu_lock); kvm_tdp_mmu_zap_invalidated_roots(kvm); diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index fcbae282af6f..4f5c8e7380a9 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -827,12 +827,11 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) } =20 /* - * Since kvm_tdp_mmu_zap_all_fast has acquired a reference to each - * invalidated root, they will not be freed until this function drops the - * reference. Before dropping that reference, tear down the paging - * structure so that whichever thread does drop the last reference - * only has to do a trivial amount of work. Since the roots are invalid, - * no new SPTEs should be created under them. + * Zap all invalidated roots to ensure all SPTEs are dropped before the "f= ast + * zap" completes. Since kvm_tdp_mmu_invalidate_all_roots() has acquired a + * reference to each invalidated root, roots will not be freed until after= this + * function drops the gifted reference, e.g. so that vCPUs don't get stuck= with + * tearing paging structures. */ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm) { @@ -856,21 +855,25 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kv= m) } =20 /* - * Mark each TDP MMU root as invalid so that other threads - * will drop their references and allow the root count to - * go to 0. + * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root = that + * is about to be zapped, e.g. in response to a memslots update. The call= er is + * responsible for invoking kvm_tdp_mmu_zap_invalidated_roots() to the act= ual + * zapping. * - * Also take a reference on all roots so that this thread - * can do the bulk of the work required to free the roots - * once they are invalidated. Without this reference, a - * vCPU thread might drop the last reference to a root and - * get stuck with tearing down the entire paging structure. + * Take a reference on all roots to prevent the root from being freed befo= re it + * is zapped by this thread. Freeing a root is not a correctness issue, b= ut if + * a vCPU drops the last reference to a root prior to the root being zappe= d, it + * will get stuck with tearing down the entire paging structure. * - * Roots which have a zero refcount should be skipped as - * they're already being torn down. - * Already invalid roots should be referenced again so that - * they aren't freed before kvm_tdp_mmu_zap_all_fast is - * done with them. + * Get a reference even if the root is already invalid, + * kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference t= o all + * invalid roots, e.g. there's no epoch to identify roots that were invali= dated + * by a previous call. Roots stay on the list until the last reference is + * dropped, so even though all invalid roots are zapped, a root may not go= away + * for quite some time, e.g. if a vCPU blocks across multiple memslot upda= tes. + * + * Because mmu_lock is held for write, it should be impossible to observe a + * root with zero refcount, i.e. the list of roots cannot be stale. * * This has essentially the same effect for the TDP MMU * as updating mmu_valid_gen does for the shadow MMU. @@ -880,9 +883,10 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm) struct kvm_mmu_page *root; =20 lockdep_assert_held_write(&kvm->mmu_lock); - list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) - if (refcount_inc_not_zero(&root->tdp_mmu_root_count)) + list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) { + if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(kvm, root))) root->role.invalid =3D true; + } } =20 /* --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5E808C4332F for ; Thu, 23 Dec 2021 22:24:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350545AbhLWWY0 (ORCPT ); Thu, 23 Dec 2021 17:24:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38692 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350478AbhLWWXz (ORCPT ); Thu, 23 Dec 2021 17:23:55 -0500 Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com [IPv6:2607:f8b0:4864:20::1049]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D8729C061759 for ; Thu, 23 Dec 2021 14:23:53 -0800 (PST) Received: by mail-pj1-x1049.google.com with SMTP id w19-20020a17090aea1300b001ad6e2148ccso4997035pjy.1 for ; Thu, 23 Dec 2021 14:23:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=cBF5OK3OFSDwub1dlVAqrTIj8KZrSkeas4HxldxyG2U=; b=TLf5tBuDqLRcIFVQFfDb6uV/HMlwQlGQb4qRYPHmVkJLRl8aHqeDSO2sXi4vEwJvoR /aqESeuhQF+AcaTBTarCLB9tpKxJdLjK+Zxyg7RGbKu28YTtim/Efk9y0bXlTCL1AUcg dk1Jux03qvlXIMR7+sUlTvBUivPkQYP3GIjhKf0AxLp8TtfB5u3VS2yRiY/tIyd+fp6L T0wbTq0/y8EQu3Jklm3itVEfMgqAzqv8T2feiZR41DESCU56m3eloHruYSq5bm+Wp/BB xwAgCzmeY7Cjikn/+K+AFpJu0L9kiMbuUT+HRNCuolZr4OlyP5jLJEwgdadZhr9HM7/c c0jQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=cBF5OK3OFSDwub1dlVAqrTIj8KZrSkeas4HxldxyG2U=; b=xi8fZpCxOb3p+F3F+9HpVfJrfa0yrccawaYW4tfULGwNygpTxLHLAIZ0XM/lzLyHag 1cemvzZC3O07mOTITo71t43mETpVfQGk2G0r00+TMjBKYIN+sTk4rccMqQkaHMY752ca aPSz9ZWBkJXuphLWMksSY2bDrMjpPlqhmnuyykFVcAFP396l3zZoIULpCuLwV1//WwI7 G17fJuTjqtQMVBjr7ECBt3Mb16E6r25ID+/tASSuLbfSKhLsXKLewXmBqEHURXhrPuaL RBqr35M7llnLYiga72ksTPeMZYu40ABcqK2zyXrIe0E/ez1P9AjGl6t94fdZ6Yt7p+CA L6OA== X-Gm-Message-State: AOAM530yNCYVeL5i0QuSXfK83r1t3LHROI9FLTj3bQegbq1ScxAPKBMS JCLQ0qFH8i+E5Z8CM1QZBPnSYCd2xcI= X-Google-Smtp-Source: ABdhPJzugLke9FkvgxxgSQ+09nGN4pAmXPvlEe1oTl20wsMFlwtYzirQ5F8h7APJb64Aj2880xUWr9DMwe0= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a05:6a00:cc7:b0:4bb:1511:8401 with SMTP id b7-20020a056a000cc700b004bb15118401mr4405550pfv.44.1640298233370; Thu, 23 Dec 2021 14:23:53 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:22:55 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-8-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 07/30] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Explicitly ignore the result of zap_gfn_range() when putting the last reference to a TDP MMU root, and add a pile of comments to formalize the TDP MMU's behavior of deferring TLB flushes to alloc/reuse. Note, this only affects the !shared case, as zap_gfn_range() subtly never returns true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic(). Putting the root without a flush is ok because even if there are stale references to the root in the TLB, they are unreachable because KVM will not run the guest with the same ASID without first flushing (where ASID in this context refers to both SVM's explicit ASID and Intel's implicit ASID that is constructed from VPID+PCID+EPT4A+etc...). Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu.c | 8 ++++++++ arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++- 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 94590bc97a67..6549c13e89d9 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5100,6 +5100,14 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) kvm_mmu_sync_roots(vcpu); =20 kvm_mmu_load_pgd(vcpu); + + /* + * Flush any TLB entries for the new root, the provenance of the root + * is unknown. In theory, even if KVM ensures there are no stale TLB + * entries for a freed root, in theory, an out-of-tree hypervisor could + * have left stale entries. Flushing on alloc also allows KVM to skip + * the TLB flush when freeing a root (see kvm_tdp_mmu_put_root()). + */ static_call(kvm_x86_tlb_flush_current)(vcpu); out: return r; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 4f5c8e7380a9..66b75c197c94 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mm= u_page *root, list_del_rcu(&root->link); spin_unlock(&kvm->arch.tdp_mmu_pages_lock); =20 - zap_gfn_range(kvm, root, 0, -1ull, false, false, shared); + /* + * A TLB flush is not necessary as KVM performs a local TLB flush when + * allocating a new root (see kvm_mmu_load()), and when migrating vCPU + * to a different pCPU. Note, the local TLB flush on reuse also + * invalidates any paging-structure-cache entries, i.e. TLB entries for + * intermediate paging structures, that may be zapped, as such entries + * are associated with the ASID on both VMX and SVM. + */ + (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared); =20 call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback); } --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9360AC433F5 for ; Thu, 23 Dec 2021 22:24:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350698AbhLWWYT (ORCPT ); Thu, 23 Dec 2021 17:24:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38690 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350465AbhLWWXz (ORCPT ); Thu, 23 Dec 2021 17:23:55 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3D608C06175B for ; Thu, 23 Dec 2021 14:23:55 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id l8-20020a17090b078800b001b1ea649932so4205624pjz.7 for ; Thu, 23 Dec 2021 14:23:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=mXdSrY6OMtNZUL6SBm5FhuqN7QxidhmmvtoR0HoEp0w=; b=eF2q4Zt9YAbIih9a3n0/rfpvUWpw8rDGnz16pJJ3XtJpmUbV2YOJ2F5YZP+91TIFad PuBpIMzR3JQ3MIqr9Nfd7EMKtpLkf/dLver+tZntBKf047SChjpKatVzV9G5UclCkFit UclKd8ehDVIT2k8w95gs3j3l4p54SoR9ACuoWJDqFmrTB0DcI5ntW4pySEDaW/SXtgEG rgqR/kmyG81kdaadahZ7ua/XmYXN9455w/QG+UaNKKOEornHEaTmUdsNqEeFTfAydVE8 O8MOsHIcFbz98H0kruDzkUffXKq8u0yJrATJIA7dDCI/GcUoE/J61phobhKhNIJvCOFY SwCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=mXdSrY6OMtNZUL6SBm5FhuqN7QxidhmmvtoR0HoEp0w=; b=b/jXoAR2WW41BzKcgbCsOCGNYODyyifdx5akW22/LPwYQkVdXnirACed49VgSqzSSA CMV2SlfDrhSwOFPelxFdqFwXqswSb3WgVQ276gmGWk97TRjwWLU5LsyqiVDZb+uVpPrj 3+GcJP4a1qnueIuLYNsKt+7xHkYrtCCDi+SwMh4FpfstIKV79toBB7hZEcO9eJ6EE4eh Pb3EanmJQUz4ZbvvkZ+6tsMXLjRVRQGbxcDydztPLaHgmHrsa10FUstOjVWPtWKV6iLd Z2mBlEyxj61xl2RRYtnEFKas5S/NNp+f8AOV5nbvYdaZaM2gDrA3MS/VF0cQQG9g9uRH Up0Q== X-Gm-Message-State: AOAM531QVRjGGw+Q9gxgQ7fx0hkMKELAK56yC0JBJJa6rGuxTu2bJzOw r1MZd0YTTubY6YDsSmO6IEmQ0rau6dw= X-Google-Smtp-Source: ABdhPJxDEcHm77MbBJGQXPXjeuSUz5e9A1qqA8iTW9llI8YtYRxrojoUl4/366Zjo9YyB/JNzSp7+7Wx/6Q= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:90b:2243:: with SMTP id hk3mr5061652pjb.72.1640298234740; Thu, 23 Dec 2021 14:23:54 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:22:56 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-9-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 08/30] KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Remove the misleading flush "handling" when zapping invalidated TDP MMU roots, and document that flushing is unnecessary for all flavors of MMUs when zapping invalid/obsolete roots/pages. The "handling" in the TDP MMU is dead code, as zap_gfn_range() is called with shared=3Dtrue, in which case it will never return true due to the flushing being handled by tdp_mmu_zap_spte_atomic(). No functional change intended. Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu.c | 10 +++++++--- arch/x86/kvm/mmu/tdp_mmu.c | 15 ++++++++++----- 2 files changed, 17 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6549c13e89d9..f660906c8230 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5645,9 +5645,13 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm) } =20 /* - * Trigger a remote TLB flush before freeing the page tables to ensure - * KVM is not in the middle of a lockless shadow page table walk, which - * may reference the pages. + * Kick all vCPUs (via remote TLB flush) before freeing the page tables + * to ensure KVM is not in the middle of a lockless shadow page table + * walk, which may reference the pages. The remote TLB flush itself is + * not required and is simply a convenient way to kick vCPUs as needed. + * KVM performs a local TLB flush when allocating a new root (see + * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are + * running with an obsolete MMU. */ kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages); } diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 66b75c197c94..87785dce1bd4 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -844,12 +844,20 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm) { struct kvm_mmu_page *root; - bool flush =3D false; =20 lockdep_assert_held_read(&kvm->mmu_lock); =20 for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) { - flush =3D zap_gfn_range(kvm, root, 0, -1ull, true, flush, true); + /* + * A TLB flush is unnecessary, invalidated roots are guaranteed + * to be unreachable by the guest (see kvm_tdp_mmu_put_root() + * for more details), and unlike the legacy MMU, no vCPU kick + * is needed to play nice with lockless shadow walks as the TDP + * MMU protects its paging structures via RCU. Note, zapping + * will still flush on yield, but that's a minor performance + * blip and not a functional issue. + */ + (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true); =20 /* * Put the reference acquired in kvm_tdp_mmu_invalidate_roots(). @@ -857,9 +865,6 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm) */ kvm_tdp_mmu_put_root(kvm, root, true); } - - if (flush) - kvm_flush_remote_tlbs(kvm); } =20 /* --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD13DC433F5 for ; Thu, 23 Dec 2021 22:24:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350646AbhLWWYM (ORCPT ); Thu, 23 Dec 2021 17:24:12 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38694 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350490AbhLWWX4 (ORCPT ); Thu, 23 Dec 2021 17:23:56 -0500 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 99E1CC06175E for ; Thu, 23 Dec 2021 14:23:56 -0800 (PST) Received: by mail-pf1-x44a.google.com with SMTP id o67-20020a62cd46000000b004ba4d2f70b5so3987547pfg.16 for ; Thu, 23 Dec 2021 14:23:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=/tgTHkNO24YF7TEp6ddw/+GI148Bo3vvZKD6TVuZqKc=; b=c7szg7bIMyxnqWFnynUtnxGaauqoLh+3Dc6S1L+RSwCz5ENGT7B4D4aa+SQspWrX2a rHOJUCadILTnbnCczP3Ea15+LvPcNxa2iUJ3txp8CZI1YYVCU0I3LBaZl5B3sJeJVFCY l9pEFLKvoin2yHbgh4cLp/sdn/U/PTDabM53rofzKy8aCWs1X7tL0n66xEpp3sSnPSvi cxasmMkp6068upgJ32Xov6BFv5PPEzwQvMXTzmZGlrLBbPBezDxBvKSA4iI7E1sQ/iOf xZe+LLSR6XNgTD7iHa+PLkj6Ycy9Khq5RnX6A6BFDnJg5xmHMGAFh1bkmP9UJ5pg2lbD XubA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=/tgTHkNO24YF7TEp6ddw/+GI148Bo3vvZKD6TVuZqKc=; b=UO3EXDi2Lg6w0hdf33xsTfwyBFC5QlYUAoAti+/bkDajSRsjk1KmdKNOkJuL7SzmiI iNtfPIzDC+cHGWHnXPFbogxaKZAmusE9LpglEDJW9S0aGh8rnDNNDQoLVb/NKvCQFuRU Xg68ISYSSQ9V6Dow90ZTSzVLT4F5pSmXo5MG8t/rjIcFoZF2le0RmssQ7O5wL+dyWkiy ldHYZx6pqRS1xyKGSQ6ViRNnbuzZIjOkUs9Hs7YR6AvM1y283qoPhwOgigtsPLf6E6KN H6t4eXZR1FxpbCDG0AGuLTurs1ANI+v0BfF3WeMQYbIzQrIllU3Pudem0kxilJHAYWLA 840A== X-Gm-Message-State: AOAM532djiFEnGENuzJ9FeOMl5JNcS/NZOXXnPLO6Fojrw8rTKUDvj/A zzg3Foz4fDnGMukP4rZxy7r3vKnSoHE= X-Google-Smtp-Source: ABdhPJx61pNg8/a4KRQPgnCci3Pfp3tZaa8BoFTdRQXMVKG9zPf8CcCKzCdQLs/52ih+UKgSbOq2mqkJ57U= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:902:aa89:b0:148:a2e7:fb69 with SMTP id d9-20020a170902aa8900b00148a2e7fb69mr4114580plr.170.1640298236092; Thu, 23 Dec 2021 14:23:56 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:22:57 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-10-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 09/30] KVM: x86/mmu: Drop unused @kvm param from kvm_tdp_mmu_get_root() From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Drop the unused @kvm param from kvm_tdp_mmu_get_root(). No functional change intended. Signed-off-by: Sean Christopherson Reviewed-by: Ben Gardon --- arch/x86/kvm/mmu/tdp_mmu.c | 4 ++-- arch/x86/kvm/mmu/tdp_mmu.h | 3 +-- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 87785dce1bd4..cd093fa73d14 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -146,7 +146,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kv= m *kvm, =20 while (next_root) { if ((type =3D=3D ALL_ROOTS || (type =3D=3D !!next_root->role.invalid)) && - kvm_tdp_mmu_get_root(kvm, next_root)) + kvm_tdp_mmu_get_root(next_root)) break; =20 next_root =3D list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots, @@ -243,7 +243,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vc= pu) */ for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) { if (root->role.word =3D=3D role.word && - kvm_tdp_mmu_get_root(kvm, root)) + kvm_tdp_mmu_get_root(root)) goto out; } =20 diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 08c917511fed..6b9bdd652bca 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -7,8 +7,7 @@ =20 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu); =20 -__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm, - struct kvm_mmu_page *root) +__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *= root) { return refcount_inc_not_zero(&root->tdp_mmu_root_count); } --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8ED2C433F5 for ; Thu, 23 Dec 2021 22:24:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350468AbhLWWYG (ORCPT ); Thu, 23 Dec 2021 17:24:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38690 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350502AbhLWWX6 (ORCPT ); Thu, 23 Dec 2021 17:23:58 -0500 Received: from mail-pg1-x54a.google.com (mail-pg1-x54a.google.com [IPv6:2607:f8b0:4864:20::54a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F418EC06175D for ; Thu, 23 Dec 2021 14:23:57 -0800 (PST) Received: by mail-pg1-x54a.google.com with SMTP id i12-20020a63584c000000b00330ec6e2c37so3895880pgm.7 for ; Thu, 23 Dec 2021 14:23:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=sXTDU9i0d/5c6YP+iUCNiapzmeptsQFxGufHlkniv+A=; b=hWwrUNLyMCsLs7esrSH4Uv/cisnjMfrQ7QbOy6jMZvgC+LG0MpjKjc/n5GFGqRahrn 7P8RbCZxBEPh1H1RNawS1jOUYK5xLazCvQanCUMeoZ8LW/7DpM2Nm0riBepm+oiN+8bJ ymbXY+PgFD2+eTASw0SfB67DCDtpiexaMQpD/o1xGLF4e7OysdebvZxrsgBfrlUkrJWs SthaSgIqFIHpJDRn1QX3qRhgCD9OdWmDJRodMAsyq4KtFkOZ2IYWxsGrwknnjVr4pnF3 xPfKHm4J4QjcIAzzRitZFD5dp77CpO8MsRpaspmg3geiIOyNJeHSWjqch31+F7XhVes0 bfMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=sXTDU9i0d/5c6YP+iUCNiapzmeptsQFxGufHlkniv+A=; b=QgJtnoUE0fLr9HtHJgwYH1KX4vona0GHaKn61YXZX4R/oRZLLHdXGmcFERi/6WVEvm bjDKopRBIWYUAfiqMVkbM+0/cD3Nl5pgwmMGmttLZSeRfzymsjeoIsd0DQlLeVKRr97r 3tNkwakbp0cfz83TQzvuo28JRG/PlGQz7pP16iL80jewT7UJTznA12xQ/APYeypupR0M Rw8F1L8/c/TiEpzsXhRGrEgbv9pmr6jX6ORDcp+3M/TroTegCM4PHw3OE/tl8MF5KV+6 BlyBdLItli6jtFfv88osMBmYAXv0W9UFkEQZ1ZQZogC9KyV116tDrCGNGm0lISL14of8 eemQ== X-Gm-Message-State: AOAM532BAW3vszQIBa3l0ioznPHMNtf5asX1e8wtE3y2uGMxOOwaHz3N 9QaPO/BnxhpqLHOLdO2+BcOmhhM03qo= X-Google-Smtp-Source: ABdhPJwdb+wvVM4QODThED5gMK8PQcyWym893zTs+D9axal7tJfzv8BBpVNjp58CiBTP6yFsSmfFgqAfQbM= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a05:6a00:228b:b0:4bb:1111:65cb with SMTP id f11-20020a056a00228b00b004bb111165cbmr4157476pfe.56.1640298237485; Thu, 23 Dec 2021 14:23:57 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:22:58 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-11-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 10/30] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Assert that mmu_lock is held for write by users of the yield-unfriendly TDP iterator. The nature of a shared walk means that the caller needs to play nice with other tasks modifying the page tables, which is more or less the same thing as playing nice with yielding. Theoretically, KVM could gain a flow where it could legitimately take mmu_lock for read in a non-preemptible context, but that's highly unlikely and any such case should be viewed with a fair amount of scrutiny. Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index cd093fa73d14..3b13249bbbe1 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -29,13 +29,16 @@ bool kvm_mmu_init_tdp_mmu(struct kvm *kvm) return true; } =20 -static __always_inline void kvm_lockdep_assert_mmu_lock_held(struct kvm *k= vm, +/* Arbitrarily returns true so that this may be used in if statements. */ +static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *k= vm, bool shared) { if (shared) lockdep_assert_held_read(&kvm->mmu_lock); else lockdep_assert_held_write(&kvm->mmu_lock); + + return true; } =20 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) @@ -187,11 +190,17 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct = kvm *kvm, #define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \ __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, ALL_ROOT= S) =20 -#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \ - list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \ - lockdep_is_held_type(&kvm->mmu_lock, 0) || \ - lockdep_is_held(&kvm->arch.tdp_mmu_pages_lock)) \ - if (kvm_mmu_page_as_id(_root) !=3D _as_id) { \ +/* + * Iterate over all TDP MMU roots. Requires that mmu_lock be held for wri= te, + * the implication being that any flow that holds mmu_lock for read is + * inherently yield-friendly and should use the yielf-safe variant above. + * Holding mmu_lock for write obviates the need for RCU protection as the = list + * is guaranteed to be stable. + */ +#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \ + list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \ + if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \ + kvm_mmu_page_as_id(_root) !=3D _as_id) { \ } else =20 static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu, --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45633C433EF for ; Thu, 23 Dec 2021 22:24:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350748AbhLWWY1 (ORCPT ); Thu, 23 Dec 2021 17:24:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38694 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350512AbhLWWX7 (ORCPT ); Thu, 23 Dec 2021 17:23:59 -0500 Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com [IPv6:2607:f8b0:4864:20::1049]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3FE5EC061401 for ; Thu, 23 Dec 2021 14:23:59 -0800 (PST) Received: by mail-pj1-x1049.google.com with SMTP id d3-20020a17090a2a4300b001b22191073dso2924949pjg.4 for ; Thu, 23 Dec 2021 14:23:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=/ZFHgDnOl20L+13r6dn56W2Dom/tt9hnv2FmIwxRFwM=; b=c7aAabUUEF7Z0aGr0vLvyy8IDuhevj0wCx5ERJGhthQ2s6qPdaPwqBwnFLd2+hCpU1 ajO/w2PpD2ZTOFjStPEcYbebnJqx3MrwhY342jfEWaFCHweuJHErV77v9/x0bHrNitbI 4EYWtrsNTOyCbefcd9QpqyApzGr+VQcV+rFNM4VtRHaTewni+4jMqz9B2BiSHJpZ6Ur0 fliWu8+4/bS0dJlA3tA7Uf1ADlNcXuuWy3PbWUmzxXcfKHhsLBcqHi6edjz7yM9alCx0 88AycimElIkDw7Aa2zdt0E5F47ngPmKgIKVxEEcOBwzooU48hif2xF7sa89GTrTyo1D+ TXhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=/ZFHgDnOl20L+13r6dn56W2Dom/tt9hnv2FmIwxRFwM=; b=U92wvRcpyVjomlYck5826XI5wbWafiL21299OQrJoFQhDpH282LnT+CV9OLaDp0fJV J3eBQ3ZeUgPGmRpZSgUvy2d22yejyBI85pzi/FWhnW5/I1rqEZ1EhlFGgQDkr463Q8l8 oNidHUsoEBok9yqEFzHLfa4eVCsoV2TTt6+XMHgvrCS2ANiioyELODFP9ETqUT+G70Uk lekeKOo0gZeV/aBk1iYNLkQq9FuR3jpOA8Dd18Q4euMytkvsc+1FSOPnQJgwG1vwevQO 63KtOVpxbHVXQ4jO+r4nYXtaJK+T0HjNCsDq41QCAXyap9PwruP8OK3Hdb8evKWmr1VO iwiA== X-Gm-Message-State: AOAM531oxEZ+8Qp9KmASQPmn8pncmCaBTPcduXWoPpv+E2vfHmV0wVkE 1hbR1DYV6vUpVbsrhoLtQIjETky9HqE= X-Google-Smtp-Source: ABdhPJye2bSrwx441l0MON1QuD1KSBcBGpRHORrrKCjLQRY2tJ5LQ2b+BT+QRa723WlukEfY87wnLs+P5HM= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a05:6a00:2391:b0:4a2:cb64:2e01 with SMTP id f17-20020a056a00239100b004a2cb642e01mr4191029pfc.45.1640298238813; Thu, 23 Dec 2021 14:23:58 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:22:59 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-12-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 11/30] KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Look for a !leaf=3D>leaf conversion instead of a PFN change when checking if a SPTE change removed a TDP MMU shadow page. Convert the PFN check into a WARN, as KVM should never change the PFN of a shadow page (except when its being zapped or replaced). From a purely theoretical perspective, it's not illegal to replace a SP with a hugepage pointing at the same PFN. In practice, it's impossible as that would require mapping guest memory overtop a kernel-allocated SP. Either way, the check is odd. Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 3b13249bbbe1..05f35541ff2f 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -518,9 +518,12 @@ static void __handle_changed_spte(struct kvm *kvm, int= as_id, gfn_t gfn, =20 /* * Recursively handle child PTs if the change removed a subtree from - * the paging structure. + * the paging structure. Note the WARN on the PFN changing without the + * SPTE being converted to a hugepage (leaf) or being zapped. Shadow + * pages are kernel allocations and should never be migrated. */ - if (was_present && !was_leaf && (pfn_changed || !is_present)) + if (was_present && !was_leaf && + (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) handle_removed_tdp_mmu_page(kvm, spte_to_child_pt(old_spte, level), shared); } --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19C69C433EF for ; Thu, 23 Dec 2021 22:24:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350557AbhLWWYa (ORCPT ); Thu, 23 Dec 2021 17:24:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38738 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350543AbhLWWYB (ORCPT ); Thu, 23 Dec 2021 17:24:01 -0500 Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com [IPv6:2607:f8b0:4864:20::549]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D972CC06175C for ; Thu, 23 Dec 2021 14:24:00 -0800 (PST) Received: by mail-pg1-x549.google.com with SMTP id r4-20020a654984000000b0033ae6493472so3899011pgs.1 for ; Thu, 23 Dec 2021 14:24:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=tXPFxhfZ38aogkCCFXR452Gm2Zs80lNiXFSNHzjj/7o=; b=p7+PPvAqhGGhQ5aEVa1JNV8vAw5XpTM3/Wrv87i5kJeKFT+kaipf8ekswCgoZUrLbI MKVRtcz4gr4Zvo3LqAnyL3cJ3xL5UU9626f0vTktOe4OuXhzBNfo0UfAx0LS7wuR2Mkb evpm+bunpn6FMxR8usQd7RrYCnyymmX6HLX7hyRuV4rppD5LGI0q3KrnB7lp4ZYbCN9u DnW5BigL08AoHCEnr4bRyG3fPuHzCzEUdX/zj80SrzDoBmGLi3v2F9MjIDFOn4KLRi+k ZfYElpUe4qJt8s8Y/wc6NhsSjCv0kP5rtFPkauhbNtJrB6jfwoWYJDtyo0mBs5ivyomB 6qJQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=tXPFxhfZ38aogkCCFXR452Gm2Zs80lNiXFSNHzjj/7o=; b=XBt9THy2MyNEaRKno8Hq5AEuasY1UKjBfTWxsQIGMX4eF1czBCummiEkJicbCUL/vo 7Z2HW8zDsASziqlOgeZvw+k5ivaDkUhP1Eu/eHdscn70PW68WhXLHfCPOv3g8dG4tjUL zShn5KYxrGZEmjwIW/f64O4l5Qtg4Bc2n9uELRa4IkWonUA4YsG2ZF7tREl5fjw314K3 ieN7IHlAGBfmqD3AA+eq0ZmzUTbBMUPM05N0cbl3Rhf1+WKKYoooKT19lrmXSsv6wZ00 3M8mP16HrS0Ez5FpizTWli4U46mXYfwaMOERrieSIG26kk/aw4HLjT8ScbVTNKKc0lA8 zXKg== X-Gm-Message-State: AOAM530vSZ3Zi2h+TdhD9peYd3hOMl71CTKq3hL2FGB3SFVlTfQPsWMJ Bb9kROMyvVLXaId4uCZ7f7nciRP/UGA= X-Google-Smtp-Source: ABdhPJxlR40Lrq8sq395rYUNf+4uJtceakhLDojMiy+LdkXIUmsw0PNh1W6lOHb3Ww3/W19kz2FPjVsshCI= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a63:1f52:: with SMTP id q18mr3702466pgm.386.1640298240415; Thu, 23 Dec 2021 14:24:00 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:00 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-13-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 12/30] KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Batch TLB flushes (with other MMUs) when handling ->change_spte() notifications in the TDP MMU. The MMU notifier path in question doesn't allow yielding and correcty flushes before dropping mmu_lock. Signed-off-by: Sean Christopherson Reviewed-by: Ben Gardon --- arch/x86/kvm/mmu/tdp_mmu.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 05f35541ff2f..6c51548d89b1 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1174,13 +1174,12 @@ static bool set_spte_gfn(struct kvm *kvm, struct td= p_iter *iter, */ bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { - bool flush =3D kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn); - - /* FIXME: return 'flush' instead of flushing here. */ - if (flush) - kvm_flush_remote_tlbs_with_address(kvm, range->start, 1); - - return false; + /* + * No need to handle the remote TLB flush under RCU protection, the + * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a + * shadow page. See the WARN on pfn_changed in __handle_changed_spte(). + */ + return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn); } =20 /* --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0A883C433FE for ; Thu, 23 Dec 2021 22:24:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350803AbhLWWYg (ORCPT ); Thu, 23 Dec 2021 17:24:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38756 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350553AbhLWWYC (ORCPT ); Thu, 23 Dec 2021 17:24:02 -0500 Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com [IPv6:2607:f8b0:4864:20::449]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 63DA2C06175D for ; Thu, 23 Dec 2021 14:24:02 -0800 (PST) Received: by mail-pf1-x449.google.com with SMTP id t29-20020a62d15d000000b004baa073f34fso3994189pfl.12 for ; Thu, 23 Dec 2021 14:24:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=BNCCSYnc3JAlm7sEigQj1irsmiDBhrB62HCDCGqzwM4=; b=hsx3AHY9e9KGs5P9Y/6uTZVutXLJXF3CgK88i4NdUo3gvWGHAaH/hzMV99ToAe8QNh yh8fX6SyxC12fB88YAwO7kI032Jxfa3v6wrr/20tiI0G5ZTkBxI5gW15jV1Zhe58uCKU 5RZ6gBAO1oksMgd5g0+eF/Eu3NQ5gNj/BE9/GGs9fAct3ToiAprcvH90xNW7o0nNzTuz vPi+xQVbNURRJL/SM40i++TwCmFGkFyUeEcHJngaIdf9XWBJWs3LlgKLmCZdcpRBLzuC VZyL1E1m/JZ3MTBSITppI5+IHIb/04xt62Xx3BwYUs/MLhjQ/ozRh8weHP29er8X78xs 7oWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=BNCCSYnc3JAlm7sEigQj1irsmiDBhrB62HCDCGqzwM4=; b=bKf8s8aphVEOfFz7AEWcB/IHv5NGsqh1XieNvDO+9nOcI4HSHXmqOCjfbjODsqF/lB SWhJWap2+lsycJR6nx7OuFP53zwghyw/WVgkWTnqR49dM5A/vbiwRDbuGnsaBPEByEmc nJrStBLNPL+/q+jLCRQEQDvdwinYkh7gBUuZqhjPh2vOcVKh0yBrYeVHJ+9Fq5obj1os 5y89DOsB+xjIpqxS23kfX/plMWTHuJvakgMaV12tl00Bc6Y3JIEHGSsFvp8DWl4Nw+xa CFtO1kcChhA/OezTW0CeZFTB3FL9i3lxQTO+Rjakv//DfOhghnjY5pTQlDYBtkL9OVyL I8nw== X-Gm-Message-State: AOAM530QKaZwSDZ7XJNI8OpoTLR5Th6y/tg0mRYNx6i1dCS4n/UcbaC4 0+9gU89EtXCUWp/0K/zSKqnUznHt7Ms= X-Google-Smtp-Source: ABdhPJyrIjNroCeHcGoLme72TYYoLeEFX8HuM7VJuuvDeTFICp/qElBwDUte+nFyADv0mybL0LB72crz4rk= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:90b:e91:: with SMTP id fv17mr4976856pjb.217.1640298241971; Thu, 23 Dec 2021 14:24:01 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:01 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-14-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 13/30] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Drop RCU protection after processing each root when handling MMU notifier hooks that aren't the "unmap" path, i.e. aren't zapping. Temporarily drop RCU to let RCU do its thing between roots, and to make it clear that there's no special behavior that relies on holding RCU across all roots. Currently, the RCU protection is completely superficial, it's necessary only to make rcu_dereference() of SPTE pointers happy. A future patch will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to ensure shadow pages aren't freed before all vCPUs do a TLB flush (or rather, acknowledge the need for a flush), but in that case RCU needs to be held until the flush is complete if and only if the flush is needed because a shadow page may have been removed. And except for the "unmap" path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit, and can't change the PFN for a SP). Signed-off-by: Sean Christopherson Reviewed-by: Ben Gardon --- arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 6c51548d89b1..47424e22a681 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1071,18 +1071,19 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(= struct kvm *kvm, struct tdp_iter iter; bool ret =3D false; =20 - rcu_read_lock(); - /* * Don't support rescheduling, none of the MMU notifiers that funnel * into this helper allow blocking; it'd be dead, wasteful code. */ for_each_tdp_mmu_root(kvm, root, range->slot->as_id) { + rcu_read_lock(); + tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) ret |=3D handler(kvm, &iter, range); + + rcu_read_unlock(); } =20 - rcu_read_unlock(); =20 return ret; } --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EF50EC4332F for ; Thu, 23 Dec 2021 22:24:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350538AbhLWWYd (ORCPT ); Thu, 23 Dec 2021 17:24:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38768 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350582AbhLWWYE (ORCPT ); Thu, 23 Dec 2021 17:24:04 -0500 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6C0F2C061763 for ; Thu, 23 Dec 2021 14:24:04 -0800 (PST) Received: by mail-pf1-x44a.google.com with SMTP id i3-20020a628703000000b004ba462357d6so3977743pfe.23 for ; Thu, 23 Dec 2021 14:24:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=BNNkx5iDOAmq+UI4vOkeZBthRbL3U3mJe4quYYQ4WIc=; b=ELA1VF+P7266N3tFYQZiX57UD7CJ9nzv9z97D9a7u4zYcnIW7U/YemfMb25FGYe3mH DRcZ5+3JvJ3mc5zka3nwe/0efPQHWwqiH2XmUenaRFwXPha7bT4tkZkZ+wpVlrdhuIyI 52KVymQHUuzdgNJIfAjF2sZZzraeV3GlVvPdMK0AbxuVVvCRA+bA4oC3rXrqu0FpUzMz Sshre2GWrf7S/JYqmudAL+Yr8BmDPGAX7GNqyuUzioByhRgT7qLgIZG8yuq5T4uie9AL kAGV03e4SN2X10if6CXLT4Z8o09/HhDJvZKfIzigw1YxZIW0KXN8X8EYd6xUjFIdd9fh oQeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=BNNkx5iDOAmq+UI4vOkeZBthRbL3U3mJe4quYYQ4WIc=; b=DyENjzR0d1M09kRCoYW8oHbKSMkAMV8SCq5jw9lF4s0aL4LzF2xUjRGixZyPaX06yd OvqsNUBmAk0W1WnWFflyuvxLVW1qfpOfVTGxbpgdlCAaLNkix/mPMdlxNams3Y3RvUZL CTxsyVhcGY/yGua5oVWw1/ENW4BqTXrgFdt9Vh2uf6AWQHbAruzQ0cxeurNRe1R2xTl4 DNaLK5m49w2iSNGFCfpMKniVR8Azqr4CsMJSYUakmaIYkzvT6U7oC2j7DjIl1lKY4XKH VW/G8+YX9ywZqKPl1PgggXhM/u6h5AhMDZfGjkUfV04qpFgB6ePHjJcwxxhMGGFiU6kG F0kg== X-Gm-Message-State: AOAM530QPrYoTJ3Ozjug55q5H2yKFG0nWVkBn8wuYVgnGuDj49mbH/R5 QaTmgzDVS+Ktd8kA1ijloc1R6vbJR3M= X-Google-Smtp-Source: ABdhPJxrCGl+X5LAgyGiaOibHR8XaD5zaKzAM30Bk0/2KTJtfA6wwRp3IzYbzyk0a1nr52rei86LKPIdARQ= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:90a:3d42:: with SMTP id o2mr738615pjf.1.1640298243595; Thu, 23 Dec 2021 14:24:03 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:02 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-15-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 14/30] KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add helpers to read and write TDP MMU SPTEs instead of open coding rcu_dereference() all over the place, and to provide a convenient location to document why KVM doesn't exempt holding mmu_lock for write from having to hold RCU (and any future changes to the rules). No functional change intended. Signed-off-by: Sean Christopherson Reviewed-by: Ben Gardon --- arch/x86/kvm/mmu/tdp_iter.c | 6 +++--- arch/x86/kvm/mmu/tdp_iter.h | 16 ++++++++++++++++ arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++++------- 3 files changed, 26 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c index caa96c270b95..de31f3e68668 100644 --- a/arch/x86/kvm/mmu/tdp_iter.c +++ b/arch/x86/kvm/mmu/tdp_iter.c @@ -12,7 +12,7 @@ static void tdp_iter_refresh_sptep(struct tdp_iter *iter) { iter->sptep =3D iter->pt_path[iter->level - 1] + SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level); - iter->old_spte =3D READ_ONCE(*rcu_dereference(iter->sptep)); + iter->old_spte =3D kvm_tdp_mmu_read_spte(iter->sptep); } =20 static gfn_t round_gfn_for_level(gfn_t gfn, int level) @@ -87,7 +87,7 @@ static bool try_step_down(struct tdp_iter *iter) * Reread the SPTE before stepping down to avoid traversing into page * tables that are no longer linked from this entry. */ - iter->old_spte =3D READ_ONCE(*rcu_dereference(iter->sptep)); + iter->old_spte =3D kvm_tdp_mmu_read_spte(iter->sptep); =20 child_pt =3D spte_to_child_pt(iter->old_spte, iter->level); if (!child_pt) @@ -121,7 +121,7 @@ static bool try_step_side(struct tdp_iter *iter) iter->gfn +=3D KVM_PAGES_PER_HPAGE(iter->level); iter->next_last_level_gfn =3D iter->gfn; iter->sptep++; - iter->old_spte =3D READ_ONCE(*rcu_dereference(iter->sptep)); + iter->old_spte =3D kvm_tdp_mmu_read_spte(iter->sptep); =20 return true; } diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index e19cabbcb65c..3cdfaf391a49 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -9,6 +9,22 @@ =20 typedef u64 __rcu *tdp_ptep_t; =20 +/* + * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SP= TEs) + * to be zapped while holding mmu_lock for read. Holding RCU isn't requir= ed for + * correctness if mmu_lock is held for write, but plumbing "struct kvm" do= wn to + * the lower* depths of the TDP MMU just to make lockdep happy is a nightm= are, + * so all* accesses to SPTEs are must be done under RCU protection. + */ +static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep) +{ + return READ_ONCE(*rcu_dereference(sptep)); +} +static inline void kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 val) +{ + WRITE_ONCE(*rcu_dereference(sptep), val); +} + /* * A TDP iterator performs a pre-order walk over a TDP paging structure. */ diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 47424e22a681..41c3a1cff3e7 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -603,7 +603,7 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *= kvm, * here since the SPTE is going from non-present * to non-present. */ - WRITE_ONCE(*rcu_dereference(iter->sptep), 0); + kvm_tdp_mmu_write_spte(iter->sptep, 0); =20 return true; } @@ -642,7 +642,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, = struct tdp_iter *iter, */ WARN_ON(is_removed_spte(iter->old_spte)); =20 - WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte); + kvm_tdp_mmu_write_spte(iter->sptep, new_spte); =20 __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte, new_spte, iter->level, false); @@ -807,7 +807,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_m= mu_page *root, * The iter must explicitly re-read the SPTE because * the atomic cmpxchg failed. */ - iter.old_spte =3D READ_ONCE(*rcu_dereference(iter.sptep)); + iter.old_spte =3D kvm_tdp_mmu_read_spte(iter.sptep); goto retry; } } @@ -1011,7 +1011,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm= _page_fault *fault) * because the new value informs the !present * path below. */ - iter.old_spte =3D READ_ONCE(*rcu_dereference(iter.sptep)); + iter.old_spte =3D kvm_tdp_mmu_read_spte(iter.sptep); } =20 if (!is_shadow_present_pte(iter.old_spte)) { @@ -1217,7 +1217,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct = kvm_mmu_page *root, * The iter must explicitly re-read the SPTE because * the atomic cmpxchg failed. */ - iter.old_spte =3D READ_ONCE(*rcu_dereference(iter.sptep)); + iter.old_spte =3D kvm_tdp_mmu_read_spte(iter.sptep); goto retry; } spte_set =3D true; @@ -1288,7 +1288,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, st= ruct kvm_mmu_page *root, * The iter must explicitly re-read the SPTE because * the atomic cmpxchg failed. */ - iter.old_spte =3D READ_ONCE(*rcu_dereference(iter.sptep)); + iter.old_spte =3D kvm_tdp_mmu_read_spte(iter.sptep); goto retry; } spte_set =3D true; @@ -1419,7 +1419,7 @@ static void zap_collapsible_spte_range(struct kvm *kv= m, * The iter must explicitly re-read the SPTE because * the atomic cmpxchg failed. */ - iter.old_spte =3D READ_ONCE(*rcu_dereference(iter.sptep)); + iter.old_spte =3D kvm_tdp_mmu_read_spte(iter.sptep); goto retry; } } --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DA5B0C433EF for ; Thu, 23 Dec 2021 22:25:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350828AbhLWWZB (ORCPT ); Thu, 23 Dec 2021 17:25:01 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38808 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350653AbhLWWYM (ORCPT ); Thu, 23 Dec 2021 17:24:12 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 527A2C0617A1 for ; Thu, 23 Dec 2021 14:24:06 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id w5-20020a17090a380500b001b112521410so4207636pjb.8 for ; Thu, 23 Dec 2021 14:24:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=S9jL8LBJWDSO1EwBcm6bNP8GuX6QWRTMXtWDTQ0GzAo=; b=i1m9kNDo4lVbMKKMJ7noTQCVlV3rFRnpbwW5J7eBAQQPmfGVNRgPfQBPDoRb1Ui5y2 KYxDCaTU2lwHc08gaE/rqtAxLDrTeqm80ZNH7WvWCMyYs81zGzFgowjDFgmU88xrLgLg 2Dvm8rQ52rYuwfXnf2c/wiI2pecS49FLZIgoc7FuwdApv+HH4WEuZVLXy4yGgUFhkmu4 fcYG+xN+po6iQzh/2vHnvyX/C0sYpJhBYfZ/B3szzJakAbN0UAspYVPVp+fnqdqBUxjM 4RqNPN+GELSi2uwFEnKy0QaqslvNHc+0WxfR5F7DJIsZ83D56N3zXV8wlr149IfrojHr x+gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=S9jL8LBJWDSO1EwBcm6bNP8GuX6QWRTMXtWDTQ0GzAo=; b=Z48LIbTLiN12A5N6G346j1Q1uB8MM1D843+SlQEj6Wg+dLajcgof1khnk4MNYZl6Pz 83+38mQkmkxL524uPLJo3lHBSYkClJhh/odxJynSo2pe/Snfrn2cni0aX7cP466uXzgh dZbeVwMgUu4pCiJLvqIfLCcLl3A6CJTkWIYEqGlmCr4OszjSAM3fVvU0EmlP/d/PsUqI 2K+NxMcvSCGIp16qwM1BI1cUgd1BYrQO5c1lVP4J4cMGFNKQVtcfI7jBxyXydJYsE4hm LliaG/shztEw9EVjCsuvREAKJe6XwCJiuQN87NYpSoJ7Sbe1G8gyvP6Ve9wuOaAl2gmJ mkYQ== X-Gm-Message-State: AOAM531Oc0CM/O4YcNR5bIieFP151TvyiCdntrrnPD6zKVDnF9za4kbZ FBUJg9vRy+abUOJzRayiSQjITJZmPpM= X-Google-Smtp-Source: ABdhPJx4FSrwOTm1Aq60tRpUbAfQrDZRF2v7lCJ9pqZVoyCOxQHrcpb5/oFfRApzth+Q+noEmq2CmgpsFHY= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:90a:c203:: with SMTP id e3mr739899pjt.0.1640298245502; Thu, 23 Dec 2021 14:24:05 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:03 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-16-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 15/30] KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE, which is called out by the comment as being disallowed but not actually checked. Keep the WARN on the old_spte as well, because overwriting a REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by lack of splats with the existing WARN). Fixes: 08f07c800e9d ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF hand= ler") Cc: Ben Gardon Signed-off-by: Sean Christopherson Reviewed-by: Ben Gardon --- arch/x86/kvm/mmu/tdp_mmu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 41c3a1cff3e7..e2d217cbeca3 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -634,13 +634,13 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm= , struct tdp_iter *iter, lockdep_assert_held_write(&kvm->mmu_lock); =20 /* - * No thread should be using this function to set SPTEs to the + * No thread should be using this function to set SPTEs to or from the * temporary removed SPTE value. * If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic * should be used. If operating under the MMU lock in write mode, the * use of the removed SPTE should not be necessary. */ - WARN_ON(is_removed_spte(iter->old_spte)); + WARN_ON(is_removed_spte(iter->old_spte) || is_removed_spte(new_spte)); =20 kvm_tdp_mmu_write_spte(iter->sptep, new_spte); =20 --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8391C433EF for ; Thu, 23 Dec 2021 22:25:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350695AbhLWWZu (ORCPT ); Thu, 23 Dec 2021 17:25:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38700 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350511AbhLWWYY (ORCPT ); Thu, 23 Dec 2021 17:24:24 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E2DBDC061398 for ; Thu, 23 Dec 2021 14:24:07 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id d15-20020a17090ac24f00b001b2321b99ecso1957217pjx.5 for ; Thu, 23 Dec 2021 14:24:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=m2NUt0YQCUbB2JSkCtN1Bl6Y1QWFIXMkBLahtwZY7/s=; b=YzqzLnyXRuM6L7Ffw/8wI5A6X5zjaE1QmoPkdckCRZTXk1468EB4TaUutLC870JEs/ D2Er9gFljcJcQBnyqhvmzhzqdmQtV3UWUJhR1rgxSdCLR4tz4JX0KCXjJN+4j3zpxeRL n4eHMhMLfeYsEplYJ+7t8hZCtXbsU7hjl+fpOGd7VoCD1pXrPS8ppka7xNM0MQxlz+KA 1Xy0YiasqIqrhLK4f/H+l7yna5qImJj3G0QfRgBoN6CNchvmQ39ixo518o2hm78MaeXp bDOELzlSCJviu8pEIENxRBS1cFqUzUv3gyJ1OwHZvxCPzvrZB4O3+p4hNB2JXqfgT798 IRsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=m2NUt0YQCUbB2JSkCtN1Bl6Y1QWFIXMkBLahtwZY7/s=; b=f3qt0H3zB3kBtGnnbKlYjNzgj+33XqhuJ0M72eKNVgmI5IxlC0dvZP0SPa5upxLNRH JLIuLcOnZgCuFJatJSa03u0lig8MIYqReV9a8VDjmOLHq1a3bH8ao/t7hmsAjlxozCMc uHzvF/tGdiS6RHZ/CgOAso2UQ+JiDuq7MLazBQBD9a/Pln2UWpr3okyZomNNzrw9u8o4 X2dMvjTvj12yPF76JoVjVQwPNhx4pd44ytkW5WNmGawmITXY7AZmEfdw5UM1VfAIUS8d Ocp2SYx9LPcc+HI7mcFrlxFVm4blqLo+tRtCM8oKu9MaiTI76wImC1wboVCslwvmKPb1 G1+A== X-Gm-Message-State: AOAM533TdpFGdcdMpieIEiICPeBEZIWoDLDeQFW8g9ZGJ9Yv9ZbIUTTg 01zyC4vCo876dsI+MDqhgNCXaYOZpdc= X-Google-Smtp-Source: ABdhPJx9IZfY2Dm8iZbhUgNO0BdBxPrFsSThO9N46OLB36o1pCnkUbXN7kAMnazK+1FxngDXepUoZx/Njoc= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a05:6a00:22c3:b0:4ba:f0a1:a9b6 with SMTP id f3-20020a056a0022c300b004baf0a1a9b6mr4266028pfj.36.1640298247335; Thu, 23 Dec 2021 14:24:07 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:04 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-17-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 16/30] KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw vals From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Refactor __tdp_mmu_set_spte() to work with raw values instead of a tdp_iter objects so that a future patch can modify SPTEs without doing a walk, and without having to synthesize a tdp_iter. No functional change intended. Signed-off-by: Sean Christopherson Reviewed-by: Ben Gardon --- arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++++++++++--------------- 1 file changed, 31 insertions(+), 20 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index e2d217cbeca3..61596b4a8121 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -611,9 +611,13 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm = *kvm, =20 /* * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookk= eeping - * @kvm: kvm instance - * @iter: a tdp_iter instance currently on the SPTE that should be set - * @new_spte: The value the SPTE should be set to + * @kvm: KVM instance + * @as_id: Address space ID, i.e. regular vs. SMM + * @sptep: Pointer to the SPTE + * @old_spte: The current value of the SPTE + * @new_spte: The new value that will be set for the SPTE + * @gfn: The base GFN that was (or will be) mapped by the SPTE + * @level: The level _containing_ the SPTE (its parent PT's level) * @record_acc_track: Notify the MM subsystem of changes to the accessed s= tate * of the page. Should be set unless handling an MMU * notifier for access tracking. Leaving record_acc_track @@ -625,12 +629,10 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm= *kvm, * Leaving record_dirty_log unset in that case prevents page * writes from being double counted. */ -static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *it= er, - u64 new_spte, bool record_acc_track, - bool record_dirty_log) +static void __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t spte= p, + u64 old_spte, u64 new_spte, gfn_t gfn, int level, + bool record_acc_track, bool record_dirty_log) { - WARN_ON_ONCE(iter->yielded); - lockdep_assert_held_write(&kvm->mmu_lock); =20 /* @@ -640,39 +642,48 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm= , struct tdp_iter *iter, * should be used. If operating under the MMU lock in write mode, the * use of the removed SPTE should not be necessary. */ - WARN_ON(is_removed_spte(iter->old_spte) || is_removed_spte(new_spte)); + WARN_ON(is_removed_spte(old_spte) || is_removed_spte(new_spte)); =20 - kvm_tdp_mmu_write_spte(iter->sptep, new_spte); + kvm_tdp_mmu_write_spte(sptep, new_spte); + + __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false); =20 - __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte, - new_spte, iter->level, false); if (record_acc_track) - handle_changed_spte_acc_track(iter->old_spte, new_spte, - iter->level); + handle_changed_spte_acc_track(old_spte, new_spte, level); if (record_dirty_log) - handle_changed_spte_dirty_log(kvm, iter->as_id, iter->gfn, - iter->old_spte, new_spte, - iter->level); + handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, + new_spte, level); +} + +static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *ite= r, + u64 new_spte, bool record_acc_track, + bool record_dirty_log) +{ + WARN_ON_ONCE(iter->yielded); + + __tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep, iter->old_spte, + new_spte, iter->gfn, iter->level, + record_acc_track, record_dirty_log); } =20 static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte) { - __tdp_mmu_set_spte(kvm, iter, new_spte, true, true); + _tdp_mmu_set_spte(kvm, iter, new_spte, true, true); } =20 static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte) { - __tdp_mmu_set_spte(kvm, iter, new_spte, false, true); + _tdp_mmu_set_spte(kvm, iter, new_spte, false, true); } =20 static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte) { - __tdp_mmu_set_spte(kvm, iter, new_spte, true, false); + _tdp_mmu_set_spte(kvm, iter, new_spte, true, false); } =20 #define tdp_root_for_each_pte(_iter, _root, _start, _end) \ --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3AFFEC433EF for ; Thu, 23 Dec 2021 22:25:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350710AbhLWWZx (ORCPT ); Thu, 23 Dec 2021 17:25:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38864 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350531AbhLWWYZ (ORCPT ); Thu, 23 Dec 2021 17:24:25 -0500 Received: from mail-pg1-x54a.google.com (mail-pg1-x54a.google.com [IPv6:2607:f8b0:4864:20::54a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4A47AC06139A for ; Thu, 23 Dec 2021 14:24:09 -0800 (PST) Received: by mail-pg1-x54a.google.com with SMTP id r4-20020a654984000000b0033ae6493472so3899153pgs.1 for ; Thu, 23 Dec 2021 14:24:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=0Ar+ik+Ixb7U4ZRrfqeBX6o/I8VZg+05g6JXxnNr4p4=; b=IJcs/ufo8XgwOG4jI9SZNlx2jrySMJYLnDwA/R8TIzelevXdK9hGKU3Wwvvo/IOCcT RvEtBugx65J+sCoFGlfntiS6TVXrI8QSeIYmWaKhaR1a4KR/z8H1N+S9KlEDbHhl8s2X +IGa1+0xtTlkg8ktrycfXGjU0CY7lG/zzRT42safJcYlOW2PfGR4HWGDkvNQ4rk0De/e eW85D1R0v+gkHObhkVwnFsPqO6kUeasl+yxTjEPKNAtMjTkklDUegArPRP9wse2gojKq hhRgcJSUKlqoTE4mpnprqT0Tb812N/cStq2TabVQmyWHLaCCL7EJGan+CcoXOPrJGGQp UdoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=0Ar+ik+Ixb7U4ZRrfqeBX6o/I8VZg+05g6JXxnNr4p4=; b=eH+wf7J+ZgXdAGJ31PMCnG/l4qzABhmiw8X3+WMyWjKGGBoVun1Xh6uvx8XUp8S6cN uEGWQR22tZE0WwY48+tw82KLEifB2hjQWsyqrPNH9e3dyvoqaXyhLooW32jdM1RAFFXL hIwzWvCzyg/wikCfmQ602419njxW+W7OdFlWGD/OLUgRQY2qB6y7PWqPXbmdaZLWVCqA fkWVf98QNAuXChOQlTNs/v3+Um2AzIzJEd2OIgcvKJWRpMOcn8AZSCTdLsAM9fG2ttNz vyEn3QTRhkHIwhuc0D0LIyNZ8VofPlO4rGJH/8zrRFIzoJcgWZ2B7esD5DJ11QxdUKqf lbzg== X-Gm-Message-State: AOAM531g/P32kxHsLd7gOLlEH4FMCyuc6r67YMop2tbrYnXqjqhFLGr+ 8wCekZ4onB3VZ8q8aeuLHeEspNbdRlo= X-Google-Smtp-Source: ABdhPJyTYQ0iYmk4z5GT1h5ERJ9qs0AOt2sOEIubClOUu6HpStBFmEirYwwyHs0Xg2PnmsEJU3ouv1Q1upg= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a62:2546:0:b0:4ba:ef4e:c43f with SMTP id l67-20020a622546000000b004baef4ec43fmr4152523pfl.57.1640298248807; Thu, 23 Dec 2021 14:24:08 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:05 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-18-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 17/30] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When recovering a potential hugepage that was shattered for the iTLB multihit workaround, precisely zap only the target page instead of iterating over the TDP MMU to find the SP that was passed in. This will allow future simplification of zap_gfn_range() by having it zap only leaf SPTEs. Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu_internal.h | 7 ++++++- arch/x86/kvm/mmu/tdp_iter.h | 2 -- arch/x86/kvm/mmu/tdp_mmu.c | 28 ++++++++++++++++++++++++++-- arch/x86/kvm/mmu/tdp_mmu.h | 18 +----------------- 4 files changed, 33 insertions(+), 22 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_interna= l.h index da6166b5c377..be063b6c91b7 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -30,6 +30,8 @@ extern bool dbg; #define INVALID_PAE_ROOT 0 #define IS_VALID_PAE_ROOT(x) (!!(x)) =20 +typedef u64 __rcu *tdp_ptep_t; + struct kvm_mmu_page { /* * Note, "link" through "spt" fit in a single 64 byte cache line on @@ -59,7 +61,10 @@ struct kvm_mmu_page { refcount_t tdp_mmu_root_count; }; unsigned int unsync_children; - struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ + union { + struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ + tdp_ptep_t ptep; + }; DECLARE_BITMAP(unsync_child_bitmap, 512); =20 struct list_head lpage_disallowed_link; diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index 3cdfaf391a49..1de6c1c9ff7b 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -7,8 +7,6 @@ =20 #include "mmu.h" =20 -typedef u64 __rcu *tdp_ptep_t; - /* * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SP= TEs) * to be zapped while holding mmu_lock for read. Holding RCU isn't requir= ed for diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 61596b4a8121..d23c2d42ad60 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -305,12 +305,16 @@ static void handle_changed_spte_dirty_log(struct kvm = *kvm, int as_id, gfn_t gfn, * * @kvm: kvm instance * @sp: the new page + * @sptep: pointer to the new page's SPTE (in its parent) * @account_nx: This page replaces a NX large page and should be marked for * eventual reclaim. */ static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp, - bool account_nx) + tdp_ptep_t sptep, bool account_nx) { + WARN_ON_ONCE(sp->ptep); + sp->ptep =3D sptep; + spin_lock(&kvm->arch.tdp_mmu_pages_lock); list_add(&sp->link, &kvm->arch.tdp_mmu_pages); if (account_nx) @@ -745,6 +749,26 @@ static inline bool __must_check tdp_mmu_iter_cond_resc= hed(struct kvm *kvm, return iter->yielded; } =20 +bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + u64 old_spte; + + rcu_read_lock(); + + old_spte =3D kvm_tdp_mmu_read_spte(sp->ptep); + if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) { + rcu_read_unlock(); + return false; + } + + __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0, + sp->gfn, sp->role.level + 1, true, true); + + rcu_read_unlock(); + + return true; +} + /* * Tears down the mappings for the range of gfns, [start, end), and frees = the * non-root pages mapping GFNs strictly within that range. Returns true if @@ -1041,7 +1065,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm= _page_fault *fault) !shadow_accessed_mask); =20 if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) { - tdp_mmu_link_page(vcpu->kvm, sp, + tdp_mmu_link_page(vcpu->kvm, sp, iter.sptep, fault->huge_page_disallowed && fault->req_level >=3D iter.level); =20 diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 6b9bdd652bca..ccb12f1914ba 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -22,24 +22,8 @@ static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm = *kvm, int as_id, { return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush); } -static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page= *sp) -{ - gfn_t end =3D sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level + 1); - - /* - * Don't allow yielding, as the caller may have a flush pending. Note, - * if mmu_lock is held for write, zapping will never yield in this case, - * but explicitly disallow it for safety. The TDP MMU does not yield - * until it has made forward progress (steps sideways), and when zapping - * a single shadow page that it's guaranteed to see (thus the mmu_lock - * requirement), its "step sideways" will always step beyond the bounds - * of the shadow page's gfn range and stop iterating before yielding. - */ - lockdep_assert_held_write(&kvm->mmu_lock); - return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp), - sp->gfn, end, false, false); -} =20 +bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp); void kvm_tdp_mmu_zap_all(struct kvm *kvm); void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm); void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm); --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CA184C433FE for ; Thu, 23 Dec 2021 22:25:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240971AbhLWWZ4 (ORCPT ); Thu, 23 Dec 2021 17:25:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38758 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350728AbhLWWY0 (ORCPT ); Thu, 23 Dec 2021 17:24:26 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5B201C0613A1 for ; Thu, 23 Dec 2021 14:24:11 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id i203-20020a2554d4000000b0060a529902b3so3024381ybb.21 for ; Thu, 23 Dec 2021 14:24:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=hi15fkGhTPSKWg6xnBWyGTdHInFiOYLMR2QZxH+BCyM=; b=tV8xL3tDgDWPtfKAnLLJ9aMprxvrR8pzIXGPXy65phS75CA+Zasdrj0BFZdR+hrAmp 6qrsRJJoKEHZih5zMkJ8pMmdZtylmwvntAWefZogtB7YNZ00Aq5RT5hD/LB8LqFObeIU clcsY749fZf2Az67uWjoklCNEoXdL7WAO3pRiXSHe1L9kWTDz+ggbRbzHt4ipBdioI4G a7yJS+nlQYOo+Gvbmn3ZEYoLnwr0xB1XZ8VlNlcNuggHArJZy0j2V3xKhYFIGZbNaPkK g0sppACPwYYiVxnBwDuymuIbszW3wz7f61hooKdRHLCcYFWgaoaLWv/bw8aLkvbPT/eF 4W5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=hi15fkGhTPSKWg6xnBWyGTdHInFiOYLMR2QZxH+BCyM=; b=x5KMKJ9+g11arzTInuw1vwOP7gt6xFUEgcJieILy+5WZ9VMYfU2I32/eEow3AIUlEa HPorE3sAel+qH7r3kfBcMnF92jvLiDppsRF+cTRQB+8W8Thrz5wTwF320ZaSUbN1ZFVI GHCWac2GARNniwHL5p2DB89sPjpmXRsS+87uzO4Qhld/TpNaMhl4vJ4CZN63/NB5n407 wlkY2/aCQj3EqIYh22wF1MPY+YkxyHhTjcz1XUHwMAuC8qiTcGs5fhDN1ZlBHst5pl8Y ARYSIUuytNno099r6zD6mq/18wocXyVKKXfIaD8IrIYqqFV5mMl2Sg11hkMUVh00VRrv lCvA== X-Gm-Message-State: AOAM530ssoZ4iwxH5B1/mF345AVxKU93UysyjkKhg/VJKdGqaomEuHVu pg4fmVlEBWPi8w0EzWpahGOWGSxQqb0= X-Google-Smtp-Source: ABdhPJz7aPgmvGfRlYLct3l0udDKcXntDEZJSyDbK0UMyC+yQCMJ/vgjWw6/jUNmPpbrxEbhTjlkQc0soPY= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a25:d4c:: with SMTP id 73mr5728847ybn.74.1640298250584; Thu, 23 Dec 2021 14:24:10 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:06 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-19-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 18/30] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM uses the slow version of "zap everything" is when the VM is being destroyed or the owning mm has exited. In either case, KVM_RUN is unreachable for the VM, i.e. the guest TLB entries cannot be consumed. Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index d23c2d42ad60..e9232ef2194e 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -871,14 +871,15 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int= as_id, gfn_t start, =20 void kvm_tdp_mmu_zap_all(struct kvm *kvm) { - bool flush =3D false; int i; =20 + /* + * A TLB flush is unnecessary, KVM zaps everything if and only the VM + * is being destroyed or the userspace VMM has exited. In both cases, + * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request. + */ for (i =3D 0; i < KVM_ADDRESS_SPACE_NUM; i++) - flush =3D kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush); - - if (flush) - kvm_flush_remote_tlbs(kvm); + (void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false); } =20 /* --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 781EBC433EF for ; Thu, 23 Dec 2021 22:26:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350749AbhLWWZ7 (ORCPT ); Thu, 23 Dec 2021 17:25:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38878 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350744AbhLWWY1 (ORCPT ); Thu, 23 Dec 2021 17:24:27 -0500 Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com [IPv6:2607:f8b0:4864:20::1049]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B014DC061757 for ; Thu, 23 Dec 2021 14:24:12 -0800 (PST) Received: by mail-pj1-x1049.google.com with SMTP id t7-20020a17090a5d8700b001a7604b85f5so4024338pji.8 for ; Thu, 23 Dec 2021 14:24:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=j7zjELqiKcyIXQzSS76hhwYO0SvmRw+6i72sM7ruJEc=; b=ZoC766AfGINLUGXjt38ST4kBuMMelcysWS4PXgZoAKD7GlcQtoqoLW8wlfUmGIxZg8 oRJwJzg0E+xIpxaWXFufdxBDNU/IfIHwyDQsNTWUTJacNadE7cLUTEXZqvlcnECUZJf7 NYl6NypBIB9uSh1e6QAW/ZAqOWkjvCM9Ilj2CAb4i3D2ak9I1PECGwstFf+PkaMf9mvv ca61Sx8c2gaIR6tkuY3SSBwQHGqYrDPABAuIERrpD25oxvfwg+4uFLext+lBOQQtetyd 3qGb6zN3HE+z/n0IWvCJt5NGe831LVxKlfpWF0+BdfZ0pIH0KMnz3cOQvCOT2rq9bXnI WtDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=j7zjELqiKcyIXQzSS76hhwYO0SvmRw+6i72sM7ruJEc=; b=pYzjj0HGjEx7KGq4GBMabed7uStfLTRbrJaKHxihHlLhgdLKXTqC0vE+yBJEQpeoeI qt0lbdjG6JPv3/CQVqqKQ9mIHrpt8HhWxmPpRx5crkwQNF1+recsheZtd29vzS3x+g+U Lr7/CcEgv0q9aJALmWKY5csJLBo3+p5+4pMxkRsyrh05oIEVzM9lAYHntchgg4Hvh/Is KXkXRAiKVxdVBZ5GXbt7AJKys8fD981luX57hRu3kzP0rpsVGsi8mPKUduUIc6jmi45p yaKEmYZ+88Hed8Du5XoBAyBzxjLRz9eg1K6rpmtrm9YnrUjK57rZBlBUH04Y2It5Ad4d q8ZA== X-Gm-Message-State: AOAM5328CUGDwyFKcL9hOEDLrAyRzFeyU6GpIy4RotCLHmJ867TKaqDt HtR9GoyaptElnF3z1bwYmPKgI1FPhXY= X-Google-Smtp-Source: ABdhPJwpuVkvPCVR6sbOFqDkFoBIMP3ReGI+OXovB6sQfYb005odNoXif0w3zQ/6eO1W3gmhZSbp79VEUFU= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:902:82c1:b0:149:5a6d:c59 with SMTP id u1-20020a17090282c100b001495a6d0c59mr1585675plz.83.1640298252210; Thu, 23 Dec 2021 14:24:12 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:07 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-20-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 19/30] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a dedicate helper for zapping a TDP MMU root, and use it in the three flows that do "zap_all" and intentionally do not do a TLB flush if SPTEs are zapped (zapping an entire root is safe if and only if it cannot be in use by any vCPU). Because a TLB flush is never required, unconditionally pass "false" to tdp_mmu_iter_cond_resched() when potentially yielding. Opportunistically document why KVM must not yield when zapping roots that are being zapped by kvm_tdp_mmu_put_root(), i.e. roots whose refcount has reached zero, and further harden the flow to detect improper KVM behavior with respect to roots that are supposed to be unreachable. In addition to hardening zapping of roots, isolating zapping of roots will allow future simplification of zap_gfn_range() by having it zap only leaf SPTEs, and by removing its tricky "zap all" heuristic. By having all paths that truly need to free _all_ SPs flow through the dedicated root zapper, the generic zapper can be freed of those concerns. Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 100 +++++++++++++++++++++++++++++++------ 1 file changed, 84 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index e9232ef2194e..c4bfd6aac999 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -56,10 +56,6 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) rcu_barrier(); } =20 -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, - gfn_t start, gfn_t end, bool can_yield, bool flush, - bool shared); - static void tdp_mmu_free_sp(struct kvm_mmu_page *sp) { free_page((unsigned long)sp->spt); @@ -82,6 +78,9 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head = *head) tdp_mmu_free_sp(sp); } =20 +static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, + bool shared); + void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, bool shared) { @@ -104,7 +103,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_m= mu_page *root, * intermediate paging structures, that may be zapped, as such entries * are associated with the ASID on both VMX and SVM. */ - (void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared); + tdp_mmu_zap_root(kvm, root, shared); =20 call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback); } @@ -749,6 +748,78 @@ static inline bool __must_check tdp_mmu_iter_cond_resc= hed(struct kvm *kvm, return iter->yielded; } =20 +static inline gfn_t tdp_mmu_max_gfn_host(void) +{ + /* + * Bound TDP MMU walks at host.MAXPHYADDR, guest accesses beyond that + * will hit a #PF(RSVD) and never hit an EPT Violation/Misconfig / #NPF, + * and so KVM will never install a SPTE for such addresses. + */ + return 1ULL << (shadow_phys_bits - PAGE_SHIFT); +} + +static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, + bool shared) +{ + bool root_is_unreachable =3D !refcount_read(&root->tdp_mmu_root_count); + struct tdp_iter iter; + + gfn_t end =3D tdp_mmu_max_gfn_host(); + gfn_t start =3D 0; + + kvm_lockdep_assert_mmu_lock_held(kvm, shared); + + rcu_read_lock(); + + /* + * No need to try to step down in the iterator when zapping an entire + * root, zapping an upper-level SPTE will recurse on its children. + */ + for_each_tdp_pte_min_level(iter, root->spt, root->role.level, + root->role.level, start, end) { +retry: + /* + * Yielding isn't allowed when zapping an unreachable root as + * the root won't be processed by mmu_notifier callbacks. When + * handling an unmap/release mmu_notifier command, KVM must + * drop all references to relevant pages prior to completing + * the callback. Dropping mmu_lock can result in zapping SPTEs + * for an unreachable root after a relevant callback completes, + * which leads to use-after-free as zapping a SPTE triggers + * "writeback" of dirty/accessed bits to the SPTE's associated + * struct page. + */ + if (!root_is_unreachable && + tdp_mmu_iter_cond_resched(kvm, &iter, false, shared)) + continue; + + if (!is_shadow_present_pte(iter.old_spte)) + continue; + + if (!shared) { + tdp_mmu_set_spte(kvm, &iter, 0); + } else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) { + /* + * cmpxchg() shouldn't fail if the root is unreachable. + * to be unreachable. Re-read the SPTE and retry so as + * not to leak the page and its children. + */ + WARN_ONCE(root_is_unreachable, + "Contended TDP MMU SPTE in unreachable root."); + iter.old_spte =3D kvm_tdp_mmu_read_spte(iter.sptep); + goto retry; + } + /* + * WARN if the root is invalid and is unreachable, all SPTEs + * should've been zapped by kvm_tdp_mmu_zap_invalidated_roots(), + * and inserting new SPTEs under an invalid root is a KVM bug. + */ + WARN_ON_ONCE(root_is_unreachable && root->role.invalid); + } + + rcu_read_unlock(); +} + bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) { u64 old_spte; @@ -790,8 +861,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_m= mu_page *root, gfn_t start, gfn_t end, bool can_yield, bool flush, bool shared) { - gfn_t max_gfn_host =3D 1ULL << (shadow_phys_bits - PAGE_SHIFT); - bool zap_all =3D (start =3D=3D 0 && end >=3D max_gfn_host); + bool zap_all =3D (start =3D=3D 0 && end >=3D tdp_mmu_max_gfn_host()); struct tdp_iter iter; =20 /* @@ -800,12 +870,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_= mmu_page *root, */ int min_level =3D zap_all ? root->role.level : PG_LEVEL_4K; =20 - /* - * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will - * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF, - * and so KVM will never install a SPTE for such addresses. - */ - end =3D min(end, max_gfn_host); + end =3D min(end, tdp_mmu_max_gfn_host()); =20 kvm_lockdep_assert_mmu_lock_held(kvm, shared); =20 @@ -871,6 +936,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int a= s_id, gfn_t start, =20 void kvm_tdp_mmu_zap_all(struct kvm *kvm) { + struct kvm_mmu_page *root; int i; =20 /* @@ -878,8 +944,10 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) * is being destroyed or the userspace VMM has exited. In both cases, * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request. */ - for (i =3D 0; i < KVM_ADDRESS_SPACE_NUM; i++) - (void)kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, false); + for (i =3D 0; i < KVM_ADDRESS_SPACE_NUM; i++) { + for_each_tdp_mmu_root_yield_safe(kvm, root, i, false) + tdp_mmu_zap_root(kvm, root, false); + } } =20 /* @@ -905,7 +973,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm) * will still flush on yield, but that's a minor performance * blip and not a functional issue. */ - (void)zap_gfn_range(kvm, root, 0, -1ull, true, false, true); + tdp_mmu_zap_root(kvm, root, true); =20 /* * Put the reference acquired in kvm_tdp_mmu_invalidate_roots(). --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E98E5C433F5 for ; Thu, 23 Dec 2021 22:26:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350807AbhLWW0H (ORCPT ); Thu, 23 Dec 2021 17:26:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38780 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350813AbhLWWYh (ORCPT ); Thu, 23 Dec 2021 17:24:37 -0500 Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com [IPv6:2607:f8b0:4864:20::1049]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DFFD5C061375 for ; Thu, 23 Dec 2021 14:24:14 -0800 (PST) Received: by mail-pj1-x1049.google.com with SMTP id l8-20020a17090b078800b001b1ea649932so4205917pjz.7 for ; Thu, 23 Dec 2021 14:24:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=tDGnOGhCcvxTaWavsfwTLmgOIvYC3gf4ygJSFWS9QqU=; b=kTOgrcc6BfePm0z+hvCdTFuh79DXEp5GuDndlgpAJVWC9FsuKerF77p60i8Amw364c A0YRrVZzYsMnaM4DfzZuKv1bn6YWYQX3XY8gpy3FGp7+sJD9s9ZB6tHTF/1mMVkH0djH gFphFTPgEquFo/vSlX88ZbEbCwx+ioHc4R3ur6f3p982PtAJSt7n5fr8/Bd6nv51l8M+ NUCmo0Eted2NZ54Y8RQMG9ni5K9eeXGrOrYwV0AakuEuQYuK8e/w8T0xKE4JBgRAT8Ee P6XL9aI7G9VXbC1/v5QwlPUgp/prj07a2SnR3Yo0iDn9+cr9YaMvA8HiVyDiKR4sq+wj +aaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=tDGnOGhCcvxTaWavsfwTLmgOIvYC3gf4ygJSFWS9QqU=; b=OQv4vMVYXBWHITF0wYMSunuOl9/H4qoiDdFZ6FMDdfodGUUDlmRHfiQRKeODa5axNP eQLaCSox4Wub2l90SQq0qqC50D/ReedgBdnaHVlk7kRXAKj0WrDQ7BBPNrdZtaehBJqO gjLjmdo2TuOUY9gDpr6f00CzwvAz2TJKwfWN2y05kM3VYWSCNwgfHFdcARtcay6KJRCZ LM+vzSrRXADJnpXJucxZvcKSnJjKDcjCI7UUDDmxQ4zT5N03UIKYWC621cwvT3aYJLoo vUoyljilK224sc5tn6DiFINSkHet5FKzkq43ksgzbkRS7gNdr1+Tiz6pSRdsI7mrYgIA DAxA== X-Gm-Message-State: AOAM533Grf7ZRTeR4WXOp96NvEzvAr7loDeWvyyoqQCQccXoT94ShikJ dY6OdHwLLu+W28FIaqcQsOHVpyXeeSg= X-Google-Smtp-Source: ABdhPJynTFDTwq9OPWP3NGC2k4NnBfjNzEbTDYwrQOTcJqWU8FH5hTFzrUrGniBxYNwYQlCQwxy11xHnRYg= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:90a:3d42:: with SMTP id o2mr738629pjf.1.1640298253827; Thu, 23 Dec 2021 14:24:13 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:08 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-21-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 20/30] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Now that all callers of zap_gfn_range() hold mmu_lock for write, drop support for zapping with mmu_lock held for read. That all callers hold mmu_lock for write isn't a random coincedence; now that the paths that need to zap _everything_ have their own path, the only callers left are those that need to zap for functional correctness. And when zapping is required for functional correctness, mmu_lock must be held for write, otherwise the caller has no guarantees about the state of the TDP MMU page tables after it has run, e.g. the SPTE(s) it zapped can be immediately replaced by a vCPU faulting in a page. Signed-off-by: Sean Christopherson Reviewed-by: Ben Gardon --- arch/x86/kvm/mmu/tdp_mmu.c | 29 ++++++----------------------- 1 file changed, 6 insertions(+), 23 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index c4bfd6aac999..c7529de776be 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -851,15 +851,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mm= u_page *sp) * function cannot yield, it will not release the MMU lock or reschedule a= nd * the caller must ensure it does not supply too large a GFN range, or the * operation can cause a soft lockup. - * - * If shared is true, this thread holds the MMU lock in read mode and must - * account for the possibility that other threads are modifying the paging - * structures concurrently. If shared is false, this thread should hold the - * MMU lock in write mode. */ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, - gfn_t start, gfn_t end, bool can_yield, bool flush, - bool shared) + gfn_t start, gfn_t end, bool can_yield, bool flush) { bool zap_all =3D (start =3D=3D 0 && end >=3D tdp_mmu_max_gfn_host()); struct tdp_iter iter; @@ -872,15 +866,14 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm= _mmu_page *root, =20 end =3D min(end, tdp_mmu_max_gfn_host()); =20 - kvm_lockdep_assert_mmu_lock_held(kvm, shared); + lockdep_assert_held_write(&kvm->mmu_lock); =20 rcu_read_lock(); =20 for_each_tdp_pte_min_level(iter, root->spt, root->role.level, min_level, start, end) { -retry: if (can_yield && - tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) { + tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) { flush =3D false; continue; } @@ -899,17 +892,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_= mmu_page *root, !is_last_spte(iter.old_spte, iter.level)) continue; =20 - if (!shared) { - tdp_mmu_set_spte(kvm, &iter, 0); - flush =3D true; - } else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) { - /* - * The iter must explicitly re-read the SPTE because - * the atomic cmpxchg failed. - */ - iter.old_spte =3D kvm_tdp_mmu_read_spte(iter.sptep); - goto retry; - } + tdp_mmu_set_spte(kvm, &iter, 0); + flush =3D true; } =20 rcu_read_unlock(); @@ -928,8 +912,7 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int a= s_id, gfn_t start, struct kvm_mmu_page *root; =20 for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false) - flush =3D zap_gfn_range(kvm, root, start, end, can_yield, flush, - false); + flush =3D zap_gfn_range(kvm, root, start, end, can_yield, flush); =20 return flush; } --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9817C433EF for ; Thu, 23 Dec 2021 22:26:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240798AbhLWW0N (ORCPT ); Thu, 23 Dec 2021 17:26:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38946 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350604AbhLWWYn (ORCPT ); Thu, 23 Dec 2021 17:24:43 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4FD76C061378 for ; Thu, 23 Dec 2021 14:24:16 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id q1-20020a17090a2dc100b001b151c90c5fso4041298pjm.3 for ; Thu, 23 Dec 2021 14:24:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=53dIIvXbBtU6BY54szQ924fsFmXOE3Wd7U+FFrRyNAM=; b=DaW3Unku37vmpqyg/y/GGM0WX9v6GYWyD9eHeDCzcKmiDS8a6De1slHbkEShtjWuRr WRFjorYjidBxDHrnVfU4rD8IZB1Uj+q/wZVC6WZdXGxlVDPPX1MjhSrMA3U49vw1Ei24 QzjJjiUgKAJ5+YpMGt1mlN+HwHq2oJNesb83lDjXVON2Bw8vsLWuh5z9rgqFqlgTZg6x skZxusSAgV7rlsqOmgOi9XRnQFmBq6AT3dGjsRWfqGxh2BUsyRZtkcKVFV9Q2H9qvEBX SdcL56ujKc+JMuQHH3rl0MVJoavS17rPC56lg0W5nb4d3zHRZc9a8lMTxSpQ64pehgGU L6Aw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=53dIIvXbBtU6BY54szQ924fsFmXOE3Wd7U+FFrRyNAM=; b=S9guVAz4LsDXkCNCa2Rx/5DCUv/ZSI1YJHg3TvaH3vm364g9PYabTQLnAfWnoBy6sP a/y6sFHopszVF6Eb7eF4nV8J+yOyA9I02lzZaSEd/D+Gqq9Fxnp9jrgLQ0tWay7fvd65 38hX0QYNVTUZLxSjEMvzpZQJ4o3YZPPzvYM7foazA5RSl5l1Q382If7zy2Pm1Nd5KEK6 6cTzubVDwjHE9Tf98OtQ3d1EeTZPD3Ey4lk53/R34jWOd8LtJqRKC7lpZH1DiAJFmb0m aIM4thp7PcPU2beuifivjwiMD3EVdhFI/n0V3L/7QR4xPnJBs3t0Zg+NN+SKzFFST43F EU5Q== X-Gm-Message-State: AOAM531N3JY+SaRkAnnaMljCm8oILjl1rJxjLPj+4JGD8JRMmsVIhkus KiCE/SGXFnziEkobuxTsBSyFCoR9LF8= X-Google-Smtp-Source: ABdhPJzMcvDl+dQmy+1vJzXwCKOHyNyNoCmQwaRTP06kqtOHqQM40R4XjuG2LOOsn9u8O3wpkplhn+llyCQ= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:90a:e501:: with SMTP id t1mr4749018pjy.241.1640298255821; Thu, 23 Dec 2021 14:24:15 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:09 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-22-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 21/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various functions accordingly. When removing mappings for functional correctness (except for the stupid VFIO GPU passthrough memslots bug), zapping the leaf SPTEs is sufficient as the paging structures themselves do not point at guest memory and do not directly impact the final translation (in the TDP MMU). Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and kvm_unmap_gfn_range(). Signed-off-by: Sean Christopherson Reviewed-by: Ben Gardon --- arch/x86/kvm/mmu/mmu.c | 4 ++-- arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++++++---------------------------- arch/x86/kvm/mmu/tdp_mmu.h | 8 +------- 3 files changed, 14 insertions(+), 39 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f660906c8230..f40773dc4c92 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5804,8 +5804,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_sta= rt, gfn_t gfn_end) =20 if (is_tdp_mmu_enabled(kvm)) { for (i =3D 0; i < KVM_ADDRESS_SPACE_NUM; i++) - flush =3D kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start, - gfn_end, flush); + flush =3D kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, + gfn_end, true, flush); } =20 if (flush) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index c7529de776be..21d015b38ac1 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -841,10 +841,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mm= u_page *sp) } =20 /* - * Tears down the mappings for the range of gfns, [start, end), and frees = the - * non-root pages mapping GFNs strictly within that range. Returns true if - * SPTEs have been cleared and a TLB flush is needed before releasing the - * MMU lock. + * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SP= TEs + * have been cleared and a TLB flush is needed before releasing the MMU lo= ck. * * If can_yield is true, will release the MMU lock and reschedule if the * scheduler needs the CPU or there is contention on the MMU lock. If this @@ -852,18 +850,11 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_m= mu_page *sp) * the caller must ensure it does not supply too large a GFN range, or the * operation can cause a soft lockup. */ -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, - gfn_t start, gfn_t end, bool can_yield, bool flush) +static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, + gfn_t start, gfn_t end, bool can_yield, bool flush) { - bool zap_all =3D (start =3D=3D 0 && end >=3D tdp_mmu_max_gfn_host()); struct tdp_iter iter; =20 - /* - * No need to try to step down in the iterator when zapping all SPTEs, - * zapping the top-level non-leaf SPTEs will recurse on their children. - */ - int min_level =3D zap_all ? root->role.level : PG_LEVEL_4K; - end =3D min(end, tdp_mmu_max_gfn_host()); =20 lockdep_assert_held_write(&kvm->mmu_lock); @@ -871,24 +862,14 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm= _mmu_page *root, rcu_read_lock(); =20 for_each_tdp_pte_min_level(iter, root->spt, root->role.level, - min_level, start, end) { + PG_LEVEL_4K, start, end) { if (can_yield && tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) { flush =3D false; continue; } =20 - if (!is_shadow_present_pte(iter.old_spte)) - continue; - - /* - * If this is a non-last-level SPTE that covers a larger range - * than should be zapped, continue, and zap the mappings at a - * lower level, except when zapping all SPTEs. - */ - if (!zap_all && - (iter.gfn < start || - iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) && + if (!is_shadow_present_pte(iter.old_spte) || !is_last_spte(iter.old_spte, iter.level)) continue; =20 @@ -906,13 +887,13 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm= _mmu_page *root, * SPTEs have been cleared and a TLB flush is needed before releasing the * MMU lock. */ -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start, - gfn_t end, bool can_yield, bool flush) +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t = end, + bool can_yield, bool flush) { struct kvm_mmu_page *root; =20 for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false) - flush =3D zap_gfn_range(kvm, root, start, end, can_yield, flush); + flush =3D tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false); =20 return flush; } @@ -1143,8 +1124,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm= _page_fault *fault) bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *ra= nge, bool flush) { - return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start, - range->end, range->may_block, flush); + return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start, + range->end, range->may_block, flush); } =20 typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter, diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index ccb12f1914ba..9759cbd821ef 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -15,14 +15,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(str= uct kvm_mmu_page *root) void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, bool shared); =20 -bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start, +bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end, bool can_yield, bool flush); -static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, - gfn_t start, gfn_t end, bool flush) -{ - return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush); -} - bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp); void kvm_tdp_mmu_zap_all(struct kvm *kvm); void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm); --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC9DAC433EF for ; Thu, 23 Dec 2021 22:26:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350830AbhLWW0Z (ORCPT ); Thu, 23 Dec 2021 17:26:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38994 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350643AbhLWWYz (ORCPT ); Thu, 23 Dec 2021 17:24:55 -0500 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 56EC2C06137E for ; Thu, 23 Dec 2021 14:24:18 -0800 (PST) Received: by mail-pl1-x649.google.com with SMTP id e4-20020a170903240400b001494e7b0d0cso961991plo.16 for ; Thu, 23 Dec 2021 14:24:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=JFtlBtomcDM/mOE9rf4RY2VzkMlwcz38nh6Jra8a7vQ=; b=IIM8HCEEXkjK2wABibtVd02wltmTvdnq19pbJ89croKvZSHb9ry7bNalx4Ow0ZE/Eq 7CfQAzl65itDRQH97+RwBLToD05uzzfhkiDVVscd68gD902ihnPINsLlW/qegWlyO8qn 315AK8mPp8C2uprSLFCE6hzoJm0zqVLqqFJ1pn/ALlRYMduSceoqeLoGf893JKYY5Ep/ vgcZSLn8gVyODJ1lg+CAG6pkrUYnTYoPhXuTTY1EWedngPg58hTWxKPuvQduKK5XB3bQ XDKKK2sh+HFrFEJdXh4GcX8PyXO0MtbaFZ3ziOuicFfDMsD093wy7gVpCsrpNngGnnnx CRUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=JFtlBtomcDM/mOE9rf4RY2VzkMlwcz38nh6Jra8a7vQ=; b=FjKPKLYukqbQutunQi0gEpdn0TjCxSCs7LXd8x+WbY3Bl6ESJ7UoUzQKWSClRwsDOb 8MblCmo9n9RkXNI2KzdQ+H5HoJKVZUXmk37JB5ifgmszEG2scnUcz6nZTeQo3jh52elb 82pb3hGmYDeiBJIuZbSLJGNcWsA81WrRG7S+NXgO8AHrGqnK0iGeLazSH2CA8XdtjBEj NH4GhZWRYoQxT0bAYTTSoEOseNpq5miZnXHEKr8z7Rw6LIdloM7vqzYc1X9kkESlQDfD 8rpAQU0iAkWlyBEkwj94XDoioYGG3e48z8KI0Ft/ubLIR2x6Pg2ZtYVFQHLPn7ZfuX5z LWtQ== X-Gm-Message-State: AOAM530EEYibNa7z7puEpQL8oGx9IIlJUcLkagUAVvvSSID37jDQ19bE nofllzuGRj1P3H6Mn1ggJuxVL59QCOc= X-Google-Smtp-Source: ABdhPJy8bLcApcuZH7n/dxygYs5t4abmOxKJJcdsud6/UPcB4ZwdmJX+b3mdekBpTjrziqo6YIhwsm9Irfs= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:90a:c203:: with SMTP id e3mr739912pjt.0.1640298257550; Thu, 23 Dec 2021 14:24:17 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:10 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-23-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 22/30] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When yielding in the TDP MMU iterator, service any pending TLB flush before dropping RCU protections in anticipation of using the callers RCU "lock" as a proxy for vCPUs in the guest. Signed-off-by: Sean Christopherson Reviewed-by: Ben Gardon --- arch/x86/kvm/mmu/tdp_mmu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 21d015b38ac1..e7086eb35599 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -728,11 +728,11 @@ static inline bool __must_check tdp_mmu_iter_cond_res= ched(struct kvm *kvm, return false; =20 if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { - rcu_read_unlock(); - if (flush) kvm_flush_remote_tlbs(kvm); =20 + rcu_read_unlock(); + if (shared) cond_resched_rwlock_read(&kvm->mmu_lock); else --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76F5CC433FE for ; Thu, 23 Dec 2021 22:26:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241232AbhLWW03 (ORCPT ); Thu, 23 Dec 2021 17:26:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38694 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350649AbhLWWY6 (ORCPT ); Thu, 23 Dec 2021 17:24:58 -0500 Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com [IPv6:2607:f8b0:4864:20::1049]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A3BDEC06118F for ; Thu, 23 Dec 2021 14:24:19 -0800 (PST) Received: by mail-pj1-x1049.google.com with SMTP id p2-20020a17090a2c4200b001b1866beecbso6392724pjm.5 for ; Thu, 23 Dec 2021 14:24:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=XnGypoQjpJLcdvN50ul/dpb0cgel0J6MKJVjKnPbCtA=; b=sG2UyMhvkZ8U2FX5mvM5cexvwtWbu7t/mCSJDI++GL1UzZe+g6T+1maB2HhgjKD0s1 GI4/2Lx5FSZ9menQ82aQmAOUqmUAd6zKE0LdoGsXBBMXBC72/EPoW3/nM8BNWBv8YeXd jDHeEgoXYXXMlX/Q/pLhX4mi/dbb0jBfmkVk5ljVgEmKg9R4d+k96sTiLtoSFmCAtypX U57lGynZ5hGdh+94TufGd9wgMMue86300KTtpyvzCPMPGoALU5NRH4T2bpwsnc2K+UaZ cXVqMJQw+krN1ilxeGTr3cI5py9khd5m18cgFf065z9xS+uEq8MFK7AJ3b3bzaLy4Sg4 AekQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=XnGypoQjpJLcdvN50ul/dpb0cgel0J6MKJVjKnPbCtA=; b=zrVY5o9tLnKxHD5FGW7PEZ2MmUjmgMiEFqTVKZBDIm5vzO1erNyfR6Ja1giPBZBfvu VuN2PjhuQ3dhmTHJGDE451gnLfSHHt5QrrhJJItAOVkVfMFBtYYR6+FqR3QKx5s6M0Gh hC8KQau1R5MhGjOWev4epVCfCMin4o2HmzVGHsUAEIunb+PMxhusnv+Vw5vbAyXr5NNG FY0PqOxk3rIXAyDm2Vv/xmswkVuQqVy0FPun6bik+CFyVOjNki1gXLbL9saeR86PJ7yy 6p9mSwkB8XpypyqiCF4GcMhhXpQM3wcVpYd1vyARRMC1+2pe9whoyTzhTKFxWHC7DwJ3 zoZQ== X-Gm-Message-State: AOAM5305ltsikD3R+unlN1a7VwwBLPUHW972JiMRFld+I9kPGo51fpZU awuj1M58h7WefCNaCsxk5Yb9GcudQyI= X-Google-Smtp-Source: ABdhPJwnLEBHJ17525jR+sFbQ3rwDDvhujkjePkSsWp07gUh7QThUDsSbmUrYAKVUommlH4P7XnlqZWHanE= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:903:188:b0:149:512a:e69c with SMTP id z8-20020a170903018800b00149512ae69cmr3768689plg.40.1640298259192; Thu, 23 Dec 2021 14:24:19 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:11 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-24-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 23/30] KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead of immediately flushing. Because the shadow pages are freed in an RCU callback, so long as at least one CPU holds RCU, all CPUs are protected. For vCPUs running in the guest, i.e. consuming TLB entries, KVM only needs to ensure the caller services the pending TLB flush before dropping its RCU protections. I.e. use the caller's RCU as a proxy for all vCPUs running in the guest. Deferring the flushes allows batching flushes, e.g. when installing a 1gb hugepage and zapping a pile of SPs. And when zapping an entire root, deferring flushes allows skipping the flush entirely (because flushes are not needed in that case). Avoiding flushes when zapping an entire root is especially important as synchronizing with other CPUs via IPI after zapping every shadow page can cause significant performance issues for large VMs. The issue is exacerbated by KVM zapping entire top-level entries without dropping RCU protection, which can lead to RCU stalls even when zapping roots backing relatively "small" amounts of guest memory, e.g. 2tb. Removing the IPI bottleneck largely mitigates the RCU issues, though it's likely still a problem for 5-level paging. A future patch will further address the problem by zapping roots in multiple passes to avoid holding RCU for an extended duration. Reviewed-by: Ben Gardon Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu.c | 12 ++++++++++++ arch/x86/kvm/mmu/tdp_iter.h | 7 +++---- arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++++++------------ 3 files changed, 26 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f40773dc4c92..ff5a7f763b1e 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6268,6 +6268,12 @@ static void kvm_recover_nx_lpages(struct kvm *kvm) rcu_idx =3D srcu_read_lock(&kvm->srcu); write_lock(&kvm->mmu_lock); =20 + /* + * Zapping TDP MMU shadow pages, including the remote TLB flush, must + * be done under RCU protection, the pages are freed via RCU callback. + */ + rcu_read_lock(); + ratio =3D READ_ONCE(nx_huge_pages_recovery_ratio); to_zap =3D ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0; for ( ; to_zap; --to_zap) { @@ -6292,12 +6298,18 @@ static void kvm_recover_nx_lpages(struct kvm *kvm) =20 if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); + rcu_read_unlock(); + cond_resched_rwlock_write(&kvm->mmu_lock); flush =3D false; + + rcu_read_lock(); } } kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); =20 + rcu_read_unlock(); + write_unlock(&kvm->mmu_lock); srcu_read_unlock(&kvm->srcu, rcu_idx); } diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index 1de6c1c9ff7b..95bd54027e08 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -9,10 +9,9 @@ =20 /* * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SP= TEs) - * to be zapped while holding mmu_lock for read. Holding RCU isn't requir= ed for - * correctness if mmu_lock is held for write, but plumbing "struct kvm" do= wn to - * the lower* depths of the TDP MMU just to make lockdep happy is a nightm= are, - * so all* accesses to SPTEs are must be done under RCU protection. + * to be zapped while holding mmu_lock for read, and to allow TLB flushes = to be + * batched without having to collect the list of zapped SPs. Flows that c= an + * remove SPs must service pending TLB flushes prior to dropping RCU prote= ction. */ static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep) { diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index e7086eb35599..72bcec2cd23c 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -424,9 +424,6 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm= , tdp_ptep_t pt, shared); } =20 - kvm_flush_remote_tlbs_with_address(kvm, base_gfn, - KVM_PAGES_PER_HPAGE(level + 1)); - call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback); } =20 @@ -822,21 +819,14 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct = kvm_mmu_page *root, =20 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) { - u64 old_spte; + u64 old_spte =3D kvm_tdp_mmu_read_spte(sp->ptep); =20 - rcu_read_lock(); - - old_spte =3D kvm_tdp_mmu_read_spte(sp->ptep); - if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) { - rcu_read_unlock(); + if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) return false; - } =20 __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0, sp->gfn, sp->role.level + 1, true, true); =20 - rcu_read_unlock(); - return true; } =20 @@ -878,6 +868,11 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct = kvm_mmu_page *root, } =20 rcu_read_unlock(); + + /* + * Because this flows zaps _only_ leaf SPTEs, the caller doesn't need + * to provide RCU protection as no 'struct kvm_mmu_page' will be freed. + */ return flush; } =20 @@ -1007,6 +1002,10 @@ static int tdp_mmu_map_handle_target_level(struct kv= m_vcpu *vcpu, ret =3D RET_PF_SPURIOUS; else if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte)) return RET_PF_RETRY; + else if (is_shadow_present_pte(iter->old_spte) && + !is_last_spte(iter->old_spte, iter->level)) + kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn, + KVM_PAGES_PER_HPAGE(iter->level + 1)); =20 /* * If the page fault was caused by a write but the page is write --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37B91C433F5 for ; Thu, 23 Dec 2021 22:26:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230170AbhLWW0c (ORCPT ); Thu, 23 Dec 2021 17:26:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38812 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350654AbhLWWZC (ORCPT ); Thu, 23 Dec 2021 17:25:02 -0500 Received: from mail-pg1-x54a.google.com (mail-pg1-x54a.google.com [IPv6:2607:f8b0:4864:20::54a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 49789C0619D8 for ; Thu, 23 Dec 2021 14:24:21 -0800 (PST) Received: by mail-pg1-x54a.google.com with SMTP id p28-20020a63951c000000b0033f7b94305dso3885403pgd.11 for ; Thu, 23 Dec 2021 14:24:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=gISDifzPfM9NcqK+RFft8Pla2QecbjH1Rdd2hFIVXsM=; b=gc/pQdeI18xpDzB197wjMZ4F1vQ6UhUlP5O+H36rIZLcP6RTQnTKWf5oNJqQGDF5Q6 oqA2S8HleD8TgLsaL9sqGpq2ut8A8nsxqaa5+eONnB0kah7e7qduGKsezJYEwF6ntNtx MdFW8XBvXoiEuKTt407Ot7OfWxf22WugTW08bLizQkmNCcbicAVESz4zgQpcJ1uIJ/CP xURYU19G3EBWYvq+o1IBin+DGplfdJWSlOky0HaAPByy/DYrD4Q+Xl9TbQ75ZvYvOr1u zEBXC5to8DBzsS/tHyqujW7mO6qc+tEkOZEu3g/ikX6T8/1U+qh0zndP3qI/nkwaNJKf FQ6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=gISDifzPfM9NcqK+RFft8Pla2QecbjH1Rdd2hFIVXsM=; b=fkV4Te9upKY47h80JRMoqAJM4FabMY31TKoxBv57IiXaM7r3QQFFLuOFvp+9FTZy2G c+FsGt53XJvkHzT16Wq6jY/xkPPoENrjnFQhE8Vd1/qZKf5vlJnascplZ/U1SGUsw/4c wD22CXiaz3sbM5hz/12vGzmwBugj8Lb3xFP5UOZIa/+rjRVIJIavIA260ciYW+y/Y8Iz 7PzsDhdKje1SwXpcpGS+IP460ngI4YoERgWJ6qLC9E8DS0U+6CsGAFqJxgomelIjUtMv ljqOQaxL4xSvqkiC0B1maoN5F4x7kSxzZcm7vAWtQI/KgSmwHYLbFhtzU7HlED4zGVJQ syww== X-Gm-Message-State: AOAM53096NsK68/v0ijdlPvNMlspYXV15j0yeX544LHLa/T9Q7lw/YFb xOENnVXYy1EhuNjZEiFpVgZKSRJQB3Y= X-Google-Smtp-Source: ABdhPJwQqLnDJ2MgZO8u3orclBM6jREZZOfxLRV0xmKGFxZt7/wmQ4/B39tbZ4domX7NBJBa1++uarDWVBU= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a05:6a00:1a03:b0:4ba:c23e:df67 with SMTP id g3-20020a056a001a0300b004bac23edf67mr4150250pfv.63.1640298260806; Thu, 23 Dec 2021 14:24:20 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:12 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-25-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 24/30] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Allow yielding when zapping SPTEs after the last reference to a valid root is put. Because KVM must drop all SPTEs in response to relevant mmu_notifier events, mark defunct roots invalid and reset their refcount prior to zapping the root. Keeping the refcount elevated while the zap is in-progress ensures the root is reachable via mmu_notifier until the zap completes and the last reference to the invalid, defunct root is put. Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the root in being put has a massive paging structure, e.g. zapping a root that is backed entirely by 4kb pages for a guest with 32tb of memory can take hundreds of seconds to complete. watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:5236= 8] RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm] __handle_changed_spte+0x1b2/0x2f0 [kvm] handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm] __handle_changed_spte+0x1f4/0x2f0 [kvm] handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm] __handle_changed_spte+0x1f4/0x2f0 [kvm] tdp_mmu_zap_root+0x307/0x4d0 [kvm] kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm] kvm_mmu_free_roots+0x22d/0x350 [kvm] kvm_mmu_reset_context+0x20/0x60 [kvm] kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm] kvm_vcpu_ioctl+0x5bd/0x710 [kvm] __se_sys_ioctl+0x77/0xc0 __x64_sys_ioctl+0x1d/0x20 do_syscall_64+0x44/0xa0 entry_SYSCALL_64_after_hwframe+0x44/0xae KVM currently doesn't put a root from a non-preemptible context, so other than the mmu_notifier wrinkle, yielding when putting a root is safe. Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't take a reference to each root (it requires mmu_lock be held for the entire duration of the walk). tdp_mmu_next_root() is used only by the yield-friendly iterator. kvm_tdp_mmu_zap_invalidated_roots() is explicitly yield friendly. kvm_mmu_free_roots() =3D> mmu_free_root_page() is a much bigger fan-out, but is still yield-friendly in all call sites, as all callers can be traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or kvm_create_vm(). Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu_internal.h | 7 +- arch/x86/kvm/mmu/tdp_mmu.c | 145 +++++++++++++++++++++++--------- 2 files changed, 109 insertions(+), 43 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_interna= l.h index be063b6c91b7..8ce3d58fdf7f 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -65,7 +65,12 @@ struct kvm_mmu_page { struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ tdp_ptep_t ptep; }; - DECLARE_BITMAP(unsync_child_bitmap, 512); + union { + DECLARE_BITMAP(unsync_child_bitmap, 512); + struct { + bool tdp_mmu_defunct_root; + }; + }; =20 struct list_head lpage_disallowed_link; #ifdef CONFIG_X86_32 diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 72bcec2cd23c..aec97e037a8d 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -91,21 +91,67 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_m= mu_page *root, =20 WARN_ON(!root->tdp_mmu_page); =20 - spin_lock(&kvm->arch.tdp_mmu_pages_lock); - list_del_rcu(&root->link); - spin_unlock(&kvm->arch.tdp_mmu_pages_lock); + /* + * Ensure root->role.invalid is read after the refcount reaches zero to + * avoid zapping the root multiple times, e.g. if a different task + * acquires a reference (after the root was marked invalid+defunct) and + * puts the last reference, all while holding mmu_lock for read. Pairs + * with the smp_mb__before_atomic() below. + */ + smp_mb__after_atomic(); + + /* + * Free the root if it's already invalid. Invalid roots must be zapped + * before their last reference is put, i.e. there's no work to be done, + * and all roots must be invalidated (see below) before they're freed. + * Re-zapping defunct roots, which are always invalid, would put KVM + * into an infinite loop (again, see below). + */ + if (root->role.invalid) { + spin_lock(&kvm->arch.tdp_mmu_pages_lock); + list_del_rcu(&root->link); + spin_unlock(&kvm->arch.tdp_mmu_pages_lock); + + call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback); + return; + } + + /* + * Invalidate the root to prevent it from being reused by a vCPU, and + * mark it defunct so that kvm_tdp_mmu_zap_invalidated_roots() doesn't + * try to put a reference it didn't acquire. + */ + root->role.invalid =3D true; + root->tdp_mmu_defunct_root =3D true; + + /* + * Ensure tdp_mmu_defunct_root is visible if a concurrent reader acquires + * a reference after the root's refcount is reset. Pairs with the + * smp_mb__after_atomic() above and in kvm_tdp_mmu_zap_invalidated_roots(= ). + */ + smp_mb__before_atomic(); + + /* + * Note, if mmu_lock is held for read this can race with other readers, + * e.g. they may acquire a reference without seeing the root as invalid, + * and the refcount may be reset after the root is skipped. Both races + * are benign, as flows that must visit all roots, e.g. need to zap + * SPTEs for correctness, must take mmu_lock for write to block page + * faults, and the only flow that must not consume an invalid root is + * allocating a new root for a vCPU, which also takes mmu_lock for write. + */ + refcount_set(&root->tdp_mmu_root_count, 1); =20 /* - * A TLB flush is not necessary as KVM performs a local TLB flush when - * allocating a new root (see kvm_mmu_load()), and when migrating vCPU - * to a different pCPU. Note, the local TLB flush on reuse also - * invalidates any paging-structure-cache entries, i.e. TLB entries for - * intermediate paging structures, that may be zapped, as such entries - * are associated with the ASID on both VMX and SVM. + * Zap the root, then put the refcount "acquired" above. Recursively + * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an + * infinite loop by freeing invalid roots. By design, the root is + * reachable while it's being zapped, thus a different task can put its + * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a + * defunct root is unavoidable. */ tdp_mmu_zap_root(kvm, root, shared); - - call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback); + kvm_tdp_mmu_put_root(kvm, root, shared); } =20 enum tdp_mmu_roots_iter_type { @@ -758,12 +804,23 @@ static inline gfn_t tdp_mmu_max_gfn_host(void) static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, bool shared) { - bool root_is_unreachable =3D !refcount_read(&root->tdp_mmu_root_count); struct tdp_iter iter; =20 gfn_t end =3D tdp_mmu_max_gfn_host(); gfn_t start =3D 0; =20 + /* + * The root must have an elevated refcount so that it's reachable via + * mmu_notifier callbacks, which allows this path to yield and drop + * mmu_lock. When handling an unmap/release mmu_notifier command, KVM + * must drop all references to relevant pages prior to completing the + * callback. Dropping mmu_lock with an unreachable root would result + * in zapping SPTEs after a relevant mmu_notifier callback completes + * and lead to use-after-free as zapping a SPTE triggers "writeback" of + * dirty accessed bits to the SPTE's associated struct page. + */ + WARN_ON_ONCE(!refcount_read(&root->tdp_mmu_root_count)); + kvm_lockdep_assert_mmu_lock_held(kvm, shared); =20 rcu_read_lock(); @@ -775,19 +832,7 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct k= vm_mmu_page *root, for_each_tdp_pte_min_level(iter, root->spt, root->role.level, root->role.level, start, end) { retry: - /* - * Yielding isn't allowed when zapping an unreachable root as - * the root won't be processed by mmu_notifier callbacks. When - * handling an unmap/release mmu_notifier command, KVM must - * drop all references to relevant pages prior to completing - * the callback. Dropping mmu_lock can result in zapping SPTEs - * for an unreachable root after a relevant callback completes, - * which leads to use-after-free as zapping a SPTE triggers - * "writeback" of dirty/accessed bits to the SPTE's associated - * struct page. - */ - if (!root_is_unreachable && - tdp_mmu_iter_cond_resched(kvm, &iter, false, shared)) + if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared)) continue; =20 if (!is_shadow_present_pte(iter.old_spte)) @@ -796,22 +841,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct k= vm_mmu_page *root, if (!shared) { tdp_mmu_set_spte(kvm, &iter, 0); } else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) { - /* - * cmpxchg() shouldn't fail if the root is unreachable. - * to be unreachable. Re-read the SPTE and retry so as - * not to leak the page and its children. - */ - WARN_ONCE(root_is_unreachable, - "Contended TDP MMU SPTE in unreachable root."); iter.old_spte =3D kvm_tdp_mmu_read_spte(iter.sptep); goto retry; } - /* - * WARN if the root is invalid and is unreachable, all SPTEs - * should've been zapped by kvm_tdp_mmu_zap_invalidated_roots(), - * and inserting new SPTEs under an invalid root is a KVM bug. - */ - WARN_ON_ONCE(root_is_unreachable && root->role.invalid); } =20 rcu_read_unlock(); @@ -899,6 +931,9 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) int i; =20 /* + * Zap all roots, including defunct roots, as all SPTEs must be dropped + * before returning to the caller. + * * A TLB flush is unnecessary, KVM zaps everything if and only the VM * is being destroyed or the userspace VMM has exited. In both cases, * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request. @@ -924,6 +959,12 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm) =20 for_each_invalid_tdp_mmu_root_yield_safe(kvm, root) { /* + * Zap the root, even if it's defunct, as all SPTEs must be + * dropped before returning to the caller, e.g. if the root was + * invalidated by a memslot update, then SPTEs associated with + * a deleted/moved memslot are unreachable via the mmu_notifier + * once the memslot update completes. + * * A TLB flush is unnecessary, invalidated roots are guaranteed * to be unreachable by the guest (see kvm_tdp_mmu_put_root() * for more details), and unlike the legacy MMU, no vCPU kick @@ -935,10 +976,24 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kv= m) tdp_mmu_zap_root(kvm, root, true); =20 /* - * Put the reference acquired in kvm_tdp_mmu_invalidate_roots(). + * Leverages kvm_tdp_mmu_get_root() in the iterator, pairs with + * the smp_mb__before_atomic() in kvm_tdp_mmu_put_root() to + * ensure a defunct root is seen as such. + */ + smp_mb__after_atomic(); + + /* + * Put the reference acquired in kvm_tdp_mmu_invalidate_roots() + * unless the root is defunct worker data, in which case no + * reference was taken. Roots become defunct only when a valid + * root has its last reference put, thus holding a reference + * means the root can't become defunct between invalidating the + * root and re-checking the data here. + * * Note, the iterator holds its own reference. */ - kvm_tdp_mmu_put_root(kvm, root, true); + if (!root->tdp_mmu_defunct_root) + kvm_tdp_mmu_put_root(kvm, root, true); } } =20 @@ -953,13 +1008,17 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *k= vm) * a vCPU drops the last reference to a root prior to the root being zappe= d, it * will get stuck with tearing down the entire paging structure. * - * Get a reference even if the root is already invalid, + * Get a reference even if the root is already invalid, unless it's defunc= t, as * kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference t= o all * invalid roots, e.g. there's no epoch to identify roots that were invali= dated * by a previous call. Roots stay on the list until the last reference is * dropped, so even though all invalid roots are zapped, a root may not go= away * for quite some time, e.g. if a vCPU blocks across multiple memslot upda= tes. * + * Don't take a reference if the root is defunct, vCPUs cannot hold refere= nces + * to defunct roots and so will never get stuck with zapping the root. No= te, + * defunct roots still need to be zapped by kvm_tdp_mmu_zap_invalidated_ro= ots(). + * * Because mmu_lock is held for write, it should be impossible to observe a * root with zero refcount, i.e. the list of roots cannot be stale. * @@ -971,8 +1030,10 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm) struct kvm_mmu_page *root; =20 lockdep_assert_held_write(&kvm->mmu_lock); + list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) { - if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(kvm, root))) + if (!root->tdp_mmu_defunct_root && + !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) root->role.invalid =3D true; } } --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE290C433EF for ; Thu, 23 Dec 2021 22:27:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350944AbhLWW07 (ORCPT ); Thu, 23 Dec 2021 17:26:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38700 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350461AbhLWWZs (ORCPT ); Thu, 23 Dec 2021 17:25:48 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EC042C061D5E for ; Thu, 23 Dec 2021 14:24:22 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id p4-20020a17090a348400b001b103a13f69so6382247pjb.8 for ; Thu, 23 Dec 2021 14:24:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=i4nDji1SAHJbfO4zc501SdWktURErA1mUSQbrDpHUlE=; b=Vxi1bRmXVR+CNUOdcODHvIqCIBGCK2G+5hIJ9+PzOmq1mK96Z4q4egX1RRZyJHLHbt jF8dMrqHbGaeuyWGDCWXRo2rENE0tvf6LPsdMhVpFX8tYhzsxymxaWwuvGyWft1jEQpr G/Fujwj3v+cC/YmylaG5Xiz2drcALyabHBoFJjp8l8DKtQTPCkRVaOCq9vQA7XzkIOI+ nSTS2VjQDNZKzXdA7m/bknnYeAP3GAq0l0aEExO1A1hhfAzrba5riS5ygrhwsJVZY8rT MK2xHJAKGNT/30Pz3m1J64UKRo56T+w7hNdDiVWptVRWGEcag1pwRBtQLY+v5PlP2e+Q oSWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=i4nDji1SAHJbfO4zc501SdWktURErA1mUSQbrDpHUlE=; b=tiO0FmnF6J4CFs+auTpZg7Ct44nOKl2JyCAOnCtKsjVwpGBISJpIEQIQgTxejc8Hvl DkDWaqUa+kbRX9Ou+Xkpf4kROLDd0PijAnNzhj3tmfGZ/v4o/GSKnFY8eWF+3F+df6N2 z1rcp1nv/gWS8Ep0pjiqJ2VGef2+UniocaEjlO35XCwcvKzxrYnQ3mU6v0fKUcXUpBBP JBiqV3bJX/MmjDCDlhKAVwuyqaaYv5gGepUt5L6aAKmhf1BWSFZ/gI1NfpziELR8eaP1 CTbHDTTdsVVw3rHP73dxNL8Df3nTVIWXccw00V/ptyWo0rd2MixjTW9YVBoyAYBVH81c mZhw== X-Gm-Message-State: AOAM530MBcM8Jgc4FIY6/ndPOE5RTnsJlNywOvedp6zAVHwyFLOrCQsh gSts6ditl1zOOY7UabhZoIb0hnVAhTM= X-Google-Smtp-Source: ABdhPJwdARqSCG1hVN6DRUH0JKvPblGfx8biIe15EHWBYkiuHI2pwAJaGdlVBYMwzWDgbhK/EfoR/O/E8HE= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a05:6a00:809:b0:4ba:da08:496f with SMTP id m9-20020a056a00080900b004bada08496fmr4359945pfk.60.1640298262474; Thu, 23 Dec 2021 14:24:22 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:13 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-26-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 25/30] KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When zapping a TDP MMU root, perform the zap in two passes to avoid zapping an entire top-level SPTE while holding RCU, which can induce RCU stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then zap top-level entries in the second pass. With 4-level paging, zapping a PGD that is fully populated with 4kb leaf SPTEs take up to ~7 or so seconds (time varies based number of kernel config, CPUs, vCPUs, etc...). With 5-level paging, that time can balloon well into hundreds of seconds. Before remote TLB flushes were omitted, the problem was even worse as waiting for all active vCPUs to respond to the IPI introduced significant overhead for VMs with large numbers of vCPUs. By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass, the amount of work that is done without dropping RCU protection is strictly bounded, with the worst case latency for a single operation being less than 100ms. Zapping at 1gb in the first pass is not arbitrary. First and foremost, KVM relies on being able to zap 1gb shadow pages in a single shot when when repacing a shadow page with a hugepage. Zapping a 1gb shadow page that is fully populated with 4kb dirty SPTEs also triggers the worst case latency due writing back the struct page accessed/dirty bits for each 4kb page, i.e. the two-pass approach is guaranteed to work so long as KVM can cleany zap a 1gb shadow page. rcu: INFO: rcu_sched self-detected stall on CPU rcu: 52-....: (20999 ticks this GP) idle=3D7be/1/0x4000000000000000 softirq=3D15759/15759 fqs=3D5058 (t=3D21016 jiffies g=3D66453 q=3D238577) NMI backtrace for cpu 52 Call Trace: ... mark_page_accessed+0x266/0x2f0 kvm_set_pfn_accessed+0x31/0x40 handle_removed_tdp_mmu_page+0x259/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 zap_gfn_range+0x141/0x3b0 kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130 kvm_mmu_zap_all_fast+0x121/0x190 kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10 kvm_page_track_flush_slot+0x5c/0x80 kvm_arch_flush_shadow_memslot+0xe/0x10 kvm_set_memslot+0x172/0x4e0 __kvm_set_memory_region+0x337/0x590 kvm_vm_ioctl+0x49c/0xf80 Reported-by: David Matlack Cc: Ben Gardon Cc: Mingwei Zhang Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/tdp_mmu.c | 27 ++++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index aec97e037a8d..2e28f5e4b761 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -809,6 +809,18 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct k= vm_mmu_page *root, gfn_t end =3D tdp_mmu_max_gfn_host(); gfn_t start =3D 0; =20 + /* + * To avoid RCU stalls due to recursively removing huge swaths of SPs, + * split the zap into two passes. On the first pass, zap at the 1gb + * level, and then zap top-level SPs on the second pass. "1gb" is not + * arbitrary, as KVM must be able to zap a 1gb shadow page without + * inducing a stall to allow in-place replacement with a 1gb hugepage. + * + * Because zapping a SP recurses on its children, stepping down to + * PG_LEVEL_4K in the iterator itself is unnecessary. + */ + int zap_level =3D PG_LEVEL_1G; + /* * The root must have an elevated refcount so that it's reachable via * mmu_notifier callbacks, which allows this path to yield and drop @@ -825,12 +837,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct k= vm_mmu_page *root, =20 rcu_read_lock(); =20 - /* - * No need to try to step down in the iterator when zapping an entire - * root, zapping an upper-level SPTE will recurse on its children. - */ +start: for_each_tdp_pte_min_level(iter, root->spt, root->role.level, - root->role.level, start, end) { + zap_level, start, end) { retry: if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared)) continue; @@ -838,6 +847,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kv= m_mmu_page *root, if (!is_shadow_present_pte(iter.old_spte)) continue; =20 + if (iter.level > zap_level) + continue; + if (!shared) { tdp_mmu_set_spte(kvm, &iter, 0); } else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) { @@ -846,6 +858,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct k= vm_mmu_page *root, } } =20 + if (zap_level < root->role.level) { + zap_level =3D root->role.level; + goto start; + } + rcu_read_unlock(); } =20 --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 770CBC4332F for ; Thu, 23 Dec 2021 22:26:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350925AbhLWW0s (ORCPT ); Thu, 23 Dec 2021 17:26:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38858 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350497AbhLWWZs (ORCPT ); Thu, 23 Dec 2021 17:25:48 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 91DD3C061D5F for ; Thu, 23 Dec 2021 14:24:24 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id w19-20020a17090aea1300b001ad6e2148ccso4997444pjy.1 for ; Thu, 23 Dec 2021 14:24:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=VKRhc2W30aJ+XXbCyNQ1ulY913smbN678/b9erLTCZg=; b=TraPH2v75kx8IbE8VZzkiytfqpmKpo7bA6i3wHFNH8kWiJsWoKxjs5qifY9XeUUgX2 tq7skiqLI75wwsk+cjIoklE8kfO4Q026KdxSxhwOPCtGvLtOhmEwScoigtyX1CHem8so zkagxCu8O7dAfUgZFPfJkFwQSPFdHLQ1V2UtzGQHphYQQNHGEZllckQfQUX7cmuix4h2 nfcB7Hrux6pUuz9oXT0nkK1872SuYu0vCiaN3hkcW7oo3lT3LOG6nAbpjSDrgIL5zhwj LNRA0yctE8lTsttvPbZFQKIEkvpm5A9lSF7rYmq4mQ+6kupPT8W6/HfKU/DtISQgrr8F G76Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=VKRhc2W30aJ+XXbCyNQ1ulY913smbN678/b9erLTCZg=; b=la2lOdSnPLt0fO6kwMPhlOkM6XRClrcY/BHDM1xNVb8UX3f80tIaNhKuAXUketKtC6 xFFSzswe8YTPCINvpx0wnVLCYIwQtbDoJFQ2WDK6jQEvQ1g8obwKWGRiDDJ6QZDV4oEm /CYjIVeLH8Ex4VZvsPu0YP5sb5Wm/FYFtL7yPb0myVU7XN7ehDqKxguZOVslmEdQ4kI3 Q3duj+PBHH8CqqxUG3WDRmVV7RLScU/4/hZmJixRfz1pscXJuA02vMnC9T8gUXNyQrzp u0fl/IbKL6vKNB+Xm6s+4LhAovwayJaa/b658ncMxW8oRR2AY0P1oUvhv9ChH4ol93Nl OGqQ== X-Gm-Message-State: AOAM533c8YmgvmMPEulu0wDvi1OuxSytAxGxVrCyBDQbAt1NjNRRwH/5 sEuqn8+YYXGV5EsGcnvisMDbwYdg5Ao= X-Google-Smtp-Source: ABdhPJwddrNiJwRWA7gbWwHjC4Wsxlj0igcrrMuVa5jSNsVjb5YY/gIuu5OqR0d5fzyshMBMmcf917Kvr00= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:90a:db0c:: with SMTP id g12mr5043010pjv.233.1640298264139; Thu, 23 Dec 2021 14:24:24 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:14 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-27-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 26/30] KVM: x86/mmu: Zap defunct roots via asynchronous worker From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Zap defunct roots, a.k.a. roots that have been invalidated after their last reference was initially dropped, asynchronously via the system work queue instead of forcing the work upon the unfortunate task that happened to drop the last reference. If a vCPU task drops the last reference, the vCPU is effectively blocked by the host for the entire duration of the zap. If the root being zapped happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging being active, the zap can take several hundred seconds. Unsurprisingly, most guests are unhappy if a vCPU disappears for hundreds of seconds. E.g. running a synthetic selftest that triggers a vCPU root zap with ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds. Offloading the zap to a worker drops the block time to <100ms. Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu_internal.h | 2 + arch/x86/kvm/mmu/tdp_mmu.c | 65 ++++++++++++++++++++++++++++----- 2 files changed, 58 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_interna= l.h index 8ce3d58fdf7f..ac365631e4fe 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -69,6 +69,8 @@ struct kvm_mmu_page { DECLARE_BITMAP(unsync_child_bitmap, 512); struct { bool tdp_mmu_defunct_root; + struct work_struct tdp_mmu_async_work; + void *tdp_mmu_async_data; }; }; =20 diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 2e28f5e4b761..a706328a5658 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -81,6 +81,38 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head= *head) static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, bool shared); =20 +static void tdp_mmu_zap_root_async(struct work_struct *work) +{ + struct kvm_mmu_page *root =3D container_of(work, struct kvm_mmu_page, + tdp_mmu_async_work); + struct kvm *kvm =3D root->tdp_mmu_async_data; + + read_lock(&kvm->mmu_lock); + + /* + * A TLB flush is not necessary as KVM performs a local TLB flush when + * allocating a new root (see kvm_mmu_load()), and when migrating vCPU + * to a different pCPU. Note, the local TLB flush on reuse also + * invalidates any paging-structure-cache entries, i.e. TLB entries for + * intermediate paging structures, that may be zapped, as such entries + * are associated with the ASID on both VMX and SVM. + */ + tdp_mmu_zap_root(kvm, root, true); + + /* + * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for + * avoiding an infinite loop. By design, the root is reachable while + * it's being asynchronously zapped, thus a different task can put its + * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an + * asynchronously zapped root is unavoidable. + */ + kvm_tdp_mmu_put_root(kvm, root, true); + + read_unlock(&kvm->mmu_lock); + + kvm_put_kvm(kvm); +} + void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, bool shared) { @@ -143,15 +175,26 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm= _mmu_page *root, refcount_set(&root->tdp_mmu_root_count, 1); =20 /* - * Zap the root, then put the refcount "acquired" above. Recursively - * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an - * infinite loop by freeing invalid roots. By design, the root is - * reachable while it's being zapped, thus a different task can put its - * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a - * defunct root is unavoidable. + * Attempt to acquire a reference to KVM itself. If KVM is alive, then + * zap the root asynchronously in a worker, otherwise it must be zapped + * directly here. Wait to do this check until after the refcount is + * reset so that tdp_mmu_zap_root() can safely yield. + * + * In both flows, zap the root, then put the refcount "acquired" above. + * When putting the reference, use kvm_tdp_mmu_put_root() to test the + * above logic for avoiding an infinite loop by freeing invalid roots. + * By design, the root is reachable while it's being zapped, thus a + * different task can put its last reference, i.e. flowing through + * kvm_tdp_mmu_put_root() for a defunct root is unavoidable. */ - tdp_mmu_zap_root(kvm, root, shared); - kvm_tdp_mmu_put_root(kvm, root, shared); + if (kvm_get_kvm_safe(kvm)) { + root->tdp_mmu_async_data =3D kvm; + INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_async); + schedule_work(&root->tdp_mmu_async_work); + } else { + tdp_mmu_zap_root(kvm, root, shared); + kvm_tdp_mmu_put_root(kvm, root, shared); + } } =20 enum tdp_mmu_roots_iter_type { @@ -949,7 +992,11 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm) =20 /* * Zap all roots, including defunct roots, as all SPTEs must be dropped - * before returning to the caller. + * before returning to the caller. Zap directly even if the root is + * also being zapped by a worker. Walking zapped top-level SPTEs isn't + * all that expensive and mmu_lock is already held, which means the + * worker has yielded, i.e. flushing the work instead of zapping here + * isn't guaranteed to be any faster. * * A TLB flush is unnecessary, KVM zaps everything if and only the VM * is being destroyed or the userspace VMM has exited. In both cases, --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8838CC433FE for ; Thu, 23 Dec 2021 22:28:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245415AbhLWW2b (ORCPT ); Thu, 23 Dec 2021 17:28:31 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39202 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350518AbhLWWZt (ORCPT ); Thu, 23 Dec 2021 17:25:49 -0500 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 472D7C06175C for ; Thu, 23 Dec 2021 14:24:26 -0800 (PST) Received: by mail-pf1-x44a.google.com with SMTP id n18-20020a056a00213200b004baa74aca72so3988443pfj.19 for ; Thu, 23 Dec 2021 14:24:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=NTWd/eLJg+oZzoP/2OC3KeQ8hrwm8MCA9hvEYELSgZQ=; b=BJbFbs6zx3b6NONCJh1BbEu1gCLTZ/pCoIckyh0v7CG8BpEflV8SaWjY3KyGRv6vty JqDIHR8WvJMmsGtNl7EhrpbYh85gLPUKYkLFHjDXbXdh642hy4OrInO4jyTA4Y3BzHi4 MVGEO5q7YTqz6IuTrBJD89czl7SrvD8H6H2jX/8XDs5EGiIL2zFIKOPIsGzx2LLjPqmX 3HYr7IaDEPZw3zqpXNyA4hpSxFthe4nhZ2Bu5a8lzV8+colY89dLU7xYa5WnKL37rx5P vTDKgXLwN88QSRBxEVcE6jOjRh8nWzO9DPKkKUodkDHNtv5f2/rSPOprahd21ZEqgtkE A9yg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=NTWd/eLJg+oZzoP/2OC3KeQ8hrwm8MCA9hvEYELSgZQ=; b=mnwK67JsuxKbnNaOzQPLnjBsyZAxiFlFNFVxaeRIW7tPYeuS71vFFaTO6ZRC1UkvAk MxdgYRcIHT6hAf1jBP9HddyslXltBOWWUzjS7SAGX5vWruSZ6EM83t8ZTOcCr7hOjyrN lrPFkzzTohaQfWhL85tMbi88mmoDjynHa9uIVzW5ih1TqWyRfXia8mltTQ5z1hihG8ZJ qrQ0eZWRMWXdLWo/Uc825NB3UaVGMxQY5dSQLxYgE0Hbaehf8nkKkOdLfpYCqpo2j7YB shsddFTeQRs9wXGKtME5K4uELw33qyPeMUt3Mq31okT4Uf6XwD838379PJwFDI0PZeih D0+g== X-Gm-Message-State: AOAM532eV8zbE2SQ0tZtQGMREploNF4BG7nX1zniOjGQIwi7P7UUVC/G SR5Mld0MT0c+u4k3wAdW9UulFyMLXm4= X-Google-Smtp-Source: ABdhPJwsc10SoowlAMp1LP+rgARsIGn5+UkTI3VN7Tm8sl6hGezd+4vtPjQ03g92/ZC5uIwUkBnn8V3Mpxo= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:90b:343:: with SMTP id fh3mr4768409pjb.35.1640298265813; Thu, 23 Dec 2021 14:24:25 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:15 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-28-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 27/30] KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Move set_memory_region_test's KVM_SET_USER_MEMORY_REGION helper to KVM's utils so that it can be used by other tests. Provide a raw version as well as an assert-success version to reduce the amount of boilerplate code need for basic usage. No functional change intended. Signed-off-by: Sean Christopherson --- .../testing/selftests/kvm/include/kvm_util.h | 5 +++ tools/testing/selftests/kvm/lib/kvm_util.c | 24 +++++++++++++ .../selftests/kvm/set_memory_region_test.c | 35 +++++-------------- 3 files changed, 37 insertions(+), 27 deletions(-) diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing= /selftests/kvm/include/kvm_util.h index 2d62edc49d67..fba971189390 100644 --- a/tools/testing/selftests/kvm/include/kvm_util.h +++ b/tools/testing/selftests/kvm/include/kvm_util.h @@ -128,6 +128,11 @@ void vcpu_dump(FILE *stream, struct kvm_vm *vm, uint32= _t vcpuid, =20 void vm_create_irqchip(struct kvm_vm *vm); =20 +void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t = flags, + uint64_t gpa, uint64_t size, void *hva); +int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t= flags, + uint64_t gpa, uint64_t size, void *hva); + void vm_userspace_mem_region_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type, uint64_t guest_paddr, uint32_t slot, uint64_t npages, diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/sel= ftests/kvm/lib/kvm_util.c index 53d2b5d04b82..45f39dd9b4da 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -779,6 +779,30 @@ static void vm_userspace_mem_region_hva_insert(struct = rb_root *hva_tree, rb_insert_color(®ion->hva_node, hva_tree); } =20 + +int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t= flags, + uint64_t gpa, uint64_t size, void *hva) +{ + struct kvm_userspace_memory_region region =3D { + .slot =3D slot, + .flags =3D flags, + .guest_phys_addr =3D gpa, + .memory_size =3D size, + .userspace_addr =3D (uintptr_t)hva, + }; + + return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION, ®ion); +} + +void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t = flags, + uint64_t gpa, uint64_t size, void *hva) +{ + int ret =3D __vm_set_user_memory_region(vm, slot, flags, gpa, size, hva); + + TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION failed, errno =3D %d (%s)", + errno, strerror(errno)); +} + /* * VM Userspace Memory Region Add * diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/t= esting/selftests/kvm/set_memory_region_test.c index 72a1c9b4882c..73bc297dabe6 100644 --- a/tools/testing/selftests/kvm/set_memory_region_test.c +++ b/tools/testing/selftests/kvm/set_memory_region_test.c @@ -329,22 +329,6 @@ static void test_zero_memory_regions(void) } #endif /* __x86_64__ */ =20 -static int test_memory_region_add(struct kvm_vm *vm, void *mem, uint32_t s= lot, - uint32_t size, uint64_t guest_addr) -{ - struct kvm_userspace_memory_region region; - int ret; - - region.slot =3D slot; - region.flags =3D 0; - region.guest_phys_addr =3D guest_addr; - region.memory_size =3D size; - region.userspace_addr =3D (uintptr_t) mem; - ret =3D ioctl(vm_get_fd(vm), KVM_SET_USER_MEMORY_REGION, ®ion); - - return ret; -} - /* * Test it can be added memory slots up to KVM_CAP_NR_MEMSLOTS, then any * tentative to add further slots should fail. @@ -382,23 +366,20 @@ static void test_add_max_memory_regions(void) TEST_ASSERT(mem !=3D MAP_FAILED, "Failed to mmap() host"); mem_aligned =3D (void *)(((size_t) mem + alignment - 1) & ~(alignment - 1= )); =20 - for (slot =3D 0; slot < max_mem_slots; slot++) { - ret =3D test_memory_region_add(vm, mem_aligned + - ((uint64_t)slot * MEM_REGION_SIZE), - slot, MEM_REGION_SIZE, - (uint64_t)slot * MEM_REGION_SIZE); - TEST_ASSERT(ret =3D=3D 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n" - " rc: %i errno: %i slot: %i\n", - ret, errno, slot); - } + for (slot =3D 0; slot < max_mem_slots; slot++) + vm_set_user_memory_region(vm, slot, 0, + ((uint64_t)slot * MEM_REGION_SIZE), + MEM_REGION_SIZE, + mem_aligned + (uint64_t)slot * MEM_REGION_SIZE); =20 /* Check it cannot be added memory slots beyond the limit */ mem_extra =3D mmap(NULL, MEM_REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); TEST_ASSERT(mem_extra !=3D MAP_FAILED, "Failed to mmap() host"); =20 - ret =3D test_memory_region_add(vm, mem_extra, max_mem_slots, MEM_REGION_S= IZE, - (uint64_t)max_mem_slots * MEM_REGION_SIZE); + ret =3D __vm_set_user_memory_region(vm, max_mem_slots, 0, + (uint64_t)max_mem_slots * MEM_REGION_SIZE, + MEM_REGION_SIZE, mem_extra); TEST_ASSERT(ret =3D=3D -1 && errno =3D=3D EINVAL, "Adding one more memory slot should fail with EINVAL"); =20 --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87B01C433FE for ; Thu, 23 Dec 2021 22:27:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350971AbhLWW1C (ORCPT ); Thu, 23 Dec 2021 17:27:02 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39210 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350679AbhLWWZt (ORCPT ); Thu, 23 Dec 2021 17:25:49 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DABF0C061D76 for ; Thu, 23 Dec 2021 14:24:27 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id q1-20020a17090a2dc100b001b151c90c5fso4041526pjm.3 for ; Thu, 23 Dec 2021 14:24:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=SnbDD+EzeD9BP5nY71kpd1r6ATELuCUqfA5xKo++7G8=; b=qkJkWHJ3G2s/zKVhQ8uR/fZou/cYZTvXfkiCyMXWcchCHYF9pWHaMYTMFikBkYZMMM CafOg1sGqSPJ1Far8Uj6JGi5FQGZzvgg3xqQBMi2bdfIsPUNCufQ4S9Y2qLWKzbFm99y 6zjdRHwBmIeIwkWM1xT5pJMVc4GCB9l2kb+GCq7NJh7oU1DFaczYHqZFD+amwIv3Ep1g BmpS9v20IOhTgB1sqo2JQaja+VHj1OzwulQhSfmkDoEWCQVQp4vuI4peYJT6DCA+bt75 zLHzGMBoRITuUALu7/sK1HtgwaphUTokJFtglyvqu+pA1OSuaMtUIUUuuN/ucQtMmgTR 7Ubg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=SnbDD+EzeD9BP5nY71kpd1r6ATELuCUqfA5xKo++7G8=; b=aX9Ha6/w0AwfD05iOIh1gculT6rbNmpcv8v3gXh6fOatbaRaSZxxBkSgNbYBLiuNCS +BPrXwzzKkNGckW4M+H9XxDYmHSGrkP8rFRxLwDhGl7v9Q0cR1xq4qkLYRxG1+8MOcff KNU6NTY/iWNUHDtPh3Ki2cLz3stB7E6dIE9KVF3LIhVDGlVcq1qNlMHC7lENVswQV+Sb JiUvCiuSq59zg9+veZRuV5PtSXc34FDxrptRZrAxw649CB/wmiroPkL1H0JC7fAcaIT/ fSd9pShTemoAwnlJ9+UwhFKHyIZYcNjwU3G6Up6eNzkLXubi9TcFFJT9jXxMIe/6vkWh vocA== X-Gm-Message-State: AOAM530JUEGheqyN4pM0bYjKmbgRE3pwvqKHAV2/kV+kHyjYWS/eOVDn 7RyQthmw7X/EIQ0yoSxDJyhEdGX3Jx4= X-Google-Smtp-Source: ABdhPJwOQj/KrL8mxnlbl5L3LG8nyLgGl8HKJ6JDrgETpddqBapjsGehx4x4j5sGHZaH4agS/k5rQTUQM/Y= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:aa7:8210:0:b0:4ba:f05c:8ae9 with SMTP id k16-20020aa78210000000b004baf05c8ae9mr4188612pfi.64.1640298267418; Thu, 23 Dec 2021 14:24:27 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:16 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-29-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 28/30] KVM: selftests: Split out helper to allocate guest mem via memfd From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Extract the code for allocating guest memory via memfd out of vm_userspace_mem_region_add() and into a new helper, kvm_memfd_alloc(). A future selftest to populate a guest with the maximum amount of guest memory will abuse KVM's memslots to alias guest memory regions to a single memfd-backed host region, i.e. needs to back a guest with memfd memory without a 1:1 association between a memslot and a memfd instance. No functional change intended. Signed-off-by: Sean Christopherson --- .../testing/selftests/kvm/include/kvm_util.h | 1 + tools/testing/selftests/kvm/lib/kvm_util.c | 42 +++++++++++-------- 2 files changed, 25 insertions(+), 18 deletions(-) diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing= /selftests/kvm/include/kvm_util.h index fba971189390..ee4a7fafb442 100644 --- a/tools/testing/selftests/kvm/include/kvm_util.h +++ b/tools/testing/selftests/kvm/include/kvm_util.h @@ -104,6 +104,7 @@ int kvm_memcmp_hva_gva(void *hva, struct kvm_vm *vm, co= nst vm_vaddr_t gva, size_t len); =20 void kvm_vm_elf_load(struct kvm_vm *vm, const char *filename); +int kvm_memfd_alloc(size_t size, bool hugepages); =20 void vm_dump(FILE *stream, struct kvm_vm *vm, uint8_t indent); =20 diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/sel= ftests/kvm/lib/kvm_util.c index 45f39dd9b4da..97514ece798f 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -658,6 +658,27 @@ void kvm_vm_free(struct kvm_vm *vmp) free(vmp); } =20 +int kvm_memfd_alloc(size_t size, bool hugepages) +{ + int memfd_flags =3D MFD_CLOEXEC; + int fd, r; + + if (hugepages) + memfd_flags |=3D MFD_HUGETLB; + + fd =3D memfd_create("kvm_selftest", memfd_flags); + TEST_ASSERT(fd !=3D -1, "memfd_create() failed, errno: %i (%s)", + errno, strerror(errno)); + + r =3D ftruncate(fd, size); + TEST_ASSERT(!r, "ftruncate() failed, errno: %i (%s)", errno, strerror(err= no)); + + r =3D fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, size); + TEST_ASSERT(!r, "fallocate() failed, errno: %i (%s)", errno, strerror(err= no)); + + return fd; +} + /* * Memory Compare, host virtual to guest virtual * @@ -910,24 +931,9 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm, region->mmap_size +=3D alignment; =20 region->fd =3D -1; - if (backing_src_is_shared(src_type)) { - int memfd_flags =3D MFD_CLOEXEC; - - if (src_type =3D=3D VM_MEM_SRC_SHARED_HUGETLB) - memfd_flags |=3D MFD_HUGETLB; - - region->fd =3D memfd_create("kvm_selftest", memfd_flags); - TEST_ASSERT(region->fd !=3D -1, - "memfd_create failed, errno: %i", errno); - - ret =3D ftruncate(region->fd, region->mmap_size); - TEST_ASSERT(ret =3D=3D 0, "ftruncate failed, errno: %i", errno); - - ret =3D fallocate(region->fd, - FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, - region->mmap_size); - TEST_ASSERT(ret =3D=3D 0, "fallocate failed, errno: %i", errno); - } + if (backing_src_is_shared(src_type)) + region->fd =3D kvm_memfd_alloc(region->mmap_size, + src_type =3D=3D VM_MEM_SRC_SHARED_HUGETLB); =20 region->mmap_start =3D mmap(NULL, region->mmap_size, PROT_READ | PROT_WRITE, --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A4DBC433F5 for ; Thu, 23 Dec 2021 22:27:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350994AbhLWW1J (ORCPT ); Thu, 23 Dec 2021 17:27:09 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38756 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350513AbhLWWZt (ORCPT ); Thu, 23 Dec 2021 17:25:49 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 852B0C0698C3 for ; Thu, 23 Dec 2021 14:24:29 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id p2-20020a17090a2c4200b001b1866beecbso6392932pjm.5 for ; Thu, 23 Dec 2021 14:24:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=Yr+H8LBjl0djQfSMzX2wORKjmAdREQqaSzyX3BhenYw=; b=XmQFD7vyhq5p8trN7cLD10q9mdheW4EOLvr7SHz8EpgoyH6oda/tiRX7ZBVlJwAHwm pxx0c47eP51bpRUXyCWELedYY8s3KrRuvX/xhBiZqPIDn05GKwYbUTdvZxtQSRJNcxLN kK77GUtf7L3Chh6OYUxSTJBxhHHfyiumxfeqDNePqdI5C0bFRb7f15c2McJzFQ0Fp95K 3JKbIWhgctO43HlTl2C0ZLUvmH5HJoTdyLkO9imR/gh6Mryu33xmQGLIzCenALuJ2lta kskFun16QDJJZ4g8xCW7JLFuo7WuGx2WS5vm87C3T5Hf5RgmmCuvx702nSwCrdp5oCix P32g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=Yr+H8LBjl0djQfSMzX2wORKjmAdREQqaSzyX3BhenYw=; b=vJi6tLgdNeG+ra8y9livYflze8KeqitFDdOgoVxYqRPcPivuPtF2obq9TlCcEJ8KKt CwrvfVAWd/8eaM5dPRkv8Ymsiz/vJDk2bQnhETol54gwp/zzpyvi90jZQOizlo0dpcoA mTgJ3HvvUzsrq+XdG7/xKc4SP+oAG1ZSuoU6x0+6uoP8j2nLlVx3CXmkcIpPlBZJWRui RgfOQKlg1jVH3ybL3s8LHL5uBb4v5oX1BCTpnBWtba1JOvu3OGBaHqk1ZJn7ESiDeu5T FgVzUPpENsXLTVGUKoT+Bv72eBpH9ZeNsyV+ovlLCaRbR600qcPukGEJy2xf7snWd1Wq IWpg== X-Gm-Message-State: AOAM533+Swo48mkWdA/QDWuAxijgd/Np3WXw6dpPpVQ04LQ/RGWwN/Gj +GuZmweAjckFnVRTvipYKZB+65lvdFI= X-Google-Smtp-Source: ABdhPJwUiLkdbtaG9+zrvhGoVsZmSI+nH0Nx0az82/C5LbcyzQYJaBgH11x2pnQDooEF32SEPCtIuz686jQ= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:90b:4a81:: with SMTP id lp1mr4969813pjb.19.1640298269028; Thu, 23 Dec 2021 14:24:29 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:17 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-30-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 29/30] KVM: selftests: Define cpu_relax() helpers for s390 and x86 From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add cpu_relax() for s390 and x86 for use in arch-agnostic tests. arm64 already defines its own version. Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/include/s390x/processor.h | 8 ++++++++ tools/testing/selftests/kvm/include/x86_64/processor.h | 5 +++++ 2 files changed, 13 insertions(+) diff --git a/tools/testing/selftests/kvm/include/s390x/processor.h b/tools/= testing/selftests/kvm/include/s390x/processor.h index e0e96a5f608c..255c9b990f4c 100644 --- a/tools/testing/selftests/kvm/include/s390x/processor.h +++ b/tools/testing/selftests/kvm/include/s390x/processor.h @@ -5,6 +5,8 @@ #ifndef SELFTEST_KVM_PROCESSOR_H #define SELFTEST_KVM_PROCESSOR_H =20 +#include + /* Bits in the region/segment table entry */ #define REGION_ENTRY_ORIGIN ~0xfffUL /* region/segment table origin */ #define REGION_ENTRY_PROTECT 0x200 /* region protection bit */ @@ -19,4 +21,10 @@ #define PAGE_PROTECT 0x200 /* HW read-only bit */ #define PAGE_NOEXEC 0x100 /* HW no-execute bit */ =20 +/* Is there a portable way to do this? */ +static inline void cpu_relax(void) +{ + barrier(); +} + #endif diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools= /testing/selftests/kvm/include/x86_64/processor.h index 05e65ca1c30c..224574ee9967 100644 --- a/tools/testing/selftests/kvm/include/x86_64/processor.h +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h @@ -346,6 +346,11 @@ static inline unsigned long get_xmm(int n) return 0; } =20 +static inline void cpu_relax(void) +{ + asm volatile("rep; nop" ::: "memory"); +} + bool is_intel_cpu(void); =20 struct kvm_x86_state; --=20 2.34.1.448.ga2b2bfdf31-goog From nobody Wed Jul 1 09:55:47 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 29AF4C433EF for ; Thu, 23 Dec 2021 22:27:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241109AbhLWW1F (ORCPT ); Thu, 23 Dec 2021 17:27:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38878 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350694AbhLWWZu (ORCPT ); Thu, 23 Dec 2021 17:25:50 -0500 Received: from mail-pg1-x54a.google.com (mail-pg1-x54a.google.com [IPv6:2607:f8b0:4864:20::54a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 40F53C0698CC for ; Thu, 23 Dec 2021 14:24:31 -0800 (PST) Received: by mail-pg1-x54a.google.com with SMTP id m15-20020a638c0f000000b0034056b46a05so3870027pgd.15 for ; Thu, 23 Dec 2021 14:24:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=bh9JWouoX1FI0l45cZfZSX2igrao9BXu99qa2pMBV/0=; b=U4Bm6CFaMoCHMmhEC8jA5D4g+EGS00FgylbV3W513R+SjiVionlfBd6Q+lrCMRuYk9 Lzgh4/4Qd5qi12xMNv53hKKyr3aBKDjoXDZvhQKIPWuSKLZYgEMrPYTo5/SV2ZcS/895 qBHrGZxTcFNR9k78O8XimfjsC0NDZO/ABfinUg+oWtFDNztTVyzYb24INryRqYbrQ1o5 qvdeCl/q0z6FPtgVXiw593TUydNdXodLy61YF2Majq4eBniLcWLShlmXtVonOnTLr/Il 8lZD8stxeWwPBhcCePTzl+yYaiQ82W5WhnMTvDWgcwHVXlZCXpDpEK3t9wevFDaK5EfN 25sg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=bh9JWouoX1FI0l45cZfZSX2igrao9BXu99qa2pMBV/0=; b=6UH27zU1fF4UAARleG7hS75bnYf+LB1/xpKWf4jAUjx7IQrjfywdrUUgI/OAoax38G 9JnGYE3E5LiHHUJ+YY6U5/EBwSeLTeqm0Vek/9Cx6tYJP1GQkovtfwgvH5e2KP4B4nWU VWZh4dmTtHge9e81r5WFSbgP20rOUM/9MpU3pa3eZKRrbzNFGKP915yStu46TxLL/VAM B6nGs77JcP4quGLfgT5RZWDTIM+nLe952Mpp3/VOsVUIm4YMPRuAm7AnbbXkfepxFACf hbPxkcbNgjsjbpQuSvpbtUUGyLA8bnSxYrFAguEMlTfD7Jtpbk3rshSV/DzEeEg5Sr44 gTsQ== X-Gm-Message-State: AOAM5310MLhLRvthFFBUKXakUQWzf5quNtYybrcBv+JNpA4ZyZSTIgKD Pgl8ZGiYJGwcIvPRrouEJOtEgcoHYus= X-Google-Smtp-Source: ABdhPJwTBJy3DC88ncs30nk0eTkEcjJQSVUzWaAQff+PDs1j/Jecyi/RfKnWQeGIcgi7nAu2Ty7wrRbeLsE= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:902:e884:b0:148:b91b:d7e2 with SMTP id w4-20020a170902e88400b00148b91bd7e2mr3980010plg.87.1640298270789; Thu, 23 Dec 2021 14:24:30 -0800 (PST) Reply-To: Sean Christopherson Date: Thu, 23 Dec 2021 22:23:18 +0000 In-Reply-To: <20211223222318.1039223-1-seanjc@google.com> Message-Id: <20211223222318.1039223-31-seanjc@google.com> Mime-Version: 1.0 References: <20211223222318.1039223-1-seanjc@google.com> X-Mailer: git-send-email 2.34.1.448.ga2b2bfdf31-goog Subject: [PATCH v2 30/30] KVM: selftests: Add test to populate a VM with the max possible guest mem From: Sean Christopherson To: Paolo Bonzini Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Ben Gardon , David Matlack , Mingwei Zhang Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a selftest that enables populating a VM with the maximum amount of guest memory allowed by the underlying architecture. Abuse KVM's memslots by mapping a single host memory region into multiple memslots so that the selftest doesn't require a system with terabytes of RAM. Default to 512gb of guest memory, which isn't all that interesting, but should work on all MMUs and doesn't take an exorbitant amount of memory or time. E.g. testing with ~64tb of guest memory takes the better part of an hour, and requires 200gb of memory for KVM's page tables when using 4kb pages. To inflicit maximum abuse on KVM' MMU, default to 4kb pages (or whatever the not-hugepage size is) in the backing store (memfd). Use memfd for the host backing store to ensure that hugepages are guaranteed when requested, and to give the user explicit control of the size of hugepage being tested. By default, spin up as many vCPUs as there are available to the selftest, and distribute the work of dirtying each 4kb chunk of memory across all vCPUs. Dirtying guest memory forces KVM to populate its page tables, and also forces KVM to write back accessed/dirty information to struct page when the guest memory is freed. On x86, perform two passes with a MMU context reset between each pass to coerce KVM into dropping all references to the MMU root, e.g. to emulate a vCPU dropping the last reference. Perform both passes and all rendezvous on all architectures in the hope that arm64 and s390x can gain similar shenanigans in the future. Measure and report the duration of each operation, which is helpful not only to verify the test is working as intended, but also to easily evaluate the performance differences different page sizes. Provide command line options to limit the amount of guest memory, set the size of each slot (i.e. of the host memory region), set the number of vCPUs, and to enable usage of hugepages. Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 3 + .../selftests/kvm/max_guest_memory_test.c | 292 ++++++++++++++++++ 3 files changed, 296 insertions(+) create mode 100644 tools/testing/selftests/kvm/max_guest_memory_test.c diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftes= ts/kvm/.gitignore index 3cb5ac5da087..ffb4da5b9d03 100644 --- a/tools/testing/selftests/kvm/.gitignore +++ b/tools/testing/selftests/kvm/.gitignore @@ -52,6 +52,7 @@ /hardware_disable_test /kvm_create_max_vcpus /kvm_page_table_test +/max_guest_memory_test /memslot_modification_stress_test /memslot_perf_test /rseq_test diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests= /kvm/Makefile index 17342b575e85..63640f59e96b 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -83,6 +83,7 @@ TEST_GEN_PROGS_x86_64 +=3D dirty_log_perf_test TEST_GEN_PROGS_x86_64 +=3D hardware_disable_test TEST_GEN_PROGS_x86_64 +=3D kvm_create_max_vcpus TEST_GEN_PROGS_x86_64 +=3D kvm_page_table_test +TEST_GEN_PROGS_x86_64 +=3D max_guest_memory_test TEST_GEN_PROGS_x86_64 +=3D memslot_modification_stress_test TEST_GEN_PROGS_x86_64 +=3D memslot_perf_test TEST_GEN_PROGS_x86_64 +=3D rseq_test @@ -101,6 +102,7 @@ TEST_GEN_PROGS_aarch64 +=3D dirty_log_test TEST_GEN_PROGS_aarch64 +=3D dirty_log_perf_test TEST_GEN_PROGS_aarch64 +=3D kvm_create_max_vcpus TEST_GEN_PROGS_aarch64 +=3D kvm_page_table_test +TEST_GEN_PROGS_aarch64 +=3D max_guest_memory_test TEST_GEN_PROGS_aarch64 +=3D memslot_modification_stress_test TEST_GEN_PROGS_aarch64 +=3D memslot_perf_test TEST_GEN_PROGS_aarch64 +=3D rseq_test @@ -115,6 +117,7 @@ TEST_GEN_PROGS_s390x +=3D demand_paging_test TEST_GEN_PROGS_s390x +=3D dirty_log_test TEST_GEN_PROGS_s390x +=3D kvm_create_max_vcpus TEST_GEN_PROGS_s390x +=3D kvm_page_table_test +TEST_GEN_PROGS_s390x +=3D max_guest_memory_test TEST_GEN_PROGS_s390x +=3D rseq_test TEST_GEN_PROGS_s390x +=3D set_memory_region_test TEST_GEN_PROGS_s390x +=3D kvm_binary_stats_test diff --git a/tools/testing/selftests/kvm/max_guest_memory_test.c b/tools/te= sting/selftests/kvm/max_guest_memory_test.c new file mode 100644 index 000000000000..360c88288295 --- /dev/null +++ b/tools/testing/selftests/kvm/max_guest_memory_test.c @@ -0,0 +1,292 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "kvm_util.h" +#include "test_util.h" +#include "guest_modes.h" +#include "processor.h" + +static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stri= de) +{ + uint64_t gpa; + + for (gpa =3D start_gpa; gpa < end_gpa; gpa +=3D stride) + *((volatile uint64_t *)gpa) =3D gpa; + + GUEST_DONE(); +} + +struct vcpu_info { + struct kvm_vm *vm; + uint32_t id; + uint64_t start_gpa; + uint64_t end_gpa; +}; + +static int nr_vcpus; +static atomic_t rendezvous; + +static void rendezvous_with_boss(void) +{ + int orig =3D atomic_read(&rendezvous); + + if (orig > 0) { + atomic_dec_and_test(&rendezvous); + while (atomic_read(&rendezvous) > 0) + cpu_relax(); + } else { + atomic_inc(&rendezvous); + while (atomic_read(&rendezvous) < 0) + cpu_relax(); + } +} + +static void run_vcpu(struct kvm_vm *vm, uint32_t vcpu_id) +{ + vcpu_run(vm, vcpu_id); + ASSERT_EQ(get_ucall(vm, vcpu_id, NULL), UCALL_DONE); +} + +static void *vcpu_worker(void *data) +{ + struct vcpu_info *vcpu =3D data; + struct kvm_vm *vm =3D vcpu->vm; + struct kvm_sregs sregs; + struct kvm_regs regs; + + vcpu_args_set(vm, vcpu->id, 3, vcpu->start_gpa, vcpu->end_gpa, + vm_get_page_size(vm)); + + /* Snapshot regs before the first run. */ + vcpu_regs_get(vm, vcpu->id, ®s); + rendezvous_with_boss(); + + run_vcpu(vm, vcpu->id); + rendezvous_with_boss(); + vcpu_regs_set(vm, vcpu->id, ®s); + vcpu_sregs_get(vm, vcpu->id, &sregs); +#ifdef __x86_64__ + /* Toggle CR0.WP to trigger a MMU context reset. */ + sregs.cr0 ^=3D X86_CR0_WP; +#endif + vcpu_sregs_set(vm, vcpu->id, &sregs); + rendezvous_with_boss(); + + run_vcpu(vm, vcpu->id); + rendezvous_with_boss(); + + return NULL; +} + +static pthread_t *spawn_workers(struct kvm_vm *vm, uint64_t start_gpa, + uint64_t end_gpa) +{ + struct vcpu_info *info; + uint64_t gpa, nr_bytes; + pthread_t *threads; + int i; + + threads =3D malloc(nr_vcpus * sizeof(*threads)); + TEST_ASSERT(threads, "Failed to allocate vCPU threads"); + + info =3D malloc(nr_vcpus * sizeof(*info)); + TEST_ASSERT(info, "Failed to allocate vCPU gpa ranges"); + + nr_bytes =3D ((end_gpa - start_gpa) / nr_vcpus) & + ~((uint64_t)vm_get_page_size(vm) - 1); + TEST_ASSERT(nr_bytes, "C'mon, no way you have %d CPUs", nr_vcpus); + + for (i =3D 0, gpa =3D start_gpa; i < nr_vcpus; i++, gpa +=3D nr_bytes) { + info[i].vm =3D vm; + info[i].id =3D i; + info[i].start_gpa =3D gpa; + info[i].end_gpa =3D gpa + nr_bytes; + pthread_create(&threads[i], NULL, vcpu_worker, &info[i]); + } + return threads; +} + +static void rendezvous_with_vcpus(struct timespec *time, const char *name) +{ + int i, rendezvoused; + + pr_info("Waiting for vCPUs to finish %s...\n", name); + + rendezvoused =3D atomic_read(&rendezvous); + for (i =3D 0; abs(rendezvoused) !=3D 1; i++) { + usleep(100); + if (!(i & 0x3f)) + pr_info("\r%d vCPUs haven't rendezvoused...", + abs(rendezvoused) - 1); + rendezvoused =3D atomic_read(&rendezvous); + } + + clock_gettime(CLOCK_MONOTONIC, time); + + /* Release the vCPUs after getting the time of the previous action. */ + pr_info("\rAll vCPUs finished %s, releasing...\n", name); + if (rendezvoused > 0) + atomic_set(&rendezvous, -nr_vcpus - 1); + else + atomic_set(&rendezvous, nr_vcpus + 1); +} + +static void calc_default_nr_vcpus(void) +{ + cpu_set_t possible_mask; + int r; + + r =3D sched_getaffinity(0, sizeof(possible_mask), &possible_mask); + TEST_ASSERT(!r, "sched_getaffinity failed, errno =3D %d (%s)", + errno, strerror(errno)); + + nr_vcpus =3D CPU_COUNT(&possible_mask); + TEST_ASSERT(nr_vcpus > 0, "Uh, no CPUs?"); +} + +int main(int argc, char *argv[]) +{ + /* + * Skip the first 4gb and slot0. slot0 maps <1gb and is used to back + * the guest's code, stack, and page tables. Because selftests creates + * an IRQCHIP, a.k.a. a local APIC, KVM creates an internal memslot + * just below the 4gb boundary. This test could create memory at + * 1gb-3gb,but it's simpler to skip straight to 4gb. + */ + const uint64_t size_1gb =3D (1 << 30); + const uint64_t start_gpa =3D (4ull * size_1gb); + const int first_slot =3D 1; + + struct timespec time_start, time_run1, time_reset, time_run2; + uint64_t max_gpa, gpa, slot_size, max_mem, i; + int max_slots, slot, opt, fd; + bool hugepages =3D false; + pthread_t *threads; + struct kvm_vm *vm; + void *mem; + + /* + * Default to 2gb so that maxing out systems with MAXPHADDR=3D46, which + * are quite common for x86, requires changing only max_mem (KVM allows + * 32k memslots, 32k * 2gb =3D=3D ~64tb of guest memory). + */ + slot_size =3D 2 * size_1gb; + + max_slots =3D kvm_check_cap(KVM_CAP_NR_MEMSLOTS); + TEST_ASSERT(max_slots > first_slot, "KVM is broken"); + + /* All KVM MMUs should be able to survive a 512gb guest. */ + max_mem =3D 512 * size_1gb; + + calc_default_nr_vcpus(); + + while ((opt =3D getopt(argc, argv, "c:h:m:s:u")) !=3D -1) { + switch (opt) { + case 'c': + nr_vcpus =3D atoi(optarg); + TEST_ASSERT(nr_vcpus, "#DE"); + break; + case 'm': + max_mem =3D atoi(optarg) * size_1gb; + TEST_ASSERT(max_mem, "#DE"); + break; + case 's': + slot_size =3D atoi(optarg) * size_1gb; + TEST_ASSERT(slot_size, "#DE"); + break; + case 'u': + hugepages =3D true; + break; + case 'h': + default: + printf("usage: %s [-c nr_vcpus] [-m max_mem_in_gb] [-s slot_size_in_gb]= [-u [huge_page_size]]\n", argv[0]); + exit(1); + } + } + + vm =3D vm_create_default_with_vcpus(nr_vcpus, 0, 0, guest_code, NULL); + + max_gpa =3D vm_get_max_gfn(vm) << vm_get_page_shift(vm); + TEST_ASSERT(max_gpa > (4 * slot_size), "MAXPHYADDR <4gb "); + + fd =3D kvm_memfd_alloc(slot_size, hugepages); + mem =3D mmap(NULL, slot_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + TEST_ASSERT(mem !=3D MAP_FAILED, "mmap() failed"); + + TEST_ASSERT(!madvise(mem, slot_size, MADV_NOHUGEPAGE), "madvise() failed"= ); + + /* Pre-fault the memory to avoid taking mmap_sem on guest page faults. */ + for (i =3D 0; i < slot_size; i +=3D vm_get_page_size(vm)) + ((uint8_t *)mem)[i] =3D 0xaa; + + gpa =3D 0; + for (slot =3D first_slot; slot < max_slots; slot++) { + gpa =3D start_gpa + ((slot - first_slot) * slot_size); + if (gpa + slot_size > max_gpa) + break; + + if ((gpa - start_gpa) >=3D max_mem) + break; + + vm_set_user_memory_region(vm, slot, 0, gpa, slot_size, mem); + +#ifdef __x86_64__ + /* Identity map memory in the guest using 1gb pages. */ + for (i =3D 0; i < slot_size; i +=3D size_1gb) + __virt_pg_map(vm, gpa + i, gpa + i, X86_PAGE_SIZE_1G); +#else + for (i =3D 0; i < slot_size; i +=3D vm_get_page_size(vm)) + virt_pg_map(vm, gpa + i, gpa + i); +#endif + } + + atomic_set(&rendezvous, nr_vcpus + 1); + threads =3D spawn_workers(vm, start_gpa, gpa); + + pr_info("Running with %lugb of guest memory and %u vCPUs\n", + (gpa - start_gpa) / size_1gb, nr_vcpus); + + rendezvous_with_vcpus(&time_start, "spawning"); + rendezvous_with_vcpus(&time_run1, "run 1"); + rendezvous_with_vcpus(&time_reset, "reset"); + rendezvous_with_vcpus(&time_run2, "run 2"); + + time_run2 =3D timespec_sub(time_run2, time_reset); + time_reset =3D timespec_sub(time_reset, time_run1); + time_run1 =3D timespec_sub(time_run1, time_start); + + pr_info("run1 =3D %ld.%.9lds, reset =3D %ld.%.9lds, run2 =3D %ld.%.9lds\= n", + time_run1.tv_sec, time_run1.tv_nsec, + time_reset.tv_sec, time_reset.tv_nsec, + time_run2.tv_sec, time_run2.tv_nsec); + + /* + * Delete even numbered slots (arbitrary) and unmap the first half of + * the backing (also arbitrary) to verify KVM correctly drops all + * references to the removed regions. + */ + for (slot =3D (slot - 1) & ~1ull; slot >=3D first_slot; slot -=3D 2) + vm_set_user_memory_region(vm, slot, 0, 0, 0, NULL); + + munmap(mem, slot_size / 2); + + /* Sanity check that the vCPUs actually ran. */ + for (i =3D 0; i < nr_vcpus; i++) + pthread_join(threads[i], NULL); + + /* + * Deliberately exit without deleting the remaining memslots or closing + * kvm_fd to test cleanup via mmu_notifier.release. + */ +} --=20 2.34.1.448.ga2b2bfdf31-goog