From nobody Mon May 11 00:45:15 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D4E0C4332F for ; Wed, 20 Apr 2022 00:49:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350618AbiDTAv6 (ORCPT ); Tue, 19 Apr 2022 20:51:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35904 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245551AbiDTAvs (ORCPT ); Tue, 19 Apr 2022 20:51:48 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CA37F37A04 for ; Tue, 19 Apr 2022 17:49:03 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id i5-20020a258b05000000b006347131d40bso60731ybl.17 for ; Tue, 19 Apr 2022 17:49:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=SDL9MFIe7Bs280dPgs9gNMtR+FqJHh0cLQq02+aHMEM=; b=D9ocaWs++kMrG7M6NBGRLsuCM/zv7JBy1xaSMv1ZDnl6aBom+xyvQDVW9krblJDF7P Sv5/GNoDIDAtWymoPqGm8Z2qSxSFazroou4g5QzDWYTaJunRqKSGXDbSjH36nPyIoGyr R6bC1XH7AarlduiwGnzcXV9vz8WsALau8z2SqWjG3e0HqvDgXiDak5K9hUl9EE2RkMAs p4x/vdP72mxc0cqdFwLKHvtaFmpuMieUgj1MafR1ZoQyVKUJGmYR90odADpAcS/QxvBB n0Xwa6ykUv2COlqiriWJIJ7ubCssxudlkBdOLDNmRyfgqFCtbkVLVEFZH1+5TA2kHyMg fpmw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=SDL9MFIe7Bs280dPgs9gNMtR+FqJHh0cLQq02+aHMEM=; b=eta8ivZtjgx045idncIsJ0vhA6YLQfM1AibZCZ5begsDF3r2QFcF9eJhquoVpYtO13 KQ/TGngKRESz/rsebti1/3cHoNHBYboCijR8yAe9MNa5ruy2aeZjvGLTO+9kvZrMV4uU P7tJncEnroyxKWLhGEF2919Omlwq7+tpn1LwX50XlCx4hUb8nAfd1XsvIqvKU+0juqVv J39k6dKuPhau0eDUIAcgyMLbXZMSFdHkCdcNTq3yFV7edCq30HB0zgITFjVqqQUQzNRq VqKZwpYJdrK2OK7dmZxyiUAInN+T4P/wYLYoLtiFavMTJR17EJ7NiwGL8DQAcuG5GPb6 kP0Q== X-Gm-Message-State: AOAM5305Ny+oP8EbPEVB2CU20CjSpViJnx2SdbqaB0t2QQT32ZjZRe8b 1P+LukIdmCBWMv/bKb+esABsO2Jd4Oo= X-Google-Smtp-Source: ABdhPJxdHb6JIqQnXBO4L3T13WZV3yySmAJgBqs9oaulleLHkFlL0hhSD3svzNgWE7eYVl1HmDW2sj7iNss= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a81:fe01:0:b0:2e5:85ba:f316 with SMTP id j1-20020a81fe01000000b002e585baf316mr19493312ywn.33.1650415743043; Tue, 19 Apr 2022 17:49:03 -0700 (PDT) Reply-To: Sean Christopherson Date: Wed, 20 Apr 2022 00:48:58 +0000 In-Reply-To: <20220420004859.3298837-1-seanjc@google.com> Message-Id: <20220420004859.3298837-2-seanjc@google.com> Mime-Version: 1.0 References: <20220420004859.3298837-1-seanjc@google.com> X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog Subject: [PATCH 1/2] KVM: Fix race between mmu_notifier invalidation and pfncache refresh From: Sean Christopherson To: Paolo Bonzini Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, David Woodhouse , Mingwei Zhang , Sean Christopherson Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Key off of mn_active_invalidate_count to detect that a pfncache refresh needs to wait for an in-progress mmu_notifier invalidation. Using the common "retry" approach is broken as the invalidation occurs _before_ mmu_notifier_count is elevated and before mmu_notifier_range_start is set/updated. The cache refresh handles the race where an invalidation occurs between dropping and re-acquiring gpc->lock, but it doesn't handle the scenario where the cache is refreshed after the cache was invalidated by the notifier, but before the notifier elevates mmu_notifier_count. CPU0 CPU1 ---- ---- gfn_to_pfn_cache_invalidate_start() | -> gpc->valid =3D false; kvm_gfn_to_pfn_cache_refresh() | |-> gpc->valid =3D true; hva_to_pfn_retry() | -> acquire kvm->mmu_lock kvm->mmu_notifier_count =3D=3D= 0 mmu_seq =3D=3D kvm->mmu_notifi= er_seq drop kvm->mmu_lock return pfn 'X' acquire kvm->mmu_lock kvm_inc_notifier_count() drop kvm->mmu_lock() kernel frees pfn 'X' kvm_gfn_to_pfn_cache_check() | |-> gpc->valid =3D=3D true caller accesses freed pfn 'X' While mn_active_invalidate_count is not guaranteed to be stable, it is guaranteed to be elevated prior to an invalidation acquiring gpc->lock, so either the refresh will see an active invalidation and wait, or the invalidation will run after the refresh completes. Waiting for in-progress invalidations to complete also obviates the need to mark gpc->valid before calling hva_to_pfn_retry(), which is in itself flawed as a concurrent kvm_gfn_to_pfn_cache_check() would see a "valid" cache with garbage pfn/khva values. That issue will be remedied in a future commit. Opportunistically change the do-while(1) to a for(;;) to make it easier to see that it loops until a break condition is encountered, the size of the loop body makes the while(1) rather difficult to see. Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation sup= port") Cc: stable@vger.kernel.org Cc: David Woodhouse Cc: Mingwei Zhang Signed-off-by: Sean Christopherson --- virt/kvm/kvm_main.c | 9 ++++++ virt/kvm/pfncache.c | 70 +++++++++++++++++++++++++++++++-------------- 2 files changed, 58 insertions(+), 21 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index dfb7dabdbc63..0848430f36c6 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -705,6 +705,15 @@ static int kvm_mmu_notifier_invalidate_range_start(str= uct mmu_notifier *mn, kvm->mn_active_invalidate_count++; spin_unlock(&kvm->mn_invalidate_lock); =20 + /* + * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e. + * before acquiring mmu_lock, to avoid holding mmu_lock while acquiring + * each cache's lock. There are relatively few caches in existence at + * any given time, and the caches themselves can check for hva overlap, + * i.e. don't need to rely on memslot overlap checks for performance. + * Because this runs without holding mmu_lock, the pfn caches must use + * mn_active_invalidate_count (see above) instead of mmu_notifier_count. + */ gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end, hva_range.may_block); =20 diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index dd84676615f1..71c84a43024c 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -112,29 +112,63 @@ static void __release_gpc(struct kvm *kvm, kvm_pfn_t = pfn, void *khva, gpa_t gpa) } } =20 -static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, unsigned long uhva) +static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache= *gpc) { + bool first_attempt =3D true; unsigned long mmu_seq; kvm_pfn_t new_pfn; - int retry; =20 - do { + lockdep_assert_held_write(&gpc->lock); + + for (;;) { mmu_seq =3D kvm->mmu_notifier_seq; smp_rmb(); =20 + write_unlock_irq(&gpc->lock); + + /* Opportunistically check for resched while the lock isn't held. */ + if (!first_attempt) + cond_resched(); + /* We always request a writeable mapping */ - new_pfn =3D hva_to_pfn(uhva, false, NULL, true, NULL); + new_pfn =3D hva_to_pfn(gpc->uhva, false, NULL, true, NULL); + + write_lock_irq(&gpc->lock); + if (is_error_noslot_pfn(new_pfn)) break; =20 - KVM_MMU_READ_LOCK(kvm); - retry =3D mmu_notifier_retry_hva(kvm, mmu_seq, uhva); - KVM_MMU_READ_UNLOCK(kvm); - if (!retry) + first_attempt =3D false; + + /* + * Wait for mn_active_invalidate_count, not mmu_notifier_count, + * to go away, as the invalidation in the mmu_notifier event + * occurs _before_ mmu_notifier_count is elevated. + * + * Note, mn_active_invalidate_count can change at any time as + * it's not protected by gpc->lock. But, it is guaranteed to + * be elevated before the mmu_notifier acquires gpc->lock, and + * isn't dropped until after mmu_notifier_seq is updated. So, + * this task may get a false positive of sorts, i.e. see an + * elevated count and wait even though it's technically safe to + * proceed (becase the mmu_notifier will invalidate the cache + * _after_ it's refreshed here), but the cache will never be + * refreshed with stale data, i.e. won't get false negatives. + */ + if (kvm->mn_active_invalidate_count) + continue; + + /* + * Ensure mn_active_invalidate_count is read before + * mmu_notifier_seq. This pairs with the smp_wmb() in + * mmu_notifier_invalidate_range_end() to guarantee either the + * old (non-zero) value of mn_active_invalidate_count or the + * new (incremented) value of mmu_notifier_seq is observed. + */ + smp_rmb(); + if (kvm->mmu_notifier_seq =3D=3D mmu_seq) break; - - cond_resched(); - } while (1); + } =20 return new_pfn; } @@ -190,7 +224,6 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struc= t gfn_to_pfn_cache *gpc, * drop the lock and do the HVA to PFN lookup again. */ if (!old_valid || old_uhva !=3D gpc->uhva) { - unsigned long uhva =3D gpc->uhva; void *new_khva =3D NULL; =20 /* Placeholders for "hva is valid but not yet mapped" */ @@ -198,15 +231,10 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, str= uct gfn_to_pfn_cache *gpc, gpc->khva =3D NULL; gpc->valid =3D true; =20 - write_unlock_irq(&gpc->lock); - - new_pfn =3D hva_to_pfn_retry(kvm, uhva); + new_pfn =3D hva_to_pfn_retry(kvm, gpc); if (is_error_noslot_pfn(new_pfn)) { ret =3D -EFAULT; - goto map_done; - } - - if (gpc->usage & KVM_HOST_USES_PFN) { + } else if (gpc->usage & KVM_HOST_USES_PFN) { if (new_pfn =3D=3D old_pfn) { new_khva =3D old_khva; old_pfn =3D KVM_PFN_ERR_FAULT; @@ -222,10 +250,10 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, str= uct gfn_to_pfn_cache *gpc, new_khva +=3D page_offset; else ret =3D -EFAULT; + } else { + /* Nothing more to do, the pfn is consumed only by the guest. */ } =20 - map_done: - write_lock_irq(&gpc->lock); if (ret) { gpc->valid =3D false; gpc->pfn =3D KVM_PFN_ERR_FAULT; --=20 2.36.0.rc0.470.gd361397f0d-goog From nobody Mon May 11 00:45:15 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4CCD7C433F5 for ; Wed, 20 Apr 2022 00:49:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1355443AbiDTAwA (ORCPT ); Tue, 19 Apr 2022 20:52:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35906 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350712AbiDTAvt (ORCPT ); Tue, 19 Apr 2022 20:51:49 -0400 Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com [IPv6:2607:f8b0:4864:20::64a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0E74837A09 for ; Tue, 19 Apr 2022 17:49:05 -0700 (PDT) Received: by mail-pl1-x64a.google.com with SMTP id m11-20020a170902f64b00b0015820f8038fso81187plg.23 for ; Tue, 19 Apr 2022 17:49:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc; bh=yHc4G/3kYUFvIv/wUWSokoXc81j23NvIKlslmOJqwiU=; b=JYhsOdKYTodR+S96nF2AE4ysIXQYIETL93+HdtxJplNaIfSiLhCtBr0fr29v7YV3Pf 6IzEq3TUvO11lgQRqnQdW2i1M1VxRZt+Xnj9am790n3gVUD9krYLXcoNlayfnnRw0y+4 WPkRNRguneJ3QsVJiog1hhFAvh4VGG/sJIS0iCIt6CsgrtYsFMMbdDfa8I+MP65G098d hbP7dJPs5c6riSAGdWqKe5mYfPOVBGWu1Q7LQqqfChVtMS+Gz2NHwNqc8Cl8uUKIg3Lb 7BUmxmF34kQPrIaBnUO386FW0ZeURN8fdtYcIcugbpHnyDWI7kMC6OddSQG2JwBUCU89 tJZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc; bh=yHc4G/3kYUFvIv/wUWSokoXc81j23NvIKlslmOJqwiU=; b=Jjk8YL3vAyQhOAtd1lt+HOO9374lOZjXvOJfWTqXvvEYsE3otVBTu12H6qBZ2E8t9Q +MU4xMSGXCHnpbrptJ46mLLOgUxaw1k0Syf76pPndEfYWlZ5LmuEAZwyp0W9x1NCLyHc 5FU6KK4pyHFWm7P7FT78HpGLFjzLyyNGDM1Cb19Fh69cnPCv6V64j8xNY6Uf2Ve17BFs 0u1qD6ltF1Bjt0gPAaWrU8N+FeQrSE2NuPjRHUp18UnAUeAWlFqIGN2lykS5/JBR5N6P yVez6j6LMUtZ62ILRu/7Eil9znVfnqR6hu2TKXQfjd/Gg+SRUabdy2HS4biptuR5lvm/ 3DfQ== X-Gm-Message-State: AOAM530SnQYPxVKn2ZoRiizRwgt2w3uLkpu4riz+AUsXYKrrN2m2L89o LAMJaXXK1KAzX7dsA5DjQkeZUe7/aDE= X-Google-Smtp-Source: ABdhPJxQOCTMF9ilxtpV2ee6jzHOaqFLVWh20t5MAUUHAvmGti92v49KQz1kmDZn2to/aMwI7YfxHo8J/9U= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:903:110c:b0:153:1293:5624 with SMTP id n12-20020a170903110c00b0015312935624mr18197021plh.149.1650415744597; Tue, 19 Apr 2022 17:49:04 -0700 (PDT) Reply-To: Sean Christopherson Date: Wed, 20 Apr 2022 00:48:59 +0000 In-Reply-To: <20220420004859.3298837-1-seanjc@google.com> Message-Id: <20220420004859.3298837-3-seanjc@google.com> Mime-Version: 1.0 References: <20220420004859.3298837-1-seanjc@google.com> X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog Subject: [PATCH 2/2] KVM: Do not speculatively mark pfn cache valid to "fix" race From: Sean Christopherson To: Paolo Bonzini Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, David Woodhouse , Mingwei Zhang , Sean Christopherson Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When refreshing a stale pfn cache, mark it valid only after the new pfn (and maybe kernel mapping) has been acquired. Speculatively marking the cache as valid with garbage pfn/khva values can lead to other tasks using the garbage values. Presumably, the cache was previously marked valid before dropping gpc->lock to address a race where an mmu_notifier event could invalidate the cache between retrieving the pfn via hva_to_pfn() and re-acquiring gpc->lock. That race has been plugged by waiting for mn_active_invalidate_count to reach zero before returning from hva_to_pfn_retry(), i.e. waiting for any notifiers to fully complete before returning. Note, this could also be "fixed" by having kvm_gfn_to_pfn_cache_check() also verify gpc->hkva, but that's unnecessary now that the aforementioned race between refresh and mmu_notifier invalidation has been eliminated. CPU0 CPU1 ---- ---- gpc->valid =3D=3D false kvm_gfn_to_pfn_cache_refresh() | |-> gpc->valid =3D true; hva_to_pfn_retry() | -> drop gpc->lock acquire gpc->lock for read kvm_gfn_to_pfn_cache_check() | |-> gpc->valid =3D=3D true gpc->khva =3D=3D NULL caller dereferences NULL khva Opportunistically add a lockdep to the check() helper to make it slightly more obvious that there is no memory ordering issue when setting "valid" versus "pfn" and/or "khva". Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation sup= port") Cc: stable@vger.kernel.org Cc: David Woodhouse Cc: Mingwei Zhang Signed-off-by: Sean Christopherson --- virt/kvm/pfncache.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 71c84a43024c..72eee096a7cd 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -81,6 +81,8 @@ bool kvm_gfn_to_pfn_cache_check(struct kvm *kvm, struct g= fn_to_pfn_cache *gpc, { struct kvm_memslots *slots =3D kvm_memslots(kvm); =20 + lockdep_assert_held_read(&gpc->lock); + if ((gpa & ~PAGE_MASK) + len > PAGE_SIZE) return false; =20 @@ -226,11 +228,6 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, stru= ct gfn_to_pfn_cache *gpc, if (!old_valid || old_uhva !=3D gpc->uhva) { void *new_khva =3D NULL; =20 - /* Placeholders for "hva is valid but not yet mapped" */ - gpc->pfn =3D KVM_PFN_ERR_FAULT; - gpc->khva =3D NULL; - gpc->valid =3D true; - new_pfn =3D hva_to_pfn_retry(kvm, gpc); if (is_error_noslot_pfn(new_pfn)) { ret =3D -EFAULT; @@ -259,7 +256,7 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struc= t gfn_to_pfn_cache *gpc, gpc->pfn =3D KVM_PFN_ERR_FAULT; gpc->khva =3D NULL; } else { - /* At this point, gpc->valid may already have been cleared */ + gpc->valid =3D true; gpc->pfn =3D new_pfn; gpc->khva =3D new_khva; } --=20 2.36.0.rc0.470.gd361397f0d-goog