From nobody Thu Dec 18 18:18:31 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6DEEBEEB580 for ; Mon, 11 Sep 2023 02:16:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229454AbjIKCQ6 (ORCPT ); Sun, 10 Sep 2023 22:16:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39354 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232588AbjIKCQz (ORCPT ); Sun, 10 Sep 2023 22:16:55 -0400 Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 30FF019C for ; Sun, 10 Sep 2023 19:16:51 -0700 (PDT) Received: by mail-pl1-x62c.google.com with SMTP id d9443c01a7336-1c336f5b1ffso32267545ad.2 for ; Sun, 10 Sep 2023 19:16:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1694398610; x=1695003410; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=xD7OvstxKA2joN48f33NVGVEMdOf8iSPumL3dY1BcYQ=; b=KpzPGRTPJIOXslaJubNz4CHsx8oC1BP8q7hW3MovPQRg2n7pXN+BEfMf6OH/FtPMp2 Z8ERzb26RnE3ayc38bR/QC7kXjCkv4Y5RJgzgOAyc6kW2npk/0EenUsoGrjOPpV+kK7S QxsF4LsSiQ4VioT0KRLxUQX87NZHQuOVy7X5A= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694398610; x=1695003410; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xD7OvstxKA2joN48f33NVGVEMdOf8iSPumL3dY1BcYQ=; b=dd899DSSFFQoevJ9jTBhRwdt7/pt2S0rz8laL81UkHWFgJI3uZfHaEd2aREYnxl0VH rr9FgZStaOSinCSEevc3BnxJ30TqSnduZM8CPakVpbma5r8WbBDDw3vzpU17tvX/Ez+B og/8lEaQeurkRwTq5TCV/NciouKeZQaO6x6TSRwBhEH3C1/fSHhk87GY60bgH3rxM41h CSEde/u3gpWdRGBZx/6omzMMo7wR2rSvldHrs5e3G8gR9lbqJt55ZNz4tt7gmVu10BRB wKHoFpBBbdoB+2l1/jfGJfGnLDCsNVwYJHftcNOBrLswW+vC/zEU7OJuRSSL//FeMAAb dsSg== X-Gm-Message-State: AOJu0Yw3xEqPShkTyZ1ZJs3B2aPXAG/ZSgB4UaQrHtay2nrc9n2IGPcI egODkldNu8pxXkA0A1hMbL/1iw== X-Google-Smtp-Source: AGHT+IFy++BuFNruJ3+F43z5nLA2htPWCJELp/lRQsizzfJXm9VzGe4L5Ti1kzq6zJIZvRm4jK0CGg== X-Received: by 2002:a17:903:228d:b0:1c3:9aaf:97be with SMTP id b13-20020a170903228d00b001c39aaf97bemr5233193plh.56.1694398610653; Sun, 10 Sep 2023 19:16:50 -0700 (PDT) Received: from localhost ([2401:fa00:8f:203:282a:59c8:cc3a:2d6]) by smtp.gmail.com with UTF8SMTPSA id x12-20020a1709028ecc00b001b8a897cd26sm5162528plo.195.2023.09.10.19.16.48 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Sep 2023 19:16:50 -0700 (PDT) From: David Stevens X-Google-Original-From: David Stevens To: Sean Christopherson Cc: Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Subject: [PATCH v9 1/6] KVM: Assert that a page's refcount is elevated when marking accessed/dirty Date: Mon, 11 Sep 2023 11:16:31 +0900 Message-ID: <20230911021637.1941096-2-stevensd@google.com> X-Mailer: git-send-email 2.42.0.283.g2d96d420d3-goog In-Reply-To: <20230911021637.1941096-1-stevensd@google.com> References: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Sean Christopherson Assert that a page's refcount is elevated, i.e. that _something_ holds a reference to the page, when KVM marks a page as accessed and/or dirty. KVM typically doesn't hold a reference to pages that are mapped into the guest, e.g. to allow page migration, compaction, swap, etc., and instead relies on mmu_notifiers to react to changes in the primary MMU. Incorrect handling of mmu_notifier events (or similar mechanisms) can result in KVM keeping a mapping beyond the lifetime of the backing page, i.e. can (and often does) result in use-after-free. Yelling if KVM marks a freed page as accessed/dirty doesn't prevent badness as KVM usually only does A/D updates when unmapping memory from the guest, i.e. the assertion fires well after an underlying bug has occurred, but yelling does help detect, triage, and debug use-after-free bugs. Note, the assertion must use page_count(), NOT page_ref_count()! For hugepages, the returned struct page may be a tailpage and thus not have its own refcount. Signed-off-by: Sean Christopherson --- virt/kvm/kvm_main.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d63cf1c4f5a7..ee6090ecb1fe 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2914,6 +2914,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap); =20 static bool kvm_is_ad_tracked_page(struct page *page) { + /* + * Assert that KVM isn't attempting to mark a freed page as Accessed or + * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM + * (typically) doesn't pin pages that are mapped in KVM's MMU, and + * instead relies on mmu_notifiers to know when a mapping needs to be + * zapped/invalidated. Unmapping from KVM's MMU must happen _before_ + * KVM returns from its mmu_notifier, i.e. the page should have an + * elevated refcount at this point even though KVM doesn't hold a + * reference of its own. + */ + if (WARN_ON_ONCE(!page_count(page))) + return false; + /* * Per page-flags.h, pages tagged PG_reserved "should in general not be * touched (e.g. set dirty) except by its owner". --=20 2.42.0.283.g2d96d420d3-goog From nobody Thu Dec 18 18:18:31 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9724C71153 for ; Mon, 11 Sep 2023 02:17:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232697AbjIKCRE (ORCPT ); Sun, 10 Sep 2023 22:17:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43848 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232166AbjIKCRD (ORCPT ); Sun, 10 Sep 2023 22:17:03 -0400 Received: from mail-oo1-xc32.google.com (mail-oo1-xc32.google.com [IPv6:2607:f8b0:4864:20::c32]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DB6B8CDE for ; Sun, 10 Sep 2023 19:16:55 -0700 (PDT) Received: by mail-oo1-xc32.google.com with SMTP id 006d021491bc7-573249e73f8so2683330eaf.1 for ; Sun, 10 Sep 2023 19:16:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1694398615; x=1695003415; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=V8kEeak7Jm/wBTYBnBaTDd+S+FNlil6EoRPZlJ8iQJM=; b=czsBP2bF1+ULo4hfWCwe7uRFaknlmfp6eCxjXjsIX/is1uRQXNZhS5yc3baUPPDuMn /ttFDIEMx1QaQJnP8t5MKvskHsGmGOMLND4qk7hEI36Gzy43JxYV4jHJLt9YdgUkkZAW P5jh7jf84btK5uCxajdA5atnxkCpRdRFIDdzQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694398615; x=1695003415; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=V8kEeak7Jm/wBTYBnBaTDd+S+FNlil6EoRPZlJ8iQJM=; b=YpK/AECxzORAadjXeNBFhGcoR26YAdovHqhEof0DeechMq03QaER3ezWELFkLLUYHe aSbDXi0DULndM/BcH88J5hW8/MAT0hj5DufPPyEtHSrkRqJ7bV3RxlzMoOmxdmM9teQd Dgsts4KDJA4vzLO9JxyDWUtg3FXHCUP90TPg7GejOETZycVfjsEgoMDuAVloMpn03Zbu Jd/sD8Eo4IGGJQz9LAwlMxEsNeiaOuiXOSDQcamkqfIvAXdHlV/PK9htEicOjxYZLWI0 keX5zTeM55XX6iOWZg6tr+jOt+zYLXdhCsl46LrZRXVlC1cV64/ANKwilaOg5e75Nmns Lc8g== X-Gm-Message-State: AOJu0Yx/mo9vtEofjZq470Vu7hW/Iix3CtCQwg2G3xLX2RLLpJLRnLPd OGEh00BaDp/7StFEHzUdwA/Khg== X-Google-Smtp-Source: AGHT+IHSjmKzK+RW2/XQRbhCvpUc4myy0NraJFC7Uxkq2KP+XZBnUnerFsiB8UN8PsAr9btp9QobSg== X-Received: by 2002:a05:6358:52c8:b0:134:c8ee:e451 with SMTP id z8-20020a05635852c800b00134c8eee451mr12154801rwz.13.1694398614766; Sun, 10 Sep 2023 19:16:54 -0700 (PDT) Received: from localhost ([2401:fa00:8f:203:282a:59c8:cc3a:2d6]) by smtp.gmail.com with UTF8SMTPSA id k11-20020aa790cb000000b0066ebaeb149dsm4477329pfk.88.2023.09.10.19.16.52 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Sep 2023 19:16:54 -0700 (PDT) From: David Stevens X-Google-Original-From: David Stevens To: Sean Christopherson Cc: Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, David Stevens Subject: [PATCH v9 2/6] KVM: mmu: Introduce __kvm_follow_pfn function Date: Mon, 11 Sep 2023 11:16:32 +0900 Message-ID: <20230911021637.1941096-3-stevensd@google.com> X-Mailer: git-send-email 2.42.0.283.g2d96d420d3-goog In-Reply-To: <20230911021637.1941096-1-stevensd@google.com> References: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: David Stevens Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot. __kvm_follow_pfn refactors the old API's arguments into a struct and, where possible, combines the boolean arguments into a single flags argument. Signed-off-by: David Stevens Reviewed-by: Maxim Levitsky --- include/linux/kvm_host.h | 16 ++++ virt/kvm/kvm_main.c | 171 ++++++++++++++++++++++----------------- virt/kvm/kvm_mm.h | 3 +- virt/kvm/pfncache.c | 10 ++- 4 files changed, 123 insertions(+), 77 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index fb6c6109fdca..c2e0ddf14dba 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -97,6 +97,7 @@ #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1) #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3) +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4) =20 /* * error pfns indicate that the gfn is in slot but faild to @@ -1177,6 +1178,21 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_mem= ory_slot *slot, gfn_t gfn, void kvm_release_page_clean(struct page *page); void kvm_release_page_dirty(struct page *page); =20 +struct kvm_follow_pfn { + const struct kvm_memory_slot *slot; + gfn_t gfn; + unsigned int flags; + bool atomic; + /* Try to create a writable mapping even for a read fault */ + bool try_map_writable; + + /* Outputs of __kvm_follow_pfn */ + hva_t hva; + bool writable; +}; + +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); + kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn); kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, bool *writable); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ee6090ecb1fe..9b33a59c6d65 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2512,8 +2512,7 @@ static inline int check_user_page_hwpoison(unsigned l= ong addr) * true indicates success, otherwise false is returned. It's also the * only part that runs if we can in atomic context. */ -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault, - bool *writable, kvm_pfn_t *pfn) +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) { struct page *page[1]; =20 @@ -2522,14 +2521,12 @@ static bool hva_to_pfn_fast(unsigned long addr, boo= l write_fault, * or the caller allows to map a writable pfn for a read fault * request. */ - if (!(write_fault || writable)) + if (!((foll->flags & FOLL_WRITE) || foll->try_map_writable)) return false; =20 - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) { + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { *pfn =3D page_to_pfn(page[0]); - - if (writable) - *writable =3D true; + foll->writable =3D true; return true; } =20 @@ -2540,35 +2537,26 @@ static bool hva_to_pfn_fast(unsigned long addr, boo= l write_fault, * The slow path to get the pfn of the specified host virtual address, * 1 indicates success, -errno is returned if error is detected. */ -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fau= lt, - bool interruptible, bool *writable, kvm_pfn_t *pfn) +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) { - unsigned int flags =3D FOLL_HWPOISON; + unsigned int flags =3D FOLL_HWPOISON | foll->flags; struct page *page; int npages; =20 might_sleep(); =20 - if (writable) - *writable =3D write_fault; - - if (write_fault) - flags |=3D FOLL_WRITE; - if (async) - flags |=3D FOLL_NOWAIT; - if (interruptible) - flags |=3D FOLL_INTERRUPTIBLE; - - npages =3D get_user_pages_unlocked(addr, 1, &page, flags); + npages =3D get_user_pages_unlocked(foll->hva, 1, &page, flags); if (npages !=3D 1) return npages; =20 - /* map read fault as writable if possible */ - if (unlikely(!write_fault) && writable) { + if (foll->flags & FOLL_WRITE) { + foll->writable =3D true; + } else if (foll->try_map_writable) { struct page *wpage; =20 - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) { - *writable =3D true; + /* map read fault as writable if possible */ + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) { + foll->writable =3D true; put_page(page); page =3D wpage; } @@ -2599,23 +2587,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn) } =20 static int hva_to_pfn_remapped(struct vm_area_struct *vma, - unsigned long addr, bool write_fault, - bool *writable, kvm_pfn_t *p_pfn) + struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn) { kvm_pfn_t pfn; pte_t *ptep; pte_t pte; spinlock_t *ptl; + bool write_fault =3D foll->flags & FOLL_WRITE; int r; =20 - r =3D follow_pte(vma->vm_mm, addr, &ptep, &ptl); + r =3D follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl); if (r) { /* * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does * not call the fault handler, so do it here. */ bool unlocked =3D false; - r =3D fixup_user_fault(current->mm, addr, + r =3D fixup_user_fault(current->mm, foll->hva, (write_fault ? FAULT_FLAG_WRITE : 0), &unlocked); if (unlocked) @@ -2623,7 +2611,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct = *vma, if (r) return r; =20 - r =3D follow_pte(vma->vm_mm, addr, &ptep, &ptl); + r =3D follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl); if (r) return r; } @@ -2635,8 +2623,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct = *vma, goto out; } =20 - if (writable) - *writable =3D pte_write(pte); + foll->writable =3D pte_write(pte); pfn =3D pte_pfn(pte); =20 /* @@ -2681,24 +2668,22 @@ static int hva_to_pfn_remapped(struct vm_area_struc= t *vma, * 2): @write_fault =3D false && @writable, @writable will tell the caller * whether the mapping is writable. */ -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible, - bool *async, bool write_fault, bool *writable) +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll) { struct vm_area_struct *vma; kvm_pfn_t pfn; int npages, r; =20 /* we can do it either atomically or asynchronously, not both */ - BUG_ON(atomic && async); + BUG_ON(foll->atomic && (foll->flags & FOLL_NOWAIT)); =20 - if (hva_to_pfn_fast(addr, write_fault, writable, &pfn)) + if (hva_to_pfn_fast(foll, &pfn)) return pfn; =20 - if (atomic) + if (foll->atomic) return KVM_PFN_ERR_FAULT; =20 - npages =3D hva_to_pfn_slow(addr, async, write_fault, interruptible, - writable, &pfn); + npages =3D hva_to_pfn_slow(foll, &pfn); if (npages =3D=3D 1) return pfn; if (npages =3D=3D -EINTR) @@ -2706,83 +2691,123 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atom= ic, bool interruptible, =20 mmap_read_lock(current->mm); if (npages =3D=3D -EHWPOISON || - (!async && check_user_page_hwpoison(addr))) { + (!(foll->flags & FOLL_NOWAIT) && check_user_page_hwpoison(foll->hva))= ) { pfn =3D KVM_PFN_ERR_HWPOISON; goto exit; } =20 retry: - vma =3D vma_lookup(current->mm, addr); + vma =3D vma_lookup(current->mm, foll->hva); =20 if (vma =3D=3D NULL) pfn =3D KVM_PFN_ERR_FAULT; else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) { - r =3D hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn); + r =3D hva_to_pfn_remapped(vma, foll, &pfn); if (r =3D=3D -EAGAIN) goto retry; if (r < 0) pfn =3D KVM_PFN_ERR_FAULT; } else { - if (async && vma_is_valid(vma, write_fault)) - *async =3D true; - pfn =3D KVM_PFN_ERR_FAULT; + if ((foll->flags & FOLL_NOWAIT) && + vma_is_valid(vma, foll->flags & FOLL_WRITE)) + pfn =3D KVM_PFN_ERR_NEEDS_IO; + else + pfn =3D KVM_PFN_ERR_FAULT; } exit: mmap_read_unlock(current->mm); return pfn; } =20 -kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t g= fn, - bool atomic, bool interruptible, bool *async, - bool write_fault, bool *writable, hva_t *hva) +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll) { - unsigned long addr =3D __gfn_to_hva_many(slot, gfn, NULL, write_fault); + foll->writable =3D false; + foll->hva =3D __gfn_to_hva_many(foll->slot, foll->gfn, NULL, + foll->flags & FOLL_WRITE); =20 - if (hva) - *hva =3D addr; - - if (addr =3D=3D KVM_HVA_ERR_RO_BAD) { - if (writable) - *writable =3D false; + if (foll->hva =3D=3D KVM_HVA_ERR_RO_BAD) return KVM_PFN_ERR_RO_FAULT; - } =20 - if (kvm_is_error_hva(addr)) { - if (writable) - *writable =3D false; + if (kvm_is_error_hva(foll->hva)) return KVM_PFN_NOSLOT; - } =20 - /* Do not map writable pfn in the readonly memslot. */ - if (writable && memslot_is_readonly(slot)) { - *writable =3D false; - writable =3D NULL; - } + if (memslot_is_readonly(foll->slot)) + foll->try_map_writable =3D false; =20 - return hva_to_pfn(addr, atomic, interruptible, async, write_fault, - writable); + return hva_to_pfn(foll); +} +EXPORT_SYMBOL_GPL(__kvm_follow_pfn); + +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t g= fn, + bool atomic, bool interruptible, bool *async, + bool write_fault, bool *writable, hva_t *hva) +{ + kvm_pfn_t pfn; + struct kvm_follow_pfn foll =3D { + .slot =3D slot, + .gfn =3D gfn, + .flags =3D 0, + .atomic =3D atomic, + .try_map_writable =3D !!writable, + }; + + if (write_fault) + foll.flags |=3D FOLL_WRITE; + if (async) + foll.flags |=3D FOLL_NOWAIT; + if (interruptible) + foll.flags |=3D FOLL_INTERRUPTIBLE; + + pfn =3D __kvm_follow_pfn(&foll); + if (pfn =3D=3D KVM_PFN_ERR_NEEDS_IO) { + *async =3D true; + pfn =3D KVM_PFN_ERR_FAULT; + } + if (hva) + *hva =3D foll.hva; + if (writable) + *writable =3D foll.writable; + return pfn; } EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot); =20 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, bool *writable) { - return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false, - NULL, write_fault, writable, NULL); + kvm_pfn_t pfn; + struct kvm_follow_pfn foll =3D { + .slot =3D gfn_to_memslot(kvm, gfn), + .gfn =3D gfn, + .flags =3D write_fault ? FOLL_WRITE : 0, + .try_map_writable =3D !!writable, + }; + pfn =3D __kvm_follow_pfn(&foll); + if (writable) + *writable =3D foll.writable; + return pfn; } EXPORT_SYMBOL_GPL(gfn_to_pfn_prot); =20 kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn) { - return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true, - NULL, NULL); + struct kvm_follow_pfn foll =3D { + .slot =3D slot, + .gfn =3D gfn, + .flags =3D FOLL_WRITE, + }; + return __kvm_follow_pfn(&foll); } EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot); =20 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gf= n_t gfn) { - return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true, - NULL, NULL); + struct kvm_follow_pfn foll =3D { + .slot =3D slot, + .gfn =3D gfn, + .flags =3D FOLL_WRITE, + .atomic =3D true, + }; + return __kvm_follow_pfn(&foll); } EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic); =20 diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index 180f1a09e6ba..ed896aee5396 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -20,8 +20,7 @@ #define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock) #endif /* KVM_HAVE_MMU_RWLOCK */ =20 -kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible, - bool *async, bool write_fault, bool *writable); +kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll); =20 #ifdef CONFIG_HAVE_KVM_PFNCACHE void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 2d6aba677830..86cd40acad11 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -144,6 +144,12 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_ca= che *gpc) kvm_pfn_t new_pfn =3D KVM_PFN_ERR_FAULT; void *new_khva =3D NULL; unsigned long mmu_seq; + struct kvm_follow_pfn foll =3D { + .slot =3D gpc->memslot, + .gfn =3D gpa_to_gfn(gpc->gpa), + .flags =3D FOLL_WRITE, + .hva =3D gpc->uhva, + }; =20 lockdep_assert_held(&gpc->refresh_lock); =20 @@ -182,8 +188,8 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cac= he *gpc) cond_resched(); } =20 - /* We always request a writeable mapping */ - new_pfn =3D hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL); + /* We always request a writable mapping */ + new_pfn =3D hva_to_pfn(&foll); if (is_error_noslot_pfn(new_pfn)) goto out_error; =20 --=20 2.42.0.283.g2d96d420d3-goog From nobody Thu Dec 18 18:18:31 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86B27EEB580 for ; Mon, 11 Sep 2023 02:17:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230227AbjIKCRQ (ORCPT ); Sun, 10 Sep 2023 22:17:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45692 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232725AbjIKCRO (ORCPT ); Sun, 10 Sep 2023 22:17:14 -0400 Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4E86ECD1 for ; Sun, 10 Sep 2023 19:17:00 -0700 (PDT) Received: by mail-pl1-x630.google.com with SMTP id d9443c01a7336-1bdf4752c3cso26096205ad.2 for ; Sun, 10 Sep 2023 19:17:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1694398619; x=1695003419; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=egByjBzJcCB8brnLv5YMd/MEK6uJwFM+otbuvFagvYA=; b=YF5CuhLN6A3Vytoyb4O19wdujhd9dkx8GtIsfYcLsxzuxj9NfZYnvwXfAhaVdcPZl+ fv9tjTQ6eXTapCKkpW3FJHbo7fCdIVhCcJrc0+wzLEmgxWgUFN55xVEbNZmBr5XWx7tC mUEeA8LcV+YYXAp9fvL3plcYOL4aY56NIeOSk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694398619; x=1695003419; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=egByjBzJcCB8brnLv5YMd/MEK6uJwFM+otbuvFagvYA=; b=e8zv4nx3Le/XZHjGXh4WHvPK4nPgIPn9+EXGg0lXlIpnNgO71OgXb8almAxTCLtxF7 X+ssne3yEpY+A3CrI3moe9QIkH8cjw7gON7xIhqpH27lthBGRefLwTNyjJ2Gn2I6tT53 I+d3FDInLSiUGKmlG5tfMHLD2XzcV5IEhivP28BJeTdp39BDJEhSeI0GMhEimjioKWy2 Xba+klcW094L5N1Am/rnAMVy3bCtlTlFzXA0IotVAnPaxe7cLYcoX/ZjZ/btAsjmsqMX lc2qixlJgIRIrpNgmn/cG2GDMFedrjv/YcH7b32W9s6wJ/dm3Qks1wyj3YN4Y6FdAlI1 oKOg== X-Gm-Message-State: AOJu0YwCRpeCeZYlZH1k1U/lyr3jxz/NF1oiOJRToer1cvsRv4hjn4Le kwZBWMWyejlsWLNi49tmXEbrPA== X-Google-Smtp-Source: AGHT+IGe6g+N219BghCgrSGmsKPPbaQaG6yQl3gXiQ0iSd31WWCTldGSVElf/HAQRu1c27N9793bgg== X-Received: by 2002:a17:903:2303:b0:1b7:f64b:378a with SMTP id d3-20020a170903230300b001b7f64b378amr8074481plh.16.1694398619338; Sun, 10 Sep 2023 19:16:59 -0700 (PDT) Received: from localhost ([2401:fa00:8f:203:282a:59c8:cc3a:2d6]) by smtp.gmail.com with UTF8SMTPSA id s13-20020a170902988d00b001b89891bfc4sm5148444plp.199.2023.09.10.19.16.57 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Sep 2023 19:16:58 -0700 (PDT) From: David Stevens X-Google-Original-From: David Stevens To: Sean Christopherson Cc: Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, David Stevens Subject: [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns Date: Mon, 11 Sep 2023 11:16:33 +0900 Message-ID: <20230911021637.1941096-4-stevensd@google.com> X-Mailer: git-send-email 2.42.0.283.g2d96d420d3-goog In-Reply-To: <20230911021637.1941096-1-stevensd@google.com> References: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: David Stevens KVM's handling of non-refcounted pfns has two problems: - struct pages without refcounting (e.g. tail pages of non-compound higher order pages) cannot be used at all, as gfn_to_pfn does not provide enough information for callers to handle the refcount. - pfns without struct pages can be accessed without the protection of a mmu notifier. This is unsafe because KVM cannot monitor or control the lifespan of such pfns, so it may continue to access the pfns after they are freed. This patch extends the __kvm_follow_pfn API to properly handle these cases. First, it adds a is_refcounted_page output parameter so that callers can tell whether or not a pfn has a struct page that needs to be passed to put_page. Second, it adds a guarded_by_mmu_notifier parameter that is used to avoid returning non-refcounted pages when the caller cannot safely use them. Since callers need to be updated on a case-by-case basis to pay attention to is_refcounted_page, the new behavior of returning non-refcounted pages is opt-in via the allow_non_refcounted_struct_page parameter. Once all callers have been updated, this parameter should be removed. The fact that non-refcounted pfns can no longer be accessed without mmu notifier protection is a breaking change. Since there is no timeline for updating everything in KVM to use mmu notifiers, this change adds an opt-in module parameter called allow_unsafe_mappings to allow such mappings. Systems which trust userspace not to tear down such unsafe mappings while KVM is using them can set this parameter to re-enable the legacy behavior. Signed-off-by: David Stevens --- include/linux/kvm_host.h | 21 ++++++++++ virt/kvm/kvm_main.c | 84 ++++++++++++++++++++++++---------------- virt/kvm/pfncache.c | 1 + 3 files changed, 72 insertions(+), 34 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c2e0ddf14dba..2ed08ae1a9be 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1185,10 +1185,31 @@ struct kvm_follow_pfn { bool atomic; /* Try to create a writable mapping even for a read fault */ bool try_map_writable; + /* Usage of the returned pfn will be guared by a mmu notifier. */ + bool guarded_by_mmu_notifier; + /* + * When false, do not return pfns for non-refcounted struct pages. + * + * TODO: This allows callers to use kvm_release_pfn on the pfns + * returned by gfn_to_pfn without worrying about corrupting the + * refcounted of non-refcounted pages. Once all callers respect + * is_refcounted_page, this flag should be removed. + */ + bool allow_non_refcounted_struct_page; =20 /* Outputs of __kvm_follow_pfn */ hva_t hva; bool writable; + /* + * True if the returned pfn is for a page with a valid refcount. False + * if the returned pfn has no struct page or if the struct page is not + * being refcounted (e.g. tail pages of non-compound higher order + * allocations from IO/PFNMAP mappings). + * + * When this output flag is false, callers should not try to convert + * the pfn to a struct page. + */ + bool is_refcounted_page; }; =20 kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 9b33a59c6d65..235c5cb3fdac 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -96,6 +96,10 @@ unsigned int halt_poll_ns_shrink; module_param(halt_poll_ns_shrink, uint, 0644); EXPORT_SYMBOL_GPL(halt_poll_ns_shrink); =20 +/* Allow non-struct page memory to be mapped without MMU notifier protecti= on. */ +static bool allow_unsafe_mappings; +module_param(allow_unsafe_mappings, bool, 0444); + /* * Ordering of locks: * @@ -2507,6 +2511,15 @@ static inline int check_user_page_hwpoison(unsigned = long addr) return rc =3D=3D -EHWPOISON; } =20 +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, + struct page *page) +{ + kvm_pfn_t pfn =3D page_to_pfn(page); + + foll->is_refcounted_page =3D true; + return pfn; +} + /* * The fast path to get the writable pfn which will be stored in @pfn, * true indicates success, otherwise false is returned. It's also the @@ -2525,7 +2538,7 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *fo= ll, kvm_pfn_t *pfn) return false; =20 if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { - *pfn =3D page_to_pfn(page[0]); + *pfn =3D kvm_follow_refcounted_pfn(foll, page[0]); foll->writable =3D true; return true; } @@ -2561,7 +2574,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *fol= l, kvm_pfn_t *pfn) page =3D wpage; } } - *pfn =3D page_to_pfn(page); + *pfn =3D kvm_follow_refcounted_pfn(foll, page); return npages; } =20 @@ -2576,16 +2589,6 @@ static bool vma_is_valid(struct vm_area_struct *vma,= bool write_fault) return true; } =20 -static int kvm_try_get_pfn(kvm_pfn_t pfn) -{ - struct page *page =3D kvm_pfn_to_refcounted_page(pfn); - - if (!page) - return 1; - - return get_page_unless_zero(page); -} - static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn *foll, kvm_pfn_t *p_pfn) { @@ -2594,6 +2597,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct = *vma, pte_t pte; spinlock_t *ptl; bool write_fault =3D foll->flags & FOLL_WRITE; + struct page *page; int r; =20 r =3D follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl); @@ -2618,37 +2622,39 @@ static int hva_to_pfn_remapped(struct vm_area_struc= t *vma, =20 pte =3D ptep_get(ptep); =20 + foll->writable =3D pte_write(pte); + pfn =3D pte_pfn(pte); + + page =3D kvm_pfn_to_refcounted_page(pfn); + if (write_fault && !pte_write(pte)) { pfn =3D KVM_PFN_ERR_RO_FAULT; goto out; } =20 - foll->writable =3D pte_write(pte); - pfn =3D pte_pfn(pte); + if (!page) + goto out; =20 - /* - * Get a reference here because callers of *hva_to_pfn* and - * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the - * returned pfn. This is only needed if the VMA has VM_MIXEDMAP - * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will - * simply do nothing for reserved pfns. - * - * Whoever called remap_pfn_range is also going to call e.g. - * unmap_mapping_range before the underlying pages are freed, - * causing a call to our MMU notifier. - * - * Certain IO or PFNMAP mappings can be backed with valid - * struct pages, but be allocated without refcounting e.g., - * tail pages of non-compound higher order allocations, which - * would then underflow the refcount when the caller does the - * required put_page. Don't allow those pages here. - */ - if (!kvm_try_get_pfn(pfn)) - r =3D -EFAULT; + if (get_page_unless_zero(page)) + WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) !=3D pfn); =20 out: pte_unmap_unlock(ptep, ptl); - *p_pfn =3D pfn; + + /* + * TODO: Remove the first branch once all callers have been + * taught to play nice with non-refcounted struct pages. + */ + if (page && !foll->is_refcounted_page && + !foll->allow_non_refcounted_struct_page) { + r =3D -EFAULT; + } else if (!foll->is_refcounted_page && + !foll->guarded_by_mmu_notifier && + !allow_unsafe_mappings) { + r =3D -EFAULT; + } else { + *p_pfn =3D pfn; + } =20 return r; } @@ -2722,6 +2728,8 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *foll) kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll) { foll->writable =3D false; + foll->is_refcounted_page =3D false; + foll->hva =3D __gfn_to_hva_many(foll->slot, foll->gfn, NULL, foll->flags & FOLL_WRITE); =20 @@ -2749,6 +2757,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memor= y_slot *slot, gfn_t gfn, .flags =3D 0, .atomic =3D atomic, .try_map_writable =3D !!writable, + .allow_non_refcounted_struct_page =3D false, }; =20 if (write_fault) @@ -2780,6 +2789,7 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn,= bool write_fault, .gfn =3D gfn, .flags =3D write_fault ? FOLL_WRITE : 0, .try_map_writable =3D !!writable, + .allow_non_refcounted_struct_page =3D false, }; pfn =3D __kvm_follow_pfn(&foll); if (writable) @@ -2794,6 +2804,7 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_= slot *slot, gfn_t gfn) .slot =3D slot, .gfn =3D gfn, .flags =3D FOLL_WRITE, + .allow_non_refcounted_struct_page =3D false, }; return __kvm_follow_pfn(&foll); } @@ -2806,6 +2817,11 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm= _memory_slot *slot, gfn_t gf .gfn =3D gfn, .flags =3D FOLL_WRITE, .atomic =3D true, + /* + * Setting atomic means __kvm_follow_pfn will never make it + * to hva_to_pfn_remapped, so this is vacuously true. + */ + .allow_non_refcounted_struct_page =3D true, }; return __kvm_follow_pfn(&foll); } diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 86cd40acad11..6bbf972c11f8 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -149,6 +149,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cac= he *gpc) .gfn =3D gpa_to_gfn(gpc->gpa), .flags =3D FOLL_WRITE, .hva =3D gpc->uhva, + .allow_non_refcounted_struct_page =3D false, }; =20 lockdep_assert_held(&gpc->refresh_lock); --=20 2.42.0.283.g2d96d420d3-goog From nobody Thu Dec 18 18:18:31 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0915FEE49A4 for ; Mon, 11 Sep 2023 02:17:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232695AbjIKCR1 (ORCPT ); Sun, 10 Sep 2023 22:17:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49538 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229803AbjIKCRZ (ORCPT ); Sun, 10 Sep 2023 22:17:25 -0400 Received: from mail-ot1-x330.google.com (mail-ot1-x330.google.com [IPv6:2607:f8b0:4864:20::330]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 60919170F for ; Sun, 10 Sep 2023 19:17:04 -0700 (PDT) Received: by mail-ot1-x330.google.com with SMTP id 46e09a7af769-6bd04558784so2765520a34.3 for ; Sun, 10 Sep 2023 19:17:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1694398623; x=1695003423; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=+u6FWrlXt53l0jioDQF0YNbKRi7RWqUmOraPqVJzqX4=; b=LEVmDlEUNW+jUNSXEqtCGWC6uLJhT5Q+YmIFV5OnzHm3q5GTaaU2V+N+Xo4iB6jhpi ie0ufgPGN4b/43F9zn6u/mlMpjM1r/NmVzO6dkckfPxhLRtL8/Hq6u/8JSNp4OoXWuDx z/h3i/6nvdIrkE8qedHmsFvBSyTYmhqh8/3p0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694398623; x=1695003423; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+u6FWrlXt53l0jioDQF0YNbKRi7RWqUmOraPqVJzqX4=; b=iXw0HSMzn/baFs26WPYWl2a1yslLJEodLUICijiFXCbP1ppGAv4y6zqtafuRQM7GdF 2iUGakYwQIJQ56JEjuQtaq/W2nimfNkmeJ8pVHLWy54vopALcTyfmFdPJk6awoCTgW25 AWXGGuzjJnPiimdKIc/UQBCnWmLqepx+FKpd3/J+IsrBQpaP4cYhpf6mrshVSkENZEQN CVoOqQDmr46RLhxmwQxJV11yBNv21EeuuGyP1s+1E8/du477fQr7TOCv5DynxBdE2PXF kFU2mFakJdHvYAZNnVbHE19efaMw5HlbyKSZIrOrCRNGeMutMiOL5PQ0qAuj6L4DYotC N2kw== X-Gm-Message-State: AOJu0Yw+e1q5JPWEwsipl9+lzymeOChMqHSpCNDEbYhhx+XoZpsBVtIG 1tFzl+QhzguQRTIzcErEEeXLgg== X-Google-Smtp-Source: AGHT+IHwkesfgaOAHg5OyAz5O7jGYM7x2Y+BjVBilIwEqB9YCTVI0ogAM/gbH69qdUBkeeb1GRPk7g== X-Received: by 2002:a9d:6754:0:b0:6b9:6419:1cde with SMTP id w20-20020a9d6754000000b006b964191cdemr9700883otm.22.1694398623724; Sun, 10 Sep 2023 19:17:03 -0700 (PDT) Received: from localhost ([2401:fa00:8f:203:282a:59c8:cc3a:2d6]) by smtp.gmail.com with UTF8SMTPSA id o13-20020a63a80d000000b0056c2de1f32esm4483868pgf.78.2023.09.10.19.17.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Sep 2023 19:17:03 -0700 (PDT) From: David Stevens X-Google-Original-From: David Stevens To: Sean Christopherson Cc: Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, David Stevens Subject: [PATCH v9 4/6] KVM: Migrate kvm_vcpu_map to __kvm_follow_pfn Date: Mon, 11 Sep 2023 11:16:34 +0900 Message-ID: <20230911021637.1941096-5-stevensd@google.com> X-Mailer: git-send-email 2.42.0.283.g2d96d420d3-goog In-Reply-To: <20230911021637.1941096-1-stevensd@google.com> References: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: David Stevens Migrate kvm_vcpu_map to __kvm_follow_pfn. Track is_refcounted_page so that kvm_vcpu_unmap know whether or not it needs to release the page. Signed-off-by: David Stevens Reviewed-by: Maxim Levitsky --- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c | 24 ++++++++++++++---------- 2 files changed, 15 insertions(+), 11 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 2ed08ae1a9be..b95c79b7833b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -294,6 +294,7 @@ struct kvm_host_map { void *hva; kvm_pfn_t pfn; kvm_pfn_t gfn; + bool is_refcounted_page; }; =20 /* @@ -1228,7 +1229,6 @@ void kvm_release_pfn_dirty(kvm_pfn_t pfn); void kvm_set_pfn_dirty(kvm_pfn_t pfn); void kvm_set_pfn_accessed(kvm_pfn_t pfn); =20 -void kvm_release_pfn(kvm_pfn_t pfn, bool dirty); int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, int len); int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long l= en); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 235c5cb3fdac..913de4e86d9d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2886,24 +2886,22 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) } EXPORT_SYMBOL_GPL(gfn_to_page); =20 -void kvm_release_pfn(kvm_pfn_t pfn, bool dirty) -{ - if (dirty) - kvm_release_pfn_dirty(pfn); - else - kvm_release_pfn_clean(pfn); -} - int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *ma= p) { kvm_pfn_t pfn; void *hva =3D NULL; struct page *page =3D KVM_UNMAPPED_PAGE; + struct kvm_follow_pfn foll =3D { + .slot =3D gfn_to_memslot(vcpu->kvm, gfn), + .gfn =3D gfn, + .flags =3D FOLL_WRITE, + .allow_non_refcounted_struct_page =3D true, + }; =20 if (!map) return -EINVAL; =20 - pfn =3D gfn_to_pfn(vcpu->kvm, gfn); + pfn =3D __kvm_follow_pfn(&foll); if (is_error_noslot_pfn(pfn)) return -EINVAL; =20 @@ -2923,6 +2921,7 @@ int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, st= ruct kvm_host_map *map) map->hva =3D hva; map->pfn =3D pfn; map->gfn =3D gfn; + map->is_refcounted_page =3D foll.is_refcounted_page; =20 return 0; } @@ -2946,7 +2945,12 @@ void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kv= m_host_map *map, bool dirty) if (dirty) kvm_vcpu_mark_page_dirty(vcpu, map->gfn); =20 - kvm_release_pfn(map->pfn, dirty); + if (map->is_refcounted_page) { + if (dirty) + kvm_release_page_dirty(map->page); + else + kvm_release_page_clean(map->page); + } =20 map->hva =3D NULL; map->page =3D NULL; --=20 2.42.0.283.g2d96d420d3-goog From nobody Thu Dec 18 18:18:31 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B145BEE49A4 for ; Mon, 11 Sep 2023 02:17:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232908AbjIKCSA (ORCPT ); Sun, 10 Sep 2023 22:18:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46560 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229803AbjIKCR6 (ORCPT ); Sun, 10 Sep 2023 22:17:58 -0400 Received: from mail-ot1-x335.google.com (mail-ot1-x335.google.com [IPv6:2607:f8b0:4864:20::335]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9D99D10CA for ; Sun, 10 Sep 2023 19:17:15 -0700 (PDT) Received: by mail-ot1-x335.google.com with SMTP id 46e09a7af769-6c0a42a469dso2765221a34.2 for ; Sun, 10 Sep 2023 19:17:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1694398627; x=1695003427; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=pUMMszyUeB8rBOIlzKvaWgoejhjaS1Q5X7ygUgKYL7o=; b=MHcv5bWJiXCkl/D4jVPSQS5ACW4awitYOVwdMY9IK4b9lRQEtvvBWt4NKADegwtt0v QADeq793DwQYt1AvSmqRQWV4kiR/sOumC3g0/KO3e0s91bLG41Wrov6sYxvIlZowq0Z6 +6z38FQUhGVl+QECzHa5REOsA6mZ2DQImUjqg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694398627; x=1695003427; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pUMMszyUeB8rBOIlzKvaWgoejhjaS1Q5X7ygUgKYL7o=; b=o80ng3cniYkYnvLnWlXYrNsyd/MoGGEECohnPyaFB3p7Y/tkK0fReQMTpK4Z3QCANg h+Mx+ac66AYu8NPbnyFC9nJC5pPK01x50DPkw9Jx+YmlDHNm7bjDP2EANO74fouzp7AC abqTv2paIyDMoHpmxPB3/an7Pz5gy+u5rHBbnQ4vzw6Yfc/gc5e/X5YwvAMbSfjACXmW pR+xyE4jdg5+G6iOJ/S9R23ug8PvqozcJdw4WQLZGuYr510M+rOKCAxPZjGtVUbqNO4U jMbNXYbeCz8FrovXgNgjBE/TztGuSSJcHdDSgQONpXDHyRFot0ZLEd1kkXjjlcXAo9kV yc1w== X-Gm-Message-State: AOJu0YwJ+y8KDS/YKnS1yoOtUJ6cHdJAZTBeuj0CIdm8Y2h2RZwJc1Gt vU61EycvWsOJgNHqovC2I6LA6w== X-Google-Smtp-Source: AGHT+IG+ZZgm9Jt+l/g4kcQ+fj/q9jDIvKCZ7VIBYEeNAcdfOlMNqp3Rov4gWnDJFtxXh/SnKOIYOQ== X-Received: by 2002:a05:6830:1c3:b0:6bf:235c:41f4 with SMTP id r3-20020a05683001c300b006bf235c41f4mr10551348ota.3.1694398627544; Sun, 10 Sep 2023 19:17:07 -0700 (PDT) Received: from localhost ([2401:fa00:8f:203:282a:59c8:cc3a:2d6]) by smtp.gmail.com with UTF8SMTPSA id a23-20020a056a001d1700b006889664aa6csm1295974pfx.5.2023.09.10.19.17.05 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Sep 2023 19:17:07 -0700 (PDT) From: David Stevens X-Google-Original-From: David Stevens To: Sean Christopherson Cc: Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, David Stevens Subject: [PATCH v9 5/6] KVM: x86: Migrate to __kvm_follow_pfn Date: Mon, 11 Sep 2023 11:16:35 +0900 Message-ID: <20230911021637.1941096-6-stevensd@google.com> X-Mailer: git-send-email 2.42.0.283.g2d96d420d3-goog In-Reply-To: <20230911021637.1941096-1-stevensd@google.com> References: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: David Stevens Migrate functions which need access to is_refcounted_page to __kvm_follow_pfn. The functions which need this are __kvm_faultin_pfn and reexecute_instruction. The former requires replacing the async in/out parameter with FOLL_NOWAIT parameter and the KVM_PFN_ERR_NEEDS_IO return value. Handling non-refcounted pages is complicated, so it will be done in a followup. The latter is a straightforward refactor. APIC related callers do not need to migrate because KVM controls the memslot, so it will always be regular memory. Prefetch related callers do not need to be migrated because atomic gfn_to_pfn calls can never make it to hva_to_pfn_remapped. Signed-off-by: David Stevens Reviewed-by: Maxim Levitsky --- arch/x86/kvm/mmu/mmu.c | 43 ++++++++++++++++++++++++++++++++---------- arch/x86/kvm/x86.c | 12 ++++++++++-- 2 files changed, 43 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index e1d011c67cc6..e1eca26215e2 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4254,7 +4254,14 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu= , struct kvm_async_pf *work) static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault = *fault) { struct kvm_memory_slot *slot =3D fault->slot; - bool async; + struct kvm_follow_pfn foll =3D { + .slot =3D slot, + .gfn =3D fault->gfn, + .flags =3D fault->write ? FOLL_WRITE : 0, + .try_map_writable =3D true, + .guarded_by_mmu_notifier =3D true, + .allow_non_refcounted_struct_page =3D false, + }; =20 /* * Retry the page fault if the gfn hit a memslot that is being deleted @@ -4283,12 +4290,20 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu,= struct kvm_page_fault *fault return RET_PF_EMULATE; } =20 - async =3D false; - fault->pfn =3D __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &asyn= c, - fault->write, &fault->map_writable, - &fault->hva); - if (!async) - return RET_PF_CONTINUE; /* *pfn has correct page already */ + foll.flags |=3D FOLL_NOWAIT; + fault->pfn =3D __kvm_follow_pfn(&foll); + + if (!is_error_noslot_pfn(fault->pfn)) + goto success; + + /* + * If __kvm_follow_pfn() failed because I/O is needed to fault in the + * page, then either set up an asynchronous #PF to do the I/O, or if + * doing an async #PF isn't possible, retry __kvm_follow_pfn() with + * I/O allowed. All other failures are fatal, i.e. retrying won't help. + */ + if (fault->pfn !=3D KVM_PFN_ERR_NEEDS_IO) + return RET_PF_CONTINUE; =20 if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) { trace_kvm_try_async_get_page(fault->addr, fault->gfn); @@ -4306,9 +4321,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, = struct kvm_page_fault *fault * to wait for IO. Note, gup always bails if it is unable to quickly * get a page and a fatal signal, i.e. SIGKILL, is pending. */ - fault->pfn =3D __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL, - fault->write, &fault->map_writable, - &fault->hva); + foll.flags |=3D FOLL_INTERRUPTIBLE; + foll.flags &=3D ~FOLL_NOWAIT; + fault->pfn =3D __kvm_follow_pfn(&foll); + + if (!is_error_noslot_pfn(fault->pfn)) + goto success; + + return RET_PF_CONTINUE; +success: + fault->hva =3D foll.hva; + fault->map_writable =3D foll.writable; return RET_PF_CONTINUE; } =20 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6c9c81e82e65..2011a7e47296 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8556,6 +8556,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vc= pu, gpa_t cr2_or_gpa, { gpa_t gpa =3D cr2_or_gpa; kvm_pfn_t pfn; + struct kvm_follow_pfn foll; =20 if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF)) return false; @@ -8585,7 +8586,13 @@ static bool reexecute_instruction(struct kvm_vcpu *v= cpu, gpa_t cr2_or_gpa, * retry instruction -> write #PF -> emulation fail -> retry * instruction -> ... */ - pfn =3D gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)); + foll =3D (struct kvm_follow_pfn) { + .slot =3D gfn_to_memslot(vcpu->kvm, gpa_to_gfn(gpa)), + .gfn =3D gpa_to_gfn(gpa), + .flags =3D FOLL_WRITE, + .allow_non_refcounted_struct_page =3D true, + }; + pfn =3D __kvm_follow_pfn(&foll); =20 /* * If the instruction failed on the error pfn, it can not be fixed, @@ -8594,7 +8601,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vc= pu, gpa_t cr2_or_gpa, if (is_error_noslot_pfn(pfn)) return false; =20 - kvm_release_pfn_clean(pfn); + if (foll.is_refcounted_page) + kvm_release_page_clean(pfn_to_page(pfn)); =20 /* The instructions are well-emulated on direct mmu. */ if (vcpu->arch.mmu->root_role.direct) { --=20 2.42.0.283.g2d96d420d3-goog From nobody Thu Dec 18 18:18:31 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 261F6EEB57E for ; Mon, 11 Sep 2023 02:18:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229491AbjIKCSe (ORCPT ); Sun, 10 Sep 2023 22:18:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46540 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230037AbjIKCSc (ORCPT ); Sun, 10 Sep 2023 22:18:32 -0400 Received: from mail-pf1-x42d.google.com (mail-pf1-x42d.google.com [IPv6:2607:f8b0:4864:20::42d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 392A1E43 for ; Sun, 10 Sep 2023 19:17:47 -0700 (PDT) Received: by mail-pf1-x42d.google.com with SMTP id d2e1a72fcca58-68fc081cd46so368913b3a.0 for ; Sun, 10 Sep 2023 19:17:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1694398631; x=1695003431; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=91mLy+G3O/xIpYQLZgTXP1iR98nXkRw1gYxVierAmLM=; b=BdvIMiTzDzYzyJroOoBF07UgFHQQAzzho3wIGtRbxNIw9ELXFKX66rc8aMI7Ki8dxl x/XrOAo6XWDKksTOeXN9LR6eZrkqpAy+aggAVP4Ru3IHK0rHIIlyalUhrvnrJ1+G4Xia 4XIWitulOhyNwIg3oN3Ec+3AyIdkLOwEauKTw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694398631; x=1695003431; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=91mLy+G3O/xIpYQLZgTXP1iR98nXkRw1gYxVierAmLM=; b=nV3ceZ+KtK+bZpBms0uC10Bs2BdRg51Er1v6Vm9413QumnhaBmN3szs/ni14oUtWak NT0gHC9wlsCYQiqqTKdmutaOgQJH3UFMZbwIly1wNxVQW+ZmGbHDdmIUNS2ON+owhCEr svuLBRYuSBi/0mQj7XZIWhfVLzPi3Ph6/bYS8z+eT2SWISOdDY5KTVk6lIuiuuTy+qMw sChkOe5Dd7cadAJ9HIKGGehiRxmBEu+avZNR5tTyR/hpY+VlFDFF9gyxBbgBtQCKePa1 2UjmGJv3cHZU34Fx08j6FBdhdmMedvp1GjMaVIDnhFOYrYnhU3enmmXCwLMB5eDUQKH3 2yJg== X-Gm-Message-State: AOJu0YwbF8EteOLZT2Ip/wYI/EdcJHLwshXcm2qrIKakWof4mcrbZczH lsX8Zxl/ya2ADz4Y5/ukxpDhTg== X-Google-Smtp-Source: AGHT+IEF78He+7+PkQmYAuwbgimpofpg6YfXs4vDl0ZW8a5/SBfT4Pu8o+FolgBcnBH93tJqiem73Q== X-Received: by 2002:a05:6a00:1989:b0:68c:705d:78b3 with SMTP id d9-20020a056a00198900b0068c705d78b3mr7505251pfl.28.1694398631461; Sun, 10 Sep 2023 19:17:11 -0700 (PDT) Received: from localhost ([2401:fa00:8f:203:282a:59c8:cc3a:2d6]) by smtp.gmail.com with UTF8SMTPSA id m2-20020aa79002000000b0068702b66ab1sm4596005pfo.174.2023.09.10.19.17.09 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Sep 2023 19:17:11 -0700 (PDT) From: David Stevens X-Google-Original-From: David Stevens To: Sean Christopherson Cc: Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, David Stevens Subject: [PATCH v9 6/6] KVM: x86/mmu: Handle non-refcounted pages Date: Mon, 11 Sep 2023 11:16:36 +0900 Message-ID: <20230911021637.1941096-7-stevensd@google.com> X-Mailer: git-send-email 2.42.0.283.g2d96d420d3-goog In-Reply-To: <20230911021637.1941096-1-stevensd@google.com> References: <20230911021637.1941096-1-stevensd@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: David Stevens Handle non-refcounted pages in __kvm_faultin_pfn. This allows the host to map memory into the guest that is backed by non-refcounted struct pages - for example, the tail pages of higher order non-compound pages allocated by the amdgpu driver via ttm_pool_alloc_page. The bulk of this change is tracking the is_refcounted_page flag so that non-refcounted pages don't trigger page_count() =3D=3D 0 warnings. This is done by storing the flag in an unused bit in the sptes. There are no bits available in PAE SPTEs, so non-refcounted pages can only be handled on TDP and x86-64. Signed-off-by: David Stevens --- arch/x86/kvm/mmu/mmu.c | 52 +++++++++++++++++++++++---------- arch/x86/kvm/mmu/mmu_internal.h | 1 + arch/x86/kvm/mmu/paging_tmpl.h | 8 +++-- arch/x86/kvm/mmu/spte.c | 4 ++- arch/x86/kvm/mmu/spte.h | 12 +++++++- arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++++++------ include/linux/kvm_host.h | 3 ++ virt/kvm/kvm_main.c | 6 ++-- 8 files changed, 76 insertions(+), 32 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index e1eca26215e2..b8168cc4cc96 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -545,12 +545,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) =20 if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) { flush =3D true; - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); + if (is_refcounted_page_pte(old_spte)) + kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte))); } =20 if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) { flush =3D true; - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); + if (is_refcounted_page_pte(old_spte)) + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte))); } =20 return flush; @@ -588,14 +590,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm,= u64 *sptep) * before they are reclaimed. Sanity check that, if the pfn is backed * by a refcounted page, the refcount is elevated. */ - page =3D kvm_pfn_to_refcounted_page(pfn); - WARN_ON_ONCE(page && !page_count(page)); + if (is_refcounted_page_pte(old_spte)) { + page =3D kvm_pfn_to_refcounted_page(pfn); + WARN_ON_ONCE(!page || !page_count(page)); + } =20 - if (is_accessed_spte(old_spte)) - kvm_set_pfn_accessed(pfn); + if (is_refcounted_page_pte(old_spte)) { + if (is_accessed_spte(old_spte)) + kvm_set_page_accessed(pfn_to_page(pfn)); =20 - if (is_dirty_spte(old_spte)) - kvm_set_pfn_dirty(pfn); + if (is_dirty_spte(old_spte)) + kvm_set_page_dirty(pfn_to_page(pfn)); + } =20 return old_spte; } @@ -631,8 +637,8 @@ static bool mmu_spte_age(u64 *sptep) * Capture the dirty status of the page, so that it doesn't get * lost when the SPTE is marked for access tracking. */ - if (is_writable_pte(spte)) - kvm_set_pfn_dirty(spte_to_pfn(spte)); + if (is_writable_pte(spte) && is_refcounted_page_pte(spte)) + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte))); =20 spte =3D mark_spte_for_access_track(spte); mmu_spte_update_no_track(sptep, spte); @@ -1261,8 +1267,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep) { bool was_writable =3D test_and_clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep); - if (was_writable && !spte_ad_enabled(*sptep)) - kvm_set_pfn_dirty(spte_to_pfn(*sptep)); + if (was_writable && !spte_ad_enabled(*sptep) && is_refcounted_page_pte(*s= ptep)) + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep))); =20 return was_writable; } @@ -2913,6 +2919,11 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struc= t kvm_memory_slot *slot, bool host_writable =3D !fault || fault->map_writable; bool prefetch =3D !fault || fault->prefetch; bool write_fault =3D fault && fault->write; + /* + * Prefetching uses gfn_to_page_many_atomic, which never gets + * non-refcounted pages. + */ + bool is_refcounted =3D !fault || fault->is_refcounted_page; =20 if (unlikely(is_noslot_pfn(pfn))) { vcpu->stat.pf_mmio_spte_created++; @@ -2940,7 +2951,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct= kvm_memory_slot *slot, } =20 wrprot =3D make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefet= ch, - true, host_writable, &spte); + true, host_writable, is_refcounted, &spte); =20 if (*sptep =3D=3D spte) { ret =3D RET_PF_SPURIOUS; @@ -4254,13 +4265,18 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcp= u, struct kvm_async_pf *work) static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault = *fault) { struct kvm_memory_slot *slot =3D fault->slot; + /* + * There are no extra bits for tracking non-refcounted pages in + * PAE SPTEs, so reject non-refcounted struct pages in that case. + */ + bool has_spte_refcount_bit =3D tdp_enabled && IS_ENABLED(CONFIG_X86_64); struct kvm_follow_pfn foll =3D { .slot =3D slot, .gfn =3D fault->gfn, .flags =3D fault->write ? FOLL_WRITE : 0, .try_map_writable =3D true, .guarded_by_mmu_notifier =3D true, - .allow_non_refcounted_struct_page =3D false, + .allow_non_refcounted_struct_page =3D has_spte_refcount_bit, }; =20 /* @@ -4277,6 +4293,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, s= truct kvm_page_fault *fault fault->slot =3D NULL; fault->pfn =3D KVM_PFN_NOSLOT; fault->map_writable =3D false; + fault->is_refcounted_page =3D false; return RET_PF_CONTINUE; } /* @@ -4332,6 +4349,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, s= truct kvm_page_fault *fault success: fault->hva =3D foll.hva; fault->map_writable =3D foll.writable; + fault->is_refcounted_page =3D foll.is_refcounted_page; return RET_PF_CONTINUE; } =20 @@ -4420,8 +4438,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, s= truct kvm_page_fault *fault r =3D direct_map(vcpu, fault); =20 out_unlock: + if (fault->is_refcounted_page) + kvm_set_page_accessed(pfn_to_page(fault->pfn)); write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); return r; } =20 @@ -4496,8 +4515,9 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vc= pu, r =3D kvm_tdp_mmu_map(vcpu, fault); =20 out_unlock: + if (fault->is_refcounted_page) + kvm_set_page_accessed(pfn_to_page(fault->pfn)); read_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); return r; } #endif diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_interna= l.h index b102014e2c60..7f73bc2a552e 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -239,6 +239,7 @@ struct kvm_page_fault { kvm_pfn_t pfn; hva_t hva; bool map_writable; + bool is_refcounted_page; =20 /* * Indicates the guest is trying to write a gfn that contains one or diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index c85255073f67..0ac4a4e5870c 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -848,7 +848,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, str= uct kvm_page_fault *fault =20 out_unlock: write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + if (fault->is_refcounted_page) + kvm_set_page_accessed(pfn_to_page(fault->pfn)); return r; } =20 @@ -902,7 +903,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, s= truct kvm_mmu *mmu, */ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp= , int i) { - bool host_writable; + bool host_writable, is_refcounted; gpa_t first_pte_gpa; u64 *sptep, spte; struct kvm_memory_slot *slot; @@ -959,10 +960,11 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, st= ruct kvm_mmu_page *sp, int sptep =3D &sp->spt[i]; spte =3D *sptep; host_writable =3D spte & shadow_host_writable_mask; + is_refcounted =3D spte & SPTE_MMU_PAGE_REFCOUNTED; slot =3D kvm_vcpu_gfn_to_memslot(vcpu, gfn); make_spte(vcpu, sp, slot, pte_access, gfn, spte_to_pfn(spte), spte, true, false, - host_writable, &spte); + host_writable, is_refcounted, &spte); =20 return mmu_spte_update(sptep, spte); } diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c index 4a599130e9c9..ce495819061f 100644 --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_pa= ge *sp, const struct kvm_memory_slot *slot, unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch, bool can_unsync, - bool host_writable, u64 *new_spte) + bool host_writable, bool is_refcounted, u64 *new_spte) { int level =3D sp->role.level; u64 spte =3D SPTE_MMU_PRESENT_MASK; @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_pa= ge *sp, =20 if (level > PG_LEVEL_4K) spte |=3D PT_PAGE_SIZE_MASK; + if (is_refcounted) + spte |=3D SPTE_MMU_PAGE_REFCOUNTED; =20 if (shadow_memtype_mask) spte |=3D static_call(kvm_x86_get_mt_mask)(vcpu, gfn, diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index a129951c9a88..4bf4a535c23d 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -96,6 +96,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK= _SAVED_MASK)); /* Defined only to keep the above static asserts readable. */ #undef SHADOW_ACC_TRACK_SAVED_MASK =20 +/* + * Indicates that the SPTE refers to a page with a valid refcount. + */ +#define SPTE_MMU_PAGE_REFCOUNTED BIT_ULL(59) + /* * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of * the memslots generation and is derived as follows: @@ -345,6 +350,11 @@ static inline bool is_dirty_spte(u64 spte) return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK; } =20 +static inline bool is_refcounted_page_pte(u64 spte) +{ + return spte & SPTE_MMU_PAGE_REFCOUNTED; +} + static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64= pte, int level) { @@ -475,7 +485,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_pa= ge *sp, const struct kvm_memory_slot *slot, unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch, bool can_unsync, - bool host_writable, u64 *new_spte); + bool host_writable, bool is_refcounted, u64 *new_spte); u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, union kvm_mmu_page_role role, int index); u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled); diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 6c63f2d1675f..185f3c666c2b 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -474,6 +474,7 @@ static void handle_changed_spte(struct kvm *kvm, int as= _id, gfn_t gfn, bool was_leaf =3D was_present && is_last_spte(old_spte, level); bool is_leaf =3D is_present && is_last_spte(new_spte, level); bool pfn_changed =3D spte_to_pfn(old_spte) !=3D spte_to_pfn(new_spte); + bool is_refcounted =3D is_refcounted_page_pte(old_spte); =20 WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL); WARN_ON_ONCE(level < PG_LEVEL_4K); @@ -538,9 +539,9 @@ static void handle_changed_spte(struct kvm *kvm, int as= _id, gfn_t gfn, if (is_leaf !=3D was_leaf) kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1); =20 - if (was_leaf && is_dirty_spte(old_spte) && + if (was_leaf && is_dirty_spte(old_spte) && is_refcounted && (!is_present || !is_dirty_spte(new_spte) || pfn_changed)) - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte))); =20 /* * Recursively handle child PTs if the change removed a subtree from @@ -552,9 +553,9 @@ static void handle_changed_spte(struct kvm *kvm, int as= _id, gfn_t gfn, (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared); =20 - if (was_leaf && is_accessed_spte(old_spte) && + if (was_leaf && is_accessed_spte(old_spte) && is_refcounted && (!is_present || !is_accessed_spte(new_spte) || pfn_changed)) - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); + kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte))); } =20 /* @@ -988,8 +989,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_v= cpu *vcpu, new_spte =3D make_mmio_spte(vcpu, iter->gfn, ACC_ALL); else wrprot =3D make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn, - fault->pfn, iter->old_spte, fault->prefetch, true, - fault->map_writable, &new_spte); + fault->pfn, iter->old_spte, fault->prefetch, true, + fault->map_writable, fault->is_refcounted_page, + &new_spte); =20 if (new_spte =3D=3D iter->old_spte) ret =3D RET_PF_SPURIOUS; @@ -1205,8 +1207,9 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp= _iter *iter, * Capture the dirty status of the page, so that it doesn't get * lost when the SPTE is marked for access tracking. */ - if (is_writable_pte(iter->old_spte)) - kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte)); + if (is_writable_pte(iter->old_spte) && + is_refcounted_page_pte(iter->old_spte)) + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter->old_spte))); =20 new_spte =3D mark_spte_for_access_track(iter->old_spte); iter->old_spte =3D kvm_tdp_mmu_write_spte(iter->sptep, @@ -1628,7 +1631,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, st= ruct kvm_mmu_page *root, trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level, iter.old_spte, iter.old_spte & ~dbit); - kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte)); + if (is_refcounted_page_pte(iter.old_spte)) + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(iter.old_spte))); } =20 rcu_read_unlock(); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b95c79b7833b..6696925f01f1 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1179,6 +1179,9 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memo= ry_slot *slot, gfn_t gfn, void kvm_release_page_clean(struct page *page); void kvm_release_page_dirty(struct page *page); =20 +void kvm_set_page_accessed(struct page *page); +void kvm_set_page_dirty(struct page *page); + struct kvm_follow_pfn { const struct kvm_memory_slot *slot; gfn_t gfn; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 913de4e86d9d..4d8538cdb690 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2979,17 +2979,19 @@ static bool kvm_is_ad_tracked_page(struct page *pag= e) return !PageReserved(page); } =20 -static void kvm_set_page_dirty(struct page *page) +void kvm_set_page_dirty(struct page *page) { if (kvm_is_ad_tracked_page(page)) SetPageDirty(page); } +EXPORT_SYMBOL_GPL(kvm_set_page_dirty); =20 -static void kvm_set_page_accessed(struct page *page) +void kvm_set_page_accessed(struct page *page) { if (kvm_is_ad_tracked_page(page)) mark_page_accessed(page); } +EXPORT_SYMBOL_GPL(kvm_set_page_accessed); =20 void kvm_release_page_clean(struct page *page) { --=20 2.42.0.283.g2d96d420d3-goog