From nobody Sat Nov 30 07:23:31 2024 Received: from smtp-fw-52002.amazon.com (smtp-fw-52002.amazon.com [52.119.213.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7BADC13B2B8; Tue, 10 Sep 2024 16:31:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.119.213.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985866; cv=none; b=fczq0UIecb/YHe2f1VbbMnZZ0aRAEPksoMqaPwb7cw3cGq5enjGXY7AtTj21DD2Irx+7xfBX0p+D9chPaftwnJkaL3g0JEFmcQJ1tru9bMpm+k92mWCoXdSrgOJwYhe3Zw+T0UtAsEv9GmZRPvvF6neKVd5GJp4K7o1dPw5Eamw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985866; c=relaxed/simple; bh=O7JPYiXoBuIHGT0WVtgUBICmUu2YOZgdMk9le+zDeiI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Wlvog+wKYNPuYcB1EdtypQnOjmnnApRveaVkXyigEeLww7d94qe+0BZ1P+SiYxxv7hdE3pUI4yAyIW3ENyUBoe6dkSWx3ODFdSDeBuZRd2QzHk6ikQxKbuYqtXVYEboA8UGJFuKrWKXzt7nKwzIu1iyTSeElgzRqePw8t/od+Ko= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=raPb7c53; arc=none smtp.client-ip=52.119.213.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="raPb7c53" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1725985865; x=1757521865; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=CXxTji2PsKU8ektMXjm+AW02ug0kAumv6veW9lxkEvg=; b=raPb7c53rp/oW/quupIhTi0SVN3m8GcvF7/yYdbiNQEaYwsd3TGWdFhR sjO0MWbP5090NbsDIgrXvJsF0H2Y0taPRBeQYpidkVH5J9znzNvWibTXF TOuUFOn8Vyd1iE+4rEBpYEbf+Om0O/PkYgNR04SGBcEk9VWNdn6sZkg1f U=; X-IronPort-AV: E=Sophos;i="6.10,217,1719878400"; d="scan'208";a="658021874" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-52002.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 16:31:01 +0000 Received: from EX19MTAUEB001.ant.amazon.com [10.0.44.209:38231] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.48.28:2525] with esmtp (Farcaster) id a2b39b4e-e66c-464d-9ecd-b79c04647c96; Tue, 10 Sep 2024 16:31:00 +0000 (UTC) X-Farcaster-Flow-ID: a2b39b4e-e66c-464d-9ecd-b79c04647c96 Received: from EX19D008UEA003.ant.amazon.com (10.252.134.116) by EX19MTAUEB001.ant.amazon.com (10.252.135.108) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:30:52 +0000 Received: from EX19MTAUWB001.ant.amazon.com (10.250.64.248) by EX19D008UEA003.ant.amazon.com (10.252.134.116) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:30:51 +0000 Received: from ua2d7e1a6107c5b.home (172.19.88.180) by mail-relay.amazon.com (10.250.64.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 10 Sep 2024 16:30:47 +0000 From: Patrick Roy To: , , , , , , , , , , , , , , , , , , , , CC: Patrick Roy , , , , , Subject: [RFC PATCH v2 01/10] kvm: gmem: Add option to remove gmem from direct map Date: Tue, 10 Sep 2024 17:30:27 +0100 Message-ID: <20240910163038.1298452-2-roypat@amazon.co.uk> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20240910163038.1298452-1-roypat@amazon.co.uk> References: <20240910163038.1298452-1-roypat@amazon.co.uk> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a flag to the KVM_CREATE_GUEST_MEMFD ioctl that causes gmem pfns to be removed from the host kernel's direct map. Memory is removed immediately after allocation and preparation of gmem folios (after preparation, as the prepare callback might expect the direct map entry to be present). Direct map entries are restored before kvm_arch_gmem_invalidate is called (as ->invalidate_folio is called before ->free_folio), for the same reason. Use the PG_private flag to indicate that a folio is part of gmem with direct map removal enabled. While in this patch, PG_private does have a meaning of "folio not in direct map", this will no longer be true in follow up patches. Gmem folios might get temporarily reinserted into the direct map, but the PG_private flag needs to remain set, as the folios will have private data that needs to be freed independently of direct map status. This is why kvm_gmem_folio_clear_private does not call folio_clear_private. kvm_gmem_{set,clear}_folio_private must be called with the folio lock held. To ensure that failures in kvm_gmem_{clear,set}_private do not cause system instability due to leaving holes in the direct map, try to always restore direct map entries on failure. Pages for which restoration of direct map entries fails are marked as HWPOISON, to prevent the kernel from ever touching them again. Signed-off-by: Patrick Roy --- include/uapi/linux/kvm.h | 2 + virt/kvm/guest_memfd.c | 96 +++++++++++++++++++++++++++++++++++++--- 2 files changed, 91 insertions(+), 7 deletions(-) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc0551453..81b0f4a236b8c 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1564,6 +1564,8 @@ struct kvm_create_guest_memfd { __u64 reserved[6]; }; =20 +#define KVM_GMEM_NO_DIRECT_MAP (1ULL << 0) + #define KVM_PRE_FAULT_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_pre_fault_memor= y) =20 struct kvm_pre_fault_memory { diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 1c509c3512614..2ed27992206f3 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,7 @@ #include #include #include +#include =20 #include "kvm_mm.h" =20 @@ -49,8 +50,69 @@ static int kvm_gmem_prepare_folio(struct inode *inode, p= goff_t index, struct fol return 0; } =20 +static bool kvm_gmem_test_no_direct_map(struct inode *inode) +{ + return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) =3D=3D = KVM_GMEM_NO_DIRECT_MAP; +} + +static int kvm_gmem_folio_set_private(struct folio *folio) +{ + unsigned long start, npages, i; + int r; + + start =3D (unsigned long) folio_address(folio); + npages =3D folio_nr_pages(folio); + + for (i =3D 0; i < npages; ++i) { + r =3D set_direct_map_invalid_noflush(folio_page(folio, i)); + if (r) + goto out_remap; + } + flush_tlb_kernel_range(start, start + folio_size(folio)); + folio_set_private(folio); + return 0; +out_remap: + for (; i > 0; i--) { + struct page *page =3D folio_page(folio, i - 1); + + if (WARN_ON_ONCE(set_direct_map_default_noflush(page))) { + /* + * Random holes in the direct map are bad, let's mark + * these pages as corrupted memory so that the kernel + * avoids ever touching them again. + */ + folio_set_hwpoison(folio); + r =3D -EHWPOISON; + } + } + return r; +} + +static int kvm_gmem_folio_clear_private(struct folio *folio) +{ + unsigned long npages, i; + int r =3D 0; + + npages =3D folio_nr_pages(folio); + + for (i =3D 0; i < npages; ++i) { + struct page *page =3D folio_page(folio, i); + + if (WARN_ON_ONCE(set_direct_map_default_noflush(page))) { + folio_set_hwpoison(folio); + r =3D -EHWPOISON; + } + } + /* + * no TLB flush here: pages without direct map entries should + * never be in the TLB in the first place. + */ + return r; +} + static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index= , bool prepare) { + int r; struct folio *folio; =20 /* TODO: Support huge pages. */ @@ -78,19 +140,31 @@ static struct folio *kvm_gmem_get_folio(struct inode *= inode, pgoff_t index, bool } =20 if (prepare) { - int r =3D kvm_gmem_prepare_folio(inode, index, folio); - if (r < 0) { - folio_unlock(folio); - folio_put(folio); - return ERR_PTR(r); - } + r =3D kvm_gmem_prepare_folio(inode, index, folio); + if (r < 0) + goto out_err; } =20 + if (!kvm_gmem_test_no_direct_map(inode)) + goto out; + + if (!folio_test_private(folio)) { + r =3D kvm_gmem_folio_set_private(folio); + if (r) + goto out_err; + } + +out: /* * Ignore accessed, referenced, and dirty flags. The memory is * unevictable and there is no storage to write back to. */ return folio; + +out_err: + folio_unlock(folio); + folio_put(folio); + return ERR_PTR(r); } =20 static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start, @@ -343,6 +417,13 @@ static int kvm_gmem_error_folio(struct address_space *= mapping, struct folio *fol return MF_DELAYED; } =20 +static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, s= ize_t end) +{ + if (start =3D=3D 0 && end =3D=3D folio_size(folio)) { + kvm_gmem_folio_clear_private(folio); + } +} + #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE static void kvm_gmem_free_folio(struct folio *folio) { @@ -358,6 +439,7 @@ static const struct address_space_operations kvm_gmem_a= ops =3D { .dirty_folio =3D noop_dirty_folio, .migrate_folio =3D kvm_gmem_migrate_folio, .error_remove_folio =3D kvm_gmem_error_folio, + .invalidate_folio =3D kvm_gmem_invalidate_folio, #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE .free_folio =3D kvm_gmem_free_folio, #endif @@ -442,7 +524,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_= guest_memfd *args) { loff_t size =3D args->size; u64 flags =3D args->flags; - u64 valid_flags =3D 0; + u64 valid_flags =3D KVM_GMEM_NO_DIRECT_MAP; =20 if (flags & ~valid_flags) return -EINVAL; base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f --=20 2.46.0 From nobody Sat Nov 30 07:23:31 2024 Received: from smtp-fw-52005.amazon.com (smtp-fw-52005.amazon.com [52.119.213.156]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BAD361A00F4; Tue, 10 Sep 2024 16:31:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.119.213.156 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985869; cv=none; b=GoMfbKpb1ayihZUZk2c5L/cc0c5NedW87MbUo8e+AboHN989uMap4o2GZqL8Z6hrjEWA9PGV3fkSveFyG/2uke5DQe0zL/SIne/K2bfgm082R4hPmEQVtchsT0PfMscpB5Mor/5WKNctGbSIm4sOdstd9JL3Gs5uTQ8TZgU04gk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985869; c=relaxed/simple; bh=skUzRcCSPy9WgItUAh3+swWfjgUWXocwpc0g2Acwcq4=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=TL83mF++Io7T82EkaLdCaKShbUtMrJzHOmRx/uWnxgqtiq/mO9TBtaPWaAKUG3MlGbcE0RzOKKQvr2PmQRMiEhcZwvtZQUggL+aLk96M8zGyc76AoRGgZNYhSZMUKxIjnyx7h5gapar0sy+4PwLbR92grRrrFtZAiSyA6+DNi88= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=v0ugoCKA; arc=none smtp.client-ip=52.119.213.156 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="v0ugoCKA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1725985868; x=1757521868; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=OSB6/wHaKh12iBDoF1XKstABhzcwNvqjtvF5UFffeGQ=; b=v0ugoCKAYz3H3AUhu7tLrSyiIPgVnhlzaT4HtAkiL3SpYf0U35n2yM+1 ibIu5HSNAPzYTqZuuaMOC37soH0KrKo5NzgzRDK7E+bf8uOKJAeSxmaue RWKsVEmXPdhWPclqFGy0eCGMpj2/4/nErQbzNgrMHWPS1CvNKPA7yo7T0 Q=; X-IronPort-AV: E=Sophos;i="6.10,217,1719878400"; d="scan'208";a="679397384" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-52005.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 16:31:05 +0000 Received: from EX19MTAUEA002.ant.amazon.com [10.0.29.78:10984] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.46.235:2525] with esmtp (Farcaster) id 37b8be63-b91b-41ff-9b88-4b6db0d84ee2; Tue, 10 Sep 2024 16:31:04 +0000 (UTC) X-Farcaster-Flow-ID: 37b8be63-b91b-41ff-9b88-4b6db0d84ee2 Received: from EX19D008UEA002.ant.amazon.com (10.252.134.125) by EX19MTAUEA002.ant.amazon.com (10.252.134.9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:30:57 +0000 Received: from EX19MTAUWB001.ant.amazon.com (10.250.64.248) by EX19D008UEA002.ant.amazon.com (10.252.134.125) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:30:57 +0000 Received: from ua2d7e1a6107c5b.home (172.19.88.180) by mail-relay.amazon.com (10.250.64.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 10 Sep 2024 16:30:52 +0000 From: Patrick Roy To: , , , , , , , , , , , , , , , , , , , , CC: Patrick Roy , , , , , Subject: [RFC PATCH v2 02/10] kvm: gmem: Add KVM_GMEM_GET_PFN_SHARED Date: Tue, 10 Sep 2024 17:30:28 +0100 Message-ID: <20240910163038.1298452-3-roypat@amazon.co.uk> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20240910163038.1298452-1-roypat@amazon.co.uk> References: <20240910163038.1298452-1-roypat@amazon.co.uk> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" If `KVM_GMEM_NO_DIRECT_MAP` is set, all gmem folios are removed from the direct map immediately after allocation. Add a flag to kvm_gmem_grab_folio to overwrite this behavior, and expose it via `kvm_gmem_get_pfn`. Only allow this flag to be set if KVM can actually access gmem (currently only if the vm type is KVM_X86_SW_PROTECTED_VM). KVM_GMEM_GET_PFN_SHARED defers the direct map removal for newly allocated folios until kvm_gmem_put_shared_pfn is called. For existing folios, the direct map entry is temporarily restored until kvm_gmem_put_shared_pfn is called. The folio lock must be held the entire time the folio is present in the direct map, to prevent races with concurrent calls kvm_gmem_folio_set_private that might remove direct map entries while the folios are being accessed by KVM. As this is currently not possible (kvm_gmem_get_pfn always unlocks the folio), the next patch will introduce a KVM_GMEM_GET_PFN_LOCKED flag. Signed-off-by: Patrick Roy --- arch/x86/kvm/mmu/mmu.c | 2 +- include/linux/kvm_host.h | 12 +++++++++-- virt/kvm/guest_memfd.c | 46 +++++++++++++++++++++++++++++++--------- 3 files changed, 47 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 901be9e420a4c..cb2f111f2cce0 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4349,7 +4349,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *v= cpu, } =20 r =3D kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn, - &max_order); + &max_order, 0); if (r) { kvm_mmu_prepare_memory_fault_exit(vcpu, fault); return r; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 689e8be873a75..8a2975674de4b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2432,17 +2432,25 @@ static inline bool kvm_mem_is_private(struct kvm *k= vm, gfn_t gfn) } #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */ =20 +#define KVM_GMEM_GET_PFN_SHARED BIT(0) +#define KVM_GMEM_GET_PFN_PREPARE BIT(31) /* internal */ + #ifdef CONFIG_KVM_PRIVATE_MEM int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, - gfn_t gfn, kvm_pfn_t *pfn, int *max_order); + gfn_t gfn, kvm_pfn_t *pfn, int *max_order, unsigned long flags); +int kvm_gmem_put_shared_pfn(kvm_pfn_t pfn); #else static inline int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, - kvm_pfn_t *pfn, int *max_order) + kvm_pfn_t *pfn, int *max_order, int flags) { KVM_BUG_ON(1, kvm); return -EIO; } +static inline int kvm_gmem_put_shared_pfn(kvm_pfn_t pfn) +{ + return -EIO; +} #endif /* CONFIG_KVM_PRIVATE_MEM */ =20 #ifdef CONFIG_HAVE_KVM_GMEM_PREPARE diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 2ed27992206f3..492b04f4e5c18 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -55,6 +55,11 @@ static bool kvm_gmem_test_no_direct_map(struct inode *in= ode) return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) =3D=3D = KVM_GMEM_NO_DIRECT_MAP; } =20 +static bool kvm_gmem_test_accessible(struct kvm *kvm) +{ + return kvm->arch.vm_type =3D=3D KVM_X86_SW_PROTECTED_VM; +} + static int kvm_gmem_folio_set_private(struct folio *folio) { unsigned long start, npages, i; @@ -110,10 +115,11 @@ static int kvm_gmem_folio_clear_private(struct folio = *folio) return r; } =20 -static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index= , bool prepare) +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index= , unsigned long flags) { int r; struct folio *folio; + bool share =3D flags & KVM_GMEM_GET_PFN_SHARED; =20 /* TODO: Support huge pages. */ folio =3D filemap_grab_folio(inode->i_mapping, index); @@ -139,7 +145,7 @@ static struct folio *kvm_gmem_get_folio(struct inode *i= node, pgoff_t index, bool folio_mark_uptodate(folio); } =20 - if (prepare) { + if (flags & KVM_GMEM_GET_PFN_PREPARE) { r =3D kvm_gmem_prepare_folio(inode, index, folio); if (r < 0) goto out_err; @@ -148,12 +154,15 @@ static struct folio *kvm_gmem_get_folio(struct inode = *inode, pgoff_t index, bool if (!kvm_gmem_test_no_direct_map(inode)) goto out; =20 - if (!folio_test_private(folio)) { + if (folio_test_private(folio) && share) { + r =3D kvm_gmem_folio_clear_private(folio); + } else if (!folio_test_private(folio) && !share) { r =3D kvm_gmem_folio_set_private(folio); - if (r) - goto out_err; } =20 + if (r) + goto out_err; + out: /* * Ignore accessed, referenced, and dirty flags. The memory is @@ -264,7 +273,7 @@ static long kvm_gmem_allocate(struct inode *inode, loff= _t offset, loff_t len) break; } =20 - folio =3D kvm_gmem_get_folio(inode, index, true); + folio =3D kvm_gmem_get_folio(inode, index, KVM_GMEM_GET_PFN_PREPARE); if (IS_ERR(folio)) { r =3D PTR_ERR(folio); break; @@ -624,7 +633,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot) } =20 static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *s= lot, - gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prepare) + gfn_t gfn, kvm_pfn_t *pfn, int *max_order, unsigned long flags) { pgoff_t index =3D gfn - slot->base_gfn + slot->gmem.pgoff; struct kvm_gmem *gmem =3D file->private_data; @@ -643,7 +652,7 @@ static int __kvm_gmem_get_pfn(struct file *file, struct= kvm_memory_slot *slot, return -EIO; } =20 - folio =3D kvm_gmem_get_folio(file_inode(file), index, prepare); + folio =3D kvm_gmem_get_folio(file_inode(file), index, flags); if (IS_ERR(folio)) return PTR_ERR(folio); =20 @@ -667,20 +676,37 @@ static int __kvm_gmem_get_pfn(struct file *file, stru= ct kvm_memory_slot *slot, } =20 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, - gfn_t gfn, kvm_pfn_t *pfn, int *max_order) + gfn_t gfn, kvm_pfn_t *pfn, int *max_order, unsigned long flags) { struct file *file =3D kvm_gmem_get_file(slot); int r; + int valid_flags =3D KVM_GMEM_GET_PFN_SHARED; + + if ((flags & valid_flags) !=3D flags) + return -EINVAL; + + if ((flags & KVM_GMEM_GET_PFN_SHARED) && !kvm_gmem_test_accessible(kvm)) + return -EPERM; =20 if (!file) return -EFAULT; =20 - r =3D __kvm_gmem_get_pfn(file, slot, gfn, pfn, max_order, true); + r =3D __kvm_gmem_get_pfn(file, slot, gfn, pfn, max_order, flags | KVM_GME= M_GET_PFN_PREPARE); fput(file); return r; } EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn); =20 +int kvm_gmem_put_shared_pfn(kvm_pfn_t pfn) { + struct folio *folio =3D pfn_folio(pfn); + + if (!kvm_gmem_test_no_direct_map(folio_inode(folio))) + return 0; + + return kvm_gmem_folio_set_private(folio); +} +EXPORT_SYMBOL_GPL(kvm_gmem_put_shared_pfn); + long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,= long npages, kvm_gmem_populate_cb post_populate, void *opaque) { --=20 2.46.0 From nobody Sat Nov 30 07:23:31 2024 Received: from smtp-fw-80008.amazon.com (smtp-fw-80008.amazon.com [99.78.197.219]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D76B516C684; Tue, 10 Sep 2024 16:31:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.219 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985879; cv=none; b=mroDCstz6lnTrvDXTES7ke+uAn8c9lTR4K4TQvzJD36e6u89i78Lz/L2DKurXo3+b9wniHX2vstgZECrGAyhV7kCbi/AUD+UU4/EwYwQQWfUe8uCSa2dG2VQZrFV6u8ZdZ0AJ6EFBFo/gpnnA8fRj53TPy+ydWln+vRlYpplBX0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985879; c=relaxed/simple; bh=mxJU2TzVkIfcGGJro+cv+6yOzAahJ5oXQEU80AyRFq8=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=qsdSAFgWR1TloPAlvx5lypO6N9x+DJj0AA4e654cI4SaBuVxxR6qPcXM7hGNaqGcgmHvIhmAYXtwKUjWPcvLARTOty02t1dNGatRg3JGeHbUa12N/G0jL2dk8EmWpqqq9rEM4uvjzRdy+gw17ex1gCIk3/f8ktxEXAeJm/lkBI0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=Tkf1jH9e; arc=none smtp.client-ip=99.78.197.219 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="Tkf1jH9e" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1725985877; x=1757521877; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=v5e1Lttypa2CCeJrmZUnbjUqOb8jI+6pZUoDp94zyOM=; b=Tkf1jH9ejSX0KEEilAQPnWXaU+tMTvtl3FiAdNPhDHg+tl9nzU9DY00P mkpYYyN2cKbIAVvucggKCHLKpMgngmHt2A8ZgCnJbEMZFmHDTbuBa8nCf N2K0mGWNVBE0P6UsfT9Bj4EFmQtvSx25J62BtdC6unUnHONp3qc0ORtSA A=; X-IronPort-AV: E=Sophos;i="6.10,217,1719878400"; d="scan'208";a="124612846" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-80008.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 16:31:09 +0000 Received: from EX19MTAUEA002.ant.amazon.com [10.0.29.78:9542] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.42.209:2525] with esmtp (Farcaster) id 7c6ae1ed-f922-4596-94d5-b5debded213c; Tue, 10 Sep 2024 16:31:08 +0000 (UTC) X-Farcaster-Flow-ID: 7c6ae1ed-f922-4596-94d5-b5debded213c Received: from EX19D008UEC004.ant.amazon.com (10.252.135.170) by EX19MTAUEA002.ant.amazon.com (10.252.134.9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:03 +0000 Received: from EX19MTAUWB001.ant.amazon.com (10.250.64.248) by EX19D008UEC004.ant.amazon.com (10.252.135.170) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:02 +0000 Received: from ua2d7e1a6107c5b.home (172.19.88.180) by mail-relay.amazon.com (10.250.64.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 10 Sep 2024 16:30:58 +0000 From: Patrick Roy To: , , , , , , , , , , , , , , , , , , , , CC: Patrick Roy , , , , , Subject: [RFC PATCH v2 03/10] kvm: gmem: Add KVM_GMEM_GET_PFN_LOCKED Date: Tue, 10 Sep 2024 17:30:29 +0100 Message-ID: <20240910163038.1298452-4-roypat@amazon.co.uk> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20240910163038.1298452-1-roypat@amazon.co.uk> References: <20240910163038.1298452-1-roypat@amazon.co.uk> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Allow kvm_gmem_get_pfn to return with the folio lock held by adding a KVM_GMEM_GET_PFN_LOCKED option to `flags`. When accessing the content of gmem folios, the lock must be held until kvm_gmem_put_pfn, to avoid concurrent direct map modifications of the same folio causing use-after-free-like problems. However, kvm_gmem_get_pfn so far unconditionally drops the folio lock, making it currently impossible to use the KVM_GMEM_GET_PFN_SHARED flag safely. Signed-off-by: Patrick Roy --- include/linux/kvm_host.h | 1 + virt/kvm/guest_memfd.c | 5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 8a2975674de4b..cd28eb34aaeb1 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2433,6 +2433,7 @@ static inline bool kvm_mem_is_private(struct kvm *kvm= , gfn_t gfn) #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */ =20 #define KVM_GMEM_GET_PFN_SHARED BIT(0) +#define KVM_GMEM_GET_PFN_LOCKED BIT(1) #define KVM_GMEM_GET_PFN_PREPARE BIT(31) /* internal */ =20 #ifdef CONFIG_KVM_PRIVATE_MEM diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 492b04f4e5c18..f637abc6045ba 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -670,7 +670,8 @@ static int __kvm_gmem_get_pfn(struct file *file, struct= kvm_memory_slot *slot, =20 r =3D 0; =20 - folio_unlock(folio); + if (!(flags & KVM_GMEM_GET_PFN_LOCKED)) + folio_unlock(folio); =20 return r; } @@ -680,7 +681,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory= _slot *slot, { struct file *file =3D kvm_gmem_get_file(slot); int r; - int valid_flags =3D KVM_GMEM_GET_PFN_SHARED; + int valid_flags =3D KVM_GMEM_GET_PFN_SHARED | KVM_GMEM_GET_PFN_LOCKED; =20 if ((flags & valid_flags) !=3D flags) return -EINVAL; --=20 2.46.0 From nobody Sat Nov 30 07:23:31 2024 Received: from smtp-fw-80009.amazon.com (smtp-fw-80009.amazon.com [99.78.197.220]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D9F831A42DB; Tue, 10 Sep 2024 16:31:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.220 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985894; cv=none; b=u28r33VW3SXZqNxR/Hr5/ZbhmRyvQNjLZVW/2ZcbcvugUAYHSK1l8LN/TjFTpDFMgjFhV0yglwh3p5JM8JnQmLkI8IUvAKvYTde8XidZWV2vj1sWTjHYvJ+P5pNdWk7LXs6wVcFh/yfa2P5FHKSOBqFaJ80li3JaJSR8Ps40RrU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985894; c=relaxed/simple; bh=BGx1rC1njIBlKD9diD98c69YvRK0p/zTsudpQQw1hYI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=kbyHG4ZdZ+iikA0GJU+15jb48scK9FxHChUwhZ2VqRKFpQ6eTQFgCxtqTvI7KTxVuUffM2I9Cf9B3hZv26cGnCgyBYn2K+QeHWBR2gmoUw7vsUQ1kjK/16bF+XTtiuqrayFyDNdq/yRon8/TsvKSmYCLW7euxptR1jwJRdqDlts= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=O5pyT0+p; arc=none smtp.client-ip=99.78.197.220 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="O5pyT0+p" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1725985892; x=1757521892; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Mg6x7TwXBxF2qrgNjQW5TE5s/jw6SkTYEEB7LumyV7k=; b=O5pyT0+pbzl00PxxshCYQOm2J5DYsVr267+2ehBgluZBP6rjP0a5BwGp DJXQgd2h+ullHm4LJNlWsEq4L80ggZMYviTwbF5iosTKTe3SPg/IFtE78 RgKr+cKoaMLLELIqUuvYfvb6f9hPmKM9+P2TjjaUZCUDuDT3YSKffSee8 Y=; X-IronPort-AV: E=Sophos;i="6.10,217,1719878400"; d="scan'208";a="124249487" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-80009.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 16:31:20 +0000 Received: from EX19MTAUEA001.ant.amazon.com [10.0.29.78:64554] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.48.28:2525] with esmtp (Farcaster) id c989717d-c1d0-4610-a59c-ea42657013e3; Tue, 10 Sep 2024 16:31:19 +0000 (UTC) X-Farcaster-Flow-ID: c989717d-c1d0-4610-a59c-ea42657013e3 Received: from EX19D008UEA002.ant.amazon.com (10.252.134.125) by EX19MTAUEA001.ant.amazon.com (10.252.134.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:09 +0000 Received: from EX19MTAUWB001.ant.amazon.com (10.250.64.248) by EX19D008UEA002.ant.amazon.com (10.252.134.125) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:08 +0000 Received: from ua2d7e1a6107c5b.home (172.19.88.180) by mail-relay.amazon.com (10.250.64.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 10 Sep 2024 16:31:04 +0000 From: Patrick Roy To: , , , , , , , , , , , , , , , , , , , , CC: Patrick Roy , , , , , Subject: [RFC PATCH v2 04/10] kvm: Allow reading/writing gmem using kvm_{read,write}_guest Date: Tue, 10 Sep 2024 17:30:30 +0100 Message-ID: <20240910163038.1298452-5-roypat@amazon.co.uk> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20240910163038.1298452-1-roypat@amazon.co.uk> References: <20240910163038.1298452-1-roypat@amazon.co.uk> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" If KVM can access guest_memfd memory (or at least convert it into a state in which KVM can access it) without causing a host-kernel panic (e.g. currently only if the vm type is KVM_X86_SW_PROTECTED_VM), allow `kvm_{read,write}_guest` to access gfns that are backed by gmem. If KVM cannot access guest_memfd memory (say, because it is running a TDX VM), prepare a KVM_EXIT_MEMORY_FAULT (if possible) and return -EFAULT. KVM can only prepare the memory fault exit inside the `kvm_vcpu_{read,write}_guest` variant, as it needs a vcpu reference to assign the exit reason to. KVM accesses to gmem are done via the direct map (as no userspace mappings exist, and even if they existed, they wouldn't be reflected into the memslots). If `KVM_GMEM_NO_DIRECT_MAP` is set, then temporarily reinsert the accessed folio into the direct map. Hold the folio lock for the entire duration of the access to prevent concurrent direct map modifications from taking place (as these might remove the direct map entry while kvm_{read,write}_guest is using it, which would result in a panic). Signed-off-by: Patrick Roy --- virt/kvm/kvm_main.c | 83 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 83 insertions(+) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d0788d0a72cc0..13347fb03d4a9 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3286,11 +3286,51 @@ static int __kvm_read_guest_page(struct kvm_memory_= slot *slot, gfn_t gfn, return 0; } =20 +static int __kvm_read_guest_private_page(struct kvm *kvm, + struct kvm_memory_slot *memslot, gfn_t gfn, + void *data, int offset, int len) +{ + kvm_pfn_t pfn; + int r; + struct folio *folio; + + r =3D kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, NULL, + KVM_GMEM_GET_PFN_SHARED | KVM_GMEM_GET_PFN_LOCKED); + + if (r < 0) + return r; + + folio =3D pfn_folio(pfn); + memcpy(data, folio_address(folio) + offset, len); + r =3D kvm_gmem_put_shared_pfn(pfn); + folio_unlock(folio); + folio_put(folio); + return r; +} + +static int __kvm_vcpu_read_guest_private_page(struct kvm_vcpu *vcpu, + struct kvm_memory_slot *memslot, gfn_t gfn, + void *data, int offset, int len) +{ + int r =3D __kvm_read_guest_private_page(vcpu->kvm, memslot, gfn, data, of= fset, len); + + /* kvm not allowed to access gmem */ + if (r =3D=3D -EPERM) { + kvm_prepare_memory_fault_exit(vcpu, gfn + offset, len, false, + false, true); + return -EFAULT; + } + + return r; +} + int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, int len) { struct kvm_memory_slot *slot =3D gfn_to_memslot(kvm, gfn); =20 + if (kvm_mem_is_private(kvm, gfn)) + return __kvm_read_guest_private_page(kvm, slot, gfn, data, offset, len); return __kvm_read_guest_page(slot, gfn, data, offset, len); } EXPORT_SYMBOL_GPL(kvm_read_guest_page); @@ -3300,6 +3340,8 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, g= fn_t gfn, void *data, { struct kvm_memory_slot *slot =3D kvm_vcpu_gfn_to_memslot(vcpu, gfn); =20 + if (kvm_mem_is_private(vcpu->kvm, gfn)) + return __kvm_vcpu_read_guest_private_page(vcpu, slot, gfn, data, offset,= len); return __kvm_read_guest_page(slot, gfn, data, offset, len); } EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page); @@ -3390,11 +3432,50 @@ static int __kvm_write_guest_page(struct kvm *kvm, return 0; } =20 +static int __kvm_write_guest_private_page(struct kvm *kvm, + struct kvm_memory_slot *memslot, gfn_t gfn, + const void *data, int offset, int len) +{ + kvm_pfn_t pfn; + int r; + struct folio *folio; + + r =3D kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, NULL, + KVM_GMEM_GET_PFN_SHARED | KVM_GMEM_GET_PFN_LOCKED); + + if (r < 0) + return r; + + folio =3D pfn_folio(pfn); + memcpy(folio_address(folio) + offset, data, len); + r =3D kvm_gmem_put_shared_pfn(pfn); + folio_unlock(folio); + folio_put(folio); + return r; +} + +static int __kvm_vcpu_write_guest_private_page(struct kvm_vcpu *vcpu, + struct kvm_memory_slot *memslot, gfn_t gfn, + const void *data, int offset, int len) +{ + int r =3D __kvm_write_guest_private_page(vcpu->kvm, memslot, gfn, data, o= ffset, len); + + if (r =3D=3D -EPERM) { + kvm_prepare_memory_fault_exit(vcpu, gfn + offset, len, true, + false, true); + return -EFAULT; + } + + return r; +} + int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data, int offset, int len) { struct kvm_memory_slot *slot =3D gfn_to_memslot(kvm, gfn); =20 + if (kvm_mem_is_private(kvm, gfn)) + return __kvm_write_guest_private_page(kvm, slot, gfn, data, offset, len); return __kvm_write_guest_page(kvm, slot, gfn, data, offset, len); } EXPORT_SYMBOL_GPL(kvm_write_guest_page); @@ -3404,6 +3485,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, = gfn_t gfn, { struct kvm_memory_slot *slot =3D kvm_vcpu_gfn_to_memslot(vcpu, gfn); =20 + if (kvm_mem_is_private(vcpu->kvm, gfn)) + return __kvm_vcpu_write_guest_private_page(vcpu, slot, gfn, data, offset= , len); return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len); } EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page); --=20 2.46.0 From nobody Sat Nov 30 07:23:31 2024 Received: from smtp-fw-80008.amazon.com (smtp-fw-80008.amazon.com [99.78.197.219]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E6D6B16C684; Tue, 10 Sep 2024 16:31:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.219 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985886; cv=none; b=KgxiejjSa9kz9zBH/DDUe4itqGK+7eLcgfkUzZzpvl453E8Af5zF7g2i9U81lccVVugqOeN/a+g7teMJpg4UOvcsRxW2+ZHJQoQ6PAniYiOQTEPuq4m3ysE+iMs8OPGnjrZmj4ylOY+GYbtZg6kI/oc+PTBQqt7siv8tmfNK3QY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985886; c=relaxed/simple; bh=wEvphmOWMZdu1sRx5FSK/ZuJYcq8ehXHEFENxlCUHAQ=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=lBJ4K+X66b7MSVhfHjHEcCIdXIfkBEQQLMC8fzRLBJECcQqv5bsDdcG9RuuV/jkt+6pLvbeuiSwS5tem67gw0nUENZaQVtbByKbBpaUlmVuI1/EcXd/Vs+Wk24gt/RCxjI3qVQ/CrBN+xk8cF3rqZ4loGQH5zqXdcdXv2suuYVY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=WzG2bs3o; arc=none smtp.client-ip=99.78.197.219 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="WzG2bs3o" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1725985884; x=1757521884; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=YAD9FwaD5CpRjHhI+oh6kuxBv4yx2jFRf65FGqhbrxU=; b=WzG2bs3oDGvTBivf2Tn/jJZvqQycr94wlVv2bWbnnRVojGA16ZYJUB+M S6odfKQPQo+vxuajw4pAgPBEMgKc+sAdUoc8CwzVP93H7VUaQWNBnf1Nu ZYEkF2iIxzYS0OF3q1NRv7kZzycOouMW3Iw6ynVBkbaQ9WIPiv813uEGC A=; X-IronPort-AV: E=Sophos;i="6.10,217,1719878400"; d="scan'208";a="124612986" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-80008.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 16:31:23 +0000 Received: from EX19MTAUEA001.ant.amazon.com [10.0.29.78:42006] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.30.239:2525] with esmtp (Farcaster) id 0e99c686-f4d2-458c-998e-c58de7385fd9; Tue, 10 Sep 2024 16:31:22 +0000 (UTC) X-Farcaster-Flow-ID: 0e99c686-f4d2-458c-998e-c58de7385fd9 Received: from EX19D008UEA004.ant.amazon.com (10.252.134.191) by EX19MTAUEA001.ant.amazon.com (10.252.134.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:14 +0000 Received: from EX19MTAUWB001.ant.amazon.com (10.250.64.248) by EX19D008UEA004.ant.amazon.com (10.252.134.191) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:14 +0000 Received: from ua2d7e1a6107c5b.home (172.19.88.180) by mail-relay.amazon.com (10.250.64.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 10 Sep 2024 16:31:09 +0000 From: Patrick Roy To: , , , , , , , , , , , , , , , , , , , , CC: Patrick Roy , , , , , Subject: [RFC PATCH v2 05/10] kvm: gmem: Refcount internal accesses to gmem Date: Tue, 10 Sep 2024 17:30:31 +0100 Message-ID: <20240910163038.1298452-6-roypat@amazon.co.uk> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20240910163038.1298452-1-roypat@amazon.co.uk> References: <20240910163038.1298452-1-roypat@amazon.co.uk> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently, if KVM_GMEM_NO_DIRECT_MAP is set and KVM wants to internally access a gmem folio, KVM needs to reinsert the folio into the direct map, and hold the folio lock until KVM is done using the folio (and the folio is removed from the direct map again). This means that long-term reinsertion into the direct map, and concurrent accesses to the same gmem folio are currently impossible. These are needed however for data structures of paravirtual devices, such as kvm-clock, which are shared between guest and host via guest memory pages (and multiple vCPUs can put their kvm-clock data into the same guest page). Thus, introduce the concept of a "sharing refcount", which gets incremented on every call to kvm_gmem_get_pfn with KVM_GMEM_GET_PFN_SHARED set. Direct map manipulations are only done when the first refcount is grabbed (direct map entries are restored), or when the last reference goes away (direct map entries are removed). While holding a sharing reference, the folio lock may be dropped, as the refcounting ensures that the direct map entry will not be removed as long as at least one reference is held. However, whoever is holding a reference will need to listen and respond to gmem invalidation events (such as the page being in the process of being fallocated away). Since refcount_t does not play nicely with references dropping to 0 and later being raised again (it will WARN), we use a refcount of 1 to mean "no sharing references held anywhere, folio not in direct map". Signed-off-by: Patrick Roy --- virt/kvm/guest_memfd.c | 61 +++++++++++++++++++++++++++++++++++++++--- 1 file changed, 58 insertions(+), 3 deletions(-) diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index f637abc6045ba..6772253497e4d 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -60,10 +60,37 @@ static bool kvm_gmem_test_accessible(struct kvm *kvm) return kvm->arch.vm_type =3D=3D KVM_X86_SW_PROTECTED_VM; } =20 +static int kvm_gmem_init_sharing_count(struct folio *folio) +{ + refcount_t *sharing_count =3D kmalloc(sizeof(*sharing_count), GFP_KERNEL); + + if (!sharing_count) + return -ENOMEM; + + /* + * we need to use sharing_count =3D=3D 1 to mean "no sharing", because + * dropping a refcount_t to 0 and later incrementing it again would + * result in a WARN. + */ + refcount_set(sharing_count, 1); + folio_change_private(folio, (void *)sharing_count); + + return 0; +} + static int kvm_gmem_folio_set_private(struct folio *folio) { unsigned long start, npages, i; int r; + unsigned int sharing_refcount =3D refcount_read(folio_get_private(folio)); + + /* + * We must only remove direct map entries after the last internal + * reference has gone away, e.g. after the refcount dropped back + * to 1. + */ + WARN_ONCE(sharing_refcount !=3D 1, "%d unexpected sharing_refcounts pfn= =3D%lx", + sharing_refcount - 1, folio_pfn(folio)); =20 start =3D (unsigned long) folio_address(folio); npages =3D folio_nr_pages(folio); @@ -97,6 +124,15 @@ static int kvm_gmem_folio_clear_private(struct folio *f= olio) { unsigned long npages, i; int r =3D 0; + unsigned int sharing_refcount =3D refcount_read(folio_get_private(folio)); + + /* + * We must restore direct map entries on acquiring the first "sharing + * reference". The refcount is lifted _after_ the call to + * kvm_gmem_folio_clear_private, so it will still be 1 here. + */ + WARN_ONCE(sharing_refcount !=3D 1, "%d unexpected sharing_refcounts pfn= =3D%lx", + sharing_refcount - 1, folio_pfn(folio)); =20 npages =3D folio_nr_pages(folio); =20 @@ -156,13 +192,21 @@ static struct folio *kvm_gmem_get_folio(struct inode = *inode, pgoff_t index, unsi =20 if (folio_test_private(folio) && share) { r =3D kvm_gmem_folio_clear_private(folio); - } else if (!folio_test_private(folio) && !share) { - r =3D kvm_gmem_folio_set_private(folio); + } else if (!folio_test_private(folio)) { + r =3D kvm_gmem_init_sharing_count(folio); + if (r) + goto out_err; + + if (!share) + r =3D kvm_gmem_folio_set_private(folio); } =20 if (r) goto out_err; =20 + if (share) + refcount_inc(folio_get_private(folio)); + out: /* * Ignore accessed, referenced, and dirty flags. The memory is @@ -429,7 +473,10 @@ static int kvm_gmem_error_folio(struct address_space *= mapping, struct folio *fol static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, s= ize_t end) { if (start =3D=3D 0 && end =3D=3D folio_size(folio)) { + refcount_t *sharing_count =3D folio_get_private(folio); + kvm_gmem_folio_clear_private(folio); + kfree(sharing_count); } } =20 @@ -699,12 +746,20 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memo= ry_slot *slot, EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn); =20 int kvm_gmem_put_shared_pfn(kvm_pfn_t pfn) { + int r =3D 0; struct folio *folio =3D pfn_folio(pfn); + refcount_t *sharing_count; =20 if (!kvm_gmem_test_no_direct_map(folio_inode(folio))) return 0; =20 - return kvm_gmem_folio_set_private(folio); + sharing_count =3D folio_get_private(folio); + refcount_dec(sharing_count); + + if (refcount_read(sharing_count) =3D=3D 1) + r =3D kvm_gmem_folio_set_private(folio); + + return r; } EXPORT_SYMBOL_GPL(kvm_gmem_put_shared_pfn); =20 --=20 2.46.0 From nobody Sat Nov 30 07:23:31 2024 Received: from smtp-fw-52003.amazon.com (smtp-fw-52003.amazon.com [52.119.213.152]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 71D241A3BBF; Tue, 10 Sep 2024 16:31:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.119.213.152 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985892; cv=none; b=CleV3QRO+YT1194AegwfRizqrnMVWuuiLn2N8/diPrILdOcqDRERaGiW0+TNHY5NbPVzm/UYK5zMg2SChkbau9YXH/Er4NZGR4X24baso8W62u290bBaF4WLc1Li2+5BoowFkX1WVU4C9jUdJpvisD94odcWfeo1HCqz7KosNAE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985892; c=relaxed/simple; bh=8fLwxy/sXjJUCG6iZQqh0XVJKhqxL9wQGHAmt38CmPk=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=mXkj2Yoxc0WUSID/U1JcLNv11R6dcEBJ8jXfeIAsHnf2SiAmpE1r6WnZK4NU2CZcWkS9x6JnTCLC/ZqQaIyfWlFiFlzy7Yr7ZXcHXl3ANJKVxTenDqjytJQ9RKc7jDQuv5HhTrTyeecrwZSzhWYeFnDB+8Gni/qDHQ6SzHMkfxM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=L6P1N3zg; arc=none smtp.client-ip=52.119.213.152 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="L6P1N3zg" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1725985891; x=1757521891; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=O39OHNSY2jvqQ+6nIWGz3rExj3YlDzBo0CAgdRJUjCQ=; b=L6P1N3zgMWEd/s2Vu6hq9O4Ld34W5/HIf2n0JjEnLv5o1x/hrnr7awt5 1MKTVsaRVUE2XEYbbdpM2+vqCXbI4Jd2u7fmI9qn/S8gAnrNRjmmjZJMY bq1nMej7iESpy2ksQGwl1Nb4t1/4IGIzlk4rudmdyYc335eTX/z595tEX A=; X-IronPort-AV: E=Sophos;i="6.10,217,1719878400"; d="scan'208";a="24649840" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-52003.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 16:31:26 +0000 Received: from EX19MTAUEC002.ant.amazon.com [10.0.0.204:15768] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.46.235:2525] with esmtp (Farcaster) id a66d20fe-467b-4f84-9176-15708e5e7cff; Tue, 10 Sep 2024 16:31:25 +0000 (UTC) X-Farcaster-Flow-ID: a66d20fe-467b-4f84-9176-15708e5e7cff Received: from EX19D008UEA004.ant.amazon.com (10.252.134.191) by EX19MTAUEC002.ant.amazon.com (10.252.135.253) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:20 +0000 Received: from EX19MTAUWB001.ant.amazon.com (10.250.64.248) by EX19D008UEA004.ant.amazon.com (10.252.134.191) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:19 +0000 Received: from ua2d7e1a6107c5b.home (172.19.88.180) by mail-relay.amazon.com (10.250.64.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 10 Sep 2024 16:31:15 +0000 From: Patrick Roy To: , , , , , , , , , , , , , , , , , , , , CC: Patrick Roy , , , , , Subject: [RFC PATCH v2 06/10] kvm: gmem: add tracepoints for gmem share/unshare Date: Tue, 10 Sep 2024 17:30:32 +0100 Message-ID: <20240910163038.1298452-7-roypat@amazon.co.uk> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20240910163038.1298452-1-roypat@amazon.co.uk> References: <20240910163038.1298452-1-roypat@amazon.co.uk> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add tracepoints for calls to kvm_gmem_get_folio that cause the returned folio to be considered "shared" (e.g. accessible by host KVM), and tracepoint for when KVM is done accessing a gmem pfn (kvm_gmem_put_shared_pfn). The above operations can cause folios to be insert/removed into/from the direct map. We want to be able to make sure that only those gmem folios that we expect KVM to access are ever reinserted into the direct map, and that all folios that are temporarily reinserted are also removed again at a later point. Processing ftrace output is one way to verify this. Signed-off-by: Patrick Roy --- include/trace/events/kvm.h | 43 ++++++++++++++++++++++++++++++++++++++ virt/kvm/guest_memfd.c | 7 ++++++- 2 files changed, 49 insertions(+), 1 deletion(-) diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h index 74e40d5d4af42..4a40fd4c22f91 100644 --- a/include/trace/events/kvm.h +++ b/include/trace/events/kvm.h @@ -489,6 +489,49 @@ TRACE_EVENT(kvm_test_age_hva, TP_printk("mmu notifier test age hva: %#016lx", __entry->hva) ); =20 +#ifdef CONFIG_KVM_PRIVATE_MEM +TRACE_EVENT(kvm_gmem_share, + TP_PROTO(struct folio *folio, pgoff_t index), + TP_ARGS(folio, index), + + TP_STRUCT__entry( + __field(unsigned int, sharing_count) + __field(kvm_pfn_t, pfn) + __field(pgoff_t, index) + __field(unsigned long, npages) + ), + + TP_fast_assign( + __entry->sharing_count =3D refcount_read(folio_get_private(folio)); + __entry->pfn =3D folio_pfn(folio); + __entry->index =3D index; + __entry->npages =3D folio_nr_pages(folio); + ), + + TP_printk("pfn=3D0x%llx index=3D%lu pages=3D%lu (refcount now %d)", + __entry->pfn, __entry->index, __entry->npages, __entry->sharing= _count - 1) +); + +TRACE_EVENT(kvm_gmem_unshare, + TP_PROTO(kvm_pfn_t pfn), + TP_ARGS(pfn), + + TP_STRUCT__entry( + __field(unsigned int, sharing_count) + __field(kvm_pfn_t, pfn) + ), + + TP_fast_assign( + __entry->sharing_count =3D refcount_read(folio_get_private(pfn_folio(pfn= ))); + __entry->pfn =3D pfn; + ), + + TP_printk("pfn=3D0x%llx (refcount now %d)", + __entry->pfn, __entry->sharing_count - 1) +) + +#endif + #endif /* _TRACE_KVM_MAIN_H */ =20 /* This part must be outside protection */ diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 6772253497e4d..742eba36d2371 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -7,6 +7,7 @@ #include =20 #include "kvm_mm.h" +#include "trace/events/kvm.h" =20 struct kvm_gmem { struct kvm *kvm; @@ -204,8 +205,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *= inode, pgoff_t index, unsi if (r) goto out_err; =20 - if (share) + if (share) { refcount_inc(folio_get_private(folio)); + trace_kvm_gmem_share(folio, index); + } =20 out: /* @@ -759,6 +762,8 @@ int kvm_gmem_put_shared_pfn(kvm_pfn_t pfn) { if (refcount_read(sharing_count) =3D=3D 1) r =3D kvm_gmem_folio_set_private(folio); =20 + trace_kvm_gmem_unshare(pfn); + return r; } EXPORT_SYMBOL_GPL(kvm_gmem_put_shared_pfn); --=20 2.46.0 From nobody Sat Nov 30 07:23:31 2024 Received: from smtp-fw-33001.amazon.com (smtp-fw-33001.amazon.com [207.171.190.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 70CBB1A3BD3; Tue, 10 Sep 2024 16:31:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=207.171.190.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985896; cv=none; b=PpcNJAqWUOKaNz3L06XRUewK8OWOd8OkuYBjhLKlCTKR0E/8QDTCW+CAgQFrRov1fdPVcb8GpW/J6L0Oa1OqllPu+wrRMAyLccR3971BouPRvzvGEWTR4r2sHM63yYnTxVzHVAdrYyWHcs79QyC9ICWqkl1OiaBGhILxrH968aU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985896; c=relaxed/simple; bh=KuKlPSFBk+6o2cA7opO9wLlxrDIRu2/Eon9+LUTmQ8c=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=cd92UtdUtLTGa1T/h1OhBvVEgjTvxMavABZe1Cnh4oTEc2xKz4KP+eS8v31t3V+WjVpD+jM2296LyCO1zdC54xOQONmZDr3/7LZLz/eYQ/T5IepWUbWsEnX2mZSvRHvEi/pGsgmFOjFHkgg/I5I4mGNQdWObJM2yLMhD1hs9jLc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=vxctVS38; arc=none smtp.client-ip=207.171.190.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="vxctVS38" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1725985895; x=1757521895; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=umyhObMpA1DDGyKFrbGuBdzuz/jHr6ecIUxm4GlHskc=; b=vxctVS38/4rbtRl6GfW+W4P4w53xJpkOSXpta4LeJFkol9TZQWBd64fK sc0n4Xy0wD+CotskhTLU8vE4Kop3cvVgnWy83NIN4zCv/O0KPNYVgLrzC ebQfH+iE/XLDsnkO9PDhsl4m3wjKRYxNOHM6AP2840yVI7m1eQgWYP/9T Q=; X-IronPort-AV: E=Sophos;i="6.10,217,1719878400"; d="scan'208";a="367280463" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-33001.sea14.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 16:31:33 +0000 Received: from EX19MTAUEC001.ant.amazon.com [10.0.44.209:40383] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.48.28:2525] with esmtp (Farcaster) id f846bd55-b4c7-479e-9859-a7c32f90264a; Tue, 10 Sep 2024 16:31:31 +0000 (UTC) X-Farcaster-Flow-ID: f846bd55-b4c7-479e-9859-a7c32f90264a Received: from EX19D008UEA004.ant.amazon.com (10.252.134.191) by EX19MTAUEC001.ant.amazon.com (10.252.135.222) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:25 +0000 Received: from EX19MTAUWB001.ant.amazon.com (10.250.64.248) by EX19D008UEA004.ant.amazon.com (10.252.134.191) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:25 +0000 Received: from ua2d7e1a6107c5b.home (172.19.88.180) by mail-relay.amazon.com (10.250.64.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 10 Sep 2024 16:31:21 +0000 From: Patrick Roy To: , , , , , , , , , , , , , , , , , , , , CC: Patrick Roy , , , , , Subject: [RFC PATCH v2 07/10] kvm: pfncache: invalidate when memory attributes change Date: Tue, 10 Sep 2024 17:30:33 +0100 Message-ID: <20240910163038.1298452-8-roypat@amazon.co.uk> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20240910163038.1298452-1-roypat@amazon.co.uk> References: <20240910163038.1298452-1-roypat@amazon.co.uk> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Invalidate gfn_to_pfn_caches when the memory attributes of the gfn it contains change. Since gfn_to_pfn_caches are not hooked up to KVM's MMU notifiers, but rather have to be invalidated right _before_ KVM's MMU notifiers are triggers, adopt the approach used by kvm_mmu_notifier_invalidate_range_start for invalidating gpcs inside kvm_vm_set_mem_attributes. Signed-off-by: Patrick Roy --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 5 +++++ virt/kvm/kvm_mm.h | 10 +++++++++ virt/kvm/pfncache.c | 45 ++++++++++++++++++++++++++++++++++++++++ 4 files changed, 61 insertions(+) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index cd28eb34aaeb1..7d36164a2cee5 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -840,6 +840,7 @@ struct kvm { #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES /* Protected by slots_locks (for writes) and RCU (for reads) */ struct xarray mem_attr_array; + bool attribute_change_in_progress; #endif char stats_id[KVM_STATS_NAME_SIZE]; }; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 13347fb03d4a9..183f7ce57a428 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2533,6 +2533,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm,= gfn_t start, gfn_t end, =20 mutex_lock(&kvm->slots_lock); =20 + /* Nothing to do if the entire range as the desired attributes. */ if (kvm_range_has_memory_attributes(kvm, start, end, attributes)) goto out_unlock; @@ -2547,6 +2548,9 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm,= gfn_t start, gfn_t end, goto out_unlock; } =20 + kvm->attribute_change_in_progress =3D true; + gfn_to_pfn_cache_invalidate_gfns_start(kvm, start, end); + kvm_handle_gfn_range(kvm, &pre_set_range); =20 for (i =3D start; i < end; i++) { @@ -2558,6 +2562,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm,= gfn_t start, gfn_t end, kvm_handle_gfn_range(kvm, &post_set_range); =20 out_unlock: + kvm->attribute_change_in_progress =3D false; mutex_unlock(&kvm->slots_lock); =20 return r; diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index 715f19669d01f..5a53d888e4b18 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -27,12 +27,22 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, b= ool interruptible, void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, unsigned long end); + +void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm, + gfn_t start, + gfn_t end); #else static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, unsigned long end) { } + +static inline void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm, + gfn_t start, + gfn_t end) +{ +} #endif /* HAVE_KVM_PFNCACHE */ =20 #ifdef CONFIG_KVM_PRIVATE_MEM diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index f0039efb9e1e3..6de934a8a153f 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -57,6 +57,43 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, = unsigned long start, spin_unlock(&kvm->gpc_lock); } =20 +/* + * Identical to `gfn_to_pfn_cache_invalidate_start`, except based on gfns + * instead of uhvas. + */ +void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm, gfn_t start, = gfn_t end) +{ + struct gfn_to_pfn_cache *gpc; + + spin_lock(&kvm->gpc_lock); + list_for_each_entry(gpc, &kvm->gpc_list, list) { + read_lock_irq(&gpc->lock); + + /* + * uhva based gpcs must not be used with gmem enabled memslots + */ + if (kvm_is_error_gpa(gpc->gpa)) { + read_unlock_irq(&gpc->lock); + continue; + } + + if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) && + gpa_to_gfn(gpc->gpa) >=3D start && gpa_to_gfn(gpc->gpa) < end) { + read_unlock_irq(&gpc->lock); + + write_lock_irq(&gpc->lock); + if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) && + gpa_to_gfn(gpc->gpa) >=3D start && gpa_to_gfn(gpc->gpa) < end) + gpc->valid =3D false; + write_unlock_irq(&gpc->lock); + continue; + } + + read_unlock_irq(&gpc->lock); + } + spin_unlock(&kvm->gpc_lock); +} + static bool kvm_gpc_is_valid_len(gpa_t gpa, unsigned long uhva, unsigned long len) { @@ -141,6 +178,14 @@ static inline bool mmu_notifier_retry_cache(struct kvm= *kvm, unsigned long mmu_s if (kvm->mn_active_invalidate_count) return true; =20 + /* + * Similarly to the above, attribute_change_in_progress is set + * before gfn_to_pfn_cache_invalidate_start is called in + * kvm_vm_set_mem_attributes, and isn't cleared until after + * mmu_invalidate_seq is updated. + */ + if (kvm->attribute_change_in_progress) + return true; /* * Ensure mn_active_invalidate_count is read before * mmu_invalidate_seq. This pairs with the smp_wmb() in --=20 2.46.0 From nobody Sat Nov 30 07:23:31 2024 Received: from smtp-fw-6002.amazon.com (smtp-fw-6002.amazon.com [52.95.49.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5B0261A76BD; Tue, 10 Sep 2024 16:31:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.95.49.90 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985903; cv=none; b=Qayr0fnFqp96GQpS+5cPjigCqScg/90HA5kp8EU6+KbDUj6nJehJdXci00nGuKgS8n9WY/JZ3zo5yl4S0/XXkskbzQQuFdZnZS5qhUmSamE2s8+UoHltQfk/Gs5SPJiWQxm9iQ6YEMKgyNvpyN8r7DRtOKXiHdIDl1mJr9vR43I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985903; c=relaxed/simple; bh=WiMd1fQnJ68uRJs7o1MvaQ3hErm2TB0gEsTi8l3L3mo=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=WLEErpPk9dZ7zXYJeVdRi2b6kBwRoXGippwrFY7JmNz3EYwfQtOci+W8aR/3BGOSj8U9ycICrmgBSuNVeqFNhsQJweB4LGHWUUn53ho5nFO3Fj9GEMzW13O8WsBHoy4GMl5fev6zWMGEF4FVTegY2L2Hk4ukJqRmCYVYe7NE9gE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=L/x6q0sZ; arc=none smtp.client-ip=52.95.49.90 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="L/x6q0sZ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1725985902; x=1757521902; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nrEWCAm2NUlH3H3kjrWi9+hHd8dgkyixf/clN7dg47w=; b=L/x6q0sZWZg6KJQiavFHq6sKC9wGFqBhmSWtOa6NRVqbl3vNml9+WA4X lM8idV5vgYXiojVMNEIoJiKgtO6dUQyOjHJboF6XFgyVVYDcHbBhc7/OR Dg9kK2+Yyiqpabg1xYmY4eyZy2ubj+lQN9cDkf6FtghbTma7Ses9+qLDd 4=; X-IronPort-AV: E=Sophos;i="6.10,217,1719878400"; d="scan'208";a="432478644" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-6002.iad6.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 16:31:39 +0000 Received: from EX19MTAUEB001.ant.amazon.com [10.0.44.209:27167] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.42.209:2525] with esmtp (Farcaster) id f943c050-93c0-47aa-b307-1815f2e5f7c0; Tue, 10 Sep 2024 16:31:37 +0000 (UTC) X-Farcaster-Flow-ID: f943c050-93c0-47aa-b307-1815f2e5f7c0 Received: from EX19D008UEA004.ant.amazon.com (10.252.134.191) by EX19MTAUEB001.ant.amazon.com (10.252.135.108) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:32 +0000 Received: from EX19MTAUWB001.ant.amazon.com (10.250.64.248) by EX19D008UEA004.ant.amazon.com (10.252.134.191) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:31 +0000 Received: from ua2d7e1a6107c5b.home (172.19.88.180) by mail-relay.amazon.com (10.250.64.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 10 Sep 2024 16:31:27 +0000 From: Patrick Roy To: , , , , , , , , , , , , , , , , , , , , CC: Patrick Roy , , , , , Subject: [RFC PATCH v2 08/10] kvm: pfncache: Support caching gmem pfns Date: Tue, 10 Sep 2024 17:30:34 +0100 Message-ID: <20240910163038.1298452-9-roypat@amazon.co.uk> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20240910163038.1298452-1-roypat@amazon.co.uk> References: <20240910163038.1298452-1-roypat@amazon.co.uk> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Inside the `hva_to_pfn_retry` loop, for gpa based gpcs, check whether the gpa has KVM_MEMORY_ATTRIBUTE_PRIVATE set, and if so, use `kvm_gmem_get_pfn` with `KVM_GMEM_GET_PFN_SHARED` to resolve the pfn. Ignore uhva based gpcs for now, as they are only used with Xen, and we don't have guest_memfd there (yet). Gmem pfns that are cached by a gpc have their sharing refcount elevated until the gpc gets invalidated (or rather: until it gets refreshed after invalidation) or deactivated. Since during the refresh loop the memory attributes could change between private shared, store a uhva anyway, even if it will not be used in the translation in the end. Signed-off-by: Patrick Roy --- include/linux/kvm_types.h | 1 + virt/kvm/pfncache.c | 63 ++++++++++++++++++++++++++++++++++----- 2 files changed, 56 insertions(+), 8 deletions(-) diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index 827ecc0b7e10a..8903b8f46cf6c 100644 --- a/include/linux/kvm_types.h +++ b/include/linux/kvm_types.h @@ -70,6 +70,7 @@ struct gfn_to_pfn_cache { kvm_pfn_t pfn; bool active; bool valid; + bool private; }; =20 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 6de934a8a153f..a4f935e80f545 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -16,6 +16,7 @@ #include #include #include +#include =20 #include "kvm_mm.h" =20 @@ -145,13 +146,20 @@ static void *gpc_map(kvm_pfn_t pfn) #endif } =20 -static void gpc_unmap(kvm_pfn_t pfn, void *khva) +static void gpc_unmap(kvm_pfn_t pfn, void *khva, bool private) { /* Unmap the old pfn/page if it was mapped before. */ if (is_error_noslot_pfn(pfn) || !khva) return; =20 if (pfn_valid(pfn)) { + if (private) { + struct folio *folio =3D pfn_folio(pfn); + + folio_lock(folio); + kvm_gmem_put_shared_pfn(pfn); + folio_unlock(folio); + } kunmap(pfn_to_page(pfn)); return; } @@ -203,6 +211,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cac= he *gpc) void *old_khva =3D (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva); kvm_pfn_t new_pfn =3D KVM_PFN_ERR_FAULT; void *new_khva =3D NULL; + bool private =3D gpc->private; unsigned long mmu_seq; =20 lockdep_assert_held(&gpc->refresh_lock); @@ -235,17 +244,43 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_c= ache *gpc) * the existing mapping and didn't create a new one. */ if (new_khva !=3D old_khva) - gpc_unmap(new_pfn, new_khva); + gpc_unmap(new_pfn, new_khva, private); =20 kvm_release_pfn_clean(new_pfn); =20 cond_resched(); } =20 - /* We always request a writeable mapping */ - new_pfn =3D hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL); - if (is_error_noslot_pfn(new_pfn)) - goto out_error; + /* + * If we do not have a GPA, we cannot immediately determine + * whether the area of guest memory gpc->uhva pointed to + * is currently set to shared. So assume that uhva-based gpcs + * never have their underlying guest memory switched to + * private (which we can do as uhva-based gpcs are only used + * with Xen, and guest_memfd is not supported there). + */ + if (gpc->gpa !=3D INVALID_GPA) { + /* + * mmu_notifier events can be due to shared/private conversions, + * thus recheck this every iteration. + */ + private =3D kvm_mem_is_private(gpc->kvm, gpa_to_gfn(gpc->gpa)); + } else { + private =3D false; + } + + if (private) { + int r =3D kvm_gmem_get_pfn(gpc->kvm, gpc->memslot, gpa_to_gfn(gpc->gpa), + &new_pfn, NULL, KVM_GMEM_GET_PFN_SHARED); + if (r) + goto out_error; + } else { + /* We always request a writeable mapping */ + new_pfn =3D hva_to_pfn(gpc->uhva, false, false, NULL, + true, NULL); + if (is_error_noslot_pfn(new_pfn)) + goto out_error; + } =20 /* * Obtain a new kernel mapping if KVM itself will access the @@ -274,6 +309,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cac= he *gpc) gpc->valid =3D true; gpc->pfn =3D new_pfn; gpc->khva =3D new_khva + offset_in_page(gpc->uhva); + gpc->private =3D private; =20 /* * Put the reference to the _new_ pfn. The pfn is now tracked by the @@ -298,6 +334,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *g= pc, gpa_t gpa, unsigned l kvm_pfn_t old_pfn; bool hva_change =3D false; void *old_khva; + bool old_private; int ret; =20 /* Either gpa or uhva must be valid, but not both */ @@ -316,6 +353,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *g= pc, gpa_t gpa, unsigned l old_pfn =3D gpc->pfn; old_khva =3D (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva); old_uhva =3D PAGE_ALIGN_DOWN(gpc->uhva); + old_private =3D gpc->private; =20 if (kvm_is_error_gpa(gpa)) { page_offset =3D offset_in_page(uhva); @@ -338,6 +376,11 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *= gpc, gpa_t gpa, unsigned l gpc->gpa =3D gpa; gpc->generation =3D slots->generation; gpc->memslot =3D __gfn_to_memslot(slots, gfn); + /* + * compute the uhva even for private memory, in case an + * invalidation event flips memory from private to + * shared while in hva_to_pfn_retry + */ gpc->uhva =3D gfn_to_hva_memslot(gpc->memslot, gfn); =20 if (kvm_is_error_hva(gpc->uhva)) { @@ -395,7 +438,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *g= pc, gpa_t gpa, unsigned l write_unlock_irq(&gpc->lock); =20 if (unmap_old) - gpc_unmap(old_pfn, old_khva); + gpc_unmap(old_pfn, old_khva, old_private); =20 return ret; } @@ -486,6 +529,7 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc) struct kvm *kvm =3D gpc->kvm; kvm_pfn_t old_pfn; void *old_khva; + bool old_private; =20 guard(mutex)(&gpc->refresh_lock); =20 @@ -508,6 +552,9 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc) old_khva =3D gpc->khva - offset_in_page(gpc->khva); gpc->khva =3D NULL; =20 + old_private =3D gpc->private; + gpc->private =3D false; + old_pfn =3D gpc->pfn; gpc->pfn =3D KVM_PFN_ERR_FAULT; write_unlock_irq(&gpc->lock); @@ -516,6 +563,6 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc) list_del(&gpc->list); spin_unlock(&kvm->gpc_lock); =20 - gpc_unmap(old_pfn, old_khva); + gpc_unmap(old_pfn, old_khva, old_private); } } --=20 2.46.0 From nobody Sat Nov 30 07:23:31 2024 Received: from smtp-fw-80007.amazon.com (smtp-fw-80007.amazon.com [99.78.197.218]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F0AFD1A2C00; Tue, 10 Sep 2024 16:31:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.218 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985917; cv=none; b=AiISTcV+QVmhfu6G0kO1ljPcYYor7FYZlLEF9+qBtExDt1J1IIvODEzxp3dtfTNUup4kxTES6vcqqjy72MVsq0e4ZuHgcbcTfQJJMdkDkmyKi8qjTfrDlMZDKN3bcsy4V4Zal6/KmVx1+Tf4ZY5BNXdb8ESi3Ia7CzS2T0x/PAM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985917; c=relaxed/simple; bh=27p9BHXqytoKqNPVT9HlrnpL6ztRh/qfZoFZTXZfoIs=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=X8zjoxg0V5591bBQiOxB1yaklOvNPDcEkxDjvt9axvJ8DD3FisvXzYyzSNqnqS/CjELPzghUCALUuz68YwJkxMD8qoVWiok+JzxIuw0j605No2Tfjq01AVIFx30kjJ3cboOQ1TCDZ5Jv/fGhHYRnCLROdY/iJ19kIG4zITZhUFM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=VMiYY4fW; arc=none smtp.client-ip=99.78.197.218 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="VMiYY4fW" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1725985916; x=1757521916; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=pdVStUC4kefwy9z9Wrcp9pHN8EBxQ7DScGA7ZKdZ8CE=; b=VMiYY4fW4xneP5aEmkvkJoX756W8chjkU1Ag7Vfh1ggnmwuG3f0cJi0Z aZlzJEyQSUkbfF4WSt24ebz1BIWkgfbolXx86qvSYTSFO9N6IRxUuYEea Qd7yJfEmGWT/Rbh8v72swE9uJ/bPbrJ0sJEzDOZXhLOthe0YaJPfI+hNi E=; X-IronPort-AV: E=Sophos;i="6.10,217,1719878400"; d="scan'208";a="329560108" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-80007.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 16:31:47 +0000 Received: from EX19MTAUEB002.ant.amazon.com [10.0.44.209:47995] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.10.99:2525] with esmtp (Farcaster) id 07908219-420c-4de5-b1ea-5205007e9ec1; Tue, 10 Sep 2024 16:31:44 +0000 (UTC) X-Farcaster-Flow-ID: 07908219-420c-4de5-b1ea-5205007e9ec1 Received: from EX19D008UEC004.ant.amazon.com (10.252.135.170) by EX19MTAUEB002.ant.amazon.com (10.252.135.47) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:37 +0000 Received: from EX19MTAUWB001.ant.amazon.com (10.250.64.248) by EX19D008UEC004.ant.amazon.com (10.252.135.170) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:36 +0000 Received: from ua2d7e1a6107c5b.home (172.19.88.180) by mail-relay.amazon.com (10.250.64.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 10 Sep 2024 16:31:32 +0000 From: Patrick Roy To: , , , , , , , , , , , , , , , , , , , , CC: Patrick Roy , , , , , Subject: [RFC PATCH v2 09/10] kvm: pfncache: hook up to gmem invalidation Date: Tue, 10 Sep 2024 17:30:35 +0100 Message-ID: <20240910163038.1298452-10-roypat@amazon.co.uk> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20240910163038.1298452-1-roypat@amazon.co.uk> References: <20240910163038.1298452-1-roypat@amazon.co.uk> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Invalidate gfn_to_pfn_caches that hold gmem pfns whenever gmem invalidations occur (fallocate(FALLOC_FL_PUNCH_HOLE), error_remove_folio).. gmem invalidations are difficult to handle for gpcs. The unmap path for gmem pfns in gpc tries to decrement the sharing ref count, and potentially modifies the direct map. However, these are not operations we can do after the gmem folio that used to sit in the pfn has been freed (and after we drop gpc->lock in gfn_to_pfn_cache_invalidate_gfns_start we are racing against the freeing of the folio, and we cannot do direct map manipulations before dropping the lock). Thus, in these cases (punch hole and error_remove_folio), we must "leak" the sharing reference (which is fine because either the folio has already been freed, or it is about to be freed by ->invalidate_folio, which only reinserts into the direct map. So if the folio already is in the direct map, no harm is done). So in these cases, we simply store a flag that tells gpc to skip unmapping of these pfns when the time comes to refresh the cache. A slightly different case are if just the memory attributes on a memslot change. If we switch from private to shared, the gmem pfn will still be there, it will simply no longer be mapped into the guest. In this scenario, we must unmap to decrement the sharing count, and reinsert into the direct map. Otherwise, if for example the gpc gets deactivated while the gfn is set to shared, and after that the gfn is flipped to private, something else might use the pfn, but it is still present in the direct map (which violates the security goal of direct map removal). However, there is one edge case we need to deal with: It could happen that a gpc gets invalidated by a memory attribute change (e.g. gpc->needs_unmap =3D true), then refreshed, and after the refresh loop has exited and the gpc->lock is dropped, but before we get to gpc_unmap, the gmem folio that occupies the invalidated pfn of the cache is fallocated away. Now needs_unmap will be true, but we are once again racing against the freeing of the folio. For this case, take a reference to the folio before we drop the gpc->lock, and only drop the reference after gpc_unmap returned, to avoid the folio being freed. For similar reasons, gfn_to_pfn_cache_invalidate_gfns_start needs to not ignore already invalidated caches, as a cache that was invalidated due to a memory attribute change will have needs_unmap=3Dtrue. If a fallocate(FALLOC_FL_PUNCH_HOLE) operation happens on the same range, this will need to get updated to needs_unmap=3Dfalse, even if the cache is already invalidated. Signed-off-by: Patrick Roy --- include/linux/kvm_host.h | 3 +++ include/linux/kvm_types.h | 1 + virt/kvm/guest_memfd.c | 19 +++++++++++++++- virt/kvm/kvm_main.c | 5 ++++- virt/kvm/kvm_mm.h | 6 +++-- virt/kvm/pfncache.c | 46 +++++++++++++++++++++++++++++++++------ 6 files changed, 69 insertions(+), 11 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 7d36164a2cee5..62e45a4ab810e 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -843,6 +843,9 @@ struct kvm { bool attribute_change_in_progress; #endif char stats_id[KVM_STATS_NAME_SIZE]; +#ifdef CONFIG_KVM_PRIVATE_MEM + atomic_t gmem_active_invalidate_count; +#endif }; =20 #define kvm_err(fmt, ...) \ diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index 8903b8f46cf6c..a2df9623b17ce 100644 --- a/include/linux/kvm_types.h +++ b/include/linux/kvm_types.h @@ -71,6 +71,7 @@ struct gfn_to_pfn_cache { bool active; bool valid; bool private; + bool needs_unmap; }; =20 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 742eba36d2371..ac502f9b220c3 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -231,6 +231,15 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem = *gmem, pgoff_t start, struct kvm *kvm =3D gmem->kvm; unsigned long index; =20 + atomic_inc(&kvm->gmem_active_invalidate_count); + + xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) { + pgoff_t pgoff =3D slot->gmem.pgoff; + + gfn_to_pfn_cache_invalidate_gfns_start(kvm, slot->base_gfn + start - pgo= ff, + slot->base_gfn + end - pgoff, true); + } + xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) { pgoff_t pgoff =3D slot->gmem.pgoff; =20 @@ -268,6 +277,8 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gm= em, pgoff_t start, kvm_mmu_invalidate_end(kvm); KVM_MMU_UNLOCK(kvm); } + + atomic_dec(&kvm->gmem_active_invalidate_count); } =20 static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t= len) @@ -478,7 +489,13 @@ static void kvm_gmem_invalidate_folio(struct folio *fo= lio, size_t start, size_t if (start =3D=3D 0 && end =3D=3D folio_size(folio)) { refcount_t *sharing_count =3D folio_get_private(folio); =20 - kvm_gmem_folio_clear_private(folio); + /* + * gfn_to_pfn_caches do not decrement the refcount if they + * get invalidated due to the gmem pfn going away (fallocate, + * or error_remove_folio) + */ + if (refcount_read(sharing_count) =3D=3D 1) + kvm_gmem_folio_clear_private(folio); kfree(sharing_count); } } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 183f7ce57a428..6d0818c723d73 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1161,6 +1161,9 @@ static struct kvm *kvm_create_vm(unsigned long type, = const char *fdname) #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES xa_init(&kvm->mem_attr_array); #endif +#ifdef CONFIG_KVM_PRIVATE_MEM + atomic_set(&kvm->gmem_active_invalidate_count, 0); +#endif =20 INIT_LIST_HEAD(&kvm->gpc_list); spin_lock_init(&kvm->gpc_lock); @@ -2549,7 +2552,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm,= gfn_t start, gfn_t end, } =20 kvm->attribute_change_in_progress =3D true; - gfn_to_pfn_cache_invalidate_gfns_start(kvm, start, end); + gfn_to_pfn_cache_invalidate_gfns_start(kvm, start, end, false); =20 kvm_handle_gfn_range(kvm, &pre_set_range); =20 diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index 5a53d888e4b18..f4d0ced4a8f57 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -30,7 +30,8 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, =20 void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm, gfn_t start, - gfn_t end); + gfn_t end, + bool needs_unmap); #else static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, @@ -40,7 +41,8 @@ static inline void gfn_to_pfn_cache_invalidate_start(stru= ct kvm *kvm, =20 static inline void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm, gfn_t start, - gfn_t end) + gfn_t end, + bool needs_unmap) { } #endif /* HAVE_KVM_PFNCACHE */ diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index a4f935e80f545..828ba8ad8f20d 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -61,8 +61,15 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, = unsigned long start, /* * Identical to `gfn_to_pfn_cache_invalidate_start`, except based on gfns * instead of uhvas. + * + * needs_unmap indicates whether this invalidation is because a gmem range= went + * away (fallocate(FALLOC_FL_PUNCH_HOLE), error_remove_folio), in which ca= se + * we must not call kvm_gmem_put_shared_pfn for it, or because of a memory + * attribute change, in which case the gmem pfn still exists, but simply + * is no longer mapped into the guest. */ -void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm, gfn_t start, = gfn_t end) +void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm, gfn_t start, = gfn_t end, + bool needs_unmap) { struct gfn_to_pfn_cache *gpc; =20 @@ -78,14 +85,16 @@ void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm = *kvm, gfn_t start, gfn_t continue; } =20 - if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) && + if (!is_error_noslot_pfn(gpc->pfn) && gpa_to_gfn(gpc->gpa) >=3D start && gpa_to_gfn(gpc->gpa) < end) { read_unlock_irq(&gpc->lock); =20 write_lock_irq(&gpc->lock); - if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) && - gpa_to_gfn(gpc->gpa) >=3D start && gpa_to_gfn(gpc->gpa) < end) + if (!is_error_noslot_pfn(gpc->pfn) && + gpa_to_gfn(gpc->gpa) >=3D start && gpa_to_gfn(gpc->gpa) < end) { gpc->valid =3D false; + gpc->needs_unmap =3D needs_unmap && gpc->private; + } write_unlock_irq(&gpc->lock); continue; } @@ -194,6 +203,9 @@ static inline bool mmu_notifier_retry_cache(struct kvm = *kvm, unsigned long mmu_s */ if (kvm->attribute_change_in_progress) return true; + + if (atomic_read_acquire(&kvm->gmem_active_invalidate_count)) + return true; /* * Ensure mn_active_invalidate_count is read before * mmu_invalidate_seq. This pairs with the smp_wmb() in @@ -425,20 +437,28 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache = *gpc, gpa_t gpa, unsigned l * Some/all of the uhva, gpa, and memslot generation info may still be * valid, leave it as is. */ + unmap_old =3D gpc->needs_unmap; if (ret) { gpc->valid =3D false; gpc->pfn =3D KVM_PFN_ERR_FAULT; gpc->khva =3D NULL; + gpc->needs_unmap =3D false; + } else { + gpc->needs_unmap =3D true; } =20 /* Detect a pfn change before dropping the lock! */ - unmap_old =3D (old_pfn !=3D gpc->pfn); + unmap_old &=3D (old_pfn !=3D gpc->pfn); =20 out_unlock: + if (unmap_old) + folio_get(pfn_folio(old_pfn)); write_unlock_irq(&gpc->lock); =20 - if (unmap_old) + if (unmap_old) { gpc_unmap(old_pfn, old_khva, old_private); + folio_put(pfn_folio(old_pfn)); + } =20 return ret; } @@ -530,6 +550,7 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc) kvm_pfn_t old_pfn; void *old_khva; bool old_private; + bool old_needs_unmap; =20 guard(mutex)(&gpc->refresh_lock); =20 @@ -555,14 +576,25 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc) old_private =3D gpc->private; gpc->private =3D false; =20 + old_needs_unmap =3D gpc->needs_unmap; + gpc->needs_unmap =3D false; + old_pfn =3D gpc->pfn; gpc->pfn =3D KVM_PFN_ERR_FAULT; + + if (old_needs_unmap && old_private) + folio_get(pfn_folio(old_pfn)); + write_unlock_irq(&gpc->lock); =20 spin_lock(&kvm->gpc_lock); list_del(&gpc->list); spin_unlock(&kvm->gpc_lock); =20 - gpc_unmap(old_pfn, old_khva, old_private); + if (old_needs_unmap) { + gpc_unmap(old_pfn, old_khva, old_private); + if (old_private) + folio_put(pfn_folio(old_pfn)); + } } } --=20 2.46.0 From nobody Sat Nov 30 07:23:31 2024 Received: from smtp-fw-9102.amazon.com (smtp-fw-9102.amazon.com [207.171.184.29]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 50E0B1A7AE3; Tue, 10 Sep 2024 16:32:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=207.171.184.29 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985922; cv=none; b=UmgobuhFXI7ez/IHSbd/JVrzh6++tKQ8uvkCbuZZKFtdhp6ug2AGQ5Kn+TU+kC763rGkEM7B3tp8lLjlWdvhl+LgEC7SwhZvFOQ73lBKRUAZDu03quRhF2lmI8LPPBK2hDADYE4ZQqZk8Xm8RCKmYaq0jrXwHjVoTGQ1iLJ3I3k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725985922; c=relaxed/simple; bh=JFZvXTOsIoUxDPHdW5ElkrBePGcuWanipE892MdDx4w=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=e8KBEL/nZL8R5vbUvwIScExXQYyRQiLmkG27o6/RzlFR5+UOTGh0IkP2lsvOQsDIS+TBWpEu7QEUcnt7XqUb1zRzE4uLGEDL7fciQ0siKyElAair6+zClE+4+4bgopq4EiGK/wnHMdegMaaPJW8O9k9B8x9Mv/XwTdBHtGtZ4aE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=uhQ4LE41; arc=none smtp.client-ip=207.171.184.29 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="uhQ4LE41" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1725985921; x=1757521921; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nC+ZthG3ZVV5r6y+2ESHYBsZJWfc4T6QNh+zi4jDtGU=; b=uhQ4LE41B338EBg4lQTQUABmCr1kFHP4Y2mhZbMKSFnAIIC+VyqhtP72 GwbJcy4GKnLk4mN/DbY+pcSYcSIaJqbPLgEcJ4Smr09psf9hYQm454YXz N13/GrZXIlVrv6JApPN7WShQlQXkERuDFTz+yDJXNGf+dVnSEXX0OgyFa M=; X-IronPort-AV: E=Sophos;i="6.10,217,1719878400"; d="scan'208";a="452556269" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-9102.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 16:31:53 +0000 Received: from EX19MTAUEB001.ant.amazon.com [10.0.44.209:34555] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.48.28:2525] with esmtp (Farcaster) id 87701a07-5d1b-4a8e-b143-b5fc4e93bed5; Tue, 10 Sep 2024 16:31:52 +0000 (UTC) X-Farcaster-Flow-ID: 87701a07-5d1b-4a8e-b143-b5fc4e93bed5 Received: from EX19D008UEC004.ant.amazon.com (10.252.135.170) by EX19MTAUEB001.ant.amazon.com (10.252.135.108) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:42 +0000 Received: from EX19MTAUWB001.ant.amazon.com (10.250.64.248) by EX19D008UEC004.ant.amazon.com (10.252.135.170) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 10 Sep 2024 16:31:42 +0000 Received: from ua2d7e1a6107c5b.home (172.19.88.180) by mail-relay.amazon.com (10.250.64.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 10 Sep 2024 16:31:38 +0000 From: Patrick Roy To: , , , , , , , , , , , , , , , , , , , , CC: Patrick Roy , , , , , Subject: [RFC PATCH v2 10/10] kvm: x86: support walking guest page tables in gmem Date: Tue, 10 Sep 2024 17:30:36 +0100 Message-ID: <20240910163038.1298452-11-roypat@amazon.co.uk> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20240910163038.1298452-1-roypat@amazon.co.uk> References: <20240910163038.1298452-1-roypat@amazon.co.uk> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update the logic in paging_tmpl.h to work with guest_private memory. If KVM cannot access gmem and the guest's page tables are in gfns marked as private, then error out. Let the guest page table walker access gmem by making it use gfn_to_pfn_caches, which are already gmem aware, and also handle on-demand mapping of gmem if KVM_GMEM_NO_DIRECT_MAP is set. We re-use the gfn_to_pfn_cache here to avoid implementing yet another remapping solution to support the cmpxchg used to set the "accessed" bit on guest PTEs. The only case that now needs some special handling is page tables in read-only memslots, as gfn_to_pfn_caches cannot be used for readonly memory. In this case, use kvm_vcpu_read_guest (which is also gmem aware), as there is no need to cache the gfn->pfn translation in this case (there is no need to do a cmpxchg on the PTE as the walker does not set the accessed bit for read-only ptes). gfn_to_pfn_caches are hooked up to the MMU notifiers, meaning if something about guest memory changes between the page table talk and setting the dirty bits (for example a concurrent fallocate on gmem), the gfn_to_pfn_caches will have been invalidated and the entire page table walk is retried. Signed-off-by: Patrick Roy --- arch/x86/kvm/mmu/paging_tmpl.h | 95 ++++++++++++++++++++++++++++------ 1 file changed, 78 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 69941cebb3a87..d96fa423bed05 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -84,7 +84,7 @@ struct guest_walker { pt_element_t ptes[PT_MAX_FULL_LEVELS]; pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; - pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; + struct gfn_to_pfn_cache ptep_caches[PT_MAX_FULL_LEVELS]; bool pte_writable[PT_MAX_FULL_LEVELS]; unsigned int pt_access[PT_MAX_FULL_LEVELS]; unsigned int pte_access; @@ -201,7 +201,7 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm= _vcpu *vcpu, { unsigned level, index; pt_element_t pte, orig_pte; - pt_element_t __user *ptep_user; + struct gfn_to_pfn_cache *pte_cache; gfn_t table_gfn; int ret; =20 @@ -210,10 +210,12 @@ static int FNAME(update_accessed_dirty_bits)(struct k= vm_vcpu *vcpu, return 0; =20 for (level =3D walker->max_level; level >=3D walker->level; --level) { + unsigned long flags; + pte =3D orig_pte =3D walker->ptes[level - 1]; table_gfn =3D walker->table_gfn[level - 1]; - ptep_user =3D walker->ptep_user[level - 1]; - index =3D offset_in_page(ptep_user) / sizeof(pt_element_t); + pte_cache =3D &walker->ptep_caches[level - 1]; + index =3D offset_in_page(pte_cache->khva) / sizeof(pt_element_t); if (!(pte & PT_GUEST_ACCESSED_MASK)) { trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte)); pte |=3D PT_GUEST_ACCESSED_MASK; @@ -246,11 +248,26 @@ static int FNAME(update_accessed_dirty_bits)(struct k= vm_vcpu *vcpu, if (unlikely(!walker->pte_writable[level - 1])) continue; =20 - ret =3D __try_cmpxchg_user(ptep_user, &orig_pte, pte, fault); + read_lock_irqsave(&pte_cache->lock, flags); + if (!kvm_gpc_check(pte_cache, sizeof(pte))) { + read_unlock_irqrestore(&pte_cache->lock, flags); + /* + * If the gpc got invalidated, then the page table + * it contained probably changed, so we probably need + * to redo the entire walk. + */ + return 1; + } + ret =3D __try_cmpxchg((pt_element_t *)pte_cache->khva, &orig_pte, pte, s= izeof(pte)); + + if (!ret) + kvm_gpc_mark_dirty_in_slot(pte_cache); + + read_unlock_irqrestore(&pte_cache->lock, flags); + if (ret) return ret; =20 - kvm_vcpu_mark_page_dirty(vcpu, table_gfn); walker->ptes[level - 1] =3D pte; } return 0; @@ -296,6 +313,13 @@ static inline bool FNAME(is_last_gpte)(struct kvm_mmu = *mmu, =20 return gpte & PT_PAGE_SIZE_MASK; } + +static void FNAME(walk_deactivate_gpcs)(struct guest_walker *walker) { + for (unsigned int level =3D 0; level < PT_MAX_FULL_LEVELS; ++level) + if (walker->ptep_caches[level].active) + kvm_gpc_deactivate(&walker->ptep_caches[level]); +} + /* * Fetch a guest pte for a guest virtual address, or for an L2's GPA. */ @@ -305,7 +329,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker= *walker, { int ret; pt_element_t pte; - pt_element_t __user *ptep_user; gfn_t table_gfn; u64 pt_access, pte_access; unsigned index, accessed_dirty, pte_pkey; @@ -320,8 +343,17 @@ static int FNAME(walk_addr_generic)(struct guest_walke= r *walker, u16 errcode =3D 0; gpa_t real_gpa; gfn_t gfn; + struct gfn_to_pfn_cache *pte_cache; =20 trace_kvm_mmu_pagetable_walk(addr, access); + + for (unsigned int level =3D 0; level < PT_MAX_FULL_LEVELS; ++level) { + pte_cache =3D &walker->ptep_caches[level]; + + memset(pte_cache, 0, sizeof(*pte_cache)); + kvm_gpc_init(pte_cache, vcpu->kvm); + } + retry_walk: walker->level =3D mmu->cpu_role.base.level; pte =3D kvm_mmu_get_guest_pgd(vcpu, mmu); @@ -362,11 +394,13 @@ static int FNAME(walk_addr_generic)(struct guest_walk= er *walker, =20 do { struct kvm_memory_slot *slot; - unsigned long host_addr; + unsigned long flags; =20 pt_access =3D pte_access; --walker->level; =20 + pte_cache =3D &walker->ptep_caches[walker->level - 1]; + index =3D PT_INDEX(addr, walker->level); table_gfn =3D gpte_to_gfn(pte); offset =3D index * sizeof(pt_element_t); @@ -396,15 +430,36 @@ static int FNAME(walk_addr_generic)(struct guest_walk= er *walker, if (!kvm_is_visible_memslot(slot)) goto error; =20 - host_addr =3D gfn_to_hva_memslot_prot(slot, gpa_to_gfn(real_gpa), - &walker->pte_writable[walker->level - 1]); - if (unlikely(kvm_is_error_hva(host_addr))) - goto error; + /* + * gfn_to_pfn_cache expects the memory to be writable. However, + * if the memory is not writable, we do not need caching in the + * first place, as we only need it to later potentially write + * the access bit (which we cannot do anyway if the memory is + * readonly). + */ + if (slot->flags & KVM_MEM_READONLY) { + if (kvm_vcpu_read_guest(vcpu, real_gpa + offset, &pte, sizeof(pte))) + goto error; + } else { + if (kvm_gpc_activate(pte_cache, real_gpa + offset, + sizeof(pte))) + goto error; =20 - ptep_user =3D (pt_element_t __user *)((void *)host_addr + offset); - if (unlikely(__get_user(pte, ptep_user))) - goto error; - walker->ptep_user[walker->level - 1] =3D ptep_user; + read_lock_irqsave(&pte_cache->lock, flags); + while (!kvm_gpc_check(pte_cache, sizeof(pte))) { + read_unlock_irqrestore(&pte_cache->lock, flags); + + if (kvm_gpc_refresh(pte_cache, sizeof(pte))) + goto error; + + read_lock_irqsave(&pte_cache->lock, flags); + } + + pte =3D *(pt_element_t *)pte_cache->khva; + read_unlock_irqrestore(&pte_cache->lock, flags); + + walker->pte_writable[walker->level - 1] =3D true; + } =20 trace_kvm_mmu_paging_element(pte, walker->level); =20 @@ -467,13 +522,19 @@ static int FNAME(walk_addr_generic)(struct guest_walk= er *walker, addr, write_fault); if (unlikely(ret < 0)) goto error; - else if (ret) + else if (ret) { + FNAME(walk_deactivate_gpcs)(walker); goto retry_walk; + } } =20 + FNAME(walk_deactivate_gpcs)(walker); + return 1; =20 error: + FNAME(walk_deactivate_gpcs)(walker); + errcode |=3D write_fault | user_fault; if (fetch_fault && (is_efer_nx(mmu) || is_cr4_smep(mmu))) errcode |=3D PFERR_FETCH_MASK; --=20 2.46.0