From nobody Mon Apr 13 00:05:22 2026 Received: from iad-out-002.esa.us-east-1.outbound.mail-perimeter.amazon.com (iad-out-002.esa.us-east-1.outbound.mail-perimeter.amazon.com [13.216.54.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B3E443DA5BF; Fri, 10 Apr 2026 15:19:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=13.216.54.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775834382; cv=none; b=AP1+fEyjBTwTpzNfoxmPomjc5w08buE3brX15MmiwEs586ZwIKZskaCE7+ZJ0xGv1JPgYFgrZUuuDwMW3ll69UekgD7ieWekDQCFxg7DVcYwUyBbUcPX56UOT8a461JBKCqcFGAL4QU3Fz5mnLbB8cMaVJSpCGzUnflWUsZG3Gw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775834382; c=relaxed/simple; bh=DARPx0/A/0SyL++lCBCb9Hxzk2csZhsoZq12066OEwo=; h=From:To:CC:Subject:Date:Message-ID:References:In-Reply-To: Content-Type:MIME-Version; b=iTDmpRwmePj5THJSvqMmbvNjGzdnw8qUKzs2qsb0bPCTKkY8zuI193GcpHrfjQ3oBIrcvkj2zZPrmTwSfloZpnnUCRl2WBgA7zlMFYH92gLaGWSNrdHXJLCKDabxRYSNy6MSBva6/AsXfoYYpgzLo2BorDgx1H9YpdipJA3hrnM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (2048-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=NB2QQGtJ; arc=none smtp.client-ip=13.216.54.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="NB2QQGtJ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazoncorp2; t=1775834380; x=1807370380; h=from:to:cc:subject:date:message-id:references: in-reply-to:content-transfer-encoding:mime-version; bh=5C9AEaTZC2vnX9WJzvecdU4TcSrFHbDbWhfB5ZZnz/U=; b=NB2QQGtJvApou6Tz9C4qTQqDKZNzrV/qM7XMIhqYQMRAcZajLm3aLbHr nep2axjzGogIU6pjmuFNYAv2a6zgzc/DosMRmloazXIVamyIwNNMdkRLJ y0POo/ru6Uiv03VbxdLtkZvCb2MfkZ7EQQzq/59v7YCtlIQSrWhYCnt8c ehnijbINXgJJLdz2itU88/ep6ZAdXTALuxCPaBFG0TKd2lG+Eann1GpR7 Ko+kTJz0XVjNTiQHoD8wuXUDaI/C86S8Jxs/kf1ptQjW676UjTinKRGud YkZ3pxz8F1a+Msam7CIgvfPix1ecLNlia+LsWKt6nLjMledYCEVMcr5n0 A==; X-CSE-ConnectionGUID: 18qYh3HmQaibfh/VI2oe3g== X-CSE-MsgGUID: XjBPrBJnTOmInJJy00mcRg== X-IronPort-AV: E=Sophos;i="6.23,171,1770595200"; d="scan'208";a="15981021" Received: from ip-10-4-3-150.ec2.internal (HELO smtpout.naws.us-east-1.prod.farcaster.email.amazon.dev) ([10.4.3.150]) by internal-iad-out-002.esa.us-east-1.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Apr 2026 15:19:36 +0000 Received: from EX19MTAUEB001.ant.amazon.com [72.21.198.67:7247] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.46.155:2525] with esmtp (Farcaster) id 7679b919-827f-4863-9b32-918b6977df81; Fri, 10 Apr 2026 15:19:36 +0000 (UTC) X-Farcaster-Flow-ID: 7679b919-827f-4863-9b32-918b6977df81 Received: from EX19D027UEC003.ant.amazon.com (10.252.137.250) by EX19MTAUEB001.ant.amazon.com (10.252.135.108) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.37; Fri, 10 Apr 2026 15:19:36 +0000 Received: from EX19D027UEC003.ant.amazon.com (10.252.137.250) by EX19D027UEC003.ant.amazon.com (10.252.137.250) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.37; Fri, 10 Apr 2026 15:19:35 +0000 Received: from EX19D027UEC003.ant.amazon.com ([fe80::887f:519b:ba73:21d]) by EX19D027UEC003.ant.amazon.com ([fe80::887f:519b:ba73:21d%3]) with mapi id 15.02.2562.037; Fri, 10 Apr 2026 15:19:35 +0000 From: "Kalyazin, Nikita" To: "kvm@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" , "kvmarm@lists.linux.dev" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , "bpf@vger.kernel.org" , "linux-kselftest@vger.kernel.org" , "kernel@xen0n.name" , "linux-riscv@lists.infradead.org" , "linux-s390@vger.kernel.org" , "loongarch@lists.linux.dev" , "linux-pm@vger.kernel.org" CC: "pbonzini@redhat.com" , "corbet@lwn.net" , "maz@kernel.org" , "oupton@kernel.org" , "joey.gouly@arm.com" , "suzuki.poulose@arm.com" , "yuzenghui@huawei.com" , "catalin.marinas@arm.com" , "will@kernel.org" , "seanjc@google.com" , "tglx@kernel.org" , "mingo@redhat.com" , "bp@alien8.de" , "dave.hansen@linux.intel.com" , "x86@kernel.org" , "hpa@zytor.com" , "luto@kernel.org" , "peterz@infradead.org" , "willy@infradead.org" , "akpm@linux-foundation.org" , "david@kernel.org" , "lorenzo.stoakes@oracle.com" , "vbabka@kernel.org" , "rppt@kernel.org" , "surenb@google.com" , "mhocko@suse.com" , "ast@kernel.org" , "daniel@iogearbox.net" , "andrii@kernel.org" , "martin.lau@linux.dev" , "eddyz87@gmail.com" , "song@kernel.org" , "yonghong.song@linux.dev" , "john.fastabend@gmail.com" , "kpsingh@kernel.org" , "sdf@fomichev.me" , "haoluo@google.com" , "jolsa@kernel.org" , "jgg@ziepe.ca" , "jhubbard@nvidia.com" , "peterx@redhat.com" , "jannh@google.com" , "pfalcato@suse.de" , "skhan@linuxfoundation.org" , "riel@surriel.com" , "ryan.roberts@arm.com" , "jgross@suse.com" , "yu-cheng.yu@intel.com" , "kas@kernel.org" , "coxu@redhat.com" , "ackerleytng@google.com" , "yosry@kernel.org" , "ajones@ventanamicro.com" , "maobibo@loongson.cn" , "tabba@google.com" , "prsampat@amd.com" , "wu.fei9@sanechips.com.cn" , "mlevitsk@redhat.com" , "jmattson@google.com" , "jthoughton@google.com" , "agordeev@linux.ibm.com" , "alex@ghiti.fr" , "aou@eecs.berkeley.edu" , "borntraeger@linux.ibm.com" , "chenhuacai@kernel.org" , "baolu.lu@linux.intel.com" , "dev.jain@arm.com" , "gor@linux.ibm.com" , "hca@linux.ibm.com" , "palmer@dabbelt.com" , "pjw@kernel.org" , "shijie@os.amperecomputing.com" , "svens@linux.ibm.com" , "thuth@redhat.com" , "yang@os.amperecomputing.com" , "Liam.Howlett@oracle.com" , "urezki@gmail.com" , "zhengqi.arch@bytedance.com" , "gerald.schaefer@linux.ibm.com" , "jiayuan.chen@shopee.com" , "lenb@kernel.org" , "pavel@kernel.org" , "rafael@kernel.org" , "yangyicong@hisilicon.com" , "vannapurve@google.com" , "jackmanb@google.com" , "patrick.roy@linux.dev" , "Thomson, Jack" , "Itazuri, Takahiro" , "Manwaring, Derek" , "Kalyazin, Nikita" , Nikita Kalyazin Subject: [PATCH v12 10/16] KVM: guest_memfd: Add flag to remove from direct map Thread-Topic: [PATCH v12 10/16] KVM: guest_memfd: Add flag to remove from direct map Thread-Index: AQHcyP1wgECm8Dfrh0CyY43UsmUx2g== Date: Fri, 10 Apr 2026 15:19:35 +0000 Message-ID: <20260410151746.61150-11-kalyazin@amazon.com> References: <20260410151746.61150-1-kalyazin@amazon.com> In-Reply-To: <20260410151746.61150-1-kalyazin@amazon.com> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Transfer-Encoding: quoted-printable Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" From: Patrick Roy Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When set, guest_memfd folios will be removed from the direct map after preparation, with direct map entries only restored when the folios are freed. To ensure these folios do not end up in places where the kernel cannot deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested. Note that this flag causes removal of direct map entries for all guest_memfd folios independent of whether they are "shared" or "private" (although current guest_memfd only supports either all folios in the "shared" state, or all folios in the "private" state if GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map entries of also the shared parts of guest_memfd are a special type of non-CoCo VM where, host userspace is trusted to have access to all of guest memory, but where Spectre-style transient execution attacks through the host kernel's direct map should still be mitigated. In this setup, KVM retains access to guest memory via userspace mappings of guest_memfd, which are reflected back into KVM's memslots via userspace_addr. This is needed for things like MMIO emulation on x86_64 to work. Direct map entries are zapped right before guest or userspace mappings of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or kvm_gmem_get_pfn() [called from the KVM MMU code]. At present, direct map removal is not supported on platforms that support kvm_gmem_populate(). In case such support is added in the future, the following ordering is maintained: zap then prepare, invalidate then restore, to avoid having guest-owned pages being temporarily mapped on by host. This assumes that preparation or invalidation code does not access the page content. Signed-off-by: Patrick Roy Co-developed-by: Nikita Kalyazin Signed-off-by: Nikita Kalyazin --- Documentation/virt/kvm/api.rst | 21 +++++----- include/linux/kvm_host.h | 3 ++ include/uapi/linux/kvm.h | 1 + virt/kvm/guest_memfd.c | 71 ++++++++++++++++++++++++++++++++-- 4 files changed, 83 insertions(+), 13 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 032516783e96..8feec77b03fe 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6439,15 +6439,18 @@ a single guest_memfd file, but the bound ranges mus= t not overlap). The capability KVM_CAP_GUEST_MEMFD_FLAGS enumerates the `flags` that can be specified via KVM_CREATE_GUEST_MEMFD. Currently defined flags: =20 - =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D - GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd file - descriptor. - GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during - KVM_CREATE_GUEST_MEMFD (memory files created - without INIT_SHARED will be marked private). - Shared memory can be faulted into host user= space - page tables. Private memory cannot. - =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D + GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd fi= le + descriptor. + GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during + KVM_CREATE_GUEST_MEMFD (memory files crea= ted + without INIT_SHARED will be marked privat= e). + Shared memory can be faulted into host us= erspace + page tables. Private memory cannot. + GUEST_MEMFD_FLAG_NO_DIRECT_MAP The guest_memfd instance will unmap the m= emory + backing it from the kernel's address space + before passing it off to userspace or the= guest. + =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D =20 When the KVM MMU performs a PFN lookup to service a guest fault and the ba= cking guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always = be diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ce8c5fdf2752..c95747e2278c 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -738,6 +738,9 @@ static inline u64 kvm_gmem_get_supported_flags(struct k= vm *kvm) if (!kvm || kvm_arch_supports_gmem_init_shared(kvm)) flags |=3D GUEST_MEMFD_FLAG_INIT_SHARED; =20 + if (!kvm || kvm_arch_gmem_supports_no_direct_map(kvm)) + flags |=3D GUEST_MEMFD_FLAG_NO_DIRECT_MAP; + return flags; } #endif diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 80364d4dbebb..d864f67efdb7 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1642,6 +1642,7 @@ struct kvm_memory_attributes { #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest= _memfd) #define GUEST_MEMFD_FLAG_MMAP (1ULL << 0) #define GUEST_MEMFD_FLAG_INIT_SHARED (1ULL << 1) +#define GUEST_MEMFD_FLAG_NO_DIRECT_MAP (1ULL << 2) =20 struct kvm_create_guest_memfd { __u64 size; diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 651649623448..80d4a6aca128 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -7,6 +7,7 @@ #include #include #include +#include =20 #include "kvm_mm.h" =20 @@ -76,6 +77,39 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, str= uct kvm_memory_slot *slo return 0; } =20 +#define KVM_GMEM_FOLIO_NO_DIRECT_MAP BIT(0) + +static bool kvm_gmem_folio_no_direct_map(struct folio *folio) +{ + return ((u64)folio->private) & KVM_GMEM_FOLIO_NO_DIRECT_MAP; +} + +static int kvm_gmem_folio_zap_direct_map(struct folio *folio) +{ + int r =3D 0; + + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + + if (WARN_ON_ONCE(!(GMEM_I(folio_inode(folio))->flags & GUEST_MEMFD_FLAG_N= O_DIRECT_MAP))) + return -EINVAL; + + if (kvm_gmem_folio_no_direct_map(folio)) + goto out; + + r =3D folio_zap_direct_map(folio); + if (!r) + folio->private =3D (void *)((u64)folio->private | KVM_GMEM_FOLIO_NO_DIRE= CT_MAP); + +out: + return r; +} + +static void kvm_gmem_folio_restore_direct_map(struct folio *folio) +{ + folio_restore_direct_map(folio); + folio->private =3D (void *)((u64)folio->private & ~KVM_GMEM_FOLIO_NO_DIRE= CT_MAP); +} + /* * Process @folio, which contains @gfn, so that the guest can use it. * The folio must be locked and the gfn must be contained in @slot. @@ -388,11 +422,17 @@ static bool kvm_gmem_supports_mmap(struct inode *inod= e) return GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_MMAP; } =20 +static bool kvm_gmem_no_direct_map(struct inode *inode) +{ + return GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP; +} + static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf) { struct inode *inode =3D file_inode(vmf->vma->vm_file); struct folio *folio; vm_fault_t ret =3D VM_FAULT_LOCKED; + int err; =20 if (((loff_t)vmf->pgoff << PAGE_SHIFT) >=3D i_size_read(inode)) return VM_FAULT_SIGBUS; @@ -418,6 +458,14 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct v= m_fault *vmf) folio_mark_uptodate(folio); } =20 + if (kvm_gmem_no_direct_map(folio_inode(folio))) { + err =3D kvm_gmem_folio_zap_direct_map(folio); + if (err) { + ret =3D vmf_error(err); + goto out_folio; + } + } + vmf->page =3D folio_file_page(folio, vmf->pgoff); =20 out_folio: @@ -529,6 +577,9 @@ static void kvm_gmem_free_folio(struct folio *folio) int order =3D folio_order(folio); =20 kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order)); + + if (kvm_gmem_folio_no_direct_map(folio)) + kvm_gmem_folio_restore_direct_map(folio); } =20 static const struct address_space_operations kvm_gmem_aops =3D { @@ -591,6 +642,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t si= ze, u64 flags) /* Unmovable mappings are supposed to be marked unevictable as well. */ WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); =20 + if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) + mapping_set_no_direct_map(inode->i_mapping); + GMEM_I(inode)->flags =3D flags; =20 file =3D alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_f= ops); @@ -802,14 +856,23 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memo= ry_slot *slot, folio_mark_uptodate(folio); } =20 + if (kvm_gmem_no_direct_map(folio_inode(folio))) { + r =3D kvm_gmem_folio_zap_direct_map(folio); + if (r) + goto out_unlock; + } + r =3D kvm_gmem_prepare_folio(kvm, slot, gfn, folio); + if (r) + goto out_unlock; =20 + *page =3D folio_file_page(folio, index); folio_unlock(folio); + return 0; =20 - if (!r) - *page =3D folio_file_page(folio, index); - else - folio_put(folio); +out_unlock: + folio_unlock(folio); + folio_put(folio); =20 return r; } --=20 2.50.1