From nobody Thu Oct 2 15:35:36 2025 Received: from fra-out-009.esa.eu-central-1.outbound.mail-perimeter.amazon.com (fra-out-009.esa.eu-central-1.outbound.mail-perimeter.amazon.com [3.64.237.68]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5C5682A1BA; Mon, 15 Sep 2025 16:18:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=3.64.237.68 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757953123; cv=none; b=iOxhYVA884jgp+Sfs1tRviNEWY4P6ql83bHySzAtXAJwyvKx2tPdzuSdJakExsM5fOBxwmACn605iEaV06EnsVNEzz5GF06X0WzGJZBorQ1TbYZWsQ8yKH083Itxep6nW7QSQuNrn0lZRPM7rAkCcG6cV8rr09zlnGh02JGD4IM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757953123; c=relaxed/simple; bh=FdYKDNIsDnnf446Tjuq2vzCml11mCYrOCvZMObDx8F8=; h=From:To:CC:Subject:Date:Message-ID:References:In-Reply-To: Content-Type:MIME-Version; b=X3YcbDfjOxVlgqwLz1guY0RTqBYoCTwZo0kocTPicGFdPxNNplIYj+Cnf/U67sQfHWX2PrR7vn2a5PYWxu1setB5dP/DkLB8w5tnydrAVXoyDjT385hXrLcmzIXY70gIc2CCK49hZTS70mRoVJ1RnT8DbDGVPAwCItGzZhrPxyw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (2048-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=H7ql8+cs; arc=none smtp.client-ip=3.64.237.68 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="H7ql8+cs" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazoncorp2; t=1757953119; x=1789489119; h=from:to:cc:subject:date:message-id:references: in-reply-to:content-transfer-encoding:mime-version; bh=+7Ihk8dN1OX3rIDkweFMJxbqL33X3cWczmsnQZ84nD0=; b=H7ql8+csm4OdpmMpg+Q/3AZLcnmOkQMj8cmoHX3WnDSdK2o6uRiL51nR IeHOXvalIchLc1xnMEk4OqaQ88/txodrj04HmYwsOewI5XQ4FEW01zJ7/ dgj5GrOMc8HSXVyjNcQxheh3WWvoJQwqcFvRuBXUsd5gwL82tBMQL5hoo giiNLxb9Jy3UnTBhhpCNa0u+UiYUVO0BzT+JipOvdO1rlDa1v9lPHwJGy k6qk6FCwhaZPrrGAIuH7+VL4JC1ttmmJijr5NyQl+6gZ4sN0c3qL8W8Nf d5a6tzrLxr5rkNB5LbBciSSvtYI1jgBk21LjUyUhc+YRUN4nt/6go98UR w==; X-CSE-ConnectionGUID: FzHRW/IQS9iWkCJBJyujBg== X-CSE-MsgGUID: qDwG3z9ZTri5JNhH4IsCNQ== X-IronPort-AV: E=Sophos;i="6.18,266,1751241600"; d="scan'208";a="2037073" Received: from ip-10-6-11-83.eu-central-1.compute.internal (HELO smtpout.naws.eu-central-1.prod.farcaster.email.amazon.dev) ([10.6.11.83]) by internal-fra-out-009.esa.eu-central-1.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2025 16:18:29 +0000 Received: from EX19MTAEUC002.ant.amazon.com [54.240.197.228:31782] by smtpin.naws.eu-central-1.prod.farcaster.email.amazon.dev [10.0.39.25:2525] with esmtp (Farcaster) id eb035914-eb78-4888-80be-0d4b8f8fa5e0; Mon, 15 Sep 2025 16:18:28 +0000 (UTC) X-Farcaster-Flow-ID: eb035914-eb78-4888-80be-0d4b8f8fa5e0 Received: from EX19D022EUC001.ant.amazon.com (10.252.51.254) by EX19MTAEUC002.ant.amazon.com (10.252.51.181) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Mon, 15 Sep 2025 16:18:28 +0000 Received: from EX19D022EUC002.ant.amazon.com (10.252.51.137) by EX19D022EUC001.ant.amazon.com (10.252.51.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.20; Mon, 15 Sep 2025 16:18:28 +0000 Received: from EX19D022EUC002.ant.amazon.com ([fe80::bd:307b:4d3a:7d80]) by EX19D022EUC002.ant.amazon.com ([fe80::bd:307b:4d3a:7d80%3]) with mapi id 15.02.2562.020; Mon, 15 Sep 2025 16:18:28 +0000 From: "Kalyazin, Nikita" To: "akpm@linux-foundation.org" , "david@redhat.com" , "pbonzini@redhat.com" , "seanjc@google.com" , "viro@zeniv.linux.org.uk" , "brauner@kernel.org" CC: "peterx@redhat.com" , "lorenzo.stoakes@oracle.com" , "Liam.Howlett@oracle.com" , "willy@infradead.org" , "vbabka@suse.cz" , "rppt@kernel.org" , "surenb@google.com" , "mhocko@suse.com" , "jack@suse.cz" , "linux-mm@kvack.org" , "kvm@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "jthoughton@google.com" , "tabba@google.com" , "vannapurve@google.com" , "Roy, Patrick" , "Thomson, Jack" , "Manwaring, Derek" , "Cali, Marco" , "Kalyazin, Nikita" Subject: [RFC PATCH v6 1/2] mm: guestmem: introduce guestmem library Thread-Topic: [RFC PATCH v6 1/2] mm: guestmem: introduce guestmem library Thread-Index: AQHcJlxeAltL5rIkhE6iLHFOGH9QMQ== Date: Mon, 15 Sep 2025 16:18:27 +0000 Message-ID: <20250915161815.40729-2-kalyazin@amazon.com> References: <20250915161815.40729-1-kalyazin@amazon.com> In-Reply-To: <20250915161815.40729-1-kalyazin@amazon.com> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Transfer-Encoding: quoted-printable Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" From: Nikita Kalyazin Move MM-generic parts of guest_memfd from KVM to MM. This allows other hypervisors to use guestmem code and enables UserfaultFD implementation for guest_memfd [1]. Previously it was not possible because KVM (and guest_memfd code) might be built as a module. Based on a patch by Elliot Berman [2]. [1] https://lore.kernel.org/kvm/20250404154352.23078-1-kalyazin@amazon.com [2] https://lore.kernel.org/kvm/20241122-guestmem-library-v5-2-450e92951a15= @quicinc.com Signed-off-by: Nikita Kalyazin --- MAINTAINERS | 2 + include/linux/guestmem.h | 46 +++++ mm/Kconfig | 3 + mm/Makefile | 1 + mm/guestmem.c | 380 +++++++++++++++++++++++++++++++++++++++ virt/kvm/Kconfig | 1 + virt/kvm/guest_memfd.c | 303 ++++--------------------------- 7 files changed, 465 insertions(+), 271 deletions(-) create mode 100644 include/linux/guestmem.h create mode 100644 mm/guestmem.c diff --git a/MAINTAINERS b/MAINTAINERS index fed6cd812d79..c468c4847ffd 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15956,6 +15956,7 @@ W: http://www.linux-mm.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm T: quilt git://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new F: mm/ +F: mm/guestmem.c F: tools/mm/ =20 MEMORY MANAGEMENT - CORE @@ -15973,6 +15974,7 @@ W: http://www.linux-mm.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm F: include/linux/gfp.h F: include/linux/gfp_types.h +F: include/linux/guestmem.h F: include/linux/highmem.h F: include/linux/memory.h F: include/linux/mm.h diff --git a/include/linux/guestmem.h b/include/linux/guestmem.h new file mode 100644 index 000000000000..2a173261d32b --- /dev/null +++ b/include/linux/guestmem.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_GUESTMEM_H +#define _LINUX_GUESTMEM_H + +#include + +struct address_space; +struct list_head; +struct inode; + +/** + * struct guestmem_ops - Hypervisor-specific maintenance operations + * @release_folio - Try to bring the folio back to fully owned by Linux + * for instance: about to free the folio [optional] + * @invalidate_begin - start invalidating mappings between start and end o= ffsets + * @invalidate_end - paired with ->invalidate_begin() [optional] + * @supports_mmap - return true if the inode supports mmap [optional] + */ +struct guestmem_ops { + bool (*release_folio)(struct address_space *mapping, + struct folio *folio); + void (*invalidate_begin)(struct list_head *entry, pgoff_t start, + pgoff_t end); + void (*invalidate_end)(struct list_head *entry, pgoff_t start, + pgoff_t end); + bool (*supports_mmap)(struct inode *inode); +}; + +int guestmem_attach_mapping(struct address_space *mapping, + const struct guestmem_ops *const ops, + struct list_head *data); +void guestmem_detach_mapping(struct address_space *mapping, + struct list_head *data); + +struct folio *guestmem_grab_folio(struct address_space *mapping, pgoff_t i= ndex); + +int guestmem_punch_hole(struct address_space *mapping, loff_t offset, + loff_t len); +int guestmem_allocate(struct address_space *mapping, loff_t offset, loff_t= len); + +bool guestmem_test_no_direct_map(struct inode *inode); +void guestmem_mark_prepared(struct folio *folio); +int guestmem_mmap(struct file *file, struct vm_area_struct *vma); +bool guestmem_vma_is_guestmem(struct vm_area_struct *vma); + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index e443fe8cd6cf..a3705099601f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1254,6 +1254,9 @@ config SECRETMEM memory areas visible only in the context of the owning process and not mapped to other processes and other kernel page tables. =20 +config GUESTMEM + bool + config ANON_VMA_NAME bool "Anonymous VMA name support" depends on PROC_FS && ADVISE_SYSCALLS && MMU diff --git a/mm/Makefile b/mm/Makefile index ef54aa615d9d..c92892acd819 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -138,6 +138,7 @@ obj-$(CONFIG_PERCPU_STATS) +=3D percpu-stats.o obj-$(CONFIG_ZONE_DEVICE) +=3D memremap.o obj-$(CONFIG_HMM_MIRROR) +=3D hmm.o obj-$(CONFIG_MEMFD_CREATE) +=3D memfd.o +obj-$(CONFIG_GUESTMEM) +=3D guestmem.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) +=3D mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP) +=3D ptdump.o obj-$(CONFIG_PAGE_REPORTING) +=3D page_reporting.o diff --git a/mm/guestmem.c b/mm/guestmem.c new file mode 100644 index 000000000000..110087aff7e8 --- /dev/null +++ b/mm/guestmem.c @@ -0,0 +1,380 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include + +struct guestmem { + const struct guestmem_ops *ops; +}; + +static inline bool __guestmem_release_folio(struct address_space *mapping, + struct folio *folio) +{ + struct guestmem *gmem =3D mapping->i_private_data; + + if (gmem->ops->release_folio) { + if (!gmem->ops->release_folio(mapping, folio)) + return false; + } + + return true; +} + +static inline void +__guestmem_invalidate_begin(struct address_space *const mapping, pgoff_t s= tart, + pgoff_t end) +{ + struct guestmem *gmem =3D mapping->i_private_data; + struct list_head *entry; + + list_for_each(entry, &mapping->i_private_list) + gmem->ops->invalidate_begin(entry, start, end); +} + +static inline void +__guestmem_invalidate_end(struct address_space *const mapping, pgoff_t sta= rt, + pgoff_t end) +{ + struct guestmem *gmem =3D mapping->i_private_data; + struct list_head *entry; + + if (gmem->ops->invalidate_end) { + list_for_each(entry, &mapping->i_private_list) + gmem->ops->invalidate_end(entry, start, end); + } +} + +static int guestmem_write_begin(const struct kiocb *kiocb, + struct address_space *mapping, + loff_t pos, unsigned int len, + struct folio **foliop, + void **fsdata) +{ + struct file *file =3D kiocb->ki_filp; + pgoff_t index =3D pos >> PAGE_SHIFT; + struct folio *folio; + + if (!PAGE_ALIGNED(pos) || len !=3D PAGE_SIZE) + return -EINVAL; + + if (pos + len > i_size_read(file_inode(file))) + return -EINVAL; + + folio =3D guestmem_grab_folio(file_inode(file)->i_mapping, index); + if (IS_ERR(folio)) + return -EFAULT; + + if (WARN_ON_ONCE(folio_test_large(folio))) { + folio_unlock(folio); + folio_put(folio); + return -EFAULT; + } + + if (folio_test_uptodate(folio)) { + folio_unlock(folio); + folio_put(folio); + return -ENOSPC; + } + + *foliop =3D folio; + return 0; +} + +static int guestmem_write_end(const struct kiocb *kiocb, + struct address_space *mapping, + loff_t pos, unsigned int len, unsigned int copied, + struct folio *folio, void *fsdata) +{ + if (copied) { + if (copied < len) { + unsigned int from =3D pos & (PAGE_SIZE - 1); + + folio_zero_range(folio, from + copied, len - copied); + } + guestmem_mark_prepared(folio); + } + + folio_unlock(folio); + folio_put(folio); + + return copied; +} + +static void guestmem_free_folio(struct address_space *mapping, + struct folio *folio) +{ + WARN_ON_ONCE(!__guestmem_release_folio(mapping, folio)); +} + +static int guestmem_error_folio(struct address_space *mapping, + struct folio *folio) +{ + pgoff_t start, end; + + filemap_invalidate_lock_shared(mapping); + + start =3D folio->index; + end =3D start + folio_nr_pages(folio); + + __guestmem_invalidate_begin(mapping, start, end); + + /* + * Do not truncate the range, what action is taken in response to the + * error is userspace's decision (assuming the architecture supports + * gracefully handling memory errors). If/when the guest attempts to + * access a poisoned page, kvm_gmem_get_pfn() will return -EHWPOISON, + * at which point KVM can either terminate the VM or propagate the + * error to userspace. + */ + + __guestmem_invalidate_end(mapping, start, end); + + filemap_invalidate_unlock_shared(mapping); + return MF_FAILED; +} + +static int guestmem_migrate_folio(struct address_space *mapping, + struct folio *dst, struct folio *src, + enum migrate_mode mode) +{ + WARN_ON_ONCE(1); + return -EINVAL; +} + +static const struct address_space_operations guestmem_aops =3D { + .dirty_folio =3D noop_dirty_folio, + .write_begin =3D guestmem_write_begin, + .write_end =3D guestmem_write_end, + .free_folio =3D guestmem_free_folio, + .error_remove_folio =3D guestmem_error_folio, + .migrate_folio =3D guestmem_migrate_folio, +}; + +int guestmem_attach_mapping(struct address_space *mapping, + const struct guestmem_ops *const ops, + struct list_head *data) +{ + struct guestmem *gmem; + + if (mapping->a_ops =3D=3D &guestmem_aops) { + gmem =3D mapping->i_private_data; + if (gmem->ops !=3D ops) + return -EINVAL; + + goto add; + } + + gmem =3D kzalloc(sizeof(*gmem), GFP_KERNEL); + if (!gmem) + return -ENOMEM; + + gmem->ops =3D ops; + + mapping->a_ops =3D &guestmem_aops; + mapping->i_private_data =3D gmem; + + mapping_set_gfp_mask(mapping, GFP_HIGHUSER); + mapping_set_inaccessible(mapping); + /* Unmovable mappings are supposed to be marked unevictable as well. */ + WARN_ON_ONCE(!mapping_unevictable(mapping)); + +add: + list_add(data, &mapping->i_private_list); + return 0; +} +EXPORT_SYMBOL_GPL(guestmem_attach_mapping); + +void guestmem_detach_mapping(struct address_space *mapping, + struct list_head *data) +{ + list_del(data); + + if (list_empty(&mapping->i_private_list)) { + /** + * Ensures we call ->free_folio() for any allocated folios. + * Any folios allocated after this point are assumed not to be + * accessed by the guest, so we don't need to worry about + * guestmem ops not being called on them. + */ + truncate_inode_pages(mapping, 0); + + kfree(mapping->i_private_data); + mapping->i_private_data =3D NULL; + mapping->a_ops =3D &empty_aops; + } +} +EXPORT_SYMBOL_GPL(guestmem_detach_mapping); + +struct folio *guestmem_grab_folio(struct address_space *mapping, pgoff_t i= ndex) +{ + /* TODO: Support huge pages. */ + return filemap_grab_folio(mapping, index); +} +EXPORT_SYMBOL_GPL(guestmem_grab_folio); + +int guestmem_punch_hole(struct address_space *mapping, loff_t offset, + loff_t len) +{ + pgoff_t start =3D offset >> PAGE_SHIFT; + pgoff_t end =3D (offset + len) >> PAGE_SHIFT; + + filemap_invalidate_lock(mapping); + __guestmem_invalidate_begin(mapping, start, end); + + truncate_inode_pages_range(mapping, offset, offset + len - 1); + + __guestmem_invalidate_end(mapping, start, end); + filemap_invalidate_unlock(mapping); + + return 0; +} +EXPORT_SYMBOL_GPL(guestmem_punch_hole); + +int guestmem_allocate(struct address_space *mapping, loff_t offset, loff_t= len) +{ + pgoff_t start, index, end; + int r; + + /* Dedicated guest is immutable by default. */ + if (offset + len > i_size_read(mapping->host)) + return -EINVAL; + + filemap_invalidate_lock_shared(mapping); + + start =3D offset >> PAGE_SHIFT; + end =3D (offset + len) >> PAGE_SHIFT; + + r =3D 0; + for (index =3D start; index < end; ) { + struct folio *folio; + + if (signal_pending(current)) { + r =3D -EINTR; + break; + } + + folio =3D guestmem_grab_folio(mapping, index); + if (IS_ERR(folio)) { + r =3D PTR_ERR(folio); + break; + } + + index =3D folio_next_index(folio); + + folio_unlock(folio); + folio_put(folio); + + /* 64-bit only, wrapping the index should be impossible. */ + if (WARN_ON_ONCE(!index)) + break; + + cond_resched(); + } + + filemap_invalidate_unlock_shared(mapping); + + return r; +} +EXPORT_SYMBOL_GPL(guestmem_allocate); + +bool guestmem_test_no_direct_map(struct inode *inode) +{ + return mapping_no_direct_map(inode->i_mapping); +} +EXPORT_SYMBOL_GPL(guestmem_test_no_direct_map); + +void guestmem_mark_prepared(struct folio *folio) +{ + struct inode *inode =3D folio_inode(folio); + + if (guestmem_test_no_direct_map(inode)) + set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio)= , false); + + folio_mark_uptodate(folio); +} +EXPORT_SYMBOL_GPL(guestmem_mark_prepared); + +static vm_fault_t guestmem_fault_user_mapping(struct vm_fault *vmf) +{ + struct inode *inode =3D file_inode(vmf->vma->vm_file); + struct folio *folio; + vm_fault_t ret =3D VM_FAULT_LOCKED; + + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >=3D i_size_read(inode)) + return VM_FAULT_SIGBUS; + + folio =3D guestmem_grab_folio(inode->i_mapping, vmf->pgoff); + if (IS_ERR(folio)) { + int err =3D PTR_ERR(folio); + + if (err =3D=3D -EAGAIN) + return VM_FAULT_RETRY; + + return vmf_error(err); + } + + if (WARN_ON_ONCE(folio_test_large(folio))) { + ret =3D VM_FAULT_SIGBUS; + goto out_folio; + } + + if (!folio_test_uptodate(folio)) { + clear_highpage(folio_page(folio, 0)); + guestmem_mark_prepared(folio); + } + + if (userfaultfd_minor(vmf->vma)) { + folio_unlock(folio); + return handle_userfault(vmf, VM_UFFD_MINOR); + } + + vmf->page =3D folio_file_page(folio, vmf->pgoff); + +out_folio: + if (ret !=3D VM_FAULT_LOCKED) { + folio_unlock(folio); + folio_put(folio); + } + + return ret; +} + +static const struct vm_operations_struct guestmem_vm_ops =3D { + .fault =3D guestmem_fault_user_mapping, +}; + +int guestmem_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct address_space *mapping =3D file_inode(file)->i_mapping; + struct guestmem *gmem =3D mapping->i_private_data; + + if (!gmem->ops->supports_mmap || !gmem->ops->supports_mmap(file_inode(fil= e))) + return -ENODEV; + + if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=3D + (VM_SHARED | VM_MAYSHARE)) { + return -EINVAL; + } + + vma->vm_ops =3D &guestmem_vm_ops; + + return 0; +} +EXPORT_SYMBOL_GPL(guestmem_mmap); + +bool guestmem_vma_is_guestmem(struct vm_area_struct *vma) +{ + struct inode *inode; + + if (!vma->vm_file) + return false; + + inode =3D file_inode(vma->vm_file); + if (!inode || !inode->i_mapping || !inode->i_mapping->i_private_data) + return false; + + return inode->i_mapping->a_ops =3D=3D &guestmem_aops; +} diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 1b7d5be0b6c4..41e26ad33c1b 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -114,6 +114,7 @@ config KVM_GENERIC_MEMORY_ATTRIBUTES =20 config KVM_GUEST_MEMFD select XARRAY_MULTI + select GUESTMEM bool =20 config HAVE_KVM_ARCH_GMEM_PREPARE diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 6989362c056c..15ab13bf6d40 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include #include +#include #include #include #include @@ -43,26 +44,6 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, str= uct kvm_memory_slot *slo return 0; } =20 -static bool kvm_gmem_test_no_direct_map(struct inode *inode) -{ - return ((unsigned long) inode->i_private) & GUEST_MEMFD_FLAG_NO_DIRECT_MA= P; -} - -static inline int kvm_gmem_mark_prepared(struct folio *folio) -{ - struct inode *inode =3D folio_inode(folio); - int r =3D 0; - - if (kvm_gmem_test_no_direct_map(inode)) - r =3D set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(= folio), - false); - - if (!r) - folio_mark_uptodate(folio); - - return r; -} - /* * Process @folio, which contains @gfn, so that the guest can use it. * The folio must be locked and the gfn must be contained in @slot. @@ -98,7 +79,7 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct= kvm_memory_slot *slot, index =3D ALIGN_DOWN(index, 1 << folio_order(folio)); r =3D __kvm_gmem_prepare_folio(kvm, slot, index, folio); if (!r) - r =3D kvm_gmem_mark_prepared(folio); + guestmem_mark_prepared(folio); =20 return r; } @@ -114,8 +95,7 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struc= t kvm_memory_slot *slot, */ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index) { - /* TODO: Support huge pages. */ - return filemap_grab_folio(inode->i_mapping, index); + return guestmem_grab_folio(inode->i_mapping, index); } =20 static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start, @@ -167,79 +147,6 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *g= mem, pgoff_t start, } } =20 -static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t= len) -{ - struct list_head *gmem_list =3D &inode->i_mapping->i_private_list; - pgoff_t start =3D offset >> PAGE_SHIFT; - pgoff_t end =3D (offset + len) >> PAGE_SHIFT; - struct kvm_gmem *gmem; - - /* - * Bindings must be stable across invalidation to ensure the start+end - * are balanced. - */ - filemap_invalidate_lock(inode->i_mapping); - - list_for_each_entry(gmem, gmem_list, entry) - kvm_gmem_invalidate_begin(gmem, start, end); - - truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1); - - list_for_each_entry(gmem, gmem_list, entry) - kvm_gmem_invalidate_end(gmem, start, end); - - filemap_invalidate_unlock(inode->i_mapping); - - return 0; -} - -static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t l= en) -{ - struct address_space *mapping =3D inode->i_mapping; - pgoff_t start, index, end; - int r; - - /* Dedicated guest is immutable by default. */ - if (offset + len > i_size_read(inode)) - return -EINVAL; - - filemap_invalidate_lock_shared(mapping); - - start =3D offset >> PAGE_SHIFT; - end =3D (offset + len) >> PAGE_SHIFT; - - r =3D 0; - for (index =3D start; index < end; ) { - struct folio *folio; - - if (signal_pending(current)) { - r =3D -EINTR; - break; - } - - folio =3D kvm_gmem_get_folio(inode, index); - if (IS_ERR(folio)) { - r =3D PTR_ERR(folio); - break; - } - - index =3D folio_next_index(folio); - - folio_unlock(folio); - folio_put(folio); - - /* 64-bit only, wrapping the index should be impossible. */ - if (WARN_ON_ONCE(!index)) - break; - - cond_resched(); - } - - filemap_invalidate_unlock_shared(mapping); - - return r; -} - static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { @@ -255,9 +162,9 @@ static long kvm_gmem_fallocate(struct file *file, int m= ode, loff_t offset, return -EINVAL; =20 if (mode & FALLOC_FL_PUNCH_HOLE) - ret =3D kvm_gmem_punch_hole(file_inode(file), offset, len); + ret =3D guestmem_punch_hole(file_inode(file)->i_mapping, offset, len); else - ret =3D kvm_gmem_allocate(file_inode(file), offset, len); + ret =3D guestmem_allocate(file_inode(file)->i_mapping, offset, len); =20 if (!ret) file_modified(file); @@ -299,7 +206,7 @@ static int kvm_gmem_release(struct inode *inode, struct= file *file) kvm_gmem_invalidate_begin(gmem, 0, -1ul); kvm_gmem_invalidate_end(gmem, 0, -1ul); =20 - list_del(&gmem->entry); + guestmem_detach_mapping(inode->i_mapping, &gmem->entry); =20 filemap_invalidate_unlock(inode->i_mapping); =20 @@ -335,74 +242,8 @@ static bool kvm_gmem_supports_mmap(struct inode *inode) return flags & GUEST_MEMFD_FLAG_MMAP; } =20 -static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf) -{ - struct inode *inode =3D file_inode(vmf->vma->vm_file); - struct folio *folio; - vm_fault_t ret =3D VM_FAULT_LOCKED; - - if (((loff_t)vmf->pgoff << PAGE_SHIFT) >=3D i_size_read(inode)) - return VM_FAULT_SIGBUS; - - folio =3D kvm_gmem_get_folio(inode, vmf->pgoff); - if (IS_ERR(folio)) { - int err =3D PTR_ERR(folio); - - if (err =3D=3D -EAGAIN) - return VM_FAULT_RETRY; - - return vmf_error(err); - } - - if (WARN_ON_ONCE(folio_test_large(folio))) { - ret =3D VM_FAULT_SIGBUS; - goto out_folio; - } - - if (!folio_test_uptodate(folio)) { - int err =3D 0; - - clear_highpage(folio_page(folio, 0)); - err =3D kvm_gmem_mark_prepared(folio); - - if (err) { - ret =3D vmf_error(err); - goto out_folio; - } - } - - vmf->page =3D folio_file_page(folio, vmf->pgoff); - -out_folio: - if (ret !=3D VM_FAULT_LOCKED) { - folio_unlock(folio); - folio_put(folio); - } - - return ret; -} - -static const struct vm_operations_struct kvm_gmem_vm_ops =3D { - .fault =3D kvm_gmem_fault_user_mapping, -}; - -static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma) -{ - if (!kvm_gmem_supports_mmap(file_inode(file))) - return -ENODEV; - - if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=3D - (VM_SHARED | VM_MAYSHARE)) { - return -EINVAL; - } - - vma->vm_ops =3D &kvm_gmem_vm_ops; - - return 0; -} - static struct file_operations kvm_gmem_fops =3D { - .mmap =3D kvm_gmem_mmap, + .mmap =3D guestmem_mmap, .llseek =3D default_llseek, .write_iter =3D generic_perform_write, .open =3D generic_file_open, @@ -415,104 +256,24 @@ void kvm_gmem_init(struct module *module) kvm_gmem_fops.owner =3D module; } =20 -static int kvm_kmem_gmem_write_begin(const struct kiocb *kiocb, - struct address_space *mapping, - loff_t pos, unsigned int len, - struct folio **foliop, - void **fsdata) -{ - struct file *file =3D kiocb->ki_filp; - pgoff_t index =3D pos >> PAGE_SHIFT; - struct folio *folio; - - if (!PAGE_ALIGNED(pos) || len !=3D PAGE_SIZE) - return -EINVAL; - - if (pos + len > i_size_read(file_inode(file))) - return -EINVAL; - - folio =3D kvm_gmem_get_folio(file_inode(file), index); - if (IS_ERR(folio)) - return -EFAULT; - - if (WARN_ON_ONCE(folio_test_large(folio))) { - folio_unlock(folio); - folio_put(folio); - return -EFAULT; - } - - if (folio_test_uptodate(folio)) { - folio_unlock(folio); - folio_put(folio); - return -ENOSPC; - } - - *foliop =3D folio; - return 0; -} - -static int kvm_kmem_gmem_write_end(const struct kiocb *kiocb, - struct address_space *mapping, - loff_t pos, unsigned int len, - unsigned int copied, - struct folio *folio, void *fsdata) +static void kvm_guestmem_invalidate_begin(struct list_head *entry, pgoff_t= start, + pgoff_t end) { - if (copied) { - if (copied < len) { - unsigned int from =3D pos & (PAGE_SIZE - 1); - - folio_zero_range(folio, from + copied, len - copied); - } - kvm_gmem_mark_prepared(folio); - } - - folio_unlock(folio); - folio_put(folio); - - return copied; -} + struct kvm_gmem *gmem =3D container_of(entry, struct kvm_gmem, entry); =20 -static int kvm_gmem_migrate_folio(struct address_space *mapping, - struct folio *dst, struct folio *src, - enum migrate_mode mode) -{ - WARN_ON_ONCE(1); - return -EINVAL; + kvm_gmem_invalidate_begin(gmem, start, end); } =20 -static int kvm_gmem_error_folio(struct address_space *mapping, struct foli= o *folio) +static void kvm_guestmem_invalidate_end(struct list_head *entry, pgoff_t s= tart, + pgoff_t end) { - struct list_head *gmem_list =3D &mapping->i_private_list; - struct kvm_gmem *gmem; - pgoff_t start, end; - - filemap_invalidate_lock_shared(mapping); - - start =3D folio->index; - end =3D start + folio_nr_pages(folio); - - list_for_each_entry(gmem, gmem_list, entry) - kvm_gmem_invalidate_begin(gmem, start, end); + struct kvm_gmem *gmem =3D container_of(entry, struct kvm_gmem, entry); =20 - /* - * Do not truncate the range, what action is taken in response to the - * error is userspace's decision (assuming the architecture supports - * gracefully handling memory errors). If/when the guest attempts to - * access a poisoned page, kvm_gmem_get_pfn() will return -EHWPOISON, - * at which point KVM can either terminate the VM or propagate the - * error to userspace. - */ - - list_for_each_entry(gmem, gmem_list, entry) - kvm_gmem_invalidate_end(gmem, start, end); - - filemap_invalidate_unlock_shared(mapping); - - return MF_DELAYED; + kvm_gmem_invalidate_end(gmem, start, end); } =20 -static void kvm_gmem_free_folio(struct address_space *mapping, - struct folio *folio) +static bool kvm_gmem_release_folio(struct address_space *mapping, + struct folio *folio) { struct page *page =3D folio_page(folio, 0); kvm_pfn_t pfn =3D page_to_pfn(page); @@ -525,19 +286,19 @@ static void kvm_gmem_free_folio(struct address_space = *mapping, * happened in set_direct_map_invalid_noflush() in kvm_gmem_mark_prepared= (). * Thus set_direct_map_valid_noflush() here only updates prot bits. */ - if (kvm_gmem_test_no_direct_map(mapping->host)) + if (guestmem_test_no_direct_map(mapping->host)) set_direct_map_valid_noflush(page, folio_nr_pages(folio), true); =20 kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order)); + + return true; } =20 -static const struct address_space_operations kvm_gmem_aops =3D { - .dirty_folio =3D noop_dirty_folio, - .write_begin =3D kvm_kmem_gmem_write_begin, - .write_end =3D kvm_kmem_gmem_write_end, - .migrate_folio =3D kvm_gmem_migrate_folio, - .error_remove_folio =3D kvm_gmem_error_folio, - .free_folio =3D kvm_gmem_free_folio, +static const struct guestmem_ops kvm_guestmem_ops =3D { + .invalidate_begin =3D kvm_guestmem_invalidate_begin, + .invalidate_end =3D kvm_guestmem_invalidate_end, + .release_folio =3D kvm_gmem_release_folio, + .supports_mmap =3D kvm_gmem_supports_mmap, }; =20 static int kvm_gmem_setattr(struct mnt_idmap *idmap, struct dentry *dentry, @@ -587,13 +348,12 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t = size, u64 flags) =20 inode->i_private =3D (void *)(unsigned long)flags; inode->i_op =3D &kvm_gmem_iops; - inode->i_mapping->a_ops =3D &kvm_gmem_aops; inode->i_mode |=3D S_IFREG; inode->i_size =3D size; - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); - mapping_set_inaccessible(inode->i_mapping); - /* Unmovable mappings are supposed to be marked unevictable as well. */ - WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); + err =3D guestmem_attach_mapping(inode->i_mapping, &kvm_guestmem_ops, + &gmem->entry); + if (err) + goto err_putfile; =20 if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) mapping_set_no_direct_map(inode->i_mapping); @@ -601,11 +361,12 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t = size, u64 flags) kvm_get_kvm(kvm); gmem->kvm =3D kvm; xa_init(&gmem->bindings); - list_add(&gmem->entry, &inode->i_mapping->i_private_list); =20 fd_install(fd, file); return fd; =20 +err_putfile: + fput(file); err_gmem: kfree(gmem); err_fd: @@ -869,7 +630,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn= , void __user *src, long p =3D src ? src + i * PAGE_SIZE : NULL; ret =3D post_populate(kvm, gfn, pfn, p, max_order, opaque); if (!ret) - ret =3D kvm_gmem_mark_prepared(folio); + guestmem_mark_prepared(folio); =20 put_folio_and_exit: folio_put(folio); --=20 2.50.1