From nobody Mon Dec 15 22:06:19 2025 Received: from fra-out-015.esa.eu-central-1.outbound.mail-perimeter.amazon.com (fra-out-015.esa.eu-central-1.outbound.mail-perimeter.amazon.com [18.158.153.154]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3164F330333; Fri, 14 Nov 2025 15:19:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=18.158.153.154 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763133558; cv=none; b=fUG00J7+DEY/ICZxV/P+5MW1FYnRofCAtNFff/LVDJln2kQZd6pj0Dueug2toSze75rH7b6JrhgG7DTnOga61wjCkTHdRodx2+VfpVNo50HUcbBDkmBEJZnjRGq2vncz10gv7MiGAmnPveP/MdcHx6vqArFIqepgLPYFTbb6PdI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763133558; c=relaxed/simple; bh=Lw+3gEtll30YxeBvu3ffGbRZz6zOoOtgy4ulRySElfs=; h=From:To:CC:Subject:Date:Message-ID:References:In-Reply-To: Content-Type:MIME-Version; b=WzdlpF/C3wknseLPHJDuqIVGhSO/oQLxpktli96Ur2fTATkE/NG6IkJD+hQ9UkYV+2MJxDPFM1VypkIGl0n8skPWIq1tkLqsZFEumC317eS3ILAEFCxqIKOl91g2QwqRUDQm4uTp3ed65tU+UcCf72nC1gXGqjcSIlNXr3fQ7dM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk; spf=pass smtp.mailfrom=amazon.co.uk; dkim=pass (2048-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b=q1d5SPSN; arc=none smtp.client-ip=18.158.153.154 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.co.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.co.uk header.i=@amazon.co.uk header.b="q1d5SPSN" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazoncorp2; t=1763133556; x=1794669556; h=from:to:cc:subject:date:message-id:references: in-reply-to:content-transfer-encoding:mime-version; bh=2jS9lSxf0MwFRU30weBPIyuZUC60gMOFaKVeHsrnLYY=; b=q1d5SPSNFWUNMZdaMZEBPCqM3F1KC4mTSBKqsHHU3fyxP+NMo/C+svuA V8zr7xdtKduoc9uyhy2Y10a++mqNDS4Hme9b2/7CKOjccX2F6rOMPXQuL UuxjKqJJWgbLxWwIEHHlcBYw6w76Q55JhTCxQhcI7MrsPAULScPiKgXrn fuD8gqNR6ZsWZXzfBK0kWxxUKwrIa1lTD7SZ0aAbjaB3yDa68MypSHVBl FK565Jj+Xkh8vdT7wGGoyPp/oTBIaJeuCNVWr/ZbG6HwUvqzGiDy+kYKv 0elwKxcW39BwvskCUSk/k+/R2AI4IXEPqymzD0JuYHQM6HFcQnGl0FkTV g==; X-CSE-ConnectionGUID: NQy97IzqSRe+LgECbniuaQ== X-CSE-MsgGUID: tcsel8u9QK6dgorzXfsH+A== X-IronPort-AV: E=Sophos;i="6.19,305,1754956800"; d="scan'208";a="5074408" Received: from ip-10-6-3-216.eu-central-1.compute.internal (HELO smtpout.naws.eu-central-1.prod.farcaster.email.amazon.dev) ([10.6.3.216]) by internal-fra-out-015.esa.eu-central-1.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Nov 2025 15:18:53 +0000 Received: from EX19MTAEUA001.ant.amazon.com [54.240.197.233:16555] by smtpin.naws.eu-central-1.prod.farcaster.email.amazon.dev [10.0.37.32:2525] with esmtp (Farcaster) id c36ef757-119a-4eaf-9a52-dac4efce4dc0; Fri, 14 Nov 2025 15:18:53 +0000 (UTC) X-Farcaster-Flow-ID: c36ef757-119a-4eaf-9a52-dac4efce4dc0 Received: from EX19D022EUC002.ant.amazon.com (10.252.51.137) by EX19MTAEUA001.ant.amazon.com (10.252.50.50) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.29; Fri, 14 Nov 2025 15:18:42 +0000 Received: from EX19D022EUC002.ant.amazon.com (10.252.51.137) by EX19D022EUC002.ant.amazon.com (10.252.51.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.29; Fri, 14 Nov 2025 15:18:42 +0000 Received: from EX19D022EUC002.ant.amazon.com ([fe80::bd:307b:4d3a:7d80]) by EX19D022EUC002.ant.amazon.com ([fe80::bd:307b:4d3a:7d80%3]) with mapi id 15.02.2562.029; Fri, 14 Nov 2025 15:18:42 +0000 From: "Kalyazin, Nikita" To: "pbonzini@redhat.com" , "shuah@kernel.org" CC: "kvm@vger.kernel.org" , "linux-kselftest@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "seanjc@google.com" , "david@kernel.org" , "jthoughton@google.com" , "ackerleytng@google.com" , "vannapurve@google.com" , "jackmanb@google.com" , "patrick.roy@linux.dev" , "Thomson, Jack" , "Itazuri, Takahiro" , "Manwaring, Derek" , "Cali, Marco" , "Kalyazin, Nikita" Subject: [PATCH v7 1/2] KVM: guest_memfd: add generic population via write Thread-Topic: [PATCH v7 1/2] KVM: guest_memfd: add generic population via write Thread-Index: AQHcVXn19V34iuawKk6m+mlThfkEpA== Date: Fri, 14 Nov 2025 15:18:41 +0000 Message-ID: <20251114151828.98165-2-kalyazin@amazon.com> References: <20251114151828.98165-1-kalyazin@amazon.com> In-Reply-To: <20251114151828.98165-1-kalyazin@amazon.com> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Transfer-Encoding: quoted-printable Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" From: Nikita Kalyazin On systems that support shared guest memory, write() is useful, for example, for population of the initial image. Even though the same can also be achieved via userspace mapping and memcpying from userspace, write() provides a more performant option because it does not need to set user page tables and it does not cause a page fault for every page like memcpy would. Note that memcpy cannot be accelerated via MADV_POPULATE_WRITE as it is not supported by guest_memfd and relies on GUP. Populating 512MiB of guest_memfd on a x86 machine: - via memcpy: 436 ms - via write: 202 ms (-54%) Only PAGE_ALIGNED offset and len are allowed. Even though non-aligned writes are technically possible, when in-place conversion support is implemented [1], the restriction makes handling of mixed shared/private huge pages simpler. write() will only be allowed to populate shared pages. When direct map removal is implemented [2] - write() will not be allowed to access pages that have already been removed from direct map - on completion, write() will remove the populated pages from direct map While it is technically possible to implement read() syscall on systems with shared guest memory, it is not supported as there is currently no use case for it. [1] https://lore.kernel.org/kvm/cover.1760731772.git.ackerleytng@google.com [2] https://lore.kernel.org/kvm/20250924151101.2225820-1-patrick.roy@campus= .lmu.de Signed-off-by: Nikita Kalyazin --- Documentation/virt/kvm/api.rst | 2 ++ include/linux/kvm_host.h | 2 +- include/uapi/linux/kvm.h | 1 + virt/kvm/guest_memfd.c | 52 ++++++++++++++++++++++++++++++++++ 4 files changed, 56 insertions(+), 1 deletion(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 57061fa29e6a..9541e95fc2ed 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6448,6 +6448,8 @@ specified via KVM_CREATE_GUEST_MEMFD. Currently defi= ned flags: without INIT_SHARED will be marked private). Shared memory can be faulted into host user= space page tables. Private memory cannot. + GUEST_MEMFD_FLAG_WRITE Enable using write() on the guest_memfd file + descriptor. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D =20 When the KVM MMU performs a PFN lookup to service a guest fault and the ba= cking diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 5bd76cf394fa..5fbf65f49586 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -736,7 +736,7 @@ static inline u64 kvm_gmem_get_supported_flags(struct k= vm *kvm) u64 flags =3D GUEST_MEMFD_FLAG_MMAP; =20 if (!kvm || kvm_arch_supports_gmem_init_shared(kvm)) - flags |=3D GUEST_MEMFD_FLAG_INIT_SHARED; + flags |=3D GUEST_MEMFD_FLAG_INIT_SHARED | GUEST_MEMFD_FLAG_WRITE; =20 return flags; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 52f6000ab020..5b73d6528f1c 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1601,6 +1601,7 @@ struct kvm_memory_attributes { #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest= _memfd) #define GUEST_MEMFD_FLAG_MMAP (1ULL << 0) #define GUEST_MEMFD_FLAG_INIT_SHARED (1ULL << 1) +#define GUEST_MEMFD_FLAG_WRITE (1ULL << 2) =20 struct kvm_create_guest_memfd { __u64 size; diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index ffadc5ee8e04..2c71c21b9189 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -411,6 +411,8 @@ static int kvm_gmem_mmap(struct file *file, struct vm_a= rea_struct *vma) =20 static struct file_operations kvm_gmem_fops =3D { .mmap =3D kvm_gmem_mmap, + .llseek =3D default_llseek, + .write_iter =3D generic_perform_write, .open =3D generic_file_open, .release =3D kvm_gmem_release, .fallocate =3D kvm_gmem_fallocate, @@ -421,6 +423,53 @@ void kvm_gmem_init(struct module *module) kvm_gmem_fops.owner =3D module; } =20 +static bool kvm_gmem_supports_write(struct inode *inode) +{ + const u64 flags =3D (u64)inode->i_private; + + return flags & GUEST_MEMFD_FLAG_WRITE; +} + +static int kvm_gmem_write_begin(const struct kiocb *kiocb, + struct address_space *mapping, + loff_t pos, unsigned int len, + struct folio **folio, void **fsdata) +{ + struct inode *inode =3D file_inode(kiocb->ki_filp); + + if (!kvm_gmem_supports_write(inode)) + return -ENODEV; + + if (pos + len > i_size_read(inode)) + return -EINVAL; + + if (!IS_ALIGNED(pos, PAGE_SIZE) || !IS_ALIGNED(len, PAGE_SIZE)) + return -EINVAL; + + *folio =3D kvm_gmem_get_folio(inode, pos >> PAGE_SHIFT); + if (IS_ERR(*folio)) + return PTR_ERR(*folio); + + return 0; +} + +static int kvm_gmem_write_end(const struct kiocb *kiocb, + struct address_space *mapping, + loff_t pos, unsigned int len, + unsigned int copied, + struct folio *folio, void *fsdata) +{ + if (!folio_test_uptodate(folio)) { + folio_zero_range(folio, copied, len - copied); + folio_mark_uptodate(folio); + } + + folio_unlock(folio); + folio_put(folio); + + return copied; +} + static int kvm_gmem_migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, enum migrate_mode mode) @@ -469,6 +518,8 @@ static void kvm_gmem_free_folio(struct folio *folio) =20 static const struct address_space_operations kvm_gmem_aops =3D { .dirty_folio =3D noop_dirty_folio, + .write_begin =3D kvm_gmem_write_begin, + .write_end =3D kvm_gmem_write_end, .migrate_folio =3D kvm_gmem_migrate_folio, .error_remove_folio =3D kvm_gmem_error_folio, #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE @@ -516,6 +567,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t si= ze, u64 flags) } =20 file->f_flags |=3D O_LARGEFILE; + file->f_mode |=3D FMODE_LSEEK | FMODE_PWRITE; =20 inode =3D file->f_inode; WARN_ON(file->f_mapping !=3D inode->i_mapping); --=20 2.50.1