From nobody Mon Jun 15 23:15:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 617C23E866C for ; Tue, 14 Apr 2026 14:24:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176640; cv=none; b=fh0+e5k7UsfKOBTfCsYA+Ux/Db3lEjshjgiBdtoQ9Wlu4rxZXV+R8DdB2oSrNuyrvgpcHtzV/BsZHfC/uVTKfu9dslTvwjuNOCLChGW/ohj2ALKib+yPFqBKD9WYDWDL+ZUyfBbWcJ+0BDqcp4Rj3TWBCJhfi0QxxUdiv1XHzaA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176640; c=relaxed/simple; bh=D3Xb8PeTvLmTI+UVvNhEzjipl2DCTEFkVuBJpSK3e5A=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=bBnrnsTdZxJB5bo84meijjsOmuxy6NZUGQEc5Kxdnd8zcghrkuCBTJI9OKDUQQp2yL9ZmkIp9+xCeAETxpuTW+ZVaEjNXwi9+jFyYDnLAy/iN9EVIcwIffk3L88/zALug2sqhzmwMKt0FM5+oWg30Eqz9ygsqIwoUpPX2Qah/uw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Ne7QX6p4; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Ne7QX6p4" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A2E10C19425; Tue, 14 Apr 2026 14:23:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776176640; bh=D3Xb8PeTvLmTI+UVvNhEzjipl2DCTEFkVuBJpSK3e5A=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Ne7QX6p4nh9f/KVkal6ql22E5r/yWYe9JHa7nSlxz3j6Ic+3FzQVmlFp+3vAuTlat v0HqOSM0yhyjH6x1IQIW9w1yjn1IpsgQgDA/ke4oZiY/c7jkPkLvQfTXTCThogjp9c ZZ3+3pIh/vV6BvtlvbyB+osol2DpabaFs7qIIRZHHN4zTEmevQYfc0MyFMUA4qppiy sdM/Otrt5Eojdb4TREddxEgauSm1zgx3ntLtbVEHXfIQD7M+Bz89OCnyxFO+NHEVs6 XhYHb8/aDtAjTsZdr3Zbwz6M8YnRJuertzI2MGpRZHLYHmlPILZh/g6rUS8mxGV2eT Hh2UjFnmMz96Q== Received: from phl-compute-02.internal (phl-compute-02.internal [10.202.2.42]) by mailfauth.phl.internal (Postfix) with ESMTP id CED13F4006B; Tue, 14 Apr 2026 10:23:58 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-02.internal (MEProxy); Tue, 14 Apr 2026 10:23:58 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdegudefkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefhvfevufffkffojghfggfgsedtkeertdertddtnecuhfhrohhmpedfmfhirhihlhcu ufhhuhhtshgvmhgruhculdfovghtrgdmfdcuoehkrghssehkvghrnhgvlhdrohhrgheqne cuggftrfgrthhtvghrnhephfdujeefvdegkefffedvkeehkeekueevfedtleehgeetlefg feevveeukefhtdetnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilh hfrhhomhepkhhirhhilhhlodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqddu ieduudeivdeiheehqddvkeeggeegjedvkedqkhgrsheppehkvghrnhgvlhdrohhrghessh hhuhhtvghmohhvrdhnrghmvgdpnhgspghrtghpthhtohepudelpdhmohguvgepshhmthhp ohhuthdprhgtphhtthhopegrkhhpmheslhhinhhugidqfhhouhhnuggrthhiohhnrdhorh hgpdhrtghpthhtohepphgvthgvrhigsehrvgguhhgrthdrtghomhdprhgtphhtthhopegu rghvihgusehkvghrnhgvlhdrohhrghdprhgtphhtthhopehljhhssehkvghrnhgvlhdroh hrghdprhgtphhtthhopehrphhptheskhgvrhhnvghlrdhorhhgpdhrtghpthhtohepshhu rhgvnhgssehgohhoghhlvgdrtghomhdprhgtphhtthhopehvsggrsghkrgeskhgvrhhnvg hlrdhorhhgpdhrtghpthhtoheplhhirghmrdhhohiflhgvthhtsehorhgrtghlvgdrtgho mhdprhgtphhtthhopeiiihihsehnvhhiughirgdrtghomh X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 14 Apr 2026 10:23:58 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: Andrew Morton Cc: Peter Xu , David Hildenbrand , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" Subject: [RFC, PATCH 01/12] userfaultfd: define UAPI constants for anonymous minor faults Date: Tue, 14 Apr 2026 15:23:35 +0100 Message-ID: <20260414142354.1465950-2-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260414142354.1465950-1-kas@kernel.org> References: <20260414142354.1465950-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add UAPI definitions for userfaultfd working set tracking on anonymous memory: - UFFD_FEATURE_MINOR_ANON: minor fault support for anonymous memory - UFFD_FEATURE_MINOR_ASYNC: auto-resolve minor faults without handler - UFFDIO_DEACTIVATE: mark pages as deactivated (protnone or PTE zap) Not yet added to UFFD_API_FEATURES or UFFD_API_RANGE_IOCTLS. Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 --- include/uapi/linux/userfaultfd.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaul= tfd.h index 2841e4ea8f2c..336d07e1b6de 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -79,6 +79,7 @@ #define _UFFDIO_WRITEPROTECT (0x06) #define _UFFDIO_CONTINUE (0x07) #define _UFFDIO_POISON (0x08) +#define _UFFDIO_DEACTIVATE (0x09) #define _UFFDIO_API (0x3F) =20 /* userfaultfd ioctl ids */ @@ -103,6 +104,8 @@ struct uffdio_continue) #define UFFDIO_POISON _IOWR(UFFDIO, _UFFDIO_POISON, \ struct uffdio_poison) +#define UFFDIO_DEACTIVATE _IOR(UFFDIO, _UFFDIO_DEACTIVATE, \ + struct uffdio_range) =20 /* read() structure */ struct uffd_msg { @@ -230,6 +233,18 @@ struct uffdio_api { * * UFFD_FEATURE_MOVE indicates that the kernel supports moving an * existing page contents from userspace. + * + * UFFD_FEATURE_MINOR_ANON indicates that minor fault interception + * is supported for anonymous private memory. Pages are made + * inaccessible via UFFDIO_DEACTIVATE (sets PROT_NONE while + * preserving the page) and faults are delivered when the pages + * are re-accessed. + * + * UFFD_FEATURE_MINOR_ASYNC indicates asynchronous minor fault + * mode. When set, faults on deactivated pages are auto-resolved + * by the kernel (PTE permissions restored immediately) without + * delivering a message to the userfaultfd handler. Use + * PAGEMAP_SCAN to find pages that were not re-accessed. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -248,6 +263,8 @@ struct uffdio_api { #define UFFD_FEATURE_POISON (1<<14) #define UFFD_FEATURE_WP_ASYNC (1<<15) #define UFFD_FEATURE_MOVE (1<<16) +#define UFFD_FEATURE_MINOR_ANON (1<<17) +#define UFFD_FEATURE_MINOR_ASYNC (1<<18) __u64 features; =20 __u64 ioctls; --=20 2.51.2 From nobody Mon Jun 15 23:15:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F35E63E928B; Tue, 14 Apr 2026 14:24:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176642; cv=none; b=l2oMal/lGEEQkVmLXtVmR3CjlRV5jm6+dk3gC382asPaLrirpqxSnSjz2FrD0JosQ4DGWxMYDLO6z3VEvveQUWfnPqtQNf5x/E/Su6kV2tQNHG7QF3DzGNtl/k4tREC39OsXriWoRh+fEYp+lsQI+OWnc0tKmP8QrYqaD1gYmHw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176642; c=relaxed/simple; bh=bxAJ7DEU0c8W9EDIeCuV7eYhdJmY4/plj4fpznSsaWc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=rNd3RjklCVEyM12Bp2zl0GxcFdOHha6d6qKm4P/20txuAXaLreJqrc0eegMgdtvTJcGQ1DU22kvvgzhUdyrsqLguhVG7YR4F4+QH6GFAv823nbVILzbIN9hvcJGtqmDXPVtLMjF7XkMEk78Gx9+UFl46i0CXV0Ogu6D5FRn97SQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=PNC07lkp; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="PNC07lkp" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 236CAC2BCB0; Tue, 14 Apr 2026 14:24:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776176641; bh=bxAJ7DEU0c8W9EDIeCuV7eYhdJmY4/plj4fpznSsaWc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=PNC07lkpEWnmGuIw4RQ3p0swT7wwprKfoNCH9q69rZMjPQEKg3bcH8z7leYqtnN9d 3FOVETsOWs8tBhSFUGsMty3w/mVyjpNF7pVCos4JsK29ZyO1S1xkkF/Dat45eMwbxq OqOKqLDJAgoW/ilWIGQYDgu6Y9VhoDtcUwfvh4sldyUWm+XxBJTfiDe0qqs4dwN7D5 Xeo4vangahMPI1WZe1mc2KIPNnhlhLEVGzBRv/q283kkG6QhllAK3H1Ha2hLjGmOSA 1XVpynGj5DA1N0daVxjFVdwuIR985L6LCvLG9or9hMvZ3q6K7oJldWl7znkvB9ZvRq d0u2XfwJpNbZQ== Received: from phl-compute-05.internal (phl-compute-05.internal [10.202.2.45]) by mailfauth.phl.internal (Postfix) with ESMTP id 54360F4006E; Tue, 14 Apr 2026 10:24:00 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-05.internal (MEProxy); Tue, 14 Apr 2026 10:24:00 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdegudefkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefhvfevufffkffojghfggfgsedtkeertdertddtnecuhfhrohhmpedfmfhirhihlhcu ufhhuhhtshgvmhgruhculdfovghtrgdmfdcuoehkrghssehkvghrnhgvlhdrohhrgheqne cuggftrfgrthhtvghrnhephfdujeefvdegkefffedvkeehkeekueevfedtleehgeetlefg feevveeukefhtdetnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilh hfrhhomhepkhhirhhilhhlodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqddu ieduudeivdeiheehqddvkeeggeegjedvkedqkhgrsheppehkvghrnhgvlhdrohhrghessh hhuhhtvghmohhvrdhnrghmvgdpnhgspghrtghpthhtohepudelpdhmohguvgepshhmthhp ohhuthdprhgtphhtthhopegrkhhpmheslhhinhhugidqfhhouhhnuggrthhiohhnrdhorh hgpdhrtghpthhtohepphgvthgvrhigsehrvgguhhgrthdrtghomhdprhgtphhtthhopegu rghvihgusehkvghrnhgvlhdrohhrghdprhgtphhtthhopehljhhssehkvghrnhgvlhdroh hrghdprhgtphhtthhopehrphhptheskhgvrhhnvghlrdhorhhgpdhrtghpthhtohepshhu rhgvnhgssehgohhoghhlvgdrtghomhdprhgtphhtthhopehvsggrsghkrgeskhgvrhhnvg hlrdhorhhgpdhrtghpthhtoheplhhirghmrdhhohiflhgvthhtsehorhgrtghlvgdrtgho mhdprhgtphhtthhopeiiihihsehnvhhiughirgdrtghomh X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 14 Apr 2026 10:23:59 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: Andrew Morton Cc: Peter Xu , David Hildenbrand , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" Subject: [RFC, PATCH 02/12] userfaultfd: add UFFD_FEATURE_MINOR_ANON registration support Date: Tue, 14 Apr 2026 15:23:36 +0100 Message-ID: <20260414142354.1465950-3-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260414142354.1465950-1-kas@kernel.org> References: <20260414142354.1465950-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Allow UFFDIO_REGISTER_MODE_MINOR on anonymous VMAs when the UFFD_FEATURE_MINOR_ANON feature is enabled. Replace the bool wp_async parameter in vma_can_userfault() and userfaultfd_register_range() with an extensible ctx_flags bitmap. Add UFFD_CTX_WP_ASYNC and UFFD_CTX_MINOR_ANON flags, and userfaultfd_ctx_flags() to build the bitmap from ctx->features. Add userfaultfd_minor_async() helper for checking async minor mode from the fault path. Gate UFFD_FEATURE_MINOR_ANON and UFFD_FEATURE_MINOR_ASYNC on CONFIG_HAVE_ARCH_USERFAULTFD_MINOR. Validate that MINOR_ASYNC requires at least one minor feature. Not yet visible to userspace (not in UFFD_API_FEATURES). Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 --- fs/userfaultfd.c | 49 ++++++++++++++++++++++++++++++----- include/linux/userfaultfd_k.h | 19 +++++++++++--- mm/userfaultfd.c | 4 +-- 3 files changed, 59 insertions(+), 13 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index bdc84e5219cd..8d508ad19e89 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -89,6 +89,27 @@ static bool userfaultfd_wp_async_ctx(struct userfaultfd_= ctx *ctx) return ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC); } =20 +static bool userfaultfd_minor_anon_ctx(struct userfaultfd_ctx *ctx) +{ + return ctx && (ctx->features & UFFD_FEATURE_MINOR_ANON); +} + +static bool userfaultfd_minor_async_ctx(struct userfaultfd_ctx *ctx) +{ + return ctx && (ctx->features & UFFD_FEATURE_MINOR_ASYNC); +} + +static unsigned int userfaultfd_ctx_flags(struct userfaultfd_ctx *ctx) +{ + unsigned int flags =3D 0; + + if (userfaultfd_wp_async_ctx(ctx)) + flags |=3D UFFD_CTX_WP_ASYNC; + if (userfaultfd_minor_anon_ctx(ctx)) + flags |=3D UFFD_CTX_MINOR_ANON; + return flags; +} + /* * Whether WP_UNPOPULATED is enabled on the uffd context. It is only * meaningful when userfaultfd_wp()=3D=3Dtrue on the vma and when it's @@ -1271,7 +1292,7 @@ static int userfaultfd_register(struct userfaultfd_ct= x *ctx, bool basic_ioctls; unsigned long start, end; struct vma_iterator vmi; - bool wp_async =3D userfaultfd_wp_async_ctx(ctx); + unsigned int ctx_flags =3D userfaultfd_ctx_flags(ctx); =20 user_uffdio_register =3D (struct uffdio_register __user *) arg; =20 @@ -1345,7 +1366,7 @@ static int userfaultfd_register(struct userfaultfd_ct= x *ctx, =20 /* check not compatible vmas */ ret =3D -EINVAL; - if (!vma_can_userfault(cur, vm_flags, wp_async)) + if (!vma_can_userfault(cur, vm_flags, ctx_flags)) goto out_unlock; =20 /* @@ -1398,7 +1419,7 @@ static int userfaultfd_register(struct userfaultfd_ct= x *ctx, VM_WARN_ON_ONCE(!found); =20 ret =3D userfaultfd_register_range(ctx, vma, vm_flags, start, end, - wp_async); + ctx_flags); =20 out_unlock: mmap_write_unlock(mm); @@ -1443,7 +1464,7 @@ static int userfaultfd_unregister(struct userfaultfd_= ctx *ctx, unsigned long start, end, vma_end; const void __user *buf =3D (void __user *)arg; struct vma_iterator vmi; - bool wp_async =3D userfaultfd_wp_async_ctx(ctx); + unsigned int ctx_flags =3D userfaultfd_ctx_flags(ctx); =20 ret =3D -EFAULT; if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister))) @@ -1505,7 +1526,7 @@ static int userfaultfd_unregister(struct userfaultfd_= ctx *ctx, * provides for more strict behavior to notice * unregistration errors. */ - if (!vma_can_userfault(cur, cur->vm_flags, wp_async)) + if (!vma_can_userfault(cur, cur->vm_flags, ctx_flags)) goto out_unlock; =20 found =3D true; @@ -1526,7 +1547,7 @@ static int userfaultfd_unregister(struct userfaultfd_= ctx *ctx, goto skip; =20 VM_WARN_ON_ONCE(vma->vm_userfaultfd_ctx.ctx !=3D ctx); - VM_WARN_ON_ONCE(!vma_can_userfault(vma, vma->vm_flags, wp_async)); + VM_WARN_ON_ONCE(!vma_can_userfault(vma, vma->vm_flags, ctx_flags)); VM_WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)); =20 if (vma->vm_start > start) @@ -1890,6 +1911,11 @@ bool userfaultfd_wp_async(struct vm_area_struct *vma) return userfaultfd_wp_async_ctx(vma->vm_userfaultfd_ctx.ctx); } =20 +bool userfaultfd_minor_async(struct vm_area_struct *vma) +{ + return userfaultfd_minor_async_ctx(vma->vm_userfaultfd_ctx.ctx); +} + static inline unsigned int uffd_ctx_features(__u64 user_features) { /* @@ -1993,11 +2019,20 @@ static int userfaultfd_api(struct userfaultfd_ctx *= ctx, if (features & UFFD_FEATURE_WP_ASYNC) features |=3D UFFD_FEATURE_WP_UNPOPULATED; =20 + ret =3D -EINVAL; + /* MINOR_ASYNC requires at least one minor feature */ + if ((features & UFFD_FEATURE_MINOR_ASYNC) && + !(features & (UFFD_FEATURE_MINOR_ANON | + UFFD_FEATURE_MINOR_HUGETLBFS | + UFFD_FEATURE_MINOR_SHMEM))) + goto err_out; + /* report all available features and ioctls to userland */ uffdio_api.features =3D UFFD_API_FEATURES; #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR uffdio_api.features &=3D - ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM); + ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM | + UFFD_FEATURE_MINOR_ANON | UFFD_FEATURE_MINOR_ASYNC); #endif if (!pgtable_supports_uffd_wp()) uffdio_api.features &=3D ~UFFD_FEATURE_PAGEFAULT_FLAG_WP; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index fd5f42765497..d1d4ed4a08b0 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -208,9 +208,13 @@ static inline bool userfaultfd_armed(struct vm_area_st= ruct *vma) return vma->vm_flags & __VM_UFFD_FLAGS; } =20 +/* Flags for vma_can_userfault() describing uffd context capabilities */ +#define UFFD_CTX_WP_ASYNC (1 << 0) +#define UFFD_CTX_MINOR_ANON (1 << 1) + static inline bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, - bool wp_async) + unsigned int ctx_flags) { vm_flags &=3D __VM_UFFD_FLAGS; =20 @@ -218,14 +222,15 @@ static inline bool vma_can_userfault(struct vm_area_s= truct *vma, return false; =20 if ((vm_flags & VM_UFFD_MINOR) && - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) + !is_vm_hugetlb_page(vma) && !vma_is_shmem(vma) && + !(vma_is_anonymous(vma) && (ctx_flags & UFFD_CTX_MINOR_ANON))) return false; =20 /* * If wp async enabled, and WP is the only mode enabled, allow any * memory type. */ - if (wp_async && (vm_flags =3D=3D VM_UFFD_WP)) + if ((ctx_flags & UFFD_CTX_WP_ASYNC) && (vm_flags =3D=3D VM_UFFD_WP)) return true; =20 /* @@ -270,6 +275,7 @@ extern void userfaultfd_unmap_complete(struct mm_struct= *mm, struct list_head *uf); extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma); extern bool userfaultfd_wp_async(struct vm_area_struct *vma); +extern bool userfaultfd_minor_async(struct vm_area_struct *vma); =20 void userfaultfd_reset_ctx(struct vm_area_struct *vma); =20 @@ -283,7 +289,7 @@ int userfaultfd_register_range(struct userfaultfd_ctx *= ctx, struct vm_area_struct *vma, vm_flags_t vm_flags, unsigned long start, unsigned long end, - bool wp_async); + unsigned int ctx_flags); =20 void userfaultfd_release_new(struct userfaultfd_ctx *ctx); =20 @@ -446,6 +452,11 @@ static inline bool userfaultfd_wp_async(struct vm_area= _struct *vma) return false; } =20 +static inline bool userfaultfd_minor_async(struct vm_area_struct *vma) +{ + return false; +} + static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct = *vma) { return false; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 927086bb4a3c..dba1ea26fdfe 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -2008,7 +2008,7 @@ int userfaultfd_register_range(struct userfaultfd_ctx= *ctx, struct vm_area_struct *vma, vm_flags_t vm_flags, unsigned long start, unsigned long end, - bool wp_async) + unsigned int ctx_flags) { VMA_ITERATOR(vmi, ctx->mm, start); struct vm_area_struct *prev =3D vma_prev(&vmi); @@ -2021,7 +2021,7 @@ int userfaultfd_register_range(struct userfaultfd_ctx= *ctx, for_each_vma_range(vmi, vma, end) { cond_resched(); =20 - VM_WARN_ON_ONCE(!vma_can_userfault(vma, vm_flags, wp_async)); + VM_WARN_ON_ONCE(!vma_can_userfault(vma, vm_flags, ctx_flags)); VM_WARN_ON_ONCE(vma->vm_userfaultfd_ctx.ctx && vma->vm_userfaultfd_ctx.ctx !=3D ctx); VM_WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)); --=20 2.51.2 From nobody Mon Jun 15 23:15:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 87F293E92A9; Tue, 14 Apr 2026 14:24:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176644; cv=none; b=jT4nxyCGTHgfaK/sene0CCA6wnoqCjGkz7Ig5vxF9gitFDHvOCIYOIaouIIdp2cA6jMDc+Kc0jRF14+JCU1TIZixyayROcCNDEkL5N10jcnu1ZILVqmpy/mv1g58PDNnrERCfF9Z+aVTHTnhtqIOPjAgyzSDDbKjGY7SwfbHGMI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176644; c=relaxed/simple; bh=Bf327wT/aGKA5T+U97BuB/g1VkBHVQ9CP0E9lDgK+RI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=aBk7B9gj7wD/1DNL2PrJNAwu7oxfde9VCjd/M7c8TZ5puldCCFNlD9/ObxBPEKgoUSzWRp3YiKWzCPqVH/cDcGRLIgaj5c7EDizewL+Dlboi13FnmKIINtaeUMcVG+MalpU97XtFXDU2qIwTB32MghzyS7jvOd1eT2fsi/ni/4w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=qRHV1ldJ; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="qRHV1ldJ" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7710CC2BCB6; Tue, 14 Apr 2026 14:24:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776176644; bh=Bf327wT/aGKA5T+U97BuB/g1VkBHVQ9CP0E9lDgK+RI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=qRHV1ldJS9pbPerE8dBDm2Wf8MN12erecWxLiWlpG2tq5SXFIjbZ2qAAZ4cTWSPoc ickV/Bk6jxpKZmt5wyoCD4LNOvvYZAXhcJlzQ89LR7M+4xN8vkIx8fAeoO2n7ppHCe YAnayuxl7/iPt6z625ZX4olJu19m/00RtwVtKKHDvTkGOD9WF5Z71wEqGydCdKGfQJ sTvzppWy2o2KQmSIXjtlQCb5Fp9LkBTXrWx216XOu5Fi1oxe95ijq5aC1fwFHB9TBS TuaajWjBx2i/EqdzShze9eptzOpciFRXWihONrvqbCbOASXfs/GgyrXUeMDgynnnug 1q/akBUCeA5yQ== Received: from phl-compute-05.internal (phl-compute-05.internal [10.202.2.45]) by mailfauth.phl.internal (Postfix) with ESMTP id A0E78F40068; Tue, 14 Apr 2026 10:24:02 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-05.internal (MEProxy); Tue, 14 Apr 2026 10:24:02 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdegudefkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefhvfevufffkffojghfgggtgfesthekredtredtjeenucfhrhhomhepfdfmihhrhihl ucfuhhhuthhsvghmrghuucdlofgvthgrmddfuceokhgrsheskhgvrhhnvghlrdhorhhgqe enucggtffrrghtthgvrhhnpefhvdefvdevjeevhefhhfevudefudejfeduvdekheeludfh iefhhedujeffffeigfenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrih hlfhhrohhmpehkihhrihhllhdomhgvshhmthhprghuthhhphgvrhhsohhnrghlihhthidq udeiudduiedvieehhedqvdekgeeggeejvdekqdhkrghspeepkhgvrhhnvghlrdhorhhgse hshhhuthgvmhhovhdrnhgrmhgvpdhnsggprhgtphhtthhopeduledpmhhouggvpehsmhht phhouhhtpdhrtghpthhtoheprghkphhmsehlihhnuhigqdhfohhunhgurghtihhonhdroh hrghdprhgtphhtthhopehpvghtvghrgiesrhgvughhrghtrdgtohhmpdhrtghpthhtohep uggrvhhiugeskhgvrhhnvghlrdhorhhgpdhrtghpthhtoheplhhjsheskhgvrhhnvghlrd horhhgpdhrtghpthhtoheprhhpphhtsehkvghrnhgvlhdrohhrghdprhgtphhtthhopehs uhhrvghnsgesghhoohhglhgvrdgtohhmpdhrtghpthhtohepvhgsrggskhgrsehkvghrnh gvlhdrohhrghdprhgtphhtthhopehlihgrmhdrhhhofihlvghtthesohhrrggtlhgvrdgt ohhmpdhrtghpthhtohepiihihiesnhhvihguihgrrdgtohhm X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 14 Apr 2026 10:24:01 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: Andrew Morton Cc: Peter Xu , David Hildenbrand , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" Subject: [RFC, PATCH 03/12] userfaultfd: implement UFFDIO_DEACTIVATE ioctl Date: Tue, 14 Apr 2026 15:23:37 +0100 Message-ID: <20260414142354.1465950-4-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260414142354.1465950-1-kas@kernel.org> References: <20260414142354.1465950-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable UFFDIO_DEACTIVATE marks pages as deactivated within a VM_UFFD_MINOR range: - Anonymous memory: set protnone via change_protection(MM_CP_UFFD_DEACTIVAT= E). Pages stay resident with PFNs preserved, only permissions removed. MM_CP_UFFD_DEACTIVATE is handled independently from MM_CP_PROT_NUMA, bypassing folio_can_map_prot_numa() and CONFIG_NUMA_BALANCING guards. - Shared shmem/hugetlbfs: zap PTEs via zap_page_range_single(). Pages stay in page cache. - Private hugetlb: rejected with -EINVAL (zapping would destroy content). Cleanup on unregister/close: restore protnone PTEs to normal permissions in userfaultfd_clear_vma(), preventing permanently inaccessible pages. Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 --- fs/userfaultfd.c | 35 ++++++++++++++++ include/linux/mm.h | 2 + include/linux/userfaultfd_k.h | 2 + mm/huge_memory.c | 9 ++-- mm/mprotect.c | 9 +++- mm/userfaultfd.c | 78 +++++++++++++++++++++++++++++++++-- 6 files changed, 127 insertions(+), 8 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 8d508ad19e89..b317c9854b86 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1441,6 +1441,10 @@ static int userfaultfd_register(struct userfaultfd_c= tx *ctx, if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR)) ioctls_out &=3D ~((__u64)1 << _UFFDIO_CONTINUE); =20 + /* DEACTIVATE is only supported for MINOR ranges. */ + if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR)) + ioctls_out &=3D ~((__u64)1 << _UFFDIO_DEACTIVATE); + /* * Now that we scanned all vmas we can already tell * userland which ioctls methods are guaranteed to @@ -1788,6 +1792,34 @@ static int userfaultfd_writeprotect(struct userfault= fd_ctx *ctx, return ret; } =20 +static int userfaultfd_deactivate(struct userfaultfd_ctx *ctx, + unsigned long arg) +{ + int ret; + struct uffdio_range uffdio_range; + + if (atomic_read(&ctx->mmap_changing)) + return -EAGAIN; + + if (copy_from_user(&uffdio_range, (void __user *)arg, + sizeof(uffdio_range))) + return -EFAULT; + + ret =3D validate_range(ctx->mm, uffdio_range.start, uffdio_range.len); + if (ret) + return ret; + + if (mmget_not_zero(ctx->mm)) { + ret =3D mdeactivate_range(ctx, uffdio_range.start, + uffdio_range.len); + mmput(ctx->mm); + } else { + return -ESRCH; + } + + return ret; +} + static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long= arg) { __s64 ret; @@ -2108,6 +2140,9 @@ static long userfaultfd_ioctl(struct file *file, unsi= gned cmd, case UFFDIO_POISON: ret =3D userfaultfd_poison(ctx, arg); break; + case UFFDIO_DEACTIVATE: + ret =3D userfaultfd_deactivate(ctx, arg); + break; } return ret; } diff --git a/include/linux/mm.h b/include/linux/mm.h index abb4963c1f06..fc2841264d56 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3036,6 +3036,8 @@ int get_cmdline(struct task_struct *task, char *buffe= r, int buflen); #define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */ #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ MM_CP_UFFD_WP_RESOLVE) +/* Whether this change is for uffd deactivation */ +#define MM_CP_UFFD_DEACTIVATE (1UL << 4) =20 bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long add= r, pte_t pte); diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index d1d4ed4a08b0..c94b5c5b5f24 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -130,6 +130,8 @@ extern int mwriteprotect_range(struct userfaultfd_ctx *= ctx, unsigned long start, unsigned long len, bool enable_wp); extern long uffd_wp_range(struct vm_area_struct *vma, unsigned long start, unsigned long len, bool enable_wp); +extern int mdeactivate_range(struct userfaultfd_ctx *ctx, unsigned long st= art, + unsigned long len); =20 /* move_pages */ void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b298cba853ab..2ad736ff007c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2563,6 +2563,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm= _area_struct *vma, spinlock_t *ptl; pmd_t oldpmd, entry; bool prot_numa =3D cp_flags & MM_CP_PROT_NUMA; + bool uffd_deactivate =3D cp_flags & MM_CP_UFFD_DEACTIVATE; bool uffd_wp =3D cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve =3D cp_flags & MM_CP_UFFD_WP_RESOLVE; int ret =3D 1; @@ -2582,8 +2583,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct v= m_area_struct *vma, goto unlock; } =20 - if (prot_numa) { + /* Already protnone =E2=80=94 nothing to do for either NUMA or uffd */ + if ((prot_numa || uffd_deactivate) && pmd_protnone(*pmd)) + goto unlock; =20 + if (prot_numa) { /* * Avoid trapping faults against the zero page. The read-only * data is likely to be read-cached on the local CPU and @@ -2592,9 +2596,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm= _area_struct *vma, if (is_huge_zero_pmd(*pmd)) goto unlock; =20 - if (pmd_protnone(*pmd)) - goto unlock; - if (!folio_can_map_prot_numa(pmd_folio(*pmd), vma, vma_is_single_threaded_private(vma))) goto unlock; diff --git a/mm/mprotect.c b/mm/mprotect.c index c0571445bef7..7c612a680014 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -220,6 +220,7 @@ static long change_pte_range(struct mmu_gather *tlb, long pages =3D 0; bool is_private_single_threaded; bool prot_numa =3D cp_flags & MM_CP_PROT_NUMA; + bool uffd_deactivate =3D cp_flags & MM_CP_UFFD_DEACTIVATE; bool uffd_wp =3D cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve =3D cp_flags & MM_CP_UFFD_WP_RESOLVE; int nr_ptes; @@ -245,7 +246,8 @@ static long change_pte_range(struct mmu_gather *tlb, pte_t ptent; =20 /* Already in the desired state. */ - if (prot_numa && pte_protnone(oldpte)) + if ((prot_numa || uffd_deactivate) && + pte_protnone(oldpte)) continue; =20 page =3D vm_normal_page(vma, addr, oldpte); @@ -255,6 +257,8 @@ static long change_pte_range(struct mmu_gather *tlb, /* * Avoid trapping faults against the zero or KSM * pages. See similar comment in change_huge_pmd. + * Skip this filter for uffd deactivation which + * must set protnone regardless of NUMA placement. */ if (prot_numa && !folio_can_map_prot_numa(folio, vma, @@ -651,6 +655,9 @@ long change_protection(struct mmu_gather *tlb, WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA); #endif =20 + if (cp_flags & MM_CP_UFFD_DEACTIVATE) + newprot =3D PAGE_NONE; + if (is_vm_hugetlb_page(vma)) pages =3D hugetlb_change_protection(vma, start, end, newprot, cp_flags); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index dba1ea26fdfe..3373b11b9d83 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -775,7 +775,7 @@ static __always_inline ssize_t mfill_atomic(struct user= faultfd_ctx *ctx, =20 if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) goto out_unlock; - if (!vma_is_shmem(dst_vma) && + if (!vma_is_shmem(dst_vma) && !vma_is_anonymous(dst_vma) && uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) goto out_unlock; =20 @@ -797,13 +797,16 @@ static __always_inline ssize_t mfill_atomic(struct us= erfaultfd_ctx *ctx, break; } dst_pmdval =3D pmdp_get_lockless(dst_pmd); + if (unlikely(!pmd_present(dst_pmdval))) { + err =3D -EEXIST; + break; + } /* * If the dst_pmd is THP don't override it and just be strict. * (This includes the case where the PMD used to be THP and * changed back to none after __pte_alloc().) */ - if (unlikely(!pmd_present(dst_pmdval) || - pmd_trans_huge(dst_pmdval))) { + if (unlikely(pmd_trans_huge(dst_pmdval))) { err =3D -EEXIST; break; } @@ -996,6 +999,65 @@ int mwriteprotect_range(struct userfaultfd_ctx *ctx, u= nsigned long start, return err; } =20 +int mdeactivate_range(struct userfaultfd_ctx *ctx, unsigned long start, + unsigned long len) +{ + struct mm_struct *dst_mm =3D ctx->mm; + unsigned long end =3D start + len; + struct vm_area_struct *dst_vma; + long err; + VMA_ITERATOR(vmi, dst_mm, start); + + VM_WARN_ON_ONCE(start & ~PAGE_MASK); + VM_WARN_ON_ONCE(len & ~PAGE_MASK); + VM_WARN_ON_ONCE(start + len <=3D start); + + guard(mmap_read_lock)(dst_mm); + guard(rwsem_read)(&ctx->map_changing_lock); + + if (atomic_read(&ctx->mmap_changing)) + return -EAGAIN; + + err =3D -ENOENT; + for_each_vma_range(vmi, dst_vma, end) { + unsigned long vma_start =3D max(dst_vma->vm_start, start); + unsigned long vma_end =3D min(dst_vma->vm_end, end); + + if (!userfaultfd_minor(dst_vma)) { + err =3D -ENOENT; + break; + } + + /* + * Private hugetlb has no page cache to fall back on =E2=80=94 + * zapping PTEs would destroy page content. + */ + if (is_vm_hugetlb_page(dst_vma) && + !(dst_vma->vm_flags & VM_SHARED)) { + err =3D -EINVAL; + break; + } + + if (vma_is_anonymous(dst_vma)) { + /* Anonymous: set protnone, pages stay resident */ + struct mmu_gather tlb; + + tlb_gather_mmu(&tlb, dst_mm); + err =3D change_protection(&tlb, dst_vma, vma_start, + vma_end, + MM_CP_UFFD_DEACTIVATE); + tlb_finish_mmu(&tlb); + if (err < 0) + break; + } else { + /* Shared shmem/hugetlb: zap PTEs, pages stay in page cache */ + zap_page_range_single(dst_vma, vma_start, + vma_end - vma_start, NULL); + } + err =3D 0; + } + return err; +} =20 void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2) @@ -1988,6 +2050,16 @@ struct vm_area_struct *userfaultfd_clear_vma(struct = vma_iterator *vmi, if (userfaultfd_wp(vma)) uffd_wp_range(vma, start, end - start, false); =20 + /* Restore protnone PTEs to normal permissions */ + if (userfaultfd_minor(vma) && vma_is_anonymous(vma)) { + struct mmu_gather tlb; + + tlb_gather_mmu(&tlb, vma->vm_mm); + change_protection(&tlb, vma, start, end, + MM_CP_TRY_CHANGE_WRITABLE); + tlb_finish_mmu(&tlb); + } + ret =3D vma_modify_flags_uffd(vmi, prev, vma, start, end, vma->vm_flags & ~__VM_UFFD_FLAGS, NULL_VM_UFFD_CTX, give_up_on_oom); --=20 2.51.2 From nobody Mon Jun 15 23:15:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A7A913E958E for ; Tue, 14 Apr 2026 14:24:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176645; cv=none; b=F5uu1aZySZAY2russ8XM7TsjpXJMP0plxQQqj1fMU8RVGpMgfXmRV4BgFxWklQywWbnb9Glay2qIK25UaUQ40UrfRBUnx9BbUpmvP+2HmeBxu1TGvyn+kuaGaN8YzEDVGTz/m9K0Nms+NCxwLRL4WGlmt1/Pd35ocF/ktRENQbk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176645; c=relaxed/simple; bh=9SQHe0wjWQ4DV8pwUSQ5vNuS7csFujXkHUIy1K9zVAo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=idBKsz0dJM1ZyCpIkpPXfOaemZfDMXYqGeeC6/oGK7k7MBKCJRMwjUwirz+PGAGiOkQbn/btDNu4lzqZHdeCMZMOyUuCbZlAJqun2VFij7DjABD/ko8Fad1K15dj4IzSgV0utBvCiRUh4qJ6spFBr4rDBBHtzjCjalxyNSu7z/Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=eo4OPyUd; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="eo4OPyUd" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 20AB8C4AF09; Tue, 14 Apr 2026 14:24:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776176645; bh=9SQHe0wjWQ4DV8pwUSQ5vNuS7csFujXkHUIy1K9zVAo=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=eo4OPyUdBPPyLpoQ4op2x+BV/yN9qLvE8zO6pEJTHQE+gTy95r1YNSnnY5uUQby3Q MoV4cWH8y6ozy9fJ+/IhkvcakRdk2A258FTRUJWajHegeoQh4pHkl0l0rNyCYLOQBr +50wP4xd2zH+tDbTnABRyhDy/gGGcpJzut7LUFhkIyjIgM+6Sbh06BdvVVbb7eM0/F 69C4XGI50Bz3XXJApZgYPMabn8iBKKYMAp+AClYA2TSx1zYC7+CjukcLIhE4jPU55v OlhB6z88Qywpc2tkKj8UWkX0MJaKVkkoY7sgdTEL4sCXpN0k/qjMxsuP/XVIWQ9IJB ez5iKfVmO7mLA== Received: from phl-compute-01.internal (phl-compute-01.internal [10.202.2.41]) by mailfauth.phl.internal (Postfix) with ESMTP id 51201F40068; Tue, 14 Apr 2026 10:24:04 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-01.internal (MEProxy); Tue, 14 Apr 2026 10:24:04 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdegudefkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefhvfevufffkffojghfgggtgfesthekredtredtjeenucfhrhhomhepfdfmihhrhihl ucfuhhhuthhsvghmrghuucdlofgvthgrmddfuceokhgrsheskhgvrhhnvghlrdhorhhgqe enucggtffrrghtthgvrhhnpefhvdefvdevjeevhefhhfevudefudejfeduvdekheeludfh iefhhedujeffffeigfenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrih hlfhhrohhmpehkihhrihhllhdomhgvshhmthhprghuthhhphgvrhhsohhnrghlihhthidq udeiudduiedvieehhedqvdekgeeggeejvdekqdhkrghspeepkhgvrhhnvghlrdhorhhgse hshhhuthgvmhhovhdrnhgrmhgvpdhnsggprhgtphhtthhopeduledpmhhouggvpehsmhht phhouhhtpdhrtghpthhtoheprghkphhmsehlihhnuhigqdhfohhunhgurghtihhonhdroh hrghdprhgtphhtthhopehpvghtvghrgiesrhgvughhrghtrdgtohhmpdhrtghpthhtohep uggrvhhiugeskhgvrhhnvghlrdhorhhgpdhrtghpthhtoheplhhjsheskhgvrhhnvghlrd horhhgpdhrtghpthhtoheprhhpphhtsehkvghrnhgvlhdrohhrghdprhgtphhtthhopehs uhhrvghnsgesghhoohhglhgvrdgtohhmpdhrtghpthhtohepvhgsrggskhgrsehkvghrnh gvlhdrohhrghdprhgtphhtthhopehlihgrmhdrhhhofihlvghtthesohhrrggtlhgvrdgt ohhmpdhrtghpthhtohepiihihiesnhhvihguihgrrdgtohhm X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 14 Apr 2026 10:24:03 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: Andrew Morton Cc: Peter Xu , David Hildenbrand , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" Subject: [RFC, PATCH 04/12] userfaultfd: UFFDIO_CONTINUE for anonymous memory Date: Tue, 14 Apr 2026 15:23:38 +0100 Message-ID: <20260414142354.1465950-5-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260414142354.1465950-1-kas@kernel.org> References: <20260414142354.1465950-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Allow UFFDIO_CONTINUE on anonymous VMAs with VM_UFFD_MINOR. For shmem, CONTINUE installs a PTE from page cache. For anonymous memory, the page is already mapped via a protnone PTE =E2=80=94 CONTINUE restores the original VMA permissions. PTE level: mfill_atomic_pte_continue_anon() walks to the PTE, verifies protnone, restores permissions. Rename the shmem path to mfill_atomic_pte_continue_shmem() for clarity. PMD/THP level: mfill_atomic_pmd_continue_anon() restores protnone PMD permissions in place without splitting. Handles PMD races with EAGAIN retry in the mfill_atomic loop. Add protnone PTE/PMD checks in userfaultfd_must_wait() so sync minor faults properly block until resolved. Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 --- fs/userfaultfd.c | 9 +++++- mm/userfaultfd.c | 82 ++++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 84 insertions(+), 7 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index b317c9854b86..43064238fd8d 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -340,8 +340,11 @@ static inline bool userfaultfd_must_wait(struct userfa= ultfd_ctx *ctx, if (!pmd_present(_pmd)) return false; =20 - if (pmd_trans_huge(_pmd)) + if (pmd_trans_huge(_pmd)) { + if (pmd_protnone(_pmd) && (reason & VM_UFFD_MINOR)) + return true; return !pmd_write(_pmd) && (reason & VM_UFFD_WP); + } =20 pte =3D pte_offset_map(pmd, address); if (!pte) @@ -366,6 +369,9 @@ static inline bool userfaultfd_must_wait(struct userfau= ltfd_ctx *ctx, */ if (!pte_write(ptent) && (reason & VM_UFFD_WP)) goto out; + /* PTE is still protnone (deactivated), wait for userspace to resolve. */ + if (pte_protnone(ptent) && (reason & VM_UFFD_MINOR)) + goto out; =20 ret =3D false; out: @@ -1820,6 +1826,7 @@ static int userfaultfd_deactivate(struct userfaultfd_= ctx *ctx, return ret; } =20 + static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long= arg) { __s64 ret; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 3373b11b9d83..4c52fa5d1608 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -380,8 +380,61 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd, return ret; } =20 -/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ -static int mfill_atomic_pte_continue(pmd_t *dst_pmd, +static int mfill_atomic_pte_continue_anon(pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + uffd_flags_t flags) +{ + pte_t *ptep, pte; + spinlock_t *ptl; + int ret =3D -EFAULT; + + ptep =3D pte_offset_map_lock(dst_vma->vm_mm, dst_pmd, dst_addr, &ptl); + if (!ptep) + return ret; + + pte =3D ptep_get(ptep); + if (!pte_protnone(pte)) + goto out_unlock; + + pte =3D pte_modify(pte, dst_vma->vm_page_prot); + pte =3D pte_mkyoung(pte); + if (flags & MFILL_ATOMIC_WP) + pte =3D pte_wrprotect(pte); + set_pte_at(dst_vma->vm_mm, dst_addr, ptep, pte); + update_mmu_cache(dst_vma, dst_addr, ptep); + ret =3D 0; +out_unlock: + pte_unmap_unlock(ptep, ptl); + return ret; +} + +static int mfill_atomic_pmd_continue_anon(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long addr, + pmd_t *pmd, pmd_t orig_pmd, + uffd_flags_t flags) +{ + spinlock_t *ptl; + pmd_t entry; + + ptl =3D pmd_lock(mm, pmd); + if (unlikely(!pmd_same(pmdp_get(pmd), orig_pmd))) { + spin_unlock(ptl); + return -EAGAIN; + } + + entry =3D pmd_modify(orig_pmd, vma->vm_page_prot); + entry =3D pmd_mkyoung(entry); + if (flags & MFILL_ATOMIC_WP) + entry =3D pmd_wrprotect(entry); + set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, entry); + update_mmu_cache_pmd(vma, addr, pmd); + spin_unlock(ptl); + return 0; +} + +static int mfill_atomic_pte_continue_shmem(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, uffd_flags_t flags) @@ -667,7 +720,10 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t = *dst_pmd, ssize_t err; =20 if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { - return mfill_atomic_pte_continue(dst_pmd, dst_vma, + if (vma_is_anonymous(dst_vma)) + return mfill_atomic_pte_continue_anon(dst_pmd, dst_vma, + dst_addr, flags); + return mfill_atomic_pte_continue_shmem(dst_pmd, dst_vma, dst_addr, flags); } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { return mfill_atomic_pte_poison(dst_pmd, dst_vma, @@ -802,11 +858,25 @@ static __always_inline ssize_t mfill_atomic(struct us= erfaultfd_ctx *ctx, break; } /* - * If the dst_pmd is THP don't override it and just be strict. - * (This includes the case where the PMD used to be THP and - * changed back to none after __pte_alloc().) + * THP PMD: for anon CONTINUE, restore protnone PMD + * permissions in place. For other operations, reject. */ if (unlikely(pmd_trans_huge(dst_pmdval))) { + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) && + vma_is_anonymous(dst_vma) && + pmd_protnone(dst_pmdval)) { + err =3D mfill_atomic_pmd_continue_anon( + dst_mm, dst_vma, dst_addr, + dst_pmd, dst_pmdval, flags); + if (err =3D=3D -EAGAIN) + continue; /* PMD changed, re-read it */ + if (err) + break; + dst_addr +=3D HPAGE_PMD_SIZE; + src_addr +=3D HPAGE_PMD_SIZE; + copied +=3D HPAGE_PMD_SIZE; + continue; + } err =3D -EEXIST; break; } --=20 2.51.2 From nobody Mon Jun 15 23:15:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BF72C3E95B5; Tue, 14 Apr 2026 14:24:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176647; cv=none; b=chhNczqt1XNMjAQEFBouhwxFkuyDZRMC5INyXa4G4ccoeRlgLM+h5sOU5orxrSNKpYQVIjYePjnGBEJXJpPxLEvJttiDHv/0RWl3DD5F24f9SFK7wP5Tff0yQuNPKWC3TYYkQOMt9rFETrJRwNR4L3pJNU2zIiBDs5KXxrhj4do= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176647; c=relaxed/simple; bh=5Usv13hgiZWkYNFHt4wKn/lAN0ONPFvR4TE1gP6n6Zg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=a5T7vnOxnOkQkylg03YX0Zez9uXnfuoiHDzxCHCujunRtC19u8yX03zPZSUEcNvWPADGeHNVJo6eoRjeta8Gnas3QBm4zLxIikIGgRascuc47j+gSmUAS2wHI+MCeawdk0v4V3DHJE4vs7OlOTWK3H8yY/RHPtRwhOj/JdG/feU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=lTwu+LuV; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="lTwu+LuV" Received: by smtp.kernel.org (Postfix) with ESMTPSA id BFA05C4AF09; Tue, 14 Apr 2026 14:24:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776176647; bh=5Usv13hgiZWkYNFHt4wKn/lAN0ONPFvR4TE1gP6n6Zg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=lTwu+LuVInhTD0gzfbmW0BI/pLWYkLdmEIijmTh9hjyjP6B4BRP1EyKvPZAUTpWij Q/mk2vLzyrkJfnkjXUbi6AUpx9aH24+ljP1FWMHildA/TrYPbZKYt/FuaMiAEWfUCk 6t2PKsFSC8Ur/86zTDhf8UYrPbY7DVKdiaIypsI8gRmtU35+lr7c0mGFes5+N1TcnG ZGFwx5ecVjPJAdHGefNqJZj08a6dzFZxsnuOJ250KOgl6fPceMDbWoxMJWsuNxvtCI uu6ZDAOiq36aseciDqaANjbJZRTHTJ5HipmwLsi77rbyw+SOd5aVTut1pNoIN10mmS gCTBkgHcC1iHQ== Received: from phl-compute-05.internal (phl-compute-05.internal [10.202.2.45]) by mailfauth.phl.internal (Postfix) with ESMTP id F2042F40068; Tue, 14 Apr 2026 10:24:05 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-05.internal (MEProxy); Tue, 14 Apr 2026 10:24:05 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdegudefkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefhvfevufffkffojghfggfgsedtkeertdertddtnecuhfhrohhmpedfmfhirhihlhcu ufhhuhhtshgvmhgruhculdfovghtrgdmfdcuoehkrghssehkvghrnhgvlhdrohhrgheqne cuggftrfgrthhtvghrnhephfdujeefvdegkefffedvkeehkeekueevfedtleehgeetlefg feevveeukefhtdetnecuvehluhhsthgvrhfuihiivgepudenucfrrghrrghmpehmrghilh hfrhhomhepkhhirhhilhhlodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqddu ieduudeivdeiheehqddvkeeggeegjedvkedqkhgrsheppehkvghrnhgvlhdrohhrghessh hhuhhtvghmohhvrdhnrghmvgdpnhgspghrtghpthhtohepudelpdhmohguvgepshhmthhp ohhuthdprhgtphhtthhopegrkhhpmheslhhinhhugidqfhhouhhnuggrthhiohhnrdhorh hgpdhrtghpthhtohepphgvthgvrhigsehrvgguhhgrthdrtghomhdprhgtphhtthhopegu rghvihgusehkvghrnhgvlhdrohhrghdprhgtphhtthhopehljhhssehkvghrnhgvlhdroh hrghdprhgtphhtthhopehrphhptheskhgvrhhnvghlrdhorhhgpdhrtghpthhtohepshhu rhgvnhgssehgohhoghhlvgdrtghomhdprhgtphhtthhopehvsggrsghkrgeskhgvrhhnvg hlrdhorhhgpdhrtghpthhtoheplhhirghmrdhhohiflhgvthhtsehorhgrtghlvgdrtgho mhdprhgtphhtthhopeiiihihsehnvhhiughirgdrtghomh X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 14 Apr 2026 10:24:05 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: Andrew Morton Cc: Peter Xu , David Hildenbrand , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" Subject: [RFC, PATCH 05/12] mm: intercept protnone faults on VM_UFFD_MINOR anonymous VMAs Date: Tue, 14 Apr 2026 15:23:39 +0100 Message-ID: <20260414142354.1465950-6-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260414142354.1465950-1-kas@kernel.org> References: <20260414142354.1465950-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When a protnone PTE/PMD fault occurs on a VMA with VM_UFFD_MINOR, dispatch to the userfaultfd minor fault path instead of NUMA balancing. Async: restore permissions inline. Sync: deliver via handle_userfault(). Feed NUMA locality stats from the fault path via task_numa_fault() so the scheduler retains placement data even though NUMA scanning is skipped on these VMAs. Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 --- include/linux/huge_mm.h | 6 +++++ mm/huge_memory.c | 24 +++++++++++++++++++ mm/memory.c | 51 +++++++++++++++++++++++++++++++++++++++-- 3 files changed, 79 insertions(+), 2 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index a4d9f964dfde..a900bb530998 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -519,6 +519,7 @@ static inline bool folio_test_pmd_mappable(struct folio= *folio) } =20 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf); +vm_fault_t do_huge_pmd_uffd_minor(struct vm_fault *vmf); =20 vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf); =20 @@ -707,6 +708,11 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct = vm_fault *vmf) return 0; } =20 +static inline vm_fault_t do_huge_pmd_uffd_minor(struct vm_fault *vmf) +{ + return 0; +} + static inline vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2ad736ff007c..264c646a8573 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2181,6 +2181,30 @@ static inline bool can_change_pmd_writable(struct vm= _area_struct *vma, return pmd_dirty(pmd); } =20 +vm_fault_t do_huge_pmd_uffd_minor(struct vm_fault *vmf) +{ + struct vm_area_struct *vma =3D vmf->vma; + + if (userfaultfd_minor_async(vma)) { + pmd_t pmd; + + vmf->ptl =3D pmd_lock(vma->vm_mm, vmf->pmd); + if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) { + spin_unlock(vmf->ptl); + return 0; + } + pmd =3D pmd_modify(vmf->orig_pmd, vma->vm_page_prot); + pmd =3D pmd_mkyoung(pmd); + set_pmd_at(vma->vm_mm, vmf->address & HPAGE_PMD_MASK, + vmf->pmd, pmd); + update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); + spin_unlock(vmf->ptl); + return 0; + } + + return handle_userfault(vmf, VM_UFFD_MINOR); +} + /* NUMA hinting page fault entry point for trans huge pmds */ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) { diff --git a/mm/memory.c b/mm/memory.c index c65e82c86fed..f068ff4027e8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6045,6 +6045,47 @@ static void numa_rebuild_large_mapping(struct vm_fau= lt *vmf, struct vm_area_stru } } =20 +static void uffd_minor_feed_numa_fault(struct vm_fault *vmf) +{ + struct folio *folio; + + folio =3D vm_normal_folio(vmf->vma, vmf->address, vmf->orig_pte); + if (folio) { + int nid =3D folio_nid(folio); + int flags =3D 0; + + if (nid =3D=3D numa_node_id()) + flags |=3D TNF_FAULT_LOCAL; + task_numa_fault(folio_last_cpupid(folio), nid, 1, flags); + } +} + +static vm_fault_t do_uffd_minor_anon(struct vm_fault *vmf) +{ + /* Feed NUMA stats even though we skip NUMA scanning on this VMA */ + uffd_minor_feed_numa_fault(vmf); + + if (userfaultfd_minor_async(vmf->vma)) { + pte_t pte; + + spin_lock(vmf->ptl); + if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; + } + pte =3D pte_modify(vmf->orig_pte, vmf->vma->vm_page_prot); + pte =3D pte_mkyoung(pte); + set_pte_at(vmf->vma->vm_mm, vmf->address, vmf->pte, pte); + update_mmu_cache(vmf->vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; + } + + /* Sync mode: unmap PTE and deliver to userfaultfd handler */ + pte_unmap(vmf->pte); + return handle_userfault(vmf, VM_UFFD_MINOR); +} + static vm_fault_t do_numa_page(struct vm_fault *vmf) { struct vm_area_struct *vma =3D vmf->vma; @@ -6319,8 +6360,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *= vmf) if (!pte_present(vmf->orig_pte)) return do_swap_page(vmf); =20 - if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) + if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) { + if (userfaultfd_minor(vmf->vma)) + return do_uffd_minor_anon(vmf); return do_numa_page(vmf); + } =20 spin_lock(vmf->ptl); entry =3D vmf->orig_pte; @@ -6434,8 +6478,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_s= truct *vma, return 0; } if (pmd_trans_huge(vmf.orig_pmd)) { - if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) + if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) { + if (userfaultfd_minor(vma)) + return do_huge_pmd_uffd_minor(&vmf); return do_huge_pmd_numa_page(&vmf); + } =20 if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && !pmd_write(vmf.orig_pmd)) { --=20 2.51.2 From nobody Mon Jun 15 23:15:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 754033E9F93; Tue, 14 Apr 2026 14:24:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176649; cv=none; b=a6dYm3MkaIsA5k+DQd7zb5AR6KVrjQ7BKwOyy9TxPo0J6sFw1NRgNsrXIxjvTmpTb3K4xduYM9yOqmIW5wxaFA13SWk5nzKWL15APgS9Z8IuC7feKHMmdeELmpn+5E6OkxlSYMD2AqzOlQLVfB/RBYFQXiNQXPNY9iQql5BOjLM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176649; c=relaxed/simple; bh=T8LSXzdMMBigtmoOeSLgAOloW5y/jN/E21D4tN7gU44=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Z+MCUwJiX+IG/iPzMN9hvFpKUaNxnO3xLUbymJMYWSI5ngsEPduoF5MyQ2gmE/c/P7BzLOj66Mcc+OimSmKubTj5awGSjLH5vvePhBBg8IOc44P3b+KA5O7LmosjrmZb+V/VftObD12KftlcwEpvGe7pfbaRcw2Xm1FPZpAhTRY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=i9CRxtgX; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="i9CRxtgX" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5A9B1C2BCB0; Tue, 14 Apr 2026 14:24:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776176649; bh=T8LSXzdMMBigtmoOeSLgAOloW5y/jN/E21D4tN7gU44=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=i9CRxtgXyuhkT63FAhVMwfPcZIEcdOlWRITnhk3rtZMVqmciL6sDH7uZI9b1Rqt1k T5JnJH/l5gCv0sNQaQFmkLcbWsa0OYyL39M+Zd01MdiNyc7vUYWFvc8aQtTS7iYJhk F841u0TDqDGtGNtVLwRfO/oNSLiHQeehuPDmMc+bu3eQ0Iyt2+8O2zaQ+z7UPLGZ8E sb2P/WzlwDqgnV9EAGi7Cl85goPmc4m/cEdQZZkHHg8oGvCXd45WODFLAh7SCR6XvX 96NsQdLmFYft3HX8Y0DUIMHNGK6IhIw0VR27AYFhjoq8uJWIcDUscgiUErHt0MuLbb HZp2XOC6r9bLg== Received: from phl-compute-05.internal (phl-compute-05.internal [10.202.2.45]) by mailfauth.phl.internal (Postfix) with ESMTP id 876FEF4006B; Tue, 14 Apr 2026 10:24:07 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-05.internal (MEProxy); Tue, 14 Apr 2026 10:24:07 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdegudefkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefhvfevufffkffojghfggfgsedtkeertdertddtnecuhfhrohhmpedfmfhirhihlhcu ufhhuhhtshgvmhgruhculdfovghtrgdmfdcuoehkrghssehkvghrnhgvlhdrohhrgheqne cuggftrfgrthhtvghrnhephfdujeefvdegkefffedvkeehkeekueevfedtleehgeetlefg feevveeukefhtdetnecuvehluhhsthgvrhfuihiivgepudenucfrrghrrghmpehmrghilh hfrhhomhepkhhirhhilhhlodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqddu ieduudeivdeiheehqddvkeeggeegjedvkedqkhgrsheppehkvghrnhgvlhdrohhrghessh hhuhhtvghmohhvrdhnrghmvgdpnhgspghrtghpthhtohepudelpdhmohguvgepshhmthhp ohhuthdprhgtphhtthhopegrkhhpmheslhhinhhugidqfhhouhhnuggrthhiohhnrdhorh hgpdhrtghpthhtohepphgvthgvrhigsehrvgguhhgrthdrtghomhdprhgtphhtthhopegu rghvihgusehkvghrnhgvlhdrohhrghdprhgtphhtthhopehljhhssehkvghrnhgvlhdroh hrghdprhgtphhtthhopehrphhptheskhgvrhhnvghlrdhorhhgpdhrtghpthhtohepshhu rhgvnhgssehgohhoghhlvgdrtghomhdprhgtphhtthhopehvsggrsghkrgeskhgvrhhnvg hlrdhorhhgpdhrtghpthhtoheplhhirghmrdhhohiflhgvthhtsehorhgrtghlvgdrtgho mhdprhgtphhtthhopeiiihihsehnvhhiughirgdrtghomh X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 14 Apr 2026 10:24:07 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: Andrew Morton Cc: Peter Xu , David Hildenbrand , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" Subject: [RFC, PATCH 06/12] userfaultfd: auto-resolve shmem and hugetlbfs minor faults in async mode Date: Tue, 14 Apr 2026 15:23:40 +0100 Message-ID: <20260414142354.1465950-7-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260414142354.1465950-1-kas@kernel.org> References: <20260414142354.1465950-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When UFFD_FEATURE_MINOR_ASYNC is enabled, skip handle_userfault() in the shmem and hugetlbfs minor fault paths. The normal fault path installs the PTE from page cache directly. Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 --- mm/hugetlb.c | 3 ++- mm/shmem.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 327eaa4074d3..c10d2432768c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5847,7 +5847,8 @@ static vm_fault_t hugetlb_no_page(struct address_spac= e *mapping, } =20 /* Check for page in userfault range. */ - if (userfaultfd_minor(vma)) { + if (userfaultfd_minor(vma) && + !userfaultfd_minor_async(vma)) { folio_unlock(folio); folio_put(folio); /* See comment in userfaultfd_missing() block above */ diff --git a/mm/shmem.c b/mm/shmem.c index b40f3cd48961..ce47e77fc090 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2489,7 +2489,8 @@ static int shmem_get_folio_gfp(struct inode *inode, p= goff_t index, fault_mm =3D vma ? vma->vm_mm : NULL; =20 folio =3D filemap_get_entry(inode->i_mapping, index); - if (folio && vma && userfaultfd_minor(vma)) { + if (folio && vma && userfaultfd_minor(vma) && + !userfaultfd_minor_async(vma)) { if (!xa_is_value(folio)) folio_put(folio); *fault_type =3D handle_userfault(vmf, VM_UFFD_MINOR); --=20 2.51.2 From nobody Mon Jun 15 23:15:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C3FBD3E958D for ; Tue, 14 Apr 2026 14:24:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176650; cv=none; b=a2EBBWkNEGiMmQQsNbaHeq2CoEIqAWRkENphpkQGyIim4Mt1mMnTm1TvkyVJNkAM+R5009eRmIfSuOwMpgalbO+CYGKxQbobOIjnGa53zXxll5qZjwFQZrnbbBGrp8FFiiQSGU9Er6QisC72LjAc7YJhoHnOibLiZ9w5GrnkNjg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176650; c=relaxed/simple; bh=m+QZFSoN1KHyBpdT/lFOUcMBNfMoQG8FWyZoa+3hWE8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Gi68ncTM/ozwGxPiHD3/3DGslFf4r4qL1WA654Bjh5jTlQ9azh492QeUKbLDz2FcI9LxfHM6/qCgctjvCgOuJLjp4LRut3DYmUEPYMqYhGyAdvb0nXhNkGMrF3keW0U0ep2L8e7GZ+D+MPZW5J1MjC4RPqHZJaBZn7kZpjgkMJY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=B3N3FcF8; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="B3N3FcF8" Received: by smtp.kernel.org (Postfix) with ESMTPSA id EDC4DC2BCB0; Tue, 14 Apr 2026 14:24:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776176650; bh=m+QZFSoN1KHyBpdT/lFOUcMBNfMoQG8FWyZoa+3hWE8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=B3N3FcF8j1cl9QhM0roD5U+FQK/bk8whCC91NDffJsKAn3I3Dw1WAuV8QnPA4vvzh w/bpTNK1D8emkJd9Bi57CcDL5Dc0nOQAbp+YRFl7H9td2UAyNFQr33yAmzaEZaISQh hKPvJgpgmEotVPP2ntvhnnOc9M4iI9s+NJ9X8UqgH8agfD9REyRoiyUJi2yZ8UlFPr C5TokRgk3GrjlbbAj0KNvs3sMevo6jWrIPm7wcBdsUImKv9yl5q8zmJtjZTyvCTNEn gByarjgJ8H6Hf84kivOVCxAd4ch3Vx0kdt6ka8iOcIR+lvPdaTS9TT7WBZh5bqmwbu lG3/9vOAT1i3A== Received: from phl-compute-01.internal (phl-compute-01.internal [10.202.2.41]) by mailfauth.phl.internal (Postfix) with ESMTP id 2D52AF40068; Tue, 14 Apr 2026 10:24:09 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-01.internal (MEProxy); Tue, 14 Apr 2026 10:24:09 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdegudefkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefhvfevufffkffojghfgggtgfesthekredtredtjeenucfhrhhomhepfdfmihhrhihl ucfuhhhuthhsvghmrghuucdlofgvthgrmddfuceokhgrsheskhgvrhhnvghlrdhorhhgqe enucggtffrrghtthgvrhhnpefhvdefvdevjeevhefhhfevudefudejfeduvdekheeludfh iefhhedujeffffeigfenucevlhhushhtvghrufhiiigvpedunecurfgrrhgrmhepmhgrih hlfhhrohhmpehkihhrihhllhdomhgvshhmthhprghuthhhphgvrhhsohhnrghlihhthidq udeiudduiedvieehhedqvdekgeeggeejvdekqdhkrghspeepkhgvrhhnvghlrdhorhhgse hshhhuthgvmhhovhdrnhgrmhgvpdhnsggprhgtphhtthhopeduledpmhhouggvpehsmhht phhouhhtpdhrtghpthhtoheprghkphhmsehlihhnuhigqdhfohhunhgurghtihhonhdroh hrghdprhgtphhtthhopehpvghtvghrgiesrhgvughhrghtrdgtohhmpdhrtghpthhtohep uggrvhhiugeskhgvrhhnvghlrdhorhhgpdhrtghpthhtoheplhhjsheskhgvrhhnvghlrd horhhgpdhrtghpthhtoheprhhpphhtsehkvghrnhgvlhdrohhrghdprhgtphhtthhopehs uhhrvghnsgesghhoohhglhgvrdgtohhmpdhrtghpthhtohepvhgsrggskhgrsehkvghrnh gvlhdrohhrghdprhgtphhtthhopehlihgrmhdrhhhofihlvghtthesohhrrggtlhgvrdgt ohhmpdhrtghpthhtohepiihihiesnhhvihguihgrrdgtohhm X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 14 Apr 2026 10:24:08 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: Andrew Morton Cc: Peter Xu , David Hildenbrand , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" Subject: [RFC, PATCH 07/12] sched/numa: skip scanning anonymous VM_UFFD_MINOR VMAs Date: Tue, 14 Apr 2026 15:23:41 +0100 Message-ID: <20260414142354.1465950-8-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260414142354.1465950-1-kas@kernel.org> References: <20260414142354.1465950-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Avoid protnone conflict on anonymous VMAs. Shmem unaffected. NUMA stats fed from uffd fault path instead. Add NUMAB_SKIP_UFFD_MINOR trace reason. Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 --- include/linux/sched/numa_balancing.h | 1 + include/trace/events/sched.h | 3 ++- kernel/sched/fair.c | 13 +++++++++++++ 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/num= a_balancing.h index 52b22c5c396d..5668074a4271 100644 --- a/include/linux/sched/numa_balancing.h +++ b/include/linux/sched/numa_balancing.h @@ -23,6 +23,7 @@ enum numa_vmaskip_reason { NUMAB_SKIP_PID_INACTIVE, NUMAB_SKIP_IGNORE_PID, NUMAB_SKIP_SEQ_COMPLETED, + NUMAB_SKIP_UFFD_MINOR, }; =20 #ifdef CONFIG_NUMA_BALANCING diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 7b2645b50e78..02e79b56db28 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -728,7 +728,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa, EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \ EM( NUMAB_SKIP_PID_INACTIVE, "pid_inactive" ) \ EM( NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" ) \ - EMe(NUMAB_SKIP_SEQ_COMPLETED, "seq_completed" ) + EM( NUMAB_SKIP_SEQ_COMPLETED, "seq_completed" ) \ + EMe(NUMAB_SKIP_UFFD_MINOR, "uffd_minor" ) =20 /* Redefine for export. */ #undef EM diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ab4114712be7..57beb04562cf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -3459,6 +3460,18 @@ static void task_numa_work(struct callback_head *wor= k) continue; } =20 + /* + * Skip anonymous VMAs registered for userfaultfd minor faults. + * Both NUMA balancing and uffd use protnone PTEs on anonymous + * memory =E2=80=94 let uffd own the hinting. For shmem, UFFDIO_DEACTIVA= TE + * zaps PTEs entirely (no protnone conflict), so NUMA scanning + * can proceed normally. + */ + if (vma_is_anonymous(vma) && userfaultfd_minor(vma)) { + trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UFFD_MINOR); + continue; + } + /* * Shared library pages mapped by multiple processes are not * migrated as it is expected they are cache replicated. Avoid --=20 2.51.2 From nobody Mon Jun 15 23:15:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6657D3E6DF9; Tue, 14 Apr 2026 14:24:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176652; cv=none; b=XPTd2nZgbkPxxVXiRaY3XAMa43da/yQWcwU/jP+41A7FGsLVWAt9be/C9rq8HBpE5lb0uL7sPqVYL6qtP9qH6HmG0wUK49JblqV67T5dzFjlc2LqtzuWrVUr6CgAytoL32g/fZvY5do56h1cjqWvgMsV4neT9OcHjuBC/iuIX0Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176652; c=relaxed/simple; bh=bdNn47+Nx91eslEieZ286KWtw/tnSdy4Rxa/tQlVFSk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Oe4cdFDKxc1DvX3rvy/3JTdMj/YT7EhgRVyWZ+ksgZ89fRBuTWQkZK5m2nbNv1OGn4Q3xBDjf0tmQXJzoAwVpZPou/xv4yOWt+24Goc4D9T+pPNKQmM7GKxUaMJf/NMcWhy4ffTeGy2kxjKahQirS+jCiyGH2kykLSHeTuoSuf8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=smzIiwzk; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="smzIiwzk" Received: by smtp.kernel.org (Postfix) with ESMTPSA id AE429C4AF0B; Tue, 14 Apr 2026 14:24:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776176652; bh=bdNn47+Nx91eslEieZ286KWtw/tnSdy4Rxa/tQlVFSk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=smzIiwzkviWVR7lnj73JVWnCQ+zBpvJCTqLUI3rfAoHi80IfNutFZUb982WabLJdp RV1EHtoV0XM8U5Lvh6R6BERnAHbxeaIWOyOTq1r1cgtn4s17TLEIXC4Czy1xUKl5/N zmzLM2Ig+HUpqpMApLtLHsYyinprvvb5thjin3NmUf4IqX94OWT09zuu8KXP9bmTPB 0h3kNPAZ2HcYJCbuRfx63vGGdEnfLqOOYWavV5a2qtXHfA7mYS1UxHoDAbRR9ZLYLw D0I5zhRWXTgJDj0qq2R7C5PShhH88aEmokXZzPaoA1H0sftf5qJ2BSXT5I4nA8ZBZJ LNaGsT5Zhu3Ng== Received: from phl-compute-02.internal (phl-compute-02.internal [10.202.2.42]) by mailfauth.phl.internal (Postfix) with ESMTP id DF7A6F40068; Tue, 14 Apr 2026 10:24:10 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-02.internal (MEProxy); Tue, 14 Apr 2026 10:24:10 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdegudefkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefhvfevufffkffojghfggfgsedtkeertdertddtnecuhfhrohhmpedfmfhirhihlhcu ufhhuhhtshgvmhgruhculdfovghtrgdmfdcuoehkrghssehkvghrnhgvlhdrohhrgheqne cuggftrfgrthhtvghrnhephfdujeefvdegkefffedvkeehkeekueevfedtleehgeetlefg feevveeukefhtdetnecuvehluhhsthgvrhfuihiivgepudenucfrrghrrghmpehmrghilh hfrhhomhepkhhirhhilhhlodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqddu ieduudeivdeiheehqddvkeeggeegjedvkedqkhgrsheppehkvghrnhgvlhdrohhrghessh hhuhhtvghmohhvrdhnrghmvgdpnhgspghrtghpthhtohepudelpdhmohguvgepshhmthhp ohhuthdprhgtphhtthhopegrkhhpmheslhhinhhugidqfhhouhhnuggrthhiohhnrdhorh hgpdhrtghpthhtohepphgvthgvrhigsehrvgguhhgrthdrtghomhdprhgtphhtthhopegu rghvihgusehkvghrnhgvlhdrohhrghdprhgtphhtthhopehljhhssehkvghrnhgvlhdroh hrghdprhgtphhtthhopehrphhptheskhgvrhhnvghlrdhorhhgpdhrtghpthhtohepshhu rhgvnhgssehgohhoghhlvgdrtghomhdprhgtphhtthhopehvsggrsghkrgeskhgvrhhnvg hlrdhorhhgpdhrtghpthhtoheplhhirghmrdhhohiflhgvthhtsehorhgrtghlvgdrtgho mhdprhgtphhtthhopeiiihihsehnvhhiughirgdrtghomh X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 14 Apr 2026 10:24:10 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: Andrew Morton Cc: Peter Xu , David Hildenbrand , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" Subject: [RFC, PATCH 08/12] userfaultfd: enable UFFD_FEATURE_MINOR_ANON Date: Tue, 14 Apr 2026 15:23:42 +0100 Message-ID: <20260414142354.1465950-9-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260414142354.1465950-1-kas@kernel.org> References: <20260414142354.1465950-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add UFFD_FEATURE_MINOR_ANON, UFFD_FEATURE_MINOR_ASYNC to UFFD_API_FEATURES and UFFDIO_DEACTIVATE to UFFD_API_RANGE_IOCTLS. The feature is now available to userspace. Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 --- include/uapi/linux/userfaultfd.h | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaul= tfd.h index 336d07e1b6de..775825da2596 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -42,7 +42,9 @@ UFFD_FEATURE_WP_UNPOPULATED | \ UFFD_FEATURE_POISON | \ UFFD_FEATURE_WP_ASYNC | \ - UFFD_FEATURE_MOVE) + UFFD_FEATURE_MOVE | \ + UFFD_FEATURE_MINOR_ANON | \ + UFFD_FEATURE_MINOR_ASYNC) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -54,13 +56,15 @@ (__u64)1 << _UFFDIO_MOVE | \ (__u64)1 << _UFFDIO_WRITEPROTECT | \ (__u64)1 << _UFFDIO_CONTINUE | \ - (__u64)1 << _UFFDIO_POISON) + (__u64)1 << _UFFDIO_POISON | \ + (__u64)1 << _UFFDIO_DEACTIVATE) #define UFFD_API_RANGE_IOCTLS_BASIC \ ((__u64)1 << _UFFDIO_WAKE | \ (__u64)1 << _UFFDIO_COPY | \ (__u64)1 << _UFFDIO_WRITEPROTECT | \ (__u64)1 << _UFFDIO_CONTINUE | \ - (__u64)1 << _UFFDIO_POISON) + (__u64)1 << _UFFDIO_POISON | \ + (__u64)1 << _UFFDIO_DEACTIVATE) =20 /* * Valid ioctl command number range with this API is from 0x00 to --=20 2.51.2 From nobody Mon Jun 15 23:15:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4DB8C3EBF08; Tue, 14 Apr 2026 14:24:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176654; cv=none; b=qETygD8hU8JVH6GbZalnww+wSVN4xH+ktorpBVKJxrWSW94L7mBLHKl0vTgqwSLLf38JeOs31D6yq4ucqO1Zw2LSObT8T/7b7TXzOstNsOML/G139xA/y27WTXXvm1Xkp5Dsjtg/d28BnRM+hUi6bGMIEA3WOfDrN01iXiF7heA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176654; c=relaxed/simple; bh=oOUfU0VXNyHpbUZg2Wz9Ad/eUIXrqX/a1NFbEQbuFDs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=OwlddiS9kqLK4XQKxLFWFE+2R5URqxZDz+w3unrW9Q+THs+ZyT+x4xbt+gQFghrPcy/H3NkroGIN+PLxbvyQKnt4KPBoHOuf/IQtBU9jm5OD28LFKd6wfkDYu8+tZenhyGopTdgZc6zUXVI6jHGMTX/+unUF43Cl3EqtwBFL3Ps= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=pvgJbxdT; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="pvgJbxdT" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 729E4C19425; Tue, 14 Apr 2026 14:24:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776176654; bh=oOUfU0VXNyHpbUZg2Wz9Ad/eUIXrqX/a1NFbEQbuFDs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=pvgJbxdTr0RDMrC3+VgX2thE+rIGjtvnDb3enwSjLfMXMuVopngsjPo4ZiBxtCxwx zsJhdIncow0aTyL63tGjwnR5YTWcV1gopC6pQgfB8BintsS9xTXQwZOen5K3tPWOyF mW3BKMzJOCASL5UsyxhzU869et0u7Xh80dj0PDVPTJwy9HIWvoKWysNUOn/GmhSkru /GaIL8KoBJQEsJYKXdmHWyFGjZcwtu5KXT9N3dE6k7cwYPPiTprjLXIcFeahZZHkS2 81PF5La9qAT7ifZOQp8xRHIdBe8TaH/AUOoxqBKtK9IOwCeMVidRpsEG7U5LfGz2JJ lWy7Oeqoj9gwQ== Received: from phl-compute-01.internal (phl-compute-01.internal [10.202.2.41]) by mailfauth.phl.internal (Postfix) with ESMTP id 9EB9DF40068; Tue, 14 Apr 2026 10:24:12 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-01.internal (MEProxy); Tue, 14 Apr 2026 10:24:12 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdegudefkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefhvfevufffkffojghfggfgsedtkeertdertddtnecuhfhrohhmpedfmfhirhihlhcu ufhhuhhtshgvmhgruhculdfovghtrgdmfdcuoehkrghssehkvghrnhgvlhdrohhrgheqne cuggftrfgrthhtvghrnhephfdujeefvdegkefffedvkeehkeekueevfedtleehgeetlefg feevveeukefhtdetnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilh hfrhhomhepkhhirhhilhhlodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqddu ieduudeivdeiheehqddvkeeggeegjedvkedqkhgrsheppehkvghrnhgvlhdrohhrghessh hhuhhtvghmohhvrdhnrghmvgdpnhgspghrtghpthhtohepudelpdhmohguvgepshhmthhp ohhuthdprhgtphhtthhopegrkhhpmheslhhinhhugidqfhhouhhnuggrthhiohhnrdhorh hgpdhrtghpthhtohepphgvthgvrhigsehrvgguhhgrthdrtghomhdprhgtphhtthhopegu rghvihgusehkvghrnhgvlhdrohhrghdprhgtphhtthhopehljhhssehkvghrnhgvlhdroh hrghdprhgtphhtthhopehrphhptheskhgvrhhnvghlrdhorhhgpdhrtghpthhtohepshhu rhgvnhgssehgohhoghhlvgdrtghomhdprhgtphhtthhopehvsggrsghkrgeskhgvrhhnvg hlrdhorhhgpdhrtghpthhtoheplhhirghmrdhhohiflhgvthhtsehorhgrtghlvgdrtgho mhdprhgtphhtthhopeiiihihsehnvhhiughirgdrtghomh X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 14 Apr 2026 10:24:11 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: Andrew Morton Cc: Peter Xu , David Hildenbrand , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" Subject: [RFC, PATCH 09/12] mm/pagemap: add PAGE_IS_UFFD_DEACTIVATED to PAGEMAP_SCAN Date: Tue, 14 Apr 2026 15:23:43 +0100 Message-ID: <20260414142354.1465950-10-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260414142354.1465950-1-kas@kernel.org> References: <20260414142354.1465950-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Report deactivated anonymous pages in PAGEMAP_SCAN results. Only set on anonymous VMAs (shmem cold =3D !PAGE_IS_PRESENT). Both PTE and PMD (THP) levels handled. Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 --- fs/proc/task_mmu.c | 11 ++++++++++- include/uapi/linux/fs.h | 1 + 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e091931d7ca1..fc42cfd5720a 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -2329,7 +2329,7 @@ static int pagemap_release(struct inode *inode, struc= t file *file) PAGE_IS_FILE | PAGE_IS_PRESENT | \ PAGE_IS_SWAPPED | PAGE_IS_PFNZERO | \ PAGE_IS_HUGE | PAGE_IS_SOFT_DIRTY | \ - PAGE_IS_GUARD) + PAGE_IS_GUARD | PAGE_IS_UFFD_DEACTIVATED) #define PM_SCAN_FLAGS (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC) =20 struct pagemap_scan_private { @@ -2354,6 +2354,10 @@ static unsigned long pagemap_page_category(struct pa= gemap_scan_private *p, =20 categories =3D PAGE_IS_PRESENT; =20 + if (pte_protnone(pte) && vma_is_accessible(vma) && + vma_is_anonymous(vma) && userfaultfd_minor(vma)) + categories |=3D PAGE_IS_UFFD_DEACTIVATED; + if (!pte_uffd_wp(pte)) categories |=3D PAGE_IS_WRITTEN; =20 @@ -2422,6 +2426,11 @@ static unsigned long pagemap_thp_category(struct pag= emap_scan_private *p, struct page *page; =20 categories |=3D PAGE_IS_PRESENT; + + if (pmd_protnone(pmd) && vma_is_accessible(vma) && + vma_is_anonymous(vma) && userfaultfd_minor(vma)) + categories |=3D PAGE_IS_UFFD_DEACTIVATED; + if (!pmd_uffd_wp(pmd)) categories |=3D PAGE_IS_WRITTEN; =20 diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 70b2b661f42c..af5b28901800 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -455,6 +455,7 @@ typedef int __bitwise __kernel_rwf_t; #define PAGE_IS_HUGE (1 << 6) #define PAGE_IS_SOFT_DIRTY (1 << 7) #define PAGE_IS_GUARD (1 << 8) +#define PAGE_IS_UFFD_DEACTIVATED (1 << 9) =20 /* * struct page_region - Page region with flags --=20 2.51.2 From nobody Mon Jun 15 23:15:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8DDD63EBF37; Tue, 14 Apr 2026 14:24:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176655; cv=none; b=jh99smQ9GslYtEfB6LveHcvd+gHeockDlK4gbDLdHvGUZm89FGu6dY+fXECgpi4xgl3AUgB/cWzhtOxAFJ5BPFCbT/vE0H3mPBgrNDFC4I3VwMXCSZw9jCBRJaWHEPi0QxVztpl3QRHT5k9NJnEZmcG2D6xS09iBX9xW7AC7Mtc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176655; c=relaxed/simple; bh=wU9XhAWdOkm2epOMZqX4k9UHuR2i8O/RDnn4cDbflLk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=dAsZuw29gd3g4CP8D4VU3ciIYLw74ELWrzfVTVGWRG9QZ0vBL13SH+4uVgV3oWcHKleu6LfJI4OqWSR1Lb7dmG1SSQ/adAea6wYPEOxiXOtJS+snf3T75pogbpmXqBC9ws1s1kdf2YE48W7YseljwKCjvfX/DnAdDIXe8E7dCOY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=n1DIGAsn; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="n1DIGAsn" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0F8A7C4AF09; Tue, 14 Apr 2026 14:24:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776176655; bh=wU9XhAWdOkm2epOMZqX4k9UHuR2i8O/RDnn4cDbflLk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=n1DIGAsniyY8vj4+mAakxpkv9iBsInZPGyST0/WXXjMQGAuCTtS8/cmieVr3Ye5Ul rF6mcAKzF684kHK2IctebHCD8ljK9Tv2VFCyG7oSLj1PAGSJbN/gvuIWuE/aYe8iJP P7fUpR+L9IXRzv16wGuwtf7IFh5r6lQCx6AMmIlgF8Q0Lj8c2hdTI3/q4tfckAqL4i p2K6ICCBx6Rri0H/Uk8AhigoALkVc4ZBFRbIqXB2Tw23ajt2XK3+sGk4QQFD7BXWS4 PPHbcFBvo3JNfaj7Wku9j0LGRhHmnF3e8aH55BCFY4Qg0TGVUfKVBB8wyn3g9c6zu8 i6nJt/in9MD6w== Received: from phl-compute-06.internal (phl-compute-06.internal [10.202.2.46]) by mailfauth.phl.internal (Postfix) with ESMTP id 421D8F4006F; Tue, 14 Apr 2026 10:24:14 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-06.internal (MEProxy); Tue, 14 Apr 2026 10:24:14 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdegudefkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefhvfevufffkffojghfggfgsedtkeertdertddtnecuhfhrohhmpedfmfhirhihlhcu ufhhuhhtshgvmhgruhculdfovghtrgdmfdcuoehkrghssehkvghrnhgvlhdrohhrgheqne cuggftrfgrthhtvghrnhephfdujeefvdegkefffedvkeehkeekueevfedtleehgeetlefg feevveeukefhtdetnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilh hfrhhomhepkhhirhhilhhlodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqddu ieduudeivdeiheehqddvkeeggeegjedvkedqkhgrsheppehkvghrnhgvlhdrohhrghessh hhuhhtvghmohhvrdhnrghmvgdpnhgspghrtghpthhtohepudelpdhmohguvgepshhmthhp ohhuthdprhgtphhtthhopegrkhhpmheslhhinhhugidqfhhouhhnuggrthhiohhnrdhorh hgpdhrtghpthhtohepphgvthgvrhigsehrvgguhhgrthdrtghomhdprhgtphhtthhopegu rghvihgusehkvghrnhgvlhdrohhrghdprhgtphhtthhopehljhhssehkvghrnhgvlhdroh hrghdprhgtphhtthhopehrphhptheskhgvrhhnvghlrdhorhhgpdhrtghpthhtohepshhu rhgvnhgssehgohhoghhlvgdrtghomhdprhgtphhtthhopehvsggrsghkrgeskhgvrhhnvg hlrdhorhhgpdhrtghpthhtoheplhhirghmrdhhohiflhgvthhtsehorhgrtghlvgdrtgho mhdprhgtphhtthhopeiiihihsehnvhhiughirgdrtghomh X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 14 Apr 2026 10:24:13 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: Andrew Morton Cc: Peter Xu , David Hildenbrand , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" Subject: [RFC, PATCH 10/12] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Date: Tue, 14 Apr 2026 15:23:44 +0100 Message-ID: <20260414142354.1465950-11-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260414142354.1465950-1-kas@kernel.org> References: <20260414142354.1465950-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add UFFDIO_SET_MODE ioctl to toggle UFFD_FEATURE_MINOR_ASYNC at runtime. Takes mmap_write_lock for serialization against all in-flight faults. On sync-to-async transition, wake threads blocked in handle_userfault() so they retry and auto-resolve. Since ctx->features can now be modified concurrently, add userfaultfd_features() helper that wraps READ_ONCE() and convert all ctx->features reads to use it. Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 --- fs/userfaultfd.c | 95 ++++++++++++++++++++++++++++---- include/uapi/linux/userfaultfd.h | 13 +++++ 2 files changed, 96 insertions(+), 12 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 43064238fd8d..0edb33599491 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -79,24 +79,33 @@ struct userfaultfd_wake_range { /* internal indication that UFFD_API ioctl was successfully executed */ #define UFFD_FEATURE_INITIALIZED (1u << 31) =20 +/* + * Read ctx->features with READ_ONCE() since UFFDIO_SET_MODE can + * modify it concurrently. + */ +static unsigned int userfaultfd_features(struct userfaultfd_ctx *ctx) +{ + return READ_ONCE(ctx->features); +} + static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx) { - return ctx->features & UFFD_FEATURE_INITIALIZED; + return userfaultfd_features(ctx) & UFFD_FEATURE_INITIALIZED; } =20 static bool userfaultfd_wp_async_ctx(struct userfaultfd_ctx *ctx) { - return ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC); + return ctx && (userfaultfd_features(ctx) & UFFD_FEATURE_WP_ASYNC); } =20 static bool userfaultfd_minor_anon_ctx(struct userfaultfd_ctx *ctx) { - return ctx && (ctx->features & UFFD_FEATURE_MINOR_ANON); + return ctx && (userfaultfd_features(ctx) & UFFD_FEATURE_MINOR_ANON); } =20 static bool userfaultfd_minor_async_ctx(struct userfaultfd_ctx *ctx) { - return ctx && (ctx->features & UFFD_FEATURE_MINOR_ASYNC); + return ctx && (userfaultfd_features(ctx) & UFFD_FEATURE_MINOR_ASYNC); } =20 static unsigned int userfaultfd_ctx_flags(struct userfaultfd_ctx *ctx) @@ -122,7 +131,7 @@ bool userfaultfd_wp_unpopulated(struct vm_area_struct *= vma) if (!ctx) return false; =20 - return ctx->features & UFFD_FEATURE_WP_UNPOPULATED; + return userfaultfd_features(ctx) & UFFD_FEATURE_WP_UNPOPULATED; } =20 static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode, @@ -435,7 +444,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsig= ned long reason) /* 0 or > 1 flags set is a bug; we expect exactly 1. */ VM_WARN_ON_ONCE(!reason || (reason & (reason - 1))); =20 - if (ctx->features & UFFD_FEATURE_SIGBUS) + if (userfaultfd_features(ctx) & UFFD_FEATURE_SIGBUS) goto out; if (!(vmf->flags & FAULT_FLAG_USER) && (ctx->flags & UFFD_USER_MODE_ONLY)) goto out; @@ -506,7 +515,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsig= ned long reason) init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function); uwq.wq.private =3D current; uwq.msg =3D userfault_msg(vmf->address, vmf->real_address, vmf->flags, - reason, ctx->features); + reason, userfaultfd_features(ctx)); uwq.ctx =3D ctx; uwq.waken =3D false; =20 @@ -668,7 +677,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct = list_head *fcs) if (!octx) return 0; =20 - if (!(octx->features & UFFD_FEATURE_EVENT_FORK)) { + if (!(userfaultfd_features(octx) & UFFD_FEATURE_EVENT_FORK)) { userfaultfd_reset_ctx(vma); return 0; } @@ -774,7 +783,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma, if (!ctx) return; =20 - if (ctx->features & UFFD_FEATURE_EVENT_REMAP) { + if (userfaultfd_features(ctx) & UFFD_FEATURE_EVENT_REMAP) { vm_ctx->ctx =3D ctx; userfaultfd_ctx_get(ctx); down_write(&ctx->map_changing_lock); @@ -824,7 +833,7 @@ bool userfaultfd_remove(struct vm_area_struct *vma, struct userfaultfd_wait_queue ewq; =20 ctx =3D vma->vm_userfaultfd_ctx.ctx; - if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_REMOVE)) + if (!ctx || !(userfaultfd_features(ctx) & UFFD_FEATURE_EVENT_REMOVE)) return true; =20 userfaultfd_ctx_get(ctx); @@ -863,7 +872,7 @@ int userfaultfd_unmap_prep(struct vm_area_struct *vma, = unsigned long start, struct userfaultfd_unmap_ctx *unmap_ctx; struct userfaultfd_ctx *ctx =3D vma->vm_userfaultfd_ctx.ctx; =20 - if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_UNMAP) || + if (!ctx || !(userfaultfd_features(ctx) & UFFD_FEATURE_EVENT_UNMAP) || has_unmap_ctx(ctx, unmaps, start, end)) return 0; =20 @@ -1826,6 +1835,65 @@ static int userfaultfd_deactivate(struct userfaultfd= _ctx *ctx, return ret; } =20 +/* + * Features that can be toggled at runtime via UFFDIO_SET_MODE. + * Only async features that were enabled at UFFDIO_API time may be toggled. + */ +#define UFFD_FEATURE_TOGGLEABLE (UFFD_FEATURE_MINOR_ASYNC) + +static int userfaultfd_set_mode(struct userfaultfd_ctx *ctx, + unsigned long arg) +{ + struct uffdio_set_mode mode; + struct mm_struct *mm =3D ctx->mm; + + if (copy_from_user(&mode, (void __user *)arg, sizeof(mode))) + return -EFAULT; + + /* enable and disable must not overlap */ + if (mode.enable & mode.disable) + return -EINVAL; + + /* only toggleable features are allowed */ + if ((mode.enable | mode.disable) & ~UFFD_FEATURE_TOGGLEABLE) + return -EINVAL; + + if (!mmget_not_zero(mm)) + return -ESRCH; + + /* + * mmap_write_lock serializes against all page faults. + * After we release, no in-flight faults from the old mode exist. + */ + { + unsigned int new_features; + + mmap_write_lock(mm); + new_features =3D userfaultfd_features(ctx); + new_features |=3D mode.enable; + new_features &=3D ~mode.disable; + WRITE_ONCE(ctx->features, new_features); + mmap_write_unlock(mm); + } + + /* + * If switching to async, wake threads blocked in handle_userfault(). + * They will retry the fault and auto-resolve under the new mode. + * len=3D0 means wake all pending faults on this context. + */ + if (mode.enable & UFFD_FEATURE_MINOR_ASYNC) { + struct userfaultfd_wake_range range =3D { .len =3D 0 }; + + spin_lock_irq(&ctx->fault_pending_wqh.lock); + __wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, + &range); + __wake_up(&ctx->fault_wqh, TASK_NORMAL, 1, &range); + spin_unlock_irq(&ctx->fault_pending_wqh.lock); + } + + mmput(mm); + return 0; +} =20 static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long= arg) { @@ -2150,6 +2218,9 @@ static long userfaultfd_ioctl(struct file *file, unsi= gned cmd, case UFFDIO_DEACTIVATE: ret =3D userfaultfd_deactivate(ctx, arg); break; + case UFFDIO_SET_MODE: + ret =3D userfaultfd_set_mode(ctx, arg); + break; } return ret; } @@ -2177,7 +2248,7 @@ static void userfaultfd_show_fdinfo(struct seq_file *= m, struct file *f) * protocols: aa:... bb:... */ seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nAPI:\t%Lx:%x:%Lx\n", - pending, total, UFFD_API, ctx->features, + pending, total, UFFD_API, userfaultfd_features(ctx), UFFD_API_IOCTLS|UFFD_API_RANGE_IOCTLS); } #endif diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaul= tfd.h index 775825da2596..f0f14f9db06c 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -84,6 +84,7 @@ #define _UFFDIO_CONTINUE (0x07) #define _UFFDIO_POISON (0x08) #define _UFFDIO_DEACTIVATE (0x09) +#define _UFFDIO_SET_MODE (0x0A) #define _UFFDIO_API (0x3F) =20 /* userfaultfd ioctl ids */ @@ -110,6 +111,8 @@ struct uffdio_poison) #define UFFDIO_DEACTIVATE _IOR(UFFDIO, _UFFDIO_DEACTIVATE, \ struct uffdio_range) +#define UFFDIO_SET_MODE _IOW(UFFDIO, _UFFDIO_SET_MODE, \ + struct uffdio_set_mode) =20 /* read() structure */ struct uffd_msg { @@ -395,6 +398,16 @@ struct uffdio_move { __s64 move; }; =20 +struct uffdio_set_mode { + /* + * Toggle async mode for features at runtime. + * Supported: UFFD_FEATURE_MINOR_ASYNC. + * Setting a bit in both enable and disable is invalid. + */ + __u64 enable; + __u64 disable; +}; + /* * Flags for the userfaultfd(2) system call itself. */ --=20 2.51.2 From nobody Mon Jun 15 23:15:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B299E3EC2E7; Tue, 14 Apr 2026 14:24:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176657; cv=none; b=BbbXvf5jvlN86jtK7roFgE9XKKeQFNgR4Bql5h2cSEA6lWYScsE2OT5nEoICNYrclE+J/FmJ3zTT3hoGcG6NZTcw7+EHjtX+qkj/Zi+ohVLde1tzawDEQZn+g+h0k9OqVGVbIsVtMgK7WnC8UI8Te3OrESNBYueOs3nx4jEzwCQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176657; c=relaxed/simple; bh=E9uRJ76OlqqumyeaXsGedT7SJ1A+HCyjfEbuV1tzZtY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=XpPCUI4QnERPbyxicjLpaRpfcHitFsAdoZ7Hbie+59LdC+ywX+RYvG7iIlmiAyNXMsRMXEHAESPtIynrh3I4sV8FQukw8OABp/i1uuw7OJAlGOfkflx5dzykYFrC+4Qgs4qV3XT5x3lrTtep9z8NWDq77MlZ8E6ZO3wK/0uPRX0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=dyHCKV0q; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="dyHCKV0q" Received: by smtp.kernel.org (Postfix) with ESMTPSA id F160FC4AF0B; Tue, 14 Apr 2026 14:24:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776176657; bh=E9uRJ76OlqqumyeaXsGedT7SJ1A+HCyjfEbuV1tzZtY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=dyHCKV0qo2zi1d5qDNF0EFTOh65ruVcvJwrkWK09JRKej+PuspN0jJyyq+hMeK2XY 6kqj9CyGRyPhtkYcB0CQI1za8guadvwkwtRxER1KMYvWORC+Ftpv/DD403IFW8QWiO Du2BqowmVBsUI8K65kyRndVfovOLR9nBr3mv2fIKaZMshj/WoK1/X4t+N5+dBb3Rkz /3cLEzaNOtHWmMcODRsmVhIilS2kUnXnEubjkVFDMKnmxeXastmM7CyK4fTuTBD8aR WpypYWf9xvZ9m5Js35lXLWyvhlIYkuykSufIvha7PO51cPFFrtz/OlgfKE9nOwjQ4K UBebfEQybzITg== Received: from phl-compute-06.internal (phl-compute-06.internal [10.202.2.46]) by mailfauth.phl.internal (Postfix) with ESMTP id 2F1CEF40068; Tue, 14 Apr 2026 10:24:16 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-06.internal (MEProxy); Tue, 14 Apr 2026 10:24:16 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdegudefkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefhvfevufffkffojghfgggtgfesthekredtredtjeenucfhrhhomhepfdfmihhrhihl ucfuhhhuthhsvghmrghuucdlofgvthgrmddfuceokhgrsheskhgvrhhnvghlrdhorhhgqe enucggtffrrghtthgvrhhnpefhvdefvdevjeevhefhhfevudefudejfeduvdekheeludfh iefhhedujeffffeigfenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrih hlfhhrohhmpehkihhrihhllhdomhgvshhmthhprghuthhhphgvrhhsohhnrghlihhthidq udeiudduiedvieehhedqvdekgeeggeejvdekqdhkrghspeepkhgvrhhnvghlrdhorhhgse hshhhuthgvmhhovhdrnhgrmhgvpdhnsggprhgtphhtthhopeduledpmhhouggvpehsmhht phhouhhtpdhrtghpthhtoheprghkphhmsehlihhnuhigqdhfohhunhgurghtihhonhdroh hrghdprhgtphhtthhopehpvghtvghrgiesrhgvughhrghtrdgtohhmpdhrtghpthhtohep uggrvhhiugeskhgvrhhnvghlrdhorhhgpdhrtghpthhtoheplhhjsheskhgvrhhnvghlrd horhhgpdhrtghpthhtoheprhhpphhtsehkvghrnhgvlhdrohhrghdprhgtphhtthhopehs uhhrvghnsgesghhoohhglhgvrdgtohhmpdhrtghpthhtohepvhgsrggskhgrsehkvghrnh gvlhdrohhrghdprhgtphhtthhopehlihgrmhdrhhhofihlvghtthesohhrrggtlhgvrdgt ohhmpdhrtghpthhtohepiihihiesnhhvihguihgrrdgtohhm X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 14 Apr 2026 10:24:15 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: Andrew Morton Cc: Peter Xu , David Hildenbrand , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" Subject: [RFC, PATCH 11/12] selftests/mm: add userfaultfd anonymous minor fault tests Date: Tue, 14 Apr 2026 15:23:45 +0100 Message-ID: <20260414142354.1465950-12-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260414142354.1465950-1-kas@kernel.org> References: <20260414142354.1465950-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Add tests for UFFD_FEATURE_MINOR_ANON, UFFD_FEATURE_MINOR_ASYNC, UFFDIO_DEACTIVATE, UFFDIO_SET_MODE, and PAGE_IS_UFFD_DEACTIVATED: - minor-anon-async: populate pages, register MODE_MINOR with MINOR_ASYNC, deactivate via UFFDIO_DEACTIVATE, re-access and verify content is preserved with no faults delivered to the handler. - minor-anon-sync: same setup but without MINOR_ASYNC. Verify that each deactivated page access delivers a MINOR fault to the handler, and UFFDIO_CONTINUE resolves it. Exercises both PTE and THP paths. - minor-anon-pagemap: deactivate a range, touch first half, use PAGEMAP_SCAN with PAGE_IS_UFFD_DEACTIVATED to verify the untouched second half is reported as cold. - minor-anon-gup: write() from a deactivated page into a pipe to exercise GUP resolution through protnone PTEs via async auto-restore. - minor-anon-async-toggle: full detection-to-eviction cycle using UFFDIO_SET_MODE. Start async (detection), flip to sync (eviction of cold pages), flip back to async. - minor-anon-close: deactivate pages, close the uffd fd, verify all pages are accessible again (protnone PTEs restored on cleanup). Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 --- tools/testing/selftests/mm/uffd-unit-tests.c | 458 +++++++++++++++++++ 1 file changed, 458 insertions(+) diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/s= elftests/mm/uffd-unit-tests.c index 6f5e404a446c..8bd5a642bd5a 100644 --- a/tools/testing/selftests/mm/uffd-unit-tests.c +++ b/tools/testing/selftests/mm/uffd-unit-tests.c @@ -7,6 +7,7 @@ =20 #include "uffd-common.h" =20 +#include #include "../../../../mm/gup_test.h" =20 #ifdef __NR_userfaultfd @@ -623,6 +624,423 @@ void uffd_minor_collapse_test(uffd_global_test_opts_t= *gopts, uffd_test_args_t * uffd_minor_test_common(gopts, true, false); } =20 +static void deactivate_range(int uffd, __u64 start, __u64 len) +{ + struct uffdio_range range =3D { .start =3D start, .len =3D len }; + + if (ioctl(uffd, UFFDIO_DEACTIVATE, &range)) + err("UFFDIO_DEACTIVATE failed"); +} + +static void set_async_mode(int uffd, bool enable) +{ + struct uffdio_set_mode mode =3D { }; + + if (enable) + mode.enable =3D UFFD_FEATURE_MINOR_ASYNC; + else + mode.disable =3D UFFD_FEATURE_MINOR_ASYNC; + + if (ioctl(uffd, UFFDIO_SET_MODE, &mode)) + err("UFFDIO_SET_MODE failed"); +} + +/* + * Test async minor faults on anonymous memory. + * Populate pages, register MODE_MINOR with MINOR_ASYNC, + * deactivate, re-access, verify content preserved and no faults delivered. + */ +static void uffd_minor_anon_async_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + unsigned long p; + + /* Populate all pages with known content */ + for (p =3D 0; p < nr_pages; p++) + memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size); + + /* Register MODE_MINOR (uffd was opened with MINOR_ANON | MINOR_ASYNC) */ + if (uffd_register(gopts->uffd, gopts->area_dst, + nr_pages * page_size, + false, false, true)) + err("register failure"); + + /* Deactivate all pages =E2=80=94 sets protnone */ + deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size); + + /* Access all pages =E2=80=94 should auto-resolve, no faults */ + for (p =3D 0; p < nr_pages; p++) { + unsigned char *page =3D (unsigned char *)gopts->area_dst + + p * page_size; + unsigned char expected =3D p % 255 + 1; + + if (page[0] !=3D expected) { + uffd_test_fail("page %lu content mismatch: %u !=3D %u", + p, page[0], expected); + return; + } + } + + uffd_test_pass(); +} + +/* + * Custom fault handler for anon minor =E2=80=94 just UFFDIO_CONTINUE, no = content + * modification (the page is protnone so we can't access it from here). + */ +static void uffd_handle_minor_anon(uffd_global_test_opts_t *gopts, + struct uffd_msg *msg, + struct uffd_args *uargs) +{ + struct uffdio_continue req; + + if (!(msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_MINOR)) + err("expected minor fault, got 0x%llx", + msg->arg.pagefault.flags); + + req.range.start =3D msg->arg.pagefault.address; + req.range.len =3D gopts->page_size; + req.mode =3D 0; + if (ioctl(gopts->uffd, UFFDIO_CONTINUE, &req)) { + /* + * THP races with khugepaged collapse/split: + * EAGAIN: PMD changed under us + * EEXIST: THP present but already resolved + * In both cases the page is accessible =E2=80=94 the faulting + * thread retries and succeeds. + */ + if (errno !=3D EEXIST && errno !=3D EAGAIN) + err("UFFDIO_CONTINUE failed"); + } + + uargs->minor_faults++; +} + +/* + * Test sync minor faults on anonymous memory. + * Populate pages, register MODE_MINOR (sync), deactivate, + * access from worker thread, verify fault delivered, UFFDIO_CONTINUE reso= lves. + */ +static void uffd_minor_anon_sync_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + pthread_t uffd_mon; + struct uffd_args uargs =3D { }; + char c =3D '\0'; + unsigned long p; + + uargs.gopts =3D gopts; + uargs.handle_fault =3D uffd_handle_minor_anon; + + /* Populate all pages */ + for (p =3D 0; p < nr_pages; p++) + memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size); + + /* Register MODE_MINOR (uffd opened with MINOR_ANON, no MINOR_ASYNC) */ + if (uffd_register(gopts->uffd, gopts->area_dst, + nr_pages * page_size, + false, false, true)) + err("register failure"); + + /* Deactivate all pages */ + deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size); + + /* Start fault handler thread */ + if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &uargs)) + err("uffd_poll_thread create"); + + /* Access all pages =E2=80=94 triggers sync minor faults, handler does CO= NTINUE */ + for (p =3D 0; p < nr_pages; p++) { + unsigned char *page =3D (unsigned char *)gopts->area_dst + + p * page_size; + + if (page[0] !=3D (p % 255 + 1)) { + uffd_test_fail("page %lu content mismatch", p); + goto out; + } + } + + if (uargs.minor_faults =3D=3D 0) { + uffd_test_fail("expected minor faults, got 0"); + goto out; + } + + uffd_test_pass(); +out: + if (write(gopts->pipefd[1], &c, sizeof(c)) !=3D sizeof(c)) + err("pipe write"); + if (pthread_join(uffd_mon, NULL)) + err("join() failed"); +} + +/* + * Test PAGEMAP_SCAN detection of deactivated (cold) pages. + */ +static void uffd_minor_anon_pagemap_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + unsigned long p; + struct page_region regions[16]; + struct pm_scan_arg pm_arg; + int pagemap_fd; + long ret; + + /* Need at least 4 pages */ + if (nr_pages < 4) { + uffd_test_skip("need at least 4 pages"); + return; + } + + /* Populate all pages */ + for (p =3D 0; p < nr_pages; p++) + memset(gopts->area_dst + p * page_size, 0xab, page_size); + + /* Register and deactivate */ + if (uffd_register(gopts->uffd, gopts->area_dst, + nr_pages * page_size, + false, false, true)) + err("register failure"); + + deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size); + + /* Touch first half of pages to re-activate them (async auto-resolve) */ + for (p =3D 0; p < nr_pages / 2; p++) { + volatile char *page =3D gopts->area_dst + p * page_size; + (void)*page; + } + + /* Scan for cold (still deactivated) pages */ + pagemap_fd =3D open("/proc/self/pagemap", O_RDONLY); + if (pagemap_fd < 0) + err("open pagemap"); + + memset(&pm_arg, 0, sizeof(pm_arg)); + pm_arg.size =3D sizeof(pm_arg); + pm_arg.start =3D (uint64_t)gopts->area_dst; + pm_arg.end =3D (uint64_t)gopts->area_dst + nr_pages * page_size; + pm_arg.vec =3D (uint64_t)regions; + pm_arg.vec_len =3D 16; + pm_arg.category_mask =3D PAGE_IS_UFFD_DEACTIVATED; + pm_arg.return_mask =3D PAGE_IS_UFFD_DEACTIVATED; + + ret =3D ioctl(pagemap_fd, PAGEMAP_SCAN, &pm_arg); + close(pagemap_fd); + + if (ret < 0) { + uffd_test_fail("PAGEMAP_SCAN failed: %s", strerror(errno)); + return; + } + + /* + * The second half of pages should be reported as deactivated. + * They may be coalesced into one region. + */ + if (ret < 1) { + uffd_test_fail("expected cold pages, got %ld regions", ret); + return; + } + + /* Verify the cold region covers the second half */ + uint64_t cold_start =3D regions[0].start; + uint64_t expected_start =3D (uint64_t)gopts->area_dst + + (nr_pages / 2) * page_size; + + if (cold_start !=3D expected_start) { + uffd_test_fail("cold region starts at 0x%lx, expected 0x%lx", + (unsigned long)cold_start, + (unsigned long)expected_start); + return; + } + + uffd_test_pass(); +} + +/* + * Test that GUP resolves through protnone PTEs (async mode). + * Deactivate pages, then use a pipe to exercise GUP on the deactivated + * memory. write() from deactivated pages triggers GUP which must fault + * through the protnone PTE. + */ +static void uffd_minor_anon_gup_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long page_size =3D gopts->page_size; + char *buf; + int pipefd[2]; + + buf =3D malloc(page_size); + if (!buf) + err("malloc"); + + /* Populate first page with known content */ + memset(gopts->area_dst, 0xCD, page_size); + + if (uffd_register(gopts->uffd, gopts->area_dst, page_size, + false, false, true)) + err("register failure"); + + deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst, page_size); + + if (pipe(pipefd)) + err("pipe"); + + /* + * write() from the deactivated page into the pipe. + * This triggers GUP on the protnone PTE. In async mode the + * kernel auto-restores permissions and GUP succeeds. + */ + if (write(pipefd[1], gopts->area_dst, page_size) !=3D page_size) { + uffd_test_fail("write from deactivated page failed: %s", + strerror(errno)); + goto out; + } + + if (read(pipefd[0], buf, page_size) !=3D page_size) { + uffd_test_fail("read from pipe failed"); + goto out; + } + + if (memcmp(buf, "\xCD", 1) !=3D 0) { + uffd_test_fail("content mismatch: got 0x%02x, expected 0xCD", + (unsigned char)buf[0]); + goto out; + } + + uffd_test_pass(); +out: + close(pipefd[0]); + close(pipefd[1]); + free(buf); +} + +/* + * Test runtime toggle between async and sync modes. + * Start in async mode (detection), flip to sync (eviction), verify faults + * block, resolve them, flip back to async. + */ +static void uffd_minor_anon_async_toggle_test(uffd_global_test_opts_t *gop= ts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + struct uffd_args uargs =3D { }; + pthread_t uffd_mon; + char c =3D '\0'; + unsigned long p; + + uargs.gopts =3D gopts; + uargs.handle_fault =3D uffd_handle_minor_anon; + + /* Populate */ + for (p =3D 0; p < nr_pages; p++) + memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size); + + if (uffd_register(gopts->uffd, gopts->area_dst, + nr_pages * page_size, + false, false, true)) + err("register failure"); + + /* Phase 1: async detection =E2=80=94 deactivate, access first half */ + deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size); + + for (p =3D 0; p < nr_pages / 2; p++) { + volatile char *page =3D gopts->area_dst + p * page_size; + (void)*page; /* auto-resolves in async mode */ + } + + /* Phase 2: flip to sync for eviction */ + set_async_mode(gopts->uffd, false); + + /* Start handler =E2=80=94 will receive faults for cold pages */ + if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &uargs)) + err("uffd_poll_thread create"); + + /* Access second half (cold pages) =E2=80=94 should trigger sync faults */ + for (p =3D nr_pages / 2; p < nr_pages; p++) { + unsigned char *page =3D (unsigned char *)gopts->area_dst + + p * page_size; + if (page[0] !=3D (p % 255 + 1)) { + uffd_test_fail("page %lu content mismatch", p); + goto out; + } + } + + if (uargs.minor_faults =3D=3D 0) { + uffd_test_fail("expected sync faults, got 0"); + goto out; + } + + /* Phase 3: flip back to async */ + set_async_mode(gopts->uffd, true); + + /* Deactivate and access again =E2=80=94 should auto-resolve */ + deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size); + + for (p =3D 0; p < nr_pages; p++) { + volatile char *page =3D gopts->area_dst + p * page_size; + (void)*page; + } + + uffd_test_pass(); +out: + if (write(gopts->pipefd[1], &c, sizeof(c)) !=3D sizeof(c)) + err("pipe write"); + if (pthread_join(uffd_mon, NULL)) + err("join() failed"); +} + +/* + * Test that deactivated pages become accessible after closing uffd. + */ +static void uffd_minor_anon_close_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + unsigned long p; + + /* Populate */ + for (p =3D 0; p < nr_pages; p++) + memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size); + + if (uffd_register(gopts->uffd, gopts->area_dst, + nr_pages * page_size, + false, false, true)) + err("register failure"); + + deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size); + + /* Close uffd =E2=80=94 should restore protnone PTEs */ + close(gopts->uffd); + gopts->uffd =3D -1; + + /* All pages should be accessible with original content */ + for (p =3D 0; p < nr_pages; p++) { + unsigned char *page =3D (unsigned char *)gopts->area_dst + + p * page_size; + unsigned char expected =3D p % 255 + 1; + + if (page[0] !=3D expected) { + uffd_test_fail("page %lu not accessible after close", p); + return; + } + } + + uffd_test_pass(); +} + static sigjmp_buf jbuf, *sigbuf; =20 static void sighndl(int sig, siginfo_t *siginfo, void *ptr) @@ -1625,6 +2043,46 @@ uffd_test_case_t uffd_tests[] =3D { /* We can't test MADV_COLLAPSE, so try our luck */ .uffd_feature_required =3D UFFD_FEATURE_MINOR_SHMEM, }, + { + .name =3D "minor-anon-async", + .uffd_fn =3D uffd_minor_anon_async_test, + .mem_targets =3D MEM_ANON, + .uffd_feature_required =3D + UFFD_FEATURE_MINOR_ANON | UFFD_FEATURE_MINOR_ASYNC, + }, + { + .name =3D "minor-anon-sync", + .uffd_fn =3D uffd_minor_anon_sync_test, + .mem_targets =3D MEM_ANON, + .uffd_feature_required =3D UFFD_FEATURE_MINOR_ANON, + }, + { + .name =3D "minor-anon-pagemap", + .uffd_fn =3D uffd_minor_anon_pagemap_test, + .mem_targets =3D MEM_ANON, + .uffd_feature_required =3D + UFFD_FEATURE_MINOR_ANON | UFFD_FEATURE_MINOR_ASYNC, + }, + { + .name =3D "minor-anon-gup", + .uffd_fn =3D uffd_minor_anon_gup_test, + .mem_targets =3D MEM_ANON, + .uffd_feature_required =3D + UFFD_FEATURE_MINOR_ANON | UFFD_FEATURE_MINOR_ASYNC, + }, + { + .name =3D "minor-anon-async-toggle", + .uffd_fn =3D uffd_minor_anon_async_toggle_test, + .mem_targets =3D MEM_ANON, + .uffd_feature_required =3D + UFFD_FEATURE_MINOR_ANON | UFFD_FEATURE_MINOR_ASYNC, + }, + { + .name =3D "minor-anon-close", + .uffd_fn =3D uffd_minor_anon_close_test, + .mem_targets =3D MEM_ANON, + .uffd_feature_required =3D UFFD_FEATURE_MINOR_ANON, + }, { .name =3D "sigbus", .uffd_fn =3D uffd_sigbus_test, --=20 2.51.2 From nobody Mon Jun 15 23:15:09 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DF4DE3E92AF; Tue, 14 Apr 2026 14:24:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176659; cv=none; b=rZxTSDBIgTksKe1vVqLCe89MsY+L5wVXrPz93fYJzQ3ohCMT5AhwWPmPQwY5xgQLfG9hKauS+Ct8+u0ab6X2CGhRWHMVhiZ1dNud43O7t+gkL4og8swN7PI6EI9LpjTazMcpwErfE9mAZfcYspk6JCAgvBF6EC+NgqnmDMesbEk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776176659; c=relaxed/simple; bh=ib6BF6sXb1GJqhq34QbrG0+ehgQcsgDUzfrc0LMzV2k=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=VTTaMypGvyq7ZNRfTHvEyK8a3w8TJ4bR183pXknUDHgtnVDrcSl7SK7plVSmetlDzRriDcbLtzb4XWRXTkXaf76x3PLzZCSH+ClI90Gba/6UCzdxWnw6/tqlqeLS3gb8F1MWr2CqvD2C94k++W0g3w7SON758HfiiTWT/i/Tn0I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=boRW0RtZ; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="boRW0RtZ" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 66EE7C4AF09; Tue, 14 Apr 2026 14:24:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776176659; bh=ib6BF6sXb1GJqhq34QbrG0+ehgQcsgDUzfrc0LMzV2k=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=boRW0RtZhQ77L3Q7DgfzHaUcZY9LjSod0vEHXhfpdeexnRMRfbyR9x7kb+KU1xo79 zvpFov8b/PWMjWMRGKuJqjpUx5U6cfcYn8Ad7FGFhRtBo+xkd/i4OzP1nwXTk0Mz3Q fVbSAR/EVeyT0hlrh8QkPGL0SDoY2lmSSdH1d3Yw0dUX/KZn6GvSHGxwQin3icGb2n 8ZlLVAchxGWECobr2/qz+2IXGfNuLTjj7lR3VRJrDNnYN85Q3v1Ws+ib/2bRBPQ07A 4fRBPUHyDyufxyGwaY6knbf3R4dt3wp0fgfialPCZnY3L5Q2zIU8ZYAp6pS8NplKUI Oqw0YcW+fbsXA== Received: from phl-compute-08.internal (phl-compute-08.internal [10.202.2.48]) by mailfauth.phl.internal (Postfix) with ESMTP id 94FCFF40068; Tue, 14 Apr 2026 10:24:18 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-08.internal (MEProxy); Tue, 14 Apr 2026 10:24:18 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdegudefkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefhvfevufffkffojghfgggtgfesthekredtredtjeenucfhrhhomhepfdfmihhrhihl ucfuhhhuthhsvghmrghuucdlofgvthgrmddfuceokhgrsheskhgvrhhnvghlrdhorhhgqe enucggtffrrghtthgvrhhnpefhvdefvdevjeevhefhhfevudefudejfeduvdekheeludfh iefhhedujeffffeigfenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrih hlfhhrohhmpehkihhrihhllhdomhgvshhmthhprghuthhhphgvrhhsohhnrghlihhthidq udeiudduiedvieehhedqvdekgeeggeejvdekqdhkrghspeepkhgvrhhnvghlrdhorhhgse hshhhuthgvmhhovhdrnhgrmhgvpdhnsggprhgtphhtthhopeduledpmhhouggvpehsmhht phhouhhtpdhrtghpthhtoheprghkphhmsehlihhnuhigqdhfohhunhgurghtihhonhdroh hrghdprhgtphhtthhopehpvghtvghrgiesrhgvughhrghtrdgtohhmpdhrtghpthhtohep uggrvhhiugeskhgvrhhnvghlrdhorhhgpdhrtghpthhtoheplhhjsheskhgvrhhnvghlrd horhhgpdhrtghpthhtoheprhhpphhtsehkvghrnhgvlhdrohhrghdprhgtphhtthhopehs uhhrvghnsgesghhoohhglhgvrdgtohhmpdhrtghpthhtohepvhgsrggskhgrsehkvghrnh gvlhdrohhrghdprhgtphhtthhopehlihgrmhdrhhhofihlvghtthesohhrrggtlhgvrdgt ohhmpdhrtghpthhtohepiihihiesnhhvihguihgrrdgtohhm X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 14 Apr 2026 10:24:17 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: Andrew Morton Cc: Peter Xu , David Hildenbrand , Lorenzo Stoakes , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , "Liam R . Howlett" , Zi Yan , Jonathan Corbet , Shuah Khan , Sean Christopherson , Paolo Bonzini , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, "Kiryl Shutsemau (Meta)" Subject: [RFC, PATCH 12/12] Documentation/userfaultfd: document working set tracking Date: Tue, 14 Apr 2026 15:23:46 +0100 Message-ID: <20260414142354.1465950-13-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20260414142354.1465950-1-kas@kernel.org> References: <20260414142354.1465950-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Document the new userfaultfd capabilities for VM working set tracking: - UFFD_FEATURE_MINOR_ANON and UFFD_FEATURE_MINOR_ASYNC for anonymous minor fault interception using the PROT_NONE hinting mechanism. - UFFDIO_DEACTIVATE for marking pages as inaccessible while keeping them resident. - Sync and async fault resolution modes, and UFFDIO_SET_MODE for runtime toggling between them. - PAGEMAP_SCAN with PAGE_IS_UFFD_DEACTIVATED for cold page detection. - Cleanup semantics on unregister and close. - NUMA balancing interaction on anonymous VMAs. - Complete VMM workflow example for the cold page eviction lifecycle, with a note on shmem applicability. Update the feature flag descriptions at the top of the guide to reference the new section. Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 --- Documentation/admin-guide/mm/userfaultfd.rst | 141 ++++++++++++++++++- 1 file changed, 140 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/a= dmin-guide/mm/userfaultfd.rst index e5cc8848dcb3..fc89e029060c 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -111,7 +111,11 @@ events, except page fault notifications, may be genera= ted: - ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating - support for shmem virtual memory areas. + support for shmem virtual memory areas. ``UFFD_FEATURE_MINOR_ANON`` + extends minor fault support to anonymous private memory using + PROT_NONE hinting; see the `Anonymous Minor Faults`_ section. + ``UFFD_FEATURE_MINOR_ASYNC`` enables asynchronous auto-resolution for + anonymous minor faults (requires ``UFFD_FEATURE_MINOR_ANON``). =20 - ``UFFD_FEATURE_MOVE`` indicates that the kernel supports moving an existing page contents from userspace. @@ -297,6 +301,141 @@ transparent to the guest, we want that same address r= ange to act as if it was still poisoned, even though it's on a new physical host which ostensibly doesn't have a memory error in the exact same spot. =20 +Anonymous Minor Faults +---------------------- + +``UFFD_FEATURE_MINOR_ANON`` enables ``UFFDIO_REGISTER_MODE_MINOR`` on +anonymous private memory. Unlike shmem/hugetlbfs minor faults (where a page +exists in the page cache but has no PTE), anonymous minor faults use the +PROT_NONE hinting mechanism: pages remain resident in memory with their PF= Ns +preserved in the PTEs, but access permissions are removed so the next acce= ss +triggers a fault. + +This is designed for VM memory managers that need to track the working set= of +anonymous guest memory for cold page eviction to tiered or remote storage. + +**Setup:** + +1. Open a userfaultfd and enable ``UFFD_FEATURE_MINOR_ANON`` (and optional= ly + ``UFFD_FEATURE_MINOR_ASYNC``) via ``UFFDIO_API``. + +2. Register the guest memory range with ``UFFDIO_REGISTER_MODE_MINOR`` + (and ``UFFDIO_REGISTER_MODE_MISSING`` if evicted pages will need to be + fetched back from storage). + +**Deactivation:** + +Use ``UFFDIO_DEACTIVATE`` to mark pages as inaccessible. This ioctl takes a +``struct uffdio_range`` and sets PROT_NONE on all present PTEs in the rang= e, +using the same mechanism as NUMA balancing. Pages stay resident and their +physical frames are preserved =E2=80=94 only access permissions are remove= d. + +**Fault Handling:** + +When a deactivated page is accessed: + +- **Sync mode** (default): The faulting thread blocks and a + ``UFFD_PAGEFAULT_FLAG_MINOR`` message is delivered to the userfaultfd + handler. The handler resolves the fault with ``UFFDIO_CONTINUE``, which + restores the PTE permissions and wakes the faulting thread. + +- **Async mode** (``UFFD_FEATURE_MINOR_ASYNC``): The kernel automatically + restores PTE permissions and the thread continues without blocking. No + message is delivered to the handler. + +**Cold Page Detection with PAGEMAP_SCAN:** + +After deactivating a range and letting the application run, use the +``PAGEMAP_SCAN`` ioctl on ``/proc/pid/pagemap`` with the +``PAGE_IS_UFFD_DEACTIVATED`` category flag to efficiently find pages that = were +never re-accessed (cold pages):: + + struct pm_scan_arg arg =3D { + .size =3D sizeof(arg), + .start =3D guest_mem_start, + .end =3D guest_mem_end, + .vec =3D (uint64_t)regions, + .vec_len =3D regions_len, + .category_mask =3D PAGE_IS_UFFD_DEACTIVATED, + .return_mask =3D PAGE_IS_UFFD_DEACTIVATED, + }; + long n =3D ioctl(pagemap_fd, PAGEMAP_SCAN, &arg); + +The returned ``page_region`` array contains contiguous cold ranges that can +then be evicted. + +**Cleanup:** + +When the userfaultfd is closed or the range is unregistered, all protnone +PTEs are automatically restored to their normal VMA permissions. This +prevents pages from becoming permanently inaccessible. + +**Interaction with NUMA Balancing:** + +NUMA balancing is automatically disabled on anonymous VMAs registered with +``UFFDIO_REGISTER_MODE_MINOR``, since both mechanisms use PROT_NONE PTEs +as access hints and would interfere with each other. Shmem VMAs are not +affected since ``UFFDIO_DEACTIVATE`` zaps PTEs there instead of using +PROT_NONE. + +**VMM Working Set Tracking Workflow:** + +A typical VMM lifecycle for cold page eviction to tiered storage:: + + /* One-time setup */ + uffd =3D userfaultfd(O_CLOEXEC | O_NONBLOCK); + ioctl(uffd, UFFDIO_API, &(struct uffdio_api){ + .api =3D UFFD_API, + .features =3D UFFD_FEATURE_MINOR_ANON | + UFFD_FEATURE_MINOR_ASYNC, + }); + ioctl(uffd, UFFDIO_REGISTER, &(struct uffdio_register){ + .range =3D { guest_mem, guest_size }, + .mode =3D UFFDIO_REGISTER_MODE_MINOR | + UFFDIO_REGISTER_MODE_MISSING, + }); + + /* Tracking loop */ + while (vm_running) { + /* 1. Detection phase (async =E2=80=94 no vCPU stalls) */ + ioctl(uffd, UFFDIO_DEACTIVATE, &full_range); + sleep(tracking_interval); + + /* 2. Find cold pages */ + ioctl(pagemap_fd, PAGEMAP_SCAN, &(struct pm_scan_arg){ + .category_mask =3D PAGE_IS_UFFD_DEACTIVATED, + ... + }); + + /* 3. Switch to sync for safe eviction */ + ioctl(uffd, UFFDIO_SET_MODE, + &(struct uffdio_set_mode){ + .disable =3D UFFD_FEATURE_MINOR_ASYNC }); + + /* 4. Evict cold pages (vCPU faults block in handler) */ + for each cold range: + pwrite(storage_fd, cold_addr, len, offset); + madvise(cold_addr, len, MADV_DONTNEED); + + /* 5. Resume async tracking */ + ioctl(uffd, UFFDIO_SET_MODE, + &(struct uffdio_set_mode){ + .enable =3D UFFD_FEATURE_MINOR_ASYNC }); + } + +During step 4, if a vCPU accesses a cold page being evicted, it blocks +with a ``UFFD_PAGEFAULT_FLAG_MINOR`` fault. The handler can either let it +wait (the eviction completes, ``MADV_DONTNEED`` fires, the fault retries as +``MISSING`` and is resolved with ``UFFDIO_COPY`` from storage) or resolve +it immediately with ``UFFDIO_CONTINUE``. + +The same workflow applies to shmem-backed guest memory +(``UFFD_FEATURE_MINOR_SHMEM``). The only difference is the +``PAGEMAP_SCAN`` mask for cold page detection: use +``!PAGE_IS_PRESENT`` instead of ``PAGE_IS_UFFD_DEACTIVATED``, since +``UFFDIO_DEACTIVATE`` zaps PTEs on shmem (pages stay in page cache) +rather than setting PROT_NONE. + QEMU/KVM =3D=3D=3D=3D=3D=3D=3D=3D =20 --=20 2.51.2