Documentation/admin-guide/mm/userfaultfd.rst | 141 +++++- fs/proc/task_mmu.c | 11 +- fs/userfaultfd.c | 184 +++++++- include/linux/huge_mm.h | 6 + include/linux/mm.h | 2 + include/linux/sched/numa_balancing.h | 1 + include/linux/userfaultfd_k.h | 21 +- include/trace/events/sched.h | 3 +- include/uapi/linux/fs.h | 1 + include/uapi/linux/userfaultfd.h | 40 +- kernel/sched/fair.c | 13 + mm/huge_memory.c | 33 +- mm/hugetlb.c | 3 +- mm/memory.c | 51 ++- mm/mprotect.c | 9 +- mm/shmem.c | 3 +- mm/userfaultfd.c | 164 ++++++- tools/testing/selftests/mm/uffd-unit-tests.c | 458 +++++++++++++++++++ 18 files changed, 1096 insertions(+), 48 deletions(-)
This series adds userfaultfd support for tracking the working set of
VM guest memory, enabling VMMs to identify cold pages and evict them
to tiered or remote storage.
== Problem ==
VMMs managing guest memory need to:
1. Track which pages are actively used (working set detection)
2. Safely evict cold pages to slower storage
3. Fetch pages back on demand when accessed again
For shmem-backed guest memory, working set tracking partially works
today: MADV_DONTNEED zaps PTEs while pages stay in page cache, and
re-access auto-resolves from cache. But safe eviction still requires
synchronous fault interception to prevent data loss races.
For anonymous guest memory (needed for KSM cross-VM deduplication),
there is no mechanism at all — clearing a PTE loses the page.
== Solution ==
The series introduces a unified userfaultfd interface that works
across both anonymous and shmem-backed memory:
UFFD_FEATURE_MINOR_ANON: extends MODE_MINOR registration to anonymous
private memory. Uses the PROT_NONE hinting mechanism (same as NUMA
balancing) to make pages inaccessible without freeing them.
UFFD_FEATURE_MINOR_ASYNC: auto-resolves minor faults without handler
involvement. The kernel restores PTE permissions immediately and the
faulting thread continues. Works for anonymous, shmem, and hugetlbfs.
UFFDIO_DEACTIVATE: marks pages as deactivated. For anonymous memory,
sets PROT_NONE on PTEs (pages stay resident). For shmem/hugetlbfs,
zaps PTEs (pages stay in page cache).
UFFDIO_SET_MODE: toggles MINOR_ASYNC at runtime, synchronized via
mmap_write_lock. Enables the VMM workflow: async mode for lightweight
detection, sync mode for race-free eviction.
PAGE_IS_UFFD_DEACTIVATED: PAGEMAP_SCAN category flag for efficient
batch detection of cold (still-deactivated) anonymous pages.
== VMM Workflow ==
UFFDIO_DEACTIVATE(all) -- async, no vCPU stalls
sleep(interval)
PAGEMAP_SCAN -- find cold pages
UFFDIO_SET_MODE(sync) -- block faults for eviction
pwrite + MADV_DONTNEED cold pages -- safe, faults block
UFFDIO_SET_MODE(async) -- resume tracking
The same workflow applies to shmem, with a different PAGEMAP_SCAN mask
(!PAGE_IS_PRESENT instead of PAGE_IS_UFFD_DEACTIVATED).
== NUMA Balancing ==
NUMA balancing scanning is skipped on anonymous VM_UFFD_MINOR VMAs to
avoid protnone conflicts. NUMA locality stats are fed from the uffd
fault path via task_numa_fault() so the scheduler retains placement
data. Shmem VMAs are unaffected (UFFDIO_DEACTIVATE zaps PTEs there,
no protnone involved).
== Testing ==
The series includes 6 new selftests covering async/sync modes,
PAGEMAP_SCAN cold detection, GUP through protnone, UFFDIO_SET_MODE
toggling, and cleanup on close. All 73 uffd unit tests pass
(including hugetlb) across defconfig, allnoconfig, allmodconfig,
and randomized configs.
Kiryl Shutsemau (Meta) (12):
userfaultfd: define UAPI constants for anonymous minor faults
userfaultfd: add UFFD_FEATURE_MINOR_ANON registration support
userfaultfd: implement UFFDIO_DEACTIVATE ioctl
userfaultfd: UFFDIO_CONTINUE for anonymous memory
mm: intercept protnone faults on VM_UFFD_MINOR anonymous VMAs
userfaultfd: auto-resolve shmem and hugetlbfs minor faults in async
mode
sched/numa: skip scanning anonymous VM_UFFD_MINOR VMAs
userfaultfd: enable UFFD_FEATURE_MINOR_ANON
mm/pagemap: add PAGE_IS_UFFD_DEACTIVATED to PAGEMAP_SCAN
userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
selftests/mm: add userfaultfd anonymous minor fault tests
Documentation/userfaultfd: document working set tracking
Documentation/admin-guide/mm/userfaultfd.rst | 141 ++++-
fs/proc/task_mmu.c | 11 +-
fs/userfaultfd.c | 184 +++++-
include/linux/huge_mm.h | 6 +
include/linux/mm.h | 2 +
include/linux/sched/numa_balancing.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/trace/events/sched.h | 3 +-
include/uapi/linux/fs.h | 1 +
include/uapi/linux/userfaultfd.h | 40 +-
kernel/sched/fair.c | 13 +
mm/huge_memory.c | 33 +-
mm/hugetlb.c | 3 +-
mm/memory.c | 51 +-
mm/mprotect.c | 9 +-
mm/shmem.c | 3 +-
mm/userfaultfd.c | 164 +++++-
tools/testing/selftests/mm/uffd-unit-tests.c | 458 +++++++++++++++
18 files changed, 1096 insertions(+), 48 deletions(-)
Kiryl Shutsemau (Meta) (12):
userfaultfd: define UAPI constants for anonymous minor faults
userfaultfd: add UFFD_FEATURE_MINOR_ANON registration support
userfaultfd: implement UFFDIO_DEACTIVATE ioctl
userfaultfd: UFFDIO_CONTINUE for anonymous memory
mm: intercept protnone faults on VM_UFFD_MINOR anonymous VMAs
userfaultfd: auto-resolve shmem and hugetlbfs minor faults in async
mode
sched/numa: skip scanning anonymous VM_UFFD_MINOR VMAs
userfaultfd: enable UFFD_FEATURE_MINOR_ANON
mm/pagemap: add PAGE_IS_UFFD_DEACTIVATED to PAGEMAP_SCAN
userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
selftests/mm: add userfaultfd anonymous minor fault tests
Documentation/userfaultfd: document working set tracking
Documentation/admin-guide/mm/userfaultfd.rst | 141 +++++-
fs/proc/task_mmu.c | 11 +-
fs/userfaultfd.c | 184 +++++++-
include/linux/huge_mm.h | 6 +
include/linux/mm.h | 2 +
include/linux/sched/numa_balancing.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/trace/events/sched.h | 3 +-
include/uapi/linux/fs.h | 1 +
include/uapi/linux/userfaultfd.h | 40 +-
kernel/sched/fair.c | 13 +
mm/huge_memory.c | 33 +-
mm/hugetlb.c | 3 +-
mm/memory.c | 51 ++-
mm/mprotect.c | 9 +-
mm/shmem.c | 3 +-
mm/userfaultfd.c | 164 ++++++-
tools/testing/selftests/mm/uffd-unit-tests.c | 458 +++++++++++++++++++
18 files changed, 1096 insertions(+), 48 deletions(-)
--
2.51.2
On 4/14/26 16:23, Kiryl Shutsemau (Meta) wrote: > This series adds userfaultfd support for tracking the working set of > VM guest memory, enabling VMMs to identify cold pages and evict them > to tiered or remote storage. > > == Problem == > > VMMs managing guest memory need to: > 1. Track which pages are actively used (working set detection) > 2. Safely evict cold pages to slower storage > 3. Fetch pages back on demand when accessed again > > For shmem-backed guest memory, working set tracking partially works > today: MADV_DONTNEED zaps PTEs while pages stay in page cache, and > re-access auto-resolves from cache. But safe eviction still requires > synchronous fault interception to prevent data loss races. > > For anonymous guest memory (needed for KSM cross-VM deduplication), > there is no mechanism at all — clearing a PTE loses the page. > > == Solution == > > The series introduces a unified userfaultfd interface that works > across both anonymous and shmem-backed memory: > > UFFD_FEATURE_MINOR_ANON: extends MODE_MINOR registration to anonymous > private memory. Uses the PROT_NONE hinting mechanism (same as NUMA > balancing) to make pages inaccessible without freeing them. I would rather tackle this from the other direction: it's another form of protection (like WP), not really a "minor" mode. Could we add a UFFDIO_REGISTER_MODE_RWP (or however we would call it) and support it for anon+shmem, avoiding the zapping for shmem completely? -- Cheers, David
On Tue, Apr 14, 2026 at 05:37:50PM +0200, David Hildenbrand (Arm) wrote: > On 4/14/26 16:23, Kiryl Shutsemau (Meta) wrote: > > This series adds userfaultfd support for tracking the working set of > > VM guest memory, enabling VMMs to identify cold pages and evict them > > to tiered or remote storage. > > > > == Problem == > > > > VMMs managing guest memory need to: > > 1. Track which pages are actively used (working set detection) > > 2. Safely evict cold pages to slower storage > > 3. Fetch pages back on demand when accessed again > > > > For shmem-backed guest memory, working set tracking partially works > > today: MADV_DONTNEED zaps PTEs while pages stay in page cache, and > > re-access auto-resolves from cache. But safe eviction still requires > > synchronous fault interception to prevent data loss races. > > > > For anonymous guest memory (needed for KSM cross-VM deduplication), > > there is no mechanism at all — clearing a PTE loses the page. > > > > == Solution == > > > > The series introduces a unified userfaultfd interface that works > > across both anonymous and shmem-backed memory: > > > > UFFD_FEATURE_MINOR_ANON: extends MODE_MINOR registration to anonymous > > private memory. Uses the PROT_NONE hinting mechanism (same as NUMA > > balancing) to make pages inaccessible without freeing them. > > I would rather tackle this from the other direction: it's another form > of protection (like WP), not really a "minor" mode. > > Could we add a UFFDIO_REGISTER_MODE_RWP (or however we would call it) > and support it for anon+shmem, avoiding the zapping for shmem completely? I like this idea. It should be functionally equivalent, but your interface idea fits better with the rest. Thanks! Will give it a try. -- Kiryl Shutsemau / Kirill A. Shutemov
On Tue, Apr 14, 2026 at 06:10:44PM +0100, Kiryl Shutsemau wrote: > On Tue, Apr 14, 2026 at 05:37:50PM +0200, David Hildenbrand (Arm) wrote: > > On 4/14/26 16:23, Kiryl Shutsemau (Meta) wrote: > > > This series adds userfaultfd support for tracking the working set of > > > VM guest memory, enabling VMMs to identify cold pages and evict them > > > to tiered or remote storage. > > > > > > == Problem == > > > > > > VMMs managing guest memory need to: > > > 1. Track which pages are actively used (working set detection) > > > 2. Safely evict cold pages to slower storage > > > 3. Fetch pages back on demand when accessed again > > > > > > For shmem-backed guest memory, working set tracking partially works > > > today: MADV_DONTNEED zaps PTEs while pages stay in page cache, and > > > re-access auto-resolves from cache. But safe eviction still requires > > > synchronous fault interception to prevent data loss races. > > > > > > For anonymous guest memory (needed for KSM cross-VM deduplication), > > > there is no mechanism at all — clearing a PTE loses the page. > > > > > > == Solution == > > > > > > The series introduces a unified userfaultfd interface that works > > > across both anonymous and shmem-backed memory: > > > > > > UFFD_FEATURE_MINOR_ANON: extends MODE_MINOR registration to anonymous > > > private memory. Uses the PROT_NONE hinting mechanism (same as NUMA > > > balancing) to make pages inaccessible without freeing them. > > > > I would rather tackle this from the other direction: it's another form > > of protection (like WP), not really a "minor" mode. > > > > Could we add a UFFDIO_REGISTER_MODE_RWP (or however we would call it) > > and support it for anon+shmem, avoiding the zapping for shmem completely? > > I like this idea. > > It should be functionally equivalent, but your interface idea fits > better with the rest. > > Thanks! Will give it a try. Here is an updated version: https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/log/?h=uffd/rfc-v2 will post after -rc1 is tagged. I like it more. It got substantially cleaner. -- Kiryl Shutsemau / Kirill A. Shutemov
On 4/16/26 15:49, Kiryl Shutsemau wrote: > On Tue, Apr 14, 2026 at 06:10:44PM +0100, Kiryl Shutsemau wrote: >> On Tue, Apr 14, 2026 at 05:37:50PM +0200, David Hildenbrand (Arm) wrote: >>> >>> I would rather tackle this from the other direction: it's another form >>> of protection (like WP), not really a "minor" mode. >>> >>> Could we add a UFFDIO_REGISTER_MODE_RWP (or however we would call it) >>> and support it for anon+shmem, avoiding the zapping for shmem completely? >> >> I like this idea. >> >> It should be functionally equivalent, but your interface idea fits >> better with the rest. >> >> Thanks! Will give it a try. > > Here is an updated version: > > https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/log/?h=uffd/rfc-v2 > > will post after -rc1 is tagged. > > I like it more. It got substantially cleaner. I don't have time to look into the details just yet, but my thinking was that a) It would avoid the zap+refault b) We could reuse the uffd-wp PTE bit + marker to indicate/remember the protection, making it co-exist with NUMA hinting naturally. b) obviously means that we cannot use uffd-wp and uffd-rwp at the same time in the same uffd area. I guess that should be acceptable for the use cases we you should have in mind? But I also haven't taken a closer look at this patch set, whether you would already be using a PTE bit somehow (I suspect not :) ) -- Cheers, David
On Thu, Apr 16, 2026 at 08:32:19PM +0200, David Hildenbrand (Arm) wrote: > On 4/16/26 15:49, Kiryl Shutsemau wrote: > > On Tue, Apr 14, 2026 at 06:10:44PM +0100, Kiryl Shutsemau wrote: > >> On Tue, Apr 14, 2026 at 05:37:50PM +0200, David Hildenbrand (Arm) wrote: > >>> > >>> I would rather tackle this from the other direction: it's another form > >>> of protection (like WP), not really a "minor" mode. > >>> > >>> Could we add a UFFDIO_REGISTER_MODE_RWP (or however we would call it) > >>> and support it for anon+shmem, avoiding the zapping for shmem completely? > >> > >> I like this idea. > >> > >> It should be functionally equivalent, but your interface idea fits > >> better with the rest. > >> > >> Thanks! Will give it a try. > > > > Here is an updated version: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/log/?h=uffd/rfc-v2 > > > > will post after -rc1 is tagged. > > > > I like it more. It got substantially cleaner. > > I don't have time to look into the details just yet, but my thinking was > that > > a) It would avoid the zap+refault Yep. > b) We could reuse the uffd-wp PTE bit + marker to indicate/remember the > protection, making it co-exist with NUMA hinting naturally. > > b) obviously means that we cannot use uffd-wp and uffd-rwp at the same > time in the same uffd area. I guess that should be acceptable for the > use cases we you should have in mind? I took a different path: I still use PROT_NONE PTEs, so it cannot co-exist with NUMA balancing [fully], but WP + RWP should be fine. I need to add a test for this. I didn't give up on NUMA balancing completely. task_numa_fault() is called on RWP fault. So it should help scheduler decisions somewhat. I think an RWP user might want to use WP too. Do you see this trade-off as reasonable? > But I also haven't taken a closer look at this patch set, whether you > would already be using a PTE bit somehow (I suspect not :) ) No. I didn't want to allocate a new bit or invent some arch-specific trick for this. This functionality is available everywhere where PAGE_NONE exists. -- Kiryl Shutsemau / Kirill A. Shutemov
On 4/16/26 22:25, Kiryl Shutsemau wrote: > On Thu, Apr 16, 2026 at 08:32:19PM +0200, David Hildenbrand (Arm) wrote: >> On 4/16/26 15:49, Kiryl Shutsemau wrote: >>> >>> Here is an updated version: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/log/?h=uffd/rfc-v2 >>> >>> will post after -rc1 is tagged. >>> >>> I like it more. It got substantially cleaner. >> >> I don't have time to look into the details just yet, but my thinking was >> that >> >> a) It would avoid the zap+refault > > Yep. > >> b) We could reuse the uffd-wp PTE bit + marker to indicate/remember the >> protection, making it co-exist with NUMA hinting naturally. >> >> b) obviously means that we cannot use uffd-wp and uffd-rwp at the same >> time in the same uffd area. I guess that should be acceptable for the >> use cases we you should have in mind? > > I took a different path: I still use PROT_NONE PTEs, so it cannot > co-exist with NUMA balancing [fully], but WP + RWP should be fine. I > need to add a test for this. > > I didn't give up on NUMA balancing completely. task_numa_fault() is > called on RWP fault. So it should help scheduler decisions somewhat. > > I think an RWP user might want to use WP too. > > Do you see this trade-off as reasonable? One reason why the PTE bit was added for the WP case was to distinguish it from other write faults. I assume without a dedicated PTE bit your design will always suffer from false positive notifications. Leaving NUMA-balancing aside, a simple mprotect(PROT_NONE)+mprotect(PROT_READ) would already be problematic to distinguish both cases. Zap+refault for shmem would likely have similar problems (we'd need a marker). I don't think a design that allows for false positives is what we really want, especially as it would diverge from what we already have for WP. Yes, using the PTE bit (that we already have) implies that we could, for now, not allow the combination of WP + RWP. -- Cheers, David
On Fri, Apr 17, 2026 at 01:43:36PM +0200, David Hildenbrand (Arm) wrote: > On 4/16/26 22:25, Kiryl Shutsemau wrote: > > On Thu, Apr 16, 2026 at 08:32:19PM +0200, David Hildenbrand (Arm) wrote: > >> On 4/16/26 15:49, Kiryl Shutsemau wrote: > >>> > >>> Here is an updated version: > >>> > >>> https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/log/?h=uffd/rfc-v2 > >>> > >>> will post after -rc1 is tagged. > >>> > >>> I like it more. It got substantially cleaner. > >> > >> I don't have time to look into the details just yet, but my thinking was > >> that > >> > >> a) It would avoid the zap+refault > > > > Yep. > > > >> b) We could reuse the uffd-wp PTE bit + marker to indicate/remember the > >> protection, making it co-exist with NUMA hinting naturally. > >> > >> b) obviously means that we cannot use uffd-wp and uffd-rwp at the same > >> time in the same uffd area. I guess that should be acceptable for the > >> use cases we you should have in mind? > > > > I took a different path: I still use PROT_NONE PTEs, so it cannot > > co-exist with NUMA balancing [fully], but WP + RWP should be fine. I > > need to add a test for this. > > > > I didn't give up on NUMA balancing completely. task_numa_fault() is > > called on RWP fault. So it should help scheduler decisions somewhat. > > > > I think an RWP user might want to use WP too. > > > > Do you see this trade-off as reasonable? > > One reason why the PTE bit was added for the WP case was to distinguish > it from other write faults. > > I assume without a dedicated PTE bit your design will always suffer from > false positive notifications. > > Leaving NUMA-balancing aside, a simple > mprotect(PROT_NONE)+mprotect(PROT_READ) would already be problematic to > distinguish both cases. Hm. I didn't consider this case (miss some uffd lore). Will rework to reuse existing PTE bit. Thanks for the feedback! -- Kiryl Shutsemau / Kirill A. Shutemov
On Fri, Apr 17, 2026 at 01:26:34PM +0100, Kiryl Shutsemau wrote: > > Leaving NUMA-balancing aside, a simple > > mprotect(PROT_NONE)+mprotect(PROT_READ) would already be problematic to > > distinguish both cases. > > Hm. I didn't consider this case (miss some uffd lore). Will rework to > reuse existing PTE bit. See https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git uffd/rfc-v3 -- Kiryl Shutsemau / Kirill A. Shutemov
On 4/19/26 16:33, Kiryl Shutsemau wrote: > On Fri, Apr 17, 2026 at 01:26:34PM +0100, Kiryl Shutsemau wrote: >>> Leaving NUMA-balancing aside, a simple >>> mprotect(PROT_NONE)+mprotect(PROT_READ) would already be problematic to >>> distinguish both cases. >> >> Hm. I didn't consider this case (miss some uffd lore). Will rework to >> reuse existing PTE bit. > > See https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git uffd/rfc-v3 > Quick feedback from skimming over it: 1) ARCH_SUPPORTS_PROT_NONE needs some thought, because I am pretty sure all architectures support something like mprotect(PROT_NONE), and the config option might be misleading. So you very likely want to express different semantics here. You want to know whether pte_protnone()/pmd_protnone() works. 2) The other stuff is really just an extension of existing WP handling. I suspect we want to have some reasonable cleanups to not end up in common code with @@ -1841,7 +1841,7 @@ static void copy_huge_non_present_pmd( add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); - if (!userfaultfd_wp(dst_vma)) + if (!userfaultfd_wp(dst_vma) && !userfaultfd_rwp(dst_vma)) pmd = pmd_swp_clear_uffd_wp(pmd); set_pmd_at(dst_mm, addr, dst_pmd, pmd); All the uffd handling should be better isolated (i.e., a single vma check?), and likely the uffd bit should be abstracted away from being called "wp" to something more generic. Maybe it's simply a "uffd" flag which's semantics depend on the vma flags. Maybe something like: @@ -1841,7 +1841,7 @@ static void copy_huge_non_present_pmd( add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); if (!userfaultfd_uses_pte_bit(dst_vma)) pmd = pmd_swp_clear_uffd(pmd); set_pmd_at(dst_mm, addr, dst_pmd, pmd); Not sure, needs another thought. But I think there are some decent cleanups to be had. 3) Some other stuff needs a second thought, like diff --git a/mm/gup.c b/mm/gup.c index 8e7dc2c6ee738..08fc18f1290d4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -695,7 +695,8 @@ static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, /* ... and a write-fault isn't required for other reasons. */ if (pmd_needs_soft_dirty_wp(vma, pmd)) return false; - return !userfaultfd_huge_pmd_wp(vma, pmd); + return !userfaultfd_huge_pmd_wp(vma, pmd) && + !userfaultfd_huge_pmd_rwp(vma, pmd); } How can a pte be writable and prot_none at the same time? Maybe just confused AI output that you should carefully double check before sending that out officially. 4) How do we want to handle PM_UFFD_WP? We are pretty much out of flags soon. Overloading PM_UFFD_WP means that we will not be able to easily support using a separate bit. But our internal design will not easily allow that either, and I am not really sure we want to go down that path any time soon. Maybe we could document this for now as "In WP VMAs, indicated WP PTEs. Otherwise, in RWP VMAs, indicates RWP.". Whenever we would allow both at the same time, we could change the semantics. User space would fail to create one with both protection types for now either way. -- Cheers, David
On Tue, Apr 21, 2026 at 03:03:56PM +0200, David Hildenbrand (Arm) wrote: > On 4/19/26 16:33, Kiryl Shutsemau wrote: > > On Fri, Apr 17, 2026 at 01:26:34PM +0100, Kiryl Shutsemau wrote: > >>> Leaving NUMA-balancing aside, a simple > >>> mprotect(PROT_NONE)+mprotect(PROT_READ) would already be problematic to > >>> distinguish both cases. > >> > >> Hm. I didn't consider this case (miss some uffd lore). Will rework to > >> reuse existing PTE bit. > > > > See https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git uffd/rfc-v3 > > > > Quick feedback from skimming over it: > > > 1) ARCH_SUPPORTS_PROT_NONE needs some thought, because I am pretty sure all > architectures support something like mprotect(PROT_NONE), and the config > option might be misleading. > > So you very likely want to express different semantics here. You want to > know whether pte_protnone()/pmd_protnone() works. We do support mprotect(PROT_NONE) everywhere, but we don't always have a way to distinguish such entries from others without VMA in hands. Like, there are other PTEs that don't have present bit set. In my and NUMA balancing context we cannot rely on VMA, because we want to install PAGE_NONE entires into accessible VMA. So we need two things; pte/pmd_protnone() checks and PAGE_NONE itself. The first to test PTE for PAGE_NONE, the second for pte/pmd_modify() to make the entry protnone. Currently, generic code only use this functionality for NUMA balancing and gated by NUMA balancing config option. So I moved it under separate config option. Do you want it to be named differently? > 2) The other stuff is really just an extension of existing WP handling. > I suspect we want to have some reasonable cleanups to not end up in > common code with > > @@ -1841,7 +1841,7 @@ static void copy_huge_non_present_pmd( > add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); > mm_inc_nr_ptes(dst_mm); > pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); > - if (!userfaultfd_wp(dst_vma)) > + if (!userfaultfd_wp(dst_vma) && !userfaultfd_rwp(dst_vma)) > pmd = pmd_swp_clear_uffd_wp(pmd); > set_pmd_at(dst_mm, addr, dst_pmd, pmd); > > All the uffd handling should be better isolated (i.e., a single vma check?), > and likely the uffd bit should be abstracted away from being called "wp" to > something more generic. > > Maybe it's simply a "uffd" flag which's semantics depend > on the vma flags. > > Maybe something like: > > @@ -1841,7 +1841,7 @@ static void copy_huge_non_present_pmd( > add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); > mm_inc_nr_ptes(dst_mm); > pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); > if (!userfaultfd_uses_pte_bit(dst_vma)) > pmd = pmd_swp_clear_uffd(pmd); > set_pmd_at(dst_mm, addr, dst_pmd, pmd); > > Not sure, needs another thought. But I think there are some decent > cleanups to be had. That's fair. Maybe userfaultfd_protected() name is better for the VMA check? And about UFFD_WP bit name. Maybe we can just drop _WP: _PAGE_UFFD_WP -> _PAGE_UFFD, pte_uffd_wp() -> pte_uffd()? But it is a lot of changes. Can I do the bit rename as a follow up patchset? > 3) Some other stuff needs a second thought, like > > diff --git a/mm/gup.c b/mm/gup.c > index 8e7dc2c6ee738..08fc18f1290d4 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -695,7 +695,8 @@ static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, > /* ... and a write-fault isn't required for other reasons. */ > if (pmd_needs_soft_dirty_wp(vma, pmd)) > return false; > - return !userfaultfd_huge_pmd_wp(vma, pmd); > + return !userfaultfd_huge_pmd_wp(vma, pmd) && > + !userfaultfd_huge_pmd_rwp(vma, pmd); > } > > How can a pte be writable and prot_none at the same time? Maybe just confused AI > output that you should carefully double check before sending that out officially. Note that this path is for !pmd_write() case to begin with. It serves FOLL_FORCE case. I believe this check is correct: we don't want to allow to write to such pages even with FOLL_FORCE. But looking around, I missed gup_can_follow_protnone() modification. It has to return false for RWP. > 4) How do we want to handle PM_UFFD_WP? > > We are pretty much out of flags soon. Overloading PM_UFFD_WP means that we will not > be able to easily support using a separate bit. > > But our internal design will not easily allow that either, and I am not really > sure we want to go down that path any time soon. > > Maybe we could document this for now as "In WP VMAs, indicated WP PTEs. > Otherwise, in RWP VMAs, indicates RWP.". Whenever we would allow both at the > same time, we could change the semantics. User space would fail to create one > with both protection types for now either way. Yeah. I think about doing documentation-only update for PM_UFFD_WP for now. -- Kiryl Shutsemau / Kirill A. Shutemov
On 4/21/26 16:33, Kiryl Shutsemau wrote: > On Tue, Apr 21, 2026 at 03:03:56PM +0200, David Hildenbrand (Arm) wrote: >> On 4/19/26 16:33, Kiryl Shutsemau wrote: >>> >>> See https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git uffd/rfc-v3 >>> >> >> Quick feedback from skimming over it: >> >> >> 1) ARCH_SUPPORTS_PROT_NONE needs some thought, because I am pretty sure all >> architectures support something like mprotect(PROT_NONE), and the config >> option might be misleading. >> >> So you very likely want to express different semantics here. You want to >> know whether pte_protnone()/pmd_protnone() works. > > We do support mprotect(PROT_NONE) everywhere, but we don't always have a > way to distinguish such entries from others without VMA in hands. Like, > there are other PTEs that don't have present bit set. In my and NUMA > balancing context we cannot rely on VMA, because we want to install > PAGE_NONE entires into accessible VMA. Exactly. So it's not ARCH_SUPPORTS_PROT_NONE. > > So we need two things; pte/pmd_protnone() checks and PAGE_NONE itself. > The first to test PTE for PAGE_NONE, the second for pte/pmd_modify() to > make the entry protnone. > > Currently, generic code only use this functionality for NUMA balancing > and gated by NUMA balancing config option. So I moved it under separate > config option. > > Do you want it to be named differently? Would ARCH_SUPPORTS_PXX_PROTNONE or sth. like that better describe that pte_protnone()/pmd_protnone() do what we want? > >> 2) The other stuff is really just an extension of existing WP handling. >> I suspect we want to have some reasonable cleanups to not end up in >> common code with >> >> @@ -1841,7 +1841,7 @@ static void copy_huge_non_present_pmd( >> add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); >> mm_inc_nr_ptes(dst_mm); >> pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); >> - if (!userfaultfd_wp(dst_vma)) >> + if (!userfaultfd_wp(dst_vma) && !userfaultfd_rwp(dst_vma)) >> pmd = pmd_swp_clear_uffd_wp(pmd); >> set_pmd_at(dst_mm, addr, dst_pmd, pmd); >> >> All the uffd handling should be better isolated (i.e., a single vma check?), >> and likely the uffd bit should be abstracted away from being called "wp" to >> something more generic. >> >> Maybe it's simply a "uffd" flag which's semantics depend >> on the vma flags. >> >> Maybe something like: >> >> @@ -1841,7 +1841,7 @@ static void copy_huge_non_present_pmd( >> add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); >> mm_inc_nr_ptes(dst_mm); >> pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); >> if (!userfaultfd_uses_pte_bit(dst_vma)) >> pmd = pmd_swp_clear_uffd(pmd); >> set_pmd_at(dst_mm, addr, dst_pmd, pmd); >> >> Not sure, needs another thought. But I think there are some decent >> cleanups to be had. > > That's fair. Maybe userfaultfd_protected() name is better for the VMA > check? Yes, something like that could also work. > > And about UFFD_WP bit name. Maybe we can just drop _WP: _PAGE_UFFD_WP -> > _PAGE_UFFD, pte_uffd_wp() -> pte_uffd()? Yes, I hinted at the above with pmd_swp_clear_uffd(). > > But it is a lot of changes. Can I do the bit rename as a follow up > patchset? Let's get this clean. There is no need to rush that in ;) I suspect it's a fairly mechanical change. > >> 3) Some other stuff needs a second thought, like >> >> diff --git a/mm/gup.c b/mm/gup.c >> index 8e7dc2c6ee738..08fc18f1290d4 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -695,7 +695,8 @@ static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, >> /* ... and a write-fault isn't required for other reasons. */ >> if (pmd_needs_soft_dirty_wp(vma, pmd)) >> return false; >> - return !userfaultfd_huge_pmd_wp(vma, pmd); >> + return !userfaultfd_huge_pmd_wp(vma, pmd) && >> + !userfaultfd_huge_pmd_rwp(vma, pmd); >> } >> >> How can a pte be writable and prot_none at the same time? Maybe just confused AI >> output that you should carefully double check before sending that out officially. > > Note that this path is for !pmd_write() case to begin with. It serves > FOLL_FORCE case. I believe this check is correct: we don't want to allow > to write to such pages even with FOLL_FORCE. > > But looking around, I missed gup_can_follow_protnone() modification. It > has to return false for RWP. Right, read-permission checks come before the write-permission checks. > >> 4) How do we want to handle PM_UFFD_WP? >> >> We are pretty much out of flags soon. Overloading PM_UFFD_WP means that we will not >> be able to easily support using a separate bit. >> >> But our internal design will not easily allow that either, and I am not really >> sure we want to go down that path any time soon. >> >> Maybe we could document this for now as "In WP VMAs, indicated WP PTEs. >> Otherwise, in RWP VMAs, indicates RWP.". Whenever we would allow both at the >> same time, we could change the semantics. User space would fail to create one >> with both protection types for now either way. > > Yeah. I think about doing documentation-only update for PM_UFFD_WP for > now. Ok, good! -- Cheers, David
On Wed, Apr 22, 2026 at 08:39:50PM +0200, David Hildenbrand (Arm) wrote: > On 4/21/26 16:33, Kiryl Shutsemau wrote: > > On Tue, Apr 21, 2026 at 03:03:56PM +0200, David Hildenbrand (Arm) wrote: > >> On 4/19/26 16:33, Kiryl Shutsemau wrote: > >>> > >>> See https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git uffd/rfc-v3 > >>> RFCv4 addresses all your feedback plus more :) https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git uffd/rfc-v4 Still plan to post it after v7.1-rc1 (unless you want it earlier). The patchet is pretty good shape in my eyes and will probably drop RFC tag. -- Kiryl Shutsemau / Kirill A. Shutemov
Hello, Kiryl, On Thu, Apr 23, 2026 at 03:27:11PM +0100, Kiryl Shutsemau wrote: > The patchet is pretty good shape in my eyes and will probably drop RFC > tag. I still have some high level questions not yet got answered. Do you want to answer them? https://lore.kernel.org/all/ad59TxAHNwFWH7Cc@x1.local/ In summary, it's about: - Whether we have explored other approaches on page hotness tracking - Whether read protection is required for an userspace swap system (e.g. did you get time to have a look at umap?) Thanks, -- Peter Xu
On Thu, Apr 23, 2026 at 10:50:06AM -0400, Peter Xu wrote: > Hello, Kiryl, > > On Thu, Apr 23, 2026 at 03:27:11PM +0100, Kiryl Shutsemau wrote: > > The patchet is pretty good shape in my eyes and will probably drop RFC > > tag. > > I still have some high level questions not yet got answered. Do you want > to answer them? > > https://lore.kernel.org/all/ad59TxAHNwFWH7Cc@x1.local/ Sorry, reply to this got lost in my TODO list. > In summary, it's about: > > - Whether we have explored other approaches on page hotness tracking So, for read/write tracking we have clear_refs=1, page_idle and DAMON. Did I miss something? clear_refs is process-wide hammer. And you can miss a hot page if it races with LRU rotation. page_idle needs rmap. It will not scale. DAMON is built around sampling. It is good for working set estimation, but I don't think it is directly useful for eviction decision. It can miss hot pages. LRU rotation will also loose info. None of them gives comparable capabilities. We also need a mechanism to atomically evict pages. > - Whether read protection is required for an userspace swap system > (e.g. did you get time to have a look at umap?) I looked at it briefly, so I can miss details. IIUC, in absence of read tracking it doesn't collect hotness information at all. The eviction is based on fault-in time: the oldest faulted-in page gets evicted first. I guess it is fine if you don't care much about refault cost. Like, if your workload fits into memory completely and refaults are rare. That's not my case. -- Kiryl Shutsemau / Kirill A. Shutemov
On Thu, Apr 23, 2026 at 07:08:00PM +0100, Kiryl Shutsemau wrote: > On Thu, Apr 23, 2026 at 10:50:06AM -0400, Peter Xu wrote: > > Hello, Kiryl, > > > > On Thu, Apr 23, 2026 at 03:27:11PM +0100, Kiryl Shutsemau wrote: > > > The patchet is pretty good shape in my eyes and will probably drop RFC > > > tag. > > > > I still have some high level questions not yet got answered. Do you want > > to answer them? > > > > https://lore.kernel.org/all/ad59TxAHNwFWH7Cc@x1.local/ > > Sorry, reply to this got lost in my TODO list. No worries. > > > In summary, it's about: > > > > - Whether we have explored other approaches on page hotness tracking > > So, for read/write tracking we have clear_refs=1, page_idle and DAMON. > Did I miss something? > > clear_refs is process-wide hammer. And you can miss a hot page if it > races with LRU rotation. > > page_idle needs rmap. It will not scale. Yes. If you would benefit from a per-mm page_idle, then it may apply to us too if we will be enforced to implement full-userspace swap in QEMU. That's also why I suggested (in my previous reply) that we split the requirement: one is for hotness tracking, the other is about read-inclusive trapping (v.s. wr-protect only traps). > > DAMON is built around sampling. It is good for working set estimation, > but I don't think it is directly useful for eviction decision. It can > miss hot pages. LRU rotation will also loose info. Exactly. If we need to collect ACCESS bit (or anything similar) for eviction accuracy pusrpose, IIUC we need per-page info, we can't estimate by sampling. > > None of them gives comparable capabilities. I want to see if some of your work can be generalized so we can use too, and we can also work together. > > We also need a mechanism to atomically evict pages. Yes, this is the 2nd question below, and btw uffd-wp can also achieve this. > > > - Whether read protection is required for an userspace swap system > > (e.g. did you get time to have a look at umap?) > > I looked at it briefly, so I can miss details. > > IIUC, in absence of read tracking it doesn't collect hotness information > at all. The eviction is based on fault-in time: the oldest faulted-in For example, let's imagine if we can have a per-mm idle page tracker, would it work for you to collect hotness info? The other idea is, no matter whether we use MGLRU or legacy LRU, if we can expose a better interface to share hotness info from kernel to userspace, would it be possible? > page gets evicted first. I guess it is fine if you don't care much about > refault cost. Like, if your workload fits into memory completely and > refaults are rare. One thing to mention is, if we have any hotness tracking facility ready above (e.g. per-mm idle page tracking) we _will_ trap read faults too; it's just that it'll be much faster (when it's hardware ACCESS bit). So if I'm not wrong, what I am trying to discuss as a full userspace swap system will always trap read too for most of the cases. The difference is only about that 5ms (in case of 30s+5ms example I gave in the other email). Your RW protection will also trap that 5ms, what I described won't: when a decision is made, we wr-protect the page, any read on top of it will still go through so it will trigger a refault. My point is, that 5ms missing over 30s (in reality maybe more than 30s) sampling window (which covered read accesses) isn't a major issue, and IMHO it's not a strong enough reason to include the whole RW feature. The other thing is, as I mentioned in the other email, I still don't know how the current RW protection would work for anonymous. I don't yet think the user swapper can read the anon page with RW-protected pgtables. So far my understanding is maybe you only care about shmem so it's fine, but it'll always be great to confirm with you. Thanks, > > That's not my case. > > -- > Kiryl Shutsemau / Kirill A. Shutemov > -- Peter Xu
On Thu, Apr 23, 2026 at 02:57:34PM -0400, Peter Xu wrote: > On Thu, Apr 23, 2026 at 07:08:00PM +0100, Kiryl Shutsemau wrote: > > > - Whether read protection is required for an userspace swap system > > > (e.g. did you get time to have a look at umap?) > > > > I looked at it briefly, so I can miss details. > > > > IIUC, in absence of read tracking it doesn't collect hotness information > > at all. The eviction is based on fault-in time: the oldest faulted-in > > For example, let's imagine if we can have a per-mm idle page tracker, would > it work for you to collect hotness info? > > The other idea is, no matter whether we use MGLRU or legacy LRU, if we can > expose a better interface to share hotness info from kernel to userspace, > would it be possible? I don't see how either fits our problem. Both page_idle and the LRUs (legacy or MGLRU) track accesses on physical memory. We need visibility in the virtual address space domain. We don't care which physical page backs a given guest address at any moment. We want to know which piece of the user's dataset is cold, and the answer has to be indifferent to kernel actions underneath: the tracking must survive migration and swap-out. RWP gives us that — the uffd-wp bit is preserved across swap PTEs and migration entries, so the "this VA was declared cold" marker stays attached to the VA. A physical-side tracker loses its state the moment the folio is freed or replaced: a refaulted folio is a fresh object with no history. Scaling goes the same way. Per-mm tracking of the form RWP does can scale with the working set. A physical-side tracker scales with all folios on the LRU/memcg, then needs an rmap walk per folio to map back to a VA — which is exactly the reason page_idle doesn't scale for this use case today. There is also a cgroup-level confound: memcg hotness mixes guest memory with the VMM's own (worker threads, I/O buffers, vhost-user rings). VMA-scoped tracking is the natural unit regardless of the migration story. -- Kiryl Shutsemau / Kirill A. Shutemov
On Fri, Apr 24, 2026 at 11:34:48AM +0100, Kiryl Shutsemau wrote:
> On Thu, Apr 23, 2026 at 02:57:34PM -0400, Peter Xu wrote:
> > On Thu, Apr 23, 2026 at 07:08:00PM +0100, Kiryl Shutsemau wrote:
> > > > - Whether read protection is required for an userspace swap system
> > > > (e.g. did you get time to have a look at umap?)
> > >
> > > I looked at it briefly, so I can miss details.
> > >
> > > IIUC, in absence of read tracking it doesn't collect hotness information
> > > at all. The eviction is based on fault-in time: the oldest faulted-in
> >
> > For example, let's imagine if we can have a per-mm idle page tracker, would
> > it work for you to collect hotness info?
> >
> > The other idea is, no matter whether we use MGLRU or legacy LRU, if we can
> > expose a better interface to share hotness info from kernel to userspace,
> > would it be possible?
>
> I don't see how either fits our problem.
>
> Both page_idle and the LRUs (legacy or MGLRU) track accesses on physical
> memory. We need visibility in the virtual address space domain.
Yes they are, but ACCESS bit isn't. ACCESS bit is only about virtual
mapping or any similar mapping (like EPT's access bit).
What I described with per-mm tracking (either we call it per-mm idle page
tracking or using other interface) is about relying on ACCESS bit, not
pgtable changes using RWP. IMHO It's more efficient and it will also
achieve your goal of VA tracking.
In your case (and also ours), if you're looking for VMs running virtual
machines, I think you need both pgtable's ACCESS bit and EPT-similar ACCESS
bit. Here what's redundant is rmap, not ACCESS bit tracking. When both
MMU and secondary MMU supports hardware access tracking, AFAIU it's faster
than RWP.
>
> We don't care which physical page backs a given guest address at any
> moment. We want to know which piece of the user's dataset is cold, and
> the answer has to be indifferent to kernel actions underneath: the
> tracking must survive migration and swap-out. RWP gives us that — the
This is exactly what we hit... that's the reason why I was trying to
propose a new API to read directly from swap (swap_access) or similar.
Btw, from another perspective, I believe we could also persist ACCESS bit
across migration or swap out.
For migration, see e.g. remove_migration_pte() has:
if (!softleaf_is_migration_young(entry))
pte = pte_mkold(pte);
For swap, it's different. Normally, if an userapp would manage page
hotness, it will record the hotness within the userspace with whatever
algorithm it wants. Then it will also survive host swap happening because
that hotness is per-VA. It should be deduced from any hotness tracking
system it previously used to sample (and it still can be idle page
tracking, even if not efficient enough; when the VM page isn't mapped
anywhere else, rmap is pure overhead, it doesn't introduce false positives).
> uffd-wp bit is preserved across swap PTEs and migration entries, so the
> "this VA was declared cold" marker stays attached to the VA. A
> physical-side tracker loses its state the moment the folio is freed or
> replaced: a refaulted folio is a fresh object with no history.
>
> Scaling goes the same way. Per-mm tracking of the form RWP does can
> scale with the working set. A physical-side tracker scales with all folios
> on the LRU/memcg, then needs an rmap walk per folio to map back to a
> VA — which is exactly the reason page_idle doesn't scale for this use
> case today.
>
> There is also a cgroup-level confound: memcg hotness mixes guest memory
> with the VMM's own (worker threads, I/O buffers, vhost-user rings).
> VMA-scoped tracking is the natural unit regardless of the migration
> story.
This kind of further proved you're using shmem and you have separate
mappings.
Again, when with a per-mm idle page tracking these issue should all be
gone. That per-mm idle page tracking needs to:
- Ignore rmap so it's VA based
- Still consider secondary MMUs, hence mmu young notifier needs to present
- Work based on ACCESS bit (to leverage hardware tracking accelerations),
rather than relying on a kernel fault to set the access mark, which
should be more efficient.
The other thing is, could you please still answer why RWP is required for
swap impl in general? It's not yet mentioned in the reply.
Personally I really feel like we're looking at very similar problems. It
is a great news to me, because if you can convince me on the new api it
means our use case may likely also adopt the approach, vice versa.
It would be great to share the new interface no matter what it is, instead
of trying to push different ones.
Thanks,
--
Peter Xu
On Fri, Apr 24, 2026 at 07:51:44AM -0400, Peter Xu wrote: > On Fri, Apr 24, 2026 at 11:34:48AM +0100, Kiryl Shutsemau wrote: > > Both page_idle and the LRUs (legacy or MGLRU) track accesses on physical > > memory. We need visibility in the virtual address space domain. > > Yes they are, but ACCESS bit isn't. A-bit is not a reliable signal for userspace working-set tracking because the kernel itself is a concurrent consumer. It is exactly why page_idle needs PG_young on top of the A-bit: PG_young is the "kernel ate the A-bit but the page was actually touched" escape hatch. And bringing PG_young into the picture puts us right back into physical-side tracking. > For migration, see e.g. remove_migration_pte() has: > > if (!softleaf_is_migration_young(entry)) > pte = pte_mkold(pte); remove_migration_pte() only propagates young-at-unmap. It does not cover the common case: A-bit cleared by reclaim before migration started. The concurrent-consumer problem is what breaks the signal, not the migration boundary. -- Kiryl Shutsemau / Kirill A. Shutemov
On 4/24/26 15:49, Kiryl Shutsemau wrote: > On Fri, Apr 24, 2026 at 07:51:44AM -0400, Peter Xu wrote: >> On Fri, Apr 24, 2026 at 11:34:48AM +0100, Kiryl Shutsemau wrote: >>> Both page_idle and the LRUs (legacy or MGLRU) track accesses on physical >>> memory. We need visibility in the virtual address space domain. >> >> Yes they are, but ACCESS bit isn't. > > A-bit is not a reliable signal for userspace working-set tracking > because the kernel itself is a concurrent consumer. Right, I don't think we want to rely on either the A bit just like we don't want to rely on the Dirty bit in other code. (and even SoftDirty bit is a flawed concept) I do see some value in a reliable RWP mechanism based on uffd. The real question is, how much benefit it would bring (which other use cases could benefit from it). -- Cheers, David
On Sat, Apr 25, 2026 at 08:05:16AM +0200, David Hildenbrand (Arm) wrote: > On 4/24/26 15:49, Kiryl Shutsemau wrote: > > On Fri, Apr 24, 2026 at 07:51:44AM -0400, Peter Xu wrote: > >> On Fri, Apr 24, 2026 at 11:34:48AM +0100, Kiryl Shutsemau wrote: > >>> Both page_idle and the LRUs (legacy or MGLRU) track accesses on physical > >>> memory. We need visibility in the virtual address space domain. > >> > >> Yes they are, but ACCESS bit isn't. > > > > A-bit is not a reliable signal for userspace working-set tracking > > because the kernel itself is a concurrent consumer. > Right, I don't think we want to rely on either the A bit just like we don't want > to rely on the Dirty bit in other code. (and even SoftDirty bit is a flawed concept) > > I do see some value in a reliable RWP mechanism based on uffd. The real question > is, how much benefit it would bring (which other use cases could benefit from it). I have not put much thought into use-cases beyond VMs, but I think other mmap-heavy loads (databases, GC'd runtimes, etc) can make a use in a similar way: workset tracking and organizing userspace-driven memory tiering based on the workset data. -- Kiryl Shutsemau / Kirill A. Shutemov
On Fri, Apr 24, 2026 at 02:49:58PM +0100, Kiryl Shutsemau wrote: > On Fri, Apr 24, 2026 at 07:51:44AM -0400, Peter Xu wrote: > > On Fri, Apr 24, 2026 at 11:34:48AM +0100, Kiryl Shutsemau wrote: > > > Both page_idle and the LRUs (legacy or MGLRU) track accesses on physical > > > memory. We need visibility in the virtual address space domain. > > > > Yes they are, but ACCESS bit isn't. > > A-bit is not a reliable signal for userspace working-set tracking > because the kernel itself is a concurrent consumer. It is exactly why > page_idle needs PG_young on top of the A-bit: PG_young is the "kernel I assume you meant PG_idle. I actually don't know whether PG_young is still actively used anywhere in the current code base. > ate the A-bit but the page was actually touched" escape hatch. And > bringing PG_young into the picture puts us right back into physical-side > tracking. > > > For migration, see e.g. remove_migration_pte() has: > > > > if (!softleaf_is_migration_young(entry)) > > pte = pte_mkold(pte); > > remove_migration_pte() only propagates young-at-unmap. It does not > cover the common case: A-bit cleared by reclaim before migration > started. The concurrent-consumer problem is what breaks the signal, > not the migration boundary. IMHO it's a separate problem, and AFAIU it was well solved at least with old LRUs with PG_idle. It's just slightly unfortunate it doesn't yet work with MGLRU. Also, when the extra bit is in folio->flags, it only works if both the consumers are reporting per-folio, not per-mm. I'm actually curious whether there're numbers or solid proof showing that in your case the per-folio perf is too bad already to justify a new per-mm API, like RWP. It's because currently this proposal is still so far very much about "let's implement a swap system". It really doesn't yet have a lot to prove on hotness tracking POV. Not asking for a time-consuming test immediately, but IMHO these should really be solid clues to first justify the overhead with current rmap in production. For us, we know the overhead in theory, but we never really measured how much. Even if so, I don't think it's unsolvable. I want to explore if there's something that can still be generic and work for per-mm tracking. I believe if we can have some bit in the ptes, then when mm reclaim code walks clearing ACCESS bit and sees some vma is being tracked, then instead of setting PG_idle, it can just move the access bit over to that special pte bit, and only to this vma this pte. IIUC that'll benefit from both worlds: fast HW-accelerated access bit, and no minor faults. Would something like that worth exploring? -- Peter Xu
On Fri, Apr 24, 2026 at 11:55:39AM -0400, Peter Xu wrote:
> On Fri, Apr 24, 2026 at 02:49:58PM +0100, Kiryl Shutsemau wrote:
> > On Fri, Apr 24, 2026 at 07:51:44AM -0400, Peter Xu wrote:
> > > On Fri, Apr 24, 2026 at 11:34:48AM +0100, Kiryl Shutsemau wrote:
> > > > Both page_idle and the LRUs (legacy or MGLRU) track accesses on physical
> > > > memory. We need visibility in the virtual address space domain.
> > >
> > > Yes they are, but ACCESS bit isn't.
> >
> > A-bit is not a reliable signal for userspace working-set tracking
> > because the kernel itself is a concurrent consumer. It is exactly why
> > page_idle needs PG_young on top of the A-bit: PG_young is the "kernel
>
> I assume you meant PG_idle. I actually don't know whether PG_young is
> still actively used anywhere in the current code base.
>
> > ate the A-bit but the page was actually touched" escape hatch. And
> > bringing PG_young into the picture puts us right back into physical-side
> > tracking.
> >
> > > For migration, see e.g. remove_migration_pte() has:
> > >
> > > if (!softleaf_is_migration_young(entry))
> > > pte = pte_mkold(pte);
> >
> > remove_migration_pte() only propagates young-at-unmap. It does not
> > cover the common case: A-bit cleared by reclaim before migration
> > started. The concurrent-consumer problem is what breaks the signal,
> > not the migration boundary.
>
> IMHO it's a separate problem, and AFAIU it was well solved at least with
> old LRUs with PG_idle. It's just slightly unfortunate it doesn't yet work
> with MGLRU. Also, when the extra bit is in folio->flags, it only works if
> both the consumers are reporting per-folio, not per-mm.
>
> I'm actually curious whether there're numbers or solid proof showing that
> in your case the per-folio perf is too bad already to justify a new per-mm
> API, like RWP.
Fair ask, and I don't have numbers I can point to right now. But I'd
flag that the case for RWP doesn't rest only on cost:
- LRU-agnostic. Per-folio approaches are bound to the current
reclaim backend (legacy, MGLRU, whatever is next);
- Race-free against reclaim's A-bit consumption;
- Deterministic preservation across swap and migration.
Numbers would strengthen the cost story but they don't change those
structural points.
> I want to explore if there's something that can still be generic and work
> for per-mm tracking. I believe if we can have some bit in the ptes, then
> when mm reclaim code walks clearing ACCESS bit and sees some vma is being
> tracked, then instead of setting PG_idle, it can just move the access bit
> over to that special pte bit, and only to this vma this pte. IIUC that'll
> benefit from both worlds: fast HW-accelerated access bit, and no minor
> faults.
>
> Would something like that worth exploring?
This can be interesting. But a spare pte bit is high ask.
And when you start tracking, you need to clear A-bit. Where do you move
it? Activate folio?
--
Kiryl Shutsemau / Kirill A. Shutemov
On Fri, Apr 24, 2026 at 11:55:39AM -0400, Peter Xu wrote:
> For us, we know the overhead in theory, but we never really measured how
> much.
I think I didn't discuss the other side of things, where hypervisor can
touch some pages and mark it accessed, even if you don't want to.
IMHO we should track most of such accesses, except special cases.
One example is when the access happened because of emulated DMAs, then it
should be marked alongside to the guest access: that's when you mentioned
vhost, and I believe whatever vhost touched on the pages it should be
marked hot even the guest didn't touch it.
The special case is really migration that I can think so.
When at this, I also want to know whether you can benefit from something
like swap_access() / maccess() that I can propose when it's ready. It
means we can shift some corner case accesses from hypervisor to that API
then it won't cause false-positive even if rmap is involved. Then rmap
will be a pure perf issue, if it will still be an issue at all.
I can also share at least my plan on our side, in case it helps to find
shared goals. Basically we at least have two ways to go:
Plan A: propose maccess() then use it in qemu for migration purpose.
migration is so far the only thing we want to rule out. We want to
keep DMAs access promote pages, for example. This will keep
relying on kernel to do hotness tracking and evictions. IMHO the
best and simplest solution so far taking both user+kernel into
account.
Plan B: if above didn't work, we can implement swap in qemu. We need some
API like what you're looking for, except that the major challenge
to us is MGLRU compatibility.
So IIUC if you're looking for completely flexible swap backends, then Plan
A won't work for you: that'll always stick with kernel's swapfile
managements and it's not always flexible enough. Then I want to look if we
can share goal in plan B. But if we can find some shared goal in plan A /
maccess(), then I'll also be more than willing to know. For that part I
can definitely see what I can do.
Thanks,
--
Peter Xu
On Thu, 23 Apr 2026 14:57:34 -0400 Peter Xu <peterx@redhat.com> wrote: > On Thu, Apr 23, 2026 at 07:08:00PM +0100, Kiryl Shutsemau wrote: > > On Thu, Apr 23, 2026 at 10:50:06AM -0400, Peter Xu wrote: [...] > > > - Whether we have explored other approaches on page hotness tracking [...] > > DAMON is built around sampling. It is good for working set estimation, > > but I don't think it is directly useful for eviction decision. It can > > miss hot pages. LRU rotation will also loose info. > > Exactly. If we need to collect ACCESS bit (or anything similar) for > eviction accuracy pusrpose, IIUC we need per-page info, we can't estimate > by sampling. That's a fair argument. Nonetheless, there are some companies who use DAMON [1] for a similar eviction purpose on their products. Also, page level accuracy issue was indeed concerns from many people. DAMON therefore provides page level DAMOS filter [2]. The idea is finding a large region of cold pages in low overhead first, then do page level access recheck on page of the region using the filter, just before doing the eviction. DAMON-based memory tiering also uses it [3], to avoid wrongly promoting/demoting cold/hot pages in DAMON-claimed hot/cold regions. The evaluation result was not very bad, and a few more users reported positive test results. Also, DAMON can be used for page level monitoring [5] and open to changes for users. Actually a work [6] for making DAMON-based page level monitoring more lightweight is ongoing. I understand no one fits all and the decision is up to each user :) Nevertheless, I will be happy to help if you have any question or request for DAMON. [1] https://cdn.amazon.science/ee/a4/41ff11374f2f865e5e24de11bd17/resource-management-in-aurora-serverless.pdf [2] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#filters [3] https://github.com/damonitor/damo/blob/next/scripts/mem_tier.sh#L40 [4] https://www.phoronix.com/news/DAMON-Self-Tuned-Memory-Tiering [5] https://origin.kernel.org/doc/html/latest/mm/damon/faq.html#can-i-simply-monitor-page-granularity [6] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com Thanks, SJ [...]
On Thu, Apr 23, 2026 at 05:26:24PM -0700, SeongJae Park wrote: > On Thu, 23 Apr 2026 14:57:34 -0400 Peter Xu <peterx@redhat.com> wrote: > > > On Thu, Apr 23, 2026 at 07:08:00PM +0100, Kiryl Shutsemau wrote: > > > On Thu, Apr 23, 2026 at 10:50:06AM -0400, Peter Xu wrote: > [...] > > > > - Whether we have explored other approaches on page hotness tracking > [...] > > > DAMON is built around sampling. It is good for working set estimation, > > > but I don't think it is directly useful for eviction decision. It can > > > miss hot pages. LRU rotation will also loose info. > > > > Exactly. If we need to collect ACCESS bit (or anything similar) for > > eviction accuracy pusrpose, IIUC we need per-page info, we can't estimate > > by sampling. > > That's a fair argument. > > Nonetheless, there are some companies who use DAMON [1] for a similar eviction > purpose on their products. > > Also, page level accuracy issue was indeed concerns from many people. DAMON > therefore provides page level DAMOS filter [2]. The idea is finding a large > region of cold pages in low overhead first, then do page level access recheck > on page of the region using the filter, just before doing the eviction. > > DAMON-based memory tiering also uses it [3], to avoid wrongly > promoting/demoting cold/hot pages in DAMON-claimed hot/cold regions. The > evaluation result was not very bad, and a few more users reported positive test > results. > > Also, DAMON can be used for page level monitoring [5] and open to changes for > users. Actually a work [6] for making DAMON-based page level monitoring more > lightweight is ongoing. Good to know that, thanks for the info, SJ. I'll add a note and try to explore all these at some point. I recall I read a paper describing damon tracking overheads when granularity is small and when the memory scope is large (in VM's case, it can be e.g. 1TB or more). Would there be quick answer on whether this one still suffers (or maybe it was never a problem)? > > I understand no one fits all and the decision is up to each user :) > Nevertheless, I will be happy to help if you have any question or request for > DAMON. I'll definitely ask after digging more into that, thanks for the offer! > > [1] https://cdn.amazon.science/ee/a4/41ff11374f2f865e5e24de11bd17/resource-management-in-aurora-serverless.pdf > [2] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#filters > [3] https://github.com/damonitor/damo/blob/next/scripts/mem_tier.sh#L40 > [4] https://www.phoronix.com/news/DAMON-Self-Tuned-Memory-Tiering > [5] https://origin.kernel.org/doc/html/latest/mm/damon/faq.html#can-i-simply-monitor-page-granularity > [6] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com > > > Thanks, > SJ > > [...] > -- Peter Xu
On Fri, 24 Apr 2026 07:55:11 -0400 Peter Xu <peterx@redhat.com> wrote: > On Thu, Apr 23, 2026 at 05:26:24PM -0700, SeongJae Park wrote: > > On Thu, 23 Apr 2026 14:57:34 -0400 Peter Xu <peterx@redhat.com> wrote: > > > > > On Thu, Apr 23, 2026 at 07:08:00PM +0100, Kiryl Shutsemau wrote: > > > > On Thu, Apr 23, 2026 at 10:50:06AM -0400, Peter Xu wrote: > > [...] > > > > > - Whether we have explored other approaches on page hotness tracking > > [...] > > > > DAMON is built around sampling. It is good for working set estimation, > > > > but I don't think it is directly useful for eviction decision. It can > > > > miss hot pages. LRU rotation will also loose info. > > > > > > Exactly. If we need to collect ACCESS bit (or anything similar) for > > > eviction accuracy pusrpose, IIUC we need per-page info, we can't estimate > > > by sampling. > > > > That's a fair argument. > > > > Nonetheless, there are some companies who use DAMON [1] for a similar eviction > > purpose on their products. > > > > Also, page level accuracy issue was indeed concerns from many people. DAMON > > therefore provides page level DAMOS filter [2]. The idea is finding a large > > region of cold pages in low overhead first, then do page level access recheck > > on page of the region using the filter, just before doing the eviction. > > > > DAMON-based memory tiering also uses it [3], to avoid wrongly > > promoting/demoting cold/hot pages in DAMON-claimed hot/cold regions. The > > evaluation result was not very bad, and a few more users reported positive test > > results. > > > > Also, DAMON can be used for page level monitoring [5] and open to changes for > > users. Actually a work [6] for making DAMON-based page level monitoring more > > lightweight is ongoing. > > Good to know that, thanks for the info, SJ. I'll add a note and try to > explore all these at some point. > > I recall I read a paper describing damon tracking overheads when > granularity is small and when the memory scope is large (in VM's case, it > can be e.g. 1TB or more). Would there be quick answer on whether this one > still suffers (or maybe it was never a problem)? I think that should still be same. In case of fixed granularity monitoring, the overhead is inherently proportional to the memory size. And we didn't make many effort on making the overhead lower. We have two ongoing works [1,2] for that, though. Nonetheless, whether the overhead is too high or not would depend on the use case, I'd say. That is, if the system has hundreds of CPUs, letting DAMON occupying one CPU might be no real problem. Rather, there were users who willing to give more than one CPUs to DAMON if DAMON can provide more accurate monitoring results or work faster. That kind of scaling is possible, by using multiple kdamonds that monitors different partitions of the address ranges. > > > > > I understand no one fits all and the decision is up to each user :) > > Nevertheless, I will be happy to help if you have any question or request for > > DAMON. > > I'll definitely ask after digging more into that, thanks for the offer! The pleasure is mine! :) > > > > > [1] https://cdn.amazon.science/ee/a4/41ff11374f2f865e5e24de11bd17/resource-management-in-aurora-serverless.pdf > > [2] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#filters > > [3] https://github.com/damonitor/damo/blob/next/scripts/mem_tier.sh#L40 > > [4] https://www.phoronix.com/news/DAMON-Self-Tuned-Memory-Tiering > > [5] https://origin.kernel.org/doc/html/latest/mm/damon/faq.html#can-i-simply-monitor-page-granularity > > [6] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com [1] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com [2] https://lore.kernel.org/20260423122340.138880-1-jiayuan.chen@linux.dev Thanks, SJ [...]
> > The other thing is, as I mentioned in the other email, I still don't know > how the current RW protection would work for anonymous. I don't yet think > the user swapper can read the anon page with RW-protected pgtables. So far > my understanding is maybe you only care about shmem so it's fine, but it'll > always be great to confirm with you. I wonder if uffdio_move could be used for a swapper implementation instead? If we ever have to read from a protnone page, maybe we could teach ptrace access to do it, or have something that can read from prot_none areas -- like uffdio_copy, which can write to prot-none areas. -- Cheers, David
On Thu, Apr 23, 2026 at 09:25:30PM +0200, David Hildenbrand (Arm) wrote: > > > > The other thing is, as I mentioned in the other email, I still don't know > > how the current RW protection would work for anonymous. I don't yet think > > the user swapper can read the anon page with RW-protected pgtables. So far > > my understanding is maybe you only care about shmem so it's fine, but it'll > > always be great to confirm with you. > > I wonder if uffdio_move could be used for a swapper implementation instead? If RW is justified to be useful first, maybe. I had a gut feeling Kirill's use case doesn't use anon at all, then if nobody needs it we can still decide to not support anon. > > If we ever have to read from a protnone page, maybe we could teach ptrace access > to do it, or have something that can read from prot_none areas -- like > uffdio_copy, which can write to prot-none areas. Somethinig like swap_access() in my proposal can also partly achieve that. https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/ There, it was only about reading from swap so far, though. But that one might be easier to be extended to read PROT_NONE and directly put data into buffer user specified (ps: in my local tree impl I named it maccess() to pair with mincore(), but it doesn't really matter; it doesn't even need to be a syscall..). To me, the interfacing is not a major issue. The major question I have is why RW protection can help in swap system impl when we already have uffd-wp. So I want to make sure the use case can't be implemented by uffd-wp already. Because that's really what we might do for QEMU. The other thing is I want to see possibilities of reusing any new kernel feature that can provide hotness. Currently idle page tracking has issue, not only about perf and rmap, but also on not working with mglru (defaults for Fedora/RHEL). If we can split the requirements and solve the hotness issue, then it'll also be very helpful. Thanks, -- Peter Xu
On 4/23/26 22:10, Peter Xu wrote: > On Thu, Apr 23, 2026 at 09:25:30PM +0200, David Hildenbrand (Arm) wrote: >>> >>> The other thing is, as I mentioned in the other email, I still don't know >>> how the current RW protection would work for anonymous. I don't yet think >>> the user swapper can read the anon page with RW-protected pgtables. So far >>> my understanding is maybe you only care about shmem so it's fine, but it'll >>> always be great to confirm with you. >> >> I wonder if uffdio_move could be used for a swapper implementation instead? > > If RW is justified to be useful first, maybe. > > I had a gut feeling Kirill's use case doesn't use anon at all, then if > nobody needs it we can still decide to not support anon. > >> >> If we ever have to read from a protnone page, maybe we could teach ptrace access >> to do it, or have something that can read from prot_none areas -- like >> uffdio_copy, which can write to prot-none areas. > > Somethinig like swap_access() in my proposal can also partly achieve that. Looks more like the hammer for the nail here: we could fault the page in just fine, while keeping it mapped prot_none and keeping the uffd-rwp pte bit set. I was rather thinking of some uffd-specific thing that can read from a uffd-rwp protected pte without trigger uffd. > > https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/ > > There, it was only about reading from swap so far, though. But that one > might be easier to be extended to read PROT_NONE and directly put data into > buffer user specified (ps: in my local tree impl I named it maccess() to > pair with mincore(), but it doesn't really matter; it doesn't even need to > be a syscall..). > > To me, the interfacing is not a major issue. The major question I have is > why RW protection can help in swap system impl when we already have uffd-wp. > > So I want to make sure the use case can't be implemented by uffd-wp already. > Because that's really what we might do for QEMU. There has to be some added value indeed. -- Cheers, David
On Thu, Apr 23, 2026 at 04:10:30PM -0400, Peter Xu wrote: > On Thu, Apr 23, 2026 at 09:25:30PM +0200, David Hildenbrand (Arm) wrote: > > > > > > The other thing is, as I mentioned in the other email, I still don't know > > > how the current RW protection would work for anonymous. I don't yet think > > > the user swapper can read the anon page with RW-protected pgtables. So far > > > my understanding is maybe you only care about shmem so it's fine, but it'll > > > always be great to confirm with you. That's true. We use vhost and therefore shmem in our setup. One idea I had about how to make atomic eviction for anon is extending process_vm_read() and process_madvise(): - Add a flag to process_vm_read() to bypass the protnone check on accessible (or only RWP?) VMAs. - Allow process_madvise(MADV_DONTNEED) when the caller already has ptrace write access to the target. The standing objection to remote DONTNEED has been "destructive", but process_vm_writev() already lets a ptrace-capable caller overwrite arbitrary anon with attacker-chosen content. DONTNEED is strictly weaker — it zeroes, it does not inject — so the trust model is already established. > > I wonder if uffdio_move could be used for a swapper implementation instead? I considered it. UFFDIO_MOVE can in principle relocate the cold folio into a staging VMA inside the VMM, which then reads it and drops it. The downside is the VMM has to maintain a second address range and serialise eviction through it. A purpose-built primitive — something like UFFDIO_EVICT that zaps the PTE and returns the folio contents (optionally to an fd for io_uring) — seems cleaner. > If RW is justified to be useful first, maybe. > > I had a gut feeling Kirill's use case doesn't use anon at all, then if > nobody needs it we can still decide to not support anon. > > > > > If we ever have to read from a protnone page, maybe we could teach ptrace access > > to do it, or have something that can read from prot_none areas -- like > > uffdio_copy, which can write to prot-none areas. > > Somethinig like swap_access() in my proposal can also partly achieve that. > > https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/ A maccess()-style primitive that reads through PROT_NONE is a reasonable building block and overlaps with part of what UFFDIO_EVICT would need. > There, it was only about reading from swap so far, though. But that one > might be easier to be extended to read PROT_NONE and directly put data into > buffer user specified (ps: in my local tree impl I named it maccess() to > pair with mincore(), but it doesn't really matter; it doesn't even need to > be a syscall..). > > To me, the interfacing is not a major issue. The major question I have is > why RW protection can help in swap system impl when we already have uffd-wp. > > So I want to make sure the use case can't be implemented by uffd-wp already. > Because that's really what we might do for QEMU. Race-free eviction can definitely be implemented with uffd-wp already. But not proper working set discovery. -- Kiryl Shutsemau / Kirill A. Shutemov
On Fri, Apr 24, 2026 at 12:37:35PM +0100, Kiryl Shutsemau wrote: > On Thu, Apr 23, 2026 at 04:10:30PM -0400, Peter Xu wrote: > > On Thu, Apr 23, 2026 at 09:25:30PM +0200, David Hildenbrand (Arm) wrote: > > > > > > > > The other thing is, as I mentioned in the other email, I still don't know > > > > how the current RW protection would work for anonymous. I don't yet think > > > > the user swapper can read the anon page with RW-protected pgtables. So far > > > > my understanding is maybe you only care about shmem so it's fine, but it'll > > > > always be great to confirm with you. > > > That's true. We use vhost and therefore shmem in our setup. I see, thanks for confirming. Side note: I believe host works for anon too since GUP works for anon, but it doesn't matter as long as we know anon isn't a must. > > One idea I had about how to make atomic eviction for anon is extending > process_vm_read() and process_madvise(): > > - Add a flag to process_vm_read() to bypass the protnone check on > accessible (or only RWP?) VMAs. > > - Allow process_madvise(MADV_DONTNEED) when the caller already has > ptrace write access to the target. > > The standing objection to remote DONTNEED has been "destructive", but > process_vm_writev() already lets a ptrace-capable caller overwrite > arbitrary anon with attacker-chosen content. DONTNEED is strictly > weaker — it zeroes, it does not inject — so the trust model is already > established. > > > > I wonder if uffdio_move could be used for a swapper implementation instead? > > I considered it. UFFDIO_MOVE can in principle relocate the cold folio > into a staging VMA inside the VMM, which then reads it and drops it. > The downside is the VMM has to maintain a second address range and > serialise eviction through it. A purpose-built primitive — something > like UFFDIO_EVICT that zaps the PTE and returns the folio contents > (optionally to an fd for io_uring) — seems cleaner. Right, the other thing is unnecessary overhead on the extra pgtable operations when moving to the staging VMA (e.g. tlb flush). > > > > If RW is justified to be useful first, maybe. > > > > I had a gut feeling Kirill's use case doesn't use anon at all, then if > > nobody needs it we can still decide to not support anon. > > > > > > > > If we ever have to read from a protnone page, maybe we could teach ptrace access > > > to do it, or have something that can read from prot_none areas -- like > > > uffdio_copy, which can write to prot-none areas. > > > > Somethinig like swap_access() in my proposal can also partly achieve that. > > > > https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/ > > A maccess()-style primitive that reads through PROT_NONE is a reasonable > building block and overlaps with part of what UFFDIO_EVICT would need. > > > There, it was only about reading from swap so far, though. But that one > > might be easier to be extended to read PROT_NONE and directly put data into > > buffer user specified (ps: in my local tree impl I named it maccess() to > > pair with mincore(), but it doesn't really matter; it doesn't even need to > > be a syscall..). > > > > To me, the interfacing is not a major issue. The major question I have is > > why RW protection can help in swap system impl when we already have uffd-wp. > > > > So I want to make sure the use case can't be implemented by uffd-wp already. > > Because that's really what we might do for QEMU. > > Race-free eviction can definitely be implemented with uffd-wp already. > But not proper working set discovery. Good. Then we can focus the discussion on hotness tracking with RWP and its benefits, and compare it with a pure access bit focused tracking system (as I mentioned in the other reply). Thanks, -- Peter Xu
On Tue, Apr 21, 2026 at 03:33:27PM +0100, Kiryl Shutsemau wrote: > > 3) Some other stuff needs a second thought, like > > > > diff --git a/mm/gup.c b/mm/gup.c > > index 8e7dc2c6ee738..08fc18f1290d4 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -695,7 +695,8 @@ static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, > > /* ... and a write-fault isn't required for other reasons. */ > > if (pmd_needs_soft_dirty_wp(vma, pmd)) > > return false; > > - return !userfaultfd_huge_pmd_wp(vma, pmd); > > + return !userfaultfd_huge_pmd_wp(vma, pmd) && > > + !userfaultfd_huge_pmd_rwp(vma, pmd); > > } > > > > How can a pte be writable and prot_none at the same time? Maybe just confused AI > > output that you should carefully double check before sending that out officially. > > Note that this path is for !pmd_write() case to begin with. It serves > FOLL_FORCE case. I believe this check is correct: we don't want to allow > to write to such pages even with FOLL_FORCE. > > But looking around, I missed gup_can_follow_protnone() modification. It > has to return false for RWP. With gup_can_follow_protnone() fixed, the checks in can_follow_write_pmd/pte() are redundant. Will drop them. -- Kiryl Shutsemau / Kirill A. Shutemov
On 4/22/26 11:27, Kiryl Shutsemau wrote: > On Tue, Apr 21, 2026 at 03:33:27PM +0100, Kiryl Shutsemau wrote: >>> 3) Some other stuff needs a second thought, like >>> >>> diff --git a/mm/gup.c b/mm/gup.c >>> index 8e7dc2c6ee738..08fc18f1290d4 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -695,7 +695,8 @@ static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, >>> /* ... and a write-fault isn't required for other reasons. */ >>> if (pmd_needs_soft_dirty_wp(vma, pmd)) >>> return false; >>> - return !userfaultfd_huge_pmd_wp(vma, pmd); >>> + return !userfaultfd_huge_pmd_wp(vma, pmd) && >>> + !userfaultfd_huge_pmd_rwp(vma, pmd); >>> } >>> >>> How can a pte be writable and prot_none at the same time? Maybe just confused AI >>> output that you should carefully double check before sending that out officially. >> >> Note that this path is for !pmd_write() case to begin with. It serves >> FOLL_FORCE case. I believe this check is correct: we don't want to allow >> to write to such pages even with FOLL_FORCE. >> >> But looking around, I missed gup_can_follow_protnone() modification. It >> has to return false for RWP. > > With gup_can_follow_protnone() fixed, the checks in > can_follow_write_pmd/pte() are redundant. Will drop them. Yes, that sounds better. -- Cheers, David
On Thu, Apr 16, 2026 at 09:25:25PM +0100, Kiryl Shutsemau wrote: > > b) obviously means that we cannot use uffd-wp and uffd-rwp at the same > > time in the same uffd area. I guess that should be acceptable for the > > use cases we you should have in mind? > > I took a different path: I still use PROT_NONE PTEs, so it cannot > co-exist with NUMA balancing [fully], but WP + RWP should be fine. I > need to add a test for this. WP + RWP works. Test case added. -- Kiryl Shutsemau / Kirill A. Shutemov
Hi, Kiryl, On Tue, Apr 14, 2026 at 03:23:34PM +0100, Kiryl Shutsemau (Meta) wrote: > This series adds userfaultfd support for tracking the working set of > VM guest memory, enabling VMMs to identify cold pages and evict them > to tiered or remote storage. Thanks for sharing this work, it looks very interesting to me. Personally I am also looking at some kind of VMM memtiering issues. I'm not sure if you saw my lsfmm proposal, it mentioned the challenge we're facing, it's slightly different but still a bit relevant: https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/ Unfortunately, that proposal was rejected upstream. For us, it's so far more about migration and how migration process introduce zero impact to guest workloads especially on hotness. I'm not sure if we have any shared goals over that aspect. > > == Problem == > > VMMs managing guest memory need to: > 1. Track which pages are actively used (working set detection) > 2. Safely evict cold pages to slower storage > 3. Fetch pages back on demand when accessed again > > For shmem-backed guest memory, working set tracking partially works > today: MADV_DONTNEED zaps PTEs while pages stay in page cache, and > re-access auto-resolves from cache. But safe eviction still requires > synchronous fault interception to prevent data loss races. > > For anonymous guest memory (needed for KSM cross-VM deduplication), > there is no mechanism at all — clearing a PTE loses the page. > > == Solution == > > The series introduces a unified userfaultfd interface that works > across both anonymous and shmem-backed memory: > > UFFD_FEATURE_MINOR_ANON: extends MODE_MINOR registration to anonymous > private memory. Uses the PROT_NONE hinting mechanism (same as NUMA > balancing) to make pages inaccessible without freeing them. > > UFFD_FEATURE_MINOR_ASYNC: auto-resolves minor faults without handler > involvement. The kernel restores PTE permissions immediately and the > faulting thread continues. Works for anonymous, shmem, and hugetlbfs. > > UFFDIO_DEACTIVATE: marks pages as deactivated. For anonymous memory, > sets PROT_NONE on PTEs (pages stay resident). For shmem/hugetlbfs, > zaps PTEs (pages stay in page cache). > > UFFDIO_SET_MODE: toggles MINOR_ASYNC at runtime, synchronized via > mmap_write_lock. Enables the VMM workflow: async mode for lightweight > detection, sync mode for race-free eviction. > > PAGE_IS_UFFD_DEACTIVATED: PAGEMAP_SCAN category flag for efficient > batch detection of cold (still-deactivated) anonymous pages. > > == VMM Workflow == AFAIU, this workflow provides two functionalities: > > UFFDIO_DEACTIVATE(all) -- async, no vCPU stalls > sleep(interval) > PAGEMAP_SCAN -- find cold pages Until here it's only about page hotness tracking. I am curious whether you evaluated idle page tracking. Is it because of perf overheads on rmap? To me, your solution (until here.. on the hotness sampling) reads more like a more efficient way to do idle page tracking but only per-mm, not per-folio. That will also be something I would like to benefit if QEMU will decide to do full userspace swap. I think that's our last resort, I'll likely start with something that makes QEMU work together with Linux on swapping (e.g. we're happy to make MGLRU or any reclaim logic that Linux mm currently uses, as long as efficient) then QEMU only cares about the rest, which is what the migration problem is about. The other issue about idle page tracking to us is, I believe MGLRU currently doesn't work well with it (due to ignoring IDLE bits) where the old LRU algo works. I'm not sure how much you evaluated above, so it'll be great to share from that perspective too. I also mentioned some of these challenges in the lsfmm proposal link above. > UFFDIO_SET_MODE(sync) -- block faults for eviction > pwrite + MADV_DONTNEED cold pages -- safe, faults block > UFFDIO_SET_MODE(async) -- resume tracking These operations are the 2nd function. It's, IMHO, a full userspace swap system based on userfaultfd. Have you thought about directly relying on userfaultfd-wp to do this work? The relevant question is, why do we need to block guest reads on pages being evicted by the userapp? Can we still allow that to happen, which seems to be more efficient? IIUC, only writes / updates matters in such swap system. Also, I'm not sure if you're aware of LLNL's umap library: https://github.com/llnl/umap That implemnted the swap system using userfaultfd wr-protect mode only, so no new kernel API needed. Thanks, > > The same workflow applies to shmem, with a different PAGEMAP_SCAN mask > (!PAGE_IS_PRESENT instead of PAGE_IS_UFFD_DEACTIVATED). > > == NUMA Balancing == > > NUMA balancing scanning is skipped on anonymous VM_UFFD_MINOR VMAs to > avoid protnone conflicts. NUMA locality stats are fed from the uffd > fault path via task_numa_fault() so the scheduler retains placement > data. Shmem VMAs are unaffected (UFFDIO_DEACTIVATE zaps PTEs there, > no protnone involved). > > == Testing == > > The series includes 6 new selftests covering async/sync modes, > PAGEMAP_SCAN cold detection, GUP through protnone, UFFDIO_SET_MODE > toggling, and cleanup on close. All 73 uffd unit tests pass > (including hugetlb) across defconfig, allnoconfig, allmodconfig, > and randomized configs. > > Kiryl Shutsemau (Meta) (12): > userfaultfd: define UAPI constants for anonymous minor faults > userfaultfd: add UFFD_FEATURE_MINOR_ANON registration support > userfaultfd: implement UFFDIO_DEACTIVATE ioctl > userfaultfd: UFFDIO_CONTINUE for anonymous memory > mm: intercept protnone faults on VM_UFFD_MINOR anonymous VMAs > userfaultfd: auto-resolve shmem and hugetlbfs minor faults in async > mode > sched/numa: skip scanning anonymous VM_UFFD_MINOR VMAs > userfaultfd: enable UFFD_FEATURE_MINOR_ANON > mm/pagemap: add PAGE_IS_UFFD_DEACTIVATED to PAGEMAP_SCAN > userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle > selftests/mm: add userfaultfd anonymous minor fault tests > Documentation/userfaultfd: document working set tracking > > Documentation/admin-guide/mm/userfaultfd.rst | 141 ++++- > fs/proc/task_mmu.c | 11 +- > fs/userfaultfd.c | 184 +++++- > include/linux/huge_mm.h | 6 + > include/linux/mm.h | 2 + > include/linux/sched/numa_balancing.h | 1 + > include/linux/userfaultfd_k.h | 21 +- > include/trace/events/sched.h | 3 +- > include/uapi/linux/fs.h | 1 + > include/uapi/linux/userfaultfd.h | 40 +- > kernel/sched/fair.c | 13 + > mm/huge_memory.c | 33 +- > mm/hugetlb.c | 3 +- > mm/memory.c | 51 +- > mm/mprotect.c | 9 +- > mm/shmem.c | 3 +- > mm/userfaultfd.c | 164 +++++- > tools/testing/selftests/mm/uffd-unit-tests.c | 458 +++++++++++++++ > 18 files changed, 1096 insertions(+), 48 deletions(-) > > Kiryl Shutsemau (Meta) (12): > userfaultfd: define UAPI constants for anonymous minor faults > userfaultfd: add UFFD_FEATURE_MINOR_ANON registration support > userfaultfd: implement UFFDIO_DEACTIVATE ioctl > userfaultfd: UFFDIO_CONTINUE for anonymous memory > mm: intercept protnone faults on VM_UFFD_MINOR anonymous VMAs > userfaultfd: auto-resolve shmem and hugetlbfs minor faults in async > mode > sched/numa: skip scanning anonymous VM_UFFD_MINOR VMAs > userfaultfd: enable UFFD_FEATURE_MINOR_ANON > mm/pagemap: add PAGE_IS_UFFD_DEACTIVATED to PAGEMAP_SCAN > userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle > selftests/mm: add userfaultfd anonymous minor fault tests > Documentation/userfaultfd: document working set tracking > > Documentation/admin-guide/mm/userfaultfd.rst | 141 +++++- > fs/proc/task_mmu.c | 11 +- > fs/userfaultfd.c | 184 +++++++- > include/linux/huge_mm.h | 6 + > include/linux/mm.h | 2 + > include/linux/sched/numa_balancing.h | 1 + > include/linux/userfaultfd_k.h | 21 +- > include/trace/events/sched.h | 3 +- > include/uapi/linux/fs.h | 1 + > include/uapi/linux/userfaultfd.h | 40 +- > kernel/sched/fair.c | 13 + > mm/huge_memory.c | 33 +- > mm/hugetlb.c | 3 +- > mm/memory.c | 51 ++- > mm/mprotect.c | 9 +- > mm/shmem.c | 3 +- > mm/userfaultfd.c | 164 ++++++- > tools/testing/selftests/mm/uffd-unit-tests.c | 458 +++++++++++++++++++ > 18 files changed, 1096 insertions(+), 48 deletions(-) > > -- > 2.51.2 > > -- Peter Xu
On Tue, Apr 14, 2026 at 11:28:33AM -0400, Peter Xu wrote: > Hi, Kiryl, > > On Tue, Apr 14, 2026 at 03:23:34PM +0100, Kiryl Shutsemau (Meta) wrote: > > This series adds userfaultfd support for tracking the working set of > > VM guest memory, enabling VMMs to identify cold pages and evict them > > to tiered or remote storage. > > Thanks for sharing this work, it looks very interesting to me. > > Personally I am also looking at some kind of VMM memtiering issues. I'm > not sure if you saw my lsfmm proposal, it mentioned the challenge we're > facing, it's slightly different but still a bit relevant: > > https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/ Thanks will read up. I didn't follow userfultfd work until recently. > Unfortunately, that proposal was rejected upstream. Sorry about that. We can chat about in hall track, if you are there :) > > == VMM Workflow == > > AFAIU, this workflow provides two functionalities: > > > > > UFFDIO_DEACTIVATE(all) -- async, no vCPU stalls > > sleep(interval) > > PAGEMAP_SCAN -- find cold pages > > Until here it's only about page hotness tracking. I am curious whether you > evaluated idle page tracking. Is it because of perf overheads on rmap? I didn't gave idle page tracking much thought. I needed uffd faults to serialize reclaim against memory accesses. If use it for one thing we can as well try to use it for tracking as well. And it seems to be fitting together nicely with sync/async mode flipping. > To > me, your solution (until here.. on the hotness sampling) reads more like a > more efficient way to do idle page tracking but only per-mm, not per-folio. > > That will also be something I would like to benefit if QEMU will decide to > do full userspace swap. I think that's our last resort, I'll likely start > with something that makes QEMU work together with Linux on swapping > (e.g. we're happy to make MGLRU or any reclaim logic that Linux mm > currently uses, as long as efficient) then QEMU only cares about the rest, > which is what the migration problem is about. > > The other issue about idle page tracking to us is, I believe MGLRU > currently doesn't work well with it (due to ignoring IDLE bits) where the > old LRU algo works. I'm not sure how much you evaluated above, so it'll be > great to share from that perspective too. I also mentioned some of these > challenges in the lsfmm proposal link above. > > > UFFDIO_SET_MODE(sync) -- block faults for eviction > > pwrite + MADV_DONTNEED cold pages -- safe, faults block > > UFFDIO_SET_MODE(async) -- resume tracking > > These operations are the 2nd function. It's, IMHO, a full userspace swap > system based on userfaultfd. Right. And we want to decide where to put cold pages from userspace. > Have you thought about directly relying on userfaultfd-wp to do this work? > The relevant question is, why do we need to block guest reads on pages > being evicted by the userapp? Can we still allow that to happen, which > seems to be more efficient? IIUC, only writes / updates matters in such > swap system. But we do care about about read accesses. We don't want to swap out pages that got read-touched. And we cannot in practice switch to WP mode after PAGEMAP_SCAN: it would require a lot of UFFDIO_WRITEPROTECT calls with TLB flushing each. With my approach switching tracking and reclaiming is single bit flip under mmap lock. > Also, I'm not sure if you're aware of LLNL's umap library: > > https://github.com/llnl/umap > > That implemnted the swap system using userfaultfd wr-protect mode only, so > no new kernel API needed. Will look into it. Thanks. -- Kiryl Shutsemau / Kirill A. Shutemov
On Tue, Apr 14, 2026 at 06:08:48PM +0100, Kiryl Shutsemau wrote: > On Tue, Apr 14, 2026 at 11:28:33AM -0400, Peter Xu wrote: > > Hi, Kiryl, > > > > On Tue, Apr 14, 2026 at 03:23:34PM +0100, Kiryl Shutsemau (Meta) wrote: > > > This series adds userfaultfd support for tracking the working set of > > > VM guest memory, enabling VMMs to identify cold pages and evict them > > > to tiered or remote storage. > > > > Thanks for sharing this work, it looks very interesting to me. > > > > Personally I am also looking at some kind of VMM memtiering issues. I'm > > not sure if you saw my lsfmm proposal, it mentioned the challenge we're > > facing, it's slightly different but still a bit relevant: > > > > https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/ > > Thanks will read up. I didn't follow userfultfd work until recently. Thanks. Note that the proposal doesn't have much with userfaultfd. You'll see when you start reading. > > > Unfortunately, that proposal was rejected upstream. > > Sorry about that. We can chat about in hall track, if you are there :) I won't be there (as it's rejected.. hence not invited). But I'm always happy to discuss on this topic on the list or elsewhere. Alone the way I believe it'll also help us to know what is the most acceptable path forward as it's still very relevant. > > > > == VMM Workflow == > > > > AFAIU, this workflow provides two functionalities: > > > > > > > > UFFDIO_DEACTIVATE(all) -- async, no vCPU stalls > > > sleep(interval) > > > PAGEMAP_SCAN -- find cold pages > > > > Until here it's only about page hotness tracking. I am curious whether you > > evaluated idle page tracking. Is it because of perf overheads on rmap? > > I didn't gave idle page tracking much thought. I needed uffd faults to > serialize reclaim against memory accesses. If use it for one thing we > can as well try to use it for tracking as well. And it seems to be > fitting together nicely with sync/async mode flipping. Yes, I get your point. It's just that it'll still partly done what access bit has already been doing for mm core in general on tracking hotness. So I wonder if we should still try to see if we can separate the two problems. One other quick thought is maybe we could also report hotness from kernel directly rather than relying on async faults, you can refer to "(2) Hotness Information API" in my above proposal. Here when it's only about knowing which page is less frequently used, it's only a READ interface. > > > To > > me, your solution (until here.. on the hotness sampling) reads more like a > > more efficient way to do idle page tracking but only per-mm, not per-folio. > > > > That will also be something I would like to benefit if QEMU will decide to > > do full userspace swap. I think that's our last resort, I'll likely start > > with something that makes QEMU work together with Linux on swapping > > (e.g. we're happy to make MGLRU or any reclaim logic that Linux mm > > currently uses, as long as efficient) then QEMU only cares about the rest, > > which is what the migration problem is about. > > > > The other issue about idle page tracking to us is, I believe MGLRU > > currently doesn't work well with it (due to ignoring IDLE bits) where the > > old LRU algo works. I'm not sure how much you evaluated above, so it'll be > > great to share from that perspective too. I also mentioned some of these > > challenges in the lsfmm proposal link above. > > > > > UFFDIO_SET_MODE(sync) -- block faults for eviction > > > pwrite + MADV_DONTNEED cold pages -- safe, faults block > > > UFFDIO_SET_MODE(async) -- resume tracking > > > > These operations are the 2nd function. It's, IMHO, a full userspace swap > > system based on userfaultfd. > > Right. And we want to decide where to put cold pages from userspace. > > > Have you thought about directly relying on userfaultfd-wp to do this work? > > The relevant question is, why do we need to block guest reads on pages > > being evicted by the userapp? Can we still allow that to happen, which > > seems to be more efficient? IIUC, only writes / updates matters in such > > swap system. > > But we do care about about read accesses. We don't want to swap out > pages that got read-touched. And we cannot in practice switch to WP mode This is a good point. When it's considered on top of your above "async trapping to collect hotness with userfaultfd" idea, it flows naturally with this idea indeed. However, IMHO that should really be an extremely small window, and the major part the userapp should rely on is the larger window sampling whether, in your current case, PROT_NONE (or PTE_NONE for shmem) switched back to a accessable PTE. It means using RW protection v.s. WR-ONLY protection will only differ very slightly if by accident some page got read-only during evicting. For example, if the mgmt app monitors PROT_NONE state for 30 seconds, make a decision to evict, evicting takes 5ms, then within 5ms someone read the page. It means it only misses the 5ms/30sec access pattern of guest. So far I don't yet know if this would justify a new kernel API just for that small false postive reporting some page is cold but actually it's hot. To me it's still fine to consider using WP-ONLY and just allow that trivial window to get refaulted later, because it shouldn't be the majority. > after PAGEMAP_SCAN: it would require a lot of UFFDIO_WRITEPROTECT calls > with TLB flushing each. This is indeed a concern, maybe a bigger one. I don't know how much benefit we can get from avoiding one extra TLB flush when evicting. IMHO some numbers might be more than great to justify this part. While at this, I do have a pure question that is relevant on the full protection scheme (and it can be naive; please bare with me on not yet reading the whole series): if you change anon mappings to PROT_NONE in pgtables, then how do the mgmt app reads this page before dumping it to anywhere? It's not like shmem where you can have a separate mapping. Do you need to fork(), for example? > > With my approach switching tracking and reclaiming is single bit flip > under mmap lock. > > > Also, I'm not sure if you're aware of LLNL's umap library: > > > > https://github.com/llnl/umap > > > > That implemnted the swap system using userfaultfd wr-protect mode only, so > > no new kernel API needed. > > Will look into it. Thanks. Thanks, -- Peter Xu
© 2016 - 2026 Red Hat, Inc.