DAMON is using Accessed bits of page table entries as the major source
of the access information. It lacks some additional information such as
which CPU was making the access. Page faults could be another source of
information for such additional information.
Implement another change_protection() flag for such use case, namely
MM_CP_DAMON. DAMON will install PAGE_NONE protections using the flag.
To avoid interfering with NUMA_BALANCING, which is also using PAGE_NON
protection, pass the faults to DAMON only when NUMA_BALANCING is
disabled.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
include/linux/mm.h | 1 +
mm/memory.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--
mm/mprotect.c | 5 +++++
3 files changed, 57 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21270f1664a4..ad92b77bf782 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2567,6 +2567,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
#define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */
#define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
MM_CP_UFFD_WP_RESOLVE)
+#define MM_CP_DAMON (1UL << 4)
bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
diff --git a/mm/memory.c b/mm/memory.c
index 92fd18a5d8d1..656e610867b0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -75,6 +75,7 @@
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
#include <linux/sched/sysctl.h>
+#include <linux/damon.h>
#include <trace/events/kmem.h>
@@ -5972,6 +5973,47 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
return VM_FAULT_FALLBACK;
}
+static vm_fault_t do_damon_page(struct vm_fault *vmf, bool huge_pmd)
+{
+ struct damon_access_report access_report = {
+ .addr = vmf->address,
+ .size = 1,
+ };
+ struct vm_area_struct *vma = vmf->vma;
+ struct folio *folio;
+ pte_t pte, old_pte;
+ bool writable = false, ignore_writable = false;
+ bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
+
+ if (huge_pmd)
+ access_report.addr = PFN_PHYS(pmd_pfn(vmf->orig_pmd));
+ else
+ access_report.addr = PFN_PHYS(pte_pfn(vmf->orig_pte));
+
+ spin_lock(vmf->ptl);
+ old_pte = ptep_get(vmf->pte);
+ if (unlikely(!pte_same(old_pte, vmf->orig_pte))) {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+ }
+ pte = pte_modify(old_pte, vma->vm_page_prot);
+ writable = pte_write(pte);
+ if (!writable && pte_write_upgrade &&
+ can_change_pte_writable(vma, vmf->address, pte))
+ writable = true;
+ folio = vm_normal_folio(vma, vmf->address, pte);
+ if (folio && folio_test_large(folio))
+ numa_rebuild_large_mapping(vmf, vma, folio, pte,
+ ignore_writable, pte_write_upgrade);
+ else
+ numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte,
+ writable);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+ damon_report_access(&access_report);
+ return 0;
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -6036,8 +6078,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);
- if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+ if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
+ if (sysctl_numa_balancing_mode == NUMA_BALANCING_DISABLED)
+ return do_damon_page(vmf, false);
return do_numa_page(vmf);
+ }
spin_lock(vmf->ptl);
entry = vmf->orig_pte;
@@ -6159,8 +6204,12 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
return 0;
}
if (pmd_trans_huge(vmf.orig_pmd)) {
- if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
+ if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) {
+ if (sysctl_numa_balancing_mode ==
+ NUMA_BALANCING_DISABLED)
+ return do_damon_page(&vmf, true);
return do_huge_pmd_numa_page(&vmf);
+ }
if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
!pmd_write(vmf.orig_pmd)) {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 78bded7acf79..e8a76114e4f9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -714,6 +714,11 @@ long change_protection(struct mmu_gather *tlb,
WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA);
#endif
+#ifdef CONFIG_ARCH_SUPPORTS_NUMA_BALANCING
+ if (cp_flags & MM_CP_DAMON)
+ newprot = PAGE_NONE;
+#endif
+
if (is_vm_hugetlb_page(vma))
pages = hugetlb_change_protection(vma, start, end, newprot,
cp_flags);
--
2.39.5
On Sun, Jul 27, 2025 at 01:18:11PM -0700, SeongJae Park wrote: > DAMON is using Accessed bits of page table entries as the major source > of the access information. It lacks some additional information such as > which CPU was making the access. Page faults could be another source of > information for such additional information. > > Implement another change_protection() flag for such use case, namely > MM_CP_DAMON. DAMON will install PAGE_NONE protections using the flag. > To avoid interfering with NUMA_BALANCING, which is also using PAGE_NON > protection, pass the faults to DAMON only when NUMA_BALANCING is > disabled. > > Signed-off-by: SeongJae Park <sj@kernel.org> This seems to not be an upstreamable series right now unless I'm missing something. Firstly, you're making a change in behaviour even when CONFIG_DAMON is not specified, and Linus already told you we can't have that default-on. I'm very very much not happy with us doing something for 'damon' unconditionaly when !CONFIG_DAMON on the assumption that an acessible mapping has PROT_NONE set. This change makes me nervous in general ANYWAY as you are now changing core mm and introducing a new faulting mechanism specifically for DAMON. And we are assuming that NUMA balancing being disabled is not racey in a way that will cause things to break. I really also dislike the idea of an _implicit_ assumption that we are good to use the NUMA balancing faulting mechanism to 'tack on' DAMON stuff. Is it really all that useful to provide a method that requires NUMA balancing to be diabled? Finally, you're making it so DAMON can mprotect() something to use the DAMON/NUMA balancing fault handler, which doesn't appaer to check to see if NUMA balacing is disabled, but anyway it could be re-enabled? And then now DAMON is making stuff fault as NUMA balancing faults incorrectly? This just seems broken. Unless there's really good justification I'm really not inclined for us to merge this as-is right now unfortunately. > --- > include/linux/mm.h | 1 + > mm/memory.c | 53 ++++++++++++++++++++++++++++++++++++++++++++-- > mm/mprotect.c | 5 +++++ > 3 files changed, 57 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 21270f1664a4..ad92b77bf782 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2567,6 +2567,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen); > #define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */ > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > MM_CP_UFFD_WP_RESOLVE) /* Comment here :) */ > +#define MM_CP_DAMON (1UL << 4) Shouldn't this be something more specific like MM_CP_DAMON_REPORT_FAULT ? > > bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, > pte_t pte); > diff --git a/mm/memory.c b/mm/memory.c > index 92fd18a5d8d1..656e610867b0 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -75,6 +75,7 @@ > #include <linux/ptrace.h> > #include <linux/vmalloc.h> > #include <linux/sched/sysctl.h> > +#include <linux/damon.h> > > #include <trace/events/kmem.h> > > @@ -5972,6 +5973,47 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud) > return VM_FAULT_FALLBACK; > } > > +static vm_fault_t do_damon_page(struct vm_fault *vmf, bool huge_pmd) You need to explain what this function is doing at least. > +{ > + struct damon_access_report access_report = { > + .addr = vmf->address, > + .size = 1, > + }; > + struct vm_area_struct *vma = vmf->vma; > + struct folio *folio; > + pte_t pte, old_pte; > + bool writable = false, ignore_writable = false; > + bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma); > + > + if (huge_pmd) > + access_report.addr = PFN_PHYS(pmd_pfn(vmf->orig_pmd)); > + else > + access_report.addr = PFN_PHYS(pte_pfn(vmf->orig_pte)); > + > + spin_lock(vmf->ptl); > + old_pte = ptep_get(vmf->pte); > + if (unlikely(!pte_same(old_pte, vmf->orig_pte))) { > + pte_unmap_unlock(vmf->pte, vmf->ptl); > + return 0; > + } > + pte = pte_modify(old_pte, vma->vm_page_prot); This is little weird, you're applying VMA protection bits to the PTE then later asking about protection bits? Can't you just tell from the VMA flags? Seems like do_numa_page() copy/pasta. > + writable = pte_write(pte); > + if (!writable && pte_write_upgrade && > + can_change_pte_writable(vma, vmf->address, pte)) > + writable = true; > + folio = vm_normal_folio(vma, vmf->address, pte); > + if (folio && folio_test_large(folio)) > + numa_rebuild_large_mapping(vmf, vma, folio, pte, > + ignore_writable, pte_write_upgrade); > + else > + numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte, > + writable); > + pte_unmap_unlock(vmf->pte, vmf->ptl); > + All of this seems to be duplicating aspects of do_numa_page(). Seems more sensible, if it makes sense to accept this change (I'm still a bit dubious about changing the faulting mechanism here), we should probably implement a 'do_non_present_acessible()' function that avoids duplication + bit-rot, with #ifdef CONFIG_DAMON's in place, or even better, just a hook into damon code that's static inline void {} if not enabled. Note that the numa fault handling logic is called _even if numa balancing is disabled_. So we'd avoid all the other changes too. > + damon_report_access(&access_report); > + return 0; > +} > + > /* > * These routines also need to handle stuff like marking pages dirty > * and/or accessed for architectures that don't do it in hardware (most > @@ -6036,8 +6078,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > if (!pte_present(vmf->orig_pte)) > return do_swap_page(vmf); > > - if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) > + if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) { > + if (sysctl_numa_balancing_mode == NUMA_BALANCING_DISABLED) > + return do_damon_page(vmf, false); Making an assumption here that, even if damon is disabled, we should handle a 'damon' fault is just icky, sorry. Also are we 100% sure that there's no races here? When we disable numa balancing do we correctly ensure that absolutely no racing NUMA balancing faults can happen whatsever at this juncture? Have you checked that and ensured that to be the case? You really have to be 100% certain you're not going to wrongly handle NUMA page faults this way, on a potentially non-CONFIG_DAMON kernel to boot. Keep in mind fault handling is verrrry racey (purposefully) and can be done under VMA read lock alone (and that's only very loosely a lock!). This is just, yeah, concerning. > return do_numa_page(vmf); > + } > > spin_lock(vmf->ptl); > entry = vmf->orig_pte; > @@ -6159,8 +6204,12 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, > return 0; > } > if (pmd_trans_huge(vmf.orig_pmd)) { > - if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) > + if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) { > + if (sysctl_numa_balancing_mode == > + NUMA_BALANCING_DISABLED) > + return do_damon_page(&vmf, true); Now we have _more_ weirdness around what CONFIG_TRANSPARENT_HUGEPAGE enables/disables. > return do_huge_pmd_numa_page(&vmf); This function will be called only if THP is enabled, but now damon overrides that... Let's try and make this consistent. > + } > > if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && > !pmd_write(vmf.orig_pmd)) { > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 78bded7acf79..e8a76114e4f9 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -714,6 +714,11 @@ long change_protection(struct mmu_gather *tlb, > WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA); > #endif > > +#ifdef CONFIG_ARCH_SUPPORTS_NUMA_BALANCING > + if (cp_flags & MM_CP_DAMON) > + newprot = PAGE_NONE; > +#endif OK this just seems broken. Firstly, again you're introducing behaviour that applies even if DAMON is disabled. Let's please not. And predicating on -the architecture even supporting NUMA balancing- seems rather bizarre? Secondly, somebody can disable/enable NUMA balancing, and thus you are now allowing this function to, when somebody specifies 'DAMON', get NUMA balancing fault handling?? If you don't bother checking whether NUMA balancing is disabled when you set it, then boom - you've broken the faulting mechanism, but even if you _do_, somebody can just switch it on again... Sorry but unless I'm wildly missing something here we're going to need a new faulting mechanism (if we even want to allow DAMON to have its own fault handling that is!) > + > if (is_vm_hugetlb_page(vma)) > pages = hugetlb_change_protection(vma, start, end, newprot, > cp_flags); > -- > 2.39.5 Cheers, Lorenzo
On Mon, 28 Jul 2025 06:19:35 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > On Sun, Jul 27, 2025 at 01:18:11PM -0700, SeongJae Park wrote: > > DAMON is using Accessed bits of page table entries as the major source > > of the access information. It lacks some additional information such as > > which CPU was making the access. Page faults could be another source of > > information for such additional information. > > > > Implement another change_protection() flag for such use case, namely > > MM_CP_DAMON. DAMON will install PAGE_NONE protections using the flag. > > To avoid interfering with NUMA_BALANCING, which is also using PAGE_NON > > protection, pass the faults to DAMON only when NUMA_BALANCING is > > disabled. > > > > Signed-off-by: SeongJae Park <sj@kernel.org> > > This seems to not be an upstreamable series right now unless I'm missing > something. > > Firstly, you're making a change in behaviour even when CONFIG_DAMON is not > specified, and Linus already told you we can't have that default-on. > > I'm very very much not happy with us doing something for 'damon' > unconditionaly when !CONFIG_DAMON on the assumption that an acessible > mapping has PROT_NONE set. > > This change makes me nervous in general ANYWAY as you are now changing core > mm and introducing a new faulting mechanism specifically for DAMON. > > And we are assuming that NUMA balancing being disabled is not racey in a > way that will cause things to break. > > I really also dislike the idea of an _implicit_ assumption that we are good > to use the NUMA balancing faulting mechanism to 'tack on' DAMON stuff. > > Is it really all that useful to provide a method that requires NUMA > balancing to be diabled? > > Finally, you're making it so DAMON can mprotect() something to use the > DAMON/NUMA balancing fault handler, which doesn't appaer to check to see if > NUMA balacing is disabled, but anyway it could be re-enabled? > > And then now DAMON is making stuff fault as NUMA balancing faults > incorrectly? > > This just seems broken. > > Unless there's really good justification I'm really not inclined for us to > merge this as-is right now unfortunately. Thank you for review and comments, Lorenzo. I fundamentally agree all your points. I don't aim to merge this as-is. Actually this patch series is more like POC, but apparently I was rushing. I will try to adjust your concerns in the next version. > > > --- > > include/linux/mm.h | 1 + > > mm/memory.c | 53 ++++++++++++++++++++++++++++++++++++++++++++-- > > mm/mprotect.c | 5 +++++ > > 3 files changed, 57 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 21270f1664a4..ad92b77bf782 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -2567,6 +2567,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen); > > #define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */ > > #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ > > MM_CP_UFFD_WP_RESOLVE) > > /* Comment here :) */ I will add the comments on the next version. > > > +#define MM_CP_DAMON (1UL << 4) > > Shouldn't this be something more specific like MM_CP_DAMON_REPORT_FAULT ? Thank you for the suggestion, I will use it. > > > > > bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, > > pte_t pte); > > diff --git a/mm/memory.c b/mm/memory.c > > index 92fd18a5d8d1..656e610867b0 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -75,6 +75,7 @@ > > #include <linux/ptrace.h> > > #include <linux/vmalloc.h> > > #include <linux/sched/sysctl.h> > > +#include <linux/damon.h> > > > > #include <trace/events/kmem.h> > > > > @@ -5972,6 +5973,47 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud) > > return VM_FAULT_FALLBACK; > > } > > > > +static vm_fault_t do_damon_page(struct vm_fault *vmf, bool huge_pmd) > > You need to explain what this function is doing at least. This function lifts the protection and reports the access information to DAMON. I will add more comments on the next version. > > > +{ > > + struct damon_access_report access_report = { > > + .addr = vmf->address, > > + .size = 1, > > + }; > > + struct vm_area_struct *vma = vmf->vma; > > + struct folio *folio; > > + pte_t pte, old_pte; > > + bool writable = false, ignore_writable = false; > > + bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma); > > + > > + if (huge_pmd) > > + access_report.addr = PFN_PHYS(pmd_pfn(vmf->orig_pmd)); > > + else > > + access_report.addr = PFN_PHYS(pte_pfn(vmf->orig_pte)); > > + > > + spin_lock(vmf->ptl); > > + old_pte = ptep_get(vmf->pte); > > + if (unlikely(!pte_same(old_pte, vmf->orig_pte))) { > > + pte_unmap_unlock(vmf->pte, vmf->ptl); > > + return 0; > > + } > > + pte = pte_modify(old_pte, vma->vm_page_prot); > > This is little weird, you're applying VMA protection bits to the PTE then > later asking about protection bits? Can't you just tell from the VMA flags? > > Seems like do_numa_page() copy/pasta. > > > + writable = pte_write(pte); > > + if (!writable && pte_write_upgrade && > > + can_change_pte_writable(vma, vmf->address, pte)) > > + writable = true; > > + folio = vm_normal_folio(vma, vmf->address, pte); > > + if (folio && folio_test_large(folio)) > > + numa_rebuild_large_mapping(vmf, vma, folio, pte, > > + ignore_writable, pte_write_upgrade); > > + else > > + numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte, > > + writable); > > + pte_unmap_unlock(vmf->pte, vmf->ptl); > > + > > All of this seems to be duplicating aspects of do_numa_page(). You are correct. > > Seems more sensible, if it makes sense to accept this change (I'm still a > bit dubious about changing the faulting mechanism here), we should probably > implement a 'do_non_present_acessible()' function that avoids duplication + > bit-rot, with #ifdef CONFIG_DAMON's in place, or even better, just a hook > into damon code that's static inline void {} if not enabled. I will do so in the next version. > > Note that the numa fault handling logic is called _even if numa balancing > is disabled_. So we'd avoid all the other changes too. > > > + damon_report_access(&access_report); > > + return 0; > > +} > > + > > /* > > * These routines also need to handle stuff like marking pages dirty > > * and/or accessed for architectures that don't do it in hardware (most > > @@ -6036,8 +6078,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > > if (!pte_present(vmf->orig_pte)) > > return do_swap_page(vmf); > > > > - if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) > > + if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) { > > + if (sysctl_numa_balancing_mode == NUMA_BALANCING_DISABLED) > > + return do_damon_page(vmf, false); > > Making an assumption here that, even if damon is disabled, we should handle > a 'damon' fault is just icky, sorry. Agreed. I will make this happen only when CONFIG_DAMON is enabled. > > Also are we 100% sure that there's no races here? When we disable numa > balancing do we correctly ensure that absolutely no racing NUMA balancing > faults can happen whatsever at this juncture? Enabling CONFIG_DAMON will not automatically invoke change_protection() with MM_CP_DAMON. Such invocations will be made only if the user disables NUMA balancing and run DAMON in the reporting mode. So there can be two racy cases. If the user enables NUMA balancing and then disables it after a while, page faults for MM_CP_PROT_NUMA can be handled by DAMON. That could look odd, but there should be no real problem in practice. DAMON's fault handling will cleanup the PAGE_NONE protection like NUMA balancing, and DAMON has no problem at receiving such additional reports from MM_CP_PROT_NUMA-caused faults. DAMON may show a few more than expected accesses, but that's no problem for DAMON. Similarly, if the user starts DAMON in the report mode, stops DAMON, and then enables NUMA balancing, faults for MM_CP_DAMON that installed while DAMON was running in the report mode can be handled by NUMA balancing. This should also not make real problems in practice in my opinion. NUMA balancing could see more accesses and migrate pages little bit more than expected, but that should be just for a moment. Maybe I'm missing something, or not well explaining my thoughts. I will be happy to learn what I'm missing, or get chances to further clarify my points. > > Have you checked that and ensured that to be the case? > > You really have to be 100% certain you're not going to wrongly handle NUMA > page faults this way, on a potentially non-CONFIG_DAMON kernel to boot. I will ensure that never happens on CONFIG_DAMON disabled kernels, in the next version. > > Keep in mind fault handling is verrrry racey (purposefully) and can be done > under VMA read lock alone (and that's only very loosely a lock!). > > This is just, yeah, concerning. Thank you for the caution. DAMON's fault handling code only saves the information in its internal ring buffer. It doesn't touch vmas. So I think there should be no such problems. I will add the clarification on the next version. > > > return do_numa_page(vmf); > > + } > > > > spin_lock(vmf->ptl); > > entry = vmf->orig_pte; > > @@ -6159,8 +6204,12 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, > > return 0; > > } > > if (pmd_trans_huge(vmf.orig_pmd)) { > > - if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) > > + if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) { > > + if (sysctl_numa_balancing_mode == > > + NUMA_BALANCING_DISABLED) > > + return do_damon_page(&vmf, true); > > Now we have _more_ weirdness around what CONFIG_TRANSPARENT_HUGEPAGE > enables/disables. > > > return do_huge_pmd_numa_page(&vmf); > > This function will be called only if THP is enabled, but now damon > overrides that... > > Let's try and make this consistent. I will do so in the next version. > > > + } > > > > if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && > > !pmd_write(vmf.orig_pmd)) { > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index 78bded7acf79..e8a76114e4f9 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -714,6 +714,11 @@ long change_protection(struct mmu_gather *tlb, > > WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA); > > #endif > > > > +#ifdef CONFIG_ARCH_SUPPORTS_NUMA_BALANCING > > + if (cp_flags & MM_CP_DAMON) > > + newprot = PAGE_NONE; > > +#endif > > OK this just seems broken. > > Firstly, again you're introducing behaviour that applies even if DAMON is > disabled. Let's please not. > > And predicating on -the architecture even supporting NUMA balancing- seems > rather bizarre? Agree, I will ensure there will be no change on CONFIG_DAMON disabled kernels, in the next version. > > Secondly, somebody can disable/enable NUMA balancing, and thus you are now > allowing this function to, when somebody specifies 'DAMON', get NUMA > balancing fault handling?? > > If you don't bother checking whether NUMA balancing is disabled when you > set it, then boom - you've broken the faulting mechanism, but even if you > _do_, somebody can just switch it on again... As I explained on the two racy cases aboe, faults that caused by DAMON or NUMA balancing can be handled by the other's handling code, but only for a limited time under the user's controls. But to my understanding that should not cause real problems in the practice, and users wouldn't be suggested to operate the system in such a way. I will add more documents and cautions about that in the next version. Again, I may be missing something, and would be happy to be enlightened or get more chances to clarify my points. > > Sorry but unless I'm wildly missing something here we're going to need a > new faulting mechanism (if we even want to allow DAMON to have its own > fault handling that is!) No worry, I completely agree to your main points. I think I was bettr to explain more about the racy cases and ensure no interference on CONFIG_DAMON disabled case, and thank you for letting me know that. Maybe we need more discussions on the racy cases, but anyway I will try to address all your concerns in the next version. Thanks, SJ [...]
On Mon, Jul 28, 2025 at 08:06:32PM -0700, SeongJae Park wrote: > On Mon, 28 Jul 2025 06:19:35 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > > On Sun, Jul 27, 2025 at 01:18:11PM -0700, SeongJae Park wrote: > > > DAMON is using Accessed bits of page table entries as the major source > > > of the access information. It lacks some additional information such as > > > which CPU was making the access. Page faults could be another source of > > > information for such additional information. > > > > > > Implement another change_protection() flag for such use case, namely > > > MM_CP_DAMON. DAMON will install PAGE_NONE protections using the flag. > > > To avoid interfering with NUMA_BALANCING, which is also using PAGE_NON > > > protection, pass the faults to DAMON only when NUMA_BALANCING is > > > disabled. > > > > > > Signed-off-by: SeongJae Park <sj@kernel.org> > > > > This seems to not be an upstreamable series right now unless I'm missing > > something. > > > > Firstly, you're making a change in behaviour even when CONFIG_DAMON is not > > specified, and Linus already told you we can't have that default-on. > > > > I'm very very much not happy with us doing something for 'damon' > > unconditionaly when !CONFIG_DAMON on the assumption that an acessible > > mapping has PROT_NONE set. > > > > This change makes me nervous in general ANYWAY as you are now changing core > > mm and introducing a new faulting mechanism specifically for DAMON. > > > > And we are assuming that NUMA balancing being disabled is not racey in a > > way that will cause things to break. > > > > I really also dislike the idea of an _implicit_ assumption that we are good > > to use the NUMA balancing faulting mechanism to 'tack on' DAMON stuff. > > > > Is it really all that useful to provide a method that requires NUMA > > balancing to be diabled? > > > > Finally, you're making it so DAMON can mprotect() something to use the > > DAMON/NUMA balancing fault handler, which doesn't appaer to check to see if > > NUMA balacing is disabled, but anyway it could be re-enabled? > > > > And then now DAMON is making stuff fault as NUMA balancing faults > > incorrectly? > > > > This just seems broken. > > > > Unless there's really good justification I'm really not inclined for us to > > merge this as-is right now unfortunately. > > Thank you for review and comments, Lorenzo. I fundamentally agree all your > points. I don't aim to merge this as-is. Actually this patch series is more > like POC, but apparently I was rushing. I will try to adjust your concerns in > the next version. Thanks. I do wonder whether we really can have a whole new faulting mechanism just for DAMON. Because if in future, we wanted to change how this worked, we'd be constrained, and it is a very specific user. The issue is you need the PTE to be restored to its previous state, just like NUMA balancing. And I really really do not like this 'oh if you turn it off you can use it for DAMON' thing, it's just really odd and asking for trouble. I think the only _workable_ version of this would be to convert the numa handling to a generic 'fault then restore' type mechanism that could be hooked in to by either NUMA or DAMON. But really I think you'd _need_ to not have significantly different behaviour between the two and _not_ constrain this to being only when NUMA balancing is disabled. But then you'd need to know _this_ PTE is for NUMA balancing vs. another is for this stuff. I'm not really sure there is an upstreamable version of this, but it'd need to be made generic like that if there were. I think it might be worth taking some time to examine whether a version of this that can be sensibly generic (which could have hooks for things) is _possible_ before sending a v2. > > > > Also are we 100% sure that there's no races here? When we disable numa > > balancing do we correctly ensure that absolutely no racing NUMA balancing > > faults can happen whatsever at this juncture? > > Enabling CONFIG_DAMON will not automatically invoke change_protection() with > MM_CP_DAMON. Such invocations will be made only if the user disables NUMA > balancing and run DAMON in the reporting mode. > > So there can be two racy cases. > > If the user enables NUMA balancing and then disables it after a while, page > faults for MM_CP_PROT_NUMA can be handled by DAMON. That could look odd, but > there should be no real problem in practice. DAMON's fault handling will > cleanup the PAGE_NONE protection like NUMA balancing, and DAMON has no problem > at receiving such additional reports from MM_CP_PROT_NUMA-caused faults. DAMON > may show a few more than expected accesses, but that's no problem for DAMON. > > Similarly, if the user starts DAMON in the report mode, stops DAMON, and then > enables NUMA balancing, faults for MM_CP_DAMON that installed while DAMON was > running in the report mode can be handled by NUMA balancing. This should also > not make real problems in practice in my opinion. NUMA balancing could see > more accesses and migrate pages little bit more than expected, but that should > be just for a moment. I'm really concerned about these. We're now introducing unexpected behaviour based on a race and allowing faults to be mis-handled. I'm maybe not as confident as you are that everything will 'just work' and it seems like we're asking for obscure bugs in the future. > > You really have to be 100% certain you're not going to wrongly handle NUMA > > page faults this way, on a potentially non-CONFIG_DAMON kernel to boot. > > I will ensure that never happens on CONFIG_DAMON disabled kernels, in the next > version. Well, in the above you say that you can't help but do that when a race occurs? > > > > > Keep in mind fault handling is verrrry racey (purposefully) and can be done > > under VMA read lock alone (and that's only very loosely a lock!). > > > > This is just, yeah, concerning. > > Thank you for the caution. DAMON's fault handling code only saves the > information in its internal ring buffer. It doesn't touch vmas. So I think > there should be no such problems. I will add the clarification on the next > version. Right, I'm just saying that this all being super racey between NUMA enabled/disabled seems pretty unavoidable, but we covered above. > > > > Secondly, somebody can disable/enable NUMA balancing, and thus you are now > > allowing this function to, when somebody specifies 'DAMON', get NUMA > > balancing fault handling?? > > > > If you don't bother checking whether NUMA balancing is disabled when you > > set it, then boom - you've broken the faulting mechanism, but even if you > > _do_, somebody can just switch it on again... > > As I explained on the two racy cases aboe, faults that caused by DAMON or NUMA > balancing can be handled by the other's handling code, but only for a limited > time under the user's controls. But to my understanding that should not cause > real problems in the practice, and users wouldn't be suggested to operate the > system in such a way. I will add more documents and cautions about that in the > next version. I really don't think a version of the code that results in the wrong handler being used is upstreamable, sorry. I've not dug into the nitty gritty details on what would happen in both cases, but even if it were 100% fine _now_, this is _exactly_ the kind of thing that results in horrible hard-to-debug issues later, should something change. Implicitly 'just having to know' that we might be in the wrong fault handler seems like asking for trouble, and the RoI on an optional profiling tool (this is not to take away from DAMON which is a _great_ utility, I'm saying it's simply not part of the _core_) isn't there.
On Tue, 29 Jul 2025 10:40:11 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > On Mon, Jul 28, 2025 at 08:06:32PM -0700, SeongJae Park wrote: > > On Mon, 28 Jul 2025 06:19:35 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > > > > On Sun, Jul 27, 2025 at 01:18:11PM -0700, SeongJae Park wrote: > > > > DAMON is using Accessed bits of page table entries as the major source > > > > of the access information. It lacks some additional information such as > > > > which CPU was making the access. Page faults could be another source of > > > > information for such additional information. > > > > > > > > Implement another change_protection() flag for such use case, namely > > > > MM_CP_DAMON. DAMON will install PAGE_NONE protections using the flag. > > > > To avoid interfering with NUMA_BALANCING, which is also using PAGE_NON > > > > protection, pass the faults to DAMON only when NUMA_BALANCING is > > > > disabled. > > > > > > > > Signed-off-by: SeongJae Park <sj@kernel.org> > > > > > > This seems to not be an upstreamable series right now unless I'm missing > > > something. > > > > > > Firstly, you're making a change in behaviour even when CONFIG_DAMON is not > > > specified, and Linus already told you we can't have that default-on. > > > > > > I'm very very much not happy with us doing something for 'damon' > > > unconditionaly when !CONFIG_DAMON on the assumption that an acessible > > > mapping has PROT_NONE set. > > > > > > This change makes me nervous in general ANYWAY as you are now changing core > > > mm and introducing a new faulting mechanism specifically for DAMON. > > > > > > And we are assuming that NUMA balancing being disabled is not racey in a > > > way that will cause things to break. > > > > > > I really also dislike the idea of an _implicit_ assumption that we are good > > > to use the NUMA balancing faulting mechanism to 'tack on' DAMON stuff. > > > > > > Is it really all that useful to provide a method that requires NUMA > > > balancing to be diabled? > > > > > > Finally, you're making it so DAMON can mprotect() something to use the > > > DAMON/NUMA balancing fault handler, which doesn't appaer to check to see if > > > NUMA balacing is disabled, but anyway it could be re-enabled? > > > > > > And then now DAMON is making stuff fault as NUMA balancing faults > > > incorrectly? > > > > > > This just seems broken. > > > > > > Unless there's really good justification I'm really not inclined for us to > > > merge this as-is right now unfortunately. > > > > Thank you for review and comments, Lorenzo. I fundamentally agree all your > > points. I don't aim to merge this as-is. Actually this patch series is more > > like POC, but apparently I was rushing. I will try to adjust your concerns in > > the next version. > > Thanks. > > I do wonder whether we really can have a whole new faulting mechanism just for > DAMON. Because if in future, we wanted to change how this worked, we'd be > constrained, and it is a very specific user. > > The issue is you need the PTE to be restored to its previous state, just like > NUMA balancing. > > And I really really do not like this 'oh if you turn it off you can use it for > DAMON' thing, it's just really odd and asking for trouble. > > I think the only _workable_ version of this would be to convert the numa > handling to a generic 'fault then restore' type mechanism that could be hooked > in to by either NUMA or DAMON. I agree, and this is my current plan for the next version of this patch. > > But really I think you'd _need_ to not have significantly different behaviour > between the two and _not_ constrain this to being only when NUMA balancing is > disabled. I agree all the points. Especially the current interface is ambiguous and easy to mistake. > > But then you'd need to know _this_ PTE is for NUMA balancing vs. another is for > this stuff. Yes, this would be the ideal. But, memorizing something in page level is ... always an interesting challenge in my opinion, and I have no good idea to do this for now :) > > I'm not really sure there is an upstreamable version of this, but it'd need to > be made generic like that if there were. > > I think it might be worth taking some time to examine whether a version of this > that can be sensibly generic (which could have hooks for things) is _possible_ > before sending a v2. Agreed, I don't need to rush. Let's take time and discuss sufficiently. :) Nonetheless, I may post a followup version of this patch series that contains this one, even before we get a conclusion about this specific one. I think I may have to do that, for sharing the future idea in a way easy to understand and test. I think it might also help us at understanding the real ROI of this patch, and if there is another option to move forward. In the case, I will of course keep RFC tag and clearly note that this patch is still under the discussion and not willing to be merged as is before the discussion is done. > > > > > > > Also are we 100% sure that there's no races here? When we disable numa > > > balancing do we correctly ensure that absolutely no racing NUMA balancing > > > faults can happen whatsever at this juncture? > > > > Enabling CONFIG_DAMON will not automatically invoke change_protection() with > > MM_CP_DAMON. Such invocations will be made only if the user disables NUMA > > balancing and run DAMON in the reporting mode. > > > > So there can be two racy cases. > > > > If the user enables NUMA balancing and then disables it after a while, page > > faults for MM_CP_PROT_NUMA can be handled by DAMON. That could look odd, but > > there should be no real problem in practice. DAMON's fault handling will > > cleanup the PAGE_NONE protection like NUMA balancing, and DAMON has no problem > > at receiving such additional reports from MM_CP_PROT_NUMA-caused faults. DAMON > > may show a few more than expected accesses, but that's no problem for DAMON. > > > > Similarly, if the user starts DAMON in the report mode, stops DAMON, and then > > enables NUMA balancing, faults for MM_CP_DAMON that installed while DAMON was > > running in the report mode can be handled by NUMA balancing. This should also > > not make real problems in practice in my opinion. NUMA balancing could see > > more accesses and migrate pages little bit more than expected, but that should > > be just for a moment. > > I'm really concerned about these. > > We're now introducing unexpected behaviour based on a race and allowing faults > to be mis-handled. > > I'm maybe not as confident as you are that everything will 'just work' and it > seems like we're asking for obscure bugs in the future. > > > > You really have to be 100% certain you're not going to wrongly handle NUMA > > > page faults this way, on a potentially non-CONFIG_DAMON kernel to boot. > > > > I will ensure that never happens on CONFIG_DAMON disabled kernels, in the next > > version. > > Well, in the above you say that you can't help but do that when a race occurs? I mean, I can't help when CONFIG_DAMON is enabled. The race (or, handling faults that caused by other entity) can hppen if and only if all below things happen. 1. CONFIG_DAMON is enbled. 2. CONFIG_NUMA_BALANCING is enabled. 3. The user repeatedly turns on and off NUMA balancing and fault-based mode DAMON in runtime. If any of these are not true, the race can completely avoided. I was saying about the case that the first condition is not met. > > > > > > > > > Keep in mind fault handling is verrrry racey (purposefully) and can be done > > > under VMA read lock alone (and that's only very loosely a lock!). > > > > > > This is just, yeah, concerning. > > > > Thank you for the caution. DAMON's fault handling code only saves the > > information in its internal ring buffer. It doesn't touch vmas. So I think > > there should be no such problems. I will add the clarification on the next > > version. > > Right, I'm just saying that this all being super racey between NUMA > enabled/disabled seems pretty unavoidable, but we covered above. > > > > > > > Secondly, somebody can disable/enable NUMA balancing, and thus you are now > > > allowing this function to, when somebody specifies 'DAMON', get NUMA > > > balancing fault handling?? > > > > > > If you don't bother checking whether NUMA balancing is disabled when you > > > set it, then boom - you've broken the faulting mechanism, but even if you > > > _do_, somebody can just switch it on again... > > > > As I explained on the two racy cases aboe, faults that caused by DAMON or NUMA > > balancing can be handled by the other's handling code, but only for a limited > > time under the user's controls. But to my understanding that should not cause > > real problems in the practice, and users wouldn't be suggested to operate the > > system in such a way. I will add more documents and cautions about that in the > > next version. > > I really don't think a version of the code that results in the wrong handler > being used is upstreamable, sorry. > > I've not dug into the nitty gritty details on what would happen in both cases, > but even if it were 100% fine _now_, this is _exactly_ the kind of thing that > results in horrible hard-to-debug issues later, should something change. > > Implicitly 'just having to know' that we might be in the wrong fault handler > seems like asking for trouble, and the RoI on an optional profiling tool (this > is not to take away from DAMON which is a _great_ utility, I'm saying it's > simply not part of the _core_) isn't there. I completely understand your concerns, thank you for nicely and patiently keeping this discussion. I don't need to upstream this in short term, and open to every option. So let's take sufficient time and discussions. I will take more time to think about a few potential options to move forward, and share those with a level of details that can help us easily further discuss, hopefully in a few days. Thanks, SJ
On Tue, Jul 29, 2025 at 09:21:06PM -0700, SeongJae Park wrote: > > > Thank you for review and comments, Lorenzo. I fundamentally agree all your > > > points. I don't aim to merge this as-is. Actually this patch series is more > > > like POC, but apparently I was rushing. I will try to adjust your concerns in > > > the next version. > > > > Thanks. > > > > I do wonder whether we really can have a whole new faulting mechanism just for > > DAMON. Because if in future, we wanted to change how this worked, we'd be > > constrained, and it is a very specific user. > > > > The issue is you need the PTE to be restored to its previous state, just like > > NUMA balancing. > > > > And I really really do not like this 'oh if you turn it off you can use it for > > DAMON' thing, it's just really odd and asking for trouble. > > > > I think the only _workable_ version of this would be to convert the numa > > handling to a generic 'fault then restore' type mechanism that could be hooked > > in to by either NUMA or DAMON. > > I agree, and this is my current plan for the next version of this patch. I think we'd need some way to embed this information somehow, but I'll see what you come up with :) > > > > > But really I think you'd _need_ to not have significantly different behaviour > > between the two and _not_ constrain this to being only when NUMA balancing is > > disabled. > > I agree all the points. Especially the current interface is ambiguous and easy > to mistake. > > > > > But then you'd need to know _this_ PTE is for NUMA balancing vs. another is for > > this stuff. > > Yes, this would be the ideal. But, memorizing something in page level is ... > always an interesting challenge in my opinion, and I have no good idea to do > this for now :) Yes this is the crux, or another approach will need to be taken. > > > > > I'm not really sure there is an upstreamable version of this, but it'd need to > > be made generic like that if there were. > > > > I think it might be worth taking some time to examine whether a version of this > > that can be sensibly generic (which could have hooks for things) is _possible_ > > before sending a v2. > > Agreed, I don't need to rush. Let's take time and discuss sufficiently. :) Worth just playing wtih different solution. > > Nonetheless, I may post a followup version of this patch series that contains > this one, even before we get a conclusion about this specific one. I think I > may have to do that, for sharing the future idea in a way easy to understand > and test. I think it might also help us at understanding the real ROI of this > patch, and if there is another option to move forward. In the case, I will of > course keep RFC tag and clearly note that this patch is still under the > discussion and not willing to be merged as is before the discussion is done. OK, as long as you please highlight this, indicating that you ack it's not mergeable with the current approach. > > Well, in the above you say that you can't help but do that when a race occurs? > > I mean, I can't help when CONFIG_DAMON is enabled. > > The race (or, handling faults that caused by other entity) can hppen if and > only if all below things happen. > > 1. CONFIG_DAMON is enbled. > 2. CONFIG_NUMA_BALANCING is enabled. > 3. The user repeatedly turns on and off NUMA balancing and fault-based mode > DAMON in runtime. > > If any of these are not true, the race can completely avoided. I was saying > about the case that the first condition is not met. Well not necessarily continually, just at any stage. I do think you're limiting yourself rather by requiring NUMA balancing to be switched off. But in any case, for reasons previously mentioned, I don't like this approach and I don't think it's at all workable, rather we either need a generic 'restore PTE state' mechanism (that can somehow store 'both OR either' NUMA/DAMON state with them), or an entirely different approach. Another idea might be that of converting the current NUMA balancing to something generic, whose semantics are such that: - It restores PTE state - It 'happens' to relocate the mapping to the node the CPU from which the access was made resides upon - It also calls a hook. So treat the NUMA balancing aspect as sort of implicit. Then you can just use the machinery for marking ranges for NUMA balancing for this, having defined a hook. If you're indifferent as to whether you get the hook called for arbitrary NUMA balancing faults, or have a way to identify whether or not those mappings were marked for DAMON stats, then you have what you need without having to customise _anything_ in the general faulting code for DAMON. Instead you'd just set up the hook for DAMON. This is just an untested and highly theoretical thought :) > > > > I really don't think a version of the code that results in the wrong handler > > being used is upstreamable, sorry. > > > > I've not dug into the nitty gritty details on what would happen in both cases, > > but even if it were 100% fine _now_, this is _exactly_ the kind of thing that > > results in horrible hard-to-debug issues later, should something change. > > > > Implicitly 'just having to know' that we might be in the wrong fault handler > > seems like asking for trouble, and the RoI on an optional profiling tool (this > > is not to take away from DAMON which is a _great_ utility, I'm saying it's > > simply not part of the _core_) isn't there. > > I completely understand your concerns, thank you for nicely and patiently > keeping this discussion. I don't need to upstream this in short term, and open > to every option. So let's take sufficient time and discussions. > > I will take more time to think about a few potential options to move forward, > and share those with a level of details that can help us easily further > discuss, hopefully in a few days. Thanks!
© 2016 - 2025 Red Hat, Inc.