From: Bijan Tabatabai <bijantabatab@micron.com>
This patch is to allow DAMON to call policy_nodemask() so it can
determine where to place a page for interleaving.
Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
---
include/linux/mempolicy.h | 9 +++++++++
mm/mempolicy.c | 4 +---
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 0fe96f3ab3ef..e96bf493ff7a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
unsigned long addr, int order, pgoff_t *ilx);
bool vma_policy_mof(struct vm_area_struct *vma);
+nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+ pgoff_t ilx, int *nid);
extern void numa_default_policy(void);
extern void numa_policy_init(void);
@@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
return NULL;
}
+static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+ pgoff_t ilx, int *nid)
+{
+ *nid = NUMA_NO_NODE;
+ return NULL;
+}
+
static inline int
vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
{
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3b1dfd08338b..54f539497e20 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
unsigned long flags);
-static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
- pgoff_t ilx, int *nid);
static bool strictly_unmovable(unsigned long flags)
{
@@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
* Return a nodemask representing a mempolicy for filtering nodes for
* page allocation, together with preferred node id (or the input node id).
*/
-static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
pgoff_t ilx, int *nid)
{
nodemask_t *nodemask = NULL;
--
2.43.5
On 12.06.25 20:13, Bijan Tabatabai wrote: > From: Bijan Tabatabai <bijantabatab@micron.com> > > This patch is to allow DAMON to call policy_nodemask() so it can > determine where to place a page for interleaving. > > Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> > --- > include/linux/mempolicy.h | 9 +++++++++ > mm/mempolicy.c | 4 +--- > 2 files changed, 10 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 0fe96f3ab3ef..e96bf493ff7a 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > unsigned long addr, int order, pgoff_t *ilx); > bool vma_policy_mof(struct vm_area_struct *vma); > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > + pgoff_t ilx, int *nid); > > extern void numa_default_policy(void); > extern void numa_policy_init(void); > @@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > return NULL; > } > > +static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > + pgoff_t ilx, int *nid) > +{ > + *nid = NUMA_NO_NODE; > + return NULL; > +} > + > static inline int > vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst) > { > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 3b1dfd08338b..54f539497e20 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { > > static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, > unsigned long flags); > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > - pgoff_t ilx, int *nid); > > static bool strictly_unmovable(unsigned long flags) > { > @@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx) > * Return a nodemask representing a mempolicy for filtering nodes for > * page allocation, together with preferred node id (or the input node id). > */ > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > pgoff_t ilx, int *nid) > { > nodemask_t *nodemask = NULL; You actually only care about the nid for your use case. Maybe we should add get_vma_policy_node() that internally does a get_vma_policy() to then give you only the node back. If get_vma_policy() is not the right thing (see my reply to patch #2), of course a get_task_policy_node() could be added. -- Cheers, David / dhildenb
On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote: > > On 12.06.25 20:13, Bijan Tabatabai wrote: > > From: Bijan Tabatabai <bijantabatab@micron.com> > > > > This patch is to allow DAMON to call policy_nodemask() so it can > > determine where to place a page for interleaving. > > > > Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> > > --- > > include/linux/mempolicy.h | 9 +++++++++ > > mm/mempolicy.c | 4 +--- > > 2 files changed, 10 insertions(+), 3 deletions(-) > > > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > > index 0fe96f3ab3ef..e96bf493ff7a 100644 > > --- a/include/linux/mempolicy.h > > +++ b/include/linux/mempolicy.h > > @@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > > struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > > unsigned long addr, int order, pgoff_t *ilx); > > bool vma_policy_mof(struct vm_area_struct *vma); > > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > > + pgoff_t ilx, int *nid); > > > > extern void numa_default_policy(void); > > extern void numa_policy_init(void); > > @@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > > return NULL; > > } > > > > +static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > > + pgoff_t ilx, int *nid) > > +{ > > + *nid = NUMA_NO_NODE; > > + return NULL; > > +} > > + > > static inline int > > vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst) > > { > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 3b1dfd08338b..54f539497e20 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { > > > > static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, > > unsigned long flags); > > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > > - pgoff_t ilx, int *nid); > > > > static bool strictly_unmovable(unsigned long flags) > > { > > @@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx) > > * Return a nodemask representing a mempolicy for filtering nodes for > > * page allocation, together with preferred node id (or the input node id). > > */ > > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > > pgoff_t ilx, int *nid) > > { > > nodemask_t *nodemask = NULL; > > You actually only care about the nid for your use case. > > Maybe we should add > > get_vma_policy_node() that internally does a get_vma_policy() to then > give you only the node back. > > If get_vma_policy() is not the right thing (see my reply to patch #2), > of course a get_task_policy_node() could be added. > > -- > Cheers, > > David / dhildenb Hi David, I did not use get_vma_policy or mpol_misplaced, which I believe is the closest function that exists for what I want in this patch, because those functions seem to assume they are called inside of the task that the folio/vma is mapped to. More specifically, mpol_misplaced assumes it is being called within a page fault. This doesn't work for us, because we call it inside of a kdamond process. I would be open to adding a new function that takes in a folio, vma, address, and task_struct and returns the nid the folio should be placed on. It could possibly be implemented as a function internal to mpol_misplaced because the two would be very similar. How would you propose we handle MPOL_BIND and MPOL_PREFFERED_MANY in this function? mpol_misplaced chooses a nid based on the node and cpu the fault occurred on, which we wouldn't have in a kdamond context. The two options I see are either: 1. return the nid of the first node in the policy's nodemask 2. return NUMA_NO_NODE I think I would lean towards the first. Thanks, Bijan
Bijan Tabatabai <bijan311@gmail.com> writes: > On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 12.06.25 20:13, Bijan Tabatabai wrote: >> > From: Bijan Tabatabai <bijantabatab@micron.com> >> > >> > This patch is to allow DAMON to call policy_nodemask() so it can >> > determine where to place a page for interleaving. >> > >> > Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> >> > --- >> > include/linux/mempolicy.h | 9 +++++++++ >> > mm/mempolicy.c | 4 +--- >> > 2 files changed, 10 insertions(+), 3 deletions(-) >> > >> > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h >> > index 0fe96f3ab3ef..e96bf493ff7a 100644 >> > --- a/include/linux/mempolicy.h >> > +++ b/include/linux/mempolicy.h >> > @@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, >> > struct mempolicy *get_vma_policy(struct vm_area_struct *vma, >> > unsigned long addr, int order, pgoff_t *ilx); >> > bool vma_policy_mof(struct vm_area_struct *vma); >> > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, >> > + pgoff_t ilx, int *nid); >> > >> > extern void numa_default_policy(void); >> > extern void numa_policy_init(void); >> > @@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma, >> > return NULL; >> > } >> > >> > +static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, >> > + pgoff_t ilx, int *nid) >> > +{ >> > + *nid = NUMA_NO_NODE; >> > + return NULL; >> > +} >> > + >> > static inline int >> > vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst) >> > { >> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> > index 3b1dfd08338b..54f539497e20 100644 >> > --- a/mm/mempolicy.c >> > +++ b/mm/mempolicy.c >> > @@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { >> > >> > static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, >> > unsigned long flags); >> > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, >> > - pgoff_t ilx, int *nid); >> > >> > static bool strictly_unmovable(unsigned long flags) >> > { >> > @@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx) >> > * Return a nodemask representing a mempolicy for filtering nodes for >> > * page allocation, together with preferred node id (or the input node id). >> > */ >> > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, >> > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, >> > pgoff_t ilx, int *nid) >> > { >> > nodemask_t *nodemask = NULL; >> >> You actually only care about the nid for your use case. >> >> Maybe we should add >> >> get_vma_policy_node() that internally does a get_vma_policy() to then >> give you only the node back. >> >> If get_vma_policy() is not the right thing (see my reply to patch #2), >> of course a get_task_policy_node() could be added. >> >> -- >> Cheers, >> >> David / dhildenb > > Hi David, > > I did not use get_vma_policy or mpol_misplaced, which I believe is the > closest function that exists for what I want in this patch, because > those functions > seem to assume they are called inside of the task that the folio/vma > is mapped to. > More specifically, mpol_misplaced assumes it is being called within a > page fault. > This doesn't work for us, because we call it inside of a kdamond process. > > I would be open to adding a new function that takes in a folio, vma, > address, and > task_struct and returns the nid the folio should be placed on. It could possibly > be implemented as a function internal to mpol_misplaced because the two would > be very similar. > > How would you propose we handle MPOL_BIND and MPOL_PREFFERED_MANY > in this function? mpol_misplaced chooses a nid based on the node and > cpu the fault > occurred on, which we wouldn't have in a kdamond context. The two options I see > are either: > 1. return the nid of the first node in the policy's nodemask > 2. return NUMA_NO_NODE > I think I would lean towards the first. You can try numa_node_id() first, then fall back to the first nid in the nodemask. --- Best Regards, Huang, Ying
On 13.06.25 18:33, Bijan Tabatabai wrote: > On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 12.06.25 20:13, Bijan Tabatabai wrote: >>> From: Bijan Tabatabai <bijantabatab@micron.com> >>> >>> This patch is to allow DAMON to call policy_nodemask() so it can >>> determine where to place a page for interleaving. >>> >>> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> >>> --- >>> include/linux/mempolicy.h | 9 +++++++++ >>> mm/mempolicy.c | 4 +--- >>> 2 files changed, 10 insertions(+), 3 deletions(-) >>> >>> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h >>> index 0fe96f3ab3ef..e96bf493ff7a 100644 >>> --- a/include/linux/mempolicy.h >>> +++ b/include/linux/mempolicy.h >>> @@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, >>> struct mempolicy *get_vma_policy(struct vm_area_struct *vma, >>> unsigned long addr, int order, pgoff_t *ilx); >>> bool vma_policy_mof(struct vm_area_struct *vma); >>> +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, >>> + pgoff_t ilx, int *nid); >>> >>> extern void numa_default_policy(void); >>> extern void numa_policy_init(void); >>> @@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma, >>> return NULL; >>> } >>> >>> +static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, >>> + pgoff_t ilx, int *nid) >>> +{ >>> + *nid = NUMA_NO_NODE; >>> + return NULL; >>> +} >>> + >>> static inline int >>> vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst) >>> { >>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>> index 3b1dfd08338b..54f539497e20 100644 >>> --- a/mm/mempolicy.c >>> +++ b/mm/mempolicy.c >>> @@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { >>> >>> static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, >>> unsigned long flags); >>> -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, >>> - pgoff_t ilx, int *nid); >>> >>> static bool strictly_unmovable(unsigned long flags) >>> { >>> @@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx) >>> * Return a nodemask representing a mempolicy for filtering nodes for >>> * page allocation, together with preferred node id (or the input node id). >>> */ >>> -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, >>> +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, >>> pgoff_t ilx, int *nid) >>> { >>> nodemask_t *nodemask = NULL; >> >> You actually only care about the nid for your use case. >> >> Maybe we should add >> >> get_vma_policy_node() that internally does a get_vma_policy() to then >> give you only the node back. >> >> If get_vma_policy() is not the right thing (see my reply to patch #2), >> of course a get_task_policy_node() could be added. >> >> -- >> Cheers, >> >> David / dhildenb > > Hi David, Hi, > > I did not use get_vma_policy or mpol_misplaced, which I believe is the > closest function that exists for what I want in this patch, because > those functions I think what you mean is, that you are performing an rmap walk. But there, you do have a VMA + MM available (stable). > seem to assume they are called inside of the task that the folio/vma > is mapped to. But, we do have a VMA at hand, so why would we want to ignore any set policy? (I think VMA policies so far only apply to shmem, but still). I really think you want to use get_vma_policy() instead of the task policy. > More specifically, mpol_misplaced assumes it is being called within a > page fault. > This doesn't work for us, because we call it inside of a kdamond process. Right. But it uses the vmf only for ... 1) Obtaining the VMA 2) Sanity-checking that the ptlock is held. Which, you also have during the rmap walk. So what about factoring out that handling from mpol_misplaced(), having another function where you pass the VMA instead of the vmf? > > I would be open to adding a new function that takes in a folio, vma, > address, and > task_struct and returns the nid the folio should be placed on. It could possibly > be implemented as a function internal to mpol_misplaced because the two would > be very similar. Good, you had the same thought :) > > How would you propose we handle MPOL_BIND and MPOL_PREFFERED_MANY > in this function? mpol_misplaced chooses a nid based on the node and > cpu the fault > occurred on, which we wouldn't have in a kdamond context. The two options I see > are either: > 1. return the nid of the first node in the policy's nodemask > 2. return NUMA_NO_NODE > I think I would lean towards the first. I guess we'd need a way for your new helper to deal with both cases (is_fault vs. !is_fault), and make a decision based on that. For your use case, you can then decide what would be appropriate. It's a good question what the appropriate action would be: 1) sounds better, but I do wonder if we would rather want to distribute the folios in a different way across applicable nodes, not sure ... -- Cheers, David / dhildenb
On Mon, Jun 16, 2025 at 4:46 AM David Hildenbrand <david@redhat.com> wrote: > > On 13.06.25 18:33, Bijan Tabatabai wrote: > > On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 12.06.25 20:13, Bijan Tabatabai wrote: [...] > Hi, > > > > > I did not use get_vma_policy or mpol_misplaced, which I believe is the > > closest function that exists for what I want in this patch, because > > those functions > > I think what you mean is, that you are performing an rmap walk. But > there, you do have a VMA + MM available (stable). > > > seem to assume they are called inside of the task that the folio/vma > > is mapped to. > > But, we do have a VMA at hand, so why would we want to ignore any set > policy? (I think VMA policies so far only apply to shmem, but still). > > I really think you want to use get_vma_policy() instead of the task policy. Sorry, I think I misunderstood you before. You are right, we should consider the VMA policy before using the task policy. I will do this in the next revision. > > > More specifically, mpol_misplaced assumes it is being called within a > > page fault. > > This doesn't work for us, because we call it inside of a kdamond process. > > Right. > > But it uses the vmf only for ... > > 1) Obtaining the VMA > 2) Sanity-checking that the ptlock is held. > > Which, you also have during the rmap walk. There is another subtle dependency in get_vma_policy. It first checks if a VMA policy exists, and if it doesn't, it uses the task policy of the current task, which doesn't make sense when called by a kdamond thread. However, I don't think this will change what seems to be our consensus of adding a new helper function. > > So what about factoring out that handling from mpol_misplaced(), having > another function where you pass the VMA instead of the vmf? > > > > > I would be open to adding a new function that takes in a folio, vma, > > address, and > > task_struct and returns the nid the folio should be placed on. It could possibly > > be implemented as a function internal to mpol_misplaced because the two would > > be very similar. > > Good, you had the same thought :) > > > > > How would you propose we handle MPOL_BIND and MPOL_PREFFERED_MANY > > in this function? mpol_misplaced chooses a nid based on the node and > > cpu the fault > > occurred on, which we wouldn't have in a kdamond context. The two options I see > > are either: > > 1. return the nid of the first node in the policy's nodemask > > 2. return NUMA_NO_NODE > > I think I would lean towards the first. > > I guess we'd need a way for your new helper to deal with both cases > (is_fault vs. !is_fault), and make a decision based on that. > > > For your use case, you can then decide what would be appropriate. It's a > good question what the appropriate action would be: 1) sounds better, > but I do wonder if we would rather want to distribute the folios in a > different way across applicable nodes, not sure ... Yes, I was thinking about that too, but I felt that adding state somewhere or using randomness to distribute the folios was incorrect, especially since those policies are not the focus of this patchset. I think I'll move forward with option 1 for now. Thanks, Bijan
On Mon, Jun 16, 2025 at 09:16:55AM -0500, Bijan Tabatabai wrote: > > > > Which, you also have during the rmap walk. > > There is another subtle dependency in get_vma_policy. > It first checks if a VMA policy exists, and if it doesn't, it uses the > task policy of the current task, which doesn't make sense when called > by a kdamond thread. > > However, I don't think this will change what seems to be our consensus > of adding a new helper function. > Hate to interject here, but this gets worse the further you dig in. The mempolicy interface as a whole has many, many, many hidden references to current->mempolicy and current->mems_allowed. External interface references to a task or vma mempolicy is a problem i explored somewhat extensively when I prototyped `set_mempolicy2()`. It did not go well. Generally, mempolicy is not well structured to allow external actors on a task's mempolicy. Accessing a task's mempolicy requires operating in a task's context or at least taking a reference on that task's mempolicy (which requires taking the task_struct lock). I will just say that mempolicy is *extremely* current-task centric - and very much allocation-time centric (i.e. the internal workings don't really want to consider migration all that much). You'll probably find that this project requires rethinking mempolicy's external interfaces in general (which is sorely needed anyway). I think this path to modifying mempolicy to support DAMON is a bit ambitious for where mempolicy is at the moment. You may be better off duplicating the interleave-weight logic and making some helper functions to get the weight data, and then coming back around to generalize it later. ~Gregory
Hi Gregory, On Mon, Jun 16, 2025 at 12:43 PM Gregory Price <gourry@gourry.net> wrote: > > On Mon, Jun 16, 2025 at 09:16:55AM -0500, Bijan Tabatabai wrote: > > > > > > Which, you also have during the rmap walk. > > > > There is another subtle dependency in get_vma_policy. > > It first checks if a VMA policy exists, and if it doesn't, it uses the > > task policy of the current task, which doesn't make sense when called > > by a kdamond thread. > > > > However, I don't think this will change what seems to be our consensus > > of adding a new helper function. > > > > Hate to interject here, but this gets worse the further you dig in. The > mempolicy interface as a whole has many, many, many hidden references to > current->mempolicy and current->mems_allowed. External interface > references to a task or vma mempolicy is a problem i explored somewhat > extensively when I prototyped `set_mempolicy2()`. It did not go well. > > Generally, mempolicy is not well structured to allow external actors on > a task's mempolicy. Accessing a task's mempolicy requires operating in > a task's context or at least taking a reference on that task's mempolicy > (which requires taking the task_struct lock). Good point, I didn't take the lock in the second patch. Also, this made me realize I need to make sure there isn't a race condition where a task exits after getting a pointer to its task_struct from mm->owner. > I will just say that mempolicy is *extremely* current-task centric - and > very much allocation-time centric (i.e. the internal workings don't > really want to consider migration all that much). You'll probably find > that this project requires rethinking mempolicy's external interfaces in > general (which is sorely needed anyway). > > I think this path to modifying mempolicy to support DAMON is a bit > ambitious for where mempolicy is at the moment. You may be better off > duplicating the interleave-weight logic and making some helper functions > to get the weight data, and then coming back around to generalize it > later. This may be true, but I think I will be able to avoid a lot of this nastiness with what I need. I am going to try with the mempolicy approach for the next revision, but if I get too much resistance, I will probably switch to this approach. Thanks, Bijan
On Mon, 16 Jun 2025 17:16:16 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote: > Hi Gregory, > > On Mon, Jun 16, 2025 at 12:43 PM Gregory Price <gourry@gourry.net> wrote: > > > > On Mon, Jun 16, 2025 at 09:16:55AM -0500, Bijan Tabatabai wrote: [...] > > Hate to interject here, Please don't hesitate :) [...] > > I will just say that mempolicy is *extremely* current-task centric - and > > very much allocation-time centric (i.e. the internal workings don't > > really want to consider migration all that much). You'll probably find > > that this project requires rethinking mempolicy's external interfaces in > > general (which is sorely needed anyway). > > > > I think this path to modifying mempolicy to support DAMON is a bit > > ambitious for where mempolicy is at the moment. You may be better off > > duplicating the interleave-weight logic and making some helper functions > > to get the weight data, and then coming back around to generalize it > > later. Thank you for the nice clarification and opinion, Gregory. > > This may be true, but I think I will be able to avoid a lot of this > nastiness with what I need. I am going to try with the mempolicy > approach for the next revision, but if I get too much resistance, I > will probably switch to this approach. I have no strong opinion about use of mempolicy for now, as long as mempolicy folks are fine. Nonetheless, I just wanted to mention Gregory's suggestion also sounds fairly good to me. It would avoid unnecessary coupling of the concepts of allocation-time interleaving and after-allocation migration. Also it feels even more aligned with a potential future extension of this project that we discussed[1]: letting users set multiple target nodes for DAMOS_MIGRATE_{HOT,COLD} with arbitrary weights. [1] https://lore.kernel.org/20250613171237.44776-1-sj@kernel.org Thanks, SJ [...]
On Tue, Jun 17, 2025 at 1:58 PM SeongJae Park <sj@kernel.org> wrote: > > On Mon, 16 Jun 2025 17:16:16 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote: > > > Hi Gregory, > > > > On Mon, Jun 16, 2025 at 12:43 PM Gregory Price <gourry@gourry.net> wrote: > > > > > > On Mon, Jun 16, 2025 at 09:16:55AM -0500, Bijan Tabatabai wrote: > [...] > > > I will just say that mempolicy is *extremely* current-task centric - and > > > very much allocation-time centric (i.e. the internal workings don't > > > really want to consider migration all that much). You'll probably find > > > that this project requires rethinking mempolicy's external interfaces in > > > general (which is sorely needed anyway). > > > > > > I think this path to modifying mempolicy to support DAMON is a bit > > > ambitious for where mempolicy is at the moment. You may be better off > > > duplicating the interleave-weight logic and making some helper functions > > > to get the weight data, and then coming back around to generalize it > > > later. > > Thank you for the nice clarification and opinion, Gregory. > > > > > This may be true, but I think I will be able to avoid a lot of this > > nastiness with what I need. I am going to try with the mempolicy > > approach for the next revision, but if I get too much resistance, I > > will probably switch to this approach. > > I have no strong opinion about use of mempolicy for now, as long as mempolicy > folks are fine. > > Nonetheless, I just wanted to mention Gregory's suggestion also sounds fairly > good to me. It would avoid unnecessary coupling of the concepts of > allocation-time interleaving and after-allocation migration. Also it feels > even more aligned with a potential future extension of this project that we > discussed[1]: letting users set multiple target nodes for > DAMOS_MIGRATE_{HOT,COLD} with arbitrary weights. > > [1] https://lore.kernel.org/20250613171237.44776-1-sj@kernel.org Given this discussion, as well as Joshua's comments earlier [1], it sounds like while people aren't exactly opposed to using mempolicy for this, the building consensus is that it would be best not to. I will move the interleave logic to DAMON for the next revision. However, I still think it makes sense to use the global weights (probably via get_il_weight) for now to avoid allocating pages a certain way and then migrating them soon after. I'll try to send the next version of the patch set by the end of the week. Thanks everyone for their feedback, Bijan [1] https://lore.kernel.org/linux-mm/20250613152517.225529-1-joshua.hahnjy@gmail.com/ [...]
On Tue, 17 Jun 2025 14:54:39 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote: [...] > Given this discussion, as well as Joshua's comments earlier [1], it > sounds like while people aren't exactly opposed to using mempolicy for > this, the building consensus is that it would be best not to. I will > move the interleave logic to DAMON for the next revision. However, I > still think it makes sense to use the global weights (probably via > get_il_weight) for now to avoid allocating pages a certain way and > then migrating them soon after. Makes sense to me :) > > I'll try to send the next version of the patch set by the end of the week. Looking forwrd to it! Thanks, SJ [...]
On 16.06.25 16:16, Bijan Tabatabai wrote: > On Mon, Jun 16, 2025 at 4:46 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 13.06.25 18:33, Bijan Tabatabai wrote: >>> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 12.06.25 20:13, Bijan Tabatabai wrote: > [...] >> Hi, >> >>> >>> I did not use get_vma_policy or mpol_misplaced, which I believe is the >>> closest function that exists for what I want in this patch, because >>> those functions >> >> I think what you mean is, that you are performing an rmap walk. But >> there, you do have a VMA + MM available (stable). >> >>> seem to assume they are called inside of the task that the folio/vma >>> is mapped to. >> >> But, we do have a VMA at hand, so why would we want to ignore any set >> policy? (I think VMA policies so far only apply to shmem, but still). >> >> I really think you want to use get_vma_policy() instead of the task policy. > > Sorry, I think I misunderstood you before. You are right, we should > consider the VMA policy before using the task policy. I will do this > in the next revision. > >> >>> More specifically, mpol_misplaced assumes it is being called within a >>> page fault. >>> This doesn't work for us, because we call it inside of a kdamond process. >> >> Right. >> >> But it uses the vmf only for ... >> >> 1) Obtaining the VMA >> 2) Sanity-checking that the ptlock is held. >> >> Which, you also have during the rmap walk. > > There is another subtle dependency in get_vma_policy. > It first checks if a VMA policy exists, and if it doesn't, it uses the > task policy of the current task, which doesn't make sense when called > by a kdamond thread. Indeed, anything that depends on current task / nid / cpu might require care. -- Cheers, David / dhildenb
David Hildenbrand <david@redhat.com> writes: > On 13.06.25 18:33, Bijan Tabatabai wrote: >> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote: >>> >>> On 12.06.25 20:13, Bijan Tabatabai wrote: >>>> From: Bijan Tabatabai <bijantabatab@micron.com> >>>> [snip] >> I did not use get_vma_policy or mpol_misplaced, which I believe is >> the >> closest function that exists for what I want in this patch, because >> those functions > > I think what you mean is, that you are performing an rmap walk. But > there, you do have a VMA + MM available (stable). > >> seem to assume they are called inside of the task that the folio/vma >> is mapped to. > > But, we do have a VMA at hand, so why would we want to ignore any set > policy? (I think VMA policies so far only apply to shmem, but still). > > I really think you want to use get_vma_policy() instead of the task policy. > > >> More specifically, mpol_misplaced assumes it is being called within a >> page fault. >> This doesn't work for us, because we call it inside of a kdamond process. > > Right. > > But it uses the vmf only for ... > > 1) Obtaining the VMA > 2) Sanity-checking that the ptlock is held. 3) update NUMA balancing per-folio cpupid state (via should_numa_migrate_memory()). This needs to be called by the NUMA page fault handler. > Which, you also have during the rmap walk. > [snip] --- Best Regards, Huang, Ying
On 16.06.25 13:02, Huang, Ying wrote: > David Hildenbrand <david@redhat.com> writes: > >> On 13.06.25 18:33, Bijan Tabatabai wrote: >>> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 12.06.25 20:13, Bijan Tabatabai wrote: >>>>> From: Bijan Tabatabai <bijantabatab@micron.com> >>>>> > > [snip] > >>> I did not use get_vma_policy or mpol_misplaced, which I believe is >>> the >>> closest function that exists for what I want in this patch, because >>> those functions >> >> I think what you mean is, that you are performing an rmap walk. But >> there, you do have a VMA + MM available (stable). >> >>> seem to assume they are called inside of the task that the folio/vma >>> is mapped to. >> >> But, we do have a VMA at hand, so why would we want to ignore any set >> policy? (I think VMA policies so far only apply to shmem, but still). >> >> I really think you want to use get_vma_policy() instead of the task policy. >> >> >>> More specifically, mpol_misplaced assumes it is being called within a >>> page fault. >>> This doesn't work for us, because we call it inside of a kdamond process. >> >> Right. >> >> But it uses the vmf only for ... >> >> 1) Obtaining the VMA >> 2) Sanity-checking that the ptlock is held. > > 3) update NUMA balancing per-folio cpupid state (via should_numa_migrate_memory()). How is the vmf used for that ("it uses the vmf only for ...")? But yes, anything that relies on "thiscpu" and "thisnid" might be fault-specific as well (what I mentioned further below) -- Cheers, David / dhildenb
© 2016 - 2025 Red Hat, Inc.