mm/damon: Add DAMOS action to interleave data across nodes

[RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by Bijan Tabatabai 4 months ago

From: Bijan Tabatabai <bijantabatab@micron.com>

This patch is to allow DAMON to call policy_nodemask() so it can
determine where to place a page for interleaving.

Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
---
 include/linux/mempolicy.h | 9 +++++++++
 mm/mempolicy.c            | 4 +---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 0fe96f3ab3ef..e96bf493ff7a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr, int order, pgoff_t *ilx);
 bool vma_policy_mof(struct vm_area_struct *vma);
+nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+		pgoff_t ilx, int *nid);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
@@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 	return NULL;
 }
 
+static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+				pgoff_t ilx, int *nid)
+{
+	*nid = NUMA_NO_NODE;
+	return NULL;
+}
+
 static inline int
 vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
 {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3b1dfd08338b..54f539497e20 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
 				unsigned long flags);
-static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
-				pgoff_t ilx, int *nid);
 
 static bool strictly_unmovable(unsigned long flags)
 {
@@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
  * Return a nodemask representing a mempolicy for filtering nodes for
  * page allocation, together with preferred node id (or the input node id).
  */
-static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
+nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 				   pgoff_t ilx, int *nid)
 {
 	nodemask_t *nodemask = NULL;
-- 
2.43.5

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by David Hildenbrand 3 months, 4 weeks ago

On 12.06.25 20:13, Bijan Tabatabai wrote:
> From: Bijan Tabatabai <bijantabatab@micron.com>
> 
> This patch is to allow DAMON to call policy_nodemask() so it can
> determine where to place a page for interleaving.
> 
> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
> ---
>   include/linux/mempolicy.h | 9 +++++++++
>   mm/mempolicy.c            | 4 +---
>   2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 0fe96f3ab3ef..e96bf493ff7a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>   		unsigned long addr, int order, pgoff_t *ilx);
>   bool vma_policy_mof(struct vm_area_struct *vma);
> +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> +		pgoff_t ilx, int *nid);
>   
>   extern void numa_default_policy(void);
>   extern void numa_policy_init(void);
> @@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>   	return NULL;
>   }
>   
> +static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> +				pgoff_t ilx, int *nid)
> +{
> +	*nid = NUMA_NO_NODE;
> +	return NULL;
> +}
> +
>   static inline int
>   vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
>   {
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 3b1dfd08338b..54f539497e20 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>   
>   static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
>   				unsigned long flags);
> -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> -				pgoff_t ilx, int *nid);
>   
>   static bool strictly_unmovable(unsigned long flags)
>   {
> @@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>    * Return a nodemask representing a mempolicy for filtering nodes for
>    * page allocation, together with preferred node id (or the input node id).
>    */
> -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>   				   pgoff_t ilx, int *nid)
>   {
>   	nodemask_t *nodemask = NULL;

You actually only care about the nid for your use case.

Maybe we should add

get_vma_policy_node() that internally does a get_vma_policy() to then 
give you only the node back.

If get_vma_policy() is not the right thing (see my reply to patch #2), 
of course a get_task_policy_node() could be added.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by Bijan Tabatabai 3 months, 4 weeks ago

On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.06.25 20:13, Bijan Tabatabai wrote:
> > From: Bijan Tabatabai <bijantabatab@micron.com>
> >
> > This patch is to allow DAMON to call policy_nodemask() so it can
> > determine where to place a page for interleaving.
> >
> > Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
> > ---
> >   include/linux/mempolicy.h | 9 +++++++++
> >   mm/mempolicy.c            | 4 +---
> >   2 files changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> > index 0fe96f3ab3ef..e96bf493ff7a 100644
> > --- a/include/linux/mempolicy.h
> > +++ b/include/linux/mempolicy.h
> > @@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> >   struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> >               unsigned long addr, int order, pgoff_t *ilx);
> >   bool vma_policy_mof(struct vm_area_struct *vma);
> > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> > +             pgoff_t ilx, int *nid);
> >
> >   extern void numa_default_policy(void);
> >   extern void numa_policy_init(void);
> > @@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> >       return NULL;
> >   }
> >
> > +static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> > +                             pgoff_t ilx, int *nid)
> > +{
> > +     *nid = NUMA_NO_NODE;
> > +     return NULL;
> > +}
> > +
> >   static inline int
> >   vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
> >   {
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 3b1dfd08338b..54f539497e20 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> >
> >   static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
> >                               unsigned long flags);
> > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> > -                             pgoff_t ilx, int *nid);
> >
> >   static bool strictly_unmovable(unsigned long flags)
> >   {
> > @@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> >    * Return a nodemask representing a mempolicy for filtering nodes for
> >    * page allocation, together with preferred node id (or the input node id).
> >    */
> > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
> >                                  pgoff_t ilx, int *nid)
> >   {
> >       nodemask_t *nodemask = NULL;
>
> You actually only care about the nid for your use case.
>
> Maybe we should add
>
> get_vma_policy_node() that internally does a get_vma_policy() to then
> give you only the node back.
>
> If get_vma_policy() is not the right thing (see my reply to patch #2),
> of course a get_task_policy_node() could be added.
>
> --
> Cheers,
>
> David / dhildenb

Hi David,

I did not use get_vma_policy or mpol_misplaced, which I believe is the
closest function that exists for what I want in this patch, because
those functions
seem to assume they are called inside of the task that the folio/vma
is mapped to.
More specifically, mpol_misplaced assumes it is being called within a
page fault.
This doesn't work for us, because we call it inside of a kdamond process.

I would be open to adding a new function that takes in a folio, vma,
address, and
task_struct and returns the nid the folio should be placed on. It could possibly
be implemented as a function internal to mpol_misplaced because the two would
be very similar.

How would you propose we handle MPOL_BIND and MPOL_PREFFERED_MANY
in this function? mpol_misplaced chooses a nid based on the node and
cpu the fault
occurred on, which we wouldn't have in a kdamond context. The two options I see
are either:
1. return the nid of the first node in the policy's nodemask
2. return NUMA_NO_NODE
I think I would lean towards the first.

Thanks,
Bijan

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by Huang, Ying 3 months, 3 weeks ago

Bijan Tabatabai <bijan311@gmail.com> writes:

> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.06.25 20:13, Bijan Tabatabai wrote:
>> > From: Bijan Tabatabai <bijantabatab@micron.com>
>> >
>> > This patch is to allow DAMON to call policy_nodemask() so it can
>> > determine where to place a page for interleaving.
>> >
>> > Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
>> > ---
>> >   include/linux/mempolicy.h | 9 +++++++++
>> >   mm/mempolicy.c            | 4 +---
>> >   2 files changed, 10 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
>> > index 0fe96f3ab3ef..e96bf493ff7a 100644
>> > --- a/include/linux/mempolicy.h
>> > +++ b/include/linux/mempolicy.h
>> > @@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>> >   struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>> >               unsigned long addr, int order, pgoff_t *ilx);
>> >   bool vma_policy_mof(struct vm_area_struct *vma);
>> > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>> > +             pgoff_t ilx, int *nid);
>> >
>> >   extern void numa_default_policy(void);
>> >   extern void numa_policy_init(void);
>> > @@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>> >       return NULL;
>> >   }
>> >
>> > +static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>> > +                             pgoff_t ilx, int *nid)
>> > +{
>> > +     *nid = NUMA_NO_NODE;
>> > +     return NULL;
>> > +}
>> > +
>> >   static inline int
>> >   vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
>> >   {
>> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> > index 3b1dfd08338b..54f539497e20 100644
>> > --- a/mm/mempolicy.c
>> > +++ b/mm/mempolicy.c
>> > @@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>> >
>> >   static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
>> >                               unsigned long flags);
>> > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>> > -                             pgoff_t ilx, int *nid);
>> >
>> >   static bool strictly_unmovable(unsigned long flags)
>> >   {
>> > @@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>> >    * Return a nodemask representing a mempolicy for filtering nodes for
>> >    * page allocation, together with preferred node id (or the input node id).
>> >    */
>> > -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>> > +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>> >                                  pgoff_t ilx, int *nid)
>> >   {
>> >       nodemask_t *nodemask = NULL;
>>
>> You actually only care about the nid for your use case.
>>
>> Maybe we should add
>>
>> get_vma_policy_node() that internally does a get_vma_policy() to then
>> give you only the node back.
>>
>> If get_vma_policy() is not the right thing (see my reply to patch #2),
>> of course a get_task_policy_node() could be added.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>
> Hi David,
>
> I did not use get_vma_policy or mpol_misplaced, which I believe is the
> closest function that exists for what I want in this patch, because
> those functions
> seem to assume they are called inside of the task that the folio/vma
> is mapped to.
> More specifically, mpol_misplaced assumes it is being called within a
> page fault.
> This doesn't work for us, because we call it inside of a kdamond process.
>
> I would be open to adding a new function that takes in a folio, vma,
> address, and
> task_struct and returns the nid the folio should be placed on. It could possibly
> be implemented as a function internal to mpol_misplaced because the two would
> be very similar.
>
> How would you propose we handle MPOL_BIND and MPOL_PREFFERED_MANY
> in this function? mpol_misplaced chooses a nid based on the node and
> cpu the fault
> occurred on, which we wouldn't have in a kdamond context. The two options I see
> are either:
> 1. return the nid of the first node in the policy's nodemask
> 2. return NUMA_NO_NODE
> I think I would lean towards the first.

You can try numa_node_id() first, then fall back to the first nid in
the nodemask.

---
Best Regards,
Huang, Ying

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by David Hildenbrand 3 months, 3 weeks ago

On 13.06.25 18:33, Bijan Tabatabai wrote:
> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.06.25 20:13, Bijan Tabatabai wrote:
>>> From: Bijan Tabatabai <bijantabatab@micron.com>
>>>
>>> This patch is to allow DAMON to call policy_nodemask() so it can
>>> determine where to place a page for interleaving.
>>>
>>> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
>>> ---
>>>    include/linux/mempolicy.h | 9 +++++++++
>>>    mm/mempolicy.c            | 4 +---
>>>    2 files changed, 10 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
>>> index 0fe96f3ab3ef..e96bf493ff7a 100644
>>> --- a/include/linux/mempolicy.h
>>> +++ b/include/linux/mempolicy.h
>>> @@ -133,6 +133,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>>>    struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>                unsigned long addr, int order, pgoff_t *ilx);
>>>    bool vma_policy_mof(struct vm_area_struct *vma);
>>> +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>>> +             pgoff_t ilx, int *nid);
>>>
>>>    extern void numa_default_policy(void);
>>>    extern void numa_policy_init(void);
>>> @@ -232,6 +234,13 @@ static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>>        return NULL;
>>>    }
>>>
>>> +static inline nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>>> +                             pgoff_t ilx, int *nid)
>>> +{
>>> +     *nid = NUMA_NO_NODE;
>>> +     return NULL;
>>> +}
>>> +
>>>    static inline int
>>>    vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
>>>    {
>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>> index 3b1dfd08338b..54f539497e20 100644
>>> --- a/mm/mempolicy.c
>>> +++ b/mm/mempolicy.c
>>> @@ -596,8 +596,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>>>
>>>    static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
>>>                                unsigned long flags);
>>> -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>>> -                             pgoff_t ilx, int *nid);
>>>
>>>    static bool strictly_unmovable(unsigned long flags)
>>>    {
>>> @@ -2195,7 +2193,7 @@ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>>>     * Return a nodemask representing a mempolicy for filtering nodes for
>>>     * page allocation, together with preferred node id (or the input node id).
>>>     */
>>> -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>>> +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>>>                                   pgoff_t ilx, int *nid)
>>>    {
>>>        nodemask_t *nodemask = NULL;
>>
>> You actually only care about the nid for your use case.
>>
>> Maybe we should add
>>
>> get_vma_policy_node() that internally does a get_vma_policy() to then
>> give you only the node back.
>>
>> If get_vma_policy() is not the right thing (see my reply to patch #2),
>> of course a get_task_policy_node() could be added.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
> 
> Hi David,

Hi,

> 
> I did not use get_vma_policy or mpol_misplaced, which I believe is the
> closest function that exists for what I want in this patch, because
> those functions

I think what you mean is, that you are performing an rmap walk. But 
there, you do have a VMA + MM available (stable).

> seem to assume they are called inside of the task that the folio/vma
> is mapped to.

But, we do have a VMA at hand, so why would we want to ignore any set 
policy? (I think VMA policies so far only apply to shmem, but still).

I really think you want to use get_vma_policy() instead of the task policy.


> More specifically, mpol_misplaced assumes it is being called within a
> page fault.
> This doesn't work for us, because we call it inside of a kdamond process.

Right.

But it uses the vmf only for ...

1) Obtaining the VMA
2) Sanity-checking that the ptlock is held.

Which, you also have during the rmap walk.


So what about factoring out that handling from mpol_misplaced(), having 
another function where you pass the VMA instead of the vmf?

> 
> I would be open to adding a new function that takes in a folio, vma,
> address, and
> task_struct and returns the nid the folio should be placed on. It could possibly
> be implemented as a function internal to mpol_misplaced because the two would
> be very similar.

Good, you had the same thought :)

> 
> How would you propose we handle MPOL_BIND and MPOL_PREFFERED_MANY
> in this function? mpol_misplaced chooses a nid based on the node and
> cpu the fault
> occurred on, which we wouldn't have in a kdamond context. The two options I see
> are either:
> 1. return the nid of the first node in the policy's nodemask
> 2. return NUMA_NO_NODE
> I think I would lean towards the first.

I guess we'd need a way for your new helper to deal with both cases 
(is_fault vs. !is_fault), and make a decision based on that.


For your use case, you can then decide what would be appropriate. It's a 
good question what the appropriate action would be: 1) sounds better, 
but I do wonder if we would rather want to distribute the folios in a 
different way across applicable nodes, not sure ...

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by Bijan Tabatabai 3 months, 3 weeks ago

On Mon, Jun 16, 2025 at 4:46 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.06.25 18:33, Bijan Tabatabai wrote:
> > On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 12.06.25 20:13, Bijan Tabatabai wrote:
[...]
> Hi,
>
> >
> > I did not use get_vma_policy or mpol_misplaced, which I believe is the
> > closest function that exists for what I want in this patch, because
> > those functions
>
> I think what you mean is, that you are performing an rmap walk. But
> there, you do have a VMA + MM available (stable).
>
> > seem to assume they are called inside of the task that the folio/vma
> > is mapped to.
>
> But, we do have a VMA at hand, so why would we want to ignore any set
> policy? (I think VMA policies so far only apply to shmem, but still).
>
> I really think you want to use get_vma_policy() instead of the task policy.

Sorry, I think I misunderstood you before. You are right, we should
consider the VMA policy before using the task policy. I will do this
in the next revision.

>
> > More specifically, mpol_misplaced assumes it is being called within a
> > page fault.
> > This doesn't work for us, because we call it inside of a kdamond process.
>
> Right.
>
> But it uses the vmf only for ...
>
> 1) Obtaining the VMA
> 2) Sanity-checking that the ptlock is held.
>
> Which, you also have during the rmap walk.

There is another subtle dependency in get_vma_policy.
It first checks if a VMA policy exists, and if it doesn't, it uses the
task policy of the current task, which doesn't make sense when called
by a kdamond thread.

However, I don't think this will change what seems to be our consensus
of adding a new helper function.

>
> So what about factoring out that handling from mpol_misplaced(), having
> another function where you pass the VMA instead of the vmf?
>
> >
> > I would be open to adding a new function that takes in a folio, vma,
> > address, and
> > task_struct and returns the nid the folio should be placed on. It could possibly
> > be implemented as a function internal to mpol_misplaced because the two would
> > be very similar.
>
> Good, you had the same thought :)
>
> >
> > How would you propose we handle MPOL_BIND and MPOL_PREFFERED_MANY
> > in this function? mpol_misplaced chooses a nid based on the node and
> > cpu the fault
> > occurred on, which we wouldn't have in a kdamond context. The two options I see
> > are either:
> > 1. return the nid of the first node in the policy's nodemask
> > 2. return NUMA_NO_NODE
> > I think I would lean towards the first.
>
> I guess we'd need a way for your new helper to deal with both cases
> (is_fault vs. !is_fault), and make a decision based on that.
>
>
> For your use case, you can then decide what would be appropriate. It's a
> good question what the appropriate action would be: 1) sounds better,
> but I do wonder if we would rather want to distribute the folios in a
> different way across applicable nodes, not sure ...

Yes, I was thinking about that too, but I felt that adding state
somewhere or using randomness to distribute the folios was incorrect,
especially since those policies are not the focus of this patchset.

I think I'll move forward with option 1 for now.

Thanks,
Bijan

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by Gregory Price 3 months, 3 weeks ago

On Mon, Jun 16, 2025 at 09:16:55AM -0500, Bijan Tabatabai wrote:
> >
> > Which, you also have during the rmap walk.
> 
> There is another subtle dependency in get_vma_policy.
> It first checks if a VMA policy exists, and if it doesn't, it uses the
> task policy of the current task, which doesn't make sense when called
> by a kdamond thread.
> 
> However, I don't think this will change what seems to be our consensus
> of adding a new helper function.
> 

Hate to interject here, but this gets worse the further you dig in.  The
mempolicy interface as a whole has many, many, many hidden references to
current->mempolicy and current->mems_allowed.  External interface
references to a task or vma mempolicy is a problem i explored somewhat
extensively when I prototyped `set_mempolicy2()`. It did not go well.

Generally, mempolicy is not well structured to allow external actors on
a task's mempolicy.  Accessing a task's mempolicy requires operating in
a task's context or at least taking a reference on that task's mempolicy
(which requires taking the task_struct lock).

I will just say that mempolicy is *extremely* current-task centric - and
very much allocation-time centric (i.e. the internal workings don't
really want to consider migration all that much).  You'll probably find
that this project requires rethinking mempolicy's external interfaces in
general (which is sorely needed anyway).

I think this path to modifying mempolicy to support DAMON is a bit
ambitious for where mempolicy is at the moment. You may be better off
duplicating the interleave-weight logic and making some helper functions
to get the weight data, and then coming back around to generalize it
later.

~Gregory

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by Bijan Tabatabai 3 months, 3 weeks ago

Hi Gregory,

On Mon, Jun 16, 2025 at 12:43 PM Gregory Price <gourry@gourry.net> wrote:
>
> On Mon, Jun 16, 2025 at 09:16:55AM -0500, Bijan Tabatabai wrote:
> > >
> > > Which, you also have during the rmap walk.
> >
> > There is another subtle dependency in get_vma_policy.
> > It first checks if a VMA policy exists, and if it doesn't, it uses the
> > task policy of the current task, which doesn't make sense when called
> > by a kdamond thread.
> >
> > However, I don't think this will change what seems to be our consensus
> > of adding a new helper function.
> >
>
> Hate to interject here, but this gets worse the further you dig in.  The
> mempolicy interface as a whole has many, many, many hidden references to
> current->mempolicy and current->mems_allowed.  External interface
> references to a task or vma mempolicy is a problem i explored somewhat
> extensively when I prototyped `set_mempolicy2()`. It did not go well.
>
> Generally, mempolicy is not well structured to allow external actors on
> a task's mempolicy.  Accessing a task's mempolicy requires operating in
> a task's context or at least taking a reference on that task's mempolicy
> (which requires taking the task_struct lock).

Good point, I didn't take the lock in the second patch. Also, this
made me realize I need to make sure there isn't a race condition where
a task exits after getting a pointer to its task_struct from
mm->owner.

> I will just say that mempolicy is *extremely* current-task centric - and
> very much allocation-time centric (i.e. the internal workings don't
> really want to consider migration all that much).  You'll probably find
> that this project requires rethinking mempolicy's external interfaces in
> general (which is sorely needed anyway).
>
> I think this path to modifying mempolicy to support DAMON is a bit
> ambitious for where mempolicy is at the moment. You may be better off
> duplicating the interleave-weight logic and making some helper functions
> to get the weight data, and then coming back around to generalize it
> later.

This may be true, but I think I will be able to avoid a lot of this
nastiness with what I need. I am going to try with the mempolicy
approach for the next revision, but if I get too much resistance, I
will probably switch to this approach.

Thanks,
Bijan

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by SeongJae Park 3 months, 3 weeks ago

On Mon, 16 Jun 2025 17:16:16 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> Hi Gregory,
> 
> On Mon, Jun 16, 2025 at 12:43 PM Gregory Price <gourry@gourry.net> wrote:
> >
> > On Mon, Jun 16, 2025 at 09:16:55AM -0500, Bijan Tabatabai wrote:
[...]
> > Hate to interject here,

Please don't hesitate :)

[...]
> > I will just say that mempolicy is *extremely* current-task centric - and
> > very much allocation-time centric (i.e. the internal workings don't
> > really want to consider migration all that much).  You'll probably find
> > that this project requires rethinking mempolicy's external interfaces in
> > general (which is sorely needed anyway).
> >
> > I think this path to modifying mempolicy to support DAMON is a bit
> > ambitious for where mempolicy is at the moment. You may be better off
> > duplicating the interleave-weight logic and making some helper functions
> > to get the weight data, and then coming back around to generalize it
> > later.

Thank you for the nice clarification and opinion, Gregory.

> 
> This may be true, but I think I will be able to avoid a lot of this
> nastiness with what I need. I am going to try with the mempolicy
> approach for the next revision, but if I get too much resistance, I
> will probably switch to this approach.

I have no strong opinion about use of mempolicy for now, as long as mempolicy
folks are fine.

Nonetheless, I just wanted to mention Gregory's suggestion also sounds fairly
good to me.  It would avoid unnecessary coupling of the concepts of
allocation-time interleaving and after-allocation migration.  Also it feels
even more aligned with a potential future extension of this project that we
discussed[1]: letting users set multiple target nodes for
DAMOS_MIGRATE_{HOT,COLD} with arbitrary weights.

[1] https://lore.kernel.org/20250613171237.44776-1-sj@kernel.org


Thanks,
SJ

[...]

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by Bijan Tabatabai 3 months, 3 weeks ago

On Tue, Jun 17, 2025 at 1:58 PM SeongJae Park <sj@kernel.org> wrote:
>
> On Mon, 16 Jun 2025 17:16:16 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
>
> > Hi Gregory,
> >
> > On Mon, Jun 16, 2025 at 12:43 PM Gregory Price <gourry@gourry.net> wrote:
> > >
> > > On Mon, Jun 16, 2025 at 09:16:55AM -0500, Bijan Tabatabai wrote:
> [...]
> > > I will just say that mempolicy is *extremely* current-task centric - and
> > > very much allocation-time centric (i.e. the internal workings don't
> > > really want to consider migration all that much).  You'll probably find
> > > that this project requires rethinking mempolicy's external interfaces in
> > > general (which is sorely needed anyway).
> > >
> > > I think this path to modifying mempolicy to support DAMON is a bit
> > > ambitious for where mempolicy is at the moment. You may be better off
> > > duplicating the interleave-weight logic and making some helper functions
> > > to get the weight data, and then coming back around to generalize it
> > > later.
>
> Thank you for the nice clarification and opinion, Gregory.
>
> >
> > This may be true, but I think I will be able to avoid a lot of this
> > nastiness with what I need. I am going to try with the mempolicy
> > approach for the next revision, but if I get too much resistance, I
> > will probably switch to this approach.
>
> I have no strong opinion about use of mempolicy for now, as long as mempolicy
> folks are fine.
>
> Nonetheless, I just wanted to mention Gregory's suggestion also sounds fairly
> good to me.  It would avoid unnecessary coupling of the concepts of
> allocation-time interleaving and after-allocation migration.  Also it feels
> even more aligned with a potential future extension of this project that we
> discussed[1]: letting users set multiple target nodes for
> DAMOS_MIGRATE_{HOT,COLD} with arbitrary weights.
>
> [1] https://lore.kernel.org/20250613171237.44776-1-sj@kernel.org

Given this discussion, as well as Joshua's comments earlier [1], it
sounds like while people aren't exactly opposed to using mempolicy for
this, the building consensus is that it would be best not to. I will
move the interleave logic to DAMON for the next revision. However, I
still think it makes sense to use the global weights (probably via
get_il_weight) for now to avoid allocating pages a certain way and
then migrating them soon after.

I'll try to send the next version of the patch set by the end of the week.

Thanks everyone for their feedback,
Bijan

[1] https://lore.kernel.org/linux-mm/20250613152517.225529-1-joshua.hahnjy@gmail.com/
[...]

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by SeongJae Park 3 months, 3 weeks ago

On Tue, 17 Jun 2025 14:54:39 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

[...]
> Given this discussion, as well as Joshua's comments earlier [1], it
> sounds like while people aren't exactly opposed to using mempolicy for
> this, the building consensus is that it would be best not to. I will
> move the interleave logic to DAMON for the next revision. However, I
> still think it makes sense to use the global weights (probably via
> get_il_weight) for now to avoid allocating pages a certain way and
> then migrating them soon after.

Makes sense to me :)

> 
> I'll try to send the next version of the patch set by the end of the week.

Looking forwrd to it!


Thanks,
SJ

[...]

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by David Hildenbrand 3 months, 3 weeks ago

On 16.06.25 16:16, Bijan Tabatabai wrote:
> On Mon, Jun 16, 2025 at 4:46 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 13.06.25 18:33, Bijan Tabatabai wrote:
>>> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 12.06.25 20:13, Bijan Tabatabai wrote:
> [...]
>> Hi,
>>
>>>
>>> I did not use get_vma_policy or mpol_misplaced, which I believe is the
>>> closest function that exists for what I want in this patch, because
>>> those functions
>>
>> I think what you mean is, that you are performing an rmap walk. But
>> there, you do have a VMA + MM available (stable).
>>
>>> seem to assume they are called inside of the task that the folio/vma
>>> is mapped to.
>>
>> But, we do have a VMA at hand, so why would we want to ignore any set
>> policy? (I think VMA policies so far only apply to shmem, but still).
>>
>> I really think you want to use get_vma_policy() instead of the task policy.
> 
> Sorry, I think I misunderstood you before. You are right, we should
> consider the VMA policy before using the task policy. I will do this
> in the next revision.
> 
>>
>>> More specifically, mpol_misplaced assumes it is being called within a
>>> page fault.
>>> This doesn't work for us, because we call it inside of a kdamond process.
>>
>> Right.
>>
>> But it uses the vmf only for ...
>>
>> 1) Obtaining the VMA
>> 2) Sanity-checking that the ptlock is held.
>>
>> Which, you also have during the rmap walk.
> 
> There is another subtle dependency in get_vma_policy.
> It first checks if a VMA policy exists, and if it doesn't, it uses the
> task policy of the current task, which doesn't make sense when called
> by a kdamond thread.

Indeed, anything that depends on current task / nid / cpu might require 
care.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by Huang, Ying 3 months, 3 weeks ago

David Hildenbrand <david@redhat.com> writes:

> On 13.06.25 18:33, Bijan Tabatabai wrote:
>> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 12.06.25 20:13, Bijan Tabatabai wrote:
>>>> From: Bijan Tabatabai <bijantabatab@micron.com>
>>>>

[snip]

>> I did not use get_vma_policy or mpol_misplaced, which I believe is
>> the
>> closest function that exists for what I want in this patch, because
>> those functions
>
> I think what you mean is, that you are performing an rmap walk. But
> there, you do have a VMA + MM available (stable).
>
>> seem to assume they are called inside of the task that the folio/vma
>> is mapped to.
>
> But, we do have a VMA at hand, so why would we want to ignore any set
> policy? (I think VMA policies so far only apply to shmem, but still).
>
> I really think you want to use get_vma_policy() instead of the task policy.
>
>
>> More specifically, mpol_misplaced assumes it is being called within a
>> page fault.
>> This doesn't work for us, because we call it inside of a kdamond process.
>
> Right.
>
> But it uses the vmf only for ...
>
> 1) Obtaining the VMA
> 2) Sanity-checking that the ptlock is held.

3) update NUMA balancing per-folio cpupid state (via should_numa_migrate_memory()).

This needs to be called by the NUMA page fault handler.

> Which, you also have during the rmap walk.
>

[snip]

---
Best Regards,
Huang, Ying

Re: [RFC PATCH 1/4] mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h

Posted by David Hildenbrand 3 months, 3 weeks ago

On 16.06.25 13:02, Huang, Ying wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 13.06.25 18:33, Bijan Tabatabai wrote:
>>> On Fri, Jun 13, 2025 at 8:45 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 12.06.25 20:13, Bijan Tabatabai wrote:
>>>>> From: Bijan Tabatabai <bijantabatab@micron.com>
>>>>>
> 
> [snip]
> 
>>> I did not use get_vma_policy or mpol_misplaced, which I believe is
>>> the
>>> closest function that exists for what I want in this patch, because
>>> those functions
>>
>> I think what you mean is, that you are performing an rmap walk. But
>> there, you do have a VMA + MM available (stable).
>>
>>> seem to assume they are called inside of the task that the folio/vma
>>> is mapped to.
>>
>> But, we do have a VMA at hand, so why would we want to ignore any set
>> policy? (I think VMA policies so far only apply to shmem, but still).
>>
>> I really think you want to use get_vma_policy() instead of the task policy.
>>
>>
>>> More specifically, mpol_misplaced assumes it is being called within a
>>> page fault.
>>> This doesn't work for us, because we call it inside of a kdamond process.
>>
>> Right.
>>
>> But it uses the vmf only for ...
>>
>> 1) Obtaining the VMA
>> 2) Sanity-checking that the ptlock is held.
> 
> 3) update NUMA balancing per-folio cpupid state (via should_numa_migrate_memory()).

How is the vmf used for that ("it uses the vmf only for ...")?

But yes, anything that relies on "thiscpu" and "thisnid" might be 
fault-specific as well (what I mentioned further below)


-- 
Cheers,

David / dhildenb