mm/damon: Add DAMOS action to interleave data across nodes

[RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Posted by Bijan Tabatabai 3 months, 4 weeks ago

From: Bijan Tabatabai <bijantabatab@micron.com>

A recent patch set automatically set the interleave weight for each node
according to the node's maximum bandwidth [1]. In another thread, the patch
set's author, Joshua Hahn, wondered if/how these weights should be changed
if the bandwidth utilization of the system changes [2].

This patch set adds the mechanism for dynamically changing how application
data is interleaved across nodes while leaving the policy of what the
interleave weights should be to userspace. It does this by adding a new
DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
paddr and vaddr operations sets. Using the paddr version is useful for
managing page placement globally. Using the vaddr version limits tracking
to one process per kdamond instance, but the va based tracking better
captures spacial locality.

DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
and the page placement algorithm in weighted_interleave_nid via
policy_nodemask. We chose to reuse the mempolicy weighted interleave
infrastructure to avoid reimplementing code. However, this has the awkward
side effect that only pages that are mapped to processes using
MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
weights. This might be fine because workloads that want their data to be
dynamically interleaved will want their newly allocated data to be
interleaved at the same ratio.

If exposing policy_nodemask is undesirable, we have two alternative methods
for having DAMON access the interleave weights it should use. We would
appreciate feedback on which method is preferred.
1. Use mpol_misplaced instead
  pros: mpol_misplaced is already exposed publically
  cons: Would require refactoring mpol_misplaced to take a struct vm_area
  instead of a struct vm_fault, and require refactoring mpol_misplaced and
  get_vma_policy to take in a struct task_struct rather than just using
  current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
2. Add a new field to struct damos, similar to target_nid for the
MIGRATE_HOT/COLD schemes.
  pros: Keeps changes contained inside DAMON. Would not require processes
  to use MPOL_WEIGHTED_INTERLEAVE.
  cons: Duplicates page placement code. Requires discussion on the sysfs
  interface to use for users to pass in the interleave weights.

This patchset was tested on an AMD machine with a NUMA node with CPUs
attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
However, this patch set should generalize to other architectures and number
of NUMA nodes.

Patches Sequence
________________
The first patch exposes policy_nodemask() in include/linux/mempolicy.h to
let DAMON determine where a page should be placed for interleaving.
The second patch implements DAMOS_INTERLEAVE as a paddr action.
The third patch moves the DAMON page migration code to ops-common, allowing
vaddr actions to use it.
Finally, the fourth patch implements a vaddr version of DAMOS_INTERLEAVE.

[1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/

Bijan Tabatabai (4):
  mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
  mm/damon/paddr: Add DAMOS_INTERLEAVE action
  mm/damon: Move damon_pa_migrate_pages to ops-common
  mm/damon/vaddr: Add vaddr version of DAMOS_INTERLEAVE

 Documentation/mm/damon/design.rst |   2 +
 include/linux/damon.h             |   2 +
 include/linux/mempolicy.h         |   2 +
 mm/damon/ops-common.c             | 136 ++++++++++++++++++++
 mm/damon/ops-common.h             |   4 +
 mm/damon/paddr.c                  | 198 +++++++++++++-----------------
 mm/damon/sysfs-schemes.c          |   1 +
 mm/damon/vaddr.c                  | 124 +++++++++++++++++++
 mm/mempolicy.c                    |   4 +-
 9 files changed, 360 insertions(+), 113 deletions(-)

-- 
2.43.5

Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Posted by Joshua Hahn 3 months, 4 weeks ago

On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> From: Bijan Tabatabai <bijantabatab@micron.com>
> 
> A recent patch set automatically set the interleave weight for each node
> according to the node's maximum bandwidth [1]. In another thread, the patch
> set's author, Joshua Hahn, wondered if/how these weights should be changed
> if the bandwidth utilization of the system changes [2].

Hi Bijan,

Thank you for this patchset, and thank you for finding interest in my
question!

> This patch set adds the mechanism for dynamically changing how application
> data is interleaved across nodes while leaving the policy of what the
> interleave weights should be to userspace. It does this by adding a new
> DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> paddr and vaddr operations sets. Using the paddr version is useful for
> managing page placement globally. Using the vaddr version limits tracking
> to one process per kdamond instance, but the va based tracking better
> captures spacial locality.
>
> DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> and the page placement algorithm in weighted_interleave_nid via
> policy_nodemask. We chose to reuse the mempolicy weighted interleave
> infrastructure to avoid reimplementing code. However, this has the awkward
> side effect that only pages that are mapped to processes using
> MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> weights. This might be fine because workloads that want their data to be
> dynamically interleaved will want their newly allocated data to be
> interleaved at the same ratio.

I think this is generally true. Maybe until a user says that they have a
usecase where they would like to have a non-weighted-interleave policy
to allocate pages, but would like to place them according to a set weight,
we can leave support for other mempolicies out for now.

> If exposing policy_nodemask is undesirable, we have two alternative methods
> for having DAMON access the interleave weights it should use. We would
> appreciate feedback on which method is preferred.
> 1. Use mpol_misplaced instead
>   pros: mpol_misplaced is already exposed publically
>   cons: Would require refactoring mpol_misplaced to take a struct vm_area
>   instead of a struct vm_fault, and require refactoring mpol_misplaced and
>   get_vma_policy to take in a struct task_struct rather than just using
>   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
> 2. Add a new field to struct damos, similar to target_nid for the
> MIGRATE_HOT/COLD schemes.
>   pros: Keeps changes contained inside DAMON. Would not require processes
>   to use MPOL_WEIGHTED_INTERLEAVE.
>   cons: Duplicates page placement code. Requires discussion on the sysfs
>   interface to use for users to pass in the interleave weights.

Here I agree with SJ's sentiment -- I think mpol_misplaced runs with the
context of working with current / fault contexts, like you pointed out.
Perhaps it is best to keep the scope of the changes as local as possible : -)
As for duplicating page placement code, I think that is something we can
refine over iterations of this patchset, and maybe SJ will have some great
ideas on how this can best be done as well.

> This patchset was tested on an AMD machine with a NUMA node with CPUs
> attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> However, this patch set should generalize to other architectures and number
> of NUMA nodes.

I think moving the test results to the cover letter will help reviewers
better understand the intent of the work. Also, I think it will also be
very helpful to include some potential use-cases in here as well. That is,
what workloads would benefit from placing pages according to a set ratio,
rather than using existing migration policies that adjust this based on
hotness / coldness?

One such use case that I can think of is using this patchset + weighted
interleave auto-tuning, which would help alleviate bandwidth limitations
by ensuring that past the allocation stage, pages are being accessed
in a way that maximizes the bandwidth usage of the system (at the cost of
latency, which may or may not even be true based on how bandwidth-bound
the workload is). 

Thank you again for the amazing patchset! Have a great day : -)
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)

Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Posted by Bijan Tabatabai 3 months, 4 weeks ago

Hi Joshua,

On Fri, Jun 13, 2025 at 10:25 AM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
>
> > From: Bijan Tabatabai <bijantabatab@micron.com>
> >
> > A recent patch set automatically set the interleave weight for each node
> > according to the node's maximum bandwidth [1]. In another thread, the patch
> > set's author, Joshua Hahn, wondered if/how these weights should be changed
> > if the bandwidth utilization of the system changes [2].
>
> Hi Bijan,
>
> Thank you for this patchset, and thank you for finding interest in my
> question!
>
> > This patch set adds the mechanism for dynamically changing how application
> > data is interleaved across nodes while leaving the policy of what the
> > interleave weights should be to userspace. It does this by adding a new
> > DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> > paddr and vaddr operations sets. Using the paddr version is useful for
> > managing page placement globally. Using the vaddr version limits tracking
> > to one process per kdamond instance, but the va based tracking better
> > captures spacial locality.
> >
> > DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> > interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> > and the page placement algorithm in weighted_interleave_nid via
> > policy_nodemask. We chose to reuse the mempolicy weighted interleave
> > infrastructure to avoid reimplementing code. However, this has the awkward
> > side effect that only pages that are mapped to processes using
> > MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> > weights. This might be fine because workloads that want their data to be
> > dynamically interleaved will want their newly allocated data to be
> > interleaved at the same ratio.
>
> I think this is generally true. Maybe until a user says that they have a
> usecase where they would like to have a non-weighted-interleave policy
> to allocate pages, but would like to place them according to a set weight,
> we can leave support for other mempolicies out for now.
>
> > If exposing policy_nodemask is undesirable, we have two alternative methods
> > for having DAMON access the interleave weights it should use. We would
> > appreciate feedback on which method is preferred.
> > 1. Use mpol_misplaced instead
> >   pros: mpol_misplaced is already exposed publically
> >   cons: Would require refactoring mpol_misplaced to take a struct vm_area
> >   instead of a struct vm_fault, and require refactoring mpol_misplaced and
> >   get_vma_policy to take in a struct task_struct rather than just using
> >   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
> > 2. Add a new field to struct damos, similar to target_nid for the
> > MIGRATE_HOT/COLD schemes.
> >   pros: Keeps changes contained inside DAMON. Would not require processes
> >   to use MPOL_WEIGHTED_INTERLEAVE.
> >   cons: Duplicates page placement code. Requires discussion on the sysfs
> >   interface to use for users to pass in the interleave weights.
>
> Here I agree with SJ's sentiment -- I think mpol_misplaced runs with the
> context of working with current / fault contexts, like you pointed out.
> Perhaps it is best to keep the scope of the changes as local as possible : -)
> As for duplicating page placement code, I think that is something we can
> refine over iterations of this patchset, and maybe SJ will have some great
> ideas on how this can best be done as well.

David Hildenbrand responded to this and proposed adding a new function that
just returns the nid a folio should go on based on its mempolicy. I think that's
probably the best way to go for now. I think the common case would want
the weights used by this and mempolicy to be the same. However, if there is
a use case where different weights are desired, I don't mind coming back and
adding that functionality.

> > This patchset was tested on an AMD machine with a NUMA node with CPUs
> > attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> > However, this patch set should generalize to other architectures and number
> > of NUMA nodes.
>
> I think moving the test results to the cover letter will help reviewers
> better understand the intent of the work. Also, I think it will also be
> very helpful to include some potential use-cases in here as well. That is,
> what workloads would benefit from placing pages according to a set ratio,
> rather than using existing migration policies that adjust this based on
> hotness / coldness?

Noted. I will be sure to include that in the next revision.

> One such use case that I can think of is using this patchset + weighted
> interleave auto-tuning, which would help alleviate bandwidth limitations
> by ensuring that past the allocation stage, pages are being accessed
> in a way that maximizes the bandwidth usage of the system (at the cost of
> latency, which may or may not even be true based on how bandwidth-bound
> the workload is).

This was the exact use case I envisioned for this patch. I talk about it in more
detail in my reply to SeongJae.

> Thank you again for the amazing patchset! Have a great day : -)
> Joshua

I appreciate you taking the time to respond,
Bijan

Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Posted by Rakie Kim 3 months, 4 weeks ago

On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> From: Bijan Tabatabai <bijantabatab@micron.com>
> 
> A recent patch set automatically set the interleave weight for each node
> according to the node's maximum bandwidth [1]. In another thread, the patch
> set's author, Joshua Hahn, wondered if/how these weights should be changed
> if the bandwidth utilization of the system changes [2].
> 
> This patch set adds the mechanism for dynamically changing how application
> data is interleaved across nodes while leaving the policy of what the
> interleave weights should be to userspace. It does this by adding a new
> DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> paddr and vaddr operations sets. Using the paddr version is useful for
> managing page placement globally. Using the vaddr version limits tracking
> to one process per kdamond instance, but the va based tracking better
> captures spacial locality.

Hi Bijan,

Thank you for explaining the motivation and need behind this patch.
I believe it's important to consider the case where a new memory node
is added and the interleave weight values are recalculated.

If a new memory node (say, node2) is added, there are two possible
approaches to consider.

1. Migrating pages to the newly added node2.
   In this case, there is a potential issue where pages may be migrated
   to node2, even though it is not part of the nodemask set by the user.

2. Ignoring the newly added node2 and continuing to use only the existing
   nodemask for migrations.
   However, if the weight values have been updated considering node2
   performance, avoiding node2 might reduce the effectiveness of using
   Weighted Interleave.

It would be helpful to consider these two options or explore other
possible solutions to ensure correctness.

Rakie

> 
> DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> and the page placement algorithm in weighted_interleave_nid via
> policy_nodemask. We chose to reuse the mempolicy weighted interleave
> infrastructure to avoid reimplementing code. However, this has the awkward
> side effect that only pages that are mapped to processes using
> MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> weights. This might be fine because workloads that want their data to be
> dynamically interleaved will want their newly allocated data to be
> interleaved at the same ratio.
> 
> If exposing policy_nodemask is undesirable, we have two alternative methods
> for having DAMON access the interleave weights it should use. We would
> appreciate feedback on which method is preferred.
> 1. Use mpol_misplaced instead
>   pros: mpol_misplaced is already exposed publically
>   cons: Would require refactoring mpol_misplaced to take a struct vm_area
>   instead of a struct vm_fault, and require refactoring mpol_misplaced and
>   get_vma_policy to take in a struct task_struct rather than just using
>   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
> 2. Add a new field to struct damos, similar to target_nid for the
> MIGRATE_HOT/COLD schemes.
>   pros: Keeps changes contained inside DAMON. Would not require processes
>   to use MPOL_WEIGHTED_INTERLEAVE.
>   cons: Duplicates page placement code. Requires discussion on the sysfs
>   interface to use for users to pass in the interleave weights.
> 
> This patchset was tested on an AMD machine with a NUMA node with CPUs
> attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> However, this patch set should generalize to other architectures and number
> of NUMA nodes.
> 
> Patches Sequence
> ________________
> The first patch exposes policy_nodemask() in include/linux/mempolicy.h to
> let DAMON determine where a page should be placed for interleaving.
> The second patch implements DAMOS_INTERLEAVE as a paddr action.
> The third patch moves the DAMON page migration code to ops-common, allowing
> vaddr actions to use it.
> Finally, the fourth patch implements a vaddr version of DAMOS_INTERLEAVE.
> 
> [1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
> [2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
> 
> Bijan Tabatabai (4):
>   mm/mempolicy: Expose policy_nodemask() in include/linux/mempolicy.h
>   mm/damon/paddr: Add DAMOS_INTERLEAVE action
>   mm/damon: Move damon_pa_migrate_pages to ops-common
>   mm/damon/vaddr: Add vaddr version of DAMOS_INTERLEAVE
> 
>  Documentation/mm/damon/design.rst |   2 +
>  include/linux/damon.h             |   2 +
>  include/linux/mempolicy.h         |   2 +
>  mm/damon/ops-common.c             | 136 ++++++++++++++++++++
>  mm/damon/ops-common.h             |   4 +
>  mm/damon/paddr.c                  | 198 +++++++++++++-----------------
>  mm/damon/sysfs-schemes.c          |   1 +
>  mm/damon/vaddr.c                  | 124 +++++++++++++++++++
>  mm/mempolicy.c                    |   4 +-
>  9 files changed, 360 insertions(+), 113 deletions(-)
> 
> -- 
> 2.43.5
> 
>

Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Posted by Bijan Tabatabai 3 months, 4 weeks ago

On Fri, Jun 13, 2025 at 4:55 AM Rakie Kim <rakie.kim@sk.com> wrote:
>
> On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> > From: Bijan Tabatabai <bijantabatab@micron.com>
> >
> > A recent patch set automatically set the interleave weight for each node
> > according to the node's maximum bandwidth [1]. In another thread, the patch
> > set's author, Joshua Hahn, wondered if/how these weights should be changed
> > if the bandwidth utilization of the system changes [2].
> >
> > This patch set adds the mechanism for dynamically changing how application
> > data is interleaved across nodes while leaving the policy of what the
> > interleave weights should be to userspace. It does this by adding a new
> > DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> > paddr and vaddr operations sets. Using the paddr version is useful for
> > managing page placement globally. Using the vaddr version limits tracking
> > to one process per kdamond instance, but the va based tracking better
> > captures spacial locality.
>
> Hi Bijan,
>
> Thank you for explaining the motivation and need behind this patch.
> I believe it's important to consider the case where a new memory node
> is added and the interleave weight values are recalculated.
>
> If a new memory node (say, node2) is added, there are two possible
> approaches to consider.
>
> 1. Migrating pages to the newly added node2.
>    In this case, there is a potential issue where pages may be migrated
>    to node2, even though it is not part of the nodemask set by the user.
>
> 2. Ignoring the newly added node2 and continuing to use only the existing
>    nodemask for migrations.
>    However, if the weight values have been updated considering node2
>    performance, avoiding node2 might reduce the effectiveness of using
>    Weighted Interleave.
>
> It would be helpful to consider these two options or explore other
> possible solutions to ensure correctness.
>
> Rakie

Hi Rakie,

Thank you for the reply - this is not a problem I considered, but it
is important.

I think option 2 is the correct choice, and is what is already done by the
policy_nodemask function we use to determine what node to place a page. I
do not think it makes sense to ignore the nodemask when it is explicitly set by
the user.
However, if we decide to change the way we get the interleave weights to be from
a DAMON specific interface, then I think it would make sense to only
migrate to the
newly onlined node if the user sets a weight for that node. This is
because in that
case, the user is explicitly telling DAMON to use that node.

Let me know if you have any other concerns,
Bijan

Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Posted by SeongJae Park 3 months, 4 weeks ago

Hi Bijan,

On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> From: Bijan Tabatabai <bijantabatab@micron.com>
> 
> A recent patch set automatically set the interleave weight for each node
> according to the node's maximum bandwidth [1]. In another thread, the patch
> set's author, Joshua Hahn, wondered if/how these weights should be changed
> if the bandwidth utilization of the system changes [2].

Thank you for sharing the background.  I do agree it is an important question.

> 
> This patch set adds the mechanism for dynamically changing how application
> data is interleaved across nodes while leaving the policy of what the
> interleave weights should be to userspace. It does this by adding a new
> DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> paddr and vaddr operations sets. Using the paddr version is useful for
> managing page placement globally. Using the vaddr version limits tracking
> to one process per kdamond instance, but the va based tracking better
> captures spacial locality.
> 
> DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> and the page placement algorithm in weighted_interleave_nid via
> policy_nodemask.

So, what DAMOS_INTERLEAVE will do is, migrating pages of a given DAMON region
into multiple nodes, following interleaving weights, right?  We already have
DAMOS actions for migrating pages of a given DAMON region, namely
DAMOS_MIGRATE_{HOT,COLD}.  The actions support only single migration target
node, though.  To my perspective, hence, DAMOS_INTERLEAVE looks like an
extended version of DAMOS_MIGRATE_{HOT,COLD} for flexible target node
selections.  In a way, DAMOS_INTERLEAVE is rather a restricted version of
DAMOS_MIGRATE_{HOT,COLD}, since it prioritizes only hotter regions, if I read
the second patch correctly.

What about extending DAMOS_MIGRATE_{HOT,COLD} to support your use case?  For
example, letting users enter special keyword, say, 'weighted_interleave' to
'target_nid' DAMON sysfs file.  In the case, DAMOS_MIGRATE_{HOT,COLD} would
work in the way you are implementing DAMOS_INTERLEAVE.

> We chose to reuse the mempolicy weighted interleave
> infrastructure to avoid reimplementing code. However, this has the awkward
> side effect that only pages that are mapped to processes using
> MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> weights. This might be fine because workloads that want their data to be
> dynamically interleaved will want their newly allocated data to be
> interleaved at the same ratio.

Makes sense to me.  I'm not very familiar with interleaving and memory policy,
though.

> 
> If exposing policy_nodemask is undesirable,

I see you are exposing it on include/linux/mempolicy.h on the first patch of
this series, and I agree it is not desirable to unnecessarily expose functions.
But you could reduce the exposure by exporting it on mm/internal.h instead.
mempolicy maitnainers and reviewers who you kindly Cc-ed to this mail could
give us good opinions.

> we have two alternative methods
> for having DAMON access the interleave weights it should use. We would
> appreciate feedback on which method is preferred.
> 1. Use mpol_misplaced instead
>   pros: mpol_misplaced is already exposed publically
>   cons: Would require refactoring mpol_misplaced to take a struct vm_area
>   instead of a struct vm_fault, and require refactoring mpol_misplaced and
>   get_vma_policy to take in a struct task_struct rather than just using
>   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.

I feel cons is larger than pros.  mpolicy people's opinion would matter more,
though.

> 2. Add a new field to struct damos, similar to target_nid for the
> MIGRATE_HOT/COLD schemes.
>   pros: Keeps changes contained inside DAMON. Would not require processes
>   to use MPOL_WEIGHTED_INTERLEAVE.
>   cons: Duplicates page placement code. Requires discussion on the sysfs
>   interface to use for users to pass in the interleave weights.

I agree this is also somewhat doable.  In future, we might want to implement
this anyway, for non-global and flexible memory interleaving.  But if memory
policy people are ok with reusing policy_nodemask(), I don't think we need to
do this now.

> 
> This patchset was tested on an AMD machine with a NUMA node with CPUs
> attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> However, this patch set should generalize to other architectures and number
> of NUMA nodes.

I show the test results on the commit messages of the second and the fourth
patches.  In the next version, letting readers know that here would be nice.
Also adding a short description of what you confirmed with the tests here
(e.g., with the test we confirmed this patch functions as expected [and
achieves X % Y metric wins]) would be nice.

> 
> Patches Sequence
> ________________
> The first patch exposes policy_nodemask() in include/linux/mempolicy.h to
> let DAMON determine where a page should be placed for interleaving.
> The second patch implements DAMOS_INTERLEAVE as a paddr action.
> The third patch moves the DAMON page migration code to ops-common, allowing
> vaddr actions to use it.
> Finally, the fourth patch implements a vaddr version of DAMOS_INTERLEAVE.

I'll try to take look on code and add comments if something stands out, but
let's focus on the high level discussion first, especially whether to implement
this as a new DAMOS action, or extend DAMOS_MIGRATE_{HOT,COLD} actions.

I think it would also be nice if you could add more explanation about why you
picked DAMON as a way to implement this feature.  I assume that's because you
found opportunities to utilize this feature in some access-aware way or
utilizing DAMOS features.  I was actually able to imagine some such usages.
For example, we could do the re-interleaving for hot or cold pages of specific
NUMA nodes or specific virtual address ranges first to make interleaving
effective faster.

Also we could apply a sort of speed limit for the interleaving-migration to
ensure it doesn't consume memory bandwidth too much.  The limit could be
arbitrarily user-defined or auto-tuned for specific system metrics value (e.g.,
memory bandwidth balance?).

If you have such use case in your mind or your test setups, sharing those here
or on the next versions of this would be very helpful for reviewers.

> 
> [1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
> [2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/

Thanks,
SJ

[...]

Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Posted by Bijan Tabatabai 3 months, 4 weeks ago

Hi SeongJae,

Thank you for your comments.

On Thu, Jun 12, 2025 at 6:49 PM SeongJae Park <sj@kernel.org> wrote:
>
> Hi Bijan,
>
> On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
>
> > From: Bijan Tabatabai <bijantabatab@micron.com>
> >
> > A recent patch set automatically set the interleave weight for each node
> > according to the node's maximum bandwidth [1]. In another thread, the patch
> > set's author, Joshua Hahn, wondered if/how these weights should be changed
> > if the bandwidth utilization of the system changes [2].
>
> Thank you for sharing the background.  I do agree it is an important question.
>
> >
> > This patch set adds the mechanism for dynamically changing how application
> > data is interleaved across nodes while leaving the policy of what the
> > interleave weights should be to userspace. It does this by adding a new
> > DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> > paddr and vaddr operations sets. Using the paddr version is useful for
> > managing page placement globally. Using the vaddr version limits tracking
> > to one process per kdamond instance, but the va based tracking better
> > captures spacial locality.
> >
> > DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> > interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> > and the page placement algorithm in weighted_interleave_nid via
> > policy_nodemask.
>
> So, what DAMOS_INTERLEAVE will do is, migrating pages of a given DAMON region
> into multiple nodes, following interleaving weights, right?

That's correct.

> We already have
> DAMOS actions for migrating pages of a given DAMON region, namely
> DAMOS_MIGRATE_{HOT,COLD}.  The actions support only single migration target
> node, though.  To my perspective, hence, DAMOS_INTERLEAVE looks like an
> extended version of DAMOS_MIGRATE_{HOT,COLD} for flexible target node
> selections.  In a way, DAMOS_INTERLEAVE is rather a restricted version of
> DAMOS_MIGRATE_{HOT,COLD}, since it prioritizes only hotter regions, if I read
> the second patch correctly.
>
> What about extending DAMOS_MIGRATE_{HOT,COLD} to support your use case?  For
> example, letting users enter special keyword, say, 'weighted_interleave' to
> 'target_nid' DAMON sysfs file.  In the case, DAMOS_MIGRATE_{HOT,COLD} would
> work in the way you are implementing DAMOS_INTERLEAVE.

I like this idea. I will do this in the next version of the patch. I
have a couple of questions
about how to go about this if you don't mind.

First, should I drop the vaddr implementation or implement
DAMOS_MIGRATE_{HOT,COLD}
in vaddr as well? I am leaning towards the former because I believe
the paddr version is
more important, though the vaddr version is useful if the user only
cares about one
application.

Second, do you have a preference for how we indicate that we are using
the mempolicy
rather than target_nid in struct damos? I was thinking of either
setting target_nid to
NUMA_NO_NODE or adding a boolean to struct damos for this.

Maybe it would also be a good idea to generalize it some more. I
implemented this using
just weighted interleave because I was targeting the use case where
the best interleave
weights for a workload changes as the bandwidth utilization of the
system changes, which
I will go describe in more detail further down. However, we could
apply the same logic for
any mempolicy instead of just filtering for MPOL_WEIGHTED_INTERLEAVE. This might
clean up the code a little bit because the logic dependent on
CONFIG_NUMA would be
contained in the mempolicy code.

> > We chose to reuse the mempolicy weighted interleave
> > infrastructure to avoid reimplementing code. However, this has the awkward
> > side effect that only pages that are mapped to processes using
> > MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> > weights. This might be fine because workloads that want their data to be
> > dynamically interleaved will want their newly allocated data to be
> > interleaved at the same ratio.
>
> Makes sense to me.  I'm not very familiar with interleaving and memory policy,
> though.
>
> >
> > If exposing policy_nodemask is undesirable,
>
> I see you are exposing it on include/linux/mempolicy.h on the first patch of
> this series, and I agree it is not desirable to unnecessarily expose functions.
> But you could reduce the exposure by exporting it on mm/internal.h instead.
> mempolicy maitnainers and reviewers who you kindly Cc-ed to this mail could
> give us good opinions.
>
> > we have two alternative methods
> > for having DAMON access the interleave weights it should use. We would
> > appreciate feedback on which method is preferred.
> > 1. Use mpol_misplaced instead
> >   pros: mpol_misplaced is already exposed publically
> >   cons: Would require refactoring mpol_misplaced to take a struct vm_area
> >   instead of a struct vm_fault, and require refactoring mpol_misplaced and
> >   get_vma_policy to take in a struct task_struct rather than just using
> >   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
>
> I feel cons is larger than pros.  mpolicy people's opinion would matter more,
> though.
>
> > 2. Add a new field to struct damos, similar to target_nid for the
> > MIGRATE_HOT/COLD schemes.
> >   pros: Keeps changes contained inside DAMON. Would not require processes
> >   to use MPOL_WEIGHTED_INTERLEAVE.
> >   cons: Duplicates page placement code. Requires discussion on the sysfs
> >   interface to use for users to pass in the interleave weights.
>
> I agree this is also somewhat doable.  In future, we might want to implement
> this anyway, for non-global and flexible memory interleaving.  But if memory
> policy people are ok with reusing policy_nodemask(), I don't think we need to
> do this now.
>
> >
> > This patchset was tested on an AMD machine with a NUMA node with CPUs
> > attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> > However, this patch set should generalize to other architectures and number
> > of NUMA nodes.
>
> I show the test results on the commit messages of the second and the fourth
> patches.  In the next version, letting readers know that here would be nice.
> Also adding a short description of what you confirmed with the tests here
> (e.g., with the test we confirmed this patch functions as expected [and
> achieves X % Y metric wins]) would be nice.
>

Noted. I'll include this in the cover letter of the next patch set.

> >
> > Patches Sequence
> > ________________
> > The first patch exposes policy_nodemask() in include/linux/mempolicy.h to
> > let DAMON determine where a page should be placed for interleaving.
> > The second patch implements DAMOS_INTERLEAVE as a paddr action.
> > The third patch moves the DAMON page migration code to ops-common, allowing
> > vaddr actions to use it.
> > Finally, the fourth patch implements a vaddr version of DAMOS_INTERLEAVE.
>
> I'll try to take look on code and add comments if something stands out, but
> let's focus on the high level discussion first, especially whether to implement
> this as a new DAMOS action, or extend DAMOS_MIGRATE_{HOT,COLD} actions.

Makes sense. Based on your reply, I will probably change the code significantly.

> I think it would also be nice if you could add more explanation about why you
> picked DAMON as a way to implement this feature.  I assume that's because you
> found opportunities to utilize this feature in some access-aware way or
> utilizing DAMOS features.  I was actually able to imagine some such usages.
> For example, we could do the re-interleaving for hot or cold pages of specific
> NUMA nodes or specific virtual address ranges first to make interleaving
> effective faster.

Yeah, I'll give more detail on the use case I was targeting, which I
will also include
in the cover letter of the next patch set.

Basically, we have seen that the best interleave weights for a workload can
change depending on the bandwidth utilization of the system. This was touched
upon in the discussion in [1]. As a toy example, imagine some
application that uses
75% of the local bandwidth. Assuming sufficient capacity, when running alone, we
probably want to keep all of that application's data in local memory.
However, if a
second instance of that application begins, using the same amount of bandwidth,
it would be best to interleave the data of both processes to alleviate
the bandwidth
pressure from the local node. Likewise, when one of the processes ends, the data
should be moved back to local memory.

We imagine there would be a userspace application that would monitor system
performance characteristics, such as bandwidth utilization or memory
access latency,
and uses that information to tune the interleave weights. Others seemed to have
come to a similar conclusion in previous discussions [2]. We are
currently working
on a userspace program that does this, but it's not quite ready to be
published yet.

After the userspace application adjusts the interleave weights, we need some
mechanism to migrate the application pages that have already been allocated.
We think DAMON is the correct venue for this mechanism because we noticed
that we don't have to migrate all of the application's pages to
improve performance,
we just need to migrate the frequently accessed pages. DAMON's existing hotness
tracking is very useful for this. Additionally, as Ying pointed out
[3], a complete
solution must also handle when a memory node is at capacity. The existing
DAMOS_MIGRATE_COLD action can be used in conjunction with the functionality
in this patch set to provide that complete solution.

[1] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/linux-mm/20250314151137.892379-1-joshua.hahnjy@gmail.com/
[3] https://lore.kernel.org/linux-mm/87frjfx6u4.fsf@DESKTOP-5N7EMDA/

> Also we could apply a sort of speed limit for the interleaving-migration to
> ensure it doesn't consume memory bandwidth too much.  The limit could be
> arbitrarily user-defined or auto-tuned for specific system metrics value (e.g.,
> memory bandwidth balance?).

I agree this is a concern, but I figured DAMOS's existing quota mechanism would
handle it. If you could elaborate on why quotas aren't enough here,
that would help
me come up with a solution.

> If you have such use case in your mind or your test setups, sharing those here
> or on the next versions of this would be very helpful for reviewers.

Answered above. I will include them in the next version.

Thanks,
Bijan

> >
> > [1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
> > [2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
>
>
> Thanks,
> SJ
>
> [...]

Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Posted by Byungchul Park 3 months, 3 weeks ago

On Fri, Jun 13, 2025 at 10:44:17AM -0500, Bijan Tabatabai wrote:
> Hi SeongJae,
> 
> Thank you for your comments.
> 
> On Thu, Jun 12, 2025 at 6:49 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > Hi Bijan,
> >
> > On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> >
> > > From: Bijan Tabatabai <bijantabatab@micron.com>
> > >
> > > A recent patch set automatically set the interleave weight for each node
> > > according to the node's maximum bandwidth [1]. In another thread, the patch
> > > set's author, Joshua Hahn, wondered if/how these weights should be changed
> > > if the bandwidth utilization of the system changes [2].
> >
> > Thank you for sharing the background.  I do agree it is an important question.
> >
> > >
> > > This patch set adds the mechanism for dynamically changing how application
> > > data is interleaved across nodes while leaving the policy of what the
> > > interleave weights should be to userspace. It does this by adding a new
> > > DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> > > paddr and vaddr operations sets. Using the paddr version is useful for
> > > managing page placement globally. Using the vaddr version limits tracking
> > > to one process per kdamond instance, but the va based tracking better
> > > captures spacial locality.
> > >
> > > DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> > > interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> > > and the page placement algorithm in weighted_interleave_nid via
> > > policy_nodemask.
> >
> > So, what DAMOS_INTERLEAVE will do is, migrating pages of a given DAMON region
> > into multiple nodes, following interleaving weights, right?
> 
> That's correct.

Your approach sounds interesting.

IIUC, the approach can be intergrated with the existing numa hinting
mechanism as well, so as to perform weighted interleaving migration for
promotion, which may result in suppressing the migration anyway tho, in
MPOL_WEIGHTED_INTERLEAVE set.

Do you have plan for the that too?

Plus, it'd be the best if you share the improvement result rather than
the placement data.

	Byungchul

> > We already have
> > DAMOS actions for migrating pages of a given DAMON region, namely
> > DAMOS_MIGRATE_{HOT,COLD}.  The actions support only single migration target
> > node, though.  To my perspective, hence, DAMOS_INTERLEAVE looks like an
> > extended version of DAMOS_MIGRATE_{HOT,COLD} for flexible target node
> > selections.  In a way, DAMOS_INTERLEAVE is rather a restricted version of
> > DAMOS_MIGRATE_{HOT,COLD}, since it prioritizes only hotter regions, if I read
> > the second patch correctly.
> >
> > What about extending DAMOS_MIGRATE_{HOT,COLD} to support your use case?  For
> > example, letting users enter special keyword, say, 'weighted_interleave' to
> > 'target_nid' DAMON sysfs file.  In the case, DAMOS_MIGRATE_{HOT,COLD} would
> > work in the way you are implementing DAMOS_INTERLEAVE.
> 
> I like this idea. I will do this in the next version of the patch. I
> have a couple of questions
> about how to go about this if you don't mind.
> 
> First, should I drop the vaddr implementation or implement
> DAMOS_MIGRATE_{HOT,COLD}
> in vaddr as well? I am leaning towards the former because I believe
> the paddr version is
> more important, though the vaddr version is useful if the user only
> cares about one
> application.
> 
> Second, do you have a preference for how we indicate that we are using
> the mempolicy
> rather than target_nid in struct damos? I was thinking of either
> setting target_nid to
> NUMA_NO_NODE or adding a boolean to struct damos for this.
> 
> Maybe it would also be a good idea to generalize it some more. I
> implemented this using
> just weighted interleave because I was targeting the use case where
> the best interleave
> weights for a workload changes as the bandwidth utilization of the
> system changes, which
> I will go describe in more detail further down. However, we could
> apply the same logic for
> any mempolicy instead of just filtering for MPOL_WEIGHTED_INTERLEAVE. This might
> clean up the code a little bit because the logic dependent on
> CONFIG_NUMA would be
> contained in the mempolicy code.
> 
> > > We chose to reuse the mempolicy weighted interleave
> > > infrastructure to avoid reimplementing code. However, this has the awkward
> > > side effect that only pages that are mapped to processes using
> > > MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> > > weights. This might be fine because workloads that want their data to be
> > > dynamically interleaved will want their newly allocated data to be
> > > interleaved at the same ratio.
> >
> > Makes sense to me.  I'm not very familiar with interleaving and memory policy,
> > though.
> >
> > >
> > > If exposing policy_nodemask is undesirable,
> >
> > I see you are exposing it on include/linux/mempolicy.h on the first patch of
> > this series, and I agree it is not desirable to unnecessarily expose functions.
> > But you could reduce the exposure by exporting it on mm/internal.h instead.
> > mempolicy maitnainers and reviewers who you kindly Cc-ed to this mail could
> > give us good opinions.
> >
> > > we have two alternative methods
> > > for having DAMON access the interleave weights it should use. We would
> > > appreciate feedback on which method is preferred.
> > > 1. Use mpol_misplaced instead
> > >   pros: mpol_misplaced is already exposed publically
> > >   cons: Would require refactoring mpol_misplaced to take a struct vm_area
> > >   instead of a struct vm_fault, and require refactoring mpol_misplaced and
> > >   get_vma_policy to take in a struct task_struct rather than just using
> > >   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
> >
> > I feel cons is larger than pros.  mpolicy people's opinion would matter more,
> > though.
> >
> > > 2. Add a new field to struct damos, similar to target_nid for the
> > > MIGRATE_HOT/COLD schemes.
> > >   pros: Keeps changes contained inside DAMON. Would not require processes
> > >   to use MPOL_WEIGHTED_INTERLEAVE.
> > >   cons: Duplicates page placement code. Requires discussion on the sysfs
> > >   interface to use for users to pass in the interleave weights.
> >
> > I agree this is also somewhat doable.  In future, we might want to implement
> > this anyway, for non-global and flexible memory interleaving.  But if memory
> > policy people are ok with reusing policy_nodemask(), I don't think we need to
> > do this now.
> >
> > >
> > > This patchset was tested on an AMD machine with a NUMA node with CPUs
> > > attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> > > However, this patch set should generalize to other architectures and number
> > > of NUMA nodes.
> >
> > I show the test results on the commit messages of the second and the fourth
> > patches.  In the next version, letting readers know that here would be nice.
> > Also adding a short description of what you confirmed with the tests here
> > (e.g., with the test we confirmed this patch functions as expected [and
> > achieves X % Y metric wins]) would be nice.
> >
> 
> Noted. I'll include this in the cover letter of the next patch set.
> 
> > >
> > > Patches Sequence
> > > ________________
> > > The first patch exposes policy_nodemask() in include/linux/mempolicy.h to
> > > let DAMON determine where a page should be placed for interleaving.
> > > The second patch implements DAMOS_INTERLEAVE as a paddr action.
> > > The third patch moves the DAMON page migration code to ops-common, allowing
> > > vaddr actions to use it.
> > > Finally, the fourth patch implements a vaddr version of DAMOS_INTERLEAVE.
> >
> > I'll try to take look on code and add comments if something stands out, but
> > let's focus on the high level discussion first, especially whether to implement
> > this as a new DAMOS action, or extend DAMOS_MIGRATE_{HOT,COLD} actions.
> 
> Makes sense. Based on your reply, I will probably change the code significantly.
> 
> > I think it would also be nice if you could add more explanation about why you
> > picked DAMON as a way to implement this feature.  I assume that's because you
> > found opportunities to utilize this feature in some access-aware way or
> > utilizing DAMOS features.  I was actually able to imagine some such usages.
> > For example, we could do the re-interleaving for hot or cold pages of specific
> > NUMA nodes or specific virtual address ranges first to make interleaving
> > effective faster.
> 
> Yeah, I'll give more detail on the use case I was targeting, which I
> will also include
> in the cover letter of the next patch set.
> 
> Basically, we have seen that the best interleave weights for a workload can
> change depending on the bandwidth utilization of the system. This was touched
> upon in the discussion in [1]. As a toy example, imagine some
> application that uses
> 75% of the local bandwidth. Assuming sufficient capacity, when running alone, we
> probably want to keep all of that application's data in local memory.
> However, if a
> second instance of that application begins, using the same amount of bandwidth,
> it would be best to interleave the data of both processes to alleviate
> the bandwidth
> pressure from the local node. Likewise, when one of the processes ends, the data
> should be moved back to local memory.
> 
> We imagine there would be a userspace application that would monitor system
> performance characteristics, such as bandwidth utilization or memory
> access latency,
> and uses that information to tune the interleave weights. Others seemed to have
> come to a similar conclusion in previous discussions [2]. We are
> currently working
> on a userspace program that does this, but it's not quite ready to be
> published yet.
> 
> After the userspace application adjusts the interleave weights, we need some
> mechanism to migrate the application pages that have already been allocated.
> We think DAMON is the correct venue for this mechanism because we noticed
> that we don't have to migrate all of the application's pages to
> improve performance,
> we just need to migrate the frequently accessed pages. DAMON's existing hotness
> tracking is very useful for this. Additionally, as Ying pointed out
> [3], a complete
> solution must also handle when a memory node is at capacity. The existing
> DAMOS_MIGRATE_COLD action can be used in conjunction with the functionality
> in this patch set to provide that complete solution.
> 
> [1] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
> [2] https://lore.kernel.org/linux-mm/20250314151137.892379-1-joshua.hahnjy@gmail.com/
> [3] https://lore.kernel.org/linux-mm/87frjfx6u4.fsf@DESKTOP-5N7EMDA/
> 
> > Also we could apply a sort of speed limit for the interleaving-migration to
> > ensure it doesn't consume memory bandwidth too much.  The limit could be
> > arbitrarily user-defined or auto-tuned for specific system metrics value (e.g.,
> > memory bandwidth balance?).
> 
> I agree this is a concern, but I figured DAMOS's existing quota mechanism would
> handle it. If you could elaborate on why quotas aren't enough here,
> that would help
> me come up with a solution.
> 
> 
> > If you have such use case in your mind or your test setups, sharing those here
> > or on the next versions of this would be very helpful for reviewers.
> 
> Answered above. I will include them in the next version.
> 
> Thanks,
> Bijan
> 
> > >
> > > [1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
> > > [2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
> >
> >
> > Thanks,
> > SJ
> >
> > [...]

Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Posted by Bijan Tabatabai 3 months, 3 weeks ago

On Mon, Jun 16, 2025 at 2:42 AM Byungchul Park <byungchul@sk.com> wrote:
[...]

Hi Byungchul,

> Your approach sounds interesting.
>
> IIUC, the approach can be intergrated with the existing numa hinting
> mechanism as well, so as to perform weighted interleaving migration for
> promotion, which may result in suppressing the migration anyway tho, in
> MPOL_WEIGHTED_INTERLEAVE set.
>
> Do you have plan for the that too?

I do not currently have plans to support that, but this approach could
be used there as well.

> Plus, it'd be the best if you share the improvement result rather than
> the placement data.

Sure, I could add some performance data in the cover letter of the
next revision.

>         Byungchul
>
[...]

Thanks,
Bijan

Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Posted by SeongJae Park 3 months, 4 weeks ago

On Fri, 13 Jun 2025 10:44:17 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> Hi SeongJae,
> 
> Thank you for your comments.
> 
> On Thu, Jun 12, 2025 at 6:49 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > Hi Bijan,
> >
> > On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> >
> > > From: Bijan Tabatabai <bijantabatab@micron.com>
> > >
[...]
> > What about extending DAMOS_MIGRATE_{HOT,COLD} to support your use case?  For
> > example, letting users enter special keyword, say, 'weighted_interleave' to
> > 'target_nid' DAMON sysfs file.  In the case, DAMOS_MIGRATE_{HOT,COLD} would
> > work in the way you are implementing DAMOS_INTERLEAVE.
> 
> I like this idea. I will do this in the next version of the patch.

Great, looking forward to that!

> I
> have a couple of questions
> about how to go about this if you don't mind.

Of course I don't :)

> 
> First, should I drop the vaddr implementation or implement
> DAMOS_MIGRATE_{HOT,COLD}
> in vaddr as well? I am leaning towards the former because I believe
> the paddr version is
> more important, though the vaddr version is useful if the user only
> cares about one
> application.

I show no problem at dropping the vaddr implementation.  Please do what you
want and need to do on your pace :)

> 
> Second, do you have a preference for how we indicate that we are using
> the mempolicy
> rather than target_nid in struct damos? I was thinking of either
> setting target_nid to
> NUMA_NO_NODE or adding a boolean to struct damos for this.

I'd prefer adding a boolean to 'struct damos'.

> 
> Maybe it would also be a good idea to generalize it some more. I
> implemented this using
> just weighted interleave because I was targeting the use case where
> the best interleave
> weights for a workload changes as the bandwidth utilization of the
> system changes, which
> I will go describe in more detail further down. However, we could
> apply the same logic for
> any mempolicy instead of just filtering for MPOL_WEIGHTED_INTERLEAVE. This might
> clean up the code a little bit because the logic dependent on
> CONFIG_NUMA would be
> contained in the mempolicy code.

Yes, I agree.  Such flexibility sounds useful :)

In future, I think we could further let users set multiple target nodes for
DAMOS_MIGRATE_{HOT,COLD} with arbitrary weights.

[...]
> > I show the test results on the commit messages of the second and the fourth
> > patches.  In the next version, letting readers know that here would be nice.
> > Also adding a short description of what you confirmed with the tests here
> > (e.g., with the test we confirmed this patch functions as expected [and
> > achieves X % Y metric wins]) would be nice.
> >
> 
> Noted. I'll include this in the cover letter of the next patch set.

Thank you! :)

[...]
> > I think it would also be nice if you could add more explanation about why you
> > picked DAMON as a way to implement this feature.  I assume that's because you
> > found opportunities to utilize this feature in some access-aware way or
> > utilizing DAMOS features.  I was actually able to imagine some such usages.
> > For example, we could do the re-interleaving for hot or cold pages of specific
> > NUMA nodes or specific virtual address ranges first to make interleaving
> > effective faster.
> 
> Yeah, I'll give more detail on the use case I was targeting, which I
> will also include
> in the cover letter of the next patch set.
> 
> Basically, we have seen that the best interleave weights for a workload can
> change depending on the bandwidth utilization of the system. This was touched
> upon in the discussion in [1]. As a toy example, imagine some
> application that uses
> 75% of the local bandwidth. Assuming sufficient capacity, when running alone, we
> probably want to keep all of that application's data in local memory.
> However, if a
> second instance of that application begins, using the same amount of bandwidth,
> it would be best to interleave the data of both processes to alleviate
> the bandwidth
> pressure from the local node. Likewise, when one of the processes ends, the data
> should be moved back to local memory.
> 
> We imagine there would be a userspace application that would monitor system
> performance characteristics, such as bandwidth utilization or memory
> access latency,
> and uses that information to tune the interleave weights. Others seemed to have
> come to a similar conclusion in previous discussions [2]. We are
> currently working
> on a userspace program that does this, but it's not quite ready to be
> published yet.

Sounds interesting, looking forward!

Note that DAMOS has internal feedback loop for auto-tuning aggressiveness of a
given scheme, and the feedback loop accepts system metrics or arbitrary user
inputs.  I think the userspace program _might_ be able to give the arbitrary
feedback.  We could also think about extending the list of DAMOS-accepting
feedback system metrics to memory bandwidth.

> 
> After the userspace application adjusts the interleave weights, we need some
> mechanism to migrate the application pages that have already been allocated.
> We think DAMON is the correct venue for this mechanism because we noticed
> that we don't have to migrate all of the application's pages to
> improve performance,
> we just need to migrate the frequently accessed pages. DAMON's existing hotness
> tracking is very useful for this. Additionally, as Ying pointed out
> [3], a complete
> solution must also handle when a memory node is at capacity. The existing
> DAMOS_MIGRATE_COLD action can be used in conjunction with the functionality
> in this patch set to provide that complete solution.
> 
> [1] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
> [2] https://lore.kernel.org/linux-mm/20250314151137.892379-1-joshua.hahnjy@gmail.com/
> [3] https://lore.kernel.org/linux-mm/87frjfx6u4.fsf@DESKTOP-5N7EMDA/

Thank you for this nice and informative description of the use case!

> 
> > Also we could apply a sort of speed limit for the interleaving-migration to
> > ensure it doesn't consume memory bandwidth too much.  The limit could be
> > arbitrarily user-defined or auto-tuned for specific system metrics value (e.g.,
> > memory bandwidth balance?).
> 
> I agree this is a concern, but I figured DAMOS's existing quota mechanism would
> handle it. If you could elaborate on why quotas aren't enough here,
> that would help
> me come up with a solution.

What I wanted to say is, we could use DAMOS's existing quota mechanism to
handle it.  DAMOS quota feature is just another name of [auto-tunable] speed
limit.  Sorry for confusing you.  Anyway, happy to confirm this is yet another
DAMOS feature that could be useful for your and future cases.

> 
> 
> > If you have such use case in your mind or your test setups, sharing those here
> > or on the next versions of this would be very helpful for reviewers.
> 
> Answered above. I will include them in the next version.

That was very helpful.  Keeping that on the next version will be helpful for
new readers such as future SJ :)

[1] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning


Thanks,
SJ

[...]

Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Posted by Huang, Ying 3 months, 4 weeks ago

SeongJae Park <sj@kernel.org> writes:

> Hi Bijan,
>
> On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
>
>> From: Bijan Tabatabai <bijantabatab@micron.com>
>> 
>> A recent patch set automatically set the interleave weight for each node
>> according to the node's maximum bandwidth [1]. In another thread, the patch
>> set's author, Joshua Hahn, wondered if/how these weights should be changed
>> if the bandwidth utilization of the system changes [2].
>
> Thank you for sharing the background.  I do agree it is an important question.
>
>> 
>> This patch set adds the mechanism for dynamically changing how application
>> data is interleaved across nodes while leaving the policy of what the
>> interleave weights should be to userspace. It does this by adding a new
>> DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
>> paddr and vaddr operations sets. Using the paddr version is useful for
>> managing page placement globally. Using the vaddr version limits tracking
>> to one process per kdamond instance, but the va based tracking better
>> captures spacial locality.
>> 
>> DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
>> interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
>> and the page placement algorithm in weighted_interleave_nid via
>> policy_nodemask.
>
> So, what DAMOS_INTERLEAVE will do is, migrating pages of a given DAMON region
> into multiple nodes, following interleaving weights, right?

Hi, Bijan,

It's hard for me to understand what you want to do in your original
patch description.  SeongJae's description is helpful.  So, can you add
more description in the future version?

So, you will migrate allocated pages to follow new weight?  How to
interact with the weight specified by users explicitly?  Usually we will
respect explicit user policy.

> We already have
> DAMOS actions for migrating pages of a given DAMON region, namely
> DAMOS_MIGRATE_{HOT,COLD}.  The actions support only single migration target
> node, though.  To my perspective, hence, DAMOS_INTERLEAVE looks like an
> extended version of DAMOS_MIGRATE_{HOT,COLD} for flexible target node
> selections.  In a way, DAMOS_INTERLEAVE is rather a restricted version of
> DAMOS_MIGRATE_{HOT,COLD}, since it prioritizes only hotter regions, if I read
> the second patch correctly.
>
> What about extending DAMOS_MIGRATE_{HOT,COLD} to support your use case?  For
> example, letting users enter special keyword, say, 'weighted_interleave' to
> 'target_nid' DAMON sysfs file.  In the case, DAMOS_MIGRATE_{HOT,COLD} would
> work in the way you are implementing DAMOS_INTERLEAVE.
>
>> We chose to reuse the mempolicy weighted interleave
>> infrastructure to avoid reimplementing code. However, this has the awkward
>> side effect that only pages that are mapped to processes using
>> MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
>> weights. This might be fine because workloads that want their data to be
>> dynamically interleaved will want their newly allocated data to be
>> interleaved at the same ratio.
>
> Makes sense to me.  I'm not very familiar with interleaving and memory policy,
> though.
>
>> 
>> If exposing policy_nodemask is undesirable,
>
> I see you are exposing it on include/linux/mempolicy.h on the first patch of
> this series, and I agree it is not desirable to unnecessarily expose functions.
> But you could reduce the exposure by exporting it on mm/internal.h instead.
> mempolicy maitnainers and reviewers who you kindly Cc-ed to this mail could
> give us good opinions.
>
>> we have two alternative methods
>> for having DAMON access the interleave weights it should use. We would
>> appreciate feedback on which method is preferred.
>> 1. Use mpol_misplaced instead
>>   pros: mpol_misplaced is already exposed publically
>>   cons: Would require refactoring mpol_misplaced to take a struct vm_area
>>   instead of a struct vm_fault, and require refactoring mpol_misplaced and
>>   get_vma_policy to take in a struct task_struct rather than just using
>>   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
>
> I feel cons is larger than pros.  mpolicy people's opinion would matter more,
> though.
>
>> 2. Add a new field to struct damos, similar to target_nid for the
>> MIGRATE_HOT/COLD schemes.
>>   pros: Keeps changes contained inside DAMON. Would not require processes
>>   to use MPOL_WEIGHTED_INTERLEAVE.
>>   cons: Duplicates page placement code. Requires discussion on the sysfs
>>   interface to use for users to pass in the interleave weights.
>
> I agree this is also somewhat doable.  In future, we might want to implement
> this anyway, for non-global and flexible memory interleaving.  But if memory
> policy people are ok with reusing policy_nodemask(), I don't think we need to
> do this now.
>
>> 
>> This patchset was tested on an AMD machine with a NUMA node with CPUs
>> attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
>> However, this patch set should generalize to other architectures and number
>> of NUMA nodes.
>
> I show the test results on the commit messages of the second and the fourth
> patches.  In the next version, letting readers know that here would be nice.
> Also adding a short description of what you confirmed with the tests here
> (e.g., with the test we confirmed this patch functions as expected [and
> achieves X % Y metric wins]) would be nice.
>
>> 
>> Patches Sequence
>> ________________
>> The first patch exposes policy_nodemask() in include/linux/mempolicy.h to
>> let DAMON determine where a page should be placed for interleaving.
>> The second patch implements DAMOS_INTERLEAVE as a paddr action.
>> The third patch moves the DAMON page migration code to ops-common, allowing
>> vaddr actions to use it.
>> Finally, the fourth patch implements a vaddr version of DAMOS_INTERLEAVE.
>
> I'll try to take look on code and add comments if something stands out, but
> let's focus on the high level discussion first, especially whether to implement
> this as a new DAMOS action, or extend DAMOS_MIGRATE_{HOT,COLD} actions.
>
> I think it would also be nice if you could add more explanation about why you
> picked DAMON as a way to implement this feature.  I assume that's because you
> found opportunities to utilize this feature in some access-aware way or
> utilizing DAMOS features.  I was actually able to imagine some such usages.
> For example, we could do the re-interleaving for hot or cold pages of specific
> NUMA nodes or specific virtual address ranges first to make interleaving
> effective faster.
>
> Also we could apply a sort of speed limit for the interleaving-migration to
> ensure it doesn't consume memory bandwidth too much.  The limit could be
> arbitrarily user-defined or auto-tuned for specific system metrics value (e.g.,
> memory bandwidth balance?).
>
> If you have such use case in your mind or your test setups, sharing those here
> or on the next versions of this would be very helpful for reviewers.
>
>> 
>> [1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
>> [2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/

---
Best Regards,
Huang, Ying

Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Posted by Bijan Tabatabai 3 months, 4 weeks ago

On Thu, Jun 12, 2025 at 9:42 PM Huang, Ying
<ying.huang@linux.alibaba.com> wrote:
>
> SeongJae Park <sj@kernel.org> writes:
>
> > Hi Bijan,
> >
> > On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> >
> >> From: Bijan Tabatabai <bijantabatab@micron.com>
> >>
> >> A recent patch set automatically set the interleave weight for each node
> >> according to the node's maximum bandwidth [1]. In another thread, the patch
> >> set's author, Joshua Hahn, wondered if/how these weights should be changed
> >> if the bandwidth utilization of the system changes [2].
> >
> > Thank you for sharing the background.  I do agree it is an important question.
> >
> >>
> >> This patch set adds the mechanism for dynamically changing how application
> >> data is interleaved across nodes while leaving the policy of what the
> >> interleave weights should be to userspace. It does this by adding a new
> >> DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> >> paddr and vaddr operations sets. Using the paddr version is useful for
> >> managing page placement globally. Using the vaddr version limits tracking
> >> to one process per kdamond instance, but the va based tracking better
> >> captures spacial locality.
> >>
> >> DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> >> interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> >> and the page placement algorithm in weighted_interleave_nid via
> >> policy_nodemask.
> >
> > So, what DAMOS_INTERLEAVE will do is, migrating pages of a given DAMON region
> > into multiple nodes, following interleaving weights, right?
Hi Ying,

> Hi, Bijan,
>
> It's hard for me to understand what you want to do in your original
> patch description.  SeongJae's description is helpful.  So, can you add
> more description in the future version?

Yes, sorry about that. I added more detail in my reply to SeongJae and
will include more detail in the cover letter of the next revision.

> So, you will migrate allocated pages to follow new weight?

Yes

> How to interact with the weight specified by users explicitly?  Usually we will
> respect explicit user policy.

I am not entirely sure I understand the question completely, but I
will try to answer
the best I can.

We interact with the user provided weights through the policy_nodemask function,
which gives us the node id a page should be on. This patch only reads the user
provided weights and migrates pages to be consistent with new weights provided
by the user, so I believe these changes do respect the explicit user
policy. Please let
me know if you disagree.

Thanks for the review,
Bijan

P.S. Sorry for sending this twice - I accidentally replied instead of
replied all.