mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

[RFC PATCH v2 0/2] mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

Posted by Bijan Tabatabai 3 months, 2 weeks ago

From: Bijan Tabatabai <bijantabatab@micron.com>

A recent patch set automatically sets the interleave weight for each node
according to the node's maximum bandwidth [1]. In another thread, the patch
set's author, Joshua Hahn, wondered if/how thes weights show be changed if
the bandwidth utilization of the system changes [2]

This patch set adds the mechanism for dynamically changing how application
data is interleaved across nodes while leaving the policy of what the
interleave weights should be to userspace. It does this by modifying the
migrate_{hot,cold} DAMOS actions to allow passing in a list of migration
targets to their target_nid parameter. When this is done, the
migrate_{hot,cold} actions will migrate pages between the specified nodes
using the global interleave weights found at
/sys/kernel/mm/mempolicy/weighted_interleave/node<N>. This functionality
can be used to dynamically adjust how pages are interleaved by changing the
global weights. When only a single migration target is passed to
target_nid, the migrate_{hot,cold} actions will act the same as before.

There have been prior discussions about how changing the interleave weights
in response to the system's bandwidth utilization can be beneficial [2].
However, currently the interleave weights only are applied when data is
allocated. Migrating already allocated pages according to the dynamically
changing weights will better help balance the bandwidth utilization across
nodes.

As a toy example, imagine some application that uses 75% of the local
bandwidth. Assuming sufficient capacity, when running alone, we want to
keep that application's data in local memory. However, if a second
instance of that application begins, using the same amount of bandwidth,
it would be best to interleave the data of both processes to alleviate the
bandwidth pressure from the local node. Likewise, when one of the processes
ends, the data should be moves back to local memory.

We imagine there would be a userspace application that would monitor system
performance characteristics, such as bandwidth utilization or memory access
latency, and uses that information to tune the interleave weights. Others
seem to have come to a similar conclusion in previous discussions [3].
We are currently working on a userspace program that does this, but it is
not quite ready to be published yet.

We believe DAMON is the correct venue for the interleaving mechanism for a
few reasons. First, we noticed that we don't ahve to migrate all of the
application's pages to improve performance. we just need to migrate the
frequently accessed pages. DAMON's existing hotness traching is very useful
for this. Second, DAMON's quota system can be used to ensure we are not
using too much bandwidth for migrations. Finally, as Ying pointed out [4],
a complete solution must also handle when a memory node is at capacity. The
existing migrate_cold action can be used in conjunction with the
functionality added in this patch set to provide that complete solution.

Functionality Test
==================
Below is an example of this new functionality in use to confirm that these
patches behave as intended.
In this example, the user initially sets the interleave weights to
interleave the pages at a 1:1 ratio and start an application, alloc_data,
using those weights that allocates 1GB of data then sleeps. Afterwards, the
weights are changes to interleave pages at a 2:1 ratio. Using numastat, we
show that DAMON has migrated the application's data to match the new
interleave weights.
  $ # Show that the migrate_hot action is used with multiple targets
  $ cd /sys/kernel/mm/damon/admin/kdamonds/0
  $ sudo cat ./contexts/0/schemes/0/action
  migrate_hot
  $ sudo cat ./contexts/0/schemes/0/target_nid
  0-1
  $ # Initially interleave at a 1:1 ratio
  $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
  $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node1
  $ # Start alloc_data with the initial interleave ratio
  $ numactl -w 0,1 ~/alloc_data 1G &
  $ # Verify the initial allocation
  $ numastat -c -p alloc_data

  Per-node process memory usage (in MBs) for PID 12224 (alloc_data)
           Node 0 Node 1 Total
           ------ ------ -----
  Huge          0      0     0
  Heap          0      0     0
  Stack         0      0     0
  Private     514    514  1027
  -------  ------ ------ -----
  Total       514    514  1027
  $ # Start interleaving at a 2:1 ratio
  $ echo 2 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
  $ # Verify that DAMON has migrated data to match the new ratio
  $ numastat -c -p alloc_data

  Per-node process memory usage (in MBs) for PID 12224 (alloc_data)
           Node 0 Node 1 Total
           ------ ------ -----
  Huge          0      0     0
  Heap          0      0     0
  Stack         0      0     0
  Private     684    343  1027
  -------  ------ ------ -----
  Total       684    343  1027

Performance Test
================
Below is a simple example showing that interleaving application data using
these patches can improve application performance.
To do this, we run a bandwidth intensive embedding reduction application
[5]. This workload is useful for this test because it reports the time it
takes each iteration to run and reuses its buffers between allocation,
allowing us to clearly see the benefits of the migration.

We evaluate this a 128 core/256 thread AMD CPU, with 72 GB/s of local DDR
bandwidth and 26 GB/s of CXL memory.

Before we start the workload, the system bandwidth utilization is low, so
we start with interleave weights biased as much as possible to the local
node. When the workload begins, it saturates the local bandwidth, making
the page placement suboptimal. To alleviate this, we modify the interleave
weights, triggering DAMON to migrate the workload's data.

  $ cd /sys/kernel/mm/damon/admin/kdamonds/0/
  $ sudo cat ./contexts/0/schemes/0/action
  migrate_hot
  $ sudo cat ./contexts/0/schemes/0/target_nid
  0-1
  $ echo 255 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
  $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node1
  $ <path>/eval_baseline -d amazon_All -c 255 -r 100
  <clip startup output>
  Eval Phase 3: Running Baseline...

  REPEAT # 0 Baseline Total time : 9043.24 ms
  REPEAT # 1 Baseline Total time : 7307.71 ms
  REPEAT # 2 Baseline Total time : 7301.4 ms
  REPEAT # 3 Baseline Total time : 7312.44 ms
  REPEAT # 4 Baseline Total time : 7282.43 ms
  # Interleave weights changed to 3:1
  REPEAT # 5 Baseline Total time : 6754.78 ms
  REPEAT # 6 Baseline Total time : 5322.38 ms
  REPEAT # 7 Baseline Total time : 5359.89 ms
  REPEAT # 8 Baseline Total time : 5346.14 ms
  REPEAT # 9 Baseline Total time : 5321.98 ms

Updating the interleave weights, and having DAMON migrate the workload
data according to the weights resulted in an approximately 25% speedup.

Questions for Reviewers
=======================
1. Are you happy with the changes to the DAMON sysfs interface?
2. Setting an interleave weight to 0 is currently not allowed. This makes
   sense when the weights are only used for allocation. Does it make sense
   to allow 0 weights now?

Patches Sequence
================
This first patch exposes get_il_weight() in ./mm/internal.h to let DAMON
access the interleave weights.
The second patch implements the interleaving mechanism in the
migrate_{hot/cold} actions.

Revision History
================
Changes from v1
(https://lore.kernel.org/linux-mm/20250612181330.31236-1-bijan311@gmail.com/)
- Reuse migrate_{hot,cold} actions instead of creating a new action
- Remove vaddr implementation
- Remove most of the use of mempolicy, instead duplicate the interleave
  logic and access interleave weights directly
- Write more about the use case in the cover letter
- Write about why DAMON was used for this in the cover letter
- Add correctness test to the cover letter
- Add performance test

[1] https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/
[3] https://lore.kernel.org/linux-mm/20250314151137.892379-1-joshua.hahnjy@gmail.com/
[4] https://lore.kernel.org/linux-mm/87frjfx6u4.fsf@DESKTOP-5N7EMDA/
[5] https://github.com/SNU-ARC/MERCI

Bijan Tabatabai (2):
  mm/mempolicy: Expose get_il_weight() to MM
  mm/damon/paddr: Allow multiple migrate targets

 include/linux/damon.h    |   8 +--
 mm/damon/core.c          |   9 ++--
 mm/damon/lru_sort.c      |   2 +-
 mm/damon/paddr.c         | 108 +++++++++++++++++++++++++++++++++++++--
 mm/damon/reclaim.c       |   2 +-
 mm/damon/sysfs-schemes.c |  14 +++--
 mm/internal.h            |   6 +++
 mm/mempolicy.c           |   2 +-
 samples/damon/mtier.c    |   6 ++-
 samples/damon/prcl.c     |   2 +-
 10 files changed, 138 insertions(+), 21 deletions(-)

-- 
2.43.5

Re: [RFC PATCH v2 0/2] mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

Posted by Joshua Hahn 3 months, 2 weeks ago

On Fri, 20 Jun 2025 13:04:56 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

Hi Bijan,

I hope you are doing well! Sorry for the late response. It seems like SJ
already gave some great feedback already though, so I will just chime in
with my 2c.

[...snip...]

> However, currently the interleave weights only are applied when data is
> allocated. Migrating already allocated pages according to the dynamically
> changing weights will better help balance the bandwidth utilization across
> nodes.
> 
> As a toy example, imagine some application that uses 75% of the local
> bandwidth. Assuming sufficient capacity, when running alone, we want to
> keep that application's data in local memory. However, if a second
> instance of that application begins, using the same amount of bandwidth,
> it would be best to interleave the data of both processes to alleviate the
> bandwidth pressure from the local node. Likewise, when one of the processes
> ends, the data should be moves back to local memory.

I think the addition of this example helps illustrate the neccesity for
interleaving, thank you for adding it in!

> We imagine there would be a userspace application that would monitor system
> performance characteristics, such as bandwidth utilization or memory access
> latency, and uses that information to tune the interleave weights. Others
> seem to have come to a similar conclusion in previous discussions [3].
>
> Functionality Test
> ==================

[...snip...]

> Performance Test
> ================
> Below is a simple example showing that interleaving application data using
> these patches can improve application performance.
> To do this, we run a bandwidth intensive embedding reduction application
> [5]. This workload is useful for this test because it reports the time it
> takes each iteration to run and reuses its buffers between allocation,
> allowing us to clearly see the benefits of the migration.
> 
> We evaluate this a 128 core/256 thread AMD CPU, with 72 GB/s of local DDR
> bandwidth and 26 GB/s of CXL memory.
> 
> Before we start the workload, the system bandwidth utilization is low, so
> we start with interleave weights biased as much as possible to the local
> node. When the workload begins, it saturates the local bandwidth, making
> the page placement suboptimal. To alleviate this, we modify the interleave
> weights, triggering DAMON to migrate the workload's data.
> 
>   $ cd /sys/kernel/mm/damon/admin/kdamonds/0/
>   $ sudo cat ./contexts/0/schemes/0/action
>   migrate_hot
>   $ sudo cat ./contexts/0/schemes/0/target_nid
>   0-1
>   $ echo 255 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
>   $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node1
>   $ <path>/eval_baseline -d amazon_All -c 255 -r 100
>   <clip startup output>
>   Eval Phase 3: Running Baseline...
> 
>   REPEAT # 0 Baseline Total time : 9043.24 ms
>   REPEAT # 1 Baseline Total time : 7307.71 ms
>   REPEAT # 2 Baseline Total time : 7301.4 ms
>   REPEAT # 3 Baseline Total time : 7312.44 ms
>   REPEAT # 4 Baseline Total time : 7282.43 ms
>   # Interleave weights changed to 3:1
>   REPEAT # 5 Baseline Total time : 6754.78 ms
>   REPEAT # 6 Baseline Total time : 5322.38 ms
>   REPEAT # 7 Baseline Total time : 5359.89 ms
>   REPEAT # 8 Baseline Total time : 5346.14 ms
>   REPEAT # 9 Baseline Total time : 5321.98 ms
> 
> Updating the interleave weights, and having DAMON migrate the workload
> data according to the weights resulted in an approximately 25% speedup.

Thank you for sharing these very impressive results! So if I can understand
correctly, this workload allocates once (mostly), and each iteration just
re-uses the same allocation, meaning the effects of the weighted interleave
change are isolated mostly to the migration portion.

Based on that understanding, I'm wondering if a longer benchmark would help
demonstrate the effects of this patch a bit better. That is, IIRC short-lived
workloads should see most of its benefits come from correct allocation,
while longer-lived workloads should see most of its benefits come from
correct migration policies. I don't have a good idea of what the threshold
is for characterizing short vs. long workloads, but I think this could be
another prospective test you can use to demonstrate the gains of your patch.

One last thing that I wanted to note is that it seems like iteration 5, where
I imagine there is some additional work needed to balance the page placement
from 255:0 to 3:1 *still* outperforms the normal case in the original
benchmark. Really awesome!!!

> Questions for Reviewers
> =======================
> 1. Are you happy with the changes to the DAMON sysfs interface?
> 2. Setting an interleave weight to 0 is currently not allowed. This makes
>    sense when the weights are only used for allocation. Does it make sense
>    to allow 0 weights now?

If the goal of 0 weights is to prevent migration to that node, I think that
we should try to re-use existing mechanisms. There was actually quite a bit
of discussion on whether 0 weights should be allowed (the entire converstaion
was split across multiple versions, but I think this is the first instance [1]).

How about using nodemasks instead? I think that they serve a more explicit
purpose of preventing certain nodes from being used. Please let me know if
I'm missing something as to why we cannot use nodemasks here : -) 

[...snip...]

One last thing that I wanted to note -- given that these weights now serve
a dual purpose of setting allocation & migration weights, does it make sense
to update the weighted interleave documentation with this information as well?
Or, since it really only affects DAMON users, should we be ok with leaving it
out?

My preference is that we include it in weighted interleave documentation
(Documentation/ABI/testing/sysfs-kernel-mm-mempolikcy-weighted-interleave)
so that anyone who edits weighted interleave code in the future will at least
be aware that the changes they make will have effects in other subsystems.

Thank you for sharing these results! I hope you have a great day :):)
Joshua

[1] https://lore.kernel.org/all/87msfkh1ls.fsf@DESKTOP-5N7EMDA/

Sent using hkml (https://github.com/sjp38/hackermail)

Re: [RFC PATCH v2 0/2] mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

Posted by Bijan Tabatabai 3 months, 2 weeks ago

Hi Joshua,

On Mon, Jun 23, 2025 at 8:45 AM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> On Fri, 20 Jun 2025 13:04:56 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
>
> Hi Bijan,
>
> I hope you are doing well! Sorry for the late response.

No need to be sorry. I have no expectation patches sent on a Friday
afternoon will be looked at over the weekend.

[...]
> > Performance Test
> > ================
> > Below is a simple example showing that interleaving application data using
> > these patches can improve application performance.
> > To do this, we run a bandwidth intensive embedding reduction application
> > [5]. This workload is useful for this test because it reports the time it
> > takes each iteration to run and reuses its buffers between allocation,
> > allowing us to clearly see the benefits of the migration.
> >
> > We evaluate this a 128 core/256 thread AMD CPU, with 72 GB/s of local DDR
> > bandwidth and 26 GB/s of CXL memory.
> >
> > Before we start the workload, the system bandwidth utilization is low, so
> > we start with interleave weights biased as much as possible to the local
> > node. When the workload begins, it saturates the local bandwidth, making
> > the page placement suboptimal. To alleviate this, we modify the interleave
> > weights, triggering DAMON to migrate the workload's data.
> >
> >   $ cd /sys/kernel/mm/damon/admin/kdamonds/0/
> >   $ sudo cat ./contexts/0/schemes/0/action
> >   migrate_hot
> >   $ sudo cat ./contexts/0/schemes/0/target_nid
> >   0-1
> >   $ echo 255 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
> >   $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node1
> >   $ <path>/eval_baseline -d amazon_All -c 255 -r 100
> >   <clip startup output>
> >   Eval Phase 3: Running Baseline...
> >
> >   REPEAT # 0 Baseline Total time : 9043.24 ms
> >   REPEAT # 1 Baseline Total time : 7307.71 ms
> >   REPEAT # 2 Baseline Total time : 7301.4 ms
> >   REPEAT # 3 Baseline Total time : 7312.44 ms
> >   REPEAT # 4 Baseline Total time : 7282.43 ms
> >   # Interleave weights changed to 3:1
> >   REPEAT # 5 Baseline Total time : 6754.78 ms
> >   REPEAT # 6 Baseline Total time : 5322.38 ms
> >   REPEAT # 7 Baseline Total time : 5359.89 ms
> >   REPEAT # 8 Baseline Total time : 5346.14 ms
> >   REPEAT # 9 Baseline Total time : 5321.98 ms
> >
> > Updating the interleave weights, and having DAMON migrate the workload
> > data according to the weights resulted in an approximately 25% speedup.
>
> Thank you for sharing these very impressive results! So if I can understand
> correctly, this workload allocates once (mostly), and each iteration just
> re-uses the same allocation, meaning the effects of the weighted interleave
> change are isolated mostly to the migration portion.

That's correct.

> Based on that understanding, I'm wondering if a longer benchmark would help
> demonstrate the effects of this patch a bit better. That is, IIRC short-lived
> workloads should see most of its benefits come from correct allocation,
> while longer-lived workloads should see most of its benefits come from
> correct migration policies. I don't have a good idea of what the threshold
> is for characterizing short vs. long workloads, but I think this could be
> another prospective test you can use to demonstrate the gains of your patch.

You might be right. I'll try to think of something for the next
revision, but no promises.

[...]
> > Questions for Reviewers
> > =======================
> > 1. Are you happy with the changes to the DAMON sysfs interface?
> > 2. Setting an interleave weight to 0 is currently not allowed. This makes
> >    sense when the weights are only used for allocation. Does it make sense
> >    to allow 0 weights now?
>
> If the goal of 0 weights is to prevent migration to that node, I think that
> we should try to re-use existing mechanisms. There was actually quite a bit
> of discussion on whether 0 weights should be allowed (the entire converstaion
> was split across multiple versions, but I think this is the first instance [1]).

Thanks, I'll look over this.

> How about using nodemasks instead? I think that they serve a more explicit
> purpose of preventing certain nodes from being used. Please let me know if
> I'm missing something as to why we cannot use nodemasks here : -)

I think since we're moving towards DAMON having its own weights, this
would only apply to mempolicy. Changing an application's mempolic
nodemask would be nice, but based off Gregory's previous comments,
having something outside the application change that application's
nodemask might be a bit difficult [1]. Also, I think it would be
easier to change one weight rather than every affected application's
mempolicy.

> [...snip...]
>
> One last thing that I wanted to note -- given that these weights now serve
> a dual purpose of setting allocation & migration weights, does it make sense
> to update the weighted interleave documentation with this information as well?
> Or, since it really only affects DAMON users, should we be ok with leaving it
> out?
>
> My preference is that we include it in weighted interleave documentation
> (Documentation/ABI/testing/sysfs-kernel-mm-mempolikcy-weighted-interleave)
> so that anyone who edits weighted interleave code in the future will at least
> be aware that the changes they make will have effects in other subsystems.

I think if we continued to use the mempolicy weights, it would make
sense to document that. However, it seems like we are moving towards
using DAMON specific weights.

Thanks for your feedback,
Bijan

[1] https://lore.kernel.org/damon/aFBXuTtwhAV7BHeY@gourry-fedora-PF4VCD3F/

Re: [RFC PATCH v2 0/2] mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

Posted by SeongJae Park 3 months, 2 weeks ago

Hi Bijan,

On Fri, 20 Jun 2025 13:04:56 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

[...]
> This patch set adds the mechanism for dynamically changing how application
> data is interleaved across nodes while leaving the policy of what the
> interleave weights should be to userspace. It does this by modifying the
> migrate_{hot,cold} DAMOS actions to allow passing in a list of migration
> targets to their target_nid parameter. When this is done, the
> migrate_{hot,cold} actions will migrate pages between the specified nodes
> using the global interleave weights found at
> /sys/kernel/mm/mempolicy/weighted_interleave/node<N>. This functionality
> can be used to dynamically adjust how pages are interleaved by changing the
> global weights. When only a single migration target is passed to
> target_nid, the migrate_{hot,cold} actions will act the same as before.

This means users are required to manipulate two interfaces.  DAMON sysfs for
target nodes, and weighted_interleave sysfs for weights.  I don't think this
coupling is very ideal.

Off the opt of my head, I concern if users could mistakenly forget updating one
of those, since the requirement is not very clear.  I think the interface
should clearly explain that.  For example, writing a special keywords, say,
"use_interleave_weights" to target_nid parameter sysfs file.  But, even in the
case, users who update weighted_interleave might foget updating target nodes on
DAMON interface.

I think letting DAMOS_MIGRATE_{HOT,COLD} to use all nodes as migration target
when the special keyword is given is one of better options.  This is what I
suggested to the previous version of this patch series.  But now I think it
would be better if we could just remove this coupling.

I understand a sort of this coupling is inevitable if the kernel should make
the connection between DAMON and weighted interleaving itself, without
user-space help.  But now I think we could get user-space help, according to
below.  Please keep reading.

[...]
> As a toy example, imagine some application that uses 75% of the local
> bandwidth. Assuming sufficient capacity, when running alone, we want to
> keep that application's data in local memory. However, if a second
> instance of that application begins, using the same amount of bandwidth,
> it would be best to interleave the data of both processes to alleviate the
> bandwidth pressure from the local node. Likewise, when one of the processes
> ends, the data should be moves back to local memory.
> 
> We imagine there would be a userspace application that would monitor system
> performance characteristics, such as bandwidth utilization or memory access
> latency, and uses that information to tune the interleave weights. Others
> seem to have come to a similar conclusion in previous discussions [3].
> We are currently working on a userspace program that does this, but it is
> not quite ready to be published yet.

So, at least in this toy example, we have user-space control.  Then, I think we
could decouple DAMON and weighted interleaving, and ask the usr-space tool to
be the connection between those.  That is, extend DAMOS_MIGRATE_{HOT,COLD} to
let users specify migration target nodes and their weights.  And ask the
user-space tool to periodically read weighted interleaving parameters that
could be auto-tuned, and update DAMOS_MIGRATE_{HOT,COLD} parameters
accordingly.  Actually the user-space tool on this example is making the
weights by itself, so this should be easy work to do?

Also, even for general use case, I think such user-space intervention is not
too much request.  Please let me know if I'm wrong.

> 
> We believe DAMON is the correct venue for the interleaving mechanism for a
> few reasons. First, we noticed that we don't ahve to migrate all of the
> application's pages to improve performance. we just need to migrate the
> frequently accessed pages. DAMON's existing hotness traching is very useful
> for this. Second, DAMON's quota system can be used to ensure we are not
> using too much bandwidth for migrations. Finally, as Ying pointed out [4],
> a complete solution must also handle when a memory node is at capacity. The
> existing migrate_cold action can be used in conjunction with the
> functionality added in this patch set to provide that complete solution.

These make perfect sense to me.  Thank you for adding this great summary.

> 
> Functionality Test
> ==================
[...]
> Performance Test
> ================
[...]
> Updating the interleave weights, and having DAMON migrate the workload
> data according to the weights resulted in an approximately 25% speedup.

Awesome.  Thank you for conducting this great tests and sharing the results!

> 
> Questions for Reviewers
> =======================
> 1. Are you happy with the changes to the DAMON sysfs interface?

I'm happy with it for RFC level implementation.  And in my opinion, you now
proved this is a good idea.  For next steps toward mainline landing, I'd like
to suggest below interface change.

Let's allow users specify DAMOS_MIGRATE_{HOT,COLD} target nodes and weights
using only DAMON interface.  And let the user-space tool do the synchronization
with weighted interleaving or other required works.

This may require writing not small amount of code, especially for DAMON sysfs
interface.  I think it is doable, though.  If you don't mind, I'd like to
quickly make a prototype and share with you.

What do you think?

> 2. Setting an interleave weight to 0 is currently not allowed. This makes
>    sense when the weights are only used for allocation. Does it make sense
>    to allow 0 weights now?

I have no opinion, and would like to let mempolicy folks make voices.  But if
we go on the decoupling approach as I suggested above, we can do this
discussion in a separate thread :)

[...]
> Revision History
> ================
> Changes from v1
> (https://lore.kernel.org/linux-mm/20250612181330.31236-1-bijan311@gmail.com/)
> - Reuse migrate_{hot,cold} actions instead of creating a new action
> - Remove vaddr implementation
> - Remove most of the use of mempolicy, instead duplicate the interleave
>   logic and access interleave weights directly
> - Write more about the use case in the cover letter
> - Write about why DAMON was used for this in the cover letter
> - Add correctness test to the cover letter
> - Add performance test

Again, thank you for revisioning.  Please bear in mind with me at next steps.
I believe this work is very promising.

Thanks,
SJ

[...]

Re: [RFC PATCH v2 0/2] mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

Posted by Gregory Price 3 months, 2 weeks ago

On Fri, Jun 20, 2025 at 01:21:55PM -0700, SeongJae Park wrote:
> Hi Bijan,
> 
> On Fri, 20 Jun 2025 13:04:56 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> 
> [...]
> > This patch set adds the mechanism for dynamically changing how application
> > data is interleaved across nodes while leaving the policy of what the
> > interleave weights should be to userspace. It does this by modifying the
> > migrate_{hot,cold} DAMOS actions to allow passing in a list of migration
> > targets to their target_nid parameter. When this is done, the
> > migrate_{hot,cold} actions will migrate pages between the specified nodes
> > using the global interleave weights found at
> > /sys/kernel/mm/mempolicy/weighted_interleave/node<N>. This functionality
> > can be used to dynamically adjust how pages are interleaved by changing the
> > global weights. When only a single migration target is passed to
> > target_nid, the migrate_{hot,cold} actions will act the same as before.
> 
> This means users are required to manipulate two interfaces.  DAMON sysfs for
> target nodes, and weighted_interleave sysfs for weights.  I don't think this
> coupling is very ideal.
> 

Just tossing this out there - weighted interleave sysfs entries *should*
be automatic, and the preferred weights shouldn't really ever change
over time.  Even if they did, if it's the result of devices coming and
going - the updates should also be automatic.

So, in practice, a usually probably only has to twiddle DAMON.

I don't have a strong opinion on whether DAMON should leverage the
mempolicy interface, but I think the way it is designed now is
acceptable.

~Gregory

Re: [RFC PATCH v2 0/2] mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

Posted by Bijan Tabatabai 3 months, 2 weeks ago

Hi Gregory,

On Mon, Jun 23, 2025 at 2:28 PM Gregory Price <gourry@gourry.net> wrote:
>
> On Fri, Jun 20, 2025 at 01:21:55PM -0700, SeongJae Park wrote:
> > Hi Bijan,
> >
> > On Fri, 20 Jun 2025 13:04:56 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> >
> > [...]
> > > This patch set adds the mechanism for dynamically changing how application
> > > data is interleaved across nodes while leaving the policy of what the
> > > interleave weights should be to userspace. It does this by modifying the
> > > migrate_{hot,cold} DAMOS actions to allow passing in a list of migration
> > > targets to their target_nid parameter. When this is done, the
> > > migrate_{hot,cold} actions will migrate pages between the specified nodes
> > > using the global interleave weights found at
> > > /sys/kernel/mm/mempolicy/weighted_interleave/node<N>. This functionality
> > > can be used to dynamically adjust how pages are interleaved by changing the
> > > global weights. When only a single migration target is passed to
> > > target_nid, the migrate_{hot,cold} actions will act the same as before.
> >
> > This means users are required to manipulate two interfaces.  DAMON sysfs for
> > target nodes, and weighted_interleave sysfs for weights.  I don't think this
> > coupling is very ideal.
> >
>
> Just tossing this out there - weighted interleave sysfs entries *should*
> be automatic, and the preferred weights shouldn't really ever change
> over time.  Even if they did, if it's the result of devices coming and
> going - the updates should also be automatic.

I'm not convinced this is true. If you have a workload that can
saturate the local bandwidth but not the remote bandwidth, wouldn't
you want the interleave weights to be more biased towards local memory
than you would for a workload that can saturate both the local and
remote bandwidth?

> So, in practice, a usually probably only has to twiddle DAMON.

That being said, I don't mind the idea of the mempolicy weights being
left untouched as a reasonable starting point for bandwidth intensive
applications and leaving the fine tuning to DAMON.

Thanks,
Bijan

Re: [RFC PATCH v2 0/2] mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

Posted by Gregory Price 3 months, 2 weeks ago

On Mon, Jun 23, 2025 at 06:21:20PM -0500, Bijan Tabatabai wrote:
> I'm not convinced this is true. If you have a workload that can
> saturate the local bandwidth but not the remote bandwidth, wouldn't
> you want the interleave weights to be more biased towards local memory
> than you would for a workload that can saturate both the local and
> remote bandwidth?
>

That sounds like an argument for task/process/cgroup-local weights, as
opposed to an argument to twiddle global weights.

These things need not solve all problems for everyone.

~Gregory

Re: [RFC PATCH v2 0/2] mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

Posted by Bijan Tabatabai 3 months, 2 weeks ago

Hi SeongJae,

On Fri, Jun 20, 2025 at 3:21 PM SeongJae Park <sj@kernel.org> wrote:
>
> Hi Bijan,
>
> On Fri, 20 Jun 2025 13:04:56 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
>
> [...]
> > As a toy example, imagine some application that uses 75% of the local
> > bandwidth. Assuming sufficient capacity, when running alone, we want to
> > keep that application's data in local memory. However, if a second
> > instance of that application begins, using the same amount of bandwidth,
> > it would be best to interleave the data of both processes to alleviate the
> > bandwidth pressure from the local node. Likewise, when one of the processes
> > ends, the data should be moves back to local memory.
> >
> > We imagine there would be a userspace application that would monitor system
> > performance characteristics, such as bandwidth utilization or memory access
> > latency, and uses that information to tune the interleave weights. Others
> > seem to have come to a similar conclusion in previous discussions [3].
> > We are currently working on a userspace program that does this, but it is
> > not quite ready to be published yet.
>
> So, at least in this toy example, we have user-space control.  Then, I think we
> could decouple DAMON and weighted interleaving, and ask the usr-space tool to
> be the connection between those.  That is, extend DAMOS_MIGRATE_{HOT,COLD} to
> let users specify migration target nodes and their weights.  And ask the
> user-space tool to periodically read weighted interleaving parameters that
> could be auto-tuned, and update DAMOS_MIGRATE_{HOT,COLD} parameters
> accordingly.  Actually the user-space tool on this example is making the
> weights by itself, so this should be easy work to do?
>
> Also, even for general use case, I think such user-space intervention is not
> too much request.  Please let me know if I'm wrong.

You are correct. The userspace tool would be coming up with the
weights, so it would not be hard for it to write those weights to two
places. I coupled the weights used in DAMON and weighted interleaving
for this revision and the previous because I could not think of a use
case where you would want to use different weights for allocation time
and migration, so it felt silly to have two different places with the
same data. However, I don't feel too strongly about this, so I'm
willing to defer to your judgement.

Also, our userspace tool updates these weights somewhat frequently,
several times per minute, when it detects a change in the bandwidth
utilization of the system to calibrate the interleave ratio. I am
concerned about how frequent changes to the scheme via the sysfs
interface will affect the effectiveness of DAMON's page sampling. From
what I understand, updates to the sysfs aren't saved until the user
writes to some sysfs file to commit them, then the damon context is
recreated from scratch. Would this throw away all the previous
sampling work done and work splitting and merging regions? I am not
super familiar with how the sysfs interface interacts with the rest of
the system, so this concern might be entirely unfounded, but I would
appreciate some clarification here.

[...]
> >
> > Questions for Reviewers
> > =======================
> > 1. Are you happy with the changes to the DAMON sysfs interface?
>
> I'm happy with it for RFC level implementation.  And in my opinion, you now
> proved this is a good idea.  For next steps toward mainline landing, I'd like
> to suggest below interface change.
>
> Let's allow users specify DAMOS_MIGRATE_{HOT,COLD} target nodes and weights
> using only DAMON interface.  And let the user-space tool do the synchronization
> with weighted interleaving or other required works.
>
> This may require writing not small amount of code, especially for DAMON sysfs
> interface.  I think it is doable, though.  If you don't mind, I'd like to
> quickly make a prototype and share with you.
>
> What do you think?

That sounds good to me! Having a prototype from you for the sysfs
interface would certainly be helpful, but if you're busy, I can take a
pass at it as well.

> > 2. Setting an interleave weight to 0 is currently not allowed. This makes
> >    sense when the weights are only used for allocation. Does it make sense
> >    to allow 0 weights now?
>
> I have no opinion, and would like to let mempolicy folks make voices.  But if
> we go on the decoupling approach as I suggested above, we can do this
> discussion in a separate thread :)
>
> [...]
> > Revision History
> > ================
> > Changes from v1
> > (https://lore.kernel.org/linux-mm/20250612181330.31236-1-bijan311@gmail.com/)
> > - Reuse migrate_{hot,cold} actions instead of creating a new action
> > - Remove vaddr implementation
> > - Remove most of the use of mempolicy, instead duplicate the interleave
> >   logic and access interleave weights directly
> > - Write more about the use case in the cover letter
> > - Write about why DAMON was used for this in the cover letter
> > - Add correctness test to the cover letter
> > - Add performance test
>
> Again, thank you for revisioning.  Please bear in mind with me at next steps.
> I believe this work is very promising.

Thank you for your help and feedback!
Bijan

[...]

Re: [RFC PATCH v2 0/2] mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

Posted by SeongJae Park 3 months, 2 weeks ago

On Fri, 20 Jun 2025 16:47:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> Hi SeongJae,
> 
> On Fri, Jun 20, 2025 at 3:21 PM SeongJae Park <sj@kernel.org> wrote:
> >
[...]
> > Also, even for general use case, I think such user-space intervention is not
> > too much request.  Please let me know if I'm wrong.
> 
> You are correct. The userspace tool would be coming up with the
> weights, so it would not be hard for it to write those weights to two
> places. I coupled the weights used in DAMON and weighted interleaving
> for this revision and the previous because I could not think of a use
> case where you would want to use different weights for allocation time
> and migration, so it felt silly to have two different places with the
> same data. However, I don't feel too strongly about this, so I'm
> willing to defer to your judgement.

Thank you for being kind and flexible about the decision.  One such use case I
can think off the top of my head is when users want to do memory tiering, and
there are multiple nodes of same tier.  For example, if users want to migrate
hot pages in a node to the upper tier, and there are multiple nodes for the
tier, users may want to do the migration with same weight, or in proportion to
their free space.

So let's push this way for now.  Nothing is set on the stone, so please feel
free to let me know if you feel differently later.

> 
> Also, our userspace tool updates these weights somewhat frequently,
> several times per minute, when it detects a change in the bandwidth
> utilization of the system to calibrate the interleave ratio. I am
> concerned about how frequent changes to the scheme via the sysfs
> interface will affect the effectiveness of DAMON's page sampling. From
> what I understand, updates to the sysfs aren't saved until the user
> writes to some sysfs file to commit them,

This is correct.

> then the damon context is
> recreated from scratch.  Would this throw away all the previous sampling work
> done and work splitting and merging regions?

This is how an early version of DAMON sysfs interface was working.  Your
concern was true for the version.  Hence, we implemented online tuning
feature[1].  If you use DAMON user-space tool, 'damo tune'[2] is the command
for using this feature.  So, as long as you use the feature, updating weights
several times per minute shouldn't make such issues.

[1] https://lore.kernel.org/20220429160606.127307-1-sj@kernel.org
[2] https://github.com/damonitor/damo/blob/next/USAGE.md#damo-tune

> I am not
> super familiar with how the sysfs interface interacts with the rest of
> the system, so this concern might be entirely unfounded, but I would
> appreciate some clarification here.

Thank you for asking.  Please feel free to ask any more questions as you need!

[...]
> > This may require writing not small amount of code, especially for DAMON sysfs
> > interface.  I think it is doable, though.  If you don't mind, I'd like to
> > quickly make a prototype and share with you.
> >
> > What do you think?
> 
> That sounds good to me! Having a prototype from you for the sysfs
> interface would certainly be helpful, but if you're busy, I can take a
> pass at it as well.

Great.  I will try to do this by this weekend.

[...]
> Thank you for your help and feedback!

The pleasure if mine!

Thanks,
SJ

[...]

Re: [RFC PATCH v2 0/2] mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

Posted by SeongJae Park 3 months, 2 weeks ago

On Fri, 20 Jun 2025 16:13:24 -0700 SeongJae Park <sj@kernel.org> wrote:

> On Fri, 20 Jun 2025 16:47:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
[...]
> > That sounds good to me! Having a prototype from you for the sysfs
> > interface would certainly be helpful, but if you're busy, I can take a
> > pass at it as well.
> 
> Great.  I will try to do this by this weekend.

I just posted an RFC for this:
https://lore.kernel.org/20250621173131.23917-1-sj@kernel.org

Note that I didn't run exhaustive tests with it, so it may have silly bugs.  I
believe it could let you knwo what I'm thinking better than my cheap talks,
though ;)


Thanks,
SJ

[...]

Re: [RFC PATCH v2 0/2] mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

Posted by Bijan Tabatabai 3 months, 2 weeks ago

On Sat, Jun 21, 2025 at 12:36 PM SeongJae Park <sj@kernel.org> wrote:
>
> On Fri, 20 Jun 2025 16:13:24 -0700 SeongJae Park <sj@kernel.org> wrote:
>
> > On Fri, 20 Jun 2025 16:47:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> [...]
> > > That sounds good to me! Having a prototype from you for the sysfs
> > > interface would certainly be helpful, but if you're busy, I can take a
> > > pass at it as well.
> >
> > Great.  I will try to do this by this weekend.
>
> I just posted an RFC for this:
> https://lore.kernel.org/20250621173131.23917-1-sj@kernel.org
>
> Note that I didn't run exhaustive tests with it, so it may have silly bugs.  I
> believe it could let you knwo what I'm thinking better than my cheap talks,
> though ;)

I only skimmed through it, but this looks good to me, thanks for
throwing this together! I'll probe more when I use these patches for
the next revision.

I'll try to reply in that thread as well, but it was only sent to my
work email. I'll try to set up my work Outlook client to work with the
mailing list format, but in the future, could you also CC my personal
email (bijan311@gmail.com)? Gmail just makes formatting for the
mailing list easier for me.

Thanks,
Bijan

[...]

Re: [RFC PATCH v2 0/2] mm/damon/paddr: Allow interleaving in migrate_{hot,cold} actions

Posted by SeongJae Park 3 months, 2 weeks ago

On Mon, 23 Jun 2025 09:39:48 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> On Sat, Jun 21, 2025 at 12:36 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > On Fri, 20 Jun 2025 16:13:24 -0700 SeongJae Park <sj@kernel.org> wrote:
> >
> > > On Fri, 20 Jun 2025 16:47:26 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
[...]
> > I just posted an RFC for this:
> > https://lore.kernel.org/20250621173131.23917-1-sj@kernel.org
> >
> > Note that I didn't run exhaustive tests with it, so it may have silly bugs.  I
> > believe it could let you knwo what I'm thinking better than my cheap talks,
> > though ;)
> 
> I only skimmed through it, but this looks good to me, thanks for
> throwing this together! I'll probe more when I use these patches for
> the next revision.

Great, please feel free to add comments there.

> 
> I'll try to reply in that thread as well, but it was only sent to my
> work email. I'll try to set up my work Outlook client to work with the
> mailing list format,

I'm sorry to hear that.  I personally gave up working with my Outlook client
and using a tool named hkml[1].  I started developing it for my personal use
but now it is aimed to help contributors for DAMON and Linux kernel in general.
So, if you are interested, please feel free to try it and submit bug reports :)

> but in the future, could you also CC my personal
> email (bijan311@gmail.com)?

Sure, I will Cc your personal email together from next time.

> Gmail just makes formatting for the
> mailing list easier for me.

Right, Gmail is better than Outlook.  To me, hkml is even better than Gmail,
though ;)

[1] https://github.com/sjp38/hackermail

Thanks,
SJ

[...]