[RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets

Bijan Tabatabai posted 2 patches 3 months, 2 weeks ago
[RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by Bijan Tabatabai 3 months, 2 weeks ago
From: Bijan Tabatabai <bijantabatab@micron.com>

The migrate_{hot,cold} DAMONS actions take a parameter, target_nid, to
indicate what node the actions should migrate pages to. In this patch,
we allow passing in a list of migration targets into target_nid. When
this is done, the mirgate_{hot, cold} actions will migrate pages between
the specified nodes using the global interleave weights found at
/sys/kernel/mm/mempolicy/weighted_interleave/node<N>. This functionality
can be used to dynamically adjust how pages are interleaved in response
to changes in bandwidth utilization to improve performance, as discussed
in [1]. When only a single migration target is passed to target_nid, the
migrate_{hot,cold} actions will act the same as before.

Below is an example of this new functionality. The user initially sets
the interleave weights to interleave pages at a 1:1 ratio and starts an
application, alloc_data, using those weights that allocates 1GB of data
then sleeps. Afterwards, the weights are changed to interleave pages at
a 2:1 ratio. Using numastat, we show that DAMON has migrated the
application's pages to match the new interleave weights.
  $ # Show that the migrate_hot action is used with multiple targets
  $ cd /sys/kernel/mm/damon/admin/kdamonds/0
  $ sudo cat ./contexts/0/schemes/0/action
  migrate_hot
  $ sudo cat ./contexts/0/schemes/0/target_nid
  0-1
  $ # Initially interleave at a 1:1 ratio
  $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
  $ echo 1 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node1
  $ # Start alloc_data with the initial interleave ratio
  $ numactl -w 0,1 ~/alloc_data 1G &
  $ # Verify the initial allocation
  $ numastat -c -p alloc_data

  Per-node process memory usage (in MBs) for PID 12224 (alloc_data)
           Node 0 Node 1 Total
           ------ ------ -----
  Huge          0      0     0
  Heap          0      0     0
  Stack         0      0     0
  Private     514    514  1027
  -------  ------ ------ -----
  Total       514    514  1027
  $ # Start interleaving at a 2:1 ratio
  $ echo 2 | sudo tee /sys/kernel/mm/mempolicy/weighted_interleave/node0
  $ # Verify that DAMON has migrated data to match the new ratio
  $ numastat -c -p alloc_data

  Per-node process memory usage (in MBs) for PID 12224 (alloc_data)
           Node 0 Node 1 Total
           ------ ------ -----
  Huge          0      0     0
  Heap          0      0     0
  Stack         0      0     0
  Private     684    343  1027
  -------  ------ ------ -----
  Total       684    343  1027

[1] https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/

Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
---
 include/linux/damon.h    |   8 +--
 mm/damon/core.c          |   9 ++--
 mm/damon/lru_sort.c      |   2 +-
 mm/damon/paddr.c         | 108 +++++++++++++++++++++++++++++++++++++--
 mm/damon/reclaim.c       |   2 +-
 mm/damon/sysfs-schemes.c |  14 +++--
 samples/damon/mtier.c    |   6 ++-
 samples/damon/prcl.c     |   2 +-
 8 files changed, 131 insertions(+), 20 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index a4011726cb3b..24e726ee146a 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -454,7 +454,7 @@ struct damos_access_pattern {
  * @apply_interval_us:	The time between applying the @action.
  * @quota:		Control the aggressiveness of this scheme.
  * @wmarks:		Watermarks for automated (in)activation of this scheme.
- * @target_nid:		Destination node if @action is "migrate_{hot,cold}".
+ * @target_nids:	Destination nodes if @action is "migrate_{hot,cold}".
  * @filters:		Additional set of &struct damos_filter for &action.
  * @ops_filters:	ops layer handling &struct damos_filter objects list.
  * @last_applied:	Last @action applied ops-managing entity.
@@ -472,7 +472,7 @@ struct damos_access_pattern {
  * monitoring context are inactive, DAMON stops monitoring either, and just
  * repeatedly checks the watermarks.
  *
- * @target_nid is used to set the migration target node for migrate_hot or
+ * @target_nids is used to set the migration targets node for migrate_hot or
  * migrate_cold actions, which means it's only meaningful when @action is either
  * "migrate_hot" or "migrate_cold".
  *
@@ -517,7 +517,7 @@ struct damos {
 	struct damos_quota quota;
 	struct damos_watermarks wmarks;
 	union {
-		int target_nid;
+		nodemask_t target_nids;
 	};
 	struct list_head filters;
 	struct list_head ops_filters;
@@ -896,7 +896,7 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern,
 			unsigned long apply_interval_us,
 			struct damos_quota *quota,
 			struct damos_watermarks *wmarks,
-			int target_nid);
+			nodemask_t *target_nids);
 void damon_add_scheme(struct damon_ctx *ctx, struct damos *s);
 void damon_destroy_scheme(struct damos *s);
 int damos_commit_quota_goals(struct damos_quota *dst, struct damos_quota *src);
diff --git a/mm/damon/core.c b/mm/damon/core.c
index b217e0120e09..b57eae393df8 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -378,7 +378,7 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern,
 			unsigned long apply_interval_us,
 			struct damos_quota *quota,
 			struct damos_watermarks *wmarks,
-			int target_nid)
+			nodemask_t *target_nids)
 {
 	struct damos *scheme;
 
@@ -407,7 +407,10 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern,
 	scheme->wmarks = *wmarks;
 	scheme->wmarks.activated = true;
 
-	scheme->target_nid = target_nid;
+	if (target_nids)
+		nodes_copy(scheme->target_nids, *target_nids);
+	else
+		nodes_clear(scheme->target_nids);
 
 	return scheme;
 }
@@ -1006,7 +1009,7 @@ static int damon_commit_schemes(struct damon_ctx *dst, struct damon_ctx *src)
 				src_scheme->action,
 				src_scheme->apply_interval_us,
 				&src_scheme->quota, &src_scheme->wmarks,
-				NUMA_NO_NODE);
+				NULL);
 		if (!new_scheme)
 			return -ENOMEM;
 		err = damos_commit(new_scheme, src_scheme);
diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c
index 4af8fd4a390b..ef584c49ecf1 100644
--- a/mm/damon/lru_sort.c
+++ b/mm/damon/lru_sort.c
@@ -164,7 +164,7 @@ static struct damos *damon_lru_sort_new_scheme(
 			&quota,
 			/* (De)activate this according to the watermarks. */
 			&damon_lru_sort_wmarks,
-			NUMA_NO_NODE);
+			NULL);
 }
 
 /* Create a DAMON-based operation scheme for hot memory regions */
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 4102a8c5f992..cbd262d21779 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -19,6 +19,12 @@
 #include "../internal.h"
 #include "ops-common.h"
 
+struct damon_pa_migrate_rmap_arg {
+	nodemask_t *nids;
+	u8 *weights;
+	int *target_nid;
+};
+
 static bool damon_folio_mkold_one(struct folio *folio,
 		struct vm_area_struct *vma, unsigned long addr, void *arg)
 {
@@ -502,12 +508,83 @@ static unsigned long damon_pa_migrate_pages(struct list_head *folio_list,
 	return nr_migrated;
 }
 
+static bool damon_pa_migrate_rmap(struct folio *folio,
+				  struct vm_area_struct *vma,
+				  unsigned long addr,
+				  void *arg)
+{
+	struct damon_pa_migrate_rmap_arg *rmap_arg;
+	pgoff_t ilx;
+	int order;
+	unsigned int target;
+	unsigned int weight_total = 0;
+	int nid;
+
+	rmap_arg = (struct damon_pa_migrate_rmap_arg *)arg;
+
+	order = folio_order(folio);
+	ilx = vma->vm_pgoff >> order;
+	ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
+
+	/* Same logic as weighted_interleave_nid() */
+	for_each_node_mask(nid, *rmap_arg->nids) {
+		weight_total += rmap_arg->weights[nid];
+	}
+
+	target = ilx % weight_total;
+	nid = first_node(*rmap_arg->nids);
+	while (target) {
+		if (target < rmap_arg->weights[nid])
+			break;
+		target -= rmap_arg->weights[nid];
+		nid = next_node_in(nid, *rmap_arg->nids);
+	}
+
+	if (nid == folio_nid(folio))
+		*rmap_arg->target_nid = NUMA_NO_NODE;
+	else
+		*rmap_arg->target_nid = nid;
+	return false;
+}
+
 static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
 		unsigned long *sz_filter_passed)
 {
 	unsigned long addr, applied;
-	LIST_HEAD(folio_list);
+	struct rmap_walk_control rwc;
+	struct damon_pa_migrate_rmap_arg rmap_arg;
+	struct list_head *folio_lists;
 	struct folio *folio;
+	u8 *weights;
+	int target_nid;
+	int nr_nodes;
+
+	nr_nodes = nodes_weight(s->target_nids);
+	if (!nr_nodes)
+		return 0;
+
+	folio_lists = kmalloc_array(nr_node_ids, sizeof(struct list_head),
+		GFP_KERNEL);
+	if (!folio_lists)
+		return 0;
+
+	weights = kmalloc_array(nr_node_ids, sizeof(u8), GFP_KERNEL);
+	if (!weights) {
+		kfree(folio_lists);
+		return 0;
+	}
+
+	for (int i = 0; i < nr_node_ids; i++) {
+		INIT_LIST_HEAD(&folio_lists[i]);
+		weights[i] = get_il_weight(i);
+	}
+
+	memset(&rwc, 0, sizeof(struct rmap_walk_control));
+	rwc.rmap_one = damon_pa_migrate_rmap;
+	rwc.arg = &rmap_arg;
+	rmap_arg.nids = &s->target_nids;
+	rmap_arg.weights = weights;
+	rmap_arg.target_nid = &target_nid;
 
 	addr = r->ar.start;
 	while (addr < r->ar.end) {
@@ -522,15 +599,38 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
 		else
 			*sz_filter_passed += folio_size(folio);
 
+		/*
+		 * If there is only one target node, migrate there. Otherwise,
+		 * interleave across the nodes according to the global
+		 * interleave weights
+		 */
+		if (nr_nodes == 1) {
+			target_nid = first_node(s->target_nids);
+		} else {
+			target_nid = NUMA_NO_NODE;
+			/* Updates target_nid */
+			rmap_walk(folio, &rwc);
+		}
+
+		if (target_nid == NUMA_NO_NODE)
+			goto put_folio;
+
 		if (!folio_isolate_lru(folio))
 			goto put_folio;
-		list_add(&folio->lru, &folio_list);
+		list_add(&folio->lru, &folio_lists[target_nid]);
 put_folio:
 		addr += folio_size(folio);
 		folio_put(folio);
 	}
-	applied = damon_pa_migrate_pages(&folio_list, s->target_nid);
-	cond_resched();
+
+	applied = 0;
+	for (int i = 0; i < nr_node_ids; i++) {
+		applied += damon_pa_migrate_pages(&folio_lists[i], i);
+		cond_resched();
+	}
+
+	kfree(weights);
+	kfree(folio_lists);
 	s->last_applied = folio;
 	return applied * PAGE_SIZE;
 }
diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c
index a675150965e0..9b9546606424 100644
--- a/mm/damon/reclaim.c
+++ b/mm/damon/reclaim.c
@@ -178,7 +178,7 @@ static struct damos *damon_reclaim_new_scheme(void)
 			&damon_reclaim_quota,
 			/* (De)activate this according to the watermarks. */
 			&damon_reclaim_wmarks,
-			NUMA_NO_NODE);
+			NULL);
 }
 
 static int damon_reclaim_apply_parameters(void)
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 0f6c9e1fec0b..eb4e2ded5c83 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -1583,7 +1583,7 @@ struct damon_sysfs_scheme {
 	struct damon_sysfs_scheme_filters *filters;
 	struct damon_sysfs_stats *stats;
 	struct damon_sysfs_scheme_regions *tried_regions;
-	int target_nid;
+	nodemask_t target_nids;
 };
 
 /* This should match with enum damos_action */
@@ -1611,7 +1611,7 @@ static struct damon_sysfs_scheme *damon_sysfs_scheme_alloc(
 	scheme->kobj = (struct kobject){};
 	scheme->action = action;
 	scheme->apply_interval_us = apply_interval_us;
-	scheme->target_nid = NUMA_NO_NODE;
+	nodes_clear(scheme->target_nids);
 	return scheme;
 }
 
@@ -1880,18 +1880,22 @@ static ssize_t target_nid_show(struct kobject *kobj,
 	struct damon_sysfs_scheme *scheme = container_of(kobj,
 			struct damon_sysfs_scheme, kobj);
 
-	return sysfs_emit(buf, "%d\n", scheme->target_nid);
+	return bitmap_print_to_pagebuf(true, buf, scheme->target_nids.bits, MAX_NUMNODES);
 }
 
 static ssize_t target_nid_store(struct kobject *kobj,
 		struct kobj_attribute *attr, const char *buf, size_t count)
 {
+	nodemask_t new;
 	struct damon_sysfs_scheme *scheme = container_of(kobj,
 			struct damon_sysfs_scheme, kobj);
 	int err = 0;
 
 	/* TODO: error handling for target_nid range. */
-	err = kstrtoint(buf, 0, &scheme->target_nid);
+	err = nodelist_parse(buf, new);
+
+	if (!err)
+		nodes_copy(scheme->target_nids, new);
 
 	return err ? err : count;
 }
@@ -2258,7 +2262,7 @@ static struct damos *damon_sysfs_mk_scheme(
 
 	scheme = damon_new_scheme(&pattern, sysfs_scheme->action,
 			sysfs_scheme->apply_interval_us, &quota, &wmarks,
-			sysfs_scheme->target_nid);
+			&sysfs_scheme->target_nids);
 	if (!scheme)
 		return NULL;
 
diff --git a/samples/damon/mtier.c b/samples/damon/mtier.c
index 36d2cd933f5a..b9ac075cbd25 100644
--- a/samples/damon/mtier.c
+++ b/samples/damon/mtier.c
@@ -47,6 +47,10 @@ static struct damon_ctx *damon_sample_mtier_build_ctx(bool promote)
 	struct damos *scheme;
 	struct damos_quota_goal *quota_goal;
 	struct damos_filter *filter;
+	nodemask_t target_node;
+
+	nodes_clear(target_node);
+	node_set(promote ? 0 : 1, target_node);
 
 	ctx = damon_new_ctx();
 	if (!ctx)
@@ -105,7 +109,7 @@ static struct damon_ctx *damon_sample_mtier_build_ctx(bool promote)
 				.weight_age = 100,
 			},
 			&(struct damos_watermarks){},
-			promote ? 0 : 1);	/* migrate target node id */
+			&target_node);
 	if (!scheme)
 		goto free_out;
 	damon_set_schemes(ctx, &scheme, 1);
diff --git a/samples/damon/prcl.c b/samples/damon/prcl.c
index 056b1b21a0fe..4d3e4e2e15cc 100644
--- a/samples/damon/prcl.c
+++ b/samples/damon/prcl.c
@@ -88,7 +88,7 @@ static int damon_sample_prcl_start(void)
 			0,
 			&(struct damos_quota){},
 			&(struct damos_watermarks){},
-			NUMA_NO_NODE);
+			NULL);
 	if (!scheme) {
 		damon_destroy_ctx(ctx);
 		return -ENOMEM;
-- 
2.43.5
Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by SeongJae Park 3 months, 2 weeks ago
Hi Bijan,

On Fri, 20 Jun 2025 13:04:58 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> From: Bijan Tabatabai <bijantabatab@micron.com>
> 
> The migrate_{hot,cold} DAMONS actions take a parameter, target_nid, to
> indicate what node the actions should migrate pages to. In this patch,
> we allow passing in a list of migration targets into target_nid. When
> this is done, the mirgate_{hot, cold} actions will migrate pages between
> the specified nodes using the global interleave weights found at
> /sys/kernel/mm/mempolicy/weighted_interleave/node<N>. This functionality
> can be used to dynamically adjust how pages are interleaved in response
> to changes in bandwidth utilization to improve performance, as discussed
> in [1]. When only a single migration target is passed to target_nid, the
> migrate_{hot,cold} actions will act the same as before.
[...]
>  include/linux/damon.h    |   8 +--
>  mm/damon/core.c          |   9 ++--
>  mm/damon/lru_sort.c      |   2 +-
>  mm/damon/paddr.c         | 108 +++++++++++++++++++++++++++++++++++++--
>  mm/damon/reclaim.c       |   2 +-
>  mm/damon/sysfs-schemes.c |  14 +++--
>  samples/damon/mtier.c    |   6 ++-
>  samples/damon/prcl.c     |   2 +-
>  8 files changed, 131 insertions(+), 20 deletions(-)

If we keep pursuing making DAMON users be able to specify multiple migration
destination nodes and their weights[1], I think we may need only paddr.c part
change of this patch in the final version of this great work.

[...]
>  static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
>  		unsigned long *sz_filter_passed)
>  {
>  	unsigned long addr, applied;
> -	LIST_HEAD(folio_list);
> +	struct rmap_walk_control rwc;
[...]
>  
>  	addr = r->ar.start;
>  	while (addr < r->ar.end) {
> @@ -522,15 +599,38 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
>  		else
>  			*sz_filter_passed += folio_size(folio);
>  
> +		/*
> +		 * If there is only one target node, migrate there. Otherwise,
> +		 * interleave across the nodes according to the global
> +		 * interleave weights
> +		 */
> +		if (nr_nodes == 1) {
> +			target_nid = first_node(s->target_nids);
> +		} else {
> +			target_nid = NUMA_NO_NODE;
> +			/* Updates target_nid */
> +			rmap_walk(folio, &rwc);
> +		}

So we are doing rmap_walk(), which is known to be not very fast, for getting
the target node id of this page, in a way very similar to that of weighted
interleaving, right?  I don't think we really need to behave that same to
weighted interleaving with the cost.

I'd hence suggest to implement and use a simple weights handling mechanism
here.  It could be roud-robin way, like weighted interleaving, or probabilistic
way, using damon_rand().

The round-robin way may be simpler in my opinion.  For example,

unsigned int damos_pa_nid_to_migrate(struct damos_migrate_dest *dest)
{
	static unsigned int nr_migrated = 0;
	unsigned int total_weight = 0;
	unsigned int weights_to_ignore;
	size_t i;

	for (i = 0; i < dest->nr_dests; i++)
		total_weight += dest->weight_arr[i];
	weights_to_ignore = nr_migrate++ % total_weight;
	total_weight = 0;
	for (i = 0; i < dest->nr_dests; i++) {
		total_weight += dest->weight_arr[i];
		if (total_weight >= weights_to_ignore)
			return dest->node_id_arr[i];
	}
	WARN_ON_ONCE(1, "I don't know what I did wrong");
	return 0;
}

Then, we could replace the above rmap_walk() call with this one.  What do you
think?

Nothing stands out from other parts to me.  I will do more thorough read on the
next revision, though.

[1] https://lore.kernel.org/20250621173639.35906-1-sj@kernel.org


Thanks,
SJ

[...]
Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by Bijan Tabatabai 3 months, 2 weeks ago
On Sat, Jun 21, 2025 at 1:02 PM SeongJae Park <sj@kernel.org> wrote:
>
> Hi Bijan,
>
> On Fri, 20 Jun 2025 13:04:58 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
>
> > From: Bijan Tabatabai <bijantabatab@micron.com>
> >
> > The migrate_{hot,cold} DAMONS actions take a parameter, target_nid, to
> > indicate what node the actions should migrate pages to. In this patch,
> > we allow passing in a list of migration targets into target_nid. When
> > this is done, the mirgate_{hot, cold} actions will migrate pages between
> > the specified nodes using the global interleave weights found at
> > /sys/kernel/mm/mempolicy/weighted_interleave/node<N>. This functionality
> > can be used to dynamically adjust how pages are interleaved in response
> > to changes in bandwidth utilization to improve performance, as discussed
> > in [1]. When only a single migration target is passed to target_nid, the
> > migrate_{hot,cold} actions will act the same as before.
> [...]
> >  include/linux/damon.h    |   8 +--
> >  mm/damon/core.c          |   9 ++--
> >  mm/damon/lru_sort.c      |   2 +-
> >  mm/damon/paddr.c         | 108 +++++++++++++++++++++++++++++++++++++--
> >  mm/damon/reclaim.c       |   2 +-
> >  mm/damon/sysfs-schemes.c |  14 +++--
> >  samples/damon/mtier.c    |   6 ++-
> >  samples/damon/prcl.c     |   2 +-
> >  8 files changed, 131 insertions(+), 20 deletions(-)
>
> If we keep pursuing making DAMON users be able to specify multiple migration
> destination nodes and their weights[1], I think we may need only paddr.c part
> change of this patch in the final version of this great work.

Sounds good to me.

> [...]
> >  static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> >               unsigned long *sz_filter_passed)
> >  {
> >       unsigned long addr, applied;
> > -     LIST_HEAD(folio_list);
> > +     struct rmap_walk_control rwc;
> [...]
> >
> >       addr = r->ar.start;
> >       while (addr < r->ar.end) {
> > @@ -522,15 +599,38 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> >               else
> >                       *sz_filter_passed += folio_size(folio);
> >
> > +             /*
> > +              * If there is only one target node, migrate there. Otherwise,
> > +              * interleave across the nodes according to the global
> > +              * interleave weights
> > +              */
> > +             if (nr_nodes == 1) {
> > +                     target_nid = first_node(s->target_nids);
> > +             } else {
> > +                     target_nid = NUMA_NO_NODE;
> > +                     /* Updates target_nid */
> > +                     rmap_walk(folio, &rwc);
> > +             }
>
> So we are doing rmap_walk(), which is known to be not very fast, for getting
> the target node id of this page, in a way very similar to that of weighted
> interleaving, right?  I don't think we really need to behave that same to
> weighted interleaving with the cost.
>
> I'd hence suggest to implement and use a simple weights handling mechanism
> here.  It could be roud-robin way, like weighted interleaving, or probabilistic
> way, using damon_rand().
>
> The round-robin way may be simpler in my opinion.  For example,
>
> unsigned int damos_pa_nid_to_migrate(struct damos_migrate_dest *dest)
> {
>         static unsigned int nr_migrated = 0;
>         unsigned int total_weight = 0;
>         unsigned int weights_to_ignore;
>         size_t i;
>
>         for (i = 0; i < dest->nr_dests; i++)
>                 total_weight += dest->weight_arr[i];
>         weights_to_ignore = nr_migrate++ % total_weight;
>         total_weight = 0;
>         for (i = 0; i < dest->nr_dests; i++) {
>                 total_weight += dest->weight_arr[i];
>                 if (total_weight >= weights_to_ignore)
>                         return dest->node_id_arr[i];
>         }
>         WARN_ON_ONCE(1, "I don't know what I did wrong");
>         return 0;
> }
>
> Then, we could replace the above rmap_walk() call with this one.  What do you
> think?

I do actually think doing the interleaving based on the VMA offset is
important for a couple of reasons.

1. If also using the weighted interleaving mempolicy, and the DAMON
weights are the same as the mempolicy weights, DAMON won't have to
migrate newly allocated pages. This is relatively minor, but helps
avoid unnecessary work.

2. More importantly, I believe this approach will cause a lot of
needless ping-ponging, where the same folios are being moved around
when they don't need to be. For example, let's say folios A-F are hot,
and just for simplification, if they are on the same node, they will
be in the same DAMON region, and only those folios are in those DAMON
regions. If all the folios start in Node 0 and both nodes have a
weight of 1, we have:

nr_migrated = 0
Node 0           Node 1
----------           ----------
A-F                  <empty>

After the scheme is first applied

nr_migrated = 6
Node 0           Node 1
----------           ----------
A,C,E              B,D,F

This is fine, but these folios are still hot, so the scheme will be
applied to them again

nr_migrated = 12
Node 0           Node 1
----------           ----------
A,E,D             C,D,F

If I am understanding your code sample correctly, this will continue
to happen each time the scheme is applied, causing folios to be
migrated for no reason. Using the VMA offset to determine where a page
should be placed avoids this problem because it gives a folio a single
node it can be in for a given set of interleave weights. This means
that in steady state, no folios will be migrated.

I see what you're saying about rmap_walks being expensive, but since
DAMON operates off the critical path for the workload, I don't think
the cost is that problematic.

[...]

Let me know what you think,
Bijan
Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by SeongJae Park 3 months, 2 weeks ago
On Mon, 23 Jun 2025 09:16:53 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> On Sat, Jun 21, 2025 at 1:02 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > Hi Bijan,
> >
> > On Fri, 20 Jun 2025 13:04:58 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> >
> > > From: Bijan Tabatabai <bijantabatab@micron.com>
> > >
> > > The migrate_{hot,cold} DAMONS actions take a parameter, target_nid, to
> > > indicate what node the actions should migrate pages to. In this patch,
> > > we allow passing in a list of migration targets into target_nid. When
> > > this is done, the mirgate_{hot, cold} actions will migrate pages between
> > > the specified nodes using the global interleave weights found at
> > > /sys/kernel/mm/mempolicy/weighted_interleave/node<N>. This functionality
> > > can be used to dynamically adjust how pages are interleaved in response
> > > to changes in bandwidth utilization to improve performance, as discussed
> > > in [1]. When only a single migration target is passed to target_nid, the
> > > migrate_{hot,cold} actions will act the same as before.
> > [...]
> > >  include/linux/damon.h    |   8 +--
> > >  mm/damon/core.c          |   9 ++--
> > >  mm/damon/lru_sort.c      |   2 +-
> > >  mm/damon/paddr.c         | 108 +++++++++++++++++++++++++++++++++++++--
> > >  mm/damon/reclaim.c       |   2 +-
> > >  mm/damon/sysfs-schemes.c |  14 +++--
> > >  samples/damon/mtier.c    |   6 ++-
> > >  samples/damon/prcl.c     |   2 +-
> > >  8 files changed, 131 insertions(+), 20 deletions(-)
> >
> > If we keep pursuing making DAMON users be able to specify multiple migration
> > destination nodes and their weights[1], I think we may need only paddr.c part
> > change of this patch in the final version of this great work.
> 
> Sounds good to me.
> 
> > [...]
> > >  static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> > >               unsigned long *sz_filter_passed)
> > >  {
> > >       unsigned long addr, applied;
> > > -     LIST_HEAD(folio_list);
> > > +     struct rmap_walk_control rwc;
> > [...]
> > >
> > >       addr = r->ar.start;
> > >       while (addr < r->ar.end) {
> > > @@ -522,15 +599,38 @@ static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
> > >               else
> > >                       *sz_filter_passed += folio_size(folio);
> > >
> > > +             /*
> > > +              * If there is only one target node, migrate there. Otherwise,
> > > +              * interleave across the nodes according to the global
> > > +              * interleave weights
> > > +              */
> > > +             if (nr_nodes == 1) {
> > > +                     target_nid = first_node(s->target_nids);
> > > +             } else {
> > > +                     target_nid = NUMA_NO_NODE;
> > > +                     /* Updates target_nid */
> > > +                     rmap_walk(folio, &rwc);
> > > +             }
> >
> > So we are doing rmap_walk(), which is known to be not very fast, for getting
> > the target node id of this page, in a way very similar to that of weighted
> > interleaving, right?  I don't think we really need to behave that same to
> > weighted interleaving with the cost.
> >
> > I'd hence suggest to implement and use a simple weights handling mechanism
> > here.  It could be roud-robin way, like weighted interleaving, or probabilistic
> > way, using damon_rand().
> >
> > The round-robin way may be simpler in my opinion.  For example,
> >
> > unsigned int damos_pa_nid_to_migrate(struct damos_migrate_dest *dest)
> > {
> >         static unsigned int nr_migrated = 0;
> >         unsigned int total_weight = 0;
> >         unsigned int weights_to_ignore;
> >         size_t i;
> >
> >         for (i = 0; i < dest->nr_dests; i++)
> >                 total_weight += dest->weight_arr[i];
> >         weights_to_ignore = nr_migrate++ % total_weight;
> >         total_weight = 0;
> >         for (i = 0; i < dest->nr_dests; i++) {
> >                 total_weight += dest->weight_arr[i];
> >                 if (total_weight >= weights_to_ignore)
> >                         return dest->node_id_arr[i];
> >         }
> >         WARN_ON_ONCE(1, "I don't know what I did wrong");
> >         return 0;
> > }
> >
> > Then, we could replace the above rmap_walk() call with this one.  What do you
> > think?
> 
> I do actually think doing the interleaving based on the VMA offset is
> important for a couple of reasons.
> 
> 1. If also using the weighted interleaving mempolicy, and the DAMON
> weights are the same as the mempolicy weights, DAMON won't have to
> migrate newly allocated pages. This is relatively minor, but helps
> avoid unnecessary work.
> 
> 2. More importantly, I believe this approach will cause a lot of
> needless ping-ponging, where the same folios are being moved around
> when they don't need to be. For example, let's say folios A-F are hot,
> and just for simplification, if they are on the same node, they will
> be in the same DAMON region, and only those folios are in those DAMON
> regions. If all the folios start in Node 0 and both nodes have a
> weight of 1, we have:
> 
> nr_migrated = 0
> Node 0           Node 1
> ----------           ----------
> A-F                  <empty>
> 
> After the scheme is first applied
> 
> nr_migrated = 6
> Node 0           Node 1
> ----------           ----------
> A,C,E              B,D,F
> 
> This is fine, but these folios are still hot, so the scheme will be
> applied to them again
> 
> nr_migrated = 12
> Node 0           Node 1
> ----------           ----------
> A,E,D             C,D,F
> 
> If I am understanding your code sample correctly, this will continue
> to happen each time the scheme is applied, causing folios to be
> migrated for no reason.

Thank you for walking with me, Bijan.  I understand and agree your concerns.
Actually, this kind of unnecessary ping-pong is a general problem for DAMOS.
We hence made a few DAMOS features to avoid this issue.

The first feature is 'age' reset.  DAMOS sets 'age' of regions to zero when it
applies an action.  Hence if your DAMOS scheme has minimum 'age' for the target
access pattern, the region will not be selected as action target again, very
soon.

The second feature is the quota.  You can set speed limit of a DAMOS action, to
avoid DAMOS being too aggressive.  When DAMOS finds memory regions that
eligible for a given action and larger than the given quota, it calculates
access temperature of regions, and apply the action to only hottest or coldest
regions of quota amount.  Whether to prioritize hotter or colder depends on the
action.  DAMOS_MIGRATE_HOT prefers hotter one.  Together with the age reset,
this can reduce unnecessary pingpong.

The third feature is quota auto-tuning.  You can ask DAMON to adjust the quotas
on its own, based on some metrics.  Let me describe an example with memory
tiering use case.  Consider there are two NUMA nodes of different speed.  Node
0 is faster than node 1, samely for every CPU.  Then you can ask DAMON to
migrate hot pages on node 1 to node 0 aiming 99% of node 0 memory be allocated,
while migrating cold pages on node 0 to node 1 aiming 1% of node 0 memory be
free.  Then, DAMON will adjust the quotas for two different schemes based on
current node 0 memory used/free amount.  If node 0 memory is used less than
99%, hot pages migration scheme will work.  The aggressiveness will be
determined on the difference between the current memory usage and the target
usage.  For example, DAMON will try to migrate hot pages faster when node 0
memory usage is 50%, compared to when node 0 memory usage is 98%.  The cold
pages migration scheme will do nothing when node 0 memory is used less than
99%, since its goal (1% node 0 free memory ratio) is already over-achieved.
When the node 0 memory usage becomes 99% and no more allocation is made, DAMON
will be quiet.  Even if a few more allocations happen, DAMON will work in slow
speed, and hence make only reasonable and healthy amount of noise.

Back to your use case, you could set per-node ideal memory usage of
interleaving as the quota goal.  For example, on the 1:1 ratio interleaving on
2 NUMA nodes, you could use two DAMOS scheme, one aiming 50% node 0 memused,
and other one aiming 50% node 0 memfree.  Once pages are well interleaved, both
schemes will stop working for unnecessary pingponging.

Note that you can one of quota auto-tuning metric that DAMON support is
arbitrary user input.  When this is being used, users can simply feed any value
as current value of the goal metric.  For example, you can use application's
performance metric, memory bandwidth, or whatever.  You could see the
node0-node1 balance from your user-space tool and feed it to DAMON quota
auto-tuning.  Then, DAMON will do more migration when it is imbalanced, and no
more migration when it is well balanced.

Finally, you can change DAMON parameters including schemes while DAMON is
running.  You can add and remove schemes whenever you want, while DAMON keeps
monitoring the access pattern.  Your user-space tool can determine how
aggressive migration is required based on current memory balance and adjust
DAMOS quotas online, or even turns DAMOS schemes off/on on demand.

So I think you could avoid the problem using these features.  Does this make
sense to you?

In future, we could add more DAMOS self-feedback metric for this use case.  For
example, the memory usage balance of nodes.  My self-tuning example above was
using two schemes since there is no DAMOS quota goal tuning metric that can
directly be used for your use case.  But I'd say that shouldn't be a blocker of
this work.

> Using the VMA offset to determine where a page
> should be placed avoids this problem because it gives a folio a single
> node it can be in for a given set of interleave weights. This means
> that in steady state, no folios will be migrated.

This makes sense for this use case.  But I don't think this makes same sense
for possible other use cases, like memory tiering on systems having multiple
NUMA nodes of same tier.  If you really need this virtual address space based
deterministic behavior, it would make more sense to use virtual address spaces
monitoring (damon-vaddr).

> 
> I see what you're saying about rmap_walks being expensive, but since
> DAMON operates off the critical path for the workload, I don't think
> the cost is that problematic.

You're right.  We try to make DAMON be controllable (min/max_nr_regions or
DAMOS quotas) rather than always fast.  But, we still try to be fast and
invisible when possible.  Since this change is not only for interleaving but
also general multi-nodes migration and we have features that hopefully can
potentially address your concern, I'd like to think again with you.


Thanks,
SJ

[...]
Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by Bijan Tabatabai 3 months, 2 weeks ago
[...]
> Thank you for walking with me, Bijan.  I understand and agree your concerns.
> Actually, this kind of unnecessary ping-pong is a general problem for DAMOS.
> We hence made a few DAMOS features to avoid this issue.
>
> The first feature is 'age' reset.  DAMOS sets 'age' of regions to zero when it
> applies an action.  Hence if your DAMOS scheme has minimum 'age' for the target
> access pattern, the region will not be selected as action target again, very
> soon.
>
> The second feature is the quota.  You can set speed limit of a DAMOS action, to
> avoid DAMOS being too aggressive.  When DAMOS finds memory regions that
> eligible for a given action and larger than the given quota, it calculates
> access temperature of regions, and apply the action to only hottest or coldest
> regions of quota amount.  Whether to prioritize hotter or colder depends on the
> action.  DAMOS_MIGRATE_HOT prefers hotter one.  Together with the age reset,
> this can reduce unnecessary pingpong.
>
> The third feature is quota auto-tuning.  You can ask DAMON to adjust the quotas
> on its own, based on some metrics.  Let me describe an example with memory
> tiering use case.  Consider there are two NUMA nodes of different speed.  Node
> 0 is faster than node 1, samely for every CPU.  Then you can ask DAMON to
> migrate hot pages on node 1 to node 0 aiming 99% of node 0 memory be allocated,
> while migrating cold pages on node 0 to node 1 aiming 1% of node 0 memory be
> free.  Then, DAMON will adjust the quotas for two different schemes based on
> current node 0 memory used/free amount.  If node 0 memory is used less than
> 99%, hot pages migration scheme will work.  The aggressiveness will be
> determined on the difference between the current memory usage and the target
> usage.  For example, DAMON will try to migrate hot pages faster when node 0
> memory usage is 50%, compared to when node 0 memory usage is 98%.  The cold
> pages migration scheme will do nothing when node 0 memory is used less than
> 99%, since its goal (1% node 0 free memory ratio) is already over-achieved.
> When the node 0 memory usage becomes 99% and no more allocation is made, DAMON
> will be quiet.  Even if a few more allocations happen, DAMON will work in slow
> speed, and hence make only reasonable and healthy amount of noise.
>
> Back to your use case, you could set per-node ideal memory usage of
> interleaving as the quota goal.  For example, on the 1:1 ratio interleaving on
> 2 NUMA nodes, you could use two DAMOS scheme, one aiming 50% node 0 memused,
> and other one aiming 50% node 0 memfree.  Once pages are well interleaved, both
> schemes will stop working for unnecessary pingponging.
>
> Note that you can one of quota auto-tuning metric that DAMON support is
> arbitrary user input.  When this is being used, users can simply feed any value
> as current value of the goal metric.  For example, you can use application's
> performance metric, memory bandwidth, or whatever.  You could see the
> node0-node1 balance from your user-space tool and feed it to DAMON quota
> auto-tuning.  Then, DAMON will do more migration when it is imbalanced, and no
> more migration when it is well balanced.
>
> Finally, you can change DAMON parameters including schemes while DAMON is
> running.  You can add and remove schemes whenever you want, while DAMON keeps
> monitoring the access pattern.  Your user-space tool can determine how
> aggressive migration is required based on current memory balance and adjust
> DAMOS quotas online, or even turns DAMOS schemes off/on on demand.
>
> So I think you could avoid the problem using these features.  Does this make
> sense to you?
>
> In future, we could add more DAMOS self-feedback metric for this use case.  For
> example, the memory usage balance of nodes.  My self-tuning example above was
> using two schemes since there is no DAMOS quota goal tuning metric that can
> directly be used for your use case.  But I'd say that shouldn't be a blocker of
> this work.


Hi SeongJae,

I really appreciate your detailed response.
The quota auto-tuning helps, but I feel like it's still not exactly
what I want. For example, I think a quota goal that stops migration
based on the memory usage balance gets quite a bit more complicated
when instead of interleaving all data, we are just interleaving *hot*
data. I haven't looked at it extensively, but I imagine it wouldn't be
easy to identify how much data is hot in the paddr setting, especially
because the regions can contain a significant amount of unallocated
data. Also, if the interleave weights changed, for example, from 11:9
to 10:10, it would be preferable if only 5% of data is migrated;
however, with the round robin approach, 50% would be. Finally, and I
forgot to mention this in my last message, the round-robin approach
does away with any notion of spatial locality, which does help the
effectiveness of interleaving [1]. I don't think anything done with
quotas can get around that. I wonder if there's an elegant way to
specify whether to use rmap or not, but my initial feeling is that
might just add complication to the code and interface for not enough
benefit.

Maybe, as you suggest later on, this is an indication that my use case
is a better fit for a vaddr scheme. I'll get into that more below.

> > Using the VMA offset to determine where a page
> > should be placed avoids this problem because it gives a folio a single
> > node it can be in for a given set of interleave weights. This means
> > that in steady state, no folios will be migrated.
>
> This makes sense for this use case.  But I don't think this makes same sense
> for possible other use cases, like memory tiering on systems having multiple
> NUMA nodes of same tier.

I see where you're coming from. I think the crux of this difference is
that in my use case, the set of nodes we are monitoring is the same as
the set of nodes we are migrating to, while in the use case you
describe, the set of nodes being monitored is disjoint from the set of
migration target nodes. I think this in particular makes ping ponging
more of a problem for my use case, compared to promotion/demotion
schemes.

> If you really need this virtual address space based
> deterministic behavior, it would make more sense to use virtual address spaces
> monitoring (damon-vaddr).

Maybe it does make sense for me to implement vaddr versions of the
migrate actions for my use case. One thing that gives me pause about
this, is that, from what I understand, it would be harder to have
vaddr schemes apply to processes that start after damon begins. I
think to do that, one would have to detect when a process starts, and
then do a damon tune to upgrade the targets list? It would be nice if,
say, you could specify a cgroup as a vaddr target and track all
processes in that cgroup, but that would be a different patchset for
another day.

But, using vaddr has other benefits, like the sampling would take into
account the locality of the accesses. There are also ways to make
vaddr sampling more efficient by using higher levels of the page
tables, that I don't think apply to paddr schemes [2]. I believe the
authors of [2] said they submitted their patches to the kernel, but I
don't know if it has been upstreamed (sorry about derailing the
conversation slightly).

[1] https://elixir.bootlin.com/linux/v6.16-rc3/source/mm/mempolicy.c#L213
[2] https://www.usenix.org/conference/atc24/presentation/nair
Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by SeongJae Park 3 months, 2 weeks ago
On Mon, 23 Jun 2025 18:15:00 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

[...]
> Hi SeongJae,
> 
> I really appreciate your detailed response.
> The quota auto-tuning helps, but I feel like it's still not exactly
> what I want. For example, I think a quota goal that stops migration
> based on the memory usage balance gets quite a bit more complicated
> when instead of interleaving all data, we are just interleaving *hot*
> data. I haven't looked at it extensively, but I imagine it wouldn't be
> easy to identify how much data is hot in the paddr setting,

I don't think so, and I don't see why you think so.  Could you please
elaborate?

> especially
> because the regions can contain a significant amount of unallocated
> data.

In the case, unallocated data shouldn't be accessed at all, so the region will
just look cold to DAMON.

> Also, if the interleave weights changed, for example, from 11:9
> to 10:10, it would be preferable if only 5% of data is migrated;
> however, with the round robin approach, 50% would be. Finally, and I
> forgot to mention this in my last message, the round-robin approach
> does away with any notion of spatial locality, which does help the
> effectiveness of interleaving [1].

We could use the probabilistic interleaving, if this is the problem?

> I don't think anything done with
> quotas can get around that.

I think I'm not getting your points well, sorry.  More elaboration of your
concern would be helpful.

> I wonder if there's an elegant way to
> specify whether to use rmap or not, but my initial feeling is that
> might just add complication to the code and interface for not enough
> benefit.

Agreed.  Please note that I'm open to add an interface for this behavior if the
benefit is clear.  I'm also thinking adding none-rmap migration first (if it
shows some benefit), and adding rmap support later with additional benefit
confirmation could also be an option.

> 
> Maybe, as you suggest later on, this is an indication that my use case
> is a better fit for a vaddr scheme. I'll get into that more below.
> 
> > > Using the VMA offset to determine where a page
> > > should be placed avoids this problem because it gives a folio a single
> > > node it can be in for a given set of interleave weights. This means
> > > that in steady state, no folios will be migrated.
> >
> > This makes sense for this use case.  But I don't think this makes same sense
> > for possible other use cases, like memory tiering on systems having multiple
> > NUMA nodes of same tier.
> 
> I see where you're coming from. I think the crux of this difference is
> that in my use case, the set of nodes we are monitoring is the same as
> the set of nodes we are migrating to, while in the use case you
> describe, the set of nodes being monitored is disjoint from the set of
> migration target nodes.

I understand and agree this difference.

> I think this in particular makes ping ponging
> more of a problem for my use case, compared to promotion/demotion
> schemes.

But again I'm failing at understanding this, sorry.  Could I ask more
elaborations?

> 
> > If you really need this virtual address space based
> > deterministic behavior, it would make more sense to use virtual address spaces
> > monitoring (damon-vaddr).
> 
> Maybe it does make sense for me to implement vaddr versions of the
> migrate actions for my use case.

Yes, that could also be an option.

> One thing that gives me pause about
> this, is that, from what I understand, it would be harder to have
> vaddr schemes apply to processes that start after damon begins. I
> think to do that, one would have to detect when a process starts, and
> then do a damon tune to upgrade the targets list? It would be nice if,
> say, you could specify a cgroup as a vaddr target and track all
> processes in that cgroup, but that would be a different patchset for
> another day.

I agree that could be a future thing to do.  Note that DAMON user-space tool
implements[1] a similar feature.

> 
> But, using vaddr has other benefits, like the sampling would take into
> account the locality of the accesses. There are also ways to make
> vaddr sampling more efficient by using higher levels of the page
> tables, that I don't think apply to paddr schemes [2]. I believe the
> authors of [2] said they submitted their patches to the kernel, but I
> don't know if it has been upstreamed (sorry about derailing the
> conversation slightly).

Thank you for reminding it.  It was nice finding and approach[2], but
unfortunately it didn't be upstreamed.  I now realize the monitoring intervals
auto-tuning[3] idea was partly motivated by the nice discussion, though.

[1] https://github.com/damonitor/damo/blob/next/release_note#L33
[2] https://lore.kernel.org/damon/20240318132848.82686-1-aravinda.prasad@intel.com/
[3] https://lkml.kernel.org/r/20250303221726.484227-1-sj@kernel.org


Thanks,
SJ

[...]

> 
> [1] https://elixir.bootlin.com/linux/v6.16-rc3/source/mm/mempolicy.c#L213
> [2] https://www.usenix.org/conference/atc24/presentation/nair
Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by Bijan Tabatabai 3 months, 2 weeks ago
On Mon, Jun 23, 2025 at 7:34 PM SeongJae Park <sj@kernel.org> wrote:
>
> On Mon, 23 Jun 2025 18:15:00 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
>
> [...]
> > Hi SeongJae,
> >
> > I really appreciate your detailed response.
> > The quota auto-tuning helps, but I feel like it's still not exactly
> > what I want. For example, I think a quota goal that stops migration
> > based on the memory usage balance gets quite a bit more complicated
> > when instead of interleaving all data, we are just interleaving *hot*
> > data. I haven't looked at it extensively, but I imagine it wouldn't be
> > easy to identify how much data is hot in the paddr setting,
>
> I don't think so, and I don't see why you think so.  Could you please
> elaborate?

Elaborated below.

> > especially
> > because the regions can contain a significant amount of unallocated
> > data.
>
> In the case, unallocated data shouldn't be accessed at all, so the region will
> just look cold to DAMON.

"Significant" was too strong of a word, but if physical memory is
fragmented, couldn't there be a non-negligible amount of unallocated
memory in a hot region? If so, I think it would mean that you cannot
simply take a sum of the sizes of the hot regions in each node to
compute how the hot data is interleaved because those regions may
contain unallocated memory that shouldn't count for that calculation.
Does that make sense?

It's very possible I might be overthinking this and it won't be an
issue in practice. It might be best to not worry about it until it
becomes an issue in practice.

> > Also, if the interleave weights changed, for example, from 11:9
> > to 10:10, it would be preferable if only 5% of data is migrated;
> > however, with the round robin approach, 50% would be.

Elaborating more on this:
Imagine a process begins with a weights of 3 and 2 for node 0 and 1
respectively in both DAMON and the weighted interleave policy. If you
looked at the which node a page resides in for a group of contiguous
pages, it would be something like this (using letters to represent the
virtual addresses):

A -> node 0
B -> node 0
C -> node 0
D -> node 1
E -> node 1
F -> node 0
G -> node 0
H -> node 0
I -> node 1
J -> node 1

If we use a user defined quota autotuning mechanism like you described
in [1] to stop DAMON interleaving when we detect that the data is
interleaved correctly, no interleaving would happen, which is good.
However, let's say we change the DAMON weights to be 4:1. My
understanding is that DAMON applies the scheme to regions in ascending
order of physical address (for paddr schemes), so if using the
round-robin algorithm you provided in [2], the interleaving would
apply to the pages in node 0 first, then node 1. For the sake of
simplicity, let's say in this scenario the pages in the same node are
sorted by their virtual address, so the interleaving would be applied
in the order ABCFGHDEIJ. This would result in the following page
placement

A -> node 0
B -> node 0
C -> node 0
D -> node 0
E -> node 0
F -> node 0
G -> node 1
H -> node 0
I -> node 0
J -> node 1

So, four pages, D, E, F, G,  and I, have been migrated. However, if
they were interleaved using their virtual addresses*, only pages D and
I would have been migrated.

* Technically, the mempolicy code interleaves based on the offset from
the start of the VMA, but that difference doesn't change this example.

> > Finally, and I
> > forgot to mention this in my last message, the round-robin approach
> > does away with any notion of spatial locality, which does help the
> > effectiveness of interleaving [1].

Elaborating more on this.
As implied by the comment in [3], interleaving works better the finer
grained it is done in virtual memory. As an extreme example, if you
had weights of 3:2, putting the first 60% of a process's data in node
0 and the remaining 40% in node 1 would satisfy the ratio globally,
but you would likely not see the benefits of interleaving. We can see
in the example above that your round-robin approach does not maintain
the desired interleave ratio locally, even though it does globally.

> We could use the probabilistic interleaving, if this is the problem?

I don't think so. In the above example, with probabilistic
interleaving, you would still migrate, on average, 20% of the pages in
node 0 and 80% of the pages in node 1. Similarly, the probablistic
interleaving also does not consider the virtual address, so it
wouldn't maintain the interleave ratio locally in the virtual address
space either.

> > I don't think anything done with
> > quotas can get around that.
>
> I think I'm not getting your points well, sorry.  More elaboration of your
> concern would be helpful.

I elaborated more above. Hopefully that clears up any confusion. If
you still have questions, maybe it would be easier to e-meet and have
a live discussion about it? I see you have a DAMON chat slot open
tomorrow at 9:30 PT [4]. If you have nothing else scheduled, maybe
that would be a good time to chat?

[...]

> > I see where you're coming from. I think the crux of this difference is
> > that in my use case, the set of nodes we are monitoring is the same as
> > the set of nodes we are migrating to, while in the use case you
> > describe, the set of nodes being monitored is disjoint from the set of
> > migration target nodes.
>
> I understand and agree this difference.
>
> > I think this in particular makes ping ponging
> > more of a problem for my use case, compared to promotion/demotion
> > schemes.
>
> But again I'm failing at understanding this, sorry.  Could I ask more
> elaborations?

Sure, and sorry for needing to elaborate so much.

What I was trying to say is that in the case where a scheme is
monitoring the same nodes it is migrating to, when it detects a hot
region, it will interleave the pages in the region between the nodes.
If there are two nodes, and assuming the access pattern was uniform
across the region, we have now turned one hot region into two. Using
the algorithms you provided earlier, the next time the scheme is
applied, it will interleave both of those regions again because the
only information it has about where to place pages is how many pages
it has previously interleaved. Using virtual addresses to interleave
solves this problem by providing one and only one location a page
should be in given a set of interleave weights.

When a scheme is monitoring one set of nodes and migrating to another
disjoint set of nodes, you don't have this problem because once the
pages are migrated, they won't be considered by the scheme until some
other scheme moves those pages back into the monitored nodes.

Does that make sense?

> >
> > > If you really need this virtual address space based
> > > deterministic behavior, it would make more sense to use virtual address spaces
> > > monitoring (damon-vaddr).
> >
> > Maybe it does make sense for me to implement vaddr versions of the
> > migrate actions for my use case.
>
> Yes, that could also be an option.

Given how much my explanations here stressed that having access to the
virtual addresses solves the problems I mentioned, I think the path
forward for the next revision should be:

1) Have the paddr migration scheme use the round-robin interleaving
that you provided - This would be good for the use case you described
where you promote pages from a node into multiple nodes of the same
tier.
2) Implement a vaddr migration scheme that uses the virtual address
based interleaving - This is useful for my target use case of
balancing bandwidth utilization between nodes?

If the vaddr scheme proves insufficient down the line for my use case,
we can have another discussion at that time.
How does this sound to you?

> > One thing that gives me pause about
> > this, is that, from what I understand, it would be harder to have
> > vaddr schemes apply to processes that start after damon begins. I
> > think to do that, one would have to detect when a process starts, and
> > then do a damon tune to upgrade the targets list? It would be nice if,
> > say, you could specify a cgroup as a vaddr target and track all
> > processes in that cgroup, but that would be a different patchset for
> > another day.
>
> I agree that could be a future thing to do.  Note that DAMON user-space tool
> implements[1] a similar feature.

Thanks, I'll take a look at that.

[...]

Thanks again for the time you are spending on these discussions. I do
appreciate it, and I hope I'm not taking up too much of your time.

Bijan

[1] https://lore.kernel.org/damon/20250623175204.43917-1-sj@kernel.org/
[2] https://lore.kernel.org/damon/20250621180215.36243-1-sj@kernel.org/
[3] https://elixir.bootlin.com/linux/v6.16-rc3/source/mm/mempolicy.c#L213
[4] https://lore.kernel.org/damon/20250620205819.98472-1-sj@kernel.org/T/#t
Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by SeongJae Park 3 months, 2 weeks ago
On Tue, 24 Jun 2025 11:01:46 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

> On Mon, Jun 23, 2025 at 7:34 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > On Mon, 23 Jun 2025 18:15:00 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:
> >
> > [...]
> > > Hi SeongJae,
> > >
> > > I really appreciate your detailed response.
> > > The quota auto-tuning helps, but I feel like it's still not exactly
> > > what I want. For example, I think a quota goal that stops migration
> > > based on the memory usage balance gets quite a bit more complicated
> > > when instead of interleaving all data, we are just interleaving *hot*
> > > data. I haven't looked at it extensively, but I imagine it wouldn't be
> > > easy to identify how much data is hot in the paddr setting,
> >
> > I don't think so, and I don't see why you think so.  Could you please
> > elaborate?
> 
> Elaborated below.
> 
> > > especially
> > > because the regions can contain a significant amount of unallocated
> > > data.
> >
> > In the case, unallocated data shouldn't be accessed at all, so the region will
> > just look cold to DAMON.
> 
> "Significant" was too strong of a word, but if physical memory is
> fragmented, couldn't there be a non-negligible amount of unallocated
> memory in a hot region? If so, I think it would mean that you cannot
> simply take a sum of the sizes of the hot regions in each node to
> compute how the hot data is interleaved because those regions may
> contain unallocated memory that shouldn't count for that calculation.
> Does that make sense?
> 
> It's very possible I might be overthinking this and it won't be an
> issue in practice. It might be best to not worry about it until it
> becomes an issue in practice.

So, I understand this is not a problem for only this migration-based
interleaving approach, but general DAMON accuracy on physical address space.

I agree.  DAMON's region-based mechanism provides only best effort results, and
depending on the workloads, virtual address space based monitoring has chances
to work better than physical address space.  Whether the accuracy level is
acceptable or not would depend on use case.  What I can say at the moment is,
there are use cases of physical address based DAMON, and we are working further
for making it better.  Recent monitoring intervals auto-tuning was one such
efforts.

> 
> > > Also, if the interleave weights changed, for example, from 11:9
> > > to 10:10, it would be preferable if only 5% of data is migrated;
> > > however, with the round robin approach, 50% would be.
> 
> Elaborating more on this:
> Imagine a process begins with a weights of 3 and 2 for node 0 and 1
> respectively in both DAMON and the weighted interleave policy. If you
> looked at the which node a page resides in for a group of contiguous
> pages, it would be something like this (using letters to represent the
> virtual addresses):
> 
> A -> node 0
> B -> node 0
> C -> node 0
> D -> node 1
> E -> node 1
> F -> node 0
> G -> node 0
> H -> node 0
> I -> node 1
> J -> node 1
> 
> If we use a user defined quota autotuning mechanism like you described
> in [1] to stop DAMON interleaving when we detect that the data is
> interleaved correctly, no interleaving would happen, which is good.
> However, let's say we change the DAMON weights to be 4:1. My
> understanding is that DAMON applies the scheme to regions in ascending
> order of physical address (for paddr schemes), so if using the
> round-robin algorithm you provided in [2], the interleaving would
> apply to the pages in node 0 first, then node 1. For the sake of
> simplicity, let's say in this scenario the pages in the same node are
> sorted by their virtual address,
> so the interleaving would be applied
> in the order ABCFGHDEIJ. This would result in the following page
> placement
> 
> A -> node 0
> B -> node 0
> C -> node 0
> D -> node 0
> E -> node 0
> F -> node 0
> G -> node 1
> H -> node 0
> I -> node 0
> J -> node 1
> 
> So, four pages, D, E, F, G,  and I, have been migrated. However, if
> they were interleaved using their virtual addresses*, only pages D and
> I would have been migrated.

Thank you for this kind and detailed explanation.  Now I can understand your
concern.  To my understanding, one of the important assumptions on the example
is that pages of the DAMON region are sorted on both virtual address space and
physical address space.  I'm not very sure if that is a reasonable assumption.

Hence, in such virtual address-randomly mixed DAMON region cases, I think
physical address based approach could work better than when virtual
address-sorted case, and sometimes even make less migration than virtual
address-based case.  For example, let's say we found a DAMON region of 6 pages
in node 0.  And the virtual address-based interleaving targets of the 6 pages
are all node 1, while physical address-based interleaving target nodes are 3
pages for node 0 and other 3 pages for node 1 (weights 1:1).  Then virtual
address-based interleaving does 6 migrations while the other one does only 3
migrations.

So both approach has worst case scenario, and my gut feeling is that this might
not result in big difference as long as careful scheme aggressiveness control
is made (whether using quota auto-tuning or manual online DAMON parameters
tuning).  Of course it would be nice if we have good test results.

Meanwhile, assuming the DAMON regions on the physical address space may have
pages of random virtual addresses, I find another concern about the behavior
from users' perspective.  In my opinion, commonly expected behaviors for the
interface we are working together is, "DAMON will migrate pages of the regions
of the access pattern to give NUMA nodes with given weights." If there is a
DAMON region having random virtual addrss page, and the virtual address-based
target node selection results in a target nodes ratio that is different from
the user-specified weights (like above example of 6 pages), I think users could
be confused about the results.

To go in this direction, I think the interface should clearly explain the
behavior.  I should confess I concern even in the case, if the behavior is
unnecessarily complex for users and developers.

Alternatively, if we go fully virtual address based approach (monitor on
virtual address and operate with virtual address), I think the behavior would
be optimum and easier to expected by users.

> 
> * Technically, the mempolicy code interleaves based on the offset from
> the start of the VMA, but that difference doesn't change this example.
> 
> > > Finally, and I
> > > forgot to mention this in my last message, the round-robin approach
> > > does away with any notion of spatial locality, which does help the
> > > effectiveness of interleaving [1].
> 
> Elaborating more on this.
> As implied by the comment in [3], interleaving works better the finer
> grained it is done in virtual memory. As an extreme example, if you
> had weights of 3:2, putting the first 60% of a process's data in node
> 0 and the remaining 40% in node 1 would satisfy the ratio globally,
> but you would likely not see the benefits of interleaving. We can see
> in the example above that your round-robin approach does not maintain
> the desired interleave ratio locally, even though it does globally.

I'm not very agreed if keeping the ratio process-locally is always beneficial.
At least when people use DAMON in the physical address space, I'd assume their
concern is global memory bandwidth control.  And hence I imagine they would
care about the global ratio more than local ratio.

I agree there could be a case users care about process local ratio, though.  In
the case, I feel using virtual address based mode sounds more natural, unless
there is other reason to use physical address space mode.

> 
> > We could use the probabilistic interleaving, if this is the problem?
> 
> I don't think so. In the above example, with probabilistic
> interleaving, you would still migrate, on average, 20% of the pages in
> node 0 and 80% of the pages in node 1. Similarly, the probablistic
> interleaving also does not consider the virtual address, so it
> wouldn't maintain the interleave ratio locally in the virtual address
> space either.

Right.  But if the DAMON region is having random virtual address pages, it
wouldn't always result in the worst case.

> 
> > > I don't think anything done with
> > > quotas can get around that.
> >
> > I think I'm not getting your points well, sorry.  More elaboration of your
> > concern would be helpful.
> 
> I elaborated more above. Hopefully that clears up any confusion. If
> you still have questions, maybe it would be easier to e-meet and have
> a live discussion about it? I see you have a DAMON chat slot open
> tomorrow at 9:30 PT [4]. If you have nothing else scheduled, maybe
> that would be a good time to chat?

Nice suggestion, thank you.  For readers of this mail other than I and Bijan, I
and Bijan made an appointment for this off list.

> 
> [...]
> 
> > > I see where you're coming from. I think the crux of this difference is
> > > that in my use case, the set of nodes we are monitoring is the same as
> > > the set of nodes we are migrating to, while in the use case you
> > > describe, the set of nodes being monitored is disjoint from the set of
> > > migration target nodes.
> >
> > I understand and agree this difference.
> >
> > > I think this in particular makes ping ponging
> > > more of a problem for my use case, compared to promotion/demotion
> > > schemes.
> >
> > But again I'm failing at understanding this, sorry.  Could I ask more
> > elaborations?
> 
> Sure, and sorry for needing to elaborate so much.

No worry, thank you for walking with me.

> 
> What I was trying to say is that in the case where a scheme is
> monitoring the same nodes it is migrating to, when it detects a hot
> region, it will interleave the pages in the region between the nodes.
> If there are two nodes, and assuming the access pattern was uniform
> across the region, we have now turned one hot region into two. Using
> the algorithms you provided earlier, the next time the scheme is
> applied, it will interleave both of those regions again because the
> only information it has about where to place pages is how many pages
> it has previously interleaved.

You're correct, if we don't use a nodes imbalance state based scheme tuning 
approach.  With the approach, but, this will not happen?

> Using virtual addresses to interleave
> solves this problem by providing one and only one location a page
> should be in given a set of interleave weights.

Even in the case, DAMON will do rmap of the pages.  Even if it is entirely
virtual address mode, rmap walking is not needed, but DAMON should still
iterate each pages.  For large systems, that might be not negligible system
resource.  I think even if we use virtual address based target node selection,
or entirely virtual address mode, that kind of imbalance status based scheme
tuning could be beneficial.

> 
> When a scheme is monitoring one set of nodes and migrating to another
> disjoint set of nodes, you don't have this problem because once the
> pages are migrated, they won't be considered by the scheme until some
> other scheme moves those pages back into the monitored nodes.

Not technically, in my opinion.  In the tiering scenario, we have two schemes
migrating ho/cold pages to each other node.  Without quota or very good
hot/coldness thresholds, unnecessary pingpoing can happen.

> 
> Does that make sense?
> 
> > >
> > > > If you really need this virtual address space based
> > > > deterministic behavior, it would make more sense to use virtual address spaces
> > > > monitoring (damon-vaddr).
> > >
> > > Maybe it does make sense for me to implement vaddr versions of the
> > > migrate actions for my use case.
> >
> > Yes, that could also be an option.
> 
> Given how much my explanations here stressed that having access to the
> virtual addresses solves the problems I mentioned, I think the path
> forward for the next revision should be:
> 
> 1) Have the paddr migration scheme use the round-robin interleaving
> that you provided - This would be good for the use case you described
> where you promote pages from a node into multiple nodes of the same
> tier.

If you will not use that, I don't think you need to do that.  Let's work for
only our own selfish purposes ;)  I believe that is a better approach for long
term maintenance, too.

> 2) Implement a vaddr migration scheme that uses the virtual address
> based interleaving - This is useful for my target use case of
> balancing bandwidth utilization between nodes?
> 
> If the vaddr scheme proves insufficient down the line for my use case,
> we can have another discussion at that time.
> How does this sound to you?

I find no concern at fully virtual address based approach off the top of my
head.  And if you think it will work for you and you will use that, I think
this is the right way to go.  I also think based on our previous discussion,
this could work for your and other's multiple use cases.

We still have a few conflicting opinions about physical address monitoring
based approach.  I love having the discussions with you.  But I don't think
those can be blockers of 2).  So to me, 2) seems the right way to go.

> 
> > > One thing that gives me pause about
> > > this, is that, from what I understand, it would be harder to have
> > > vaddr schemes apply to processes that start after damon begins. I
> > > think to do that, one would have to detect when a process starts, and
> > > then do a damon tune to upgrade the targets list? It would be nice if,
> > > say, you could specify a cgroup as a vaddr target and track all
> > > processes in that cgroup, but that would be a different patchset for
> > > another day.
> >
> > I agree that could be a future thing to do.  Note that DAMON user-space tool
> > implements[1] a similar feature.
> 
> Thanks, I'll take a look at that.
> 
> [...]
> 
> Thanks again for the time you are spending on these discussions. I do
> appreciate it, and I hope I'm not taking up too much of your time.

I also appreciate your time and patience.  I see we are near to the alignment,
though, please bear in mind!


Thanks,
SJ

[...]
Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by SeongJae Park 3 months, 2 weeks ago
On Sat, 21 Jun 2025 11:02:15 -0700 SeongJae Park <sj@kernel.org> wrote:

[...]
> I'd hence suggest to implement and use a simple weights handling mechanism
> here.  It could be roud-robin way, like weighted interleaving, or probabilistic
> way, using damon_rand().
> 
> The round-robin way may be simpler in my opinion.  For example,
> 
> unsigned int damos_pa_nid_to_migrate(struct damos_migrate_dest *dest)
> {
> 	static unsigned int nr_migrated = 0;
> 	unsigned int total_weight = 0;
> 	unsigned int weights_to_ignore;
> 	size_t i;
> 
> 	for (i = 0; i < dest->nr_dests; i++)
> 		total_weight += dest->weight_arr[i];
> 	weights_to_ignore = nr_migrate++ % total_weight;

Actually, probabilistic way may be not that complicated.  Maybe we could to
below here.

	return damon_rand(0, total_weight) >= weight_to_ignore;

> 	total_weight = 0;
> 	for (i = 0; i < dest->nr_dests; i++) {
> 		total_weight += dest->weight_arr[i];
> 		if (total_weight >= weights_to_ignore)
> 			return dest->node_id_arr[i];
> 	}
> 	WARN_ON_ONCE(1, "I don't know what I did wrong");
> 	return 0;
> }

But damon_rand() might be more expensive than the roud-robin way, and arguably
roud-robin way is what usrs who familiar with weighted interleaving may easily
expect and even prefer?  I have no preferrence here.


Thanks,
SJ

[...]
Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by Bijan Tabatabai 3 months, 2 weeks ago
On Sat, Jun 21, 2025 at 1:11 PM SeongJae Park <sj@kernel.org> wrote:
>
> On Sat, 21 Jun 2025 11:02:15 -0700 SeongJae Park <sj@kernel.org> wrote:
>
> [...]
> > I'd hence suggest to implement and use a simple weights handling mechanism
> > here.  It could be roud-robin way, like weighted interleaving, or probabilistic
> > way, using damon_rand().
> >
> > The round-robin way may be simpler in my opinion.  For example,
> >
> > unsigned int damos_pa_nid_to_migrate(struct damos_migrate_dest *dest)
> > {
> >       static unsigned int nr_migrated = 0;
> >       unsigned int total_weight = 0;
> >       unsigned int weights_to_ignore;
> >       size_t i;
> >
> >       for (i = 0; i < dest->nr_dests; i++)
> >               total_weight += dest->weight_arr[i];
> >       weights_to_ignore = nr_migrate++ % total_weight;
>
> Actually, probabilistic way may be not that complicated.  Maybe we could to
> below here.
>
>         return damon_rand(0, total_weight) >= weight_to_ignore;
>
> >       total_weight = 0;
> >       for (i = 0; i < dest->nr_dests; i++) {
> >               total_weight += dest->weight_arr[i];
> >               if (total_weight >= weights_to_ignore)
> >                       return dest->node_id_arr[i];
> >       }
> >       WARN_ON_ONCE(1, "I don't know what I did wrong");
> >       return 0;
> > }
>
> But damon_rand() might be more expensive than the roud-robin way, and arguably
> roud-robin way is what usrs who familiar with weighted interleaving may easily
> expect and even prefer?  I have no preferrence here.

I think my comments about the round-robin approach also apply here,
but all things equal, I generally prefer being deterministic over
being probabilistic when possible.

Bijan

[...]
Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by SeongJae Park 3 months, 2 weeks ago
On Mon, 23 Jun 2025 09:27:16 -0500 Bijan Tabatabai <bijan311@gmail.com> wrote:

[...]
> I think my comments about the round-robin approach also apply here,
> but all things equal, I generally prefer being deterministic over
> being probabilistic when possible.

Makes sense, let's go with the roun-robin approach :)


Thanks,
SJ

[...]
Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by Joshua Hahn 3 months, 2 weeks ago
On Sat, 21 Jun 2025 11:11:27 -0700 SeongJae Park <sj@kernel.org> wrote:

> On Sat, 21 Jun 2025 11:02:15 -0700 SeongJae Park <sj@kernel.org> wrote:
> 
> [...]
> > I'd hence suggest to implement and use a simple weights handling mechanism
> > here.  It could be roud-robin way, like weighted interleaving, or probabilistic
> > way, using damon_rand().
> > 
> > The round-robin way may be simpler in my opinion.  For example,

[...snip...]
 
> Actually, probabilistic way may be not that complicated.  Maybe we could to
> below here.

[...snip...]

> But damon_rand() might be more expensive than the roud-robin way, and arguably
> roud-robin way is what usrs who familiar with weighted interleaving may easily
> expect and even prefer?  I have no preferrence here.

Hi SJ,

If you have no preference here, I would like to add some thoughts : -)

I think that code complexity aside, round-robin may be the better choice for
a few reasons. Like you mentioned, I think it is what users might be used to,
if they are coming from weighted interleave code. Also, I think a round-robin
way will prevent worst-case scenarios where we get a long stretch of allocations
on the "wrong" node (but maybe this isn't a big deal, since it is so unlikely).

Finaly -- If we run workloads with mempolicy wet to weighted interleave
*and* with the weights already set, then pages will be allocated in a
round-robin fashion. I think it may be best to try and minimize migration costs
by trying to keep these weights in-sync. That is, if we have a 2:1 ratio,
we will have the following allocation:

node0 | oo oo oo oo oo oo oo ...
node1 |   o  o  o  o  o  o   ...

Using a probabilistic migration, it might change the pattern:

node0 |   oooo oo o ooo oo ...
node1 | oo    o  o o   o   ...

That is, the ratio might be preserved, but we may be doing unnecessary
migrations, since a probabilistic allocation isn't aware of any underlying
patterns. With a round-robin allocation, we have a 1/total_weight chance that
there will be no additional migrations, depending on where the round-robin
begins. I also want to note that weighted interleave auto-tuning is written
to minimize total_weight.

I'm wondering what you think about this. Perhaps there is a way to know where
the "beginning" of round-robin should begin, so that we try to keep the
allocation & migration pattern as in-sync as possible? I have a suspicion
that I am way over-thinking this, and none of this really has a tangible
impact on performance as well ;)

Thank you as always SJ, have a great day!!
Joshua

> Thanks,
> SJ
> 
> [...]
> 

Sent using hkml (https://github.com/sjp38/hackermail)
Re: [RFC PATCH v2 2/2] mm/damon/paddr: Allow multiple migrate targets
Posted by SeongJae Park 3 months, 2 weeks ago
On Mon, 23 Jun 2025 07:08:07 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> On Sat, 21 Jun 2025 11:11:27 -0700 SeongJae Park <sj@kernel.org> wrote:
> 
> > On Sat, 21 Jun 2025 11:02:15 -0700 SeongJae Park <sj@kernel.org> wrote:
> > 
> > [...]
> > > I'd hence suggest to implement and use a simple weights handling mechanism
> > > here.  It could be roud-robin way, like weighted interleaving, or probabilistic
> > > way, using damon_rand().
> > > 
> > > The round-robin way may be simpler in my opinion.  For example,
> 
> [...snip...]
>  
> > Actually, probabilistic way may be not that complicated.  Maybe we could to
> > below here.
> 
> [...snip...]
> 
> > But damon_rand() might be more expensive than the roud-robin way, and arguably
> > roud-robin way is what usrs who familiar with weighted interleaving may easily
> > expect and even prefer?  I have no preferrence here.
> 
> Hi SJ,
> 
> If you have no preference here, I would like to add some thoughts : -)
> 
[...]
> I think that code complexity aside, round-robin may be the better choice for
> a few reasons. Like you mentioned, I think it is what users might be used to,
> if they are coming from weighted interleave code. Also, I think a round-robin
> way will prevent worst-case scenarios where we get a long stretch of allocations
> on the "wrong" node (but maybe this isn't a big deal, since it is so unlikely).
> 
> Finaly -- If we run workloads with mempolicy wet to weighted interleave
> *and* with the weights already set, then pages will be allocated in a
> round-robin fashion. I think it may be best to try and minimize migration costs
> by trying to keep these weights in-sync. That is, if we have a 2:1 ratio,
> we will have the following allocation:
> 
> node0 | oo oo oo oo oo oo oo ...
> node1 |   o  o  o  o  o  o   ...
> 
> Using a probabilistic migration, it might change the pattern:
> 
> node0 |   oooo oo o ooo oo ...
> node1 | oo    o  o o   o   ...
> 
> That is, the ratio might be preserved, but we may be doing unnecessary
> migrations, since a probabilistic allocation isn't aware of any underlying
> patterns. With a round-robin allocation, we have a 1/total_weight chance that
> there will be no additional migrations, depending on where the round-robin
> begins. I also want to note that weighted interleave auto-tuning is written
> to minimize total_weight.
> 
> I'm wondering what you think about this. Perhaps there is a way to know where
> the "beginning" of round-robin should begin, so that we try to keep the
> allocation & migration pattern as in-sync as possible? I have a suspicion
> that I am way over-thinking this, and none of this really has a tangible
> impact on performance as well ;)

The theory makes sense to me.  I also not very sure how much visible difference
it will make on large scale real workloads, though.  Since at least the theory
makes sense and we show no risk, I think taking the round-robin appraoch would
be a saner action, unless we find other opinions or test results.

> 
> Thank you as always SJ, have a great day!!

Thank you, you too!


Thanks,
SJ

[...]