mm: mempolicy: Multi-tier weighted interleaving

[RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving

Posted by Gregory Price 2 years, 4 months ago

v2: change memtier mutex to semaphore
    add source-node relative weighting
    add remaining mempolicy integration code

= v2 Notes

Developed in colaboration with original authors to deconflict
similar efforts to extend mempolicy to take weights directly.

== Mutex to Semaphore change:

The memory tiering subsystem is extended in this patch set to have
externally available information (weights), and therefore additional
controls need to be added to ensure values are not changed (or tiers
changed/added/removed) during various calculations.

Since it is expected that many threads will be accessing this data
during allocations, a mutex is not appropriate.

Since write-updates (weight changes, hotplug events) are rare events,
a simple rw semaphore is sufficient.

== Source-node relative weighting:

Tiers can now be weighted differently based on the node requesting
the weight.  For example CPU-Nodes 0 and 1 may have different weights
for the same CXL memory tier, because topologically the number of
NUMA hops is greater (or any other physical topological difference
resulting in different effective latency or bandwidth values)

1. Set weights for DDR (tier4) and CXL(teir22) tiers.
   echo source_node:weight > /path/to/interleave_weight

# Set tier4 weight from node 0 to 85
echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
# Set tier4 weight from node 1 to 65
echo 1:65 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
# Set tier22 weight from node 0 to 15
echo 0:15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
# Set tier22 weight from node 1 to 10
echo 1:10 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight

== Mempolicy integration

Two new functions have been added to memory-tiers.c
* memtier_get_node_weight
  - Get the effective weight for a given node
* memtier_get_total_weight
  - Get the "total effective weight" for a given nodemask.

These functions are used by the following functions in mempolicy:
* interleave_nodes
* offset_il_nodes
* alloc_pages_bulk_array_interleave

The weight values are used to determine how many pages should be
allocated per-node as interleave rounds occur.

To avoid holding the memtier semaphore for long periods of time
(e.g. during the calls that actually allocate pages), there is
a small race condition during bulk allocation between calculating
the total weight of a node mask and fetching each individual
node weight - but this is managed by simply detecting the over/under
allocation conditions and handling them accordingly.

~Gregory

=== original RFC ====

From: Ravi Shankar <ravis.opensrc@micron.com>

Hello,

The current interleave policy operates by interleaving page requests
among nodes defined in the memory policy. To accommodate the
introduction of memory tiers for various memory types (e.g., DDR, CXL,
HBM, PMEM, etc.), a mechanism is needed for interleaving page requests
across these memory types or tiers.

This can be achieved by implementing an interleaving method that
considers the tier weights.
The tier weight will determine the proportion of nodes to select from
those specified in the memory policy.
A tier weight can be assigned to each memory type within the system.

Hasan Al Maruf had put forth a proposal for interleaving between two
tiers, namely the top tier and the low tier. However, this patch was
not adopted due to constraints on the number of available tiers.

https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/

New proposed changes:

1. Introducea sysfs entry to allow setting the interleave weight for each
memory tier.
2. Each tier with a default weight of 1, indicating a standard 1:1
proportion.
3. Distribute the weight of that tier in a uniform manner across all nodes.
4. Modifications to the existing interleaving algorithm to support the
implementation of multi-tier interleaving based on tier-weights.

This is inline with Huang, Ying's presentation in lpc22, 16th slide in
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\
Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf

Observed a significant increase (165%) in bandwidth utilization
with the newly proposed multi-tier interleaving compared to the
traditional 1:1 interleaving approach between DDR and CXL tier nodes,
where 85% of the bandwidth is allocated to DDR tier and 15% to CXL
tier with MLC -w2 option.

Usage Example:

1. Set weights for DDR (tier4) and CXL(teir22) tiers.
echo 85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
echo 15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight

2. Interleave between DRR(tier4, node-0) and CXL (tier22, node-1) using numactl
numactl -i0,1 mlc --loaded_latency W2

Gregory Price (3):
  mm/memory-tiers: change mutex to rw semaphore
  mm/memory-tiers: Introduce sysfs for tier interleave weights
  mm/mempolicy: modify interleave mempolicy to use memtier weights

 include/linux/memory-tiers.h |  16 ++++
 include/linux/mempolicy.h    |   3 +
 mm/memory-tiers.c            | 179 +++++++++++++++++++++++++++++++----
 mm/mempolicy.c               | 148 +++++++++++++++++++++++------
 4 files changed, 297 insertions(+), 49 deletions(-)

-- 
2.39.1

Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving

Posted by Huang, Ying 2 years, 3 months ago

Gregory Price <gourry.memverge@gmail.com> writes:

> v2: change memtier mutex to semaphore
>     add source-node relative weighting
>     add remaining mempolicy integration code
>
> = v2 Notes
>
> Developed in colaboration with original authors to deconflict
> similar efforts to extend mempolicy to take weights directly.
>
> == Mutex to Semaphore change:
>
> The memory tiering subsystem is extended in this patch set to have
> externally available information (weights), and therefore additional
> controls need to be added to ensure values are not changed (or tiers
> changed/added/removed) during various calculations.
>
> Since it is expected that many threads will be accessing this data
> during allocations, a mutex is not appropriate.

IIUC, this is a change for performance.  If so, please show some
performance data.

> Since write-updates (weight changes, hotplug events) are rare events,
> a simple rw semaphore is sufficient.
>
> == Source-node relative weighting:
>
> Tiers can now be weighted differently based on the node requesting
> the weight.  For example CPU-Nodes 0 and 1 may have different weights
> for the same CXL memory tier, because topologically the number of
> NUMA hops is greater (or any other physical topological difference
> resulting in different effective latency or bandwidth values)
>
> 1. Set weights for DDR (tier4) and CXL(teir22) tiers.
>    echo source_node:weight > /path/to/interleave_weight

If source_node is considered, why not consider target_node too?  On a
system with only 1 tier (DRAM), do you want weighted interleaving among
NUMA nodes?  If so, why tie weighted interleaving with memory tiers?
Why not just introduce weighted interleaving for NUMA nodes?

> # Set tier4 weight from node 0 to 85
> echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
> # Set tier4 weight from node 1 to 65
> echo 1:65 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
> # Set tier22 weight from node 0 to 15
> echo 0:15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
> # Set tier22 weight from node 1 to 10
> echo 1:10 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight

--
Best Regards,
Huang, Ying

Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving

Posted by Gregory Price 2 years, 3 months ago

On Mon, Oct 16, 2023 at 03:57:52PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > == Mutex to Semaphore change:
> >
> > Since it is expected that many threads will be accessing this data
> > during allocations, a mutex is not appropriate.
> 
> IIUC, this is a change for performance.  If so, please show some
> performance data.
>

This change will be dropped in v3 in favor of the existing
RCU mechanism in memory-tiers.c as pointed out by Matthew.

> > == Source-node relative weighting:
> >
> > 1. Set weights for DDR (tier4) and CXL(teir22) tiers.
> >    echo source_node:weight > /path/to/interleave_weight
> 
> If source_node is considered, why not consider target_node too?  On a
> system with only 1 tier (DRAM), do you want weighted interleaving among
> NUMA nodes?  If so, why tie weighted interleaving with memory tiers?
> Why not just introduce weighted interleaving for NUMA nodes?
>

The short answer: Practicality and ease-of-use.

The long answer: We have been discussing how to make this more flexible..

Personally, I agree with you.  If Task A is on Socket 0, the weight on
Socket 0 DRAM should not be the same as the weight on Socket 1 DRAM.
However, right now, DRAM nodes are lumped into the same tier together,
resulting in them having the same weight.

If you scrollback through the list, you'll find an RFC I posted for
set_mempolicy2 which implements weighted interleave in mm/mempolicy.
However, mm/mempolicy is extremely `current-centric` at the moment,
so that makes changing weights at runtime (in response to a hotplug
event, for example) very difficult.

I still think there is room to extend set_mempolicy to allow
task-defined weights to take preference over tier defined weights.

We have discussed adding the following features to memory-tiers:

1) breaking up tiers to allow 1 tier per node, as opposed to defaulting
   to lumping all nodes of a simlar quality into the same tier

2) enabling movemnet of nodes between tiers (for the purpose of
   reconfiguring due to hotplug and other situations)

For users that require fine-grained control over each individual node,
this would allow for weights to be applied per-node, because a
node=tier. For the majority of use cases, it would allow clumping of
nodes into tiers based on physical topology and performance class, and
then allow for the general weighting to apply.  This seems like the most
obvious use-case that a majority of users would use, and also the
easiest to set-up in the short term.

That said, there are probably 3 or 4 different ways/places to implement
this feature.  The question is what is the clear and obvious way?
I don't have a definitive answer for that, hence the RFC.

There are at least 5 proposals that i know of at the moment

1) mempolicy
2) memory-tiers
3) memory-block interleaving? (weighting among blocks inside a node)
   Maybe relevant if Dynamic Capacity devices arrive, but it seems
   like the wrong place to do this.
4) multi-device nodes (e.g. cxl create-region ... mem0 mem1...)
5) "just do it in hardware"

> > # Set tier4 weight from node 0 to 85
> > echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
> > # Set tier4 weight from node 1 to 65
> > echo 1:65 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
> > # Set tier22 weight from node 0 to 15
> > echo 0:15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
> > # Set tier22 weight from node 1 to 10
> > echo 1:10 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
> 
> --
> Best Regards,
> Huang, Ying

Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving

Posted by Huang, Ying 2 years, 3 months ago

Gregory Price <gregory.price@memverge.com> writes:

> On Mon, Oct 16, 2023 at 03:57:52PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > == Mutex to Semaphore change:
>> >
>> > Since it is expected that many threads will be accessing this data
>> > during allocations, a mutex is not appropriate.
>> 
>> IIUC, this is a change for performance.  If so, please show some
>> performance data.
>>
>
> This change will be dropped in v3 in favor of the existing
> RCU mechanism in memory-tiers.c as pointed out by Matthew.
>
>> > == Source-node relative weighting:
>> >
>> > 1. Set weights for DDR (tier4) and CXL(teir22) tiers.
>> >    echo source_node:weight > /path/to/interleave_weight
>> 
>> If source_node is considered, why not consider target_node too?  On a
>> system with only 1 tier (DRAM), do you want weighted interleaving among
>> NUMA nodes?  If so, why tie weighted interleaving with memory tiers?
>> Why not just introduce weighted interleaving for NUMA nodes?
>>
>
> The short answer: Practicality and ease-of-use.
>
> The long answer: We have been discussing how to make this more flexible..
>
> Personally, I agree with you.  If Task A is on Socket 0, the weight on
> Socket 0 DRAM should not be the same as the weight on Socket 1 DRAM.
> However, right now, DRAM nodes are lumped into the same tier together,
> resulting in them having the same weight.
>
> If you scrollback through the list, you'll find an RFC I posted for
> set_mempolicy2 which implements weighted interleave in mm/mempolicy.
> However, mm/mempolicy is extremely `current-centric` at the moment,
> so that makes changing weights at runtime (in response to a hotplug
> event, for example) very difficult.
>
> I still think there is room to extend set_mempolicy to allow
> task-defined weights to take preference over tier defined weights.
>
> We have discussed adding the following features to memory-tiers:
>
> 1) breaking up tiers to allow 1 tier per node, as opposed to defaulting
>    to lumping all nodes of a simlar quality into the same tier
>
> 2) enabling movemnet of nodes between tiers (for the purpose of
>    reconfiguring due to hotplug and other situations)
>
> For users that require fine-grained control over each individual node,
> this would allow for weights to be applied per-node, because a
> node=tier. For the majority of use cases, it would allow clumping of
> nodes into tiers based on physical topology and performance class, and
> then allow for the general weighting to apply.  This seems like the most
> obvious use-case that a majority of users would use, and also the
> easiest to set-up in the short term.
>
> That said, there are probably 3 or 4 different ways/places to implement
> this feature.  The question is what is the clear and obvious way?
> I don't have a definitive answer for that, hence the RFC.
>
> There are at least 5 proposals that i know of at the moment
>
> 1) mempolicy
> 2) memory-tiers
> 3) memory-block interleaving? (weighting among blocks inside a node)
>    Maybe relevant if Dynamic Capacity devices arrive, but it seems
>    like the wrong place to do this.
> 4) multi-device nodes (e.g. cxl create-region ... mem0 mem1...)
> 5) "just do it in hardware"

It may be easier to start with the use case.  What is the practical use
cases in your mind that can not be satisfied with simple per-memory-tier
weight?  Can you compare the memory layout with different proposals?

>> > # Set tier4 weight from node 0 to 85
>> > echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
>> > # Set tier4 weight from node 1 to 65
>> > echo 1:65 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
>> > # Set tier22 weight from node 0 to 15
>> > echo 0:15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
>> > # Set tier22 weight from node 1 to 10
>> > echo 1:10 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight

--
Best Regards,
Huang, Ying

Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving

Posted by Gregory Price 2 years, 3 months ago

On Wed, Oct 18, 2023 at 04:29:02PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > There are at least 5 proposals that i know of at the moment
> >
> > 1) mempolicy
> > 2) memory-tiers
> > 3) memory-block interleaving? (weighting among blocks inside a node)
> >    Maybe relevant if Dynamic Capacity devices arrive, but it seems
> >    like the wrong place to do this.
> > 4) multi-device nodes (e.g. cxl create-region ... mem0 mem1...)
> > 5) "just do it in hardware"
> 
> It may be easier to start with the use case.  What is the practical use
> cases in your mind that can not be satisfied with simple per-memory-tier
> weight?  Can you compare the memory layout with different proposals?
>

Before I delve in, one clarifying question:  When you asked whether
weights should be part of node or memory-tiers, i took that to mean
whether it should be part of mempolicy or memory-tiers.

Were you suggesting that weights should actually be part of
drivers/base/node.c?

Because I had not considered that, and this seems reasonable, easy to
implement, and would not require tying mempolicy.c to memory-tiers.c

Beyond this, i think there's been 3 imagined use cases (now, including
this).

a)
numactl --weighted-interleave=Node:weight,0:16,1:4,...

b)
echo weight > /sys/.../memory-tiers/memtier/access0/interleave_weight
numactl --interleave=0,1

c)
echo weight > /sys/bus/node/node0/access0/interleave_weight
numactl --interleave=0,1

d)
options b or c, but with --weighted-interleave=0,1 instead
this requires libnuma changes to pick up, but it retains --interleave
as-is to avoid user confusion.

The downside of an approach like A (which was my original approach), was
that the weights cannot really change should a node be hotplugged. Tasks
would need to detect this and change the policy themselves.  That's not
a good solution.

However in both B and C's design, weights can be rebalanced in response
to any number of events.  Ultimately B and C are equivalent, but
the placement in nodes is cleaner and more intuitive.  If memory-tiers
wants to use/change this information, there's nothing that prevents it.

Assuming this is your meaning, I agree and I will pivot to this.

~Gregory

Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving

Posted by Matthew Wilcox 2 years, 3 months ago

On Mon, Oct 09, 2023 at 04:42:56PM -0400, Gregory Price wrote:
> == Mutex to Semaphore change:
> 
> The memory tiering subsystem is extended in this patch set to have
> externally available information (weights), and therefore additional
> controls need to be added to ensure values are not changed (or tiers
> changed/added/removed) during various calculations.
> 
> Since it is expected that many threads will be accessing this data
> during allocations, a mutex is not appropriate.
> 
> Since write-updates (weight changes, hotplug events) are rare events,
> a simple rw semaphore is sufficient.

Given how you're using it, wouldn't the existing RCU mechanism be
better than converting this to an rwsem?

Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving

Posted by Gregory Price 2 years, 4 months ago

On Wed, Oct 11, 2023 at 10:15:02PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 09, 2023 at 04:42:56PM -0400, Gregory Price wrote:
> > == Mutex to Semaphore change:
> > 
> > The memory tiering subsystem is extended in this patch set to have
> > externally available information (weights), and therefore additional
> > controls need to be added to ensure values are not changed (or tiers
> > changed/added/removed) during various calculations.
> > 
> > Since it is expected that many threads will be accessing this data
> > during allocations, a mutex is not appropriate.
> > 
> > Since write-updates (weight changes, hotplug events) are rare events,
> > a simple rw semaphore is sufficient.
> 
> Given how you're using it, wouldn't the existing RCU mechanism be
> better than converting this to an rwsem?
> 

... yes, and a smarter person would have just done that first :P

derp derp, thanks, I'll update.

~Gregory