Node Weights and Weighted Interleave

[RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Gregory Price 2 years, 3 months ago

This patchset implements weighted interleave and adds a new sysfs
entry: /sys/devices/system/node/nodeN/accessM/il_weight.

The il_weight of a node is used by mempolicy to implement weighted
interleave when `numactl --interleave=...` is invoked.  By default
il_weight for a node is always 1, which preserves the default round
robin interleave behavior.

Interleave weights may be set from 0-100, and denote the number of
pages that should be allocated from the node when interleaving
occurs.

For example, if a node's interleave weight is set to 5, 5 pages
will be allocated from that node before the next node is scheduled
for allocations.

Additionally, "node accessors" (synonmous with cpu nodes) are used
to allow for accessor-relative weighting.  The "accessor" for a task
is defined as the node the task is presently running on.

# Set node weight for node0 accessed by tasks on node0 to 5
echo 5 > /sys/devices/system/node/node0/access0/il_weight

# Set node weight for node0 accessed by tasks on node1 to 3
echo 3 > /sys/devices/system/node/node0/access1/il_weight

In this way it becomes possible to set an interleaving strategy
that fits the available bandwidth for the devices available on
the system. An example system:

Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex

In this setup, the effective weights for nodes 0-3 for a task
running on Node 0 may be [60, 20, 10, 10].

This spreads memory out across devices which all have different
latency and bandwidth attributes at a way that can maximize the
available resources.

~Gregory

(sorry for the repeat send, automation failure)

================================================================

Version Notes:

v3: move weights into node rather than memtiers
    some additional fixes to node.c to support this

v1/v2: add weighted-interleave support to mempolicy

= v3 notes

This update effectively removes the connection between mempolicy
and memory-tiers by simply placing the interleave weights directly
in the node accessor information structure.

Node was recommended by Huang, Ying
Accessor was recommended by Ravi Shankar


== Move weights into node

Originally this work was done by placing weights in the memory tier.
In this patch set we changed the weights to live in the numa node
accessor structure, which allows for a more natural weighting scheme
and also supports source-node relative weighting.

Interleave weight is located in:
/sys/devices/system/node/nodeN/accessM/il_weight

and is set with a value between 1 and 100:

# Set node weight for node0 accessed by node0 to 5
echo 5 > /sys/devices/system/node/node0/access0/il_weight

By default, il_weight is always set to 1, which mimics the default
interleave behavior (simple round-robin).


== Other Node fixes

2 other updates to node.c were required to support this:

1) The access list must be initialized prior to the node struct
   pointer being registered in the node array

2) The accessor's in the list must be registered regardless of
   whether HMAT/HMEM information is reported. Presently this
   results in 0-value information being present in the various
   access subgroup


== Weighted interleave

mm/mempolicy: modify interleave mempolicy to use node weights

The node subsystem implements interleave weighting for the purpose
of bandwidth optimization.  Each node may have different weights in
relation to each compute node ("access node").

The mempolicy MPOL_INTERLEAVE utilizes the node weights to implement
weighted interleave.  By default, since all nodes default to a weight
of 1, the original interleave behavior is retained.

Examples

Weight settings:
  echo 4 > node0/access0/il_weight
  echo 3 > node1/access0/il_weight
  echo 2 > node1/access1/il_weight
  echo 1 > node0/access1/il_weight

Results:

Task A:
  cpunode:  0
  nodemask: [0,1]
  weights:  [4,3]
  allocation result: [0,0,0,0,1,1,1 repeat]

Task B:
  cpunode:  1
  nodemask: [0,1]
  weights:  [1,2]
  allocation result: [0,1,1 repeat]

=== original RFCs ====

Memory-tier based weights
By: Ravi Shankar
https://lore.kernel.org/all/20230927095002.10245-1-ravis.opensrc@micron.com/

Mempolicy multi-node weighting w/ set_mempolicy2:
By: Gregory Price
https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@memverge.com/

N:M weighting in mempolicy
By: Hasan Al Maruf
https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/

Ying Huang's presentation in lpc22, 16th slide in
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\
Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf

Gregory Price (4):
  base/node.c: initialize the accessor list before registering
  node: add accessors to sysfs when nodes are created
  node: add interleave weights to node accessor
  mm/mempolicy: modify interleave mempolicy to use node weights

 drivers/base/node.c       | 120 ++++++++++++++++++++++++++++++++-
 include/linux/mempolicy.h |   4 ++
 include/linux/node.h      |  17 +++++
 mm/mempolicy.c            | 138 +++++++++++++++++++++++++++++---------
 4 files changed, 246 insertions(+), 33 deletions(-)

-- 
2.39.1

Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Michal Hocko 2 years, 3 months ago

On Mon 30-10-23 20:38:06, Gregory Price wrote:
> This patchset implements weighted interleave and adds a new sysfs
> entry: /sys/devices/system/node/nodeN/accessM/il_weight.
> 
> The il_weight of a node is used by mempolicy to implement weighted
> interleave when `numactl --interleave=...` is invoked.  By default
> il_weight for a node is always 1, which preserves the default round
> robin interleave behavior.
> 
> Interleave weights may be set from 0-100, and denote the number of
> pages that should be allocated from the node when interleaving
> occurs.
> 
> For example, if a node's interleave weight is set to 5, 5 pages
> will be allocated from that node before the next node is scheduled
> for allocations.

I find this semantic rather weird TBH. First of all why do you think it
makes sense to have those weights global for all users? What if
different applications have different view on how to spred their
interleaved memory?

I do get that you might have a different tiers with largerly different
runtime characteristics but why would you want to interleave them into a
single mapping and have hard to predict runtime behavior?

[...]
> In this way it becomes possible to set an interleaving strategy
> that fits the available bandwidth for the devices available on
> the system. An example system:
> 
> Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
> Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
> Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
> Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
> 
> In this setup, the effective weights for nodes 0-3 for a task
> running on Node 0 may be [60, 20, 10, 10].
> 
> This spreads memory out across devices which all have different
> latency and bandwidth attributes at a way that can maximize the
> available resources.

OK, so why is this any better than not using any memory policy rely
on demotion to push out cold memory down the tier hierarchy?

What is the actual real life usecase and what kind of benefits you can
present?
-- 
Michal Hocko
SUSE Labs

Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Johannes Weiner 2 years, 3 months ago

On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
> On Mon 30-10-23 20:38:06, Gregory Price wrote:
> > This patchset implements weighted interleave and adds a new sysfs
> > entry: /sys/devices/system/node/nodeN/accessM/il_weight.
> > 
> > The il_weight of a node is used by mempolicy to implement weighted
> > interleave when `numactl --interleave=...` is invoked.  By default
> > il_weight for a node is always 1, which preserves the default round
> > robin interleave behavior.
> > 
> > Interleave weights may be set from 0-100, and denote the number of
> > pages that should be allocated from the node when interleaving
> > occurs.
> > 
> > For example, if a node's interleave weight is set to 5, 5 pages
> > will be allocated from that node before the next node is scheduled
> > for allocations.
> 
> I find this semantic rather weird TBH. First of all why do you think it
> makes sense to have those weights global for all users? What if
> different applications have different view on how to spred their
> interleaved memory?
> 
> I do get that you might have a different tiers with largerly different
> runtime characteristics but why would you want to interleave them into a
> single mapping and have hard to predict runtime behavior?
> 
> [...]
> > In this way it becomes possible to set an interleaving strategy
> > that fits the available bandwidth for the devices available on
> > the system. An example system:
> > 
> > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
> > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
> > 
> > In this setup, the effective weights for nodes 0-3 for a task
> > running on Node 0 may be [60, 20, 10, 10].
> > 
> > This spreads memory out across devices which all have different
> > latency and bandwidth attributes at a way that can maximize the
> > available resources.
> 
> OK, so why is this any better than not using any memory policy rely
> on demotion to push out cold memory down the tier hierarchy?
> 
> What is the actual real life usecase and what kind of benefits you can
> present?

There are two things CXL gives you: additional capacity and additional
bus bandwidth.

The promotion/demotion mechanism is good for the capacity usecase,
where you have a nice hot/cold gradient in the workingset and want
placement accordingly across faster and slower memory.

The interleaving is useful when you have a flatter workingset
distribution and poorer access locality. In that case, the CPU caches
are less effective and the workload can be bus-bound. The workload
might fit entirely into DRAM, but concentrating it there is
suboptimal. Fanning it out in proportion to the relative performance
of each memory tier gives better resuls.

We experimented with datacenter workloads on such machines last year
and found significant performance benefits:

https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/

This hopefully also explains why it's a global setting. The usecase is
different from conventional NUMA interleaving, which is used as a
locality measure: spread shared data evenly between compute
nodes. This one isn't about locality - the CXL tier doesn't have local
compute. Instead, the optimal spread is based on hardware parameters,
which is a global property rather than a per-workload one.

Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Michal Hocko 2 years, 3 months ago

On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
> On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
> > On Mon 30-10-23 20:38:06, Gregory Price wrote:
> > > This patchset implements weighted interleave and adds a new sysfs
> > > entry: /sys/devices/system/node/nodeN/accessM/il_weight.
> > > 
> > > The il_weight of a node is used by mempolicy to implement weighted
> > > interleave when `numactl --interleave=...` is invoked.  By default
> > > il_weight for a node is always 1, which preserves the default round
> > > robin interleave behavior.
> > > 
> > > Interleave weights may be set from 0-100, and denote the number of
> > > pages that should be allocated from the node when interleaving
> > > occurs.
> > > 
> > > For example, if a node's interleave weight is set to 5, 5 pages
> > > will be allocated from that node before the next node is scheduled
> > > for allocations.
> > 
> > I find this semantic rather weird TBH. First of all why do you think it
> > makes sense to have those weights global for all users? What if
> > different applications have different view on how to spred their
> > interleaved memory?
> > 
> > I do get that you might have a different tiers with largerly different
> > runtime characteristics but why would you want to interleave them into a
> > single mapping and have hard to predict runtime behavior?
> > 
> > [...]
> > > In this way it becomes possible to set an interleaving strategy
> > > that fits the available bandwidth for the devices available on
> > > the system. An example system:
> > > 
> > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
> > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
> > > 
> > > In this setup, the effective weights for nodes 0-3 for a task
> > > running on Node 0 may be [60, 20, 10, 10].
> > > 
> > > This spreads memory out across devices which all have different
> > > latency and bandwidth attributes at a way that can maximize the
> > > available resources.
> > 
> > OK, so why is this any better than not using any memory policy rely
> > on demotion to push out cold memory down the tier hierarchy?
> > 
> > What is the actual real life usecase and what kind of benefits you can
> > present?
> 
> There are two things CXL gives you: additional capacity and additional
> bus bandwidth.
> 
> The promotion/demotion mechanism is good for the capacity usecase,
> where you have a nice hot/cold gradient in the workingset and want
> placement accordingly across faster and slower memory.
> 
> The interleaving is useful when you have a flatter workingset
> distribution and poorer access locality. In that case, the CPU caches
> are less effective and the workload can be bus-bound. The workload
> might fit entirely into DRAM, but concentrating it there is
> suboptimal. Fanning it out in proportion to the relative performance
> of each memory tier gives better resuls.
> 
> We experimented with datacenter workloads on such machines last year
> and found significant performance benefits:
> 
> https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/

Thanks, this is a useful insight.
 
> This hopefully also explains why it's a global setting. The usecase is
> different from conventional NUMA interleaving, which is used as a
> locality measure: spread shared data evenly between compute
> nodes. This one isn't about locality - the CXL tier doesn't have local
> compute. Instead, the optimal spread is based on hardware parameters,
> which is a global property rather than a per-workload one.

Well, I am not convinced about that TBH. Sure it is probably a good fit
for this specific CXL usecase but it just doesn't fit into many others I
can think of - e.g. proportional use of those tiers based on the
workload - you get what you pay for.

Is there any specific reason for not having a new interleave interface
which defines weights for the nodemask? Is this because the policy
itself is very dynamic or is this more driven by simplicity of use?

-- 
Michal Hocko
SUSE Labs

Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Huang, Ying 2 years, 3 months ago

Michal Hocko <mhocko@suse.com> writes:

> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
>> On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
>> > On Mon 30-10-23 20:38:06, Gregory Price wrote:
>> > > This patchset implements weighted interleave and adds a new sysfs
>> > > entry: /sys/devices/system/node/nodeN/accessM/il_weight.
>> > > 
>> > > The il_weight of a node is used by mempolicy to implement weighted
>> > > interleave when `numactl --interleave=...` is invoked.  By default
>> > > il_weight for a node is always 1, which preserves the default round
>> > > robin interleave behavior.
>> > > 
>> > > Interleave weights may be set from 0-100, and denote the number of
>> > > pages that should be allocated from the node when interleaving
>> > > occurs.
>> > > 
>> > > For example, if a node's interleave weight is set to 5, 5 pages
>> > > will be allocated from that node before the next node is scheduled
>> > > for allocations.
>> > 
>> > I find this semantic rather weird TBH. First of all why do you think it
>> > makes sense to have those weights global for all users? What if
>> > different applications have different view on how to spred their
>> > interleaved memory?
>> > 
>> > I do get that you might have a different tiers with largerly different
>> > runtime characteristics but why would you want to interleave them into a
>> > single mapping and have hard to predict runtime behavior?
>> > 
>> > [...]
>> > > In this way it becomes possible to set an interleaving strategy
>> > > that fits the available bandwidth for the devices available on
>> > > the system. An example system:
>> > > 
>> > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
>> > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
>> > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
>> > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
>> > > 
>> > > In this setup, the effective weights for nodes 0-3 for a task
>> > > running on Node 0 may be [60, 20, 10, 10].
>> > > 
>> > > This spreads memory out across devices which all have different
>> > > latency and bandwidth attributes at a way that can maximize the
>> > > available resources.
>> > 
>> > OK, so why is this any better than not using any memory policy rely
>> > on demotion to push out cold memory down the tier hierarchy?
>> > 
>> > What is the actual real life usecase and what kind of benefits you can
>> > present?
>> 
>> There are two things CXL gives you: additional capacity and additional
>> bus bandwidth.
>> 
>> The promotion/demotion mechanism is good for the capacity usecase,
>> where you have a nice hot/cold gradient in the workingset and want
>> placement accordingly across faster and slower memory.
>> 
>> The interleaving is useful when you have a flatter workingset
>> distribution and poorer access locality. In that case, the CPU caches
>> are less effective and the workload can be bus-bound. The workload
>> might fit entirely into DRAM, but concentrating it there is
>> suboptimal. Fanning it out in proportion to the relative performance
>> of each memory tier gives better resuls.
>> 
>> We experimented with datacenter workloads on such machines last year
>> and found significant performance benefits:
>> 
>> https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/
>
> Thanks, this is a useful insight.
>  
>> This hopefully also explains why it's a global setting. The usecase is
>> different from conventional NUMA interleaving, which is used as a
>> locality measure: spread shared data evenly between compute
>> nodes. This one isn't about locality - the CXL tier doesn't have local
>> compute. Instead, the optimal spread is based on hardware parameters,
>> which is a global property rather than a per-workload one.
>
> Well, I am not convinced about that TBH. Sure it is probably a good fit
> for this specific CXL usecase but it just doesn't fit into many others I
> can think of - e.g. proportional use of those tiers based on the
> workload - you get what you pay for.

For "pay", per my understanding, we need some cgroup based
per-memory-tier (or per-node) usage limit.  The following patchset is
the first step for that.

https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/

--
Best Regards,
Huang, Ying

Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Michal Hocko 2 years, 3 months ago

On Wed 01-11-23 10:21:47, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
[...]
> > Well, I am not convinced about that TBH. Sure it is probably a good fit
> > for this specific CXL usecase but it just doesn't fit into many others I
> > can think of - e.g. proportional use of those tiers based on the
> > workload - you get what you pay for.
> 
> For "pay", per my understanding, we need some cgroup based
> per-memory-tier (or per-node) usage limit.  The following patchset is
> the first step for that.
> 
> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/

Why do we need a sysfs interface if there are plans for cgroup API?

-- 
Michal Hocko
SUSE Labs

Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Huang, Ying 2 years, 3 months ago

Michal Hocko <mhocko@suse.com> writes:

> On Wed 01-11-23 10:21:47, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
> [...]
>> > Well, I am not convinced about that TBH. Sure it is probably a good fit
>> > for this specific CXL usecase but it just doesn't fit into many others I
>> > can think of - e.g. proportional use of those tiers based on the
>> > workload - you get what you pay for.
>> 
>> For "pay", per my understanding, we need some cgroup based
>> per-memory-tier (or per-node) usage limit.  The following patchset is
>> the first step for that.
>> 
>> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/
>
> Why do we need a sysfs interface if there are plans for cgroup API?

They are for different target.  The cgroup API proposed here is to
constrain the DRAM usage in a system with DRAM and CXL memory.  The less
you pay, the less DRAM and more CXL memory you use.

--
Best Regards,
Huang, Ying

Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Michal Hocko 2 years, 3 months ago

On Thu 02-11-23 14:11:09, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Wed 01-11-23 10:21:47, Huang, Ying wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> > [...]
> >> > Well, I am not convinced about that TBH. Sure it is probably a good fit
> >> > for this specific CXL usecase but it just doesn't fit into many others I
> >> > can think of - e.g. proportional use of those tiers based on the
> >> > workload - you get what you pay for.
> >> 
> >> For "pay", per my understanding, we need some cgroup based
> >> per-memory-tier (or per-node) usage limit.  The following patchset is
> >> the first step for that.
> >> 
> >> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/
> >
> > Why do we need a sysfs interface if there are plans for cgroup API?
> 
> They are for different target.  The cgroup API proposed here is to
> constrain the DRAM usage in a system with DRAM and CXL memory.  The less
> you pay, the less DRAM and more CXL memory you use.

Right, but why the usage distribution requires its own interface and
cannot be combined with the access control part of it?
-- 
Michal Hocko
SUSE Labs

Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Johannes Weiner 2 years, 3 months ago

On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
> > > On Mon 30-10-23 20:38:06, Gregory Price wrote:
> > > > This patchset implements weighted interleave and adds a new sysfs
> > > > entry: /sys/devices/system/node/nodeN/accessM/il_weight.
> > > > 
> > > > The il_weight of a node is used by mempolicy to implement weighted
> > > > interleave when `numactl --interleave=...` is invoked.  By default
> > > > il_weight for a node is always 1, which preserves the default round
> > > > robin interleave behavior.
> > > > 
> > > > Interleave weights may be set from 0-100, and denote the number of
> > > > pages that should be allocated from the node when interleaving
> > > > occurs.
> > > > 
> > > > For example, if a node's interleave weight is set to 5, 5 pages
> > > > will be allocated from that node before the next node is scheduled
> > > > for allocations.
> > > 
> > > I find this semantic rather weird TBH. First of all why do you think it
> > > makes sense to have those weights global for all users? What if
> > > different applications have different view on how to spred their
> > > interleaved memory?
> > > 
> > > I do get that you might have a different tiers with largerly different
> > > runtime characteristics but why would you want to interleave them into a
> > > single mapping and have hard to predict runtime behavior?
> > > 
> > > [...]
> > > > In this way it becomes possible to set an interleaving strategy
> > > > that fits the available bandwidth for the devices available on
> > > > the system. An example system:
> > > > 
> > > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
> > > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
> > > > 
> > > > In this setup, the effective weights for nodes 0-3 for a task
> > > > running on Node 0 may be [60, 20, 10, 10].
> > > > 
> > > > This spreads memory out across devices which all have different
> > > > latency and bandwidth attributes at a way that can maximize the
> > > > available resources.
> > > 
> > > OK, so why is this any better than not using any memory policy rely
> > > on demotion to push out cold memory down the tier hierarchy?
> > > 
> > > What is the actual real life usecase and what kind of benefits you can
> > > present?
> > 
> > There are two things CXL gives you: additional capacity and additional
> > bus bandwidth.
> > 
> > The promotion/demotion mechanism is good for the capacity usecase,
> > where you have a nice hot/cold gradient in the workingset and want
> > placement accordingly across faster and slower memory.
> > 
> > The interleaving is useful when you have a flatter workingset
> > distribution and poorer access locality. In that case, the CPU caches
> > are less effective and the workload can be bus-bound. The workload
> > might fit entirely into DRAM, but concentrating it there is
> > suboptimal. Fanning it out in proportion to the relative performance
> > of each memory tier gives better resuls.
> > 
> > We experimented with datacenter workloads on such machines last year
> > and found significant performance benefits:
> > 
> > https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/
> 
> Thanks, this is a useful insight.
>  
> > This hopefully also explains why it's a global setting. The usecase is
> > different from conventional NUMA interleaving, which is used as a
> > locality measure: spread shared data evenly between compute
> > nodes. This one isn't about locality - the CXL tier doesn't have local
> > compute. Instead, the optimal spread is based on hardware parameters,
> > which is a global property rather than a per-workload one.
> 
> Well, I am not convinced about that TBH. Sure it is probably a good fit
> for this specific CXL usecase but it just doesn't fit into many others I
> can think of - e.g. proportional use of those tiers based on the
> workload - you get what you pay for.
> 
> Is there any specific reason for not having a new interleave interface
> which defines weights for the nodemask? Is this because the policy
> itself is very dynamic or is this more driven by simplicity of use?

A downside of *requiring* weights to be paired with the mempolicy is
that it's then the application that would have to figure out the
weights dynamically, instead of having a static host configuration. A
policy of "I want to be spread for optimal bus bandwidth" translates
between different hardware configurations, but optimal weights will
vary depending on the type of machine a job runs on.

That doesn't mean there couldn't be usecases for having weights as
policy as well in other scenarios, like you allude to above. It's just
so far such usecases haven't really materialized or spelled out
concretely. Maybe we just want both - a global default, and the
ability to override it locally. Could you elaborate on the 'get what
you pay for' usecase you mentioned?

Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Michal Hocko 2 years, 3 months ago

On Tue 31-10-23 12:22:16, Johannes Weiner wrote:
> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
[...]
> > Is there any specific reason for not having a new interleave interface
> > which defines weights for the nodemask? Is this because the policy
> > itself is very dynamic or is this more driven by simplicity of use?
> 
> A downside of *requiring* weights to be paired with the mempolicy is
> that it's then the application that would have to figure out the
> weights dynamically, instead of having a static host configuration. A
> policy of "I want to be spread for optimal bus bandwidth" translates
> between different hardware configurations, but optimal weights will
> vary depending on the type of machine a job runs on.

I can imagine this could be achieved by numactl(8) so that the process
management tool could set this up for the process on the start up. Sure
it wouldn't be very dynamic after then and that is why I was asking
about how dynamic the situation might be in practice.

> That doesn't mean there couldn't be usecases for having weights as
> policy as well in other scenarios, like you allude to above. It's just
> so far such usecases haven't really materialized or spelled out
> concretely. Maybe we just want both - a global default, and the
> ability to override it locally. Could you elaborate on the 'get what
> you pay for' usecase you mentioned?

This is more or less just an idea that came first to my mind when
hearing about bus bandwidth optimizations. I suspect that sooner or
later we just learn about usecases where the optimization function
maximizes not only bandwidth but also cost for that bandwidth. Consider
a hosting system serving different workloads each paying different QoS.
Do I know about anybody requiring that now? No! But we should really
test the proposed interface for potential future extensions. If such an
extension is not reasonable and/or we can achieve that by different
means then great.
-- 
Michal Hocko
SUSE Labs

Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Huang, Ying 2 years, 3 months ago

Michal Hocko <mhocko@suse.com> writes:

> On Tue 31-10-23 12:22:16, Johannes Weiner wrote:
>> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
> [...]
>> > Is there any specific reason for not having a new interleave interface
>> > which defines weights for the nodemask? Is this because the policy
>> > itself is very dynamic or is this more driven by simplicity of use?
>> 
>> A downside of *requiring* weights to be paired with the mempolicy is
>> that it's then the application that would have to figure out the
>> weights dynamically, instead of having a static host configuration. A
>> policy of "I want to be spread for optimal bus bandwidth" translates
>> between different hardware configurations, but optimal weights will
>> vary depending on the type of machine a job runs on.
>
> I can imagine this could be achieved by numactl(8) so that the process
> management tool could set this up for the process on the start up. Sure
> it wouldn't be very dynamic after then and that is why I was asking
> about how dynamic the situation might be in practice.
>
>> That doesn't mean there couldn't be usecases for having weights as
>> policy as well in other scenarios, like you allude to above. It's just
>> so far such usecases haven't really materialized or spelled out
>> concretely. Maybe we just want both - a global default, and the
>> ability to override it locally. Could you elaborate on the 'get what
>> you pay for' usecase you mentioned?
>
> This is more or less just an idea that came first to my mind when
> hearing about bus bandwidth optimizations. I suspect that sooner or
> later we just learn about usecases where the optimization function
> maximizes not only bandwidth but also cost for that bandwidth. Consider
> a hosting system serving different workloads each paying different
> QoS.

I don't think pure software solution can enforce the memory bandwidth
allocation.  For that, we will need something like MBA (Memory Bandwidth
Allocation) as in the following URL,

https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-allocation.html

At lease, something like MBM (Memory Bandwidth Monitoring) as in the
following URL will be needed.

https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-monitoring.html

The interleave solution helps the cooperative workloads only.

> Do I know about anybody requiring that now? No! But we should really
> test the proposed interface for potential future extensions. If such an
> extension is not reasonable and/or we can achieve that by different
> means then great.

--
Best Regards,
Huang, Ying

Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Michal Hocko 2 years, 3 months ago

On Thu 02-11-23 14:21:49, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Tue 31-10-23 12:22:16, Johannes Weiner wrote:
> >> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
> > [...]
> >> > Is there any specific reason for not having a new interleave interface
> >> > which defines weights for the nodemask? Is this because the policy
> >> > itself is very dynamic or is this more driven by simplicity of use?
> >> 
> >> A downside of *requiring* weights to be paired with the mempolicy is
> >> that it's then the application that would have to figure out the
> >> weights dynamically, instead of having a static host configuration. A
> >> policy of "I want to be spread for optimal bus bandwidth" translates
> >> between different hardware configurations, but optimal weights will
> >> vary depending on the type of machine a job runs on.
> >
> > I can imagine this could be achieved by numactl(8) so that the process
> > management tool could set this up for the process on the start up. Sure
> > it wouldn't be very dynamic after then and that is why I was asking
> > about how dynamic the situation might be in practice.
> >
> >> That doesn't mean there couldn't be usecases for having weights as
> >> policy as well in other scenarios, like you allude to above. It's just
> >> so far such usecases haven't really materialized or spelled out
> >> concretely. Maybe we just want both - a global default, and the
> >> ability to override it locally. Could you elaborate on the 'get what
> >> you pay for' usecase you mentioned?
> >
> > This is more or less just an idea that came first to my mind when
> > hearing about bus bandwidth optimizations. I suspect that sooner or
> > later we just learn about usecases where the optimization function
> > maximizes not only bandwidth but also cost for that bandwidth. Consider
> > a hosting system serving different workloads each paying different
> > QoS.
> 
> I don't think pure software solution can enforce the memory bandwidth
> allocation.  For that, we will need something like MBA (Memory Bandwidth
> Allocation) as in the following URL,
> 
> https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-allocation.html
> 
> At lease, something like MBM (Memory Bandwidth Monitoring) as in the
> following URL will be needed.
> 
> https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-monitoring.html
> 
> The interleave solution helps the cooperative workloads only.

Enforcement is an orthogonal thing IMO. We are talking about a best
effort interface.

-- 
Michal Hocko
SUSE Labs

Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Posted by Huang, Ying 2 years, 3 months ago

Johannes Weiner <hannes@cmpxchg.org> writes:

> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
>> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
>> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
>> > > On Mon 30-10-23 20:38:06, Gregory Price wrote:

[snip]

>>  
>> > This hopefully also explains why it's a global setting. The usecase is
>> > different from conventional NUMA interleaving, which is used as a
>> > locality measure: spread shared data evenly between compute
>> > nodes. This one isn't about locality - the CXL tier doesn't have local
>> > compute. Instead, the optimal spread is based on hardware parameters,
>> > which is a global property rather than a per-workload one.
>> 
>> Well, I am not convinced about that TBH. Sure it is probably a good fit
>> for this specific CXL usecase but it just doesn't fit into many others I
>> can think of - e.g. proportional use of those tiers based on the
>> workload - you get what you pay for.
>> 
>> Is there any specific reason for not having a new interleave interface
>> which defines weights for the nodemask? Is this because the policy
>> itself is very dynamic or is this more driven by simplicity of use?
>
> A downside of *requiring* weights to be paired with the mempolicy is
> that it's then the application that would have to figure out the
> weights dynamically, instead of having a static host configuration. A
> policy of "I want to be spread for optimal bus bandwidth" translates
> between different hardware configurations, but optimal weights will
> vary depending on the type of machine a job runs on.
>
> That doesn't mean there couldn't be usecases for having weights as
> policy as well in other scenarios, like you allude to above. It's just
> so far such usecases haven't really materialized or spelled out
> concretely. Maybe we just want both - a global default, and the
> ability to override it locally.

I think that this is a good idea.  The system-wise configuration with
reasonable default makes applications life much easier.  If more control
is needed, some kind of workload specific configuration can be added.
And, instead of adding another memory policy, a cgroup-wise
configuration may be easier to be used.  The per-workload weight may
need to be adjusted when we deploying different combination of workloads
in the system.

Another question is that should the weight be per-memory-tier or
per-node?  In this patchset, the weight is per-source-target-node
combination.  That is, the weight becomes a matrix instead of a vector.
IIUC, this is used to control cross-socket memory access in addition to
per-memory-type memory access.  Do you think the added complexity is
necessary?

> Could you elaborate on the 'get what you pay for' usecase you
> mentioned?

--
Best Regards,
Huang, Ying