[RFC PATCH v2 0/2] Node migration between memory tiers

sthanneeru.opensrc@micron.com posted 2 patches 2 years ago
Documentation/ABI/stable/sysfs-devices-node |  7 ++
drivers/base/node.c                         | 47 ++++++++++++
include/linux/memory-tiers.h                | 11 +++
include/linux/node.h                        | 11 +++
mm/memory-tiers.c                           | 85 ++++++++++++---------
5 files changed, 125 insertions(+), 36 deletions(-)
[RFC PATCH v2 0/2] Node migration between memory tiers
Posted by sthanneeru.opensrc@micron.com 2 years ago
From: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>

The memory tiers feature allows nodes with similar memory types
or performance characteristics to be grouped together in a
memory tier. However, there is currently no provision for
moving a node from one tier to another on demand.

This patch series aims to support node migration between tiers
on demand by sysadmin/root user using the provided sysfs for
node migration.

To migrate a node to a tier, the corresponding node’s sysfs
memtier_override is written with target tier id.

Example: Move node2 to memory tier2 from its default tier(i.e 4)

1. To check current memtier of node2
$cat  /sys/devices/system/node/node2/memtier_override
memory_tier4

2. To migrate node2 to memory_tier2
$echo 2 > /sys/devices/system/node/node2/memtier_override
$cat  /sys/devices/system/node/node2/memtier_override
memory_tier2

Usecases:

1. Useful to move cxl nodes to the right tiers from userspace, when
   the hardware fails to assign the tiers correctly based on
   memorytypes.

   On some platforms we have observed cxl memory being assigned to
   the same tier as DDR memory. This is arguably a system firmware
   bug, but it is true that tiers represent *ranges* of performance
   and we believe it's important for the system operator to have
   the ability to override bad firmware or OS decisions about tier
   assignment as a fail-safe against potential bad outcomes.

2. Useful if we want interleave weights to be applied on memory tiers
   instead of nodes.
In a previous thread, Huang Ying <ying.huang@intel.com> thought
this feature might be useful to overcome limitations of systems
where nodes with different bandwidth characteristics are grouped
in a single tier.
https://lore.kernel.org/lkml/87a5rw1wu8.fsf@yhuang6-desk2.ccr.corp.intel.com/

=============
Version Notes:

V2 : Changed interface to memtier_override from adistance_offset.
memtier_override was recommended by
1. John Groves <john@jagalactic.com>
2. Ravi Shankar <ravis.opensrc@micron.com>
3. Brice Goglin <Brice.Goglin@inria.fr>

V1 : Introduced adistance_offset sysfs.

=============

Srinivasulu Thanneeru (2):
  base/node: Add sysfs for memtier_override
  memory tier: Support node migration between tiers

 Documentation/ABI/stable/sysfs-devices-node |  7 ++
 drivers/base/node.c                         | 47 ++++++++++++
 include/linux/memory-tiers.h                | 11 +++
 include/linux/node.h                        | 11 +++
 mm/memory-tiers.c                           | 85 ++++++++++++---------
 5 files changed, 125 insertions(+), 36 deletions(-)

-- 
2.25.1

Re: [RFC PATCH v2 0/2] Node migration between memory tiers
Posted by Huang, Ying 2 years ago
<sthanneeru.opensrc@micron.com> writes:

> From: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
>
> The memory tiers feature allows nodes with similar memory types
> or performance characteristics to be grouped together in a
> memory tier. However, there is currently no provision for
> moving a node from one tier to another on demand.
>
> This patch series aims to support node migration between tiers
> on demand by sysadmin/root user using the provided sysfs for
> node migration.
>
> To migrate a node to a tier, the corresponding node’s sysfs
> memtier_override is written with target tier id.
>
> Example: Move node2 to memory tier2 from its default tier(i.e 4)
>
> 1. To check current memtier of node2
> $cat  /sys/devices/system/node/node2/memtier_override
> memory_tier4
>
> 2. To migrate node2 to memory_tier2
> $echo 2 > /sys/devices/system/node/node2/memtier_override
> $cat  /sys/devices/system/node/node2/memtier_override
> memory_tier2
>
> Usecases:
>
> 1. Useful to move cxl nodes to the right tiers from userspace, when
>    the hardware fails to assign the tiers correctly based on
>    memorytypes.
>
>    On some platforms we have observed cxl memory being assigned to
>    the same tier as DDR memory. This is arguably a system firmware
>    bug, but it is true that tiers represent *ranges* of performance
>    and we believe it's important for the system operator to have
>    the ability to override bad firmware or OS decisions about tier
>    assignment as a fail-safe against potential bad outcomes.
>
> 2. Useful if we want interleave weights to be applied on memory tiers
>    instead of nodes.
> In a previous thread, Huang Ying <ying.huang@intel.com> thought
> this feature might be useful to overcome limitations of systems
> where nodes with different bandwidth characteristics are grouped
> in a single tier.
> https://lore.kernel.org/lkml/87a5rw1wu8.fsf@yhuang6-desk2.ccr.corp.intel.com/
>
> =============
> Version Notes:
>
> V2 : Changed interface to memtier_override from adistance_offset.
> memtier_override was recommended by
> 1. John Groves <john@jagalactic.com>
> 2. Ravi Shankar <ravis.opensrc@micron.com>
> 3. Brice Goglin <Brice.Goglin@inria.fr>

It appears that you ignored my comments for V1 as follows ...

https://lore.kernel.org/lkml/87o7f62vur.fsf@yhuang6-desk2.ccr.corp.intel.com/
https://lore.kernel.org/lkml/87jzpt2ft5.fsf@yhuang6-desk2.ccr.corp.intel.com/
https://lore.kernel.org/lkml/87a5qp2et0.fsf@yhuang6-desk2.ccr.corp.intel.com/

--
Best Regards,
Huang, Ying

> V1 : Introduced adistance_offset sysfs.
>
> =============
>
> Srinivasulu Thanneeru (2):
>   base/node: Add sysfs for memtier_override
>   memory tier: Support node migration between tiers
>
>  Documentation/ABI/stable/sysfs-devices-node |  7 ++
>  drivers/base/node.c                         | 47 ++++++++++++
>  include/linux/memory-tiers.h                | 11 +++
>  include/linux/node.h                        | 11 +++
>  mm/memory-tiers.c                           | 85 ++++++++++++---------
>  5 files changed, 125 insertions(+), 36 deletions(-)
Re: [RFC PATCH v2 0/2] Node migration between memory tiers
Posted by Gregory Price 2 years ago
On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
> <sthanneeru.opensrc@micron.com> writes:
> 
> > =============
> > Version Notes:
> >
> > V2 : Changed interface to memtier_override from adistance_offset.
> > memtier_override was recommended by
> > 1. John Groves <john@jagalactic.com>
> > 2. Ravi Shankar <ravis.opensrc@micron.com>
> > 3. Brice Goglin <Brice.Goglin@inria.fr>
> 
> It appears that you ignored my comments for V1 as follows ...
> 
> https://lore.kernel.org/lkml/87o7f62vur.fsf@yhuang6-desk2.ccr.corp.intel.com/
> https://lore.kernel.org/lkml/87jzpt2ft5.fsf@yhuang6-desk2.ccr.corp.intel.com/
> https://lore.kernel.org/lkml/87a5qp2et0.fsf@yhuang6-desk2.ccr.corp.intel.com/
> 

Not speaking for the group, just chiming in because i'd discussed it
with them.

"Memory Type" is a bit nebulous.  Is a Micron Type-3 with performance X
and an SK Hynix Type-3 with performance Y a "Different type", or are
they the "Same Type" given that they're both Type 3 backed by some form
of DDR?  Is socket placement of those devices relevant for determining
"Type"?  Is whether they are behind a switch relevant for determining
"Type"? "Type" is frustrating when everything we're talking about
managing is "Type-3" with difference performance.

A concrete example:
To the system, a Multi-Headed Single Logical Device (MH-SLD) looks
exactly the same as an standard SLD.  I may want to have some
combination of local memory expansion devices on the majority of my
expansion slots, but reserve 1 slot on each socket for a connection to
the MH-SLD.   As of right now: There is no good way to differentiate the
devices in terms of "Type" - and even if you had that, the tiering
system would still lump them together.

Similarly, an initial run of switches may or may not allow enumeration
of devices behind it (depends on the configuration), so you may end up
with a static numa node that "looks like" another SLD - despite it being
some definition of "GFAM".  Do number of hops matter in determining
"Type"?

So I really don't think "Type" is useful for determining tier placement.

As of right now, the system lumps DRAM nodes as one tier, and pretty
much everything else as "the other tier". To me, this patch set is an
initial pass meant to allow user-control over tier composition while
the internal mechanism is sussed out and the environment develops.

In general, a release valve that lets you redefine tiers is very welcome
for testing and validation of different setups while the industry evolves.

Just my two cents.

~Gregory

> --
> Best Regards,
> Huang, Ying
>
Re: [RFC PATCH v2 0/2] Node migration between memory tiers
Posted by Huang, Ying 1 year, 12 months ago
Gregory Price <gregory.price@memverge.com> writes:

> On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
>> <sthanneeru.opensrc@micron.com> writes:
>> 
>> > =============
>> > Version Notes:
>> >
>> > V2 : Changed interface to memtier_override from adistance_offset.
>> > memtier_override was recommended by
>> > 1. John Groves <john@jagalactic.com>
>> > 2. Ravi Shankar <ravis.opensrc@micron.com>
>> > 3. Brice Goglin <Brice.Goglin@inria.fr>
>> 
>> It appears that you ignored my comments for V1 as follows ...
>> 
>> https://lore.kernel.org/lkml/87o7f62vur.fsf@yhuang6-desk2.ccr.corp.intel.com/
>> https://lore.kernel.org/lkml/87jzpt2ft5.fsf@yhuang6-desk2.ccr.corp.intel.com/
>> https://lore.kernel.org/lkml/87a5qp2et0.fsf@yhuang6-desk2.ccr.corp.intel.com/
>> 
>
> Not speaking for the group, just chiming in because i'd discussed it
> with them.
>
> "Memory Type" is a bit nebulous.  Is a Micron Type-3 with performance X
> and an SK Hynix Type-3 with performance Y a "Different type", or are
> they the "Same Type" given that they're both Type 3 backed by some form
> of DDR?  Is socket placement of those devices relevant for determining
> "Type"?  Is whether they are behind a switch relevant for determining
> "Type"? "Type" is frustrating when everything we're talking about
> managing is "Type-3" with difference performance.
>
> A concrete example:
> To the system, a Multi-Headed Single Logical Device (MH-SLD) looks
> exactly the same as an standard SLD.  I may want to have some
> combination of local memory expansion devices on the majority of my
> expansion slots, but reserve 1 slot on each socket for a connection to
> the MH-SLD.   As of right now: There is no good way to differentiate the
> devices in terms of "Type" - and even if you had that, the tiering
> system would still lump them together.
>
> Similarly, an initial run of switches may or may not allow enumeration
> of devices behind it (depends on the configuration), so you may end up
> with a static numa node that "looks like" another SLD - despite it being
> some definition of "GFAM".  Do number of hops matter in determining
> "Type"?

In the original design, the memory devices of same memory type are
managed by the same device driver, linked with system in same way
(including switches), built with same media.  So, the performance is
same too.  And, same as memory tiers, memory types are orthogonal to
sockets.  Do you think the definition itself is clear enough?

I admit "memory type" is a confusing name.  Do you have some better
suggestion?

> So I really don't think "Type" is useful for determining tier placement.
>
> As of right now, the system lumps DRAM nodes as one tier, and pretty
> much everything else as "the other tier". To me, this patch set is an
> initial pass meant to allow user-control over tier composition while
> the internal mechanism is sussed out and the environment develops.

The patchset to identify the performance of memory devices and put them
in proper "memory types" and memory tiers via HMAT has been merged by
v6.7-rc1.

      07a8bdd4120c (memory tiering: add abstract distance calculation algorithms management, 2023-09-26)
      d0376aac59a1 (acpi, hmat: refactor hmat_register_target_initiators(), 2023-09-26)
      3718c02dbd4c (acpi, hmat: calculate abstract distance with HMAT, 2023-09-26)
      6bc2cfdf82d5 (dax, kmem: calculate abstract distance with general interface, 2023-09-26)

> In general, a release valve that lets you redefine tiers is very welcome
> for testing and validation of different setups while the industry evolves.
>
> Just my two cents.

--
Best Regards,
Huang, Ying