Kernel daemon for detecting and promoting hot pages

[RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages

Posted by Bharata B Rao 11 months, 1 week ago

Hi,

This is an attempt towards having a single subsystem that accumulates
hot page information from lower memory tiers and does hot page
promotion.

At the heart of this subsystem is a kernel daemon named kpromoted that
does the following:

1. Exposes an API that other subsystems which detect/generate memory
   access information can use to inform the daemon about memory
   accesses from lower memory tiers.
2. Maintains the list of hot pages and attempts to promote them to
   toptiers.

Currently I have added AMD IBS driver as one source that provides
page access information as an example. This driver feeds info to
kpromoted in this RFC patchset. More sources were discussed in a
similar context here at [1].

This is just an early attempt to check what it takes to maintain
a single source of page hotness info and also separate hot page
detection mechanisms from the promotion mechanism. There are too
many open ends right now and I have listed a few of them below.

- The API that is provided to register memory access expects
  the PFN, NID and time of access at the minimum. This is
  described more in patch 2/4. This API currently can be called
  only from contexts that allow sleeping and hence this rules
  out using it from PTE scanning paths. The API needs to be
  more flexible with respect to this.
- Some sources like PTE A bit scanning can't provide the precise
  time of access or the NID that is accessing the page. The latter
  has been an open problem to which I haven't come across a good
  and acceptable solution.
- The way the hot page information is maintained is pretty
  primitive right now. Ideally we would like to store hotness info
  in such a way that it should be easily possible to lookup say N
  most hot pages.
- If PTE A bit scanners are considered as hotness sources, we will
  be bombarded with accesses. Do we want to accomodate all those
  accesses or just go with hotness info for fixed number of pages
  (possibly as a ratio of lower tier memory capacity)?
- Undoubtedly the mechanism to classify a page as hot and subsequent
  promotion needs to be more sophisticated than what I have right now.

This is just an early RFC posted now to ignite some discussion
in the context of LSFMM [2].

I am also working with Raghu to integrate his kmmdscan [3] as the
hotness source and use kpromoted for migration.

Also, I had posted the IBS driver ealier as an alternative to
hint faults based NUMA Balancing [4]. However here I am using
it as generic page hotness source.

[1] https://lore.kernel.org/linux-mm/de31971e-98fc-4baf-8f4f-09d153902e2e@amd.com/
[2] https://lore.kernel.org/linux-mm/20250123105721.424117-1-raghavendra.kt@amd.com/
[3] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
[3] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/

Regards,
Bharata.

Bharata B Rao (4):
  mm: migrate: Allow misplaced migration without VMA too
  mm: kpromoted: Hot page info collection and promotion daemon
  x86: ibs: In-kernel IBS driver for memory access profiling
  x86: ibs: Enable IBS profiling for memory accesses

 arch/x86/events/amd/ibs.c           |  11 +
 arch/x86/include/asm/entry-common.h |   3 +
 arch/x86/include/asm/hardirq.h      |   2 +
 arch/x86/include/asm/ibs.h          |   9 +
 arch/x86/include/asm/msr-index.h    |  16 ++
 arch/x86/mm/Makefile                |   3 +-
 arch/x86/mm/ibs.c                   | 344 ++++++++++++++++++++++++++++
 include/linux/kpromoted.h           |  54 +++++
 include/linux/mmzone.h              |   4 +
 include/linux/vm_event_item.h       |  30 +++
 mm/Kconfig                          |   7 +
 mm/Makefile                         |   1 +
 mm/kpromoted.c                      | 305 ++++++++++++++++++++++++
 mm/migrate.c                        |   5 +-
 mm/mm_init.c                        |  10 +
 mm/vmstat.c                         |  30 +++
 16 files changed, 831 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/include/asm/ibs.h
 create mode 100644 arch/x86/mm/ibs.c
 create mode 100644 include/linux/kpromoted.h
 create mode 100644 mm/kpromoted.c

-- 
2.34.1

Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages

Posted by Balbir Singh 10 months, 3 weeks ago

On 3/6/25 16:45, Bharata B Rao wrote:
> Hi,
> 
> This is an attempt towards having a single subsystem that accumulates
> hot page information from lower memory tiers and does hot page
> promotion.
> 
> At the heart of this subsystem is a kernel daemon named kpromoted that
> does the following:
> 
> 1. Exposes an API that other subsystems which detect/generate memory
>    access information can use to inform the daemon about memory
>    accesses from lower memory tiers.
> 2. Maintains the list of hot pages and attempts to promote them to
>    toptiers.
> 
> Currently I have added AMD IBS driver as one source that provides
> page access information as an example. This driver feeds info to
> kpromoted in this RFC patchset. More sources were discussed in a
> similar context here at [1].
> 

Is hot page promotion mandated or good to have? Memory tiers today
are a function of latency and bandwidth, specifically in 
mt_aperf_to_distance() 

adist ~ k * R(B)/R(L) where R(x) is relatively performance of the
memory w.r.t DRAM. Do we want hot pages in the top tier all the time?
Are we optimizing for bandwidth or latency?

> This is just an early attempt to check what it takes to maintain
> a single source of page hotness info and also separate hot page
> detection mechanisms from the promotion mechanism. There are too
> many open ends right now and I have listed a few of them below.
> 


<snip>

> This is just an early RFC posted now to ignite some discussion
> in the context of LSFMM [2].
> 

I look forward to any summary of the discussions

Balbir Singh

Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages

Posted by Bharata B Rao 10 months, 3 weeks ago

Hi Balbir,

On 18-Mar-25 10:58 AM, Balbir Singh wrote:
> On 3/6/25 16:45, Bharata B Rao wrote:
>> Hi,
>>
>> This is an attempt towards having a single subsystem that accumulates
>> hot page information from lower memory tiers and does hot page
>> promotion.
>>
>> At the heart of this subsystem is a kernel daemon named kpromoted that
>> does the following:
>>
>> 1. Exposes an API that other subsystems which detect/generate memory
>>     access information can use to inform the daemon about memory
>>     accesses from lower memory tiers.
>> 2. Maintains the list of hot pages and attempts to promote them to
>>     toptiers.
>>
>> Currently I have added AMD IBS driver as one source that provides
>> page access information as an example. This driver feeds info to
>> kpromoted in this RFC patchset. More sources were discussed in a
>> similar context here at [1].
>>
> 
> Is hot page promotion mandated or good to have?

If you look at the current hot page promotion (NUMAB=2) logic, IIUC an 
accessed lower tier page is directly promoted to toptier if enough space 
exists in the toptier node. In such cases, it doesn't even bother about 
the hot threshold (measure of how recently it was accessed) or migration 
rate limiting. This tells me that it in a tiered memory setup, having an 
accessed page in toptier is preferrable.

> Memory tiers today
> are a function of latency and bandwidth, specifically in
> mt_aperf_to_distance()
> 
> adist ~ k * R(B)/R(L) where R(x) is relatively performance of the
> memory w.r.t DRAM. Do we want hot pages in the top tier all the time?
> Are we optimizing for bandwidth or latency?

When memory tiering code converts BW and latency numbers into an opaque 
metric adistance based on which the node gets placed at an appropriate 
position in the tiering hierarchy, I wonder if it is still possible to 
say if we are optimizing for bandwidth or latency separately?

>> This is just an early attempt to check what it takes to maintain
>> a single source of page hotness info and also separate hot page
>> detection mechanisms from the promotion mechanism. There are too
>> many open ends right now and I have listed a few of them below.
>>
> 
> 
> <snip>
> 
>> This is just an early RFC posted now to ignite some discussion
>> in the context of LSFMM [2].
>>
> 
> I look forward to any summary of the discussions

Sure. Thanks,
Bharata.

Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages

Posted by Balbir Singh 10 months, 3 weeks ago

On 3/20/25 20:07, Bharata B Rao wrote:
> Hi Balbir,
> 
> On 18-Mar-25 10:58 AM, Balbir Singh wrote:
>> On 3/6/25 16:45, Bharata B Rao wrote:
>>> Hi,
>>>
>>> This is an attempt towards having a single subsystem that accumulates
>>> hot page information from lower memory tiers and does hot page
>>> promotion.
>>>
>>> At the heart of this subsystem is a kernel daemon named kpromoted that
>>> does the following:
>>>
>>> 1. Exposes an API that other subsystems which detect/generate memory
>>>     access information can use to inform the daemon about memory
>>>     accesses from lower memory tiers.
>>> 2. Maintains the list of hot pages and attempts to promote them to
>>>     toptiers.
>>>
>>> Currently I have added AMD IBS driver as one source that provides
>>> page access information as an example. This driver feeds info to
>>> kpromoted in this RFC patchset. More sources were discussed in a
>>> similar context here at [1].
>>>
>>
>> Is hot page promotion mandated or good to have?
> 
> If you look at the current hot page promotion (NUMAB=2) logic, IIUC an accessed lower tier page is directly promoted to toptier if enough space exists in the toptier node. In such cases, it doesn't even bother about the hot threshold (measure of how recently it was accessed) or migration rate limiting. This tells me that it in a tiered memory setup, having an accessed page in toptier is preferrable.
> 

I'll review the patches, I don't agree with toptier, I think DRAM is the
right tier

>> Memory tiers today
>> are a function of latency and bandwidth, specifically in
>> mt_aperf_to_distance()
>>
>> adist ~ k * R(B)/R(L) where R(x) is relatively performance of the
>> memory w.r.t DRAM. Do we want hot pages in the top tier all the time?
>> Are we optimizing for bandwidth or latency?
> 
> When memory tiering code converts BW and latency numbers into an opaque metric adistance based on which the node gets placed at an appropriate position in the tiering hierarchy, I wonder if it is still possible to say if we are optimizing for bandwidth or latency separately?

I think we need a notion of that, just higher tiers may not be right.
IOW, I think we need to promote to at-most the DRAM tier, not above it.


Balbir Singh

Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages

Posted by SeongJae Park 10 months, 3 weeks ago

+ Harry, who was called Hyeonggon before.

Hello,

Thank you very much for sharing this great patchset.

On Thu, 6 Mar 2025 11:15:28 +0530 Bharata B Rao <bharata@amd.com> wrote:

> Hi,
> 
> This is an attempt towards having a single subsystem that accumulates
> hot page information from lower memory tiers and does hot page
> promotion.

That is one of DAMON's goal, too.  DAMON aims to be a kernel subsystem that can
provide access information that accumulated from multiple sources and can be
useful for multiple use cases including profiling and access aware system
operations.

Hot pages information and promotioning those are one of such information and
operations.  SK hynix developed their CXL memory tiering solution[1] using
DAMON.  I also shared auto-tuning based memory tiering solution idea[2] before.
On LSFMMBPF 2025, I may share its prototype implementation and evaluation
results on CXL memory devices that I recentily gained access.

Of course, DAMON is still in the middle of its journey towards the northern
star.  I'm looking for what are really required to DAMON for the goal, what are
[not] available with today's DAMON, and what should be the good future plans.
My LSFMMBPF 2025 topic proposals are for those.

Hence, this patchset is very helpful to me at showing what can be added and
improved on DAMON.  I specifically understand support of access information
sources other than Page tables' accessed bits such as AMD IBS as the main
thing.  I admit the fact that DAMON of today is supporting only page tables'
accessed bit as the primary source of the information.  But DAMON of future
would be different.  Let me share more thoughts below.

> 
> At the heart of this subsystem is a kernel daemon named kpromoted that
> does the following:
> 
> 1. Exposes an API that other subsystems which detect/generate memory
>    access information can use to inform the daemon about memory
>    accesses from lower memory tiers.

DAMON also provides such API, namely, its monitoring operations set layer
interface[3].  Nevertheless, only page tables accessed bit use cases exist
today.  Hence the interface may have have hidden problems at extending for
other sources.

> 2. Maintains the list of hot pages and attempts to promote them to
>    toptiers.

DAMON provides its another half, DAMOS[4], for this kind of usages.

> 
> Currently I have added AMD IBS driver as one source that provides
> page access information as an example. This driver feeds info to
> kpromoted in this RFC patchset. More sources were discussed in a
> similar context here at [1].

I was imagining how I would be able to do this with DAMON via operations set
layer interface.  And I find thee current interface is not very optimized for
AMD IBS like sources that catches the access on the line.  That is, in a way,
we could say AMD IBS like primitives as push-oriented, while page tables'
accessed bits information are pull-oriented.  DAMON operations set layer
interface is easier to be used in pull-oriented case.  I don't think it cannot
be used for push-oriented case, but definitely the interface would better to be
more optimized for the use case.

I'm curious if you also tried doing this by extending DAMON, and if some hidden
problems you found.

> 
> This is just an early attempt to check what it takes to maintain
> a single source of page hotness info and also separate hot page
> detection mechanisms from the promotion mechanism. There are too
> many open ends right now and I have listed a few of them below.
> 
> - The API that is provided to register memory access expects
>   the PFN, NID and time of access at the minimum. This is
>   described more in patch 2/4. This API currently can be called
>   only from contexts that allow sleeping and hence this rules
>   out using it from PTE scanning paths. The API needs to be
>   more flexible with respect to this.
> - Some sources like PTE A bit scanning can't provide the precise
>   time of access or the NID that is accessing the page. The latter
>   has been an open problem to which I haven't come across a good
>   and acceptable solution.

Agree.  PTE A bit scanning could be useful in many cases, but not every case.
There was an RFC patchset[7] that extends DAMON for NID.  I'm planning to do
that again using DAMON operations layer interface.  My current plan is to
implement the prototype using prot_none page faults, and later extend for AMD
IBS like h/w features.  Hopefully I will share a prototype or at least more
detailed idea on LSFMMBPF 2025.

> - The way the hot page information is maintained is pretty
>   primitive right now. Ideally we would like to store hotness info
>   in such a way that it should be easily possible to lookup say N
>   most hot pages.

DAMON provides a feature for lookup of N most hotpages, namely DAMOS quotas'
access pattern based regions prioritization[5].

> - If PTE A bit scanners are considered as hotness sources, we will
>   be bombarded with accesses. Do we want to accomodate all those
>   accesses or just go with hotness info for fixed number of pages
>   (possibly as a ratio of lower tier memory capacity)?

I understand you're saying about memory space overhead.  Correct me if I'm
wrong, please.

Isn't same issue exists for current implementation of the sampling frequency is
high, and/or aggregation window is long?

To me, hence, this looks like not a problem of the information source, but how
to maintain the information.  Current implementation maintains it per page, so
I think the problem is inherent.

DAMON maintains the information in region abstraction that can save multiple
pages with one data structure.  The maximum number of regions can be set by
users, so the space overhead can be controlled.

> - Undoubtedly the mechanism to classify a page as hot and subsequent
>   promotion needs to be more sophisticated than what I have right now.

DAMON provides aim-based DAMOS aggressiveness auto-tuning[6] and monitoring
intervals auto-tuning[8] for this purpose.

> 
> This is just an early RFC posted now to ignite some discussion
> in the context of LSFMM [2].

This is really helpful.  Appreciate, and looking forward to more discussions on
LSFMM and mailing lists.

> 
> I am also working with Raghu to integrate his kmmdscan [3] as the
> hotness source and use kpromoted for migration.

Raghu also mentioned he would try to take a time to look into DAMON if there is
anything that he could reuse for the purpose.  I'm curious if he was able to
find something there.

> 
> Also, I had posted the IBS driver ealier as an alternative to
> hint faults based NUMA Balancing [4]. However here I am using
> it as generic page hotness source.

This will also be very helpful for understanding how IBS can be used.
Appreciate!

> 
> [1] https://lore.kernel.org/linux-mm/de31971e-98fc-4baf-8f4f-09d153902e2e@amd.com/
> [2] https://lore.kernel.org/linux-mm/20250123105721.424117-1-raghavendra.kt@amd.com/
> [3] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
> [3] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/

[1] https://github.com/skhynix/hmsdk/wiki/Capacity-Expansion
[2] https://lore.kernel.org/all/20231112195602.61525-1-sj@kernel.org/
[3] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#operations-set-layer
[4] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#operation-schemes
[5] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#prioritization
[6] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning
[7] https://lore.kernel.org/linux-mm/cover.1645024354.git.xhao@linux.alibaba.com/
[8] https://origin.kernel.org/doc/html/next/mm/damon/design.html#monitoring-intervals-auto-tuning

Thank,
SJ

> 
> Regards,
> Bharata.
> 
> Bharata B Rao (4):
>   mm: migrate: Allow misplaced migration without VMA too
>   mm: kpromoted: Hot page info collection and promotion daemon
>   x86: ibs: In-kernel IBS driver for memory access profiling
>   x86: ibs: Enable IBS profiling for memory accesses
> 
>  arch/x86/events/amd/ibs.c           |  11 +
>  arch/x86/include/asm/entry-common.h |   3 +
>  arch/x86/include/asm/hardirq.h      |   2 +
>  arch/x86/include/asm/ibs.h          |   9 +
>  arch/x86/include/asm/msr-index.h    |  16 ++
>  arch/x86/mm/Makefile                |   3 +-
>  arch/x86/mm/ibs.c                   | 344 ++++++++++++++++++++++++++++
>  include/linux/kpromoted.h           |  54 +++++
>  include/linux/mmzone.h              |   4 +
>  include/linux/vm_event_item.h       |  30 +++
>  mm/Kconfig                          |   7 +
>  mm/Makefile                         |   1 +
>  mm/kpromoted.c                      | 305 ++++++++++++++++++++++++
>  mm/migrate.c                        |   5 +-
>  mm/mm_init.c                        |  10 +
>  mm/vmstat.c                         |  30 +++
>  16 files changed, 831 insertions(+), 3 deletions(-)
>  create mode 100644 arch/x86/include/asm/ibs.h
>  create mode 100644 arch/x86/mm/ibs.c
>  create mode 100644 include/linux/kpromoted.h
>  create mode 100644 mm/kpromoted.c
> 
> -- 
> 2.34.1

Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages

Posted by Bharata B Rao 10 months, 3 weeks ago

Hi SJ,

Thanks for your detailed points and this surely sets up a good context 
for discussion in LSFMM.

Please see my replies to a few of your questions below:

On 17-Mar-25 3:30 AM, SeongJae Park wrote:
>>
>> Currently I have added AMD IBS driver as one source that provides
>> page access information as an example. This driver feeds info to
>> kpromoted in this RFC patchset. More sources were discussed in a
>> similar context here at [1].
> 
> I was imagining how I would be able to do this with DAMON via operations set
> layer interface.  And I find thee current interface is not very optimized for
> AMD IBS like sources that catches the access on the line.  That is, in a way,
> we could say AMD IBS like primitives as push-oriented, while page tables'
> accessed bits information are pull-oriented.  DAMON operations set layer
> interface is easier to be used in pull-oriented case.  I don't think it cannot
> be used for push-oriented case, but definitely the interface would better to be
> more optimized for the use case.
> 
> I'm curious if you also tried doing this by extending DAMON, and if some hidden
> problems you found.

I remember discussing this with you during DAMON BoF in one of the 
earlier LPC events, but I didn't get to try it. Guess now is the time :-)

I see the challenge with the current DAMON interfaces to integrate IBS 
provided access info. If you check my IBS driver, I store the incoming 
access info from IBS into per-cpu buffers before pushing them on to the 
subsystem that act on them. I would think pull-based DAMON interfaces 
can consume those buffered samples rather than IBS pushing samples into 
DAMON. But I am yet to get clarity on how to honor the region based 
sampling that is inherent to DAMON's functioning. May be only using 
samples that are of interest to the region being tracked could be one way.

> 
>>
>> This is just an early attempt to check what it takes to maintain
>> a single source of page hotness info and also separate hot page
>> detection mechanisms from the promotion mechanism. There are too
>> many open ends right now and I have listed a few of them below.
>>
>> - The API that is provided to register memory access expects
>>    the PFN, NID and time of access at the minimum. This is
>>    described more in patch 2/4. This API currently can be called
>>    only from contexts that allow sleeping and hence this rules
>>    out using it from PTE scanning paths. The API needs to be
>>    more flexible with respect to this.
>> - Some sources like PTE A bit scanning can't provide the precise
>>    time of access or the NID that is accessing the page. The latter
>>    has been an open problem to which I haven't come across a good
>>    and acceptable solution.
> 
> Agree.  PTE A bit scanning could be useful in many cases, but not every case.
> There was an RFC patchset[7] that extends DAMON for NID.  I'm planning to do
> that again using DAMON operations layer interface.  My current plan is to
> implement the prototype using prot_none page faults, and later extend for AMD
> IBS like h/w features.  Hopefully I will share a prototype or at least more
> detailed idea on LSFMMBPF 2025.
> 
>> - The way the hot page information is maintained is pretty
>>    primitive right now. Ideally we would like to store hotness info
>>    in such a way that it should be easily possible to lookup say N
>>    most hot pages.
> 
> DAMON provides a feature for lookup of N most hotpages, namely DAMOS quotas'
> access pattern based regions prioritization[5].
> 
>> - If PTE A bit scanners are considered as hotness sources, we will
>>    be bombarded with accesses. Do we want to accomodate all those
>>    accesses or just go with hotness info for fixed number of pages
>>    (possibly as a ratio of lower tier memory capacity)?
> 
> I understand you're saying about memory space overhead.  Correct me if I'm
> wrong, please.

Correct and also the overhead of managing so much data. What I see is 
that if I start pushing all the access info obtained from LRU pgtable 
scanning, kpromoted would end up spending a lot of time in operations 
like lookup, walking the list of hot pages etc.

So may be it would be better to do some sort of early processing and/or 
filtering at the hotness source level itself before letting 
kpromoted-like subsystems to do further tracking and action.

> 
> Isn't same issue exists for current implementation of the sampling frequency is
> high, and/or aggregation window is long?
> 
> To me, hence, this looks like not a problem of the information source, but how
> to maintain the information.  Current implementation maintains it per page, so
> I think the problem is inherent.

Well yes, but we the goal could be do better than NUMAB=2 which does 
per-page level tracking.

> 
> DAMON maintains the information in region abstraction that can save multiple
> pages with one data structure.  The maximum number of regions can be set by
> users, so the space overhead can be controlled.

The granularity of tracking - per-page vs range/region is a topic of 
discussion I suppose.

Regards,
Bharata.

Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages

Posted by Raghavendra K T 10 months, 3 weeks ago

On 3/17/2025 3:30 AM, SeongJae Park wrote:
> + Harry, who was called Hyeonggon before.
>>
>> I am also working with Raghu to integrate his kmmdscan [3] as the
>> hotness source and use kpromoted for migration.
> 
> Raghu also mentioned he would try to take a time to look into DAMON if there is
> anything that he could reuse for the purpose.  I'm curious if he was able to
> find something there.
> 
[...]
Hello SJ,

I did take a look at DAMON vaddr and paddr implementation. Also
wondering how can I optimize hotness data collected by kmmscand.

DAMON regions should be very helpful here, But I am not there yet.

will surely need help brainstorming session post my next RFC.

Thanks and Regards
- Raghu

Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages

Posted by Bharata B Rao 10 months, 2 weeks ago

> Hi,
> 
> This is an attempt towards having a single subsystem that accumulates
> hot page information from lower memory tiers and does hot page
> promotion.
> 
> At the heart of this subsystem is a kernel daemon named kpromoted that
> does the following:
> 
> 1. Exposes an API that other subsystems which detect/generate memory
>    access information can use to inform the daemon about memory
>    accesses from lower memory tiers.
> 2. Maintains the list of hot pages and attempts to promote them to
>    toptiers.
> 
> Currently I have added AMD IBS driver as one source that provides
> page access information as an example. This driver feeds info to
> krpromoted in this RFC patchset.

FWIW, here are some numbers from krpomoted driven hotpage promotion with
IBS as the hotness source:

Test 1
======
Memory allocated on DRAM and CXL nodes explicitly and no demotion activity
is seen.

Benchmark details
-----------------
* Memory is allocated initially on DRAM and CXL nodes separately.
* Two threads: One accessing DRAM-allocated and other CXL-allocated memory.
* Divides memory area into regions and accesses pages within the region randomly
  and repetitively. In the test config shown below, the allocated memory is
  divided into regions of 1GB size and each such region is repetitively (512
  times) accessed with 21474836480 random accesses in each repetition).
* Benchmark score is time taken for accesses to complete, lower is better
* Data accesses from CXL node are expected to trigger promotion
* Test system has 2 DRAM nodes (128G each) and a CXL node (128G)

kernel.numa_balancing		2 for base, 0 for kpromoted
demotion			true
Threads run on			Node 1
Memory allocated on		Node 1(DRAM) and Node 2(CXL)
Initial allocation ratio	75% on DRAM
Allocated memory size		160G (mmap, MAP_POPULATE)
Initial memory on DRAM node	120G
Initial memory on CXL node	40G
Hot region size			1G
Acccess pattern			random
Access granularity		4K
Load/store ratio		50% loads + 50% stores
Number of accesses		21474836480
Nr access repetitions		512

Benchmark completion time
-------------------------
Base, NUMAB=2		261s
kpromoted-ibs, NUMAB=0	281s

Stats comparision
-----------------
				Base,NUMAB=2	kpromoted-IBS,NUMAB=0
pgdemote_kswapd			0		0
pgdemote_direct			0		0
numa_pte_updates		10485760	0
numa_hint_faults		4427809		0
numa_pages_migrated		388229		374765
kpromoted_recorded_accesses			1651130	/* nr accesses reported to kpromoted */
kpromoted_recorded_hwhints			1651130	/* nr accesses coming from IBS */
kpromoted_record_toptier			1269697	/* nr accesses from toptier/DRAM */
kpromoted_record_added				378090	/* nr accesses considered for promotion */
kpromoted_mig_promoted				374765	/* nr pages promoted */
hwhint_nr_events				1674227	/* nr events reported by IBS */
hwhint_dram_accesses				1269626	/* nr DRAM accesses reported by IBS */
hwhint_cxl_accesses				381435	/* nr Extmem (CXL) accesses reported by IBS */
hwhint_useful_samples				1651110	/* nr actionable samples as per IBS driver */


Test 2
======
Memory is allocated with DRAM and CXL nodes in the affinity mask with
MPOL_BIND + MPOL_F_NUMA_BALANCING.

Benchmark details
-----------------
* Initially, memory allocated spreads over from DRAM to CXL, involves demotion
* Single thread accesses the memory
* Divides memory area into regions and accesses pages within the region randomly
  and repetitively. In the test config shown below, the allocated memory is
  divided into regions of 1GB size and each such region is repetitively (512
  times) accessed with 21474836480 random accesses in each repetition).
* Benchmark score is time taken for accesses to complete, lower is better
* Data accesses from CXL node are expected to trigger promotion
* Test system has 2 DRAM nodes (128G each) and a CXL node (128G)

kernel.numa_balancing		2 for base, 0 for kpromoted
demotion			true
Threads run on			Node 1
Memory allocated on		Node 1(DRAM) and Node 2(CXL)
Allocated memory size		192G (mmap, MAP_POPULATE)
Hot region size			1G
Acccess pattern			random
Access granularity		4K
Load/store ratio		50% loads + 50% stores
Number of accesses		21474836480
Nr access repetitions		512

Benchmark completion time
-------------------------
Base, NUMAB=2		628s
kpromoted-ibs, NUMAB=0	626s

Stats comparision
-----------------
				Base,NUMAB=2	kpromoted-IBS,NUMAB=0
pgdemote_kswapd			73187		2196028
pgdemote_direct			0		0
numa_pte_updates		27511631	0
numa_hint_faults		10010852	0
numa_pages_migrated		14		611177	/* such low number of promotions is unexecpted in Base, Need to recheck */
kpromoted_recorded_accesses			1883570
kpromoted_recorded_hwhints			1883570
kpromoted_record_toptier			1262088
kpromoted_record_added				616273
kpromoted_mig_promoted				611077
hwhint_nr_events				1904619
hwhint_dram_accesses				1261758
hwhint_cxl_accesses				621428
hwhint_useful_samples				1883543